Using VCT to Identify PII and Information Governance applications

Using Visual Classification to Identify PII Document Types

Visual document classification provides a whole new approach to meeting the challenge of detecting and protecting PII in “unstructured” content. A key point is that PII does not occur at random across document types or within documents. It’s like gold, it occurs in veins or in ore, and visual classification is very useful in identifying the PII veins and ore present in the mountains of unstructured content maintained by large organizations. Visual classification works by clustering all documents based on their visual similarity so that the document types that normally contain PII can be identified. This approach ignores whether text is available and focuses on appearance, thereby normalizing documents regardless of the type of file in which the content was stored.

Reviewing Visual Clusters for Multiple Information Governance Purposes

Typically, the number of visually-similar clusters is less than one percent of the total number of documents, and the clusters can be viewed starting with the largest clusters first.

Within a few days, the PII detection team can:

Review clusters representing well over 99% of all the documents in an organization,
Eliminate clusters that do not have ongoing business or legal value,
Tag those remaining clusters that contain PII,
Assign document-type name labels to them, and
Identify the PII attributes or data elements present in each document type.

Subsequent reviews only have to examine new clusters of visually-similar documents that have formed since the time of the last review. Decisions made about existing clusters are simply applied to documents that are later added to the cluster.

Whether or not those documents had associated text values, after the review of visually-similar clusters, the organization can now decide what types of protection is warranted for each type of document:

What level of storage is indicated, e.g., should some of the clusters be on encrypted servers?
Which people or job functions should be able to see specific document types?
What retention period should apply?

Visual Classification and Text-Based Approaches Are Complementary

Visual classification is not used to the exclusion of text-based approaches. In fact, text-based pattern and term searching techniques can be used in conjunction with visual classification to provide the most comprehensive detection and protection options available.

After visual clusters are formed, searches can be made for patterns like social security numbers or for lists of potential PII like medical diagnoses. The results are then viewed arranged by visual cluster to determine whether some of the clusters that were not originally tagged as regularly having PII ought to be included in the PII category of clusters.

Note that even if not all documents in a cluster have associated text, the ones that have text can be identified as having PII and this can result in all the documents in the cluster receiving the additional PII protection they warrant.

Detecting PII Flags or Cues

Text search can also locate words that often serve as flags or cues for PII. For example, the terms “SSAN” or “SS#” or “Social Security Number” will often serve as flags that the information close by includes social security numbers. If documents cannot be sorted or arranged by visual cluster or document type it could be very burdensome to review the results of such a search because there can be so many hits. However, when the results can be reviewed by cluster or document type, attention can be focused only on those clusters that have not already been designated as containing PII.

Protective Measures

Once documents or clusters have been identified as having PII they can be afforded the appropriate level of protection. These include:

Encrypted Storage

Encrypting data helps protect it and lowers regulatory risks in the event of a data breach. However, organizations may not want to encrypt everything they have. Visual classification greatly reduces the content that is kept because much content can be disposed of, and then only a portion of what is retained may warrant encryption.

Restricted Access

Having consistent, reliable document type classification of all stored content permits organizations to restrict employees’ access to only those documents they need to perform their jobs.

Prompt Disposition

Without consistent document classification, many organizations end up keeping everything either forever or for the longest retention period associated with any of the documents in a collection. Consistent, reliable classification permits granular retention schedules that can be readily applied, considerably reducing the volume of content at risk.

Three other protective options ought to be considered:

Text Redaction

Visual classification system is based on cataloging the graphical elements on all pages. As part of that process it content-enables image-only documents to provide searchable text. Whether it provided the searchable text or the text was already present when the documents were processed, the system knows the page coordinates for the text values associated with the pages. It uses those coordinates to perform high-speed, highly-accurate redactions using expressions or word lists, on the order of 700,000 redactions per CPU per hour.

This industrial-grade redaction capability means that when producing or turning over documents to third parties, the PII can be simply removed. It also provides the organization with the option to work with redacted copies of documents where circumstances warrant. The obvious benefit is that people can’t steal or inadvertently disclose PII that isn’t there.

Zoned Redaction

Many times, some forms may be completed with handwriting that is not susceptible to text or word- based redaction. Other times, the words in a part of a document are too variable to be able to specify what pat- terns or words will be used. And as already discussed, some documents do not have accurate, reliable text.

In these circumstances the application can also provide zoned redaction where all of the content that falls within certain page coordinates will be redacted.

Logging Redactions

Regardless of which type of redaction is used a detailed log of all redactions is provided, including what was redacted and the reason given for the redaction. Both text and image layers in redacted documents are redacted.

Expedited Disposition

Often documents are used to collect information that is then input into a database or decision support system. If organizations knew the information was accurately entered, the documents could be viewed as transitory and be scheduled for immediate or expedited disposition.

Visual classification technology provides an automated way to validate that the information on specific documents were added correctly and to then flag those documents for expedited disposition. Automated attribute extraction can pull specified data elements and format them. These data values could then be checked against the values maintained in databases or decision support systems.

A Final Word

Detecting and protecting PII is one of several major document-centric information governance initiatives that are dependent on consistent document classification. Others include records retention and disposition, file share remediation, content migration, silo consolidation and digitization. The good news is that visual classification can serve as the foundation for all those initiatives, making the most effective use of the time and energy invested in reviewing and classifying an organization’s otherwise “unstructured” documents.