Reviewing Visual Clusters for Multiple Information Governance Purposes
Typically, the number of visually-similar clusters is less than one percent of the total number of documents, and the clusters can be viewed starting with the largest clusters first.
Within a few days, the PII detection team can:
- Review clusters representing well over 99% of all the documents in an organization,
- Eliminate clusters that do not have ongoing business or legal value,
- Tag those remaining clusters that contain PII,
- Assign document-type name labels to them, and
- Identify the PII attributes or data elements present in each document type.
Subsequent reviews only have to examine new clusters of visually-similar documents that have formed since the time of the last review. Decisions made about existing clusters are simply applied to documents that are later added to the cluster.
Whether or not those documents had associated text values, after the review of visually-similar clusters, the organization can now decide what types of protection is warranted for each type of document:
- What level of storage is indicated, e.g., should some of the clusters be on encrypted servers?
- Which people or job functions should be able to see specific document types?
- What retention period should apply?
Visual Classification and Text-Based Approaches Are Complementary
Visual classification is not used to the exclusion of text-based approaches. In fact, text-based pattern and term searching techniques can be used in conjunction with visual classification to provide the most comprehensive detection and protection options available.
After visual clusters are formed, searches can be made for patterns like social security numbers or for lists of potential PII like medical diagnoses. The results are then viewed arranged by visual cluster to determine whether some of the clusters that were not originally tagged as regularly having PII ought to be included in the PII category of clusters.
Note that even if not all documents in a cluster have associated text, the ones that have text can be identified as having PII and this can result in all the documents in the cluster receiving the additional PII protection they warrant.
Detecting PII Flags or Cues
Text search can also locate words that often serve as flags or cues for PII. For example, the terms “SSAN” or “SS#” or “Social Security Number” will often serve as flags that the information close by includes social security numbers. If documents cannot be sorted or arranged by visual cluster or document type it could be very burdensome to review the results of such a search because there can be so many hits. However, when the results can be reviewed by cluster or document type, attention can be focused only on those clusters that have not already been designated as containing PII.
Protective Measures
Once documents or clusters have been identified as having PII they can be afforded the appropriate level of protection. These include:
Encrypted Storage
Encrypting data helps protect it and lowers regulatory risks in the event of a data breach. However, organizations may not want to encrypt everything they have. Visual classification greatly reduces the content that is kept because much content can be disposed of, and then only a portion of what is retained may warrant encryption.
Restricted Access
Having consistent, reliable document type classification of all stored content permits organizations to restrict employees’ access to only those documents they need to perform their jobs.
Prompt Disposition
Without consistent document classification, many organizations end up keeping everything either forever or for the longest retention period associated with any of the documents in a collection. Consistent, reliable classification permits granular retention schedules that can be readily applied, considerably reducing the volume of content at risk.
Three other protective options ought to be considered:
Text Redaction
Visual classification system is based on cataloging the graphical elements on all pages. As part of that process it content-enables image-only documents to provide searchable text. Whether it provided the searchable text or the text was already present when the documents were processed, the system knows the page coordinates for the text values associated with the pages. It uses those coordinates to perform high-speed, highly-accurate redactions using expressions or word lists, on the order of 700,000 redactions per CPU per hour.