Skip to main content

Data classification, or information extraction, is the process of extracting semantics from text. One of the main motivations for this work stems from the need to protect and control the use of sensitive information within an enterprise. For example, this process may include identifying sensitive columns in databases or identifying personal information in documents.

This activity includes multiple research and development aspects. First, we are laying the groundwork to identify and classify basic data items (such as credit cards, SSNs, etc.). Next, building on these mechanisms, we consider different targets of classification and research directions:

  • Classification of metadata (e.g., document properties such as file name)
  • Classification of information in semi-structured and structured documents, where the context (such as a column) may provide additional hints
  • Classification of plain text, where natural language processing may provide additional information.

Finally, we are combing and leveraging existing (e.g., document classification) and developed mechanisms (see above), providing the means to control and adjust different classification parameters and provide a holistic solution for data classification.

The goal of this activity is develop a generic classification framework able to support multiple diverse applications.