Skip to main content

1000xfaster ingestion

Deep Search: Connecting
unstructured data

*Deep search can ingest 20 pages per second, where a typical human expert takes 1–2 minutes per page just to read.

Deep Search uses natural language processing to ingest and analyze massive amounts of data—structured and unstructured. Researchers can then extract, explore, and make connections faster than ever.

See how we used Deep Search to discover a new molecule

How it works

Deep Search imports and analyzes data from public, private, structured, and unstructured sources. AI then breaks down the data and classifies it into fundamental parts.

The Deep Search process starts with unstructured data such as journal articles, patents, or technical reports. No matter whether this data comes from public or proprietary sources, businesses can leverage both securely through our hybrid cloud.

After reviewing unstructured data, the user annotates a few documents to create an AI model. The model then classifies all documents into their fundamental parts. By using this AI model and NLP (Natural Language Processing), Deep Search is able to ingest and understand large collections of documents and unstructured data at scale, automatically extracting semantic units and their relationships.

Once the data has been consolidated and extracted, Deep Search organizes and structures it into a searchable knowledge graph—enabling users to robustly explore information extracted from tens of thousands of documents without having to read a single paper.

Figure D1.

Ingesting the information


Unstructured documents like PDFs of articles and patents are uploaded. Text, bitmap images and line paths are then parsed.

Graphic showing outlines of digital documents


Graphic showing a digital document with labels for machine learning

If there is no model available, a new one can easily be created and trained. To train a model, documents are categorized by layout

Next, sections from a sample of unique pages are annotated with semantic labels. The model is then applied to the remaining document, automatically annotating each page.


The predictive annotations are inspected and corrected to improve the models performance. Once refined, the model ready to be applied to other documents.

The remaining documents are parsed, labeled, and assembled into a JSON file that contains both the content and structure of the originals.

Graphic showing JSON extract from digital documents

Figure D2.

Constructing the knowledge graph


NLP components are run on JSON files, linking entities and extracting relationships.

Information is no longer contained within a document, but part of a larger ecosystem created from many documents.

Graphic showing highlighted entities in sentences


Graphic showing connected nodes of a knowledge graph

Extracted information is combined with other sources such and private or public databases to form a searchable knowledge graph.


When queried, the system is able to form links between different nodes within the knowledge graph.

It understands that the same material may be written in different ways, as abbreviations or formulas for example, and makes accurate connections.

Graphic showing semantically linked nodes of a knowledge graph

Accelerating intake, organization, and understanding of massive amounts of data. Deep Search empowers researchers to grasp previously daunting bodies of information, in a fraction of the time.

Deep Search
for Future
of Materials
Dr. Peter Staar, Ph.D
Deep Search at work:

Project Photoresist

Turning a vast collection of patents, papers, and data into an interactive library of molecules.

In order to find a more sustainable PAG option, we first needed to know what already existed. Our research team began with collecting nearly all published literature on PAG’s: over 6,000 patents, open-source documents, and publicly available material datasheets. This was a massive amount of information that would need to be carefully analyzed and organized.

While data intake traditionally requires whole teams of researchers to slowly read through stacks of documents, Deep Search was able to complete this task in a fraction of the time. Once trained, it was able to extract and identify most known PAG molecules and their reported properties.

Deep Search then generated a knowledge graph of all the extracted data, giving researchers the flexibility to examine the vast set of molecules in multiple ways. We used the knowledge graph to sort the PAGs into families, looking first for outliers: PAGs that used non-toxic metals rather than the traditional sulfur.

From this small subset, our team was able to quickly identify a molecule that already existed in the IBM lab—it had been sitting on the shelves, unused! The new candidate showed good sensitivity in an extreme ultraviolet (EUV) photoresist. A better view of the data alone had presented us with a promising PAG alternative, saving our team whole rounds of design, simulation, and testing from the outset of the project.

Figure D3.


Collect & Organize

Deep Search ingested 6000+ patents, publications, and data sheets in order to create a catalogue of all known PAGs.

Graphic with the label “Extracted PAG families”


Graphic with the labels “Extracted PAG families”, “Lambda Max”, “Biodegradaibility”,and “LD50”


Examining the Deep Search knowledge graph, we were able to identify fragmented sections that needed information, as well as discover a promising PAG already in our lab.

Case Studies

Case study

We built a biochemical knowledge graph with Nagase to identify novel carbohydrate enzymes.

Case study

Helping the community understand COVID-19. In just two weeks, we built an explorable knowledge graph of the COVID-19 database for researchers working on vaccines and treatment for the novel coronavirus.


Pierre L. Dognin and Igor Melnyk and Inkit Padhi and Cicero Nogueira dos Santos and Payel Das.

EMNLP (2020)

Authors: Matteo Manica, Christoph Auer, Valery Weber, Federico Zipoli, Michele Dolfi, Peter Staar, Teodoro Laino, Costas Bekas, Akihiro Fujita, Hiroki Toda, Shuichi Hirose, Yasumitsu Orii

KDD (2020)

Chieh Lin, Pei-Hua Wang, Yi Hsiao, Yi-Tsu Chan, Amanda C. Engler, Jed W. Pitera, Daniel P. Sanders, Joy Cheng, and Yufeng J. Tseng.

ACS Applied Polymer Materials (2020)

Peter W Staar, Michele Dolfi, Michele, Christoph Auer, and Costas Bekas

KDD (2018)

Discovery Workloads
on the Hybrid Cloud

Emerging discovery workflows are posing new challenges for compute, network, storage, and usability. IBM Research supports these new workflows by bringing together world-class physical infrastructure, a hybrid cloud platform that unifies computing, data, and the user experience, and full-stack intelligence for orchestrating discovery workflows across computing environments.