Use deep search to explore the COVID-19 corpus

To help researchers access structured and unstructured data quickly, IBM Research has developed a cloud-based AI research service that has ingested a corpus of thousands of papers from the COVID-19 Open Research Dataset (CORD-19) and licensed databases from DrugBank, Clinicaltrials.gov and GenBank. This tool uses advanced AI, allowing users to make specific queries to the collections of papers and extract critical COVID-19 knowledge – including embedded text, tables and figures.

Current usage

Last updated: 29-June-2020

158,524

Ingested papers (more will be added daily)

513

Registered Users

Corpus Conversion Service and Corpus Processing Service 

To unlock the knowledge from the published unstructured and structured data on COVID-19, IBM researchers are making available two key technologies - the Corpus Conversion Service and Corpus Processing Service. Both are already in extensive use in the material science, automotive and energy industries.

The Corpus Conversion Service can ingest 100,000 PDF pages per day (even of scanned documents) on a single server — and then train and apply advanced machine learning models that extract the content from these documents with high accuracy at a scale never achieved before. We have applied this technology to thousands of PDFs on the coronavirus and COVID-19 and combined it with curated databases from DrugBank, Clinicaltrials.gov and GenBank.

The Corpus Processing Service integrates data from databases and publications into a knowledge graph, so that these can be queried to retrieve known facts and to generate novel insights.  

Examples of the types of queries:

  1. Which drugs have been used so far and what are the outcomes
  2. Identify new, reported risk-factors

Corpus Processing Service features

The Corpus Conversion Service allows us to convert the latest PDF papers (e.g. from bioRxiv) into JSON documents. These can be ingested into a knowledge graph as unstructured data, allowing users to explore the latest published research.

The knowledge graph incorporates data from various sources, both unstructured (e.g. CORD-19 documents and converted PDF files) as well as structured (e.g. Drugbank, Genbank and clinical trials). The current knowledge graph contains approximately 4 million nodes and 50 million edges.

The knowledge graph will be updated and extended regularly to incorporate newly reported data. 

For advanced users, we offer deep search capabilities. This allows users to build complex query workflows on the knowledge graph in order to obtain specific answers from the literature. Above, we show how we can search for evidence of what is the incubation time. 

Request access

Access will be granted to scientists and academics. The Deep Search service is an application that collects your name, email address, affiliation and intended uses for requesting to access the service. The Personal information collected will solely be used for the purpose of assessing if access will be granted to you and providing access to approved individuals to our site, content and use of services. The information collected will not be used for any other purpose. The information will be retained for 12-months. If you were granted access and no longer wish to have access, you can withdraw your request at any time by submitting a withdraw request.

DrugBank data is available under a CC-BY-NC 4.0 licence. The datasets can be used freely in a non-commercial application or project. If you are interested in using DrugBank data in a commercial product or application, please see the Drugbank release page.

Your request for access has been received. The IBM Research team will review your request and provide next steps.

Title *

Access will be granted to scientists and academicians. The Deep Search service is an application that collects your name, email address, affiliation and intended uses for requesting to access the service. The personal information collected will solely be used for the purpose of assessing if access will be granted to you and providing access to approved individuals to our site, content and use of services. The personal information collected will not be used for any other purpose. The information will be retained for 12-months. If you were granted access and no longer wish to have access or wish for your information to be removed, you may request so by sending us an withdraw request.

More information on our processing can be found in the IBM Privacy Statement. By submitting this form, I acknowledge that I have read and understand the IBM Privacy Statement. I accept the product Terms and Conditions of this registration form.

I accept the product Terms and Conditions of this registration form.

Research resources

Please find reserach resources below that highlight the scientific advances that drive many of the Corpus Processing Service's capabilities.

Paper

An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries

KDD (2019)

Paper

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

SysML Conference(2018)

Blog

Corpus Conversion Service Makes PDF Content Discoverable

IBM Research

IBM’s response to COVID-19

To meet the global challenge of COVID-19, the world must come together — and IBM has resources to share. Watch this page for a growing list of offers to help you learn, adapt and overcome. 

Learn more