E. Data sets that can accelerate AI research
(RFI question 9)

Last updated July 28, 2016

A major bottleneck in developing and validating AI systems is public access to sufficiently large, openly curated, public training data sets. Machine learning, supervised and unsupervised, requires large, unbiased data sets to train accurate models. Deep learning is advancing speech transcription, language translation, image captioning, and question and answering capabilities. Each new AI advance, e.g., video comprehension, requires the creation of new data sets. Deep domain tasks, such as cancer radiology, or insurance adjustment, requires specialized and often hard-to-get datasets. Incentives must be created for greater sharing of both input datasets and trained models through mechanisms like model zoos.

The open curation and sharing of large data sets is essential for the development and validation of cognitive systems. Increasingly, machine learning is being relied on to train models using both supervised and unsupervised techniques. Tremendous progress has been made using deep learning, in particular, with open data sets the trained models have greatly advanced for speech transcription, language translation, image captioning, and question and answering.

Examples:There are notable examples such as ImageNet (14M images, 22K label categories) where open training data has been essential for propelling the field forward. Ongoing efforts such as Visual Genome, MegaFace, YouTube 8M are continuing to produce valuable data sets. However, the bottleneck for sustaining further progress in cognitive systems is public access of sufficiently large, openly curated, public training data sets.

One can easily foresee that trade bodies or industry associations might wish to create open “primer” data sets for the industries they represent, much as bodies such as RosettaNet worked to develop e-business and electronics industry data standards in the early 2000s.

Avoiding Biases and Skew: There are complexities of machine learning based systems that make it imperative for data sets to be developed in an open and transparent way. Biases are often embedded in them, which become compounded during training. For example, skew with respect to race has been found in the popularly used face recognition data sets. In one high-profile instance, a large commercial company’s photo tagging service was found to be falsely and insensitively labeling people of certain races as animals. By not detecting the gaps or biases in the training data sets, cognitive systems such are left with blind spots, which can have dire consequences.

Sharing Models: Beyond the open curation of large data sets, there is great benefit from open sharing of trained models. As the volume of training data is growing, the computation required for training cognitive systems has increased tremendously. It may take months of processing to learn a single discriminative model for action detection from a large video data set. Greater sharing of these trained models through mechanisms like the Caffe Model Zoo lowers the bar for researchers and enables greater participation from the community in advancing the field.

Back to summary.