Skip to main content

The Hearst Metrotone Newsreel Collection

The Hearst Metrotone newsreel collection is one of the most important historical resources of twentieth century history including approximately 850 hours of newsreel footage, covering major events from 1914 to 1971.

The collection is documented on aging paper and includes 500,000 typed index cards as well as synopsis and disposition sheets. The cards provide detailed descriptions of the event documented on each newsreel and are irreplaceable historical documents.

The collection was bequeathed to the UCLA archive in 1981. Access to the collection has been restricted in an effort to preserve the deteriorating paper catalogue. For more than 10 years, UCLA searched for an affordable way to digitize the paper documentation and create a searchable online database that would be accessible to the general public.

The Technical Challenge
The text on the collection's index cards was characterized by very poor quality due to aging and old copying techniques, improper print positioning and old typing machine fonts.

Off the shelf OCR systems were tried by UCLA and found to be unusable. Most OCR systems are optimized to work on clean text and on the most prevalent computer generated fonts. Deterioration, due to either broken or connected characters also severely degraded OCR performance.

Examples of index cards:

Click to see full size 
Click to see full size
  Click to see full size 
Click to see full size
Figure 27 - Example of Hearst Metrotone newsreel cards

Our Approach
Haifa researchers looked at the entire digitization operation as a special "business process", and hence designed both the workflow and the processing modules to optimize this process.

The key elements of the solution include:

  • Advanced image preprocessing - color scanning, color dropout, and binarization.
  • Enhanced OCR - sophisticated segmentation for lines, words and individual characters, and OCR tuned to the old typing machine fonts.
  • Adaptive spell checker to enhance the recognition rate.
  • Replicas finding yields a substantial saving - most of the cards are filed under several categories and therefore replicated. We found these replicas by analyzing geometric structures among tens of thousands of possibilities and by using partial recognition results.
  • Layout analysis related each line of text to the appropriate category, e.g., film description, camera location, date, etc.
  • Special key-in applications enhanced operator productivity in validation and correction of the post OCR data.

The workflow exploits special characteristics of the cards, for example:

  • OCR voting after finding of replicas.
  • Extracting all cross index terms first and then re-running OCR on the body of the cards with an enhanced dictionary, that includes the added words.

The following diagram explains the order of automatic and manual processing of the index cards.


Figure 28 - Flow diagram of the Hearst Metrotone index cards processing

Key-in Process:

Click to see full size 
Click to see full size
  Click to see full size 
Click to see full size
 
Figure 29
Left: Layout analysis of card.
Right: Same card after OCR and key-in.

Some Statistics:

  • 500,000 cards were processed.
  • The automatic processing took about four days working in parallel on eight computers.
  • The recognition rate was increased two-fold using IBM's intelligent document processing technology.
  • The data key-in process takes about 50 seconds per card. Key-in productivity enhancements significantly reduced the time required for correcting each card.