InfoChem is performing automatic chemical named entity recognition of Chemisches Zentralblatt, one of the most important abstracts journal for the time period 1830-1969.
Aim of the project is building a structure searchable database, in order to offer a language independent search in such a relevant historical source.
Chemisches Zentralblatt is the first and oldest abstracts journal published in the field of chemistry. It covers the chemical literature from 1830 to 1969, describing the "birth" of chemistry as science, compared to alchemy.
In 140 years Chemisches Zentralblatt published 900,000 pages: 700,000 contain ca. 2 million abstracts, 200,000 are indexes.
The documents, mainly in .tiff format undergo OCR processing. The ICANNOTATOR performs named entity recognition with support of the optimized SPRESI dictionaries, then the abstracted names are converted into connection tables with the name to structure tool, also integrated in the ICANNOTATOR and supported by the dictionaries. The connection tables and the associated names are stored in a database and then combined search of text and structures can be performed on a federated search system. From the hit list it is possible to jump directly to the page containing the information thanks to a direct link to the original literature.
To proof the quality of the automatic process we have also abstracted manually structures from a sample set and performed a quantitative comparison.
We have abstracted automatically 900,000 pages, obtaining ca. 1 million unique chemical names and 500,000 unique structures. The quantitative comparison with a manually abstracted sample set proves over 60% recall and nearly 90% precision for our process.
Scientists can now perform combined text and structure / substructure searches in Chemisches Zentralblatt using the federated search system ICFEDSEARCH.
Please feel free to contact us if you need further information.
Last modification: January 14, 2010.
Landsberger Straße 408/V
D-81241 München
Germany
| Phone: | +49 (0)89 58 30 02 |
|---|---|
| Fax: | +49 (0)89 580 38 39 |
| Email: | info@infochem.de |