Conference Proceeding

Formalization and preliminary evaluation of a pipeline for text extraction from infographics

Details

Citation

Böschen F & Scherp A (2015) Formalization and preliminary evaluation of a pipeline for text extraction from infographics. In: Görg S, Bergmann R & Müller G (eds.) Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB, volume 1458. CEUR Workshop Proceedings, 1458. LWA 2015 Workshops: KDML, FGWM, IR, FGD, Trier, Germany, 07.10.2015-09.10.2015. Aachen, Germany: CEUR Workshop Proceedings, pp. 20-31. http://ceur-ws.org/Vol-1458/D03_CRC13_Boeschen.pdf

Abstract
We propose a pipeline for text extraction from infographics that makes use of a novel combination of data mining and computer vision techniques. The pipeline defines a sequence of steps to identify characters, cluster them into text lines, determine their rotation angle, and apply state-of-the-art OCR to recognise the text. In this paper, we formally define the pipeline and present its current implementation. In addition, we have conducted preliminary evaluations over a data corpus of 121 manually annotated infographics from a broad range of illustration types such as bar charts, pie charts, and line charts, maps, and others. We assess the results of our text extraction pipeline by comparing it with two baselines. Finally, we sketch an outline for future work and possibilities for improving the pipeline.

Keywords
Infographics; OCR; multi-oriented text extraction; formalization;

Journal
CEUR Workshop Proceedings: Volume 1458

StatusPublished
Title of seriesCEUR Workshop Proceedings
Number in series1458
Publication date31/12/2015
URLhttp://hdl.handle.net/1893/28051
PublisherCEUR Workshop Proceedings
Publisher URLhttp://ceur-ws.org/Vol-1458/D03_CRC13_Boeschen.pdf
Place of publicationAachen, Germany
ISSN of series1613-0073
ISBNN/A
ConferenceLWA 2015 Workshops: KDML, FGWM, IR, FGD
Conference locationTrier, Germany
Dates