Conference Paper (published)

Multi-oriented text extraction from information graphics

Details

Citation

Böschen F & Scherp A (2015) Multi-oriented text extraction from information graphics. In: Proceedings of the 2015 ACM Symposium on Document Engineering (DocEng '15). 2015 ACM Symposium on Document Engineering, Lausanne, Switzerland, 08.09.2015-11.09.2015. New York: ACM, pp. 35-38. https://doi.org/10.1145/2682571.2797092

Abstract
Existing research on analyzing information graphics assume to have a perfect text detection and extraction available. However, text extraction from information graphics is far from solved. To fill this gap, we propose a novel processing pipeline for multi-oriented text extraction from infographics. The pipeline applies a combination of data mining and computer vision techniques to identify text elements, cluster them into text lines, compute their orientation, and uses a state-of-the-art open source OCR engine to perform the text recognition. We evaluate our method on 121 infographics extracted from an open access corpus of scientific publications. The results show that our approach is effective and significantly outperforms a state-of-the-art baseline.

Keywords
Infographics; OCR; multi-oriented text extraction;

Journal
DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering

Status	Published
Publication date	31/12/2015
URL	http://hdl.handle.net/1893/28052
Publisher	ACM
Place of publication	New York
ISBN	9781450333078
Conference	2015 ACM Symposium on Document Engineering
Conference location	Lausanne, Switzerland
Dates	30/09/2015