![]() |
|
|
![]() |
Daniel P. Lopresti: Noisy Text Errors are unavoidable in advanced
computer vision applications such as optical character recognition
(OCR), and the noise induced by these errors presents a serious
challenge to downstream processes that attempt to make use of such
data. Some of my work involves developing techniques to measure the
impact of recognition errors on the NLP stages of a standard text
analysis pipeline: sentence boundary detection, tokenization, and
part-of-speech tagging. I have developed a methodology that formulates
error classification as an optimization problem solvable using a
hierarchical dynamic programming approach, and used this technique to
analylze OCR errors and their cascading effects as they travel through
the pipeline. Listed below are papers that describe this work:
“Optical Character Recognition Errors and Their Effects on Natural Language Processing,” D. Lopresti, Proceedings of the ACM SIGIR Workshop on Analytics for Noisy Unstructured Text Data, July 2008, Singapore, pp. 9-16. “Measuring the Impact of Character Recognition Errors on Downstream Text Analysis,” D. Lopresti, Proceedings of Document Recognition and Retrieval XV (IS&T/SPIE International Symposium on Electronic Imaging), January 2008. “Performance Evaluation for Text Processing of Noisy Inputs,” D. Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763. (Abstract) (PDF 83 kbytes) “Summarizing Noisy Documents” (with H. Jing and C. Shih), Proceedings of the Symposium on Document Image Understanding Technology, April 2003, Greenbelt, MD, pp. 111-119. (PDF 97 kbytes) I am also co-chair of the Workshops on Analytics for Noisy Unstructured Text Data. The first workshop, AND 2007, was held in Hyderabad India in January 2007 in conjunction with the Twentieth International Joint Conference on Artificial Intelligence (IJCAI). The second workshop, AND 2008, was held in Singapore in July 2008 in conjunction with the Thirty-first Annual International ACM SIG-IR Conference. And the third workshop, AND 2009, was held in Barcelona in July 2009 in conjunction with the Tenth International Conference on Document Analysis and Recognition (ICDAR). Noisy Text Dataset
A recent paper I wrote, presented at the 2008 AND Workshop, examines
the impact of
OCR errors on a large collection of scanned pages I constructed
specifically for this task and which I am making available to the
international research community to help foster similar studies. This
data set is derived from the well-known Reuters-21578
news corpus. |
![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |