Locating tables in scanned documents with heterogeneous layout

Jahan, Akmal, M.A.C

Please use this identifier to cite or link to this item: http://ir.lib.seu.ac.lk/handle/123456789/368

Full metadata record

DC Field	Value	Language
dc.contributor.author	Jahan, Akmal, M.A.C
dc.date.accessioned	2015-09-03T08:46:40Z
dc.date.available	2015-09-03T08:46:40Z
dc.date.issued	2014
dc.identifier.citation	Annual Science Research Session 2014
dc.identifier.uri	http://ir.lib.seu.ac.lk/123456789/368
dc.description.abstract	The pool of knowledge available to the mankind depends on the source of learning resources, which can vary from ancient printed documents to present electronic materials. The rapid conversion of material available in traditional libraries to digital form needs a significant amount of work for format preservation. Most of the printed documents contain not only characters and its formatting but also some associated non text objects such as tables, charts and graphical objects. Since most of the existing optical character recognition techniques face challenges in detecting such objects and do not concentrate on the format preservation of the contents while reproducing them, we attempt to locate all type of tables in scanned documents with heterogeneous layout. Generally all the documents with multi columns are not purely divided by the inter column space. Long headings, centered aligned page numbers, lengthy text in headers and footer and horizontal lines extremely interfere the inter column space which was commonly used in layout analysis. To address this issue, we propose an algorithm using specific threshold to eliminate the interfering parts in inter column space and using local thresholds for word space and line height to detect and extract all categories of tables from scanned documents. From the experiment performed in 50 documents, we conclude that our algorithm has an overall accuracy of about 73% in detecting tables from multi-column layout. Even though complex layout document still have some problem, the system could treat some of these kind of documents as well. Since the algorithm does not completely depend on number of columns, inter column spaces, rule lines which bound the tables, it can detect all categories of tables in a range of different layout scanned documents.	en_US
dc.language.iso	en_US	en_US
dc.publisher	Faculty of Applied science South Eastern University of Sri Lanka Oluvil # 32360 Sri Lanka	en_US
dc.subject	Optical character recognition	en_US
dc.title	Locating tables in scanned documents with heterogeneous layout	en_US
dc.type	Conference paper	en_US
Appears in Collections:	ASRS - FAS 2014

Files in This Item:

File	Description	Size	Format
LOCATING TABLES IN.pdf		30.87 kB	Adobe PDF	View/Open

Show simple item record