|IMPROVING THE COMPRESSION ALGORITHMS PERFORMANCE FOR SCANNED AMHARIC PDF FILES
|Data Compression, OCR, Page layout segmentation, Portable Document file
|ST. MARY’S UNIVERSITY
|The advancement and accessibility of digital computers and the introduction of the Internet and World Wide Web lead to in massive information explosion all over the world. There for large amount of newspapers, magazines and printed documents available with numerous information and knowledge of different areas. PDF file format documents, facilitates office automation and the move towards paperless office. PDFs can become inconveniently large when they contain a large amount of high-resolution content such as images and Graphics, or even just a very large number of pages. To make the information and knowledge embedded in these PDF documents accessible and share to the public there is a need to minimize the data size using different mechanisms. This study has been conducted to develop Amharic PDF file document compression system by applying an effective page segmentation technique that can identify text and non-text blocks with the aim of reconstructing PDF document layouts to optimize memory space requirement and bandwidth for transmission. The first step of the proposed approach is separating textual and non-textual objects. After a applying combination of page segmentation techniques, namely: connected component with Dilation and connected components Area, Height and width analysis techniques is applied to detect a graphics part of a document. Based on the experiment on the average 78% accuracy rate is achieved from the proposed approach. The next step after textual and non-textual separation is column block detection for textual objects. Similar page segmentation techniques are applied to segment column layout. The proposed technique accurately identified column layout with an accuracy of 89%, thereby all coordinate information’s about column block is stored for reconstructing stage. Finally, the extracted objects are compressed using Huffman compression algorithms. The proposed approach experimented on different PDF documents and compresses the extracted objects with compression ratio of less than 50%, which is better compression result than existing commercial compression tools. The proposed approach also capable of reconstructing the compressed data after decompression. Based on the stored layout coordinate information the original PDF documents non-textual blocks and textual columns reconstructed on the average 74% accuracy. From correctly segmented column and paragraph block the proposed techniques 92% accuracy rates. However, the performance of the proposed method greatly affected black shades in PDF document images while scanning, irregular shaped images with non-rectangular shaped text blocks results in loss of some text and difficult to segmentation.
|Appears in Collections:
|Master of computer science
Master of computer science
|Haimanot Andargachew Final thesis V2.pdf
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.