[nabop] Fwd: Announcing PDF2OCR
David Andrews
dandrews at visi.com
Fri Sep 14 10:01:30 CDT 2007
>
>Now available at
>http://www.EmpowermentZone.com/pdf2ocr.zip
>
>PDF2OCR 1.0
>Released September 14, 2007
>Public Domain by Jamal Mazrui
>
>Following up on a tip from Ken Perry about the open source Tesseract-OCR
>project at Google, I have tried to use this OCR engine to build a free
>program for producing accessible text from an image-based PDF. Such files
>are created by scanning equipment or software printer drivers that save
>only the picture of text, without the actual characters themselves. This
>makes them inaccessible to most PDF viewing utilities, which extract text
>but do not perform OCR on images.
>
>I could not find an existing Windows solution on the web, but did get
>useful ideas from Linux-oriented ones. What I am calling PDF2OCR combines
>Tesseract from
>http://code.google.com/p/tesseract-ocr
>with the GhostScript interpreter from
>http://ghostscript.com
>
>GhostScript creates a .tif file from the .pdf file of interest, and then
>Tesseract creates a .txt file from that. The current implementation is a
>batch file, pdf2ocr.bat, with the following syntax on the command line:
>pdf2ocr SourceRootName
>where SourceRootName is the name of a PDF file without the .pdf extension.
>This produces a text file with the same name except for a .txt extension.
>The PDF name can include a directory path, but not embedded spaces. For
>example,
>pdf2ocr c:\temp\test
>produces
>c:\temp\test.txt
>When complete, the batch file prints tesseract.log to the screen -- a file
>that is recreated for each conversion.
>
>Installation consists of unzipping the pdf2ocr.zip archive to a target
>directory, e.g., to one called
>C:\PDF2OCR
>This directory contains the executable files, as well as three
>subdirectories with support files. The gsdata subdirectory contains many
>files I gathered from an installed GhostScript directory tree. The
>tessdata subdirectory contains language support for Tesseract (I have only
>distributed English files, but other languages are available from the
>Google site). The misc subdirectory contains sample files, some source
>code, and this documentation.
>
>A sample image-based PDF is named mlk.pdf -- the letter Martin Luther
>King, Jr. wrote from the Birmingham Jail. Another sample is debate.pdf --
>the legal agreement between the Bush and Kerry campaigns concerning
>Presidential debates. Two commercial OCR programs tested, Kurzweil 1000
>and PDF Magic, converted one of these files well, but not the other at all
>(a different one for each). Their results, as well as that of PDF2OCR,
>are provided in text files. Please understand that Tesseract is not the
>best OCR available, though it is generally considered the best free OCR at
>present.
>
>In order to run the batch file from any directory, you can add the PDF2OCR
>directory to the path of a console session with a command like the
>following:
>set path=c:\pdf2ocr;%path%
>You can add the path for every console session via the Advanced tab page
>of the System applet in Control Panel.
>
>To easily convert multiple PDFs in a directory, I have also created a
>utility called dir2ocr.exe. Simply pass the directory name to process as
>a parameter, e.g.,
>dir2ocr c:\temp
>If no parameter is passed, the current directory is assumed. Source code
>for this PowerBASIC program that calls pdf2ocr.bat is in the files
>dir2ocr.bas and fn.inc, located in the misc subdirectory.
>
>The PDF2OCR download is large, about 14 megabytes as a compressed
>archive. Other techniques of getting text from a PDF should probably be
>tried first. When other tools do not work or are unavailable, however, I
>hope this helps to bridge an accessibility gap. Feel free to enhance it
>in the spirit of open source development!
>
>Jamal Mazrui
>jamal at EmpowermentZone.com
David Andrews and white cane Harry.
More information about the nabop
mailing list