[blindlaw] Fwd: Announcing PDF2OCR

Steve Jacobson steve.jacobson at visi.com
Fri Sep 14 14:00:13 CDT 2007


Tim,

Jamal's program essentially does the same thing that you do with Open Book using other tools.  There are several ways of approaching the problem of PDF's 
containing scanned images.  Your solution, judging from what Jamal says, will probably give better results because you are using a better character recognition 
engine.  K-1000 also has this capability.  OmniPage, and I believe FineReader, both commercially available OCR programs, also include methods of getting text out 
of scanned image PDF's.  I have also read that some versions of Microsoft Office can do this as well because some versions contain an OCR engine and a printer 
driver that will allow the capturing of an image instead of printing it.  In some cases, Jamals program might be simpler to run, especially if you need to convert a series 
of PDF documents.  He didn't say this, but it is also conceivable that his process might ignore some security flags that other programs honor letting you get text from 
PDF's with some protection.  I do not know this for certain, but in the past, the "Ghost Script" process that he is using has had that advantage.  Of course, given 
that this is a list for lawyers, it should be mentioned that extracting text from a protected document might violate copyrights, although it is probably permitted if the 
document is only used as a means to make it accessible.  

The biggest benefit of Jamal's program is that it is a way for computer users to read scanned PDF's but who do not own any of the more expensive alternatives.  
There are certainly blind people out there who don't own Microsoft Office or any of the OCR applications who primarily use their computer for e-mail and web 
browsing.  This might be very useful in that environment.

I hope this helps.

Best regards,

Steve Jacobson

On Fri, 14 Sep 2007 09:07:52 -0700, Ford, Tim \(CDPH-OLS\) wrote:

>I wonder if this new program does something that I can do through
>Openbook?

>I use Openbook with these kinds of image-only PDF files by using
>Openbook's virtual printer option, named Freedom Import Printer (for
>Freedom Scientific).  so all I do is bring up the PDF version, and if
>there is just an image there, I hit the print command, Openbook
>launches, and converts the PDF file to text.  This is essentially as if
>I printed out the document and ran it into Openbook through a scanner.

>Does this program do the same thing?

> 

>-----Original Message-----
>From: blindlaw-bounces at nfbnet.org [mailto:blindlaw-bounces at nfbnet.org]
>On Behalf Of David Andrews
>Sent: Friday, September 14, 2007 8:02 AM
>To: promotion-technology at nfbnet.org; gui-talk at nfbnet.org;
>nfbcs at nfbnet.org; blindtlk at nfbnet.org; blindlaw at nfbnet.org;
>nabs-l at nfbnet.org; nabop at nfbnet.org
>Subject: [blindlaw] Fwd: Announcing PDF2OCR


>>
>>Now available at
>>http://www.EmpowermentZone.com/pdf2ocr.zip
>>
>>PDF2OCR 1.0
>>Released September 14, 2007
>>Public Domain by Jamal Mazrui
>>
>>Following up on a tip from Ken Perry about the open source 
>>Tesseract-OCR project at Google, I have tried to use this OCR engine to

>>build a free program for producing accessible text from an image-based 
>>PDF.  Such files are created by scanning equipment or software printer 
>>drivers that save only the picture of text, without the actual 
>>characters themselves.  This makes them inaccessible to most PDF 
>>viewing utilities, which extract text but do not perform OCR on images.
>>
>>I could not find an existing Windows solution on the web, but did get 
>>useful ideas from Linux-oriented ones.  What I am calling PDF2OCR 
>>combines Tesseract from http://code.google.com/p/tesseract-ocr
>>with the GhostScript interpreter from
>>http://ghostscript.com
>>
>>GhostScript creates a .tif file from the .pdf file of interest, and 
>>then Tesseract creates a .txt file from that.  The current 
>>implementation is a batch file, pdf2ocr.bat, with the following syntax
>on the command line:
>>pdf2ocr SourceRootName
>>where SourceRootName is the name of a PDF file without the .pdf
>extension.
>>This produces a text file with the same name except for a .txt
>extension.
>>The PDF name can include a directory path, but not embedded spaces.  
>>For example, pdf2ocr c:\temp\test produces c:\temp\test.txt When 
>>complete, the batch file prints tesseract.log to the screen -- a file 
>>that is recreated for each conversion.
>>
>>Installation consists of unzipping the pdf2ocr.zip archive to a target 
>>directory, e.g., to one called C:\PDF2OCR This directory contains the 
>>executable files, as well as three subdirectories with support files.  
>>The gsdata subdirectory contains many files I gathered from an 
>>installed GhostScript directory tree.  The tessdata subdirectory 
>>contains language support for Tesseract (I have only distributed 
>>English files, but other languages are available from the Google site).

>>The misc subdirectory contains sample files, some source code, and this

>>documentation.
>>
>>A sample image-based PDF is named mlk.pdf -- the letter Martin Luther 
>>King, Jr. wrote from the Birmingham Jail.  Another sample is debate.pdf

>>-- the legal agreement between the Bush and Kerry campaigns concerning 
>>Presidential debates.  Two commercial OCR programs tested, Kurzweil 
>>1000 and PDF Magic, converted one of these files well, but not the 
>>other at all (a different one for each).  Their results, as well as 
>>that of PDF2OCR, are provided in text files.  Please understand that 
>>Tesseract is not the best OCR available, though it is generally 
>>considered the best free OCR at present.
>>
>>In order to run the batch file from any directory, you can add the 
>>PDF2OCR directory to the path of a console session with a command like 
>>the
>>following:
>>set path=c:\pdf2ocr;%path%
>>You can add the path for every console session via the Advanced tab 
>>page of the System applet in Control Panel.
>>
>>To easily convert multiple PDFs in a directory, I have also created a 
>>utility called dir2ocr.exe.  Simply pass the directory name to process 
>>as a parameter, e.g., dir2ocr c:\temp If no parameter is passed, the 
>>current directory is assumed.  Source code for this PowerBASIC program 
>>that calls pdf2ocr.bat is in the files dir2ocr.bas and fn.inc, located 
>>in the misc subdirectory.
>>
>>The PDF2OCR  download is large, about 14 megabytes as a compressed 
>>archive.  Other techniques of getting text from a PDF should probably 
>>be tried first.  When other tools do not work or are unavailable, 
>>however, I hope this helps to bridge an accessibility gap.  Feel free 
>>to enhance it in the spirit of open source development!
>>
>>Jamal Mazrui
>>jamal at EmpowermentZone.com

>David Andrews and white cane Harry.


>_______________________________________________
>blindlaw mailing list
>blindlaw at nfbnet.org
>http://www.nfbnet.org/mailman/listinfo/blindlaw
>_______________________________________________
>blindlaw mailing list
>blindlaw at nfbnet.org
>http://www.nfbnet.org/mailman/listinfo/blindlaw






More information about the blindlaw mailing list