[blindlaw] Fwd: Announcing PDF2OCR

Ford, Tim (CDPH-OLS) Tim.Ford at cdph.ca.gov
Fri Sep 14 11:07:52 CDT 2007


I wonder if this new program does something that I can do through
Openbook?

I use Openbook with these kinds of image-only PDF files by using
Openbook's virtual printer option, named Freedom Import Printer (for
Freedom Scientific).  so all I do is bring up the PDF version, and if
there is just an image there, I hit the print command, Openbook
launches, and converts the PDF file to text.  This is essentially as if
I printed out the document and ran it into Openbook through a scanner.

Does this program do the same thing?

 

-----Original Message-----
From: blindlaw-bounces at nfbnet.org [mailto:blindlaw-bounces at nfbnet.org]
On Behalf Of David Andrews
Sent: Friday, September 14, 2007 8:02 AM
To: promotion-technology at nfbnet.org; gui-talk at nfbnet.org;
nfbcs at nfbnet.org; blindtlk at nfbnet.org; blindlaw at nfbnet.org;
nabs-l at nfbnet.org; nabop at nfbnet.org
Subject: [blindlaw] Fwd: Announcing PDF2OCR


>
>Now available at
>http://www.EmpowermentZone.com/pdf2ocr.zip
>
>PDF2OCR 1.0
>Released September 14, 2007
>Public Domain by Jamal Mazrui
>
>Following up on a tip from Ken Perry about the open source 
>Tesseract-OCR project at Google, I have tried to use this OCR engine to

>build a free program for producing accessible text from an image-based 
>PDF.  Such files are created by scanning equipment or software printer 
>drivers that save only the picture of text, without the actual 
>characters themselves.  This makes them inaccessible to most PDF 
>viewing utilities, which extract text but do not perform OCR on images.
>
>I could not find an existing Windows solution on the web, but did get 
>useful ideas from Linux-oriented ones.  What I am calling PDF2OCR 
>combines Tesseract from http://code.google.com/p/tesseract-ocr
>with the GhostScript interpreter from
>http://ghostscript.com
>
>GhostScript creates a .tif file from the .pdf file of interest, and 
>then Tesseract creates a .txt file from that.  The current 
>implementation is a batch file, pdf2ocr.bat, with the following syntax
on the command line:
>pdf2ocr SourceRootName
>where SourceRootName is the name of a PDF file without the .pdf
extension.
>This produces a text file with the same name except for a .txt
extension.
>The PDF name can include a directory path, but not embedded spaces.  
>For example, pdf2ocr c:\temp\test produces c:\temp\test.txt When 
>complete, the batch file prints tesseract.log to the screen -- a file 
>that is recreated for each conversion.
>
>Installation consists of unzipping the pdf2ocr.zip archive to a target 
>directory, e.g., to one called C:\PDF2OCR This directory contains the 
>executable files, as well as three subdirectories with support files.  
>The gsdata subdirectory contains many files I gathered from an 
>installed GhostScript directory tree.  The tessdata subdirectory 
>contains language support for Tesseract (I have only distributed 
>English files, but other languages are available from the Google site).

>The misc subdirectory contains sample files, some source code, and this

>documentation.
>
>A sample image-based PDF is named mlk.pdf -- the letter Martin Luther 
>King, Jr. wrote from the Birmingham Jail.  Another sample is debate.pdf

>-- the legal agreement between the Bush and Kerry campaigns concerning 
>Presidential debates.  Two commercial OCR programs tested, Kurzweil 
>1000 and PDF Magic, converted one of these files well, but not the 
>other at all (a different one for each).  Their results, as well as 
>that of PDF2OCR, are provided in text files.  Please understand that 
>Tesseract is not the best OCR available, though it is generally 
>considered the best free OCR at present.
>
>In order to run the batch file from any directory, you can add the 
>PDF2OCR directory to the path of a console session with a command like 
>the
>following:
>set path=c:\pdf2ocr;%path%
>You can add the path for every console session via the Advanced tab 
>page of the System applet in Control Panel.
>
>To easily convert multiple PDFs in a directory, I have also created a 
>utility called dir2ocr.exe.  Simply pass the directory name to process 
>as a parameter, e.g., dir2ocr c:\temp If no parameter is passed, the 
>current directory is assumed.  Source code for this PowerBASIC program 
>that calls pdf2ocr.bat is in the files dir2ocr.bas and fn.inc, located 
>in the misc subdirectory.
>
>The PDF2OCR  download is large, about 14 megabytes as a compressed 
>archive.  Other techniques of getting text from a PDF should probably 
>be tried first.  When other tools do not work or are unavailable, 
>however, I hope this helps to bridge an accessibility gap.  Feel free 
>to enhance it in the spirit of open source development!
>
>Jamal Mazrui
>jamal at EmpowermentZone.com

David Andrews and white cane Harry.


_______________________________________________
blindlaw mailing list
blindlaw at nfbnet.org
http://www.nfbnet.org/mailman/listinfo/blindlaw


More information about the blindlaw mailing list