[blindlaw] Fwd: Announcing PDF2OCR
Ford, Tim (CDPH-OLS)
Tim.Ford at cdph.ca.gov
Fri Sep 14 12:01:11 CDT 2007
Thanks very much for the clarification. Considering how expensive the
KW and OB software is, alternatives such as this are no doubt
appreciated.
-----Original Message-----
From: blindlaw-bounces at nfbnet.org [mailto:blindlaw-bounces at nfbnet.org]
On Behalf Of David Andrews
Sent: Friday, September 14, 2007 9:47 AM
To: NFBnet Blind Law Mailing List
Subject: Re: [blindlaw] Fwd: Announcing PDF2OCR
Both Oepn Book, and K1000 have methods for dealing with this kind of
situation, however, I forwarded it for those who may not have those
packages.
Dave
At 11:07 AM 9/14/2007, you wrote:
>I wonder if this new program does something that I can do through
>Openbook?
>
>I use Openbook with these kinds of image-only PDF files by using
>Openbook's virtual printer option, named Freedom Import Printer (for
>Freedom Scientific). so all I do is bring up the PDF version, and if
>there is just an image there, I hit the print command, Openbook
>launches, and converts the PDF file to text. This is essentially as if
>I printed out the document and ran it into Openbook through a scanner.
>
>Does this program do the same thing?
>
>
>
>-----Original Message-----
>From: blindlaw-bounces at nfbnet.org [mailto:blindlaw-bounces at nfbnet.org]
>On Behalf Of David Andrews
>Sent: Friday, September 14, 2007 8:02 AM
>To: promotion-technology at nfbnet.org; gui-talk at nfbnet.org;
>nfbcs at nfbnet.org; blindtlk at nfbnet.org; blindlaw at nfbnet.org;
>nabs-l at nfbnet.org; nabop at nfbnet.org
>Subject: [blindlaw] Fwd: Announcing PDF2OCR
>
>
> >
> >Now available at
> >http://www.EmpowermentZone.com/pdf2ocr.zip
> >
> >PDF2OCR 1.0
> >Released September 14, 2007
> >Public Domain by Jamal Mazrui
> >
> >Following up on a tip from Ken Perry about the open source
> >Tesseract-OCR project at Google, I have tried to use this OCR engine
> >to
>
> >build a free program for producing accessible text from an
> >image-based PDF. Such files are created by scanning equipment or
> >software printer drivers that save only the picture of text, without
> >the actual characters themselves. This makes them inaccessible to
> >most PDF viewing utilities, which extract text but do not perform OCR
on images.
> >
> >I could not find an existing Windows solution on the web, but did get
> >useful ideas from Linux-oriented ones. What I am calling PDF2OCR
> >combines Tesseract from http://code.google.com/p/tesseract-ocr
> >with the GhostScript interpreter from http://ghostscript.com
> >
> >GhostScript creates a .tif file from the .pdf file of interest, and
> >then Tesseract creates a .txt file from that. The current
> >implementation is a batch file, pdf2ocr.bat, with the following
> >syntax
>on the command line:
> >pdf2ocr SourceRootName
> >where SourceRootName is the name of a PDF file without the .pdf
>extension.
> >This produces a text file with the same name except for a .txt
>extension.
> >The PDF name can include a directory path, but not embedded spaces.
> >For example, pdf2ocr c:\temp\test produces c:\temp\test.txt When
> >complete, the batch file prints tesseract.log to the screen -- a file
> >that is recreated for each conversion.
> >
> >Installation consists of unzipping the pdf2ocr.zip archive to a
> >target directory, e.g., to one called C:\PDF2OCR This directory
> >contains the executable files, as well as three subdirectories with
support files.
> >The gsdata subdirectory contains many files I gathered from an
> >installed GhostScript directory tree. The tessdata subdirectory
> >contains language support for Tesseract (I have only distributed
> >English files, but other languages are available from the Google
site).
>
> >The misc subdirectory contains sample files, some source code, and
> >this
>
> >documentation.
> >
> >A sample image-based PDF is named mlk.pdf -- the letter Martin Luther
> >King, Jr. wrote from the Birmingham Jail. Another sample is
> >debate.pdf
>
> >-- the legal agreement between the Bush and Kerry campaigns
> >concerning Presidential debates. Two commercial OCR programs tested,
> >Kurzweil 1000 and PDF Magic, converted one of these files well, but
> >not the other at all (a different one for each). Their results, as
> >well as that of PDF2OCR, are provided in text files. Please
> >understand that Tesseract is not the best OCR available, though it is
> >generally considered the best free OCR at present.
> >
> >In order to run the batch file from any directory, you can add the
> >PDF2OCR directory to the path of a console session with a command
> >like the
> >following:
> >set path=c:\pdf2ocr;%path%
> >You can add the path for every console session via the Advanced tab
> >page of the System applet in Control Panel.
> >
> >To easily convert multiple PDFs in a directory, I have also created a
> >utility called dir2ocr.exe. Simply pass the directory name to
> >process as a parameter, e.g., dir2ocr c:\temp If no parameter is
> >passed, the current directory is assumed. Source code for this
> >PowerBASIC program that calls pdf2ocr.bat is in the files dir2ocr.bas
> >and fn.inc, located in the misc subdirectory.
> >
> >The PDF2OCR download is large, about 14 megabytes as a compressed
> >archive. Other techniques of getting text from a PDF should probably
> >be tried first. When other tools do not work or are unavailable,
> >however, I hope this helps to bridge an accessibility gap. Feel free
> >to enhance it in the spirit of open source development!
> >
> >Jamal Mazrui
> >jamal at EmpowermentZone.com
>
>David Andrews and white cane Harry.
>
>
>_______________________________________________
>blindlaw mailing list
>blindlaw at nfbnet.org
>http://www.nfbnet.org/mailman/listinfo/blindlaw
>_______________________________________________
>blindlaw mailing list
>blindlaw at nfbnet.org
>http://www.nfbnet.org/mailman/listinfo/blindlaw
David Andrews and white cane Harry.
_______________________________________________
blindlaw mailing list
blindlaw at nfbnet.org
http://www.nfbnet.org/mailman/listinfo/blindlaw
More information about the blindlaw
mailing list