[blindlaw] Fwd: Announcing PDF2OCR
Ford, Tim (CDPH-OLS)
Tim.Ford at cdph.ca.gov
Fri Sep 14 11:07:52 CDT 2007
I wonder if this new program does something that I can do through
Openbook?
I use Openbook with these kinds of image-only PDF files by using
Openbook's virtual printer option, named Freedom Import Printer (for
Freedom Scientific). so all I do is bring up the PDF version, and if
there is just an image there, I hit the print command, Openbook
launches, and converts the PDF file to text. This is essentially as if
I printed out the document and ran it into Openbook through a scanner.
Does this program do the same thing?
-----Original Message-----
From: blindlaw-bounces at nfbnet.org [mailto:blindlaw-bounces at nfbnet.org]
On Behalf Of David Andrews
Sent: Friday, September 14, 2007 8:02 AM
To: promotion-technology at nfbnet.org; gui-talk at nfbnet.org;
nfbcs at nfbnet.org; blindtlk at nfbnet.org; blindlaw at nfbnet.org;
nabs-l at nfbnet.org; nabop at nfbnet.org
Subject: [blindlaw] Fwd: Announcing PDF2OCR
>
>Now available at
>http://www.EmpowermentZone.com/pdf2ocr.zip
>
>PDF2OCR 1.0
>Released September 14, 2007
>Public Domain by Jamal Mazrui
>
>Following up on a tip from Ken Perry about the open source
>Tesseract-OCR project at Google, I have tried to use this OCR engine to
>build a free program for producing accessible text from an image-based
>PDF. Such files are created by scanning equipment or software printer
>drivers that save only the picture of text, without the actual
>characters themselves. This makes them inaccessible to most PDF
>viewing utilities, which extract text but do not perform OCR on images.
>
>I could not find an existing Windows solution on the web, but did get
>useful ideas from Linux-oriented ones. What I am calling PDF2OCR
>combines Tesseract from http://code.google.com/p/tesseract-ocr
>with the GhostScript interpreter from
>http://ghostscript.com
>
>GhostScript creates a .tif file from the .pdf file of interest, and
>then Tesseract creates a .txt file from that. The current
>implementation is a batch file, pdf2ocr.bat, with the following syntax
on the command line:
>pdf2ocr SourceRootName
>where SourceRootName is the name of a PDF file without the .pdf
extension.
>This produces a text file with the same name except for a .txt
extension.
>The PDF name can include a directory path, but not embedded spaces.
>For example, pdf2ocr c:\temp\test produces c:\temp\test.txt When
>complete, the batch file prints tesseract.log to the screen -- a file
>that is recreated for each conversion.
>
>Installation consists of unzipping the pdf2ocr.zip archive to a target
>directory, e.g., to one called C:\PDF2OCR This directory contains the
>executable files, as well as three subdirectories with support files.
>The gsdata subdirectory contains many files I gathered from an
>installed GhostScript directory tree. The tessdata subdirectory
>contains language support for Tesseract (I have only distributed
>English files, but other languages are available from the Google site).
>The misc subdirectory contains sample files, some source code, and this
>documentation.
>
>A sample image-based PDF is named mlk.pdf -- the letter Martin Luther
>King, Jr. wrote from the Birmingham Jail. Another sample is debate.pdf
>-- the legal agreement between the Bush and Kerry campaigns concerning
>Presidential debates. Two commercial OCR programs tested, Kurzweil
>1000 and PDF Magic, converted one of these files well, but not the
>other at all (a different one for each). Their results, as well as
>that of PDF2OCR, are provided in text files. Please understand that
>Tesseract is not the best OCR available, though it is generally
>considered the best free OCR at present.
>
>In order to run the batch file from any directory, you can add the
>PDF2OCR directory to the path of a console session with a command like
>the
>following:
>set path=c:\pdf2ocr;%path%
>You can add the path for every console session via the Advanced tab
>page of the System applet in Control Panel.
>
>To easily convert multiple PDFs in a directory, I have also created a
>utility called dir2ocr.exe. Simply pass the directory name to process
>as a parameter, e.g., dir2ocr c:\temp If no parameter is passed, the
>current directory is assumed. Source code for this PowerBASIC program
>that calls pdf2ocr.bat is in the files dir2ocr.bas and fn.inc, located
>in the misc subdirectory.
>
>The PDF2OCR download is large, about 14 megabytes as a compressed
>archive. Other techniques of getting text from a PDF should probably
>be tried first. When other tools do not work or are unavailable,
>however, I hope this helps to bridge an accessibility gap. Feel free
>to enhance it in the spirit of open source development!
>
>Jamal Mazrui
>jamal at EmpowermentZone.com
David Andrews and white cane Harry.
_______________________________________________
blindlaw mailing list
blindlaw at nfbnet.org
http://www.nfbnet.org/mailman/listinfo/blindlaw
More information about the blindlaw
mailing list