OCRing Documents from PACER

Thu Aug 03 2006 00:00:00 GMT+0000 (Coordinated Universal Time)

hacks

So, occasionally, when examining documents on PACER (the electronic filing system for federal courts in the US) you get one that is scanned in and not OCR'd (which means that you can't select the text in the document). How do you go about getting it OCR'd? If you have access to Acrobat Professional 7.0, it's rather easy, although it takes some time. Here's how:

Follow up:

Make sure you have Acrobat Pro open... then (this assumes you're on Windows, but it's not much different on a Mac):

Open the PDF you want to OCR in Acrobat.
Select "Save As" and save it as TIFF into a new directory (it will save each page as a separate TIFF file).
Close the original PDF file from Acrobat.
In Windows Explorer (or by using the "create PDF" option in Acrobat's "File" menu), select all the TIFF files, right-click and select "combine in Acrobat". Make sure that the pages are in order and then hit OK so that it creates a new PDF out of the TIFF files.
Now, in the "Document" menu, select "Start OCR" and let it OCR all pages. This will take a while. Save the PDF when prompted to a new name.
When it is done you have a PDF with OCR'd text, however, because each page is a TIFF file, the resulting PDF is very very very large (my 22 page PDF was 31MB!).
To reduce the size, select the "PDF optimizer" in the "Advanced Tools" menu. Click on the size calculator and you'll see that most of the space is being consumed by document overhead and the TIFF images.
Click on the "Images" option in the optimizer and go with the defaults (this will turn your 600dpi TIFF images into 300dpi JPGs). This will take a while.
Save this file and you should now have a searchable, selectable, cut-and-pasteable PDF file that also reads well when printed and doesn't take up too much space.

Discussion: Regular Acrobat users might say, "Jeez, that's quite a bit of trouble to go through for a simple OCR." However, PACER includes weird stuff in their PDFs such that a standard OCR doesn't work; thus having to go through the saving-to-TIFF part.