How to Fix a Corrupted, Unsearchable PDF


I apologize for the lack of pretty formatting. I'm in a hurry so I am just throwing this text file onto my site. :)

Simple instructions for making a corrupted pdf searchable

THE SYMPTOMS:

I recently bought an e-book from someone who had used an older Mac OS to create them. The books opened just fine. I could *see* the words in them. But I could not search for words in the book. All the programs that I used to do this (Windows Explorer, Foxit Reader, Adobe Acrobat, LibreOffice, various web browsers, Evernote Premium) either told me that the word was not found, or just stared at me blankly as though I hadn't just told them to search. The only search query that got a response, was a search for a single letter or digit. However, I never found the letter or digit that I searched for; instead, I got a series of other characters one after another. For example, if I searched for the letter 'h', I would get, in succession: w, w, w, w w, m, m, m, m, 2, 2, m, m, m, f, f, f, f, etc. After maybe 30 times of hitting Search Again, whatever program I was using seemed to get bored with the game, because it would then take me back to the top of the document and start finding instances of 'w' again. My boyfriend opened the document using his Mac and his linux box, and he couldn't search it either.

Another symptom was that the text was uncopiable. I tried copying and pasting the text into various editors, but all that gave me was code.

This was completely unacceptable. Since the dawn of electronic documents, reading to me has also meant SEARCHING. If an electronic document is unsearchable, it is actually LESS convenient (to me) than a paper document. I'm guessing that if you are bothering to read this post I don't have to explain to you why it's so important to be able to search a document and copy text from it, and to be able to seach the text of an entire directory full of hundreds of pdf's. Copying is also important to me; if I have to use a pen and paper to copy a quote from an ebook, well, then I'm still back in the 19th century, aren't I.

The person who had created the book (several years ago?) wasn't available to fix them for me. So I spent a few weeks trying to figure out the problem. I know, that seems pathetic for someone who claims to be working on artificial intelligence--unless you take into account that I simply don't ever deal with pdf's except to read them when I have absolutely no other choice. I posted an entry several years ago in which I coined a Zen koan referencing the difficulty of using pdf's, their unbelievably large file size, the inability to convert them to usable plain text, and the infuriating cutsy little hand that clenches itself into a fist yet doesn't do much: "Adobe: It's the sound of one hand flipping you off." I thought it best to just leave that whole world alone. I installed Foxit Reader, which does a much better job with such exotic functions as scrolling, paging down, etc.

So I had no experience with manipulating pdf's and didn't know that, as a Windows 7 user, I owned the software to do it. As I scoured the web for the solution, I came across many almost-unintelligible (to me) explanations of the problem and what to do about it. In general, I found more explanations of why there was a problem, but forum discussions typically ended with the problem going unsolved. But the basic gist I got, is that there is a very kludgy work-around using Adobe Acrobat. This is a program I never use, because I've always hated it (and pdf's). I thought it was just a reader anyway, and a terribly awkward one at that.

So last night I got to know Adobe Acrobat. I had no idea what most of the menu items did, so I just tried everything and failed until something worked.

ONE SOLUTION:

To save you from the same grief, here are step-by-step instructions. There may be other solutions; this is simply the first one that I found that I could do myself, without paying a web service or Kinkos to do it for twice what I paid for the e-book. If you don't happen to have Adobe Acrobat, you almost certainly have a friend who does. And there may be other pdf manipulators that can do the same thing (I looked hard but couldn't find a way to do it with Foxit or with Evernote, even though Evernote can read text from snapshots of your handwriting!

1. LAUNCH Adobe Acrobat

2. Using the File menu, OPEN the corrupted document. (I don't know what to do if you're not even able to open the file. Sorry!)

3. (VERIFY that Acrobat can't search the document, in case you haven't done so, just to avoid unnecessary work.)

4. EXPORT: Once the document is open, use the File menu again, and choose EXPORT / IMAGE / PNG. Your corrupted pdf will be saved as a series of images with the file extension ".pgn", one for each page of the pdf document. Don't worry, they will be numbered automatically by Acrobat, and they aren't terribly large. My document was 200 pages long, so I got 200 little image files in .png format. The export may take a couple of minutes. You won't get any further signals from Adobe to tell you it's done--just go look in the directory that contains the original and see if it made png files with names like:

chemistryBook_Page_001.png chemistryBook_Page_002.png

5. COLLECT: Once you have the image files, collect them all by cut and paste into their own directory.

6. OCR: Under the Document menu, choose OCR TEXT RECOGNITION / RECOGNIZE TEXT IN MULTIPLE FILES USING OCR

7. ADD FILES: You will be shown a dialog box with the title "Paper Capture Multiple Files" with the subtitle "Run OCR on a set of images. There is a button that says "Add Files". Click this button, choose ADD FOLDERS, and browse to the folder that contains your png files. Highlight that file, click OK. The files will then appear within this dialog box. Make sure that the files are in the proper order, or you will be sad. Click OK.

8. CHOOSE OUTPUT OPTIONS: Now you will get a dialog box entitled "Output Options". You have several choices to make here:

TARGET FOLDER: Click "Specific Folder", then Browse to your folder full of images, click "Make New Folder", name the folder (something like "CHEMISTRYBOOKIMAGEFILES" so you can find it easily and know what is in it, click OK.

FILE NAMING: Click "Keep Original File Names". This will preserve Acrobat's automatic numbering of your files--you will need that to get the page ordering right! UNcheck "Overwrite existing files" just to avoid a terrible mishap, unless you are very pressed for disc space or unless this is your 5th time attempting to follow these instructions and you've already got 'way too many duplicates of the output files. If you have the disc space, just make new empty folder for your 6th try.

OUTPUT FORMAT: Select "Save File(s) as Adobe PDF. Click OK.

Now wait for Adobe to execute optical character recognition on the image files. Its output will be one little pdf file for each little image file that it OCR's.

9. COLLATE THE FILES INTO ONE: Under the File menu, select COMBINE / MERGE FILES INTO A SINGLE PDF. This step is optional; maybe you wanted a bunch of little files, or maybe you wanted to divide your enormous original document into 2 or 3 more manageable documents. To divide the file, just make a separate directory for the png files you want in each smaller final document, and repeat steps 6 through 9 for each directory. BE CAREFUL WITH NAMING! Make sure you choose a unique name, because if you got something wrong, you will want to be able to go back to your original corrupted pdf and try again. If your original is named "CHEMISTRY.PDF", please remember to name this new file something like "CHEMISTRY-FIXED.PDF".

If you really despise pdf, you can try using different output formats in Step 8. I do hate pdf, but I chose pdf for two reasons: one is that I had more confidence that that pdf would retain important features like charts and graphs and labeled photos in my document. The other is that I was so so so so SO tired of doing all this pdf crap instead of the chemistry work that I'd gotten the ebook to help me with that I didn't want to do anything fancy with file formats at this point. Let me know if you try output to rtf or ascii and get good results.

10. TEST: Open the merged document(s) in all of the pdf readers and web browsers you will want to use with it, and try searching with it. Use your file browser and try to search for text in the directory with a word you know the file contains. Searchable? Good job, you're done, cheers!

Not searchable? Oh noes! Check that you opened the correct document (maybe you opened the original by mistake). Try the entire process again. If that doesn't help, try the entire process again, but output to plain text this time. My apologies, but, being a complete newbie myself, I have no further advice on this topic.

NB! My output PDF is of rather low quality. It looks like it was literally scanned from a 10th-iteration paper copy. Don't know how to fix that, after the fact or somewhere in the above process. It's good enough, so I'm just dealing with the shakey blurriness. I seem to remember somewhere that I could choose a high quality output, but, again, I didn't want to do anything fancy with vectors and rostering and layers and other terms that I don't know before I verified that I could do something basic and get back to chemistry asap.

My blog is not open to public comments. If you have questions, email me. My address is carolyn at my domain name. I will do my best to help you because I know how frustrating and crippling this problem can be, and I know how daunting this whole pretend-ocr process is.