How To Turn A Physical Book Into A Searchable PDF File
Turn your books into PDFs!
Let's Dig In
What you'll need:
- A Fujitsu ScanSnap Scanner
- Adobe Acrobat Pro software (NOT "Acrobat Reader")
- A book you are willing to cut the binding off (obviously, don't do this with very rare or
Of course, you can scan a book with a flatbed scanner, but it would take forever and you would get a lot of (very unprofessional looking) shadows from one side of each page being raised up because of the binding. A better option is to use a "sheet-fed document scanner" like the Fujitsu ScanSnap, which is specifically built to be able to scan stacks of (uniform) paper quickly.
Step 1: Take your book to Kinko's or a professional print shop and have them cut the binding off with their industrial strength paper cutting machine. What you want is a straight, clean, vertical cut about 1/8th to 1/4 of an inch from the edge of the spine. Make sure that the cut isn't too close to the binding or the pages will still have some glue on them and may stick together when going through the scanner, which risks tearing the pages later in this process. You can mark the book in advance to show them where you want the cut. After the binding is off, make sure to keep the pages in order. You can do this by putting the freshly-cut book in a flat paper bag or Manila envelope.
Step 2: Set your ScanSnap software (called "ScanSnap Manager") to scan double sided pages, in color, and select the quality level you want. The choices are "normal", "better", "best", or "excellent". The higher the setting, the better the quality, but the longer it will take. It would take a very long time to scan a book on the "excellent" setting; you should use this setting for the front and back cover only. For the rest of the book, "best" should give you the best balance between quality and speed. Make sure "PDF" is selected as the file option. Scan in color, even if you're scanning black and white pages. If you choose black and white, the scans won't look as good. Set the paper size you want. You can choose "automatic detection" which will work great if the book happens to be in a standard size, but it's more likely that you'll need to use the "custom" setting to get the size exactly right. Measure a typical page from the book (height and width), select "New Custom Size" type these settings in, and the double check that you're actually using the custom setting (sometimes you can enter the custom setting without it being 'active', so it's a good idea to double-check that it has actually been selected). Choose the file name format "yyyy_MM_dd_HH_mm_ss" so that each file that is created is completely unique". This will help later because all the scans will automatically be in order. Finally, choose your "image saving folder" and take note of the location.
Step 3: Start scanning your book. You'll want to scan each section separately in a way that makes sense, keeping in mind that each group of pages scanned together will be saved into a separate file. For your end result, you'll want a PDF file that makes sense, with each chapter separated out in a table of contents, so think ahead when you're in the scanning phase and make sure that each section/chapter is scanned separately. For example:
- front cover
- copyright, title page, acknowledgments
- table of contents
- chapter 1
- chapter 2
Step 4: Once all the scanning is done, find the folder on your hard drive and name it "original scans". Copy the folder, keeping the original files if you ever need to come back to them. The reason for this is that we are going to be modifying the files when we "OCR" them (OCR stands for "optical character recognition" and is something you have to do if you want your PDF to be "searchable"). From now on, we'll be working on the copies.
Open Adobe Acrobat Pro and open the first file. From the "Tools" tab, select "Recognize Text > In This File", select "All pages" (or "current page" if it's only a one page file) and click ok. The default settings should work fine, but have a quick look to make sure the software isn't going to "downsample"; keep it at 600dpi. After this page has been "OCR'd", you're ready to batch process the remaining files. Do that by selecting "Tools" > "Recognize Text" > "In Multiple Files" and then dragging all the scanned PDF files EXCEPT the first one (remember, you've already OCR'd that). When you see all the files in the "Recognize text…" window, click ok. Now go get a cup of coffee because the computer will need to "cook" for a while and process everything. When that's done, you're ready to move on to step 5.
Naming Your Chapters and Combining Everything Together
Now, let's check that the OCR was done correctly. Find the folder on your hard drive that contains the OCR'd files. Open one of them, choose a word on the page, choose "Find", type in that word and hit enter. Does that word highlight? Was the computer able to find it? If so, the OCR worked. If the OCR didn't work, you'll need to consult your Adobe Acrobat Pro software documentation to see what you're doing wrong. If it did work, you can close the file and we can start naming the chapters.
Step 5: start with the first file, which should be the front cover. Open it quickly to check. Close it. Name it "01-Front Cover". Open the next file and check what it is (let's say it's the copyright page and title page). Name that "02-Copyright, Title". Open the next file (let's say it's the table of contents). Name that "03-Table of Contents". Open the next file (let's say it's Chapter 1). Name that "04-Chapter 1"…and so on. Do this until all of the files are named. Now you're ready to combine all of the separate, named PDF files into one master PDF. Open Adobe Acrobat Pro and from the main screen, select the option "Combine Files into PDF". Drag all the files into that window (they should be in order because we started each file name with 01, 02, etc.). Click the "Combine Files" button and wait for the software to do its magic. Now the only thing left to do is save the PDF file. Choose "Save As" from the File menu, give it a name like "(Book Title) Final OCR'd" and you're done.
It will be quite a big file, so you can also save another version with a reduced file size. From the file menu, choose "Save As" > "Reduced Size PDF" and remember to indicate "reduced size" in the name. For example, "(Book Title) Final OCR'd, reduced".