When you scan a document directly into a PDF file, Acrobat captures all the text and graphics on each page as though they were all just one big graphic image. This is fine as far as it goes, except that it doesn’t go very far because you can neither edit nor search the PDF document (because, as far as Acrobat is concerned, the document doesn’t contain any text to edit or search, just one humongous graphic). That’s where the Paper Capture plug-in in Acrobat 5 for Windows comes into play: You can use it to make a PDF that you can just search or both search and edit.
For some unknown reason, some of the first copies of Acrobat 5 for Windows shipped without the Paper Capture plug-in. If you find that your Tools menu in Acrobat 5 is missing the Paper Capture item, you need to download and install the Paper Capture plug-in from the Adobe Web site. Note that the Paper Capture plug-in has a 50-page document limit. If you need to process PDF documents over 50 pages in length, you need to look into purchasing Adobe Acrobat Capture, a full-blown version of the Paper Capture plug-in that can handle longer documents.
To use Paper Capture, all you have to do is choose Tools –> Paper Capture to open the Paper Capture Plug-In dialog box, select the page or pages to be processed (All Pages, Current Page, or From Page x to y), and then click the OK button; the Paper Capture utility does the rest. As it processes the page or pages in the document that you designated, a Paper Capture Plug-In alert dialog box keeps you informed of its progress in preparing and performing the page recognition. When Paper Capture finishes doing the page recognition, this alert dialog box disappears and you can then save the changes to your PDF document with the File –> Save command.
When doing the page recognition in a PDF document, the Paper Capture plug-in offers you a choice between the following three Output Style options:
- Formatted Text & Graphics to make the text in the PDF document both editable and searchable. Select this setting if you not only want to be able to find text in the document but also possibly make editing changes to it.
If the message below appears, the document is not text-searchable. Alternatively, use the mouse to highlight a word in the text. If a single word cannot be highlighted and the entire page turns blue to indicate it is an image, the text is not searchable. How to make a PDF text-searchable. In order to make a PDF searchable and editable, you need some sort of Optical Character Recognition(OCR) software which can detect the text in the scanned document. In this post I will share a very simple and easy method to make a scanned PDF searchable. Apr 08, 2015 Open create-searchable.pdf in Acrobat DC or open a photo of one of your own documents. In the right hand pane, select the Enhance Scans tool. Adjust skewing Select Enhance Camera Image to bring up the Enhance sub menu. Select the correct option from the Content drop down. Auto Detect is the default and works on most scanned documents. I'm going to assume that you have a PDF that was created from a scanner or an image. If you have a PDF that was created from an Office document (e.g. Word, Excel), then you can use a number of online PDF to Word converters in order to select the t.
- Searchable Image (Exact) to make the text in the PDF document searchable but not editable (this is the default setting). Use this setting if you’re processing a document that needs to be searchable but should never be edited in any way, such as an executed contract.
- Searchable Image (Compact) to make the text in the PDF document searchable but not editable and to compress its graphics. Select this setting if you’re processing a document whose text requires searching without editing and that also contains a fair number of graphic images that need compressing. When you select this setting, Paper Capture applies JPEG compression to color images and ZIP compression to black-and-white images.
To select a different output style setting, click the Preferences button in the Paper Capture Plug-In dialog box to open the Preferences dialog box. This dialog box not only enables you to select a new output style in the PDF Output Style pop-up menu but also to designate the primary language used in the text in the Primary OCR Language pop-up menu (OCR stands for Optical Character Recognition, which is the kind of software that Paper Capture uses to recognize and convert text captured as a graphic into text that can be searched and edited).
If your PDF document contains graphic images, you can tell Paper Capture how much to compress the images by selecting the maximum resolution in the Downsample Images pop-up menu. This menu offers you three options in addition to None (for no compression): Low (300 dpi), Medium (150 dpi), and High (72 dpi). The Low, Medium, and High options refer to the amount of compression applied to the images, and the values 300, 150, and 72 dpi (dots per inch) refer to their resolution and thus their quality. As always, the higher the amount of compression, the smaller the file size and the lower the image quality.
After processing the pages of your PDF document with the Paper Capture plug-in, use the Find feature (Ctrl+F on Windows and Command key+F on the Mac) to search for words or phrases in the text to verify it can be searched. If you used the Formatted Text & Graphics output style in doing the page recognition, you can select the TouchUp Text Tool by clicking its button on the Editing toolbar or by typing T, and then click the I-beam pointer in a line of text to select the line with a bounding box to verify that you can edit the text as well. Always remember to use File –> Save to save the changes made to your document by processing with Paper Capture.
Active1 year, 7 months ago
Is there any freeware OCR software (for Linux and/or Windows) that can take a PDF scanned document as input and output a Searchable PDF like Adobe Acrobat does?
With searchable PDF I meant that the OCRed text is invisible over the original text and can be selected with the mouse and copied.
Software To Create Searchable Pdf From Word
I know that gscan2pdf on Linux can do something like this, but the text is placed in the top left corner of the page and is way too small, not at all synchronized with the text on the background scanned page. This because gscan2pdf feeds the whole page to an OCR engine. It should decompose the image in small images with single lines of text or small paragraphs to send to OCR software.
Nicolas Raoul♦11.6k99 gold badges4343 silver badges116116 bronze badges
CorneliusCornelius4,11511 gold badge1717 silver badges3939 bronze badges
10 Answers
A tool that lets you do that is PDF-XChange Viewer. The free version will allow you to OCR your document in a variety of languages (you can download additional language packs for free) and add the OCR'd text as an overlay text layer you can copy from and search with CTRL+F.
- fast PDF viewer with a lot of features
- fast OCR engine (unless you choose the best accuracy)
- a lot of options have the
PRO
icon next to them (available only on the Pro version) but you can hide them - color management and custom screen DPI settings
- Windows only application, which doesn't seem to work on Wine (the viewer works, but the OCR function makes it crash)
What it doesn't:
- the OCR doesn't take advantage of multiple cores
- OCR doesn't detect character styles (bold, italic) or the copy function loses them
- it doesn't use correct Romaniandiacritics, but than can be fixed if you copy text in an editor and do a search and replace:
4,11511 gold badge1717 silver badges3939 bronze badges
Guido DomeniciGuido Domenici
Try
pdfsandwich
. From the man-page:pdfsandwich generates 'sandwich' OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly 'behind' the images.
pdfsandwich is a command line utility. If you have a scanned pdf file, for instance this one:
alice.pdf
(which is the first chapter of a novel you might have heard of), invoke pdfsandwich like this:This will generate a file
alice_ocr.pdf
which looks like the orginal file, but the recognized text will be placed behind the scanned images. You can make full text searches now or select text areas.Another option might be
OCRmyPDF
.studentstudent
The newer version of Tesseract (3.03 RC at the time of writing this) can do this:
- free, opensource and cross-plarform
- starting from version 3.03 PDF output is available
- CLI software
- multiple languages support
- unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.
This is the command:
CorneliusCornelius4,11511 gold badge1717 silver badges3939 bronze badges
pypdfocr
is what worked for me. It is a Python script streamlining the whole Tesseract usage. After getting dependencies installed (on Linux it's a much simpler process) it's as simple as typing:pypdfocr myfile.pdf
And opening
myfile_ocr.pdf
a while later.ZarothZaroth
I use Microsoft OneNote as OCR tool. On Right click against an image It can copy the entire text in images and It also has the capability to search text with in image. It is free and accurate and runs on windows and support almost all image formats.
It can also search through PDF files, and Images in PDF files.
Bonus point is that it supports multiple languages :) English, French, Spanish also
BarathVutukuriBarathVutukuri
https://www.microsoft.com/en-us/store/p/leadtools-ocr/9wzdncrdr0d5 is a small simple WinRT app (runs fine on Win10 as well) that does nothing more than take an image or pdf and output a sandwich PDF or text. It's kinda ugly and has absolutely no configuration, but it does this one small task perfectly.
James PolleyJames Polley
You can get searchable text using Google Drive.
First, choose a key setting. Under 'general' in your Google Drive settings, check the box next 'Convert uploads: Convert uploaded files to Google Docs editor format.'
Now upload the pdf to your Google Drive (click 'new', then 'file upload'). When the upload is complete (might take a minute or two), right click it. (If you have trouble finding it, try hitting 'Recent' in the left-hand sidebar.) As I was saying, right-click the pdf you uploaded, and choose 'Open with... Google Docs'. Now you will have searchable text.
aparente001aparente001
While the other answers on this thread focus on desktop software, I've had a lot of success with this webservice: http://www.searchablepdfs.org/
It allows you to upload a PDF of a scanned document, and it generates a 'sandwich PDF' with embedded OCR text that you can copy/paste.
Pros:
- Fast
- High quality OCR text recognition (the results I've gotten have been at least as good as what I've been able to get from using
tesseract
, which Cornelius mentioned) - Cross-platform (it's a web application so you don't need to install any software yourself)
- Free
![Searchable Searchable](/uploads/1/2/6/3/126304184/758655829.png)
Cons:
- Only supports English documents
- Only processes up to 10 pages per file
calvinyoungcalvinyoung
Another option is pdf2pdfocr (https://github.com/LeoFCardoso/pdf2pdfocr) that is based on Tesseract-OCR and can run natively on Windows, MacOS and Linux operating systems.
Disclaimer: I'm the pdf2pdfocr developer.
Free Searchable Pdf Software
Leo CardosoLeo Cardoso
Two more options:
1) Online: www.sandwichpdf.com
2) Desktop (multiple OSes): NAPS2 - https://www.naps2.com/
kpkkpk