Readable English strives to provide the best text conversion experience possible. As part of this commitment we have developed technology that will convert the text in PDF documents into Readable English text. PDFs can be produced by numerous programs as a universal format. The fact that so many different programs are used to create PDFs leads to complexity that may cause issues during document conversion. To achieve the best possible results please follow the following guidelines.
On the surface, all PDFs may look the same. However, there are two main types of PDFs that result mainly from how they were created.
Text Based PDFs (Ideal)
File→Export As→ PDF
Text Based PDFs are the most common type. This is what is usually produced by any word-processing program or web browser when exporting as a PDF file. All text in the document can be highlighted by a cursor and is represented within the document itself as digital text. While images may be embedded in the image, these images do not contain text as they would, for example, in a scan. In general, if the PDF was created by exporting a document from another digital text format, the result will be a Text Based PDF and our technology should provide great results.
Image Based PDFs (Scans, Not Ideal)
An Image Based PDF results from a document scanner. While our technology can convert these documents, there are two problems that may lead to poor results: (1) Poor Scans or (2) Hidden Text. The former is easy to spot while the latter can lead to converted documents with horribly mis-aligned glyphs.
Poor Scans
To convert the text in an image we must run Optical Character Recognition (OCR) algorithms to recognize the text. The result of OCR may be poor due to image quality, artifacts such as page warping during scanning, or complex background colors or patterns. It is almost always immediately clear if this is the problem. With few exceptions, the only image-based documents that will result in a good converted PDF will be those that have only straight, clear text on uniform backgrounds, ideally white, with minimal or no page warping and no rotation.
Hidden (“Searchable”) Text
When some printer-scanners scan a document they may perform their own OCR to recognize the text. This text is made transparent so it cannot be seen except by the computer. The purpose of this is to enable “copy paste” functionality and also to make the text searchable. Unfortunately when this occurs our conversion technology interprets the document as a Text-Based PDF and applies RE glyphs to the text, which is not perfectly aligned with the underlying text in the image. If the glyphs on your converted document are not perfectly aligned with the characters and your document resulted from a scan–this is the most likely reason. Your options are to re-scan the image without making it searchable or to copy-and-paste the text into another document and then export that document as a PDF, thus converting it to a Text Based PDF.