Extract text out of a image/PDF


For this purpose, you may read my previous related post here.

I am going to introduce (again) to the tesseract OCR engine. But this time I am using 16.04 and the command to install it is:

sudo apt install tesseract-ocr

If you have some PDF and want it to convert to image to further process it. You may use various methods. One of them may be:

convert input.pdf output.png

But this will produce a relatively low-resolution image that may result in bad text out of OCR.

So, instead we use:

convert -density 300 -quality 100 input.pdf output.png

Changing the density and tell it to not to decrease the quality than 100%.

Note if the input.pdf is a multi-page PDF, it will create different output images named like: output-0.png, output-1.png and so on.

So finally, use tesseract as:

tesseract output.png text_file -l eng

It will create a text_file.txt in the same directory. You may play with various options of convert or tesseract based on your needs.

Advertisements