Extract text out of a image/PDF

For this purpose, you may read my previous related post here.

I am going to introduce (again) to the tesseract OCR engine. But this time I am using 16.04 and the command to install it is:

sudo apt install tesseract-ocr

If you have some PDF and want it to convert to image to further process it. You may use various methods. One of them may be:

convert input.pdf output.png

But this will produce a relatively low-resolution image that may result in bad text out of OCR.

So, instead we use:

convert -density 300 -quality 100 input.pdf output.png

Changing the density and tell it to not to decrease the quality than 100%.

Note if the input.pdf is a multi-page PDF, it will create different output images named like: output-0.png, output-1.png and so on.

So finally, use tesseract as:

tesseract output.png text_file -l eng

It will create a text_file.txt in the same directory. You may play with various options of convert or tesseract based on your needs.

Advertisements

4 thoughts on “Extract text out of a image/PDF

  1. Pingback: Darshpreet Singh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s