Extract text out of a image/PDF

For this purpose, you may read my previous related post here.

I am going to introduce (again) to the tesseract OCR engine. But this time I am using 16.04 and the command to install it is:

sudo apt install tesseract-ocr

If you have some PDF and want it to convert to image to further process it. You may use various methods. One of them may be:

convert input.pdf output.png

But this will produce a relatively low-resolution image that may result in bad text out of OCR.

So, instead we use:

convert -density 300 -quality 100 input.pdf output.png

Changing the density and tell it to not to decrease the quality than 100%.

Note if the input.pdf is a multi-page PDF, it will create different output images named like: output-0.png, output-1.png and so on.

So finally, use tesseract as:

tesseract output.png text_file -l eng

It will create a text_file.txt in the same directory. You may play with various options of convert or tesseract based on your needs.

Advertisements

4 thoughts on “Extract text out of a image/PDF

  1. Las actualizaciones de software de Apple por norma general incluyen tecnologías avanzadas de ahorro de energía. http://www.4shared.com/office/W9f1wp_7ce/Son_Demasiado_Altas_Las_Valora.html

    Like

  2. Great tool ..thanks for sharing

    Like

  3. Pingback: Darshpreet Singh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close