Extract text out of a image/PDF

For this purpose, you may read my previous related post here.

I am going to introduce (again) to the tesseract OCR engine. But this time I am using 16.04 and the command to install it is:

sudo apt install tesseract-ocr

If you have some PDF and want it to convert to image to further process it. You may use various methods. One of them may be:

convert input.pdf output.png

But this will produce a relatively low-resolution image that may result in bad text out of OCR.

So, instead we use:

convert -density 300 -quality 100 input.pdf output.png

Changing the density and tell it to not to decrease the quality than 100%.

Note if the input.pdf is a multi-page PDF, it will create different output images named like: output-0.png, output-1.png and so on.

So finally, use tesseract as:

tesseract output.png text_file -l eng

It will create a text_file.txt in the same directory. You may play with various options of convert or tesseract based on your needs.

Pandoc: convert document formats

I had to submit some text to mediawiki. As it supports special markdown, but I had a file on Google Docs. Initially, I looked up for some online tools for conversion. For them I downloaded the Doc in html format. But that didn’t work. For example: http://pandoc.org/try. It doesn’t support docx as input. But it support HTML. But it didn’t like much the HTML produced from Google Doc.

The thing that finally worked for me was that I installed pandoc on my system and then did the conversion.

Ubuntu guys can install is simply using:

sudo apt-get install pandoc

Then I read the manpage and found the -t flag useful. I searched for “mediawiki” in the manpage and that did the job.

The actual command that I fired was:

pandoc -t mediawiki input.docx > output

Then there will be a file named output will be created in the current directory in mediawiki style markdown format.

Update: More about pandoc can be seen at http://pandoc.org/ and the formatting help for mediawiki content can be found at https://www.mediawiki.org/wiki/Help:Formatting

IRC: Nickname is already in use

IRC is Internet Relay Chat. And it is a text-based protocol used for communication. It follows client/server architecture. There is a channel hosted somewhere on a server and you use your client application to connect to that server (channel).

I have used XChat IRC client and weechat (console based, for 2-3 times).

So coming back to the title of this post. Ah! You are probably here due to that only.

I had XChat set up on one operating system (let’s say Arch) and a nick registered e.g. mandeep_7. Now suppose, due to some reason, when I temporarily shifted to another operating system (let’s say Ubuntu). So if now I want to use the same Nick (/nick mandeep_7) then it will give the error: Nickname is already in use. There may be several reasons like sudden shutdown and you still are logged in to the particular server.

So what I tried is:

/ghost mandeep_7 password

Use your password instead of “password”. Now try the command /nick mandeep_7

You’ll now be able to identify yourself as old user (mandeep_7) on the new OS.

You should now try:  /msg NickServ identify <password>

Another thing to look for is linking.


Editing Open Street Maps

I just signed up on OSM website i.e. www.openstreetmap.org. After signing up, I explored it a bit and located my house on it. It was quite empty there (nearby). Now we can add places, roads, streets etc. using three functions: Point, Line and Area. So I selected Area of my home and assigned it my name. You may search with my name; it’s the only one entry that’s appearing (as of now). Then I edited the map of my college GNDEC a bit. Also added my old school Guru Nanak Khalsa Sen. Sec. School. Earlier, I couldn’t find even any mark related to it. Then added more neighbourhood houses to it. Now it’s quite colourful over there. 🙂

One thing I noticed on the osm website is that we can’t see satellite view normally. But while editing maps, we can see the imagery (satellite view) using bing or mapnik. Please correct me if something wrong with it.

Scan documents via mobile camera – OCR

You might have wondered about scanning a document or a some pages of a book and store them as digital documents. So here is how to do it.

You need a good quality mobile phone camera to be able to take good quality picture that can be clearly understood.

Now there are many mobile applications for Android that helps to scan images. Textfairy is an application that enhances the clicked picture and extracts text from it. The extracted text was not much reliable, but it’s OK. Another application is CamScanner. The limitation of these apps is that I couldn’t do the batch processing after clicking the pictures. I found one application i.e. Droid scan that did the batch operation. I applied it to 88 images and it took a lot much time. When I reduced the number of images to 5, then it completed within some minutes, but the output was not correct and some images were distorted.

On Linux, we may use the tesseract tool to obtain the text from an image (scanned or clicked). To install tesseract, use the command below, which will also install the data for english language for OCR.

sudo apt-get install tesseract tesseract-ocr-eng

After it got installed, try it on an image using the following command:

tesseract input.jpg output.txt -l eng

Here the input.jpg is your image from where you want to extract the text. The output.txt is the target output file that will be created with the text extracted from image. And the -l eng is for specifying the language.

Tesseract is very accurate as per my experience.
Read more on OCR at https://en.wikipedia.org/wiki/Optical_character_recognition


Save files via URL directly to Google Drive, Dropbox etc. without Downloading


Yes this is the website that provide the facility to save files directly to Google Drive, Dropbox, ftp server etc.(and some more) without downloading and uploading them again.

Just paste the link of file to be downloaded and then choose the destination. Verify your account with the service (like Google).

Then it will be saved!!!

I tried 10-15 pdf files and it worked. But it couldn’t save a zip file worth 237mb in size(IDK the reason). Please tell me something in comments if you used it.