You are here

Cleaning up Noisy Microfilm/Microfiche

I am a new user to GIMP and am trying to figure out a way to clean up some noisy microfiche scans. All I need is for the scans to be readable enough so that my OCR program will work on them. The microfiche was in very bad shape and the readers were quite old as well, so the scanned copy of the text was the best I could get after much fiddling around. Just getting a legible copy was a feat. I have attached one image to this message. Text in bold can be recognized through OCR but regular text just seems to blend in with all the dust and speckles.

I was advised by my PDF support person to export the PDF into an image format, and run the image through image manipulation software to lighten the background without altering text. However, in GIMP 2.8, the layers dialog box does not list "background" as a layer option. This was perplexing because when I searched the the web for help using GIMP to lighten the background of an image, several different sites said that you can choose "background" as a layer.

I tried opening the image in layers, used the stack option to choose the bottom image, and fiddled around with various settings, but no visible changes appeared. The only thing I was able to do was to work on the top layer, which changed both text and background, which ended up making the text too light and without enough contrast to be useful. Is there a way to isolate the background, leaving the text in tact? I don't need these to look great by any means - just legible enough so that the OCR program can work. It may be impossible, but I did want to find out.

Many many thanks!

AttachmentSize
Test 1.jpg2.19 MB
Forums: 

Scanned images have no background it's all in one layer so you have to do it in another way. Try to despeckle with different settings until the OCR recognise most letters. Use the eraser for the difficult parts and let the word processor do the final step. After that you should thoroughly check you text.

There is nothing to lighten here, your background already is 255,255,255, you need to get rid of speckles. I dont think there is a good-enough tool for this job in gimp. Try to find a specialized tool ...

Try a median filter, maybe coupled with enlarge/shrink.

Automated removal of speckles, dust and toner particles from scanned images is a much tougher task than it seems.
For Linux users there is a GIMP plugin called Nuvola Tools (needs to be installed). Designed for removal of shadows, speckles, dust and toner particles from scanned images.
In Windows/MAC try GMIC plugin (also needs to be installed). In GMIC plugin UI open Available Filters->Enhancement-> Remove Hot Pixels. Set Input/Output layers to Active(default) and New active layer(s). Leave default settings for Mask size and Threshold. Apply filter 3-4 times (don't hit OK) It will create 3-4 new layers with reduced noise. See which one is good enough for your OCR software to recognize characters. If OCR gets the job done with, say, 80% accuracy this is the most you can ever get from such a heavily "polluted" image. OCR is never 100% accurate to begin with, even when it deals with clean, high-res texts.

I was able to clean it like this (with custom python script):

http://i49.tinypic.com/xaxic5.png

Thanks to everyone for their very helpful comments! I'm going to keep working on this and will try out all the great advice in this thread. If I can get this figured out it would be incredibly helpful as I will have to use more microfilm over the coming months for my dissertation research.

Tibor95, I am wondering what is a custom python script and would it be possible for me to use the one you created on other pages in the document as I experiment more with this?

Thanks so much again for all the help!

Subscribe to Comments for "Cleaning up Noisy Microfilm/Microfiche "