25 Apr 2012

Converting Images of Documents to Near Scanned Quality

I’m working on a project at the moment where we’re handling large quantity of documents that have been photographed in less than ideal circumstances. These photographs are high resolution color photos of typed and handwritten documents that need to be transformed into scan quality documents.  The steps involved are:

  • Convert color to grayscale (128 shades or less)
  • Remove background noise and speckles/flecks/etc
  • Reduce size of images for easier distribution and lower bandwidth usage (currently ~40 GB for set of images) After a few days of working on the problem and learning about the technologies involved… here’s the script I came up with, along with notes about sources and inspirations (note much of the heavy lifting is done by ImageMagick and   Fred’s ImageMagick Scripts[gist id=2494870]