Quack – an ALTO viewer

by

Never heard of ALTO (Analyzed Layout and Text Object)? Fear not, it is a very simple representation of structured text from physical pages, and I mean that in a positive way. Its primary use is to hold OCRed text. A piece of ALTO XML is a list of boxes, representing pages. Each page is a list of boxes, representing blocks. Blocks can, among other things, be a list of boxes, representing lines. Each line is a list of boxes, representing words. Each word is just a word. No more Chinese boxes. For all its simplicity, there is a curious lack of simple viewers for the format. That’s where Quack comes in.

Quack displaying one of the sample images

Quack displaying one of the sample images

Quack is a bash hack from yours truly that grew out of control. The focus is small-scale quality assurance: A few hundred images is ok, a few thousand is manageable. It traverses a folder structure of images with corresponding ALTO-files and creates HTML-pages containing zoomable images with ALTO structure overlays. The pages can be viewed directly from the file system or (of course) put on a web server. Features are

  • OpenSeadragon used for smooth zoom & pan of images of arbitrary size (arbitrary meaning at least up to 50 gigapixel in this context)
  • Overlays
    • Connected blocks for inspecting segmentation quality
    • Grid for inspecting skewing and rotation
    • Burned out high-lights and low-lights
  • OCR-text on mouse over
  • Histogram for each image
  • Darkest and brightest greyscale value with percentage (e.g. 12% of the images is absolute black)
  • Thumbs view for folders
    • Burned out high-lights and low-lights overlays in thumbs view
  • Iterative updates of the display files (additions does not require a full rebuild)
  • Works with a fairly standard Linux setup, except for deepzoom which is optional but recommended for large images. Should work under Cygwin but this is not tested
  • Dog slow for big ALTO files as bash is heavily misused for text processing and HTML templating

Quack is Apache 2.0 Open Source. Clone it from https://github.com/tokee/quack and give it a spin. It even has sample images to get you started.

About these ads

One Response to “Quack – an ALTO viewer”

  1. villadsen Says:

    Great idea posting this on github.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: