Never heard of ALTO (Analyzed Layout and Text Object)? Fear not, it is a very simple representation of structured text from physical pages, and I mean that in a positive way. Its primary use is to hold OCRed text. A piece of ALTO XML is a list of boxes, representing pages. Each page is a list of boxes, representing blocks. Blocks can, among other things, be a list of boxes, representing lines. Each line is a list of boxes, representing words. Each word is just a word. No more Chinese boxes. For all its simplicity, there is a curious lack of simple viewers for the format. That’s where Quack comes in.
Quack is a bash hack from yours truly that grew out of control. The focus is small-scale quality assurance: A few hundred images is ok, a few thousand is manageable. It traverses a folder structure of images with corresponding ALTO-files and creates HTML-pages containing zoomable images with ALTO structure overlays. The pages can be viewed directly from the file system or (of course) put on a web server. Features are
- OpenSeadragon used for smooth zoom & pan of images of arbitrary size (arbitrary meaning at least up to 50 gigapixel in this context)
- Connected blocks for inspecting segmentation quality
- Grid for inspecting skewing and rotation
- Burned out high-lights and low-lights
- OCR-text on mouse over
- Histogram for each image
- Darkest and brightest greyscale value with percentage (e.g. 12% of the images is absolute black)
- Thumbs view for folders
- Burned out high-lights and low-lights overlays in thumbs view
- Iterative updates of the display files (additions does not require a full rebuild)
- Works with a fairly standard Linux setup, except for deepzoom which is optional but recommended for large images. Should work under Cygwin but this is not tested
- Dog slow for big ALTO files as bash is heavily misused for text processing and HTML templating
Quack is Apache 2.0 Open Source. Clone it from https://github.com/tokee/quack and give it a spin. It even has sample images to get you started.