Our large-scale digitization project of newspapers from microfilm is just on the verge of going into production. Being technical lead on the process of ingesting and controlling quality of the digitized data has been a roller coaster of excitement, disillusionment, humility, admiration, despair, confidence, sleeplessness, and triumph.
I would like to record some of the more important lessons learned as seen from my role in the project. I hope this will help other organizations, when doing similar projects. I wish we had had these insights at the beginning of our project, but also recognize that some of these lessons can only be learned by experience.
Lesson 1: For a good end result, you need people in your organization that understand all parts of the process of digitization
At the beginning of our project, we had people in our organization who knew a lot about our source material (newspapers and microfilm) and people who knew a lot about digital preservation.
We did not have people who knew much about microfilm scanning.
We assumed this would be no problem, because we would hire a contractor, and they would know about the issues concerning microfilm scanning.
We were not wrong as such. The contractor we chose had great expertise in microfilm scanning. And yet it still turned out we ended
up with a gap in required knowledge.
The reason is, most scanning companies do not scan for preservation. They scan for presentation. The two scenarios entail two different sets of requirements. Our major requirement was to have a digital copy that resembled our physical copy as closely as possible. The usual set of requirements a scanning company gets from its customers, is to get the most legible files for the lowest cost possible. These two sets of requirements are not always compatible.
One example was image compression. We had decided on losslessly compressed images (in JPEG2000), which is more expensive than a lossy compression but avoids the graphic artifacts that lossy compression always leave, and can be a hassle in any post-processing or migration of the images. Using lossless image formats is an expensive choice when it comes to storage, but since we were scanning to replace originals we opted for the costly but unedited files.
Once we got our first images, though, inspection of the images showed definite signs of lossy compression artifacts. The files themselves were in a lossless format as expected, but the compressions artifacts were there all the same. Somewhere along the path to our lossless JPEG2000 images, a lossy compression had taken place. The contractor assured us that they used no lossy compressions. Not until we visited the contractor and saw the scanning stations did we find the culprit. It was the scanners themselves! It turned out that the scanner, when transferring the images from the scanner to the scanner processing software, used JPEG as an intermediary format. So in the end we got the costly lossless image format, but the artifacts from lossy compression as well. It was a pure lose/lose situation. And even worse, there was no obvious way to turn it off! We finally managed to resolve it, though, with three-way communication between us, the scanner manufacturer and the contractor. Luckily, there was a non-obvious way to avoid the JPEG transfer format. The way to turn it off was to change the color profile from “gray-scale” to “gray-scale (lossless)”.
As another example, we had in our tender the requirement that the images should not be post-processed in any way. No sharpening, no brightening, no enhancement. We wanted the original scans from the microfilm scanner. The reason for this was that we can always do post-processing on the images for presentation purposes, but once you post-process an image, you lose information that cannot be regained – you can’t ”unsharpen” an image and get the original back. We had assumed this would be one of our more easily met requirements. After all, we were asking contractors to not do a task, not to perform one.
However, ensuring that images are not post-processed was a difficult task on its own. First there is the problem of communicating it at all. Scanner companies have great expertise in adjusting images for the best possible experience, and now we asked them not to do that. It was at first completely disruptive to communication, because our starting points were so completely different. Then there was the problem that some of the post-processing was done by the scanner software, and the contractor had no idea how to turn it off. Once again, it took three-way communication between the scanner manufacturer, our organization, and the contractor before we found a way to get the scanner to deliver the original images without post-processing.
The crucial point in both these examples is that we would not even have noticed all of this, if we hadn’t had a competent, dedicated
expert in our organization, analyzing the images and finding the artifacts of lossy compression and post processing. And in our case we only had that by sheer luck. We had not scheduled any time for this analysis or dedicated an expert to the job. We had drawn on expertise like this when writing the tender, so the requirements were clear and documented, and we had expected the contractor to honor these requirements as written. It was no one’s job to make sure they did.
However, one employee stepped up on his own initiative. He is an autodidact image expert, who originally was not assigned to the project at all. He took a look at the images and started pointing out the various signs of post processing. He wrote analysis tools and went out of his way to communicate to the rest of the digitization project how artifacts could be seen and histograms could expose the signs of post processing. It is uncertain that we would ever have had the quality of images we are getting from this project, if it had not been for his initiative.
Lesson 2: Your requirements are never as clear as you think they are
This one is really a no-brainer and did not come as a surprise for us, but it bears repeating.
Assuming you can write something in a tender and then have it delivered as described is an illusion. You really need to discuss and explain each requirement to your contractor, if you want a common understanding. And even then you should expect to have to clarify at any point during the process.
Also, in a large-scale digitization project, your source material is not as well-known as you think it is. You will find exceptions,
faults and human errors, that cause the source material to vary from the specifications.
Make sure you keep communication open with the contractor to clarify such issues. And make sure you have resources available to handle that communication.
Examples can be trivial – we had cases where metadata documents were delivered with the example text from our templates in
our specifications, instead of with the actual value it should contain. But they can also be much more complex – for instance we
asked our contractors to record the section title of our newspapers in metadata. But how do you tell an Indian operator where to find a
section title in a Danish newspaper?
Examples can also be the other way round. Sometimes your requirements propose a poorer solution than what the contractor can provide. We had our contractors suggest a better solution for recording metadata for target test scans. Be open to suggestions from your contractor, in some cases they know the game better than you do.
Lesson 3: Do not underestimate the resources required for QA
Doing a large-scale digitization project probably means you don’t have time to look at all the output you get from your contractor. The solution is fairly obvious when you work in an IT department: Let the computer do it for you. We planned a pretty elaborate automatic QA system, which would check data and metadata for conformity to specifications and internal consistency. We also put into our contract that this automatic QA system should be run by the contractor as well to check their data before delivery.
This turned out to be a much larger task than we had anticipated. While the requirements are simple enough, there is simply so much grunt work to do, that it takes a lot of resources to make a complete check of the entire specification. Communicating with the contractor about getting the tool to run and interpreting the results is an important part of getting value from the automatic QA tool. We have found that assumptions about technical platforms, input and output, and even communicating output of failed automatic QA are things that should not be underestimated.
However the value of this has been very high. It has served to clarify requirements in both our own organization and with our contractor, and it has given us a sound basis for accepting the data from our contractor.
In other, smaller digitization projects, we have sometimes tried to avoid doing a thorough automatic QA check. Our experience, in these cases, is that this has simply postponed finding mistakes, that could have been automatically detected to our manual QA spot checks. The net effect of this is that the time spent on manual QA and on requesting redeliveries has been greatly increased. So our
recommendation is to do thorough automatic QA, but also to expect this to be a substantial task to do.
Even when you have done thorough automatic QA, it does not replace the need for a manual QA process, but since you don’t have time to check every file manually, you will need to do a spot check. Our strategy in this case has been twofold: First we take a random sample of images to check, giving us a statistical model allowing us to make statements about the probability of undiscovered mistakes. Second we amend this list of images to check, with images that an automatic analysis tool marks as suspicious – for instance very dark, unexpected (that is: possibly wrong) metadata, very low OCR success rates etc.
We have had our contractor build a workflow system for doing the actual manual QA process for us. So given the input of random and
suspect pages, they are presented in an interface, where a QA tester can approve images, or reject them with a reason. A supervisor will then use the output from the testers to confirm the reception of the data, or request a redelivery with some data corrected.
Even though the contractor builds our manual QA interface, we still need to integrate with this system, and the resources required for this should not be underestimated. We opted to have the tool installed in our location, to ensure the data checked in the manual QA spot check was in fact the data that was delivered to us. If the manual QA spot check had been done at the contractor, in theory the data could have been altered after the manual QA spot check and before delivery. Communication concerning installation of the system and providing access to images for manual QA spot check also turned out to be time consuming.
In conclusion, in a large-scale digitization project, QA is a substantial part of the project, and must be expected to require considerable resources.
Lesson 4: Expect a lot of time to elapse before first approved batch
This lesson may be a corollary of the previous three, but it seems to be one that needs to be learned time and time again.
When doing time-lines for a digitization project, you always have a tendency to expect everything to go smoothly. We had made that assumption once again in this project, and as we should have expected, it didn’t happen.
Nothing went wrong as such, but during planning we simply didn’t take into account the time it takes to communicate about requirements when we did the planning. So when we received the first pilot batch, our time-line said we would go
into production soon after. This, of course, did not happen. What happened was that the communication process about what needs to be changed (in the specification or in the data) started. And then, after this communication process had been completed it took a while before new data could be delivered. And then the cycle starts again.
Our newly revised plan has no final deadline. Instead it has deadlines on each cycle, until we approve the first batch. We expect
this to take some time. The plan says we allow three weeks for the first cycle, then when problems seem less substantial, we reduce the cycle to two weeks. Finally we go to one week cycles for more trivial issues. And once we have finally approved the batch, we can go into production. Obviously, this pushes our original deadline back months, but it this is really how our plan should have been designed from the very beginning. So make sure your plans allow time to work out the kinks and approve the workflow, before you plan on going into production.
Lesson 5: Everything and nothing you learned from small-scale digitization projects still applies
Running small-scale digitization projects is a good way to prepare you for handling a large-scale digitization project. You learn a lot about writing tenders, communicating with contractors, what doing QA and ingest entails, how you evaluate scan results etc. It is definitely recommended to do several small-scale digitization projects before you go large-scale.
But a lot of the things you learned in small-scale digitization projects turn out not to apply when you go large-scale.
We are digitizing 32 million newspaper pages in three years. That means that every single day, we need to be able to receive 30.000
pages. With each page being roughly 15 MB, that’s close to half a terabyte a day. Suddenly a lot of resources usually taken for granted need to be re-evaluated. Do we have enough storage space just to receive the data? Do we have enough bandwidth? Can our in-house routers keep up with managing the data? How long will it actually take to run characterization of all these files? Can we keep up? What happens if we are delayed a day or even a week?
Also in small digitization projects, manually handling minor issues are feasible. Even doing a manual check of the delivered data is
feasible. In this case, if you want to do a check of everything, a full-time employee would only be able to spend about two thirds of a second per page if he or she wanted to keep up. So you really need to accept that you can not manually do anything on the whole project. Spot checks and automation are crucial.
This also means that the only ones who will see every page of your digitization project ever will probably be your contractor. Plan
carefully what you want from them, because you probably only have this one chance to get it. If you want anyone to read anything printed on the newspaper page, now is the time to specify it. If you want anything recorded about the visual quality of the page, now is the time.
Another point is that you need to be very careful what you accept as okay. Accepting something sub-optimal because it can always be
rectified later will probably be “never” rather than “later”. This needs to be taken into account every time a decision is made that
effects the entire digitization output.
Every kind of project has its own gotchas and kinks. Large-scale digitization projects are no exception.
Listed above are some of our most important lessons learned so far and seen from the perspective of a technical lead primarily working with receiving the files. This is only one small part of the mass digitization project, and other lessons are learned in different parts of the organization.
I hope these lessons will be of use to other people – and even to ourselves next time we embark on a mass digitization adventure. It has been an exhilarating ride so far, and it has taken a lot of effort to get a good result. Next step is the process of giving our users access to all this newspaper gold. Let the new challenges begin!