ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

The Universal Digital Collections Cookbook

Updated on November 14, 2016
jackclee lm profile image

Jack is a volunteer at the CCNY Archives. Before retiring, he worked at IBM for over 28 years. His articles have over 120,000 views.


How does one go about creating an universal digital collection? My hub is a result of over 20 years of experience dealing with museums, libraries and archives around the world. There were many lessons learned and failed initiatives but they were not in vain. Our experience has taught us the way forward. Technology has improved to a point where it is viable to create the ultimate collection that will be ever lasting.

- February 2016

Public Domain

The information I've provided here is classified under public domain. I am not claiming any copyrights or inventions. It is based on my body of knowledge gained over the past 20 years. My intent is to help all organizations large and small to get a better handle on the monumental task of preservation and archive and content management. I welcome any feedback.

Obsolete Media

The Cloud Servers


In my 20 plus years working with large institutions, the one most important thing I've learned is that haste makes waste. In trying to achieve something quick, the efforts and resources spent end up being mostly wasted. In order to achieve something long lasting (by that I mean in the order of hundreds of years), one need to take the time and implement wisely such that any work would not be duplicated sometimes later. I've been involved with numerous pilot projects where the idea is to try something new on a limited basis. If it is proven to be acceptable, it would be adopted.

There is a learning curve for any newcomer. The problems of archive and preservation have been around for centuries. Traditional libraries and museums are very suspicious of new technology and rightly so. If the technology is only good for a few decades, that is unacceptable. Such are true of digital media such as CDROM, DAT tape and DVD. The difference now is the cost and density of storage have reach the point where it is almost negligent. With the "cloud" servers, the goal of a truly permanent media storage is at hand.

Design Goals

The main purpose of any collection is preservation. No matter what objects we are concerned with (painting, letters, artifacts, communications...), what is the best way to preserve a collection?

A doomsday scenario. The one way to think about preservation is to imagine the worst case scenario. What if a nuclear war wipes out a significant part of our planet or an asteroid strikes, how will a collection survive? Believe it or not, there are systems in place today to address that very case. The most important and sensitive documents are digitized and copies stored on different continents buried deep in mountains. In addition, there are physical printouts of binary files on paper that are stored as backup for the case when all electronic means are no longer available. These people have concluded that the most reliable form of preservation is still binary codes on printed paper. Think about it, after years of technological advances, trying to reach the paperless office, the simplest and most reliable source is 1's and 0's on paper.

Most collections do not require that extreme treatment. What are the goals of a typical collection?

  • long term preservation
  • accuracy of reproduction
  • full indexing
  • secure storage and backup
  • security and copyrights
  • easy and quick access
  • finding aids
  • ease of use interface
  • simple workflow
  • be cost effective
  • modular approach

Universal Module Approach

The whole task can be broken down into individual modules.

  • Capture
  • Index
  • Associations
  • Storage (cloud server, batch and single)
  • Search (finding aids...)
  • Retrieval (web access, read only and R/W)
  • Security (rights management, data encryption))
  • Image Processing
  • Web design (front end appeal)
  • Workflow
  • Intelligent Agent (cache)
  • Standards (File formats, Industry standards)
  • Backup (automated)
  • Physical inventory and management (Long term storage)
  • Restoration (as needed)
  • Audit trail

A Framework For Implementation

The nice thing about a modular approach is that you need not to implement all at once. The framework is what is important. Once it is established, the pieces can be assembled at your own pace.

Coming up with the proper framework depends on many factors. Such as the type of content, the condition of content, the the value of content, the sensitivity of content etc.

Funding and Human Resources

As with any project of this magnitude, the funding source must be addressed up front. There is no point in starting a long term project without full financial commitment. The amount required depends on the scope of the project and the time frame required. Once a project is put in motion, there are a few options as to allocating resources to get the job done.

For example, the prepping and scanning may be outsourced to a Service Bureau who can do the task faster and cheaper.

A minimal staff of people with an assortment of skills are required. Teamwork is a key component to success.

Getting the right equipment to match the tasks and purchasing the right software that will be long lasting.

Prioritize Contents

Another key decision before embarking on a project is to prioritize your content. What material will be processed first? It is not possible to do everything together. There should be a priority where certain type of content (perhaps most important) be done first. Within the type of content, the same applies. You may choose to do only a subset of that content type before moving to another type.

Model After the Human Brain (A Detour)

If you think about it, our human brain is the ultimate archive. It contains all the information of our whole life experiences. It is accessible via our memory though some are better than others. It is compact and it is accessible via multiple triggers (our senses). It is organized in a very sophisticated form that we are just beginning to understand. It is dynamic and changing. It has built in redundancies in case of disaster such as trauma injuries. It is the perfect archive.

We can try and model a collection after our own brain and take advantage of what is known. Our brain works on intuition more than on exact data. Our brain relates many attributes to create a "memory packet" that can be easily recalled via various triggers. That is because we have created numerous neuron connections to the same event. Instead of recall only by a name, we can recall by a smell or a sensation or a color or a sound etc.

In our own daily activities, if we are looking for a document in our file system, we usually think of a name or a category to narrow our search. We alphabetize the name so that we can locate it easily and quickly. That works fine if the document we are seeking is indexed properly or filed in the right folder. If we are looking for a specific document that are filed in a different folder, that is much harder. Fortunately, our brain can do some analyzing and guess as to where that document may be. Technology can help also. We can digitize the files into searchable PDF files. Once we have all the files in place, we can search on a particular name or phrase that will pull up all occurrences of that term. This will help locate the document assuming you can fine a term unique enough to that collection.

Workflow (A Sample Implementation)

Let me take you through a simple workflow for a specific collection - say paper documents (8.5 x 11) letter format. For this discussion, let's assume we are talking about Personnel records. Assume you have a file cabinet with folders of documents stored in alphabetic order by Last name, First Name. Each folder contains an assortment of paper records for a given employee. Being personnel records, they are sensitive and must require access limitations.

Here is the workflow and programs to convert them to a digital archive. (Items in bold are specific hardware/software recommendations).

1. Document Prep Step - All folders must be prepped first to remove any staples or post-it notes and repaired as necessary any tears etc. Each folder will be prepped from fist page to the last page.

2. Document Scanner - The folders are put through a document scanner such as a Kodak desktop scanner with CapturePro. The scanner should be production grade with reasonable speed and able to scan duplex.

3. Quality Assurance - After scanning to a TIF format file, each page is examined by a Q/A operator using a standard software program such as Microsoft Office Document Imaging. The Q/A operator will be required to check for quality of the scan, delete blank pages, rotate page as needed and identify any defects.

4. The Scan Operator will perform any corrections as needed.

5. A batch operation will be performed on a set of files (perhaps all scanned during one day) and convert all TIF files to a searchable PDF file. The Adobe Acrobat Pro is a good program to use. This operation may take a while to process and may run overnight.

6. An index Operator will create a CSV file containing the Name (index field) of each folder along with the file name of the PDF file. There exists utilities that will help with this such as ImagEntry. This utility will insure accuracy of data entry by creating a double blind process.

7. A good Content Management System for documents is Cabinet SAFE. This product is easy to use and contains various access protection mechanism and encryption of data. It will also allow web clients to login to access the data anywhere.

8. There is an option to batch process a set of documents and import them into this system. An administrator can setup users based on various access privileges and assign login password.

9. Once completed, one can login as a user and verify that all the records were imported properly. The users also have various search capability to help locate the folders. Besides searching by the index info, there is the option to search across the whole set of records to find a particular term. The system will return all the occurrences of that term.

10. Finally, the system has an audit trail capabilities that can generate reports on access activities based on users.

It is desirable to keep all original content in storage as long as possible. There are data storage services such as Iron Mountain that will store boxes of content and track them with bar codes.

Also, besides the simple data entry, there are several options to assist with indexing. You can use a cover sheet that includes the index fields and let the Capture software pick them up at scan time or alternatively, you can choose to use a barcode scheme to create the index.

Document Prep (Sidebar)

This is an important step not to be overlooked or belittled. As with any assortment of documents, you will find there are exceptions. There may be over-sized paper, card boards, punched holes, coffee stains and assorted clips and hand written notes etc. The document prepper must handle all these encounters and fix them in a way that will allow easy scanning and not cause damage to the equipment or paper jams. It is tedious work and requires paying attention to details. It is not a high paying job and yet it is essential to creating a quality end product.

Depending on the source material and how they were stored, there are cases where the original content may have been damaged by water or mold or fungus. Prepping of these content require special skills and wearing gloves and masks and patience.

Rule Of Thumb

For paper documents, a typical box will contain 2000 pages on average. A pallet will hold 32 boxes. Typical Service Bureau will charge 10-12 cents per page. This average $200 per box and $6400 per pallet. In deciding whether to outsource or do in house, this is a good rule of thumb. A service bureau will be able to turn around 32 boxes in about 2 weeks.

If you have only a few boxes, the decision is simple. Do it in house. If you have 32 boxes or more, go with a service bureau. They are gears to handle large volumes and they employ trained workers that are detail oriented. It is cost efficient and a win win.

Sound Practices

  • Keep the original if at all possible.
  • Digitize the content at the highest quality possible.
  • Quality and accuracy should be emphasized through out the process.
  • Simple workflow is best.
  • Adopt standards.
  • Ease of use should be a design goal.
  • Security must not be an after thought.
  • Don't put your trust and assets in one basket.
  • Pay attention to details.
  • Never duplicate work.
  • Teamwork allows for redundancy.
  • Hire people that cares and have the good work ethics.
  • Use gloves when handling original documents, photos or film.
  • No food or drinks allowed in the work area.
  • Use pencil only (no ink pens)
  • Store originals in a cool and dry place and most importantly above ground.
  • Use Archival quality folders (store documents vertically)

A Photo Archive Workflow (Example)

As another example, I am currently a P/T volunteer at the local Archives in Westchester County. They have a large collection of photographs going back almost 100 years. A simple workflow to capture them are as follows:

1. Use an Epson Expression Tabletop Scanner to capture originals at 400ppi with 24 bits color and store as TIF format files to a folder \Archive\

2. Create a logical naming convention for the collection. Each file must have a unique file name. It could be as simple as ABCDnnnn where nnnn is a sequence of numeric numbers. This particular scheme fits with the Epson scanner software. It is configured to scan using this type of scheme and will save time during production scanning.

3. Sort the photos by size and orientation. Make cardboard jigs to the sizes of the photos. Scan the photos in sequence.

4. After each scan, open the image file with PhotoShop to verify the scan and make sure the orientation is proper. Save the image to a different folder named \JPG\ and select the highest compression setting.

5. You can optionally use Microsoft Access to create a simple database containing the name of the photo and some related index information such as photographer, year, attributes etc. and a link to the JPG file. This file can be exported to a CSV file for future imports to other Content Management system.

This simple workflow can produce a large amount of content for archival purposes. Given current hardware, one can process up to 100 photos per 8 hour work day which amount to 25,000 per year.


This is my contribution to help promote good preservation practices. There are many aspects to this topic but none more important then getting the process right. If the process is "right" all other concerns or issues will fall in place. Don't be afraid of the unknowns. Many institutions, out of fear or suspicion, decides to do nothing or wait till the next breakthrough. The problem is, the longer one waits, the probability of disaster is increased. As with any original content, a fire or flood or tornado... maybe the death sentence and lost of content forever.

This cookbook is the starting point. It is a guide to help all organizations to think about what they value most and start marching in the direction that will bring them security and peace of mind. Remember, it is OK to ask for help and guidance and seek out a second opinion.

Let me know by way of comments and feedback if you find this helpful. Thanks for reading.

© 2016 Jack Lee


This website uses cookies

As a user in the EEA, your approval is needed on a few things. To provide a better website experience, uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at:

Show Details
HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
LoginThis is necessary to sign in to the HubPages Service.
Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
AkismetThis is used to detect comment spam. (Privacy Policy)
HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the or domains, for performance and efficiency reasons. (Privacy Policy)
Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
MavenThis supports the Maven widget and search functionality. (Privacy Policy)
Google AdSenseThis is an ad network. (Privacy Policy)
Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
Index ExchangeThis is an ad network. (Privacy Policy)
SovrnThis is an ad network. (Privacy Policy)
Facebook AdsThis is an ad network. (Privacy Policy)
Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
AppNexusThis is an ad network. (Privacy Policy)
OpenxThis is an ad network. (Privacy Policy)
Rubicon ProjectThis is an ad network. (Privacy Policy)
TripleLiftThis is an ad network. (Privacy Policy)
Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
ClickscoThis is a data management platform studying reader behavior (Privacy Policy)