Arts Autos Books Business Education Entertainment Family Fashion Food Games Gender Health Holidays Home HubPages Personal Finance Pets Politics Religion Sports Technology Travel

The Universal Digital Collections Cookbook

Updated on November 14, 2016

Jack Lee

Before retiring, Jack worked at IBM for over 28 years. His articles have over 120,000 views.

Contact Author

Introduction

How does one go about creating an universal digital collection? My hub is a result of over 20 years of experience dealing with museums, libraries and archives around the world. There were many lessons learned and failed initiatives but they were not in vain. Our experience has taught us the way forward. Technology has improved to a point where it is viable to create the ultimate collection that will be ever lasting.

- February 2016

Public Domain

The information I've provided here is classified under public domain. I am not claiming any copyrights or inventions. It is based on my body of knowledge gained over the past 20 years. My intent is to help all organizations large and small to get a better handle on the monumental task of preservation and archive and content management. I welcome any feedback.

Obsolete Media

The Cloud Servers

Background

In my 20 plus years working with large institutions, the one most important thing I've learned is that haste makes waste. In trying to achieve something quick, the efforts and resources spent end up being mostly wasted. In order to achieve something long lasting (by that I mean in the order of hundreds of years), one need to take the time and implement wisely such that any work would not be duplicated sometimes later. I've been involved with numerous pilot projects where the idea is to try something new on a limited basis. If it is proven to be acceptable, it would be adopted.

There is a learning curve for any newcomer. The problems of archive and preservation have been around for centuries. Traditional libraries and museums are very suspicious of new technology and rightly so. If the technology is only good for a few decades, that is unacceptable. Such are true of digital media such as CDROM, DAT tape and DVD. The difference now is the cost and density of storage have reach the point where it is almost negligent. With the "cloud" servers, the goal of a truly permanent media storage is at hand.

Design Goals

The main purpose of any collection is preservation. No matter what objects we are concerned with (painting, letters, artifacts, communications...), what is the best way to preserve a collection?

A doomsday scenario. The one way to think about preservation is to imagine the worst case scenario. What if a nuclear war wipes out a significant part of our planet or an asteroid strikes, how will a collection survive? Believe it or not, there are systems in place today to address that very case. The most important and sensitive documents are digitized and copies stored on different continents buried deep in mountains. In addition, there are physical printouts of binary files on paper that are stored as backup for the case when all electronic means are no longer available. These people have concluded that the most reliable form of preservation is still binary codes on printed paper. Think about it, after years of technological advances, trying to reach the paperless office, the simplest and most reliable source is 1's and 0's on paper.

Most collections do not require that extreme treatment. What are the goals of a typical collection?

long term preservation
accuracy of reproduction
full indexing
secure storage and backup
security and copyrights
easy and quick access
finding aids
ease of use interface
simple workflow
be cost effective
modular approach

Universal Module Approach

The whole task can be broken down into individual modules.

Capture
Index
Associations
Storage (cloud server, batch and single)
Search (finding aids...)
Retrieval (web access, read only and R/W)
Security (rights management, data encryption))
Image Processing
Web design (front end appeal)
Workflow
Intelligent Agent (cache)
Standards (File formats, Industry standards)
Backup (automated)
Physical inventory and management (Long term storage)
Restoration (as needed)
Audit trail

A Framework For Implementation

The nice thing about a modular approach is that you need not to implement all at once. The framework is what is important. Once it is established, the pieces can be assembled at your own pace.

Coming up with the proper framework depends on many factors. Such as the type of content, the condition of content, the the value of content, the sensitivity of content etc.

Funding and Human Resources

As with any project of this magnitude, the funding source must be addressed up front. There is no point in starting a long term project without full financial commitment. The amount required depends on the scope of the project and the time frame required. Once a project is put in motion, there are a few options as to allocating resources to get the job done.

For example, the prepping and scanning may be outsourced to a Service Bureau who can do the task faster and cheaper.

A minimal staff of people with an assortment of skills are required. Teamwork is a key component to success.

Getting the right equipment to match the tasks and purchasing the right software that will be long lasting.

Prioritize Contents

Another key decision before embarking on a project is to prioritize your content. What material will be processed first? It is not possible to do everything together. There should be a priority where certain type of content (perhaps most important) be done first. Within the type of content, the same applies. You may choose to do only a subset of that content type before moving to another type.

Model After the Human Brain (A Detour)

If you think about it, our human brain is the ultimate archive. It contains all the information of our whole life experiences. It is accessible via our memory though some are better than others. It is compact and it is accessible via multiple triggers (our senses). It is organized in a very sophisticated form that we are just beginning to understand. It is dynamic and changing. It has built in redundancies in case of disaster such as trauma injuries. It is the perfect archive.

We can try and model a collection after our own brain and take advantage of what is known. Our brain works on intuition more than on exact data. Our brain relates many attributes to create a "memory packet" that can be easily recalled via various triggers. That is because we have created numerous neuron connections to the same event. Instead of recall only by a name, we can recall by a smell or a sensation or a color or a sound etc.

In our own daily activities, if we are looking for a document in our file system, we usually think of a name or a category to narrow our search. We alphabetize the name so that we can locate it easily and quickly. That works fine if the document we are seeking is indexed properly or filed in the right folder. If we are looking for a specific document that are filed in a different folder, that is much harder. Fortunately, our brain can do some analyzing and guess as to where that document may be. Technology can help also. We can digitize the files into searchable PDF files. Once we have all the files in place, we can search on a particular name or phrase that will pull up all occurrences of that term. This will help locate the document assuming you can fine a term unique enough to that collection.

Workflow (A Sample Implementation)

Let me take you through a simple workflow for a specific collection - say paper documents (8.5 x 11) letter format. For this discussion, let's assume we are talking about Personnel records. Assume you have a file cabinet with folders of documents stored in alphabetic order by Last name, First Name. Each folder contains an assortment of paper records for a given employee. Being personnel records, they are sensitive and must require access limitations.

Here is the workflow and programs to convert them to a digital archive. (Items in bold are specific hardware/software recommendations).

1. Document Prep Step - All folders must be prepped first to remove any staples or post-it notes and repaired as necessary any tears etc. Each folder will be prepped from fist page to the last page.

2. Document Scanner - The folders are put through a document scanner such as a Kodak desktop scanner with CapturePro. The scanner should be production grade with reasonable speed and able to scan duplex.

3. Quality Assurance - After scanning to a TIF format file, each page is examined by a Q/A operator using a standard software program such as Microsoft Office Document Imaging. The Q/A operator will be required to check for quality of the scan, delete blank pages, rotate page as needed and identify any defects.

4. The Scan Operator will perform any corrections as needed.

5. A batch operation will be performed on a set of files (perhaps all scanned during one day) and convert all TIF files to a searchable PDF file. The Adobe Acrobat Pro is a good program to use. This operation may take a while to process and may run overnight.

6. An index Operator will create a CSV file containing the Name (index field) of each folder along with the file name of the PDF file. There exists utilities that will help with this such as ImagEntry. This utility will insure accuracy of data entry by creating a double blind process.

7. A good Content Management System for documents is Cabinet SAFE. This product is easy to use and contains various access protection mechanism and encryption of data. It will also allow web clients to login to access the data anywhere.

8. There is an option to batch process a set of documents and import them into this system. An administrator can setup users based on various access privileges and assign login password.

9. Once completed, one can login as a user and verify that all the records were imported properly. The users also have various search capability to help locate the folders. Besides searching by the index info, there is the option to search across the whole set of records to find a particular term. The system will return all the occurrences of that term.

10. Finally, the system has an audit trail capabilities that can generate reports on access activities based on users.

It is desirable to keep all original content in storage as long as possible. There are data storage services such as Iron Mountain that will store boxes of content and track them with bar codes.

Also, besides the simple data entry, there are several options to assist with indexing. You can use a cover sheet that includes the index fields and let the Capture software pick them up at scan time or alternatively, you can choose to use a barcode scheme to create the index.

Document Prep (Sidebar)

This is an important step not to be overlooked or belittled. As with any assortment of documents, you will find there are exceptions. There may be over-sized paper, card boards, punched holes, coffee stains and assorted clips and hand written notes etc. The document prepper must handle all these encounters and fix them in a way that will allow easy scanning and not cause damage to the equipment or paper jams. It is tedious work and requires paying attention to details. It is not a high paying job and yet it is essential to creating a quality end product.

Depending on the source material and how they were stored, there are cases where the original content may have been damaged by water or mold or fungus. Prepping of these content require special skills and wearing gloves and masks and patience.

Rule Of Thumb

For paper documents, a typical box will contain 2000 pages on average. A pallet will hold 32 boxes. Typical Service Bureau will charge 10-12 cents per page. This average $200 per box and $6400 per pallet. In deciding whether to outsource or do in house, this is a good rule of thumb. A service bureau will be able to turn around 32 boxes in about 2 weeks.

If you have only a few boxes, the decision is simple. Do it in house. If you have 32 boxes or more, go with a service bureau. They are gears to handle large volumes and they employ trained workers that are detail oriented. It is cost efficient and a win win.

Sound Practices

Keep the original if at all possible.
Digitize the content at the highest quality possible.
Quality and accuracy should be emphasized through out the process.
Simple workflow is best.
Adopt standards.
Ease of use should be a design goal.
Security must not be an after thought.
Don't put your trust and assets in one basket.
Pay attention to details.
Never duplicate work.
Teamwork allows for redundancy.
Hire people that cares and have the good work ethics.
Use gloves when handling original documents, photos or film.
No food or drinks allowed in the work area.
Use pencil only (no ink pens)
Store originals in a cool and dry place and most importantly above ground.
Use Archival quality folders (store documents vertically)

A Photo Archive Workflow (Example)

As another example, I am currently a P/T volunteer at the local Archives in Westchester County. They have a large collection of photographs going back almost 100 years. A simple workflow to capture them are as follows:

1. Use an Epson Expression Tabletop Scanner to capture originals at 400ppi with 24 bits color and store as TIF format files to a folder \Archive\

2. Create a logical naming convention for the collection. Each file must have a unique file name. It could be as simple as ABCDnnnn where nnnn is a sequence of numeric numbers. This particular scheme fits with the Epson scanner software. It is configured to scan using this type of scheme and will save time during production scanning.

3. Sort the photos by size and orientation. Make cardboard jigs to the sizes of the photos. Scan the photos in sequence.

4. After each scan, open the image file with PhotoShop to verify the scan and make sure the orientation is proper. Save the image to a different folder named \JPG\ and select the highest compression setting.

5. You can optionally use Microsoft Access to create a simple database containing the name of the photo and some related index information such as photographer, year, attributes etc. and a link to the JPG file. This file can be exported to a CSV file for future imports to other Content Management system.

This simple workflow can produce a large amount of content for archival purposes. Given current hardware, one can process up to 100 photos per 8 hour work day which amount to 25,000 per year.

Summary

This is my contribution to help promote good preservation practices. There are many aspects to this topic but none more important then getting the process right. If the process is "right" all other concerns or issues will fall in place. Don't be afraid of the unknowns. Many institutions, out of fear or suspicion, decides to do nothing or wait till the next breakthrough. The problem is, the longer one waits, the probability of disaster is increased. As with any original content, a fire or flood or tornado... maybe the death sentence and lost of content forever.

This cookbook is the starting point. It is a guide to help all organizations to think about what they value most and start marching in the direction that will bring them security and peace of mind. Remember, it is OK to ask for help and guidance and seek out a second opinion.

Let me know by way of comments and feedback if you find this helpful. Thanks for reading.

Some Related Information

My Career at IBM Research
I was active from 1983 till 2002 working on some exciting projects at IBM Research.
Adobe Acrobat Pro DC Review & Rating | PCMag.com
Acrobat has always towered above all rival PDF software, and the latest version is the best upgrade in years.
Document Management Software | Cabinet SAFE
Go paperless with enterprise document management software & AP solutions from Cabinet. Save time, save money. In the cloud or installed.
Safe Storage White Paper
eBizDocs Document Management Solutions; Albany, NY
eBizDocs, an award-winning leader in electronic document management solutions, document conversion, intelligent data capture and scanner sales and service.
Image Processing for the web
My tutorial on some basic image processing techniques.

Microeconomics
Optimum Population Theory
by Sundaram Ponnusamy1
History of Humanities
Exploring Fragonard’s Horrifying Corpse Collection
by Mamerto Adan0
San Francisco
Ode To Spring - San Francisco Botanical Garden in February
by Viet Doan8
Microsoft Word
How to Use Word Processing Software (NVQ Diploma in Business and Administration & IT)
by Dahlia Ambrose7
Birthdays & Celebrations
Best 5th Birthday Wishes Collections
by Bik Bar6

This website uses cookies

As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

Necessary

Features

Marketing

Statistics

Approve All & Submit
Approve Checked Only

For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://corp.maven.io/privacy-policy

Show Details

Necessary
HubPages Device ID	This is used to identify particular browsers or devices when the access the service, and is used for security reasons.
Login	This is necessary to sign in to the HubPages Service.
Google Recaptcha	This is used to prevent bots and spam. (Privacy Policy)
Akismet	This is used to detect comment spam. (Privacy Policy)
HubPages Google Analytics	This is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
HubPages Traffic Pixel	This is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
Amazon Web Services	This is a cloud services platform that we used to host our service. (Privacy Policy)
Cloudflare	This is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
Google Hosted Libraries	Javascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)

Features
Google Custom Search	This is feature allows you to search the site. (Privacy Policy)
Google Maps	Some articles have Google Maps embedded in them. (Privacy Policy)
Google Charts	This is used to display charts and graphs on articles and the author center. (Privacy Policy)
Google AdSense Host API	This service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
Google YouTube	Some articles have YouTube videos embedded in them. (Privacy Policy)
Vimeo	Some articles have Vimeo videos embedded in them. (Privacy Policy)
Paypal	This is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
Facebook Login	You can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
Maven	This supports the Maven widget and search functionality. (Privacy Policy)

Marketing
Google AdSense	This is an ad network. (Privacy Policy)
Google DoubleClick	Google provides ad serving technology and runs an ad network. (Privacy Policy)
Index Exchange	This is an ad network. (Privacy Policy)
Sovrn	This is an ad network. (Privacy Policy)
Facebook Ads	This is an ad network. (Privacy Policy)
Amazon Unified Ad Marketplace	This is an ad network. (Privacy Policy)
AppNexus	This is an ad network. (Privacy Policy)
Openx	This is an ad network. (Privacy Policy)
Rubicon Project	This is an ad network. (Privacy Policy)
TripleLift	This is an ad network. (Privacy Policy)
Say Media	We partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
Remarketing Pixels	We may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
Conversion Tracking Pixels	We may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.

Statistics
Author Google Analytics	This is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
Comscore	ComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
Amazon Tracking Pixel	Some articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
Clicksco	This is a data management platform studying reader behavior (Privacy Policy)