Archival Research Using Digital Archives

My first encounter with a digital archive was a collection called Post-War Europe: Refugees, Exile and Resettlement.  The collection contains 119,000 pages of scanned materials from the National Archives and Wiener Museum in London relating to post-World War II “displaced persons.”  For someone familiar with the National Archives, the Post-War Europe collection was an easy “bridge” into digital research.  The digital collection reproduced the original record organization system from the National Archives, including the folder titles and reference numbers (FO 371/57708 for example). This made it easy to move back and forth between records that had been scanned versus those in the London archives.  A browse feature made it possible to see a full list of the scanned materials and to study the arrangement of documents and the relationships between different fonds.

The Post-War Europe collection placed the archival record online with the power of full-text search, subject browsing and access to materials whenever needed. My research, teaching and writing took place concurrently.  The archive was alive and accessible in new ways.  My excitement about this new tool inspired me to design a course on digital archives.  The seminar would give me a chance to reflect on my own research and to explore the new potential of digital collections and computational methods.

I decided to dedicate as much time as possible in “Refugee History and Digital Archives” to in-class exercises using primary sources.   The class met for three-hour sessions once a week.  The first hour was generally used for discussion of common readings, questions and essential context for working with the day’s archive.  A student presentation offered an introduction to the collection and commentary on the archive’s interface.  The remaining time was used for small-group projects using the digital archive. Exercises were designed to become increasingly more complicated.  We began with single documents, then a single folder of documents, then multiple folder collections and finally entire archives.  On our best days, the class agreed on a common research problem and divided the research tasks among small groups led by graduate students.  Each group researched part of the problem and we combined our results as a class in the final half hour.  The class was consistently able to conduct primary source research in small groups and to link their findings into a master narrative by the end of class. Our research reproduced both familiar narratives and suggestive directions for future research. This was extremely rewarding to witness.

Success came with a careful balance of structure and spontaneity.  For one class, I gave each student a single document to read and discuss with his or her group before working with the larger collection.  These materials gave the students key evidence and set their discussion in specific directions.  The final pieces fit together very well and there was a tangible feeling of accomplishment.  Total freedom produced less certain results, but the findings were often more original and unexpected.

After ten weeks working with multiple online archives, we had a final discussion about our experiences.  There are three topics that appear prominently when looking back on the course and deserve further contemplation.  These are the diversity of archival interfaces, the problem of archival integrity and the future potential of digital research methods.


As the students and I reflected on the course, it became clear that nearly all of the archives that we worked with had a unique interface.  This required students to learn and adapt to the available tools, capabilities and limitations of each web-site and its interface. Full-text search is available in some archives, but not others.  Most archives do not allow users to browse record descriptions and locations.  There is no consistency in the ability to search by date, record type or other attributes.  One is left to hope that a research interface will emerge that can be customized for researchers’ needs.

We need a historical source analysis program, comparable to ArcGIS for mapping and spatial analysis, that would allow users to compile and analyze materials from digital archives.  Such a program would function as a professional research tool that could be adapted to the requirements of computational methods and Digital Humanities projects.  This would allow researchers to utilize materials from existing digitized collections in addition to images and scans collected from analog archives.  Few researchers work solely from digital or analog collections alone. The ability to create one’s own digital archive of relevant materials from any source for analysis using the same platform would be liberating. We already use Zotero and Endnote to organize secondary sources.  Surely there is great potential for a unified historical information analysis platform.

Additionally, as one student noted, a tool is needed for handwritten texts.  Imagine a paleography tool that could be used with any digital archive to quickly read handwritten documents and generate text for computational analysis.  The technology for such a tool exists, but none of the archives used in our course featured such a tool. This tool would utilize the potential of digital archives to make handwritten text, which is often overlooked or demands significantly more time to read, more accessible.  Student would be able to work from both the scanned text and the handwritten original.

To be useful for research, digital archives need to make their data available to researchers.  Here I am talking about the actual image files and data, not just a web interface.  For example, the Voices of the Holocaust digital archive allows users to display several types of data on a map.  Where did David Boder interview Holocaust survivors?  Where were the subjects born?  Using Google Maps, Voices makes good use of this spatial data. At the same time, without access to the KML files, users are bound by the visualizations on the site and it is impossible to overlay layers or engage in more advanced analysis.

Ideally, digital archives should make their holdings available as plain text, not just scanned images. In the very best archives, both document images and text are available. The digital archive of the Harvard Project on the Soviet Social System, for example, allows users to easily move between document images and text with “view image” and “view text” buttons.  Further access to this “raw text” in digitized archives would open possibilities for textual analysis, topic modeling and other computational tasks that are not possible with document images alone.

For an exercise on topic modeling, we used a program called MALLET.  This program requires input in simple text in a txt file.  I had to spend several hours cutting and pasting from the archive’s web-page to create files that the program could analyze.  Direct access to the interview text files would have significantly shorted this labor.  It simply isn’t practical to pull text from ten of thousands of pages from a web-archive.

We also need an easier way to cite materials from digital archives. Notes with a full URL are nothing but gobbledygook.  What if the Chicago Manual of Style had a format for footnotes with an embedded hyperlink?  This method worked very well with my students’ final papers.  When grading the papers, it was thrilling to just click on a footnote and have direct access to the source. I could see their discoveries in the original.  Whenever I had questions about a student’s interpretations of a document, I needed only to click on the footnote and consult the document image.  This is a “must have” for future scholarship of all kinds.  We might even move to a time when publishers require images of all cited documents to link with footnotes for digitally published books and articles.

Archival Integrity

Digital archives are changing and evolving with astonishing speed. Even in the short ten weeks of our class, the interface for the Post-War Europe collection changed dramatically.  Documents were removed and the essential browse function was taken off the site.  All of these changes are part of a transition to “cross-searchable” collections.  Both Gale Cengage’s Archives Unbound and Chadwyck-Healey’s Digital National Security Archive (DNSA) make it possible to search across dozens of digital archives with a single search.  A search for “refugees” with DNSA delivers more than 2,000 results from topical collections ranging from CIA covert actions in Afghanistan to transcripts of Henry Kissinger’s telephone conversations.  While I welcome the ability to search across collections, we should also pause for caution.

At this point in history, most archives are “born analog” and must be digitized for inclusion in a digital archive.  In their analog version, archives have a detailed finding aid and have been organized in ways that allow researchers to identify what kinds of materials are held in the collection and where to find them.  It is impossible for a researcher to read every page in an archive, so he or she depends on archival organization to assess a collection, to understand its origins, its possibilities and limitations.  Knowledge of the materials as a collection informs how we find materials, evaluate their authenticity and their usefulness as historical evidence.

As archival collections are chopped into individually searchable units, we lose vital data on the provenance and organization of archival collections.  In the Post-War Europe collection, for example, it is no longer possible to search by reference number or to browse folder titles.  We are now entirely dependent on the search engine for access to the documents.  While search results help us to find useful bits of text, it is also essential to study documents as part of a larger archival collection.  As an example, let’s think of this as a box of letters.  If we read each letter in order, then we begin to see a dialogue unfold between the letter writers.  If we only look at letters with the word “computer” in them that dialogue is far more elusive.

There are many audiences for digital archives and it is understandable that many digital collections are intended for the general public.  The current trend in digital archiving seems to be a public-history model where digital archives function more as virtual museum exhibits than as research archives.  While this provides important access to primary sources, advanced undergraduates and graduate students quickly grow frustrated by limited collections that show little promise for original research or do not lend themselves to innovative research methods.  Given the immense potential of digital archives for research, professional historians must assert their needs and engage in greater dialogue with acquisitions librarians and those creating digital collections.

Digital Potential

On the positive side, digital archives hold great potential to move past our traditional dependence on provenance, archival description and organization.  Digital archives allow researchers to approach documents as simple containers of text and data.  We can identify statistically relevant patterns among billions of characters, words and topics.  With these methods, the individual origins of a single document are irrelevant.  Our findings do not depend on where a text comes from or who wrote it, but the occurrence of statistically relevant patterns within a massive body of text.  This new digital history demands innovative methods of archival research and analysis.

We can create algorithms that can organize archival collections in new ways.  With tokenization it is possible to identify elements in a document.  A program could identify all of the names of the letter writers in our box of letters.  This would make it possible to quickly evaluate the personalities represented in the collection and to map their networks of correspondence.  Similarly, it is possible to identify the date a letter was written or sent and to quickly arrange the letters chronologically.  These methods make it possible to organize the full archive in ways that may reveal previously unseen historical narratives.  Organizational tasks that might have required several years for an archivist will be done in a matter of seconds by an algorithm.

Additionally, digital research has the potential to identify change over time using computational patterns rather than human narratives.  This opens exciting new avenues for research.  Computers can perform the essential tasks of arrangement and curation in innovative new ways.  There are immense new possibilities, but we first need to create plain-text digital archives from analog documents.  As a discipline, we are only beginning to recognize this inevitable evolution in our research methods and analysis. Historians will need to understand and master skills that were previously associated with computer and archival science.  One must hope that we will embrace the potential of digital archival analysis, while retaining the best traditions and practices of analog historical research.

Andrew Paul Janco
Human Rights Program
The University of Chicago

Image: Archives organized with network visualization and analysis. League of Nations Archives (UN Geneva). Wikimedia Commons.

The views, perspectives, and opinions expressed here and by those providing comments are those of the author(s) and commentator(s) alone, and do not reflect the opinions of Dissertation Reviews, its members, editors, or advisory board members.


