Wrangling Data July 24 - 30


In the previous module, we successfully grabbed lots of data from various online repositories. Some of it was already in well-structured tables; much of it was not. All of it was text though. Initially, it (or most of it) was just scanned images of documents. At some point, object character recognition was used to identify the black dots from the white dots in those images, to recognize the patterns that make up letters, numbers, and punctuation. There are commercial products that can do this (and we have some installed in the Underhill Research Room that you can use), and there are free products that you can install on your computer to do it yourself.

It all looks so neat and tidy. Ian Milligan discusses this 'illusionary order' and its implications for historians:

In this article, I make two arguments. Firstly, online historical databases have profoundly shaped Canadian historiography. In a shift that is rarely – if ever – made explicit, Canadian historians have profoundly reacted to the availability of online databases. Secondly, historians need to understand how OCR works, in order to bring a level of methodological rigor to their work that use these sources.

Just as we saw with Ted Underwood's article on theorizing search, these 'simple' steps in the research process are anything but. They are also profoundly theoretical in how they change what it is we can know. In archaeology, every step of the method, every stage in the process, has a profound impact on the stories we eventually tell about the past. Decisions we make destroy data, and create new data. Historians aren't used to thinking about these kinds of issues!

There are also manual ways of doing the same thing as OCR does - we call these things 'humans', and we organize their work through 'crowdsourcing'. We break the process up into wee manageable steps, and make these available over the net. Sometimes we gamify these steps, to make them more 'fun'. If several people all work on the same piece of text, the thinking is that errors will cancel each other out: a proper transcription will emerge from the work of the crowd. While transcriptions might've provided the earliest examples of crowdsourcing research (but see also The HeritageCrowd Project and the subsequent 'How I Lost the Crowd'), other tasks are now finding their way into the crowdsourced world - see the archaeological applications within the MicroPasts platform. These include things like 'masking' artefact photographs in order to develop 3d photogrammetric models.

But often, we don't have a whole crowd. We're just one person, alone, with a computer, at the archive. Or working with someone else's digitized image that we found online. How do we wrangle that data? Let's start with M. H. Beal's account of how she 'xml'd her way to data management' and then consider a few more of the nuts and bolts of her work in OA TEI-XML DH on the WWW; or, My Guide to Acronymic Success.

This kind of work is extraordinarily important! You already had a taste of it in the TEI exercise in the last module. (Now, if we had a seriously big project where we were transcribing lots of text, we'd invest in a dedicated XML editor like Oxygen - there are plugins available and frameworks for doing historical transcription on this platform. There is a 30 day free trial license if you want to give it a try. But for now, Notepad++, Textwrangler, Komodo Edit, Sublime text, or any of a number of good text editors will do all that we need to do). Also, check out the TEI. Take 15 minutes and read throuhg What is XML and Why Should Humanists Care? by David Birnbaum. Annotate!

In this module we're going to do some other kinds of wrangling.


In the exercises for this week we are going to focus on some bare-bones wrangling of data. First, we are going to do some activities and exercises to get in the right frame of mind. Then, we'll switch gears and we'll use regular expressions to search and extract information from the Diplomatic Correspondence of the Republic of Texas, which you'll find at the Internet Archive. If you're on a PC, download Notepad++ - this is a souped-up version of the simple notepad application, and allows us to do very useful things indeed. If you're on a Mac, TextWrangler is probably already installed and is all you need. If you're working in Linux, you can use whatever text editor you're familiar with.

We'll conclude by using 'Open Refine' to tidy up the information we extracted from the Texan correspondence.

Things you will learn in this module:

  • the power of regular expressions. 'Search' and 'Replace' in Word just won't cut it any more for you! (Another reason why you should write in Markdown in the first place and then convert to Word for the final typesetting.
  • Open Refine as a powerful engine for tidying up the messiness that is ocr'd text.

What you need to do this week

  1. Respond to the readings and the reading questions through annotation (taking care to respond to others' annotations as well) - see the instructions below. Remember to tag your annotations with 'hist3814o' so that we can find them here
  2. Do the exercises for this module, pushing yourself as far as you can. Annotate the instructions where they might be unclear or confusing; see if others have annotated them as well, and respond to them with help if you can. Keep an eye on our Slack channel - you can always offer help or seek out help there. Write a blog post describing what happened as you went through the exercises (your successes, your failures, the help you may have found/received), and link to your 'faillog' (ie, the notes you upload to your github account - for more on that, see the exercises!).
  3. Submit your work here


Select one of the articles behind the links above OR select one of the articles below to annotate.

Blevins, Mining and Mapping the Production of Space A View of the World from Houston

Blevins, Space, Nation, and the Triumph of Region: A View of the World from Houston

Ian Milligan on Imageplot and here

Shawn Graham on extracting text & diy OCR

Reading questions: On your blog, reflect on any data cleaning you've had to do in other classes. Why don't historians discuss this kind of work? Make reference (or link to) key annotations, whether by your or one of your peers, to support your points.