Module 4 Exercises

There are many different tools and approaches you could use to visualize your data, both as a preliminary pass to spot the holes and also for more formal analysis. In which case, for this module, I would like you to select two of these exercises which seem most germane for your final project.

You are welcome to work through more of them, of course, but I want the exercises to move your own research forward. Some of these I wrote; some are adapted from The Macroscope; others are adapted or used holus-bolus from scholars like Miriam Posner, Fred Gibbs, and Heather Froehlich (and I'm grateful that they shared their materials!). Finally, you are welcome to explore the lessons and tutorials at The Programming Historian if they seem appropriate to what you want to do for your project.

But what if I haven't any idea of what to do for the final project? Then read through the various tutorials for inspiration. Find something that strikes you as interesting, and then talk to me about how you might employ the ideas or concepts with regard to the Equity data.

Things to install?

Many of these involve having to install more software on your own machine. In those exercises that involve using R and RStudio, you are welcome to install RStudio on your own machine OR to use it in DHBox. Please read this quick introduction to R and Rstudio carefully.

(If you decide to install R and RStudio on your own machine, I would suggest you read the introductory bits from Lincoln Mullen's book-in-progress, Computational Historical Thinking, especially the 'setup' part under 'getting started' (pay attention to the bit on installing packages and dependencies). If you spot any exercises in Mullen's book that seem relevant to your project, you may do those as an alternative to the ones here. Alternatively, go to Swirl and learn the basics of R within R. DHNow links to a new Basic Text Mining in R tutorial which is worth checking out as well.)

nb It is always very important to record in your own notebooks what version of R you used for your analysis, what version of any R packages you installed and used, and so on because packages can go out of date.

In the table below I've gathered the exercises together under the headings of 'text', 'networks', 'maps', and 'charts'. I've also added some entries that I am categorizing under 'art' The first is a series of exercises on the sonification of data and the second is a guide to making twitterbots; the third is about glitching digital imagery. These approaches can provide surprising and novel insights into history, as they move from representing history digitally to performing it. See for instance the final project in an undergraduate digital history class at the University of Saskatchewan by Daniel Ruten where he translated simple wordclouds of a WWI diary into a profound auditory performance. I would be very interested indeed to see if any final projects in HIST3814o gave sonification or twitterbots or glitch a try.

Texts Networks Maps Charts Art
Topic Modeling Tool Network analysis in Gephi Simple mapping & georectifying Quick charts using RAW Sonification
Topic Modeling in R Converting 2-mode to 1-mode QGIS (tutorials by Fred Gibbs) Twitterbots
Text Analysis with OverviewProject Network Analysis in R Geoparsing with Python Glitching Photos
Corpus Linguistics with AntConc Network Analysis in Cytoscape Palladio with Posner
Text Analysis with Voyant Choose your own adventure Leaflet.js Maps

Exercise 1

Network Visualization

This exercise uses the open source programme Gephi which you install on your own computer. If you'd rather not install anything, please see Network Analysis in R instead.

Recall that the index of the collected letters of the Republic of Texas was just a list of letters from so-and-so to so-and-so. We haven't looked at the content of those letters, but the shape of network - the meta data of that correspondence - can be revealing (remember Paul Revere!) When we stitch that together into a network of people connected because they exchanged letters, we end up with a shard of their social network. Networks can be queried for things like power, position, and role, and so used judiciously, we can begin to suss something of the social structures in which their history took place. I would recommend that you also take a long look at Scott Weingart's series, Networks Demystified. Finally, heed our warning.

In this exercise, you will transform your Texan Correspondence data into a network, which you will then visualize with the open source programme Gephi The detailed instructions are here.


Exercise 2

Topic Modeling Tool

In exercise 2, you will use the 'Topic Modeling Tool' to create a simple topic model and a webpage that allows you to browse the results.

  1. Download the tool.
  2. Make sure you have some content on your own machine; the Colonial Newspaper Database is a handy corpus. (Created by Melodee Beals, it's a series of late 18th, early 19th century cleanly transcribed newspaper articles from Scotland and Northern England; You can grab my copy from here). Or perhaps you might move your copy of the Shawville equity out of DHBox onto your computer. At the command prompt in DHBox, type $ ls to make sure you can see your Equity folder (ie, you can't zip a folder from the command line if you are in that folder, so cd out of it if necessary). Assuming your files are in equityfolder, zip the folder up with this command, $ zip -r equityfiles.zip equityfolder. Then use the filemanager to download the zip file. Unzip the folder on your machine.
  3. Double-click on the file you downloaded in step 1. This will open a java-based graphical user interface with one of the most common topic-modeling approaches, 'Latent Dirichlet Allocation'.
  4. Set the input to be the Colonial Newspaper Database OR the Shawville Equity folder.
  5. Set the output to be somewhere neat and tidy on your computer.
  6. Set the number of topics you'd like to model.
  7. Click 'train topics' to run the algorithm.
  8. When it finishes, go to the folder you selected for output, and find the file 'all_topics.html' in the 'output_html' folder. Click on that, and you now have a browser-based way of navigating your topics and documents. In the output_csv folder created, you will find the same information as csv, which you could then input into a spreadsheet for other kinds of visualizations (which we'll talk about in class.)

Make a note in your open notebook about your process and your observations. How does reading this material in this way change/challenge/or focus your understanding of the material?


exercise 3

Topic Modeling in R

Exercise 2 was quite a simple way to do topic modeling. In this exercise, we are going to use a package for the R statistical language called 'Mallet' to do our topic modeling. One way isn't necessarily better than the other, although doing our analysis within R allows the potential for extending the analysis or combining it with other data. First, read this introduction to R so what follows isn't a complete shock!


exercise 4

Text Analysis with Overview

In exercise 4, we're going to look at the Colonial Newspaper Database again, but this time using a tool called 'Overview'. Overview uses a different approach that the topic models we've been discussing. In essence, it looks at word frequencies and their distributions within a document, and within a corpus, to organize the documents into folders of progressively similar word use.

You can download Overview to run on your own machine, but for our purposes, the hosted version at https://www.overviewdocs.com/ is sufficient. Go to that page, watch the video, create an account, and then log in. (More help about how Overview works may be found on their blog, including helpful videos.)

Once you're inside, click 'import from a CSV file', and upload the CND.csv (which you can download and save to your own machine from here <- right-click and save as. On the 'UPLOAD A CSV FILE' page in Overview click 'browse' and select the CND.csv. It will give you a preview. There are a number of options here - you can tell Overview which words to ignore, and which words to give added importance to. What words will you select? Make a note in your notebook. Then hit 'upload'.

A new page appears, called 'YOUR DOCUMENT SETS'. Click on the one you just uploaded. A file folder tree showing documents of progressively greater similarity will open; on the right hand side will be the list of documents within each box (the box in question will be greyed out when you click on it, so you know where you are). You can search for words in your document, and Overview will tell you where they are; you can tag documents that you find interesting. The Overview system allows you to jump between a distant, macroscopic view and a close, document level view. Jump back and forth, see what you can find. For suggestions about how to use Overview effectively, try their blog. Make notes about what you observe in your notebook. Also, you can export your tagged document set from Overview, so that you could visualize the patterns of tagging in a spreadsheet (for instance).

Going further Do you see how you could upload your documents that you collected during Module 2?


exercise 5

Corpus Linguistics with AntConc

Heather Froelich has put together an excellent step-by-step with using AntConc for exploring textual patterns within, and across, corpora of texts. Work your way through her tutorial

Can you get our example materials (from the Colonial Newspaper Database) into AntConc? This might help you to split the csv into individual txt files. Alternatively, do you have any materials of your own, already collected? Feed them into AntConc. What patterns do you see? What if you compare your materials against other corpora of texts?

FYI, here is a collection of corpora that you can explore


exercise 6

Text Analysis with Voyant

In module 2 if you recall, we worked through how to transform XML using stylesheets. Melodee Beals used a stylesheet to transform her database into a series of individual txt files. In the exercises above, a transformer was used to make the database into a single CSV file. In this exercise, we are going to use Voyant Tools to visualize patterns in word use in the database. Voyant can read either a CSV or text files. The advantage of uploading a folder of text files is that, if the files are in chronological order, Voyant's default visualizations will also be arranged in chronological order and thus we can see change over time.

Go to http://voyant-tools.org. Paste the URL to the csv of the CND database: https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv .

Now, open a new browser window, and go here http://voyant-tools.org/?corpus=colonial-newspapers&stopList=stop.en.taporware.txt

Do you see the difference? In the latter window, the individual articles have been uploaded individually, and thus are treated as individual documents in chronological order.

Explore the corpus, comparing terms over time, looking at keywords in context, and using the RezoViz tool to create a graph where people, places, and organizations that appear in the same documents (and across documents) are connected (you can find 'rezoviz' under the cogwheel icon at the top right of the panel). Google these terms and tools for what they mean and how others have used them. You can embed any of the tools in your blogs by using the 'save' icon and getting the iframe or embed code. You can apply 'stopwords' by clicking on the cogwheel in any of the different tools, and selecting stopwords. Apply the stopwords globally, and you'll only have to do this once! What patterns do you see? What do different tools highlight? Which ones are useful? What patterns do you see that strike you as interesting? Note this all down.

Going further Upload materials you collected in module 2 and explore them.


exercise 7

Quick Charts Using RAW

A quick chart can be a handy thing to have. Google spreadsheets, Microsoft Excel, and a host of other programs can make excellent charts quickly with their wizard functions. Never hesitate to turn to these. However, they are not always good with non-numeric data. In module 3, you used the NER to extract place names from a text. After some further munging with regex, you might have ended up with a CSV that looks like this. Can we do a quick visualization of this information? One useful tool is RAW. Open that in a new window. Copy the table of data of places mentioned in the Texan correspondence, and paste it into the data input box at the top of the RAW screen.

oh noes an error!

A quick data munge

You should get an error message, to the effect that you need to check 'line 2'. What's gone wrong? RAW has checked the number of values you have in that row, and compared it to the number of columns in row 1 (which contains all the column names). It sees that the two don't match. What we need to do is add a default null value in those cells. So, go to Google Sheets, click the 'go to google sheets' button, and then click on the big green plus sign to start a new sheet. Paste the following into the top-left cell (cell A1):

=IMPORTDATA("https://raw.githubusercontent.com/hist3907b-winter2015/module4-holes/master/texas.csv")

Pretty neat, eh? Now, here's the thing: even though your sheet looks like it is filled with information, it's not (at least, as far as the script we are about to run is concerned). That is to say, the sheet itself only has one cell of data, and that one cell is grabbing info from elsewhere on the web and dynamically filling the sheet. The script we're going to run works only on static values (more or less).

So, place your cursor in cell B1. On a Mac, hit shift+cmnd+downarrow. On a Windows machine, hit shift+ctrl+downarrow. Then on Mac shit+cmnd+rightarrow, on Windows shitf+crtl+rightarrow. Then copy all of that data (cmnd+c or ctrl+c). Then, under 'Edit' select 'paste special' -> 'paste VALUES only'.

The formula you put in cell A1 now says #REF!. You can delete this now. This mucking about is necessary so that the add on script we are about to run will work.

We now need to fill those empty values. In the tool bar, click add ons -> get add ons. Search for blanks. You want to add Blank Detector.

Now, click somewhere in your data. On Mac, hit cmnd+a. On Windows, hit ctrl+a. This highlights all of your data. Click Add ons -> blank detector -> detect cells. A dialogue panel will open on the right hand side of your screen. Click the button beside set value and type in null. Hit run. All of the blank cells will fill with the word null. Delete column A (which formerly had record numbers, but is now just filled with the word null. We don't need it). If you get the error, 'run exceeded maximum time' just hit the run button again. This script might take a few minutes.

You can now copy and paste your table of data into the data input box in RAW, and you should get the green thumbs up saying x records have been successfully parsed!

Playing with RAW

RAW takes your data, and depending on your choices, passes it into chart templates built on the d3.js code library. D3.js is a powerful library for making all sorts of charts (including interactive ones). If this sort of thing interests you, you can follow the tutorials in Elijah Meeks' excellent new book.

With your data pasted in, you can now experiment with a number of different visualizations that are all built on the d3.js code library. Try the ‘alluvial’ diagram. Pick place1 and place2 as your dimensions - you click and drag the green boxes under 'map your data' into the 'steps' box. Leave the 'size' box empty. Under 'customize your visualization' you can click inside the 'width' box to make the diagram wider and more legible.

Does anything jump out? Try place3 and place 4. Try place1, place2, place3, and place4 in a single alluvial diagram. When we look at the original letters, we see that the writer often identified the town in which he was writing, and the town of the addressee. Why choose the third and fourth places? Perhaps it makes sense, for a given research question, to assume that with the pleasantries out of the way the writers will discuss the places important to their message. Experiment! This is one of the joys of working with data, experimenting to see how you can deform your materials to see them in a new light.

You can export your visualization under the 'download' box at the bottom of the RAW page - your choices are as a simple raster image (png), a vector image (svg) or a data representation (json).


exercise 8

Simple Mapping and Georectifying

In this exercise, you will find a historical map online, upload a copy to a mapwarper service, georectify it, and then display the map online, via a hosted service like CartoDB, and also through a map you will build yourself using leaflet.js. Finally, we will also convert csv to geojson using http://togeojson.com/, and we'll map that as a github gist. We'll also grab a geojson file hosted on github gist and import it into cartodb.

Georectifying

Georectifying is the process of taking an image (whether it is of a historical map, chart, airphoto, or whatever) and manipulating its geometry so that it matches a geographic projection. Think of it like this: you take your handdrawn map, and use pushpins to pin down known locations on your map to a globe. As you pin, your image stretches and warps. Traditionally, this has not been an easy thing to do, if you are new to GIS. In recent years, the curve has flattened significantly. In this exercise, we'll grab an image, upload it to the Harvard Library MapWarper service, and then export it as a tileset which can be used in other mapping programs.

  1. Get a historical map. I like the Fire Insurance plans from the Gatineau Valley Historical Society; I'm sure you can find others to suit your interests.
  2. Right-click, save as.... grab a copy. Save it somewhere handy.
  3. Go to Harvard World MapWarp and sign up for an account. Then login.
  4. Go to the upload screen:
    Imgur
  5. Fill in as much of the metadata as you can. Then select your map from your computer, and upload it.
  6. On the next page, click 'rectify'.
    Imgur
  7. Pan and zoom both maps until you're sure you're looking at the same area in both. Double click in a map, select the pencil icon, and click on a point (location) you are sure you can match in the other window. Then click on the other map window, select the pencil, and then click on the same point. The 'add control point' button below and between both maps will light up. Click on this to confirm that this is a control point you want. Do this at least three times; the more times you can do it, the better the map warp.
  8. Having selected your control points, click on 'warp image'.
  9. You can now click on the 'export' panel, and get the URL for your georectified image in a few different formats. If you clicked on the KML option, a google map window will open like so. For many webmapping applications, the Tiles (Google/OSM scheme): Tiles Based URL is what you want. You'll get a URL like this: http://warp.worldmap.harvard.edu/maps/tile/4152/z/x/y.png Save that info. You'll need it later.

You have now georectified a map. Let's use that map as a base layer in Palladio

We need some place data for Palladio. Here's what I'm using
Imgur
Note how I've formatted this data. I'll be copying and pasting it into Palladio. (For more on how to input geographic data into Palladio, see this tutorial). Basically, you want something like this:

Place Coordinates
Mexico 23.634501,-102.552784
California 36.778261,-119.4179324
Brazos 32.661389,-98.121667

etc: that is, a tab between 'place' and 'coordinates' in the first line, a tab between 'mexico' and the latitude, and a comma between latitude and logitude.

  1. Go to Palladio. Hit 'start' then 'upload spreadsheet or csv'. In the box, paste in your data. You can progress to the next step without having any real data: just paste or type something in - see the video below. Obviously, you won't have any points on your map, but if you were having trouble with that step, this allows you to bypass it to continue on with this tutorial.
  2. Click on 'map'. Under 'places', select 'coordinates'. Then, click 'add new layer'. In the popup, beside 'Choose one of Palladio default layers or create a new one.', select 'custom'. This is where you're going to paste it that tiles based URL from the map warper. Paste it in, but replace the /z/x/y part with {z}/{x}/{y}. Click add.

Here is a video walk through; places where you might have got into trouble include getting past the initial data entry box on Palladio, and finding where exactly to past in your georectified map url.

Congratulations! You've georectified a map, and used it as a base layer for a visualization of some point data. Here are some notes on using a georectified map with the CartoDB service.

Imgur


exercise 9

Network Analysis in R

Earlier, we took the index from the Texan Correspondence, a list of letters from so-and-so to so-and-so. When we stitch that together into a network of people connected because they exchanged letters, we end up with a shard of their social network. Networks can be queried for things like power, position, and role, and so used judiciously, we can begin to suss something of the social structures in which their history took place.Before you go any further, make sure you also take a long look at Scott Weingart's series, Networks Demystified. Finally, heed our warning.

This exercise uses the R language to do our analysis, which in DHBox we access via R Studio, a programming environment. Please read this introduction to R and then progress to the exercise.

exercise 10 QGIS

QGIS

There are many excellent tutorials around concerning how to get started with GIS. Our own library, in the MADGIC centre has tremendous resources and I would encourage you to speak with the map librarians before embarking on any serious mapping projects. In the short term, the historian Fred Gibbs has an excellent series on using the open source GIS platform QGIS to make and map historical data.

For this exercise, I would recommend you try Gibbs' first tutorial,

'Making a map with QGIS'

...and then, try georectifying a historical map and adding it to your GIS:

'Using Historical maps with qgis'

Going Further

There are many tutorials at The Programming Historian that are appropriate here. Try some under the 'data manipulation' or 'distant reading' headings.