One Fine Day

So, you know how it goes. You have the best of intentions. You have the motivation to work on your new blog (a lot more than you used to, anyway), and you’re going to capitalize on it.

Well, another year has passed, and I’ve not touched this website, nor this blog.

In my defense, I’ve been working (somewhat) diligently on my dissertation, and I’m getting to the point where I can see the potential first rays of light shining from that oncoming train that is my graduation deadline. When things clarify a bit, when I can say for certain that it’s a train or the end of the tunnel…that’s when I’ll put some more effort into this enterprise.

Until then, I ask only for patience from the unending void of the universe that is my readership. I’ll get there one day.

Why I’m Moving to a CMS

For those of you who have been following the adventures of the #digimeth community here at the University of New Mexico, it should come as no surprise that I’ve been doing an awful lot of coding. The truth is that I don’t mind; there’s a kind of simple elegance to making useful code that does what I want it to do, and I can enjoy the act of creation when writing code like that.

The problem comes when not all code is like that. I’m not talking about the part where one wrestles with the code to make it do what one wants it to do, nor even the part where it’s difficult to integrate one part of one’s code with another part. I’m specifically speaking to the difficulties of maintaining code once it has ceased to be on the front burner. I’m talking about all of the ridiculous changes that need to be made just to move a collection of web pages to a subdirectory. Hand-coded websites are great and they show a dedication that is worthy of recognition, but let’s face it; if one is going to maintain a hand-coded website, it is highly unlikely that one will have the time to do anything worthy of mention on that website.

DrupalIt’s for this reason that I’ve decided to move justinlarsen.net to a web content management system called Drupal. The fact is that I’ve been so impressed with the way that the folks over at Reclaim Hosting have integrated WordPress (what you’re reading now) that it only makes sense to shift a few things around and install another piece of software designed to make life a lot easier, at least as far as upkeep is concerned.

I do, of course, understand the value of making people learn to code their own HTML, and I believe that there is a great deal to value about doing things the hard way at first. Once that lesson is learned, however, it stands to reason that we should not reinvent the wheel. More importantly, there are a number of things one can learn from using content management software that one can’t learn when one is elbows-deep in angle brackets and hrefs. As this is the kind of software most of us will use once we leave the context of the course, it stands to reason that gaining experience with this software would have a real benefit. It’s also important conceptually; the way that CMS systems handle processes like publishing and editing is different in important ways from the methods we’ve used to simply upload text files.

Thus, sometime in the next two weeks I’ll be backing up everything (including this blog) and converting over to the new system. The truth of the matter is that I’ll be doing nothing much over the Christmas holiday, anyway; this should keep me occupied for at least a little while. I’ll document how it all goes here, of course, too.

Dots never felt like such an accomplishment

Behold:

The Travels of Widsith (through line 30)

The Travels of Widsith (through line 30)

This may not seem like much (trust me, I know), but this image above represents a real albatross to me, one that I look forward to removing from my neck so I can get back to the serious business of trying to make my website not suck (along with all the rest of the stuff I’m supposed to be doing).

There are a number of problems with this image that I could list, but the fact that I have dots on the map, and indeed that there is a map at all, kind of overshadows my disappointment in my own abilities here.  I know I’ve mentioned it before, but I’m not really geographically inclined.  I also have varying levels of patience with software that is designed for a very narrow user base, especially when that user base does not include me or anyone with whom I converse on a semi-regular basis.  Add in the fact that the software in question is open source, meaning that it is designed by people who donate their time to make it work, but don’t necessarily donate their time to make it usable, and we have  a perfect storm, or at least the prime conditions for me to go remove the software suite tout suite.

Nonetheless, I seem to have found a way to make Quantum GIS (QGIS) put the things I want in a place I can live with, which is satisfactory.  Just don’t expect me to invite it over for dinner.

Ultimately, the process of getting those dots up there would have been a lot easier had I been smart and selected a map project that didn’t require me to delve as deeply into the research as mine did.  Indeed, the fact that there are so few dots on that map is indicative of the fact that I was only able to process the tribes from the first thirty or so lines of Widsith.  There are, in all seriousness, about one hundred more dots to place if I were to be completionist about this, which I might be as I have a bit more time.  It might even be fun to use some Javascript to make the thing interactive at some point, placing a copy of the poem next to the map and highlighting the location when the tribe’s name receives mouse focus or something.

Still, since we’re supposed to be showing our steps, I’ll start with this:

You’ll notice that I’ve got a decent amount of information in there.  I collated what I could glean out of Kemp Malone’s study of the poem along with additional scholarship on the Web and then did my best to put dots in locations that seemed like they were reasonable.  Let me reiterate: there is so much up in the air about all of this that any sort of attempt to locate the majority of these tribes is an act of speculation at best.  The points on the map are there only to designate the most general places possible because I have neither the time nor the patience to learn how to work with areas and polygons.  The purpose was to give a general idea of how widely traveled Widsith claims to be; the dots do well enough to get that point across.

From there, it was easy enough to import the Google sheet into Google’s Maps Engine.

Well, sort of, anyway.  The data went in just fine, but certain parts, like the hyperlinks, didn’t convert for some reason.  I had to go in and manually add each of the hyperlinks to the data table of each group, being careful to put the right link in the right spot.  The fact that the data in the Maps Engine software had been moved around made it far more difficult than it needed to be, but I got it in there eventually, anyway.  To its credit, however, the software automatically created a hyperlink out of the URL I posted in each cell.

Afterwards, I had to get it overt to QGIS.  I followed Jessica Troy’s advice and installed Google Earth before doing much else.  I then went back into Maps Engine Lite and exported the map as KML.  I then opened the new KML file as a layer in Google Earth.  Voilá!  More dots!

My spreadsheet data as rendered by Google Earth

My spreadsheet data as rendered by Google Earth

I then cast about, looking for a way to save the data I had imported as something other than it was to begin with.  I found nothing.  Curious about what would happen if I just entered the KML file into QGIS directly, I plugged the file in, and then followed Jessica’s directions for getting a base map working.  I’m not sure if just having Google Earth installed made some sort of difference, but I was able to get the information inside of QGIS without a problem.  The results are what you see at the top of this article.

I’m not really all that interested in making a map that does fun, dynamic things online.  As much as it would be cool to be able to switch the different locations on and off by fit or thula, I think I’d rather concentrate on getting the data entered in as completely as possible.  Still, there must be projects for the future, as well, which merely reminds me to remain philosophical about GIS and its role in visualizing history and literature.

Encode This!

What is simultaneously a terrifically frustrating programming language to learn and a nigh-unredeemable horror flick starring Robert Englund (but without the really cool razorglove)? Give up?

It’s Python…er…Python!

The truth is, it’s not usually so bad. I’ve had much worse times learning things like VisualBasic and C++, so I guess I shouldn’t complain.  I’m just frustrated because I don’t have time to deal with something piddly like this: I have to figure out how to change the encoding on my files so I can use extended characters.

Why, you ask? Well, as I’m working with Old English in my webscraping/topic modeling project, I need to process texts that use characters that have disappeared from the language, specifically Æ (ash), Þ (thorn), and Ð (eth).  I also need to support both upper and lower case versions of these letter forms.  In the good news department, ᚹ (wynn, apparently not supported here, either) has already been converted to W in the source files I’ll be using.

So what now?  Well, I get to run a bunch of functions in Python that look like this:

def add_ash(input):
output = re.sub('&aelig', 'æ', input)
return output

The problem, of course, is that Python itself only supports ASCII in its base form, meaning it only understands about 128 characters natively (a move that I understand since the language is designed to be useful and low-overhead); I have to figure out how to change the encoding to UTF-8 or somesuch.

I honestly have no idea WHAT I need to change the encoding in (the entire script? the Python environment itself? Time and Space, perhaps?), nor how to go about doing so.  All I know is that the word “þæt” should not come out looking like “þæt” when I dump my data into Mallet.

Scraping Web Pages (But Not Off of the Kitchen Floor)*

Baby Steps by marisag on Flickr

Baby Steps by marisag on Flickr. Creative Commons license

Okay, so this is going to be a middle-of-the-process post as I attempt to make something useful using Python.  Specifically, I’m working on learning how to scrape web pages for content without going crazy or devoting more time to such a project than I would already use simply by going to each of these web pages and manually copying and pasting the text.  I say that it’s the middle of the process because I’ve already done some of the work (success!), although there’s still a lot to do and several technological problems to overcome first (not-yet-failures).

First, though, it’s probably a good idea for me to clarify the project so that I can talk lucidly about what I have not yet done.

Ultimately, this scraping project is about getting some texts together so that I can create some useful topic models for my own research on the Exeter Book. I’ve mentioned before that the corpus of surviving Anglo-Saxon poetry is fairly limited; we have a mere 30,000 lines or so, and those lines exist, for the most part, in four collections: the Junius Manuscript, the Vercelli Book, the Nowell Codex, and the Exeter Book.  Each of these books has been published several times, both as complete collections and as editions of individual poems.  This presents a problem insofar as the creation of topic models are dependent on great numbers of texts which, in this case at least, are not extant.

Well, the best thing I can do is work with what I have.  Jessica came across a pretty good source for the texts online, too: http://www.sacred-texts.org/neu/ascp/ seems to be pretty okay, and the fact is that I’m not at a point in the semester where I can go through and check their editing job, anyway.  This site certainly meets one criterion, though: the text isn’t bogged down with a lot of other material.  With relatively straightforward HTML, then, I should be able to scrape away with no issues!

With that in mind, I needed to develop a plan with definable goals so I could start getting done what needs to get done.  I came up with the following list of steps:

  1. Scrape the URLs of the child pages from the landing/parent page
  2. Remove the small number of unwanted URLs from the results
  3. Scrape the child pages
  4. Write the “scrapings” to a text file
  5. Profit!

So far, this is what I’ve done:

  1. Toss my current Python environment and start running Linux
  2. Install BeautifulSoup and a few other modules into Python
  3. Scrape the URLs of the child paes from the landing/parent page
  4. Write the scraped URLs to a text file

The first step was a little more work than I really wanted it to be; I’ve been running Python inside a Linux-like environment on Windows called Cygwin, but that has started to be more trouble than it’s worth.  For example, every tutorial I read on BeautifulSoup told me to start the script with the line

from bs4 import BeautifulSoup

The problem there is that Cygwin’s version of Python needed something different, specifically:

from BeautifulSoup import BeautifulSoup

I arrived at this information through extensive trial and error, not through any documentation, which is a real problem in Cygwin, it seems, because there are other differences in Python, as well, such as the use of module names written in camel-caps (capitalizing the first letter of the second word instead of using a space or underscore) instead of the otherwise-universally used underscore method of calling modules.  There needed to be a better way.

Indeed there was; rather than give up on Linux and its advantages for doing things in Python, and rather than continuing to suffer through the Cygwin problem, I decided to install Ubuntu on a flash drive and run from that.  This solution is great because it doesn’t wipe my hard drive at all, it doesn’t cause any problems as far as compatibility is concerned (as Windows or Cygwin do), and this way I get to play with Linux while scraping away.

Once in my new environment, I was happy to see that I could install and run the various necessary modules with the same common commands listed in much of the documentation we’ve seen linked in class.  The next problem would be to use the modules to scrape the code from the parent page.  Cobbling together code from two or three tutorials, I ended up with a script that looks like this:

from bs4 import BeautifulSoup
import requests

url = "http://www.sacred-texts.org/neu/ascp"
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
	address_text = link.get('href')
	target = open("target\targetURLs.txt", 'a')
	target.write(address_text)
	target.write("\n")
	target.close()

This worked straight off the bat.  It also provided me with a text file of the results, which means that I can go through and manually clean up the two or three additional links that I don’t need to follow when building my catalog of Old English poetry.

There you have it.  My next goal is to learn how to read through the text file and load each line into a loop so that it can function as a full URL.  Then it’s off to scraping to another file and ultimately to cleaning up the text, although I may have a lead on the latter that will make things a lot easier, even it it is a little bit like cheating.  More later!

*In case you’re wondering, the title is a reference to one of the recent cartoons that marks the triumphant return of Homestar Runner, or in this case, the triumphant return of Strong Bad and the Cheat making music videos.

Mixing Methodology and Interdisciplinarity: Text Mining and the Study of Literature

???As I’ve mentioned earlier, I enrolled in Digital Methods this semester for many reasons, but one of the big ones was to learn the technology necessary to do some serious text analysis.  Text Mining, and specifically Topic Modeling, are exactly what I want to deal with and I’m quite excited to be able to start working with the tools that, at least hopefully, will allow me to perform a remarkable form of analysis that will inform my dissertation in new and compelling ways.  The trouble is, as a student of literature, my method is usually set to focus on a very small number of texts (indeed, usually a single text); topic modeling, however, requires thousands of texts in order to be effective.  As a result, I need to deal with some pretty hairy problems.

They come in two basic groups: problems with acquiring texts and problems with rendering the texts.  Neither of these is a problem inherent and isolated to Anglo-Saxon studies, but there are some particular difficulties in both arenas that I think I’m going to have to ask for help on if I’m going to bring all this to a successful conclusion.

As far as acquiring texts is concerned, the fact is that one would expect such a thing to be easy.  After all, most of what we study as Anglo-Saxonists has been published over and over again for centuries.  Zupitza’s first edition of Beowulf was published in 1815, almost 200 years ago, so one would expect that sort of thing to be readily available as a part of the public domain, right?

Well, for the most part, yes.  There are a lot of documents out there where people have edited one poem or another, and there are even websites that feature all of Anglo-Saxon poetry (approximately 30,000 lines, a rather meager sample size).  The problem here comes in the fact that it’s also all poetry.  If topic modeling works on the principle of words being found in association with one another, then the fact that Anglo-Saxon verse is built on variation might cause some problems, as it would for any analysis that attempted to build a series of categories around poetry alone.  Thus, in order to understand poetry, a large number of prose texts must also be found.

This leads into the second difficulty.  While I have no doubt that various software tools can learn to handle the special characters used in Anglo-Saxon manuscripts (ash “æ”, eth “ð”, thorn “þ”, wynn “ƿ”, and others), one must also allow for the fact that different editors have different editorial practices, including the use of diacritic marks (many editors differentiate between the short “a” and the long “ā”, for example, as it helps modern readers, who would not know the differences in pronunciation).  Since marks like the macron do not appear in manuscripts, there is no widespread standard as to their use, nor is there a way to ensure that decisions made by editors are perfectly reliable.  There is likewise no way to ensure that the texts even represent their native language in a standard way since spelling would not be normalized for several centuries.  Thus, any decision I make with regards to cleaning and standardizing the texts I include in my sample has the very real potential to cause false associations and false results.

I have no idea how to deal with the complexities of this problem in a way that will work for my dissertation.  I can only sit back and learn about the process and its workings one step at a time, hoping that some sort of answer will occur to me.  It’s certainly possible that I’m just missing the obvious right now, too.  As I apply myself to tools like MALLET, I’ll be sure to be on the lookout.

Why I Hate to Wait: The Real Cost of Less-Than-Immediate Computing

"Lost" by Jose Maria Cuellar . Used under CC-BY-NC license.

“Lost” by Jose Maria Cuellar . Used under CC-BY-NC license.

If there is a single force on this Earth that unites us all, making clear that we are all the same regardless of our culture, gender, age, socioeconomic background, or language, it is that we hate to wait for technology.  Unfortunately, this is not going to be the force that eliminates war or poverty, but it does speak volumes about how we use these miraculous machines we’ve created and what we expect of them in our day-to-day interactions.  When we click, we want to see evidence that we clicked.  When we type, we expect characters to show up on the screen simultaneously.  Even in my earliest computer-related activities, I wondered why it took so long for the computer do do some things, while it seemed like it could handle anything I threw at it as fast as I could throw it in other situations (those of you who have used a BBS know exactly to which pain I am referring).

Interestingly, though, my problem isn’t one of simple, aggravated frustration.  Again, this is probably because I cut my digital teeth in a world where online speed was measured in baud and most data moved around in 720 K chunks via sneakernet.  In those days, the expectations may have been low, but there were still basic things we expected computers to do, such as reflect our input.  When something happened and we were forced to wait (which happened a lot), we sometimes got frustrated with the machine itself and banged on it like it were some sort of mechanical device (perhaps one of Grace Hopper’s bugs got in there somewhere?), but we never really walked away.  We never gave up because computers were just like that and we accepted them in all of their fickle imperfection because we didn’t know any better.

Now, that may sound like an advantage at first, and indeed it probably is when dealing with magic boxes one barely understands, but it does come with its own set of problems, especially given the current ubiquity and pervasiveness of the web.  In particular, when I have to wait for login information to be e-mailed to me by an automated system, I tend to wait for a few minutes, but invariably if it isn’t in my mailbox in under 180 seconds, I get distracted by something else and then completely forget that I ever signed up for an e-mail in the first place.  This is precisely the problem I’m having with a number of the services to which we were directed for text mining.  In the case of Bookworm, it’s been several days, and the only reason I remember that I signed up in the first place is that I wanted to be sure to write this blog post about my frustrations surrounding text mining.

So yes, in short, sometimes the biggest frustrations we face in the realm of technology is not the software itself, but simply getting access to it.  I know that’s been my struggle on a number of occasions outside of the work we’re doing for Digital Methods, and it has popped up occasionally in #digimeth, as well.  As in everything else, persistence and purpose will win out in the end, but sometimes I wonder how much I haven’t learned just because I forgot I was interested in the first place.

A Minor Victory: OE, OCR, and Frequency Clouds Through Voyant

I thought it only right to post a counterpoint to my last little rant because “text mining” is one of the biggest reasons I decided to enroll in Digital Methods in the first place.  Although visualizing word frequencies and other elements is not necessarily what I would have put down as my most important priority, it is certainly satisfying to be able to share with you the very quick and interesting, if completely unscientific, results of dumping the entire plain text scan of Gollancz’s 1895 edition of the Exeter Book into Voyant.

As this edition has both Old English and 19th century Modern English in play, we see an interesting mix here.  It seems that the most common OE words are, as one would expect, pronouns: seic, etc., although the negator “ne” is also “popular.”  As I continue to play around with this text, I’ll post anything new I discover.

Word cloud

Generated at voyant-tools.org from the Gollancz 1895 edition of the Exeter Book

I just put a new portfolio project idea on the map

I just put a new portfolio project idea on the map

No, the jokes don’t get any better than that.

For those of you who don’t know, I’m working with the Exeter Book for my dissertation, which is a remarkable privilege and something that I hope I don’t hate doing after this whole process is over. As a result, I guess it’s a surprise to no one that I try to apply new digital tricks to what I read in the Book, and I think I may have finally found a way to make the study of KML worthwhile to me.

I’m going to try and map out the travels of Ƿidsið (“Widsith”).

The very name “Widsith” means “wide(ly) travel(ed),” which is as apt a description as one can give of the main character in the Anglo-Saxon poem. As a traveling scop, Widsith travels and performs in numerous courts and halls, but he never stays put for long.  The poem itself is a catalog of these visits, often giving names of the tribe and their leader.  My proposed mapping project will be to collect as much information about places mentioned in the poem and then place that information on a map in order to get a clear idea of the scope of his wanderings.

This really will be a great deal of work as there are a lot of locations, I’ll be doing a lot of normal, everyday research for it, such as digging in books or on the web for good information, but I’ll also be doing a great deal of research into the use of QGIS and KML, too.  It’s entirely possible that doing all of that work might very well end up creating nothing interesting, but I think something else will happen: it will make for a very interesting map once the research has been done and I have a set of coordinates for each of the kingdoms in the list.

Now who wants to help me learn how to run the necessary software?