Mixing Methodology and Interdisciplinarity: Text Mining and the Study of Literature

???As I’ve mentioned earlier, I enrolled in Digital Methods this semester for many reasons, but one of the big ones was to learn the technology necessary to do some serious text analysis.  Text Mining, and specifically Topic Modeling, are exactly what I want to deal with and I’m quite excited to be able to start working with the tools that, at least hopefully, will allow me to perform a remarkable form of analysis that will inform my dissertation in new and compelling ways.  The trouble is, as a student of literature, my method is usually set to focus on a very small number of texts (indeed, usually a single text); topic modeling, however, requires thousands of texts in order to be effective.  As a result, I need to deal with some pretty hairy problems.

They come in two basic groups: problems with acquiring texts and problems with rendering the texts.  Neither of these is a problem inherent and isolated to Anglo-Saxon studies, but there are some particular difficulties in both arenas that I think I’m going to have to ask for help on if I’m going to bring all this to a successful conclusion.

As far as acquiring texts is concerned, the fact is that one would expect such a thing to be easy.  After all, most of what we study as Anglo-Saxonists has been published over and over again for centuries.  Zupitza’s first edition of Beowulf was published in 1815, almost 200 years ago, so one would expect that sort of thing to be readily available as a part of the public domain, right?

Well, for the most part, yes.  There are a lot of documents out there where people have edited one poem or another, and there are even websites that feature all of Anglo-Saxon poetry (approximately 30,000 lines, a rather meager sample size).  The problem here comes in the fact that it’s also all poetry.  If topic modeling works on the principle of words being found in association with one another, then the fact that Anglo-Saxon verse is built on variation might cause some problems, as it would for any analysis that attempted to build a series of categories around poetry alone.  Thus, in order to understand poetry, a large number of prose texts must also be found.

This leads into the second difficulty.  While I have no doubt that various software tools can learn to handle the special characters used in Anglo-Saxon manuscripts (ash “æ”, eth “ð”, thorn “þ”, wynn “ƿ”, and others), one must also allow for the fact that different editors have different editorial practices, including the use of diacritic marks (many editors differentiate between the short “a” and the long “ā”, for example, as it helps modern readers, who would not know the differences in pronunciation).  Since marks like the macron do not appear in manuscripts, there is no widespread standard as to their use, nor is there a way to ensure that decisions made by editors are perfectly reliable.  There is likewise no way to ensure that the texts even represent their native language in a standard way since spelling would not be normalized for several centuries.  Thus, any decision I make with regards to cleaning and standardizing the texts I include in my sample has the very real potential to cause false associations and false results.

I have no idea how to deal with the complexities of this problem in a way that will work for my dissertation.  I can only sit back and learn about the process and its workings one step at a time, hoping that some sort of answer will occur to me.  It’s certainly possible that I’m just missing the obvious right now, too.  As I apply myself to tools like MALLET, I’ll be sure to be on the lookout.

Why I Hate to Wait: The Real Cost of Less-Than-Immediate Computing

"Lost" by Jose Maria Cuellar . Used under CC-BY-NC license.

“Lost” by Jose Maria Cuellar . Used under CC-BY-NC license.

If there is a single force on this Earth that unites us all, making clear that we are all the same regardless of our culture, gender, age, socioeconomic background, or language, it is that we hate to wait for technology.  Unfortunately, this is not going to be the force that eliminates war or poverty, but it does speak volumes about how we use these miraculous machines we’ve created and what we expect of them in our day-to-day interactions.  When we click, we want to see evidence that we clicked.  When we type, we expect characters to show up on the screen simultaneously.  Even in my earliest computer-related activities, I wondered why it took so long for the computer do do some things, while it seemed like it could handle anything I threw at it as fast as I could throw it in other situations (those of you who have used a BBS know exactly to which pain I am referring).

Interestingly, though, my problem isn’t one of simple, aggravated frustration.  Again, this is probably because I cut my digital teeth in a world where online speed was measured in baud and most data moved around in 720 K chunks via sneakernet.  In those days, the expectations may have been low, but there were still basic things we expected computers to do, such as reflect our input.  When something happened and we were forced to wait (which happened a lot), we sometimes got frustrated with the machine itself and banged on it like it were some sort of mechanical device (perhaps one of Grace Hopper’s bugs got in there somewhere?), but we never really walked away.  We never gave up because computers were just like that and we accepted them in all of their fickle imperfection because we didn’t know any better.

Now, that may sound like an advantage at first, and indeed it probably is when dealing with magic boxes one barely understands, but it does come with its own set of problems, especially given the current ubiquity and pervasiveness of the web.  In particular, when I have to wait for login information to be e-mailed to me by an automated system, I tend to wait for a few minutes, but invariably if it isn’t in my mailbox in under 180 seconds, I get distracted by something else and then completely forget that I ever signed up for an e-mail in the first place.  This is precisely the problem I’m having with a number of the services to which we were directed for text mining.  In the case of Bookworm, it’s been several days, and the only reason I remember that I signed up in the first place is that I wanted to be sure to write this blog post about my frustrations surrounding text mining.

So yes, in short, sometimes the biggest frustrations we face in the realm of technology is not the software itself, but simply getting access to it.  I know that’s been my struggle on a number of occasions outside of the work we’re doing for Digital Methods, and it has popped up occasionally in #digimeth, as well.  As in everything else, persistence and purpose will win out in the end, but sometimes I wonder how much I haven’t learned just because I forgot I was interested in the first place.

A Minor Victory: OE, OCR, and Frequency Clouds Through Voyant

I thought it only right to post a counterpoint to my last little rant because “text mining” is one of the biggest reasons I decided to enroll in Digital Methods in the first place.  Although visualizing word frequencies and other elements is not necessarily what I would have put down as my most important priority, it is certainly satisfying to be able to share with you the very quick and interesting, if completely unscientific, results of dumping the entire plain text scan of Gollancz’s 1895 edition of the Exeter Book into Voyant.

As this edition has both Old English and 19th century Modern English in play, we see an interesting mix here.  It seems that the most common OE words are, as one would expect, pronouns: seic, etc., although the negator “ne” is also “popular.”  As I continue to play around with this text, I’ll post anything new I discover.

Word cloud

Generated at voyant-tools.org from the Gollancz 1895 edition of the Exeter Book

I just put a new portfolio project idea on the map

I just put a new portfolio project idea on the map

No, the jokes don’t get any better than that.

For those of you who don’t know, I’m working with the Exeter Book for my dissertation, which is a remarkable privilege and something that I hope I don’t hate doing after this whole process is over. As a result, I guess it’s a surprise to no one that I try to apply new digital tricks to what I read in the Book, and I think I may have finally found a way to make the study of KML worthwhile to me.

I’m going to try and map out the travels of Ƿidsið (“Widsith”).

The very name “Widsith” means “wide(ly) travel(ed),” which is as apt a description as one can give of the main character in the Anglo-Saxon poem. As a traveling scop, Widsith travels and performs in numerous courts and halls, but he never stays put for long.  The poem itself is a catalog of these visits, often giving names of the tribe and their leader.  My proposed mapping project will be to collect as much information about places mentioned in the poem and then place that information on a map in order to get a clear idea of the scope of his wanderings.

This really will be a great deal of work as there are a lot of locations, I’ll be doing a lot of normal, everyday research for it, such as digging in books or on the web for good information, but I’ll also be doing a great deal of research into the use of QGIS and KML, too.  It’s entirely possible that doing all of that work might very well end up creating nothing interesting, but I think something else will happen: it will make for a very interesting map once the research has been done and I have a set of coordinates for each of the kingdoms in the list.

Now who wants to help me learn how to run the necessary software?

The Best Laid Plans of Mice and Medievalists…

There can really be no doubt about which of the #digimeth technologies I’m weakest with; I’ve messed around with all sorts of programming languages and I’ve used many different types of software over the years with varying levels of success, but I’ve never been good with geospatial stuff.  As I tweeted before, I’m bad enough when I have to read a map; I’m pretty sure it’s a bad idea to ask me to make one.  As a result, I find that I’m falling behind in my plan to finish most of the portfolio work two or three weeks before the end of the semester (necessary because I have a number of other important projects on which I’ll need to focus at that point).  Being behind on that, as well as behind on my various other responsibilities, like grading, hasn’t been good for my blood pressure, which my doctors have been pushing me to reduce, as well.  In essence, although it’s happening a bit later than I thought it would, my semester is certainly now spiraling out of control.

That isn’t my big failure, though.

If I’ve messed up on anything this semester, it’s that I’ve not capitalized on the excellent resources that are readily at hand; indeed, if I could change anything about the class so far, I would address the fact that I’ve not yet asked one of my fellow students for help.  There are, of course, a myriad of reasons for this, but none supercede the simple fact that I have yet to reach out to the others who seem to really “get” that particular technology.  My sad attempt at putting together a work group via Twitter didn’t even get off the ground (I’m not aware that I received a single response), so I let it fade into the obscurity of the feed’s past and did nothing to revive the effort later. I could have been light years ahead of where I am now if I had just followed through a bit more there and asked for help again.

This may be a general pattern with the way I run my life; I wouldn’t be the first academic who bit off more than he or she could chew in a semester, nor would I be the first to try and keep things together far beyond the point when it became clear that doing so was not just impossible, attempting to do so was destructive.

Thus, I’m changing my tactics.  Tonight, I post this first “celebrate failure” post along with a tweet that will call, once again, for a study or work group of some sort.  I will list what I’d like to accomplish for myself instead of just vaguely and weakly suggesting we all get together and poke at our computers until they do something interesting.  I’m sure I’m not the only one who has problems that could be better solved by having two or three fresh sets of eyes looking over my code, and I hope that I can be of service helping the rest of the class.