Mixing Methodology and Interdisciplinarity: Text Mining and the Study of Literature

???As I’ve mentioned earlier, I enrolled in Digital Methods this semester for many reasons, but one of the big ones was to learn the technology necessary to do some serious text analysis.  Text Mining, and specifically Topic Modeling, are exactly what I want to deal with and I’m quite excited to be able to start working with the tools that, at least hopefully, will allow me to perform a remarkable form of analysis that will inform my dissertation in new and compelling ways.  The trouble is, as a student of literature, my method is usually set to focus on a very small number of texts (indeed, usually a single text); topic modeling, however, requires thousands of texts in order to be effective.  As a result, I need to deal with some pretty hairy problems.

They come in two basic groups: problems with acquiring texts and problems with rendering the texts.  Neither of these is a problem inherent and isolated to Anglo-Saxon studies, but there are some particular difficulties in both arenas that I think I’m going to have to ask for help on if I’m going to bring all this to a successful conclusion.

As far as acquiring texts is concerned, the fact is that one would expect such a thing to be easy.  After all, most of what we study as Anglo-Saxonists has been published over and over again for centuries.  Zupitza’s first edition of Beowulf was published in 1815, almost 200 years ago, so one would expect that sort of thing to be readily available as a part of the public domain, right?

Well, for the most part, yes.  There are a lot of documents out there where people have edited one poem or another, and there are even websites that feature all of Anglo-Saxon poetry (approximately 30,000 lines, a rather meager sample size).  The problem here comes in the fact that it’s also all poetry.  If topic modeling works on the principle of words being found in association with one another, then the fact that Anglo-Saxon verse is built on variation might cause some problems, as it would for any analysis that attempted to build a series of categories around poetry alone.  Thus, in order to understand poetry, a large number of prose texts must also be found.

This leads into the second difficulty.  While I have no doubt that various software tools can learn to handle the special characters used in Anglo-Saxon manuscripts (ash “æ”, eth “ð”, thorn “þ”, wynn “ƿ”, and others), one must also allow for the fact that different editors have different editorial practices, including the use of diacritic marks (many editors differentiate between the short “a” and the long “ā”, for example, as it helps modern readers, who would not know the differences in pronunciation).  Since marks like the macron do not appear in manuscripts, there is no widespread standard as to their use, nor is there a way to ensure that decisions made by editors are perfectly reliable.  There is likewise no way to ensure that the texts even represent their native language in a standard way since spelling would not be normalized for several centuries.  Thus, any decision I make with regards to cleaning and standardizing the texts I include in my sample has the very real potential to cause false associations and false results.

I have no idea how to deal with the complexities of this problem in a way that will work for my dissertation.  I can only sit back and learn about the process and its workings one step at a time, hoping that some sort of answer will occur to me.  It’s certainly possible that I’m just missing the obvious right now, too.  As I apply myself to tools like MALLET, I’ll be sure to be on the lookout.

Leave a Reply

Your email address will not be published. Required fields are marked *