Encode This!

What is simultaneously a terrifically frustrating programming language to learn and a nigh-unredeemable horror flick starring Robert Englund (but without the really cool razorglove)? Give up?

It’s Python…er…Python!

The truth is, it’s not usually so bad. I’ve had much worse times learning things like VisualBasic and C++, so I guess I shouldn’t complain.  I’m just frustrated because I don’t have time to deal with something piddly like this: I have to figure out how to change the encoding on my files so I can use extended characters.

Why, you ask? Well, as I’m working with Old English in my webscraping/topic modeling project, I need to process texts that use characters that have disappeared from the language, specifically Æ (ash), Þ (thorn), and Ð (eth).  I also need to support both upper and lower case versions of these letter forms.  In the good news department, ᚹ (wynn, apparently not supported here, either) has already been converted to W in the source files I’ll be using.

So what now?  Well, I get to run a bunch of functions in Python that look like this:

def add_ash(input):
output = re.sub('&aelig', 'æ', input)
return output

The problem, of course, is that Python itself only supports ASCII in its base form, meaning it only understands about 128 characters natively (a move that I understand since the language is designed to be useful and low-overhead); I have to figure out how to change the encoding to UTF-8 or somesuch.

I honestly have no idea WHAT I need to change the encoding in (the entire script? the Python environment itself? Time and Space, perhaps?), nor how to go about doing so.  All I know is that the word “þæt” should not come out looking like “þæt” when I dump my data into Mallet.

Leave a Reply

Your email address will not be published. Required fields are marked *