Scraping Web Pages (But Not Off of the Kitchen Floor)*

Baby Steps by marisag on Flickr

Baby Steps by marisag on Flickr. Creative Commons license

Okay, so this is going to be a middle-of-the-process post as I attempt to make something useful using Python.  Specifically, I’m working on learning how to scrape web pages for content without going crazy or devoting more time to such a project than I would already use simply by going to each of these web pages and manually copying and pasting the text.  I say that it’s the middle of the process because I’ve already done some of the work (success!), although there’s still a lot to do and several technological problems to overcome first (not-yet-failures).

First, though, it’s probably a good idea for me to clarify the project so that I can talk lucidly about what I have not yet done.

Ultimately, this scraping project is about getting some texts together so that I can create some useful topic models for my own research on the Exeter Book. I’ve mentioned before that the corpus of surviving Anglo-Saxon poetry is fairly limited; we have a mere 30,000 lines or so, and those lines exist, for the most part, in four collections: the Junius Manuscript, the Vercelli Book, the Nowell Codex, and the Exeter Book.  Each of these books has been published several times, both as complete collections and as editions of individual poems.  This presents a problem insofar as the creation of topic models are dependent on great numbers of texts which, in this case at least, are not extant.

Well, the best thing I can do is work with what I have.  Jessica came across a pretty good source for the texts online, too: seems to be pretty okay, and the fact is that I’m not at a point in the semester where I can go through and check their editing job, anyway.  This site certainly meets one criterion, though: the text isn’t bogged down with a lot of other material.  With relatively straightforward HTML, then, I should be able to scrape away with no issues!

With that in mind, I needed to develop a plan with definable goals so I could start getting done what needs to get done.  I came up with the following list of steps:

  1. Scrape the URLs of the child pages from the landing/parent page
  2. Remove the small number of unwanted URLs from the results
  3. Scrape the child pages
  4. Write the “scrapings” to a text file
  5. Profit!

So far, this is what I’ve done:

  1. Toss my current Python environment and start running Linux
  2. Install BeautifulSoup and a few other modules into Python
  3. Scrape the URLs of the child paes from the landing/parent page
  4. Write the scraped URLs to a text file

The first step was a little more work than I really wanted it to be; I’ve been running Python inside a Linux-like environment on Windows called Cygwin, but that has started to be more trouble than it’s worth.  For example, every tutorial I read on BeautifulSoup told me to start the script with the line

from bs4 import BeautifulSoup

The problem there is that Cygwin’s version of Python needed something different, specifically:

from BeautifulSoup import BeautifulSoup

I arrived at this information through extensive trial and error, not through any documentation, which is a real problem in Cygwin, it seems, because there are other differences in Python, as well, such as the use of module names written in camel-caps (capitalizing the first letter of the second word instead of using a space or underscore) instead of the otherwise-universally used underscore method of calling modules.  There needed to be a better way.

Indeed there was; rather than give up on Linux and its advantages for doing things in Python, and rather than continuing to suffer through the Cygwin problem, I decided to install Ubuntu on a flash drive and run from that.  This solution is great because it doesn’t wipe my hard drive at all, it doesn’t cause any problems as far as compatibility is concerned (as Windows or Cygwin do), and this way I get to play with Linux while scraping away.

Once in my new environment, I was happy to see that I could install and run the various necessary modules with the same common commands listed in much of the documentation we’ve seen linked in class.  The next problem would be to use the modules to scrape the code from the parent page.  Cobbling together code from two or three tutorials, I ended up with a script that looks like this:

from bs4 import BeautifulSoup
import requests

url = ""
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
	address_text = link.get('href')
	target = open("target\targetURLs.txt", 'a')

This worked straight off the bat.  It also provided me with a text file of the results, which means that I can go through and manually clean up the two or three additional links that I don’t need to follow when building my catalog of Old English poetry.

There you have it.  My next goal is to learn how to read through the text file and load each line into a loop so that it can function as a full URL.  Then it’s off to scraping to another file and ultimately to cleaning up the text, although I may have a lead on the latter that will make things a lot easier, even it it is a little bit like cheating.  More later!

*In case you’re wondering, the title is a reference to one of the recent cartoons that marks the triumphant return of Homestar Runner, or in this case, the triumphant return of Strong Bad and the Cheat making music videos.

Leave a Reply

Your email address will not be published. Required fields are marked *