Web Scraping for Fun
This post is all about how to scrape the web with Ruby. I’ll be covering the four main ways to interact with the webserver and get the data you want.
Best Case Scenario
This is the absolute dream, you don’t need anything outside of the standard library (but something like Curb could also be used for handling cases where the HTTP requests go bad).
Sadly not all websites give you nice API like this, or sometimes it’s only accessible for a fee (lots of sports data is like this).
So, let’s look at a bit of code that fetches a random image from Reddit:
The first thing you’ll notice is that we require json
and open-uri
which
are both part of the standard library. json
obviously gives us all the tools
for parsing and creating JSON objects. open-uri
allows us to pass URLs to
open
and read it just like we would a local file, this is really cool because
it leads to some very clean and simple code.
get_threads
is a funny little method, it was initially called
try_really_hard_to_get_threads
but I thought that wasn’t very professional so
I renamed it. This is a very simplistic implementation that trusts I’m always
going to get back the JSON I expect or that there will be a HTTPError
I can
respond to. The reason I catch the 429 here is that Reddit will very frequently return
429 Too Many Requests
, so we just wait a minute and try again.
The next method get_random_thread
is all about pulling out the relevant data
from the API we want. The parts of the json we care about are structured like this:
We loop through every thread and get only the ones with the domain i.redd.it
because that’s the one we know we can download really easily. Once we match on
the domain we pass the url to map
. At the end the result of map
looks like
[nil, url, nil, url]
so we call compact
on it to get rid of the nil
results. We then call sample
to get a single random result to return.
download
is slightly over engineered to give you an example of how to stream
a download to a disk, this function was originally just:
but that requires reading the entire download into memory before flushing it to the disk which is very bad if you want to download anything of a decent size.
For the full version of this script with does some nice naming of the file and actually sets it as the background you can check it out on GitHub here.
Usual Scenario
This is when there is no API but all you need to do is parse some HTML and turn it into data. 90% of the scripts I write fall into this and the previous category. I very rarely need to touch the last two but they’re still a great learning experience.
OpenGraph data / Twitter card
We’ll start off with a really simple example to get the Open Graph data from sites.
This time we’re using the Nokogiri gem, this will handle parsing the HTML and allowing us to navigate it using CSS selectors and ruby itself.
The first thing we do is download the webpage and pass it into the Nokogiri HTML parser. This gives us an object which we can query against in really nice ways.
In the example we use a single css query meta[property^="og:"]
which if
you’re not very familiar with CSS means “All meta elements witch a property
property that starts with ‘og:’”
So if our HTML looks like this:
It will only return the last four meta
tags.
Next we loop over all the results and turn them into an array. For each element
we return an array that looks like ['title', 'A Twitter for My Sister']
, so
the result of the map is:
Now, when you call to_h
on an array that is an array of two element arrays it
turns it into a hash like this:
So this little method has given us a pretty hash of all the Open Graph data for this page. This is useful for things like forums where when a user submits a link you can include a little more information.
I’ve got an example of this script which also handles Twitter card data, you can find it on GitHub here.
When the data isn’t very pretty
A lot of the time you’ll we working with fairly awkward data, for instance this little script gets the front page of Hacker News and returns hashes of each submission.
Let’s break this down piece by piece, the first thing we do is fetch the HTML and initialize a Nokogiri object.
Secondly we select all elements which match .itemlist tr
and we loop over
them three at a time. The reason for this is because all the data for a single
submission is contained in three table rows. This is done using the lovely
method each_slice
from ruby core.
So the first thing we do is check that we actually have three elements, the reason for this is because the last two rows are actually the “more” at the bottom. So we want to ignore that and skip over it.
Next we grab user element, the reason I’ve done it like this is because
sometimes this one wont exist and defining it here makes it a bit cleaner later
on. The at_css
method returns the first element that matches the selector,
which is handy when you know you only have one or know you want the first.
After that we start populating the hash for this submission. I’m going to go
through this quickly as it’s all pretty self explanatory with only minor
differences between them. For rank we look for an element with the rank
class
on it and get it’s text contents, we then remove any .
from it and turn it
into an integer.
Next up we get the story details, we only care about the a
element under the
element with the title
class. We grab both the text and link. When you have a
single element for Nokogiri you can access the properties on the element as a
hash which is really nice.
This is why we got the user_element
earlier, we don’t have try
since
we haven’t included ActiveSupport
so we just do a simple ternary. I could
have done user_element&.text
which was introduced in Ruby 2.3 but I wanted to
remain compatible with Ruby 2.2 since it’s still supported.
And lastly we want to get information about the comments, here we use the css
selector so we can get the last element. Here I use a bit of a trickery with
to_i
, if you pass in something like 123blah456
you’ll get 123
. This is
because to_i
will stop converting to an integer at the very first non-digit
character. If the first non-whitespace character it encounters is not a digit,
it’ll return zero. For example:
Posting and Sessions
When you need to keep your session data and cookies it can be troublesome to use the more lightweight approaches above. Using Mechanize is a good way to handle it.
Let’s have a look at the code below:
For this demo we’re using the demo site of Thredded which is a simple Rails backed forum that is mobile friendly. The reason for choosing Thredded is because its demo site doesn’t require email verification, captcha, or even a password!
We start off by requiring Mechanize and then go on to initialize a new instance of it.
Next we use session.get
to change to the login page of the Thredded demo.
From here we use the session.forms.last
function to get a
Mechanize::Form
object we can populate with our data. For
this example we set our name to ‘Jane’ and we don’t touch the admin checkbox
since it’s already set to true. Then we click the ‘Sign in’ button.
Now just to confirm we’ve logged in we spit out what we find in the flash
messages. You may have recognised the css
method used here, that’s because
behind the scenes Mechanize uses Nokogiri for HTML parsing.
So you can treat it like my examples up above once you’ve gotten to the page you’re interested in.
We then click the 9th link on the page to take us into the Off-Topic area of the forum.
And finally we get the form responsible for creating a topic and populate it with a bit of data just like we did the login form, then we submit. If you go to the Off-Topic category you should be able to find the thread created but only for a little while since the demo site refreshes it’s database regularly.
Worst Case Scenario
I consider this the worst case scenario and the only time I see it as actually necessary is when you are running a Javascript testing framework from Rspec.
The gem used here is Selenium and it lets you interact with websites using an actual web browser. Chrome, Firefox, Safari, and Internet Explorer are all supported. About a month ago I would have recommended using PhantomJS but it’s since been deprecated I can no longer suggest it.
I’m going to keep this section quite short since I really don’t condone using it as it’s been very flaky for me.
You’ll notice this is pretty much the same as the Mechanize example but with
quite a few sleep
statements, this is because if something isn’t visible you
can’t interact with it.
The basic way of interacting using Selenium is to select the expected element and send whatever keys you’d like to it. There are also ways of faking mouse interaction if you need to drag and drop or hover.
You can see a full run of the demo below, you’ll also notice it took me a few tries to get this recording right!
Protecting Yourself from Users
Users are the worst, they’ll do strange things that’ll break whatever code you write but thankfully there are some things you can do to protect yourself!
Basically all this does is check that we are requesting a site over HTTP or HTTPs, checking the amount of data is under 5MB, and that the data we are getting back is HTML.
It’s pretty rudimentary but should help protect you at a pretty basic level.
Closing Thoughts
So in this article you’ve learnt how to read data from sources all over the web, but keep in mind people pay good money to keep those sites up. Don’t hammer them too hard and if you’re going to build a spider for a search engine, make sure to respect the robots.txt.