A Simple Approach to Webscraping - Part 1

Written By Leo Yorke, Edited By Lewis Fogden
Thu 23 November 2017, in category Data science

Python, Webscraping


The Problem

A common problem many data science projects face is where to source data from. One of the richest sources of external data is the web, and being able to access websites and retrieve data from them is a huge challenge when approached manually. Anybody wanting such information either at scale or at frequent intervals needs a way of extracting it programmatically; this is what we will cover in today's post.

The example we will go through is using currency exchange rates. Any project involving more than one currency will need access to accurate exchange rates at reasonably frequent time intervals. Luckily, exchange rates are published live across many websites and all we need to do is figure out how to get them. While we are going to do this using Python 3, this can be done in almost any language you want.

The first thing we have to do is find a website with the required information. For our example we are going to use Yahoo Finance. The reason we're going to use this website is that all the data we want is on one page and is stored in a relatively simple table - hopefully this will make the job of extracting it much easier.

Inspecting Web Pages

To be able to fetch the desired information from a web page, we need to know what the HTML of that page looks like. This is easy to find out by using Chrome's DevTools (most other browsers have a similar equivalent). We can open the developer tools up by navigating to our chosen URL in Chrome, right clicking on the page, and clicking Inspect. We will then see something like this:


While this looks fairly complex, all we really care about in our example is the left hand pane - this contains the structure of our page. As you mouse over sections of the HTML, the corresponding parts of the page will highlight. This way we can quickly find which piece of HTML refers to the data we wish to retrieve.

We want to find the lowest level HTML tag (a tag looks like this <tag> ... </tag>) that contains all of the information we want and preferably has nothing else contained within it. A quick scan through the page and we can see that the the following line looks like a good place to start:

<table class="yfinlist-table W(100%) BdB Bdc($tableBorderGray)" data-reactid="15">

Mousing over this <table> tag highlights the exchange rate table and nothing else, so it looks like this is what we need. So how can we use Python to first access this part of the page, and then get some useful information out of it?

Choosing our Libraries

Now we have our target website and understand its structure, let's start by importing the Python libraries we are going to need.

import bs4          # The most important library for us, see the note below
import requests     # Requests will allow us to access the website via HTTP requests
import pandas as pd # A standard tabular data manipulation library

The library bs4 is more commonly known as Beautiful Soup. It is the go-to library for parsing web pages and HTML documents in Python, and is indeed what we will be using today. Once you learn how to use it, it makes scraping even poorly designed websites relatively easy and simple. I highly recommend its use in any project involving scraping online data.

Due to its frequent use in the Python community, you may come across references to soup variables in scripts you are reading. These are generally references to the entire HTML document of a page, parsed into a BeautifulSoup object. Think of them as top level containers from which you can extract the desired information.

The requests library provides a simple, intuitive interface for making HTTP requests. When trying to access any online site using Python, it should be your first stop. However, note that Beautiful Soup is only able to pass the content available in the input request object. As such, content that is added after the initial load of a page using asynchronous JavaScript may be missed when using simple HTTP GET requests. How to deal with this is however, another blog article in itself.

Our final imported library is Pandas, Python's favourite for manipulating panel data (think tables or Excel style data).

Requesting Web Pages in Python

Next, let's write a function that will allow us to download the web page. We can do that easily using a HTTP GET request from the requests library. What we actually want from the page is the entire HTML document, so that we can use Beautiful Soup to parse our response into an easily navigable object.

URL = 'https://uk.finance.yahoo.com/currencies'

def get_webpage(url):
    response = requests.get(url)  #  Get the url
    return bs4.BeautifulSoup(response.text, 'html.parser') #  Turn the url response into a BeautifulSoup object

If you where to print this response to the console it would look exactly like the HTML we saw earlier in the DevTools window. Try this for yourself to verify that your request has worked.

Navigating HTML using Beautiful Soup

Beautiful soup has many ways of navigating HTML documents, as covered in great detail in its documentation. We will cover a very small subsection of commands available to fetch data.

If we look at the HTML in Chrome's DevTools we can see that inside the <table> tag, there is a <tbody> tag (the 'table body'), and inside that there are a series of <tr> tags (the table rows). Inside each of them are a several <td> tags (table data), and finally inside these tags is the actual data we want to retrieve.

Now looking at the table, all we really want is the Name and Last price columns. Building a set of logical steps to get these two columns, we will need to:

  1. Locate the <table> tag
    • We can do this using bs4's find method, which will search through the soup object and return the first instance of the specified tag
  2. Get all the <tr> tags within the table
    • We can do this using the find_all method, which will search through the soup object and return all instances of the specified tag that it finds.
  3. For every <tr> tag, get all the <td> tags
    • For this we will need to use a for loop and along with the find_all method
  4. For each <td> tag, check it is in the right column, and then get the data inside
    • Here we will need a second for loop and can use list indexing to select the correct columns
  5. Store the text from each row of the table
    • The text method allows us to access the data, and returns the text within the tag as a string
  6. Combine all the rows into one results table
    • We can use pd.DataFrame to combine our results into a DataFrame object

Using those steps we can construct the following function:

COLUMNS = ['cy-pair', 'rate']

def scrape(webpage):
    table = webpage.find("table") # Find the "table" tag in the page
    rows = table.find_all("tr")  # Find all the "tr" tags in the table
    cy_data = [] 
    for row in rows:
        cells = row.find_all("td") #  Find all the "td" tags in each row 
        cells = cells[1:3] # Select the correct columns (1 & 2 as python is 0-indexed)
        cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
    return pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0)

Now we have written our functions, all we need is a little routine to execute them and print the result to our console. This way, we can check we are actually getting what we want.

if __name__ == "__main__":
    page = get_webpage(URL)
    data = scrape(page)

Finally we can save our script as currency-webscraper.py and execute it using bash.

$ python3 currency-webscraper.py
    cy-pair       rate
1   GBP/USD       1.33
2   GBP/EUR     1.1232
3   EUR/USD     1.1847
4   GBP/JPY  148.05121
5   USD/JPY    111.259

Success! So we've successfully built a webscraper using only 21 lines of simple code and can now get access to up to date exchange rates whenever we need them. In the next blog I will cover how we can extend this method to extract a bigger list of currency pairs and access historical exchange rates by using HTTP requests.

The full code for this example is available here