A common problem many data science projects face is where to source data from. One of the richest sources of external data is the web, and being able to access websites and retrieve data from them is a huge challenge when approached manually. Anybody wanting such information either at scale or at frequent intervals needs a way of extracting it programmatically; this is what we will cover in today's post.
The example we will go through is using currency exchange rates. Any project involving more than one currency will need access to accurate exchange rates at reasonably frequent time intervals. Luckily, exchange rates are published live across many websites and all we need to do is figure out how to get them. While we are going to do this using Python 3, this can be done in almost any language you want.
The first thing we have to do is find a website with the required information. For our example we are going to use Yahoo Finance. The reason we're going to use this website is that all the data we want is on one page and is stored in a relatively simple table - hopefully this will make the job of extracting it much easier.
To be able to fetch the desired information from a web page, we need to know what the HTML of that page looks like. This is easy to find out by using Chrome's DevTools (most other browsers have a similar equivalent). We can open the developer tools up by navigating to our chosen URL in Chrome, right clicking on the page, and clicking Inspect. We will then see something like this:
While this looks fairly complex, all we really care about in our example is the left hand pane - this contains the structure of our page. As you mouse over sections of the HTML, the corresponding parts of the page will highlight. This way we can quickly find which piece of HTML refers to the data we wish to retrieve.
We want to find the lowest level HTML tag (a tag looks like this
<tag> ... </tag>) that contains all of the information we want and preferably has nothing else contained within it. A quick scan through the page and we can see that the the following line looks like a good place to start:
<table class="yfinlist-table W(100%) BdB Bdc($tableBorderGray)" data-reactid="15">
Mousing over this
<table> tag highlights the exchange rate table and nothing else, so it looks like this is what we need. So how can we use Python to first access this part of the page, and then get some useful information out of it?
Now we have our target website and understand its structure, let's start by importing the Python libraries we are going to need.
import bs4 # The most important library for us, see the note below import requests # Requests will allow us to access the website via HTTP requests import pandas as pd # A standard tabular data manipulation library
bs4 is more commonly known as Beautiful Soup. It is the go-to library for parsing web pages and HTML documents in Python, and is indeed what we will be using today. Once you learn how to use it, it makes scraping even poorly designed websites relatively easy and simple. I highly recommend its use in any project involving scraping online data.
Due to its frequent use in the Python community, you may come across references to
soup variables in scripts you are reading. These are generally references to the entire HTML document of a page, parsed into a
BeautifulSoup object. Think of them as top level containers from which you can extract the desired information.
Our final imported library is Pandas, Python's favourite for manipulating panel data (think tables or Excel style data).
Next, let's write a function that will allow us to download the web page. We can do that easily using a HTTP GET request from the
requests library. What we actually want from the page is the entire HTML document, so that we can use Beautiful Soup to parse our response into an easily navigable object.
URL = 'https://uk.finance.yahoo.com/currencies' def get_webpage(url): response = requests.get(url) # Get the url return bs4.BeautifulSoup(response.text, 'html.parser') # Turn the url response into a BeautifulSoup object
If you where to print this response to the console it would look exactly like the HTML we saw earlier in the DevTools window. Try this for yourself to verify that your request has worked.
Beautiful soup has many ways of navigating HTML documents, as covered in great detail in its documentation. We will cover a very small subsection of commands available to fetch data.
If we look at the HTML in Chrome's DevTools we can see that inside the
<table> tag, there is a
<tbody> tag (the 'table body'), and inside that there are a series of
<tr> tags (the table rows). Inside each of them are a several
<td> tags (table data), and finally inside these tags is the actual data we want to retrieve.
Now looking at the table, all we really want is the Name and Last price columns. Building a set of logical steps to get these two columns, we will need to:
findmethod, which will search through the soup object and return the first instance of the specified tag
<tr>tags within the table
find_allmethod, which will search through the soup object and return all instances of the specified tag that it finds.
<tr>tag, get all the
for loopand along with the
<td>tag, check it is in the right column, and then get the data inside
for loopand can use list indexing to select the correct columns
textmethod allows us to access the data, and returns the text within the tag as a string
pd.DataFrameto combine our results into a
Using those steps we can construct the following function:
COLUMNS = ['cy-pair', 'rate'] def scrape(webpage): table = webpage.find("table") # Find the "table" tag in the page rows = table.find_all("tr") # Find all the "tr" tags in the table cy_data =  for row in rows: cells = row.find_all("td") # Find all the "td" tags in each row cells = cells[1:3] # Select the correct columns (1 & 2 as python is 0-indexed) cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it return pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0)
Now we have written our functions, all we need is a little routine to execute them and print the result to our console. This way, we can check we are actually getting what we want.
if __name__ == "__main__": page = get_webpage(URL) data = scrape(page) print(data.head())
Finally we can save our script as
currency-webscraper.py and execute it using bash.
$ python3 currency-webscraper.py cy-pair rate 1 GBP/USD 1.33 2 GBP/EUR 1.1232 3 EUR/USD 1.1847 4 GBP/JPY 148.05121 5 USD/JPY 111.259
Success! So we've successfully built a webscraper using only 21 lines of simple code and can now get access to up to date exchange rates whenever we need them. In the next blog I will cover how we can extend this method to extract a bigger list of currency pairs and access historical exchange rates by using HTTP requests.
The full code for this example is available here