A Simple Approach to Webscraping - Part 2

By Leo Yorke, Thu 07 December 2017, in category Data science

Python, Webscraping

  
@

In the first part of this series of blog posts we created a simple webscraper in Python that would go to Yahoo Finance and fetch a table of currency exchange rates. There are a couple of drawbacks using our previous approach. Firstly, we don't have any access to any historical exchange rates - only current ones - and secondly we have a limited list of currency pairs to choose from. In this post we are going to try and address these limitations and explore how we can take our webscraper a little bit further. All the code in this example and the previous blog can be found here.

Looking at Yahoo's website, they have a currency converter page that gives us the option to convert between a much more extensive list of currencies than were available to us previously. It also displays a plot of historical exchange rates for all of the currency pairs. If we can access the data behind that graph, we'll have exactly what we need. The problem is that this is no longer in a neat and tidy table for us to access so we will need a different approach.

Inspecting the Page

If we inspect the page (right click -> inspect) we'll discover that the HTML tag that contains the currency converter is actually an iframe - essentially a web page embedded within a web page.

<iframe 
    name="currencyCalculator-iframe" 
    allowtransparency="true" 
    scrolling="no" 
    frameborder="0" 
    referrerpolicy="no-referrer" 
    class="W(100%) Mb(5px) Bd(0) H(305px)--cclg H(560px)--ccmd H(305px)--ccsm H(560px)--ccxs" 
    src="https://widget-yahoo.ofx.com/?locale=UK&amp;base=GBP&amp;term=EUR" 
    id="Col1-0-IFrame" 
    data-reactid="2">
</iframe>

We can actually go to the embedded web page and take a look for ourselves to verify it is the currency converter widget used in the original page. Isolating this widget might make writing our webscraper a bit simpler, but how are we actually going to get data out of it?

One option would be to have a program that can open the page in a web browser such as Chrome, click through all the options in the dropdown, and scrape information as it goes. This is certainly feasible but is likely to be fairly slow due the overhead of driving the browser. Selenium is a popular browser automation framework, with APIs available in many languages including Python. However, for this post we're going to take a different approach.

Instead what we can do is get the data the same way that the widget does. Behind the scenes, every time we click on one of the dropdowns the website is going off and fetching the corresponding exchange rate from somewhere, so if we can duplicate that process we can fetch all the exchange rates we want.

We can use Chrome's DevTools to monitor what requests the website makes and where it sends them when we click on something. If we right click -> inspect, go to the Network tab, and then click on the dropdown we will see something similar to the following:

image1

Each 1 in the Name panel is the website making a request and by clicking on each of them we can see what the website is requesting. We can see that there is a POST request being made, and by examining the Headers and Response tabs we can see what exactly is being sent and what is being returned.

At the bottom of the Headers tab there is a section called Request payload, which actually shows us what is being sent out by the website to request data. The website is sending off the following piece of JSON to the URL https://adsynth-ofx-quotewidget-prod.herokuapp.com/api/1.

{
    "method": "spotRateHistory", 
    "data": {
        "base": "ARS", 
        "term": "EUR", 
        "period": "day"
    }
}

If we switch to the Response tab we can see what is being sent back. In return the website is receiving JSON data that looks like this:

{
    "data": {
        "CurrentInterbankRate": 0.050245055371495,
        "HistoricalPoints":[
            {
                "PointInTime": 1511352900000,
                "InterbankRate": 0.050556151155792685
            }, 
            ...
        ]
    }
}

This looks suspiciously like the current exchange rate and historical data points for the exchange rate that we want. So now we have figured out where we can get our information from, how do we write a program that can go a make these POST requests for us?

First let's import the libraries that we will need.

import bs4  #  The goto library for parsing HTML documents
import requests  #  Standard library module for handling URLs and HTTP requests
import pandas as pd #  The defacto tabular data manipulation library
from datetime import datetime #  Standard library module for working with dates

Looking at the format of the outgoing POST request, it seems we will need a list of currency codes. We could just write the desired codes out manually, but ideally we should fetch a complete list of available ones from the website. Luckily it seems that the website itself makes a request for a list of currencies in the form of a JSON file, USA.json on the Network, so we can also use that. If this doesn't appear, just refresh the page.

URL = 'https://widget-yahoo.ofx.com/resources/1500309750700/data/localizations/USA.json'

def get_webpage(url):
    return requests.get(url).json()

Now we have a list of currency codes, but examining USA.json reveals that they're not quite in the format we want. We want a 3 letter code to match the request we want to copy, however the data we have at the moment looks like this:

{
    "Registration Button A": "Transfer Now with OFX",
    "Registration URL": "https://www.ofx.com/en-us/yf/?pid=5330",
    "Disclaimer": "Currency converter displays [Market Rates] and is not indicative of [OFX Customer Rates]",
    "Disclaimer Market Rates URL": "https://www.ofx.com/en-us/faqs/what-is-the-market-rate/?pid=5330",
    "Disclaimer Customer Rates URL": "https://www.ofx.com/en-us/faqs/what-is-the-customer-rate/?pid=5330",
    "Currency Converter": "Currency Converter",
    "Conversion Unavailable": "temporarily unavailable",
    "Chart Page Link": "[More Charts]",
    "Chart Page URL": "https: //www.ofx.com/en-us/forex-news/?pid=5330",
    "Chart Title Post Currencies": "Exchange Rates",
    "Brought to you by": "Sponsored and brought to you by",
    "Monday": "Monday", "Tuesday": "Tuesday", "Wednesday": "Wednesday",
    "Thursday": "Thursday", "Friday": "Friday", "Saturday": "Saturday",
    "Sunday": "Sunday", "Mon": "Mon", "Tues": "Tues", "Wed": "Wed",
    "Thurs": "Thurs", "Fri": "Fri", "Sat": "Sat", "Sun": "Sun",
    "1 Minute": "Minute", "1 Hour": "Hour", "1 Day": "Day", 
    "1 Week":  "Week", "1 Month": "Month", "1 Year": "Year", 
    "All time": "All time", "Month": "Month", "Jan": "Jan", "Feb": "Feb",
    "Mar": "Mar", "April": "April", "May": "May", "June": "June",
    "July": "July", "Aug": "Aug", "Sept": "Sept", "Oct": "Oct",
    "Nov": "Nov", "Dec": "Dec",
    "AED - Description":"United Arab Emirates Dirham", 
    ...
}

We need to write a function that only keeps the desired text, leaving us with just the 3 letter currency codes:

def get_list_of_cys(response):
    l = []
    for x in response.keys():
        if x.endswith("- Description"):
            l.append(x.split("-")[0].strip())
    return l

So now we have a list of currency codes we have enough to construct our request. All we need to do is copy the format of the JSON file we saw earlier and submit it to the same place as the website does, and we will get the same response. We can see that this only allows for one currency at a time, so we will need to loop over our different currencies. Finally we should convert the results from a JSON format into a tabular format so we could then write it out to a .csv file or relational database if we wanted to.

POST_URL = 'https://adsynth-ofx-quotewidget-prod.herokuapp.com/api/1'
COLUMNS = ['cy-pair', 'rate', 'date']

def post_loop(currencies):
    results = []
    for base in currencies:
        for term in currencies:
            data = {
                "method": "spotRateHistory", 
                "data": {"base": base, "term": term, "period": "day"}
            }
            f = requests.post(POST_URL, json=data).json()
            try:
                results.append(
                    [base + term, f['data']['CurrentInterbankRate'], datetime.now()])
            except Exception as e:
                print(e)
    return pd.DataFrame(results, columns=COLUMNS)

One thing to note is that all we are doing is returning the current exchange rate for different currency pairs, and ignoring the historical data points. The next step to expand our program would be to try and work out how to extract all the historical information as well, but we wont worry about that for now.

At this point let's also add a function to allow us to control which currency pairs we actually want to retrieve rather than just blindly fetching them all. If we ever do want to fetch them all we can just not call this function.

def filter_cys(currencies):
    allowed = set(['GBP', 'USD', 'EUR'])
    return list(allowed & set(currencies))

Now the final piece of the puzzle is to string all of our functions together and check everything is working.

if __name__ == "__main__":
    webpage = get_webpage(URL)
    currencies = get_list_of_cys(webpage)
    filtered = filter_cys(currencies)
    data = post_loop(filtered)
    print(data)

If we run it, we should get the following output:

$ python3 yf-post.py
  cy-pair      rate                       date
0  USDUSD  1.000000 2017-11-24 10:40:41.857866
1  USDEUR  1.000000 2017-11-24 10:40:42.207901
2  USDGBP  0.751300 2017-11-24 10:40:42.557936
3  EURUSD  1.184300 2017-11-24 10:40:42.903970
4  EUREUR  1.000000 2017-11-24 10:40:43.254005
5  EURGBP  0.889744 2017-11-24 10:40:43.601040
6  GBPUSD  1.331100 2017-11-24 10:40:43.948075
7  GBPEUR  1.123919 2017-11-24 10:40:44.296109
8  GBPGBP  1.000000 2017-11-24 10:40:44.645144

Success! We have managed to write a program to make a HTTP request to an API endpoint and fetch a table of all the currency exchange rates that we want. We could also extend this to cover historical data points of up to a year if we wanted too. Get the full code here and try it out for yourself.