Web Scraping with Selenium

Written By Leo Yorke, Edited By Mae Semakula-Buuza
Fri 24 August 2018, in category Data science

Python, Webscraping

  
@

Introduction

In previous blog articles, we have covered how to use the Python library, BeautifulSoup to programmatically navigate through websites and extract useful data. However, one common issue that we didn't cover is that sometimes the data we want to extract is hidden behind JavaScript objects, objects that need to be clicked on to reveal the hidden data. Alternatively, any other scenario that interferes with our interaction with the webpage, such as filling in a search form, can also be a problem. Unfortunately for us, BeautifulSoup doesn't allow us to avoid this, it only allows us to navigate through the HTML, so we will need something else... please, enter Selenium.

What is Selenium?

Selenium is a web testing library that is primarily used to automate webpages, essentially allowing you, the user to merrily click through, navigate and browse websites. It also shares functionality with the infamous Python package, BeautifulSoup, which enables you to parse data from HTML and XML documents; both are prevalent web scraping toolkits. However, I would personally recommend sticking to BeautifulSoup for this specific function, due to: its robustness and ease of error recovery.

Setup

Installation is easy enough - just use your favourite method for installing python libraries (mine is pip):

pip install selenium

You will also need to install a driver; this will provide a software interface to view webpages. Click here to install a driver that is compatible with most web browsers. Save this download to your desired folder and then, add that file location to your PATH variable - this will ensure that selenium will know where to locate it.

Now, let's check if everything worked correctly by running the below script:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.google.com')
browser.quit()

This above script will: 1. Open the web browser, Chrome 2. Navigate to google.com 3. Close Chrome

Hopefully that script ran successfully (without errors)! Looks like we're ready to try and scrape something.

Interacting with web pages

Now that we can open a webpage, the next step is to be able to interact with it in some way or form. A good use case would be: extracting comments from a website, so let's use this article for that purpose: https://www.theguardian.com/commentisfree/2018/jun/21/matteo-salvini-threatening-minister-of-interior-police-protection-mafia.

If we scroll right down to the comments section, we can see that there are comments hidden behind a clickable button e.g. + View more comments.

If we further inspect the web page (CTRL + SHIFT + I), we can also see that the comments section does not appear within the HTML of the page unless we press this button, therefore making this information unavailable for webscraping. So our task is simple, we need to load the web page, click the button to display more comments and then do the webscraping.

So, let's dive into selenium and discover how we can do this.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome(executable_path="C:/Users/user_name/Desktop/Projects/rugby-webscraper/chromedriver.exe")
browser.get('https://www.theguardian.com/commentisfree/2018/jun/21/matteo-salvini-threatening-minister-of-interior-police-protection-mafia')

COMMENTS_XPATH = '//*[@id="comments"]/div/div/div[2]/button'

wait = WebDriverWait(browser, timeout)
element = wait.until(EC.element_to_be_clickable((By.XPATH, COMMENTS_XPATH)))
element = browser.find_element_by_xpath(COMMENTS_XPATH)
element.click()

The code above will attempt to load the website using browser.get(), navigate to the correct button using the xpath of the button (COMMENTS_XPATH) and then, click on it (element.click()). However, upon execution of this code, a common problem will appear - a rather persistent banner will pop up, asking us to accept cookies, subsequently preventing us from accessing the comment button.

So, how can I solve this problem? Upon arrival of the webpage, the first thing we need to do is: clear this banner. Once this is resolved, we can then click onto our desirable the button.

See below for the extended code with changes made:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome(executable_path="C:/Users/leo.yorke/Desktop/Projects/rugby-webscraper/chromedriver.exe")
browser.get('https://www.theguardian.com/commentisfree/2018/jun/21/matteo-salvini-threatening-minister-of-interior-police-protection-mafia')

COMMENTS_XPATH = '//*[@id="comments"]/div/div/div[2]/button'
COOKIES_XPATH = '//*[@id="top"]/div[7]/div/div/div[2]/div[2]/div/button'
timeout = 15

wait = WebDriverWait(browser, timeout)
element = wait.until(EC.element_to_be_clickable((By.XPATH, COOKIES_XPATH)))
element = browser.find_element_by_xpath(COOKIES_XPATH)
element.click()

element = browser.find_element_by_xpath(COMMENTS_XPATH)
element.click()

Now, if we execute this script, we should find that our browser automatically clears the cookies banner, jumps straight to the comments section and then expands the comments for us, now enabling us to scrape them.

So, how does our new and improved code work?

  1. Firstly, we use the .get() method to fetch the web page.
  2. The WebDriverWait tells the webdriver to wait - either for 15 seconds (the timeout variable) - or until some sort of condition is met.
  3. The wait condition is specified using the .until() method, in which we are telling the web page to look for a specific object, and if it can't find it (or if we take longer than 15 seconds and timeout) throw an exception.
  4. Finally, find_element_by_xpath() and .click() work together to navigate to the correct section and then click on the button.

Simple, right?

We could easily extend this script to collect a variety of data, or indeed wrap our code in a try-except block to catch common exceptions such as, timeouts.

Conclusion

To wrap up, we have learned how to use selenium to perform slightly more complex web scraping techniques and use a different data collection scenario. Hopefully you found this tutorial very easy to follow along with and will begin to utilise the code in your own projects.

Don't forget to stay tuned to the Keyrus blog for more content like this!