In previous blog articles, we have covered how to use the Python library,
BeautifulSoup to programmatically navigate through websites and extract useful data. However, one common issue that we didn't cover is that sometimes the data we want to extract is hidden behind
BeautifulSoup doesn't allow us to avoid this, it only allows us to navigate through the HTML, so we will need something else... please, enter
Selenium is a web testing library that is primarily used to automate webpages, essentially allowing you, the user to merrily click through, navigate and browse websites. It also shares functionality with the infamous Python package,
BeautifulSoup, which enables you to parse data from
XML documents; both are prevalent web scraping toolkits. However, I would personally recommend sticking to
BeautifulSoup for this specific function, due to: its robustness and ease of error recovery.
Installation is easy enough - just use your favourite method for installing python libraries (mine is
pip install selenium
You will also need to install a driver; this will provide a software interface to view webpages. Click here to install a driver that is compatible with most web browsers. Save this download to your desired folder and then, add that file location to your
PATH variable - this will ensure that
selenium will know where to locate it.
Now, let's check if everything worked correctly by running the below script:
from selenium import webdriver browser = webdriver.Chrome() browser.get('https://www.google.com') browser.quit()
This above script will: 1. Open the web browser, Chrome 2. Navigate to google.com 3. Close Chrome
Hopefully that script ran successfully (without errors)! Looks like we're ready to try and scrape something.
Now that we can open a webpage, the next step is to be able to interact with it in some way or form. A good use case would be: extracting comments from a website, so let's use this article for that purpose: https://www.theguardian.com/commentisfree/2018/jun/21/matteo-salvini-threatening-minister-of-interior-police-protection-mafia.
If we scroll right down to the comments section, we can see that there are comments hidden behind a clickable button e.g. + View more comments.
If we further inspect the web page (
CTRL + SHIFT + I), we can also see that the comments section does not appear within the HTML of the page unless we press this button, therefore making this information unavailable for webscraping. So our task is simple, we need to load the web page, click the button to display more comments and then do the webscraping.
So, let's dive into
selenium and discover how we can do this.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC browser = webdriver.Chrome(executable_path="C:/Users/user_name/Desktop/Projects/rugby-webscraper/chromedriver.exe") browser.get('https://www.theguardian.com/commentisfree/2018/jun/21/matteo-salvini-threatening-minister-of-interior-police-protection-mafia') COMMENTS_XPATH = '//*[@id="comments"]/div/div/div/button' wait = WebDriverWait(browser, timeout) element = wait.until(EC.element_to_be_clickable((By.XPATH, COMMENTS_XPATH))) element = browser.find_element_by_xpath(COMMENTS_XPATH) element.click()
The code above will attempt to load the website using
browser.get(), navigate to the correct button using the
xpath of the button (
COMMENTS_XPATH) and then, click on it (
element.click()). However, upon execution of this code, a common problem will appear - a rather persistent banner will pop up, asking us to accept cookies, subsequently preventing us from accessing the comment button.
So, how can I solve this problem? Upon arrival of the webpage, the first thing we need to do is: clear this banner. Once this is resolved, we can then click onto our desirable the button.
See below for the extended code with changes made:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC browser = webdriver.Chrome(executable_path="C:/Users/leo.yorke/Desktop/Projects/rugby-webscraper/chromedriver.exe") browser.get('https://www.theguardian.com/commentisfree/2018/jun/21/matteo-salvini-threatening-minister-of-interior-police-protection-mafia') COMMENTS_XPATH = '//*[@id="comments"]/div/div/div/button' COOKIES_XPATH = '//*[@id="top"]/div/div/div/div/div/div/button' timeout = 15 wait = WebDriverWait(browser, timeout) element = wait.until(EC.element_to_be_clickable((By.XPATH, COOKIES_XPATH))) element = browser.find_element_by_xpath(COOKIES_XPATH) element.click() element = browser.find_element_by_xpath(COMMENTS_XPATH) element.click()
Now, if we execute this script, we should find that our browser automatically clears the cookies banner, jumps straight to the comments section and then expands the comments for us, now enabling us to scrape them.
.get()method to fetch the web page.
WebDriverWaittells the webdriver to wait - either for 15 seconds (the
timeoutvariable) - or until some sort of condition is met.
.until()method, in which we are telling the web page to look for a specific object, and if it can't find it (or if we take longer than 15 seconds and timeout) throw an exception.
.click()work together to navigate to the correct section and then click on the button.
We could easily extend this script to collect a variety of data, or indeed wrap our code in a
try-except block to catch common exceptions such as, timeouts.
To wrap up, we have learned how to use
selenium to perform slightly more complex web scraping techniques and use a different data collection scenario. Hopefully you found this tutorial very easy to follow along with and will begin to utilise the code in your own projects.
Don't forget to stay tuned to the Keyrus blog for more content like this!