This article is part of a series that goes through all the steps needed to write a script that reads information from a website and save it locally. Make sure that all the pre-requisites (at the end of this article) are in place before continuing.
Installing Selenium and other requirements
Selenium setup requires two steps:
- Install the Selenium library using the command: pip install selenium
- Download the Selenium WebDriver for your browser (exact version)
Chrome drivers can be found on chromium.org
Scraper setup requires two commands:
- pip install requests
- pip install beautifulsoup4
Scraping a website
What is web scraping?
Scraping is like browsing to a website and copying some content, but it is done programmatically (e.g. using Python) which means that it is much faster. The limit to how fast you can scrape is basically your bandwidth and computing power (and how much the web server allows you to). Technically this process can be divided in two parts:
- Crawling is the first part, which basically involves opening a page and finding all the interesting links in it, e.g. shops listed in a section of the yellow pages.
- Scraping comes next, where all the links from the previous step are visited to extract specific parts of the web page, e.g. the address or phone number.
Challenges of Scraping
One main challenge is that websites tend to be varied and you will likely end up writing a scraper specific to every site you are dealing with. Even if you stick with the same websites, updates/re-designs will likely break your scraper in some way (you will be using the F12 button frequently).
Some websites do not tolerate being scraped and will employ different techniques to slow or stop scraping. Another aspect to consider is the legality of this process, which depends on where the server is located, the term of service and what you do once you have the data amongst other things.
An Alternative to web scraping, when available, are Application Programming Interfaces (APIs) which offer a way to access structured data directly (using formats like JSON and XML) without dealing with the visual presentation of the web pages. Hence it is always a good idea to check if the website offers an API before investing time and effort in a scraper.
While there are many ways how to get data from web pages (e.g. using Excel, browser plugins or other tools) this article will focus on how to do it with Python. Having the flexibility of a programming language makes it a very powerful approach and there are very good libraries available such as Beautiful Soup which will be used in the sample below. There is a very good write up on how to Build a Web Scraper With Beautiful Soup. Another framework to consider is Scrapy.
What is Selenium? Why is it needed?
Beautiful Soup and Selenium can also be used together as shown in this interesting article at freecodecamp.org.
Building a first scraper
This first scraper will perform the following steps:
- Visit the page and parse source HTML
- Check that the page title is as expected
- Perform a search
- Look for expected result
- Get the link URL
Two Implementations, using Beautiful Soup and Selenium, can be found below.
Scraper using Beautiful Soup
import requests from bs4 import BeautifulSoup #Visit Page and parse source HTML page = requests.get("http://www.python.org") soup = BeautifulSoup(page.content, 'html.parser') #Check title is as expected assert "Python" in soup.title.string ##Perform the search using HTTP GET request page = requests.get("https://www.python.org/search/?q=pip") soup = BeautifulSoup(page.content, 'html.parser') #Look for expected result link = soup.select_one('a:contains("PEP 439")') #Get the link URL print (link['href'])
Scraper using Selenium
from selenium import webdriver #Open browser and visit page driver = webdriver.Chrome() driver.get("http://www.python.org") #Check title is as expected assert "Python" in driver.title #Find search field search_field = driver.find_element_by_id("id-search-field") #Enter search term search_field.send_keys("pip") #Find and click the Search button search_button = driver.find_element_by_id("submit") search_button.click() #Look for expected result link = driver.find_element_by_partial_link_text("PEP 439") #Get the link URL print(link.get_attribute("href")) #Close browser driver.close()
This is part of a series that goes through all the steps needed to write a script that reads information from a website and save it locally. This section lists all the technologies you should be familiar with and all the tools that need to be installed.
Basic knowledge of HTML
This article series assumes a basic understanding of web page source code including:
- Familiarity with Python 3.x
- HTML document structure
- Attributes of common HTML elements
- CSS classes
- HTTP request parameters
- Awareness of lazy-loading techniques
A good place to start is W3Schools.
As part of the pre-requisites, installing the correct version of Python and pip is required. This setup section assumes a Windows operating system, but it should be easily transferable to macOS or Linux.
Which Python version should one use: Python 2 or 3? This might have been a point of discussion in the past (Python 2.7 is the latest version of Python 2.x and was released in 2010) since the two are not compatible, one had to pick a version. However today (2020) it is safe to go with version 3.x, with the latest stable version at the time of writing being 3.8.
Start by downloading the latest version of Python 3 from the official website. Install it as you would with any other software. Make sure you add python to the PATH as shown below.
To confirm that it was successfully installed open the Command Prompt window and type python, you should see something like the following:
C:\WINDOWS\System32>python Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
Installing and using pip
pip is the package installer for Python. It is very likely that it came along with your Python installation. You can check by entering pip -V in a Command Prompt Window, and you should see something like the following:
C:\WINDOWS\System32>pip -V pip 20.1.1 from c:\path\to\python\python38-32\lib\site-packages\pip (python 3.8)
If pip is not available, it needs to be installed by following these steps:
- Download get-pip.py to a folder on your computer.
- Open a command prompt
- Navigate to the folder where get-pip.py was saved
- Run the following command: python get-pip.py