This article is part of a series that goes through all the steps needed to write a script that automatically scrapes information from a website. Other articles in this series so far covered:
Introduction to Python scrapingA Selenium scraper for TripAdvisor reviews
This last article will focus on how to improve the first version of the scraper script by addressing its limitations.
The final version of the scraper code can be found here.
Script functional limitations
Getting reviews written in all languages
By default, TripAdvisor only shows English reviews. To view all languages, one needs to click All languages and wait for the JavaScript to download the other reviews. The relevant HTML parts are shown below:
<div class="item" data-value="ALL" data-tracker="All languages">
<label for="filters_detail_language_filterLang_ALL" class="label container cx_brand_refresh_phase2">
<!-- All Languages radio button -->
<input id="filters_detail_language_filterLang_ALL" type="radio" name="filters_detail_language_filterLang_0"
value="ALL" onchange=" .. JavaScript ..">
<span class="checkmark"></span>
All languages
</label>
</div>
<div class="item" data-value="en" data-tracker="English"> .. </div>
<div class="item" data-value="it" data-tracker="Italian"> .. </div>
<div class="item" data-value="fr" data-tracker="French"> .. </div>
The following Python code clicks the radio button to display reviews in all languages:
# Get reviews in all languages
all_langs = driver.find_element_by_id("filters_detail_language_filterLang_ALL")
all_langs.click()
Recording the language
While there is no visible indication of the language of the review text, it is possible to get this information from the link of the Google Translation button. The URL has two parameters sl=it
and tl=en
, where sl indicates the starting language. In the example below the review is in Italian:
<div class="ui_column is-9"> <!-- REVIEW -->
..
<div class="prw_rup prw_reviews_google_translate_button_hsx" .. >
<span .. data-url="/MachineTranslation? .. sl=it&tl=en">
Google Translation
</span>
</div>
..
</div>
The following function does all the work needed to get the review language:
def get_language(review):
lang = "en" #assume English
lang_divs = find_elements(review_div, REVIEW_LANG)
if len(lang_divs) > 0:
button = lang_divs[0]
span = button.find_elements_by_tag_name("span")[0]
url = span.get_attribute("data-url")
index = url.find("sl=")
if index != -1:
lang = url[index+3: index+5]
return lang
Going through all the pages
Every restaurant will have a different number of reviews; hence we cannot assume how many pages there are. The scraper needs to look for the last page and go through them all, one by one. From the HTML code below one can see that the last page link uses specific CSS classes pageNum last
and has an attribute data-page-number
with a value indicating the number of pages:
<div class="pageNumbers">
<a data-page-number="1" .. >1</a>
<a data-page-number="2" .. >2</a>
<a data-page-number="3" .. >3</a>
..
<a class="separator cx_brand_refresh_phase2">…</a>
<!-- Link to the last page -->
<a data-page-number="41" data-offset="400" class="pageNum last cx_brand_refresh_phase2" .. >41</a>
</div>
The following code gets the number of pages, and stores it as an int value that can be used as a loop counter:
# Find the last page number
last_page = driver.find_element(By.CSS_SELECTOR, 'a.pageNum.last');
pages = last_page.get_attribute("data-page-number")
Reading all the reviews in one page
The initial scrips uses the CSS class ui_column is-9
to identity a review. The problem with this approach is that restaurant owner responses and questions asked by users also use that same CSS class. This means that there is an unknown number of elements with that CSS class and not all of them are reviews. The simplified HTML structure can be found below:
<div id="taplc_location_reviews_list_resp_rr_resp_0" .. > <!-- List of reviews -->
..
<div class="ui_column is-9"> <!-- A single review -->
..
<p class="partial_entry"> Actual review text </p>
</div>
</div>
The scraper needs to get DIV elements with class=ui_column is-9
under a specific element (list of reviews) to avoid the user questions. It also needs to use the first P element with class=partial_entry
to avoid restaurant owner responses.
Scraper improvements
Waits
Use time.sleep()
to pause the script execution for a number of seconds, until a page loads is not considered good practice. If the number of seconds is too small there is the risk that the loading is not ready yet, while if the scripts sleeps for too long it will degrade the performance of the scraper. Selenium offers a solution to this in the form of implicit and explicit waits.
When using implicit waits, the WebDriver will poll the HTML document a number of times when an element is not immediately available. By default, implicit waits are disabled, but they can go a long way in making our scraper less brittle. To enable implicit waits, the following line of code is added after creating the driver:
driver.implicitly_wait(10) # implicitly wait 10 seconds
Occasionally the scraper needs to wait for something very specific to happen. This is the case when having to wait for the loading DIV id=taplc_hotels_loading_box_rr_resp_0
to disappear after clicking All languages. The code below achieves this by using explicit waits:
# Wait for reviews to load
WebDriverWait(driver, 30).until(
EC.invisibility_of_element_located((By.ID, "taplc_hotels_loading_box_rr_resp_0"))
)
The More button
The More button to load full reviews proved to be quite challenging. On pressing the More button some work is done, the HTML element becomes stale and is replaced with a new one saying Show less. Apart from the test, the elements are identical. It becomes a timing game, where clicking the button at different times might result in:
- the expected behaviour: showing the full reviews
- stale element reference: element is not attached to the page document
- element click intercepted: Element is not clickable. Other element would receive the click
Hence a custom wait has been created that will find the button and try to press it. The custom wait will only stop when the Show less text appears. It will also ignore any exceptions in the process. The code below shows the custom wait and its usage:
class more_loads_full_reviews(object):
def __init__(self, locator):
self.locator = locator
def __call__(self, driver):
try:
button = find_element(driver, self.locator)
text = button.text
if text == 'Show less':
return True
button.click()
return False
except Exception as e:
return False
# Usage
WebDriverWait(driver, 30).until(more_loads_full_reviews(MORE_BTN))
General script maintainability and efficiency
All the locators have been moved to the top of the code and declared as ‘constants’. This makes the code more readable, enables reusability and makes future maintenance easier. One could also consider the Page Object Model design pattern but that would be beyond the scope of this article. As can be seen below, the use of XPath locators has been minimised:
CURR_PAGE_NUM = (By.CSS_SELECTOR, "a.pageNum.current")
MORE_BTN = (By.CSS_SELECTOR, "span.taLnk.ulBlueLinks")
NEXT_BTN = (By.CSS_SELECTOR, "a.nav.next")
LAST_PAGE = (By.CSS_SELECTOR, "a.pageNum.last")
ALL_LANGS = (By.ID, "filters_detail_language_filterLang_ALL")
LOADING_DIV = (By.ID, "taplc_hotels_loading_box_rr_resp_0")
REVIEW_LIST = (By.ID, "taplc_location_reviews_list_resp_rr_resp_0")
REVIEWS = (By.CSS_SELECTOR, "div.ui_column.is-9")
SCORE = (By.XPATH, ".//span[contains(@class, 'ui_bubble_rating bubble_')]")
DATE = (By.CSS_SELECTOR, "span.ratingDate")
TITLE = (By.CSS_SELECTOR, "span.noQuotes")
REVIEW_TEXT = (By.CSS_SELECTOR, "p.partial_entry")
REVIEW_LANG = (By.CSS_SELECTOR, "div.prw_reviews_google_translate_button_hsx")
Other minor improvements include:
- Saving CSV in utf-16 encoding to support emoji and not western characters
- Use functions to group related lines of code
- Adding comments and helpful traces
- Asserts to make sure that all assumptions actually hold