This article continues on the series that automatically scraped information from TripAdvisor. The other series links can be found here:Introduction to Web Scraping
The data used in this demonstration was obtained by using the techniques from the previous web series. The file used in this article can be downloaded from here.
The main aim of this article is to start exploring the data found in the scraped reviews. Various techniques will be used to obtain insights.
Loading Required Libraries
First thing is to load all required libraries, you will need to install some of them. You can get the environment file from here.
import pandas as pd import datetime import matplotlib.pyplot as plt import nltk import unicodedata import re from wordcloud import WordCloud from wordcloud import STOPWORDS from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import stopwords
For this example, all reviews for a particular restaurant were downloaded into a csv file, now this data will be loaded into a pandas dataframe.
This dataset is made up of five columns:
- Score – the score given to the restaurant per review
- Date – when the review was submitted
- Title – the title for the review
- Review – the review text
- Language – the language for the review
By using shape one can observe that apart from the four columns there are 1250 records.
#encoding utf-16 is used to cater for a large variet of characters including emojis data = pd.read_csv("./reviews.csv", encoding='utf-16') print(data.head()) #first 5 records print(data.shape) #structure of dataframe (rows, columns)
Score Date Title \ 0 50 August 15, 2020 Tappa culinaria obbligatoria 1 40 August 13, 2020 Resto sympa 2 50 August 9, 2020 Storie e VERI Sapori 3 50 August 8, 2020 We love your pizzas! 4 50 August 7, 2020 #OSEMA Review Language 0 Abbiamo soggiornato a Malta 4 giorni e tre vol... it 1 Resto sympa tout comme les serveurs. Seul peti... fr 2 Abbiamo cenato presso questo ristorante in una... it 3 I went to this restaurant yesterday evening wi... en 4 Cena di coppia molto piacevole, staff cordiale... it
Cleaning the Date
The review date’s format (e.g. August 15, 2020) is not suitable for our analysis, so the first step is to re-format the date.
For this task, a function will be created. The function will read the date, identify the different parts (month, day, and year) and then extract the desired parts, in this case only the month and year.
def formatDate(val): return datetime.datetime.strptime(val,'%B %d, %Y').strftime('%m/%Y') # lambda is used to apply a function to all rows in a data frame data['Date'] = data['Date'].apply(lambda x: formatDate(x)) print (data['Date'].head(2)) # the date must be changed into a date format so that it will be easier for plotting data['Date'] = pd.to_datetime(data['Date']) print(data['Date'].head(2))
0 08/2020 1 08/2020 Name: Date, dtype: object 0 2020-08-01 1 2020-08-01 Name: Date, dtype: datetime64[ns]
In certain cases plotting will be done based on the year, so a new column will be created with just the year.
data['Year'] = data['Date'].dt.year print(data.head(2))
Score Date Title \ 0 50 2020-08-01 Tappa culinaria obbligatoria 1 40 2020-08-01 Resto sympa Review Language Year 0 Abbiamo soggiornato a Malta 4 giorni e tre vol... it 2020 1 Resto sympa tout comme les serveurs. Seul peti... fr 2020
The score must be divided by ten as well to obtain a number from 1 to 5, since right now it’s from 10 to 50.
data['Score'] = (data['Score'] / 10).astype(int) print(data.head(2))
Score Date Title \ 0 5 2020-08-01 Tappa culinaria obbligatoria 1 4 2020-08-01 Resto sympa Review Language Year 0 Abbiamo soggiornato a Malta 4 giorni e tre vol... it 2020 1 Resto sympa tout comme les serveurs. Seul peti... fr 2020
In order to get an idea of how this restaurant fares, a chart will be generated showing the total number of reviews per score.
The function value_counts() will be used to produce a count for each of the unique values, so in this case, it will count how many reviews there are per score.
sort_index() is used so that the result is shown sorted by the score.
# seaborn is a styling setting to make charts look nicer plt.style.use('seaborn') data['Score'].value_counts().sort_index().plot(kind='bar') plt.title('Score Distribution') plt.xlabel('Score') plt.ylabel('Total Reviews')
One can also observe how many reviews were done in each year.
data_score=data['Year'].value_counts().sort_index().plot() plt.title('Number of Reviews Per Year') plt.xlabel('Year') plt.ylabel('Reviews')
We can refine this and see the number of reviews per month.
data['Date'].value_counts().sort_index().plot.line(figsize=(10,5)) plt.title('Reviews') plt.xlabel('Date') plt.ylabel('Review Count') plt.show()
Another interesting observation would be the average score per year, to see how it fared over the years.
For this, we first calculate the total number of reviews, along with the average score, for each year.
# in this case we are grouping by the column Year, and for each group # calculate how many reviews there were and what their average score is. data_score = data.groupby("Year", as_index=False)\ .agg(('count', 'mean'))\ .reset_index() print(data_score)
Year Score count mean 0 2015 6 4.000000 1 2016 219 3.716895 2 2017 269 3.579926 3 2018 287 3.494774 4 2019 397 4.622166 5 2020 72 4.736111
The next step is to plot the year against the mean score. We will also add the overall (across all years) mean so that we can compare each year against the overall mean.
plt.plot(data_score['Year'], data_score['Score']['mean']) plt.title('Average Score per Year') plt.xlabel('Date') plt.ylabel('Score') plt.axhline(data_score["Score"]['mean'].mean(), color='green', linestyle='--') plt.show()
Sometimes it would also be interesting analysing the length of each review. Since review length can vary a lot, we will group them into three groups using the function cut from pandas.
You can see that most of the reviews fall under 1100 characters.
print (pd.cut(data['Review'].str.len(),3, include_lowest=True).value_counts())
(57.855000000000004, 1109.0] 1212 (1109.0, 2157.0] 33 (2157.0, 3205.0] 5 Name: Review, dtype: int64
In order to try to obtain better insights, we will ignore the longish reviews and focus on the smaller reviews.
One can see that most of the reviews still have less than 404 characters.
shorter_reviews = data[data['Review'].str.len()<1100] shorter_reviews = pd.cut(shorter_reviews['Review'].str.len(),3).value_counts() print(shorter_reviews)
(59.97, 404.333] 908 (404.333, 747.667] 241 (747.667, 1091.0] 63 Name: Review, dtype: int64
We can plot the data to have a visual indication.
shorter_reviews = data[data['Review'].str.len()<1100] # in this case labels are passed to the cut function to display proper labels shorter_reviews = pd.cut(shorter_reviews['Review'].str.len(),3, labels=['Shortest', 'Short', 'Medium']).value_counts() #rot=0 is used to rotate the labels in the x-axis. shorter_reviews.plot.bar(rot=0) plt.title('Review Character Distribution') plt.xlabel('Number of Characters') plt.ylabel('Total Reviews')
Here we can see the average score by review length.
One can observe, the longer the review is the lower the score will be.
# a copy of the original data is done data2 = data # new column Bins is added to store under which category it falls data2['Bins'] = pd.cut(data['Review'].str.len(),6, include_lowest=True, labels=['Shortest', 'Short', 'Medium', 'Long', 'Longer','Longest']) print(data2.head(2)) # then each Bin is grouped to find it's average and count data_score = data2.groupby("Bins")['Score'].agg(('count', 'mean')) print(data_score)
Score Date Title \ 0 5 2020-08-01 Tappa culinaria obbligatoria 1 4 2020-08-01 Resto sympa Review Language Year Bins 0 Abbiamo soggiornato a Malta 4 giorni e tre vol... it 2020 Shortest 1 Resto sympa tout comme les serveurs. Seul peti... fr 2020 Shortest count mean Bins Shortest 1072 4.182836 Short 140 2.985714 Medium 29 2.172414 Long 4 2.000000 Longer 2 1.500000 Longest 3 1.333333
Next part will deal with analysing text.