Further Exploration

The data used in this demonstration was obtained by using the techniques from the previous web series. The file used in this article can be downloaded from here.

The main aim of this article is to start exploring the data found in the scraped reviews. Various techniques will be used to obtain insights.

Loading Required Libraries

First thing is to load all required libraries, you will need to install some of them. You can get the environment file from here.

import pandas as pd
import datetime
import matplotlib.pyplot as plt
import nltk
import unicodedata
import re
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

Loading the Data

We will load the same data file that was used for the previous article.

#encoding utf-16 is used to cater for a large variet of characters including emojis
data = pd.read_csv("./reviews.csv", encoding='utf-16')
print(data.head()) #first 5 records
 Score             Date                         Title  \
0     50  August 15, 2020  Tappa culinaria obbligatoria   
1     40  August 13, 2020                   Resto sympa   
2     50   August 9, 2020          Storie e VERI Sapori   
3     50   August 8, 2020          We love your pizzas!   
4     50   August 7, 2020                        #OSEMA   

                                              Review Language  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  
1  Resto sympa tout comme les serveurs. Seul peti...       fr  
2  Abbiamo cenato presso questo ristorante in una...       it  
3  I went to this restaurant yesterday evening wi...       en  
4  Cena di coppia molto piacevole, staff cordiale...       it  

 

Cleaning the Data

For this example, only the score will be cleaned.

data['Score'] = (data['Score'] / 10).astype(int)
print(data.head(2))
 Score             Date                         Title  \
0      5  August 15, 2020  Tappa culinaria obbligatoria   
1      4  August 13, 2020                   Resto sympa   

                                              Review Language  
0  Abbiamo soggiornato a Malta 4 giorni e tre vol...       it  
1  Resto sympa tout comme les serveurs. Seul peti...       fr 

 

Text Analysis

In this case, the file has a column showing the language of the reviews. The following is an example of what can be done if this column was not present.

Library langdetect can be used to identify the languages.

from langdetect import detect
data['Detected Lang'] = data['Review'].apply(lambda x: detect(x))
data['Detected Lang'].value_counts()
it    732
en    406
fr     37
de     24
es     21
el      6
nl      6
pl      4
ja      4
no      3
sv      2
ru      2
cs      1
pt      1
ko      1
Name: Detected Lang, dtype: int64

 


The following shows the language of original reviews, and you can see langdetect has identified all languages correctly.

data['Language'].value_counts()
it    733
en    405
fr     37
de     24
es     21
el      6
nl      6
pl      4
ja      4
no      3
sv      2
ru      2
cs      1
pt      1
ko      1
Name: Language, dtype: int64

 

For the following exploration, only English reviews will be used.

data = data[data['Language']=='en']

Word Clouds

Using word clouds is an easy way of seeing the most frequently used words.

First, we extract all the words from all the reviews using the join function. This will create a variable containing all the words from all the reviews.

Then all the text is lowercased, this will make sure that words written in different caps are still considered the same words.

Stop words (which are common words like the, when, was) are removed so as not to be included in the cloud.

text = text = " ".join(review for review in data.Review)
text = text.lower()
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False).generate(text)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

You can limit the number of words as well.

wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=8).generate(text)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

 

Sometimes, it makes sense to generate a word cloud for the negative reviews, and another one for the positive reviews.

reviews_bad = data[data['Score']<3]
text = text = " ".join(review for review in reviews_bad.Review)
text = text.lower()
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=12).generate(text.lower())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

We can repeat the same process for the positive ones.

reviews_good = data[data['Score']>3]
text = text = " ".join(review for review in reviews_good.Review)
text = text.lower()

stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=12).generate(text.lower())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Another way of seeing the most frequent terms is to use CountVectorizer from scikit-learn package.

from nltk.corpus import stopwords
stops =  set(stopwords.words('english'))
co = CountVectorizer(stop_words=stops)
counts = co.fit_transform(data.Review)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
0
food260
good226
pizza204
staff176
service169
restaurant143
place140
us133
one127
great120
nice110
italian101
waiter100
friendly96
ordered94
pasta87
would85
came77
wine76
table75

Using CountVectorizer we can also obtain ngrams (sets of words) rather than a single word. In this case, we are seeing the most frequent bi-grams (2 words).

You can increase the ngram_range to obtain longer sequences of words.

stops =  set(stopwords.words('english'))
co = CountVectorizer(ngram_range=(2,2), stop_words=stops)
counts = co.fit_transform(data.Review)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
0
storie sapori30
food good24
friendly staff18
good service16
italian food16
barrakka gardens15
really good15
staff friendly15
would recommend13
good food13
upper barrakka13
food great12
italian restaurant12
pizza good11
excellent service11
go back11
well done11
great location11
food service11
best pizza11

One can also clean the data to their desired level. Sometimes it’s not possible to entirely clean the data, since it will omit certain details.

In this case a function is created that does the following:

  • Load stemmer (words are returned to their root – for instance running and runs, will be returned as run)
  • A list of stop words is loaded
  • The text is normalized meanining it will replace all compatibility characters with their equivalents
  • The text is then encoded to ASCII and then decoded to utf-8. This will make sure that we only have required characters
  • Text is lower cased.
  • A list of words is created from the sentence, and only valid characters are retained
  • All the words are lemmatized and any stop words are removed
def basic_clean(sentence):
    wnl = nltk.stem.WordNetLemmatizer()
    stop_words = nltk.corpus.stopwords.words('english')
    text_norm = (unicodedata.normalize('NFKD',sentence)
    .encode('ascii','ignore')
    .decode('utf-8','ignore')
    .lower())
    words = re.sub(r'[^\w\s]','', text_norm).split()
    txt =  [wnl.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join([str(elem) for elem in txt])

By calling the function basic_clean() we will obtain a new column that contains an array of words that we need

data['Cleaned'] = data['Review'].apply(lambda x: basic_clean(x))
print(data.head(3))
Score            Date                         Title  \
3       5  August 8, 2020          We love your pizzas!   
9       5  August 6, 2020  Most enjoyable meal in Malta   
32      5   March 6, 2020                        #Osema   

                                               Review Language Detected Lang  \
3   I went to this restaurant yesterday evening wi...       en            en   
9   Came to this little restaurant by accident, ve...       en            en   
32  Good and reasonable food nice ambient highly r...       en            en   

                                              Cleaned  
3   went restaurant yesterday evening friend order...  
9   came little restaurant accident glad busy luck...  
32  good reasonable food nice ambient highly recom...  

 

This new data set can be used to generate word clouds, or n-grams.

text = text = " ".join(review for review in data.Cleaned)
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

That concludes our scraping and exploration series. See you for the next one.