The data used in this demonstration was obtained by using the techniques from the previous web series. The file used in this article can be downloaded from here.
The main aim of this article is to start exploring the data found in the scraped reviews. Various techniques will be used to obtain insights.
Loading Required Libraries
First thing is to load all required libraries, you will need to install some of them. You can get the environment file from here.
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import nltk
import unicodedata
import re
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
Loading the Data
We will load the same data file that was used for the previous article.
#encoding utf-16 is used to cater for a large variet of characters including emojis
data = pd.read_csv("./reviews.csv", encoding='utf-16')
print(data.head()) #first 5 records
Score Date Title \
0 50 August 15, 2020 Tappa culinaria obbligatoria
1 40 August 13, 2020 Resto sympa
2 50 August 9, 2020 Storie e VERI Sapori
3 50 August 8, 2020 We love your pizzas!
4 50 August 7, 2020 #OSEMA
Review Language
0 Abbiamo soggiornato a Malta 4 giorni e tre vol... it
1 Resto sympa tout comme les serveurs. Seul peti... fr
2 Abbiamo cenato presso questo ristorante in una... it
3 I went to this restaurant yesterday evening wi... en
4 Cena di coppia molto piacevole, staff cordiale... it
Cleaning the Data
For this example, only the score will be cleaned.
data['Score'] = (data['Score'] / 10).astype(int)
print(data.head(2))
Score Date Title \
0 5 August 15, 2020 Tappa culinaria obbligatoria
1 4 August 13, 2020 Resto sympa
Review Language
0 Abbiamo soggiornato a Malta 4 giorni e tre vol... it
1 Resto sympa tout comme les serveurs. Seul peti... fr
Text Analysis
In this case, the file has a column showing the language of the reviews. The following is an example of what can be done if this column was not present.
Library langdetect can be used to identify the languages.
from langdetect import detect
data['Detected Lang'] = data['Review'].apply(lambda x: detect(x))
data['Detected Lang'].value_counts()
it 732
en 406
fr 37
de 24
es 21
el 6
nl 6
pl 4
ja 4
no 3
sv 2
ru 2
cs 1
pt 1
ko 1
Name: Detected Lang, dtype: int64
The following shows the language of original reviews, and you can see langdetect has identified all languages correctly.
data['Language'].value_counts()
it 733
en 405
fr 37
de 24
es 21
el 6
nl 6
pl 4
ja 4
no 3
sv 2
ru 2
cs 1
pt 1
ko 1
Name: Language, dtype: int64
For the following exploration, only English reviews will be used.
data = data[data['Language']=='en']
Word Clouds
Using word clouds is an easy way of seeing the most frequently used words.
First, we extract all the words from all the reviews using the join function. This will create a variable containing all the words from all the reviews.
Then all the text is lowercased, this will make sure that words written in different caps are still considered the same words.
Stop words (which are common words like the, when, was) are removed so as not to be included in the cloud.
text = text = " ".join(review for review in data.Review)
text = text.lower()
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False).generate(text)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

You can limit the number of words as well.
wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=8).generate(text)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Sometimes, it makes sense to generate a word cloud for the negative reviews, and another one for the positive reviews.
reviews_bad = data[data['Score']<3]
text = text = " ".join(review for review in reviews_bad.Review)
text = text.lower()
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=12).generate(text.lower())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

We can repeat the same process for the positive ones.
reviews_good = data[data['Score']>3]
text = text = " ".join(review for review in reviews_good.Review)
text = text.lower()
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, collocations=False, max_words=12).generate(text.lower())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Another way of seeing the most frequent terms is to use CountVectorizer from scikit-learn package.
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
co = CountVectorizer(stop_words=stops)
counts = co.fit_transform(data.Review)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
0 | |
---|---|
food | 260 |
good | 226 |
pizza | 204 |
staff | 176 |
service | 169 |
restaurant | 143 |
place | 140 |
us | 133 |
one | 127 |
great | 120 |
nice | 110 |
italian | 101 |
waiter | 100 |
friendly | 96 |
ordered | 94 |
pasta | 87 |
would | 85 |
came | 77 |
wine | 76 |
table | 75 |
Using CountVectorizer we can also obtain ngrams (sets of words) rather than a single word. In this case, we are seeing the most frequent bi-grams (2 words).
You can increase the ngram_range to obtain longer sequences of words.
stops = set(stopwords.words('english'))
co = CountVectorizer(ngram_range=(2,2), stop_words=stops)
counts = co.fit_transform(data.Review)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
0 | |
---|---|
storie sapori | 30 |
food good | 24 |
friendly staff | 18 |
good service | 16 |
italian food | 16 |
barrakka gardens | 15 |
really good | 15 |
staff friendly | 15 |
would recommend | 13 |
good food | 13 |
upper barrakka | 13 |
food great | 12 |
italian restaurant | 12 |
pizza good | 11 |
excellent service | 11 |
go back | 11 |
well done | 11 |
great location | 11 |
food service | 11 |
best pizza | 11 |
One can also clean the data to their desired level. Sometimes it’s not possible to entirely clean the data, since it will omit certain details.
In this case a function is created that does the following:
- Load stemmer (words are returned to their root – for instance running and runs, will be returned as run)
- A list of stop words is loaded
- The text is normalized meanining it will replace all compatibility characters with their equivalents
- The text is then encoded to ASCII and then decoded to utf-8. This will make sure that we only have required characters
- Text is lower cased.
- A list of words is created from the sentence, and only valid characters are retained
- All the words are lemmatized and any stop words are removed
def basic_clean(sentence):
wnl = nltk.stem.WordNetLemmatizer()
stop_words = nltk.corpus.stopwords.words('english')
text_norm = (unicodedata.normalize('NFKD',sentence)
.encode('ascii','ignore')
.decode('utf-8','ignore')
.lower())
words = re.sub(r'[^\w\s]','', text_norm).split()
txt = [wnl.lemmatize(word) for word in words if word not in stop_words]
return ' '.join([str(elem) for elem in txt])
By calling the function basic_clean() we will obtain a new column that contains an array of words that we need
data['Cleaned'] = data['Review'].apply(lambda x: basic_clean(x))
print(data.head(3))
Score Date Title \
3 5 August 8, 2020 We love your pizzas!
9 5 August 6, 2020 Most enjoyable meal in Malta
32 5 March 6, 2020 #Osema
Review Language Detected Lang \
3 I went to this restaurant yesterday evening wi... en en
9 Came to this little restaurant by accident, ve... en en
32 Good and reasonable food nice ambient highly r... en en
Cleaned
3 went restaurant yesterday evening friend order...
9 came little restaurant accident glad busy luck...
32 good reasonable food nice ambient highly recom...
This new data set can be used to generate word clouds, or n-grams.
text = text = " ".join(review for review in data.Cleaned)
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

That concludes our scraping and exploration series. See you for the next one.