img

Recommendation Systems

Customer feedback is a critical source of information for improving operations in the hotel industry but capturing an accurate and complete picture of the customer experience has always been a challenging task.

We’ll look at reviews from the hotel dataset, which is provided by Datafiniti’s Business Database. The goal of this study, we’ll train a machine learning system to predict the star-rating of a review based only on its text. For example, if the text says “Great! Best stay ever!!” we would expect a 5-star rating. If the text says, “Scared to stay there again”, we would expect a 1-star rating. Rather than writing many rules to check whether some text is positive or negative, we can train a machine learning classifier to “learn” the difference between positive and negative reviews by giving it labeled examples.

Executive Summary

Our analysis considered the relationship between the sentiment of the review text and the rating score, extracted the keyword with different methods (TF-IDF+SVD, Text Rank, NLTK RAKE), also we used LDA topic modeling generated 6 topics from the whole review. Additionally, we experimented with topic modeling with several machine learning algorithms to establish our best sentiment classifier and rating system to improve our results. Moreover, we construct a recommendation system to provide a suitable hotel for our customers.

Sentiment analysis and topic modeling are part of the Natural Language Processing (NLP) techniques that consist in extracting emotions and the main topic related to some raw texts. This is usually used on social media posts and customer reviews in order to automatically understand if some users’ comments are positive or negative and the reason. We used the data set to have a better understanding of how sentiment is related to the rating in the data set, exploring the customers’ most concerned part while they choose a hotel. 

Moreover, we applied bags of words methods, Non-negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA) with several machine learning algorithms to compare and construct the best classifier to predict hotel ratings and sentiment analysis. Also, we excluded the non-English sentence and test hyperparameter in order to improve the accuracy of our analyses.

We also compared different methods to extract the keyword of the review to see which method is better and with Latent Dirichlet Allocation (LDA) topic modeling we generate the top 7 topics and visualize them to see what the consumers are most concerned about a hotel.

Exploratory Data Analysis

Shuffle Data: After loading the data, we decided to shuffle the data. This is because by shuffling the data, we can ensure that each data point creates an “independent” change on the model, without being biased by the same points before them.

CountVectorizer: We perform text preprocessing and remove stop words in text data. After cleaning the text, we can apply the bag-of-words method, which is Scikit-learn’s CountVectorizer and TfidfVectorizer. CountVectorizer is used to transform corpora of text to a vector of term or token counts.

For example, doc=[“One Cent, Two Cents, Old Cent, New Cent: All About Money”]. The text will be transformed into a sparse matrix.

Figure 1: Representation in Practice (Accessible in PDF Version) 

From the above example, we can notice that we have 9 unique words, so we have 9 columns. Each column in the matrix shows a unique word in the vocabulary, while each row represents the document in our dataset. 

Refer to our example, we have one book title (i.e. the document), so we have only one row. The values in each cell are the word counts. Also, the count of some words might be 0 when the word did not appear in the corresponding document. In addition, we play around CountVectorizer with MIN_DF. MIN_DF is to omit words that have a low occurrence to be considered meaningful. For instance, there might be names of people that appear in only one or two documents. In some applications, this may be considered noise and could be eliminated from further analysis. MAX_DF can ignore words that are too common. There is still another method called N-gram, which used to develop not just unigram models but also bigram and trigram models. Therefore, we can compare which model performs better performance.

Term Frequency – Inverse Document(TF-IDF)

TF-IDF stands for “Term Frequency – Inverse Document” Frequency which is a statistical measure to evaluate how important a word is to a document in a collection or corpus. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. Tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf).

The term frequency (tf) of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document. The inverse document frequency (idf) of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1. Thus, the terms with higher weight scores are considered to be more important. Also, tf-idf can apply with min_df, max_df and n-gram.

 

Sentiment Analysis

After we convert textual data into numerical data, we can develop sentiment analysis. We applied Valence Aware Dictionary and sentiment Reasoner(vader) sentiment analysis on textual data. This is a sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of a sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. VADER not only shows about the Positivity and Negativity score but also indicates about how positive or negative a sentiment is.

By applying VADER sentiment analysis, it generates the compound score, which is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1 (most extreme negative) and +1 (most extreme positive). Then, we can write a function to assign which review text is positive, negative or neutral. In our study, we will focus on positive and negative.

Positive sentiment: (compound score >= 0.05)

Negative sentiment: (compound score < -0.05)

Logistic Regression with CV/TF-IDF

After we labeled positive or negative on review texts, we started to establish our classifier. In this study, we have applied TF-IDF/CountVectorizer with Logistic Regression, Burnoulli Naive Bayes, Linear Support Vector (LSV) and Multinomial Naive Bayes. We compared different models when we implemented language detection or not.  

To sum up, after implementing language detection, logistic regression still performs the best even though the accuracy decreased by .5%. The model will be CountVectorizer (ngram_range = (1,2),stop_words=’english’) with Logistic Regression.

Figure 2: Accuracy Table (Accessible in PDF Version) 

Logistic Regression is linear regression + sigmoid function. 

Figure 3: Sigmoid Function (Accessible in PDF Version) 

We can see that the value of the sigmoid function always lies between 0 and 1. The value is exactly 0.5 at X=0. We can use 0.5 as the probability threshold to determine the classes. If the probability is greater than 0.5, we classify it as Class-1 (Y=1) or else as Class-0 (Y=0).

Hotel Recommendations

Another interesting part is hotel recommendation. We established a function to help customers to choose what specific hotel they want. They just need to describe the hotel that they want and the system will automatically provide customers a hotel related to their description.

First, we transform the clean text (text preprocessing) with CountVectorizer along with stop words, ngram. Then, extract the feature names and join back to our data.Secondly, we choose Logistic Regression as a classifier. Thirdly, we collected information from customers. They will describe the hotel that they want to live in. To clarify, these customers’ needs did not come from original data. It comes from our friends. Once we start to predict which hotel is suitable for customers, we will offer hotel names, hotel addresses, hotel country and hotel city.

For example, down below are a few hotel recommendations for our customers.

Figure 4: Recommendations for our customers (Accessible in PDF Version) 

Machine Learning Algorithms to Predict Hotel Ratings

Another challenge part is predicting the actual hotel rating for hotel owners because sometimes the customer might mis-click the star rating when they put nice comments or bad comments. For example, when they put a good comment and they mis-click the star rating for 1 or when they comment a review with something bad, but mis-click the star rating for 5. It happens, but not frequently. We still do not want to ignore information because each review or star ratings does impact the image of the hotel.

We have experimented with CountVectorizer/TF-IDF along with Logistic Regression, Random Forest, Linear Support Vector and Bernoulli Naive Bayes. We attempted to discover the best classifier to assist us to improve accuracy. We compared CountVectorizer/TF-IDF with several machine learning to choose the best model. We compared different models before implementing language detection and after implementing language detection.

Figure 5: Accuracy Table (Accessible in PDF Version) 

To sum up, after implementing language detection, we exclude non-English sentences. Only keep the language with English. Logistic Regression still shows its outperformance among these machine learning algorithms, which gives us the accuracy rate of 53.35%.