NATURAL LANGUAGE PROCESSING

Natural language processing is about analyzing text which may include but not limited to books, reviews or HTML webpages extracted from web scraping. Therefore, Natural Language Processing is a branch of machine learning that deals with predictive analysis on texts.

In this article, we will be going through a simple application of Natural Language Processing where we will analyze written reviews from a restaurant to determine if the review is positive or negative.

The data used can be gotten here

The jupyter notebook for reference can be gotten from My github

Importing the necessary libraries and reading data

The data is a tsv (Tab Separated Values) unlike the normal csv (Comma Separated Values). We import the data with the pandas.read_csv command but set our delimiter as tab.

import pandas as pd
import numpy as np
df = pd.read_csv("data/Restaurant_Reviews.tsv", delimiter = "\t", )
df.head()

from our data, positive reviews have a tag of one while negative reviews have a tag of zero.

Cleaning the test

The test is cleaned in other to keep only relevant words and get rid of words that are not important in helping the machine learning model predict if a review is positive or negative, these may include pronouns (e.g it, them, us) and prepositions (e.g on, at, off) to mention a couple. The text cleaning process is carried out on one of the reviews in other to visualize the steps one after the other. In the code below we select one of the reviews and use it for our illustration.

# Selecting a review from reviews 
df["Review"][999]

The alphabets are selected using the re (regular expression) library, this library provides regular expressions matching and will be used to select both lower and upper case characters from A to Z in the text. This helps us get rid of punctuations and replace them with white spaces. To learn more about the re library click here

import re
example = re.sub("[^A-Za-z]",
                 " ",
                 df["Review"][999],
example

The next step is to make sure all the alphabets in the review are in lower case

example = example.lower()
example

Once the letters of each word are in lower case, we then split the words so each word becomes an item in a list.

example = example.split()
example

List comprehension is then used to select any word with length greater than two. Words with length of two and below are usually not significant in helping the machine learning model predict if a review is positive or not.

example = [x for x in example if len(x) > 2]
example

Using a list comprehension and stop words from the nltk library, we remove stop words from our list of words. Nltk which stands for Natural Language Tool Kit is used to build python programs that work with human language data. Stop words are pronouns like they, them, us or conjunctions like but and while to mention a few. we start by importing stop words. You need to have the nltk library installed and stop words downloaded to do this. To learn more about the nltk library click here

# importing the stop words
from nltk.corpus import stopwords

# Run the code below if you dont have stopwords installed
nltk.download('stopwords')
# printing the stopwords
print(stopwords.words("english"))
# Checking the number of stopwords we have in total 
len(stopwords.words("english"))

There are 179 stopwords available for us to use

# Selecting only words not included in stop words
example = [word for word in example if word not in stopwords.words("english")]
example

Stemming is then applied to each word to keep only the roots of each word. For example, words like ran, running are both replaced with the root word run. There are two types of stemming techniques namely PorterStemmer and Lemmatization. PorterStemmer is being used in this situation and is also gotten from the nltk library.

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
example = [ps.stem(word) for word in example]
example

The stemmed words are then joined to formed a new sentence containing only key words

" ".join(example)

A function is created that carries out all these steps one after the other and this function is applied to our review column of the dataset. This cleans all the review rows in our dataset. The cleaned data is called a corpus.

def clean_text(data):
    clean_text = re.sub("[^A-Za-z]", " ", data)
    clean_text = clean_text.lower()
    clean_text = clean_text.split()
    clean_text = [x for x in clean_text if len(x) > 2]
    clean_text = [word for word in clean_text if word not in stopwords.words("english")]
    clean_text = [ps.stem(word) for word in clean_text]
    clean_text = " ".join(clean_text)
    return clean_text

df["Review"] = df["Review"].apply(clean_text)
df["Review"]

Using the corpus created, a bag of word model is created using all the unique words in the reviews. That is, there will be no duplicate of any word in our bag of words.

Each unique word is used to create a column. What we then have is a matrix containing number of rows as number of reviews with as many columns as number of words in our bag of words.

Each word in each review gets a value of 1 under the columns the word belong to. A sparse matrix containing mostly zeros is created.

Tokenization is the step required to create the sparse matrix and this can be done using the count vectorizer.

The sparse matrix is then set as our features column "X" and also our target column y is defined.

from sklearn.feature_extraction.text import CountVectorizer
tokenizer = CountVectorizer(max_features = 1500)
X = tokenizer.fit_transform(df["Review"]).toarray()

y = df["Liked"]

Train test split and building the model

The data is split into train and test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33,
                                                    random_state=42)

Naive-Bayes Classifier

Naive Bayes classifier is one of the most frequent classifier used in Natural Language Processing and is also used in this in this article to classify restaurant reviews. It is used to fit the training set and used to predict on the test set.

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train,y_train)
predictions = model.predict(X_test)

After the prediction has been made, the model performance is evaluated using the confusion matrix and the classification report

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test, predictions))
print("========================================")
print(classification_report(y_test, predictions))

If accuracy of the model is not satisfactory, the model can be tweaked or another model can be used.

I hope this was able to answer some of the questions you have concerning Natural Language Processing.

Should you have any question, you can send me an mail at