AI-Guided Drug Recommendation

The use of data analytics (DA) and artificial intelligence (AI) have been increasing in the pharmaceutical industry, including drug R&D, drug repurposing, improving pharmaceutical productivity, and clinical trials, among others. DA uses algorithms that can recognize patterns within a set of data to be further classified. Examples include drug recommendation, discovery and validation using knowledge graphs and Natural Language Processing (NLP) algorithms.

Leading biopharmaceutical companies’ belief in AI is due to growing awareness related to AI in the pharmaceutical sector and rising investment in drug development.

Drug image by Sharon McCutcheon on Unsplash
Photo by Sharon McCutcheon on Unsplash

Following the earlier DA-based study (cf. research paper and the source code), the objective of this project is to to build an AI-guided drug review system that recommends the most effective drug for a certain condition based on available reviews of various drugs used to treat this condition.

E2E Workflow

  • Setup Jupyter notebook within the Anaconda IDE
  • Import/install relevant Python libraries
  • Download the Kaggle UCI ML Drug Review dataset
  • Overview of the train/test dataset
  • Reset the index after data concatenation
  • Exploratory Data Analysis (EDA) using SNS plots
  • Plot the WordCount images for positive/negative reviews
  • Loading stop words from NLTK
  • Text data NLP Pre-Processing (removing digits, extra spaces, lower case, etc.)
  • nltk.sentiment.vader sentiment analysis using SentimentIntensityAnalyzer
  • Adding the sentiment scores for reviews, preprocessed reviews as new features
  • Feature Engineering – check the Pearson correlation matrix of various features
  • Adding the word count, stopword count,char length, unique words count, mean word length, and puncation count
  • Named entity recognition (NER) using spacy
  • LDA topic modelling – prepare cleaned reviews
  • Splitting the data into train, test and cross-validation (CV) datasets
  • Encoding categorical, text and numerical features by applying LabelEncoder
  • Vectorizing the cleaned reviews using BoW, TF-IDF (1 gram)
  • Word2Vec Vectorization for reviews using pretrained glove model
  • Compute the LDA confusion, precision, and recall matrices for the test data

Import Libraries

Let’s set the working directory YOURPATH

import os
os. getcwd()

os.chdir(‘YOURPATH’)

and import/install relevant Python libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings(‘ignore’)
from wordcloud import WordCloud
from wordcloud import STOPWORDS

import nltk
import regex as re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

import re
from tqdm import tqdm
from nltk.corpus import stopwords

from nltk.sentiment.vader import SentimentIntensityAnalyzer

import string

import spacy

from tqdm import tqdm

import gensim

Input Data

Let’s read the Kaggle train and test data as csv files

data_train = pd.read_csv(‘drugsComTrain_raw.csv’)
data_test = pd.read_csv(‘drugsComTest_raw.csv’)

and check the dataset structure

print(‘Size of Train dataset is:’,data_train.shape)
print(‘Size of Test dataset is:’,data_test.shape)

Size of Train dataset is: (161297, 7)
Size of Test dataset is: (53766, 7)

print(‘Columns of the dataset are:\n’,data_train.columns)

Columns of the dataset are:
 Index(['uniqueID', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object')

print(‘Overview of Train dataset:\n’)
data_train.head(10)

Overview of Train dataset:
Input Kaggle train dataset

Similarly, let’s print the test Kaggle dataset

print(‘Overview of Test dataset:\n’)
data_test.head(10)

Overview of Test dataset:
Input test Kaggle dataset




Let’s concatenate the two datasets

data = pd.concat([data_train,data_test])
print(‘The size of the combined data is:’,data.shape)

The size of the combined data is: (215063, 7)

while resetting the index after concatenation

data.reset_index(inplace=True,drop=True)
data.tail()

Reset index tab after input data concatenation

Let’s verify data types
data.dtypes

uniqueID        int64
drugName       object
condition      object
review         object
rating          int64
date           object
usefulCount     int64
dtype: object

The descriptive statistics of numerical features is given by

data.describe()

descriptive statistics of numerical values in input data

and the data info is

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215063 entries, 0 to 215062
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   uniqueID     215063 non-null  int64 
 1   drugName     215063 non-null  object
 2   condition    213869 non-null  object
 3   review       215063 non-null  object
 4   rating       215063 non-null  int64 
 5   date         215063 non-null  object
 6   usefulCount  215063 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 11.5+ MB

Raw Data Cleaning

Let’s check the above data for null values
data.isnull().any()

uniqueID       False
drugName       False
condition       True
review         False
rating         False
date           False
usefulCount    False
dtype: bool

and the percentage of null values is

null_size = data.isnull().sum()[‘condition’]
print(‘Total null values are:’,null_size)
data_size = data.shape[0]
print(‘Percentage of null values are:’,(null_size/data_size)*100)

Total null values are: 1194
Percentage of null values are: 0.5551861547546533

Let’ drop the entire rows with null values

data = data.dropna(axis=0)
print(‘Size of the dataset after dropping null values:’,data.shape)

Size of the dataset after dropping null values: (213869, 7)

and check for number of unique conditions

print(‘Number of unique conditions are:’,data[‘condition’].unique().shape[0])

Number of unique conditions are: 916

Exploratory Data Analysis (EDA)

Referring to the previous DA analysis, let’s apply EDA to our data after dropping null values.

Let’s plot the top 10 conditions

conditions = dict(data[‘condition’].value_counts())
top_conditions = list(conditions.keys())[0:10]
values = list(conditions.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
b=sns.barplot(x=top_conditions,y=values,palette=’spring’)
plt.title(‘Top 10 Conditions’)
plt.xlabel(‘Conditions’)
plt.ylabel(‘Count’)
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
b.set_xlabel(“Conditions”,fontsize=20)
b.set_ylabel(“Count”, fontsize=20)
b.tick_params(labelsize=12)
b.axes.set_title(“Top 10 Conditions”,fontsize=20)
plt.show()
plt.savefig(‘drugbartop10conditions.png’)

top 10 conditions barchart

It is clear that Birth Control is the most frequently occurring condition in our dataset, followed by Depression, Pain, Anxiety, etc.

Let’s count the number of drugs provided or prescribed as a treatment for top 10 conditions

val=[]
for c in list(conditions.keys()):
val.append(data[data[‘condition’]==c][‘drugName’].nunique())

drug_cond = dict(zip(list(conditions.keys()),val))

Let’s plot the number of drugs provided or prescribed as a treatment for top 10 conditions
top_conditions = list(drug_cond.keys())[0:10]
values = list(drug_cond.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
b=sns.barplot(x=top_conditions,y=values,palette=’spring’)
plt.title(‘Number of Drugs for each Top 10 Conditions’)
plt.xlabel(‘Conditions’)
plt.ylabel(‘Count of Drugs’)

b.set_xlabel(“Count of Patients used”,fontsize=20)
b.set_ylabel(“Drug Names”, fontsize=20)
b.tick_params(labelsize=12)
b.axes.set_title(“Number of Drugs for each Top 10 Conditions”,fontsize=20)

plt.show()
plt.savefig(‘drugbarnumberofdrugs.png’)

Number of drugs for each top 10 conditions barchart

This plot shows that multiple drugs are normally used as a treatment for top 10 conditions. It seems that there will likely be no “silver bullet” medication for Pain and Birth Control conditions.

Let’s see what is the most frequently used Birth Control drug by plotting top 10 drugs used as a treatment for this particular condition

drugs_birth = dict(data[data[‘condition’]==’Birth Control’][‘drugName’].value_counts())

top_drugs = list(drugs_birth.keys())[0:10]
values = list(drugs_birth.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
b=sns.barplot(x=values,y=top_drugs,palette=’spring’)
plt.title(‘Top 10 Drugs used for Birth Control’)
plt.ylabel(‘Drug Names’)
plt.xlabel(‘Count of Patients used’)

b.set_xlabel(“Count of Patients used”,fontsize=20)
b.set_ylabel(“Drug Names”, fontsize=20)
b.tick_params(labelsize=12)
b.axes.set_title(“Top 10 Drugs used for Birth Control”,fontsize=20)

plt.show()
plt.savefig(‘drugbarnumbermostuseddrugs.png’)

top 10 drugs used for Birth Control

It is clear that Etonogestrel is the most frequently used Birth Control drug, followed by Ethinyl estradiol, Levonorgestrel and Nexplanon.

Similarly, we can plot top 10 drugs used as a treatment for Pain

drugs_pain = dict(data[data[‘condition’]==’Pain’][‘drugName’].value_counts())

top_drugs = list(drugs_pain.keys())[0:10]
values = list(drugs_pain.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
b=sns.barplot(x=values,y=top_drugs,palette=’spring’)
plt.title(‘Top 10 Drugs used for Pain’)
plt.ylabel(‘Drug Names’)
plt.xlabel(‘Count of Patients used’)

b.set_xlabel(“Count of Patients used”,fontsize=20)
b.set_ylabel(“Drug Names”, fontsize=20)
b.tick_params(labelsize=12)
b.axes.set_title(“Top 10 Drugs used for Pain”,fontsize=20)

plt.show()
plt.savefig(‘drugbarnumbermostusedpain.png’)

top 10 drugs used for Pain barchart

We can see that Tramadol is the most frequently used Pain drug.

Let’s plot the top 10 drugs rated as 10
drugs_rating = dict(data[data[‘rating’]==10][‘drugName’].value_counts())

top_drugs = list(drugs_rating.keys())[0:10]
values = list(drugs_rating.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
b=sns.barplot(x=values,y=top_drugs,palette=’spring’)
plt.title(‘Top 10 Drugs rated as 10’)
plt.ylabel(‘Drug Names’)
plt.xlabel(‘Count of Ratings’)

b.set_xlabel(“Count of Ratings”,fontsize=20)
b.set_ylabel(“Drug Names”, fontsize=20)
b.tick_params(labelsize=12)
b.axes.set_title(“Top 10 Drugs rated as 10”,fontsize=20)

plt.show()
plt.savefig(‘drugbartop10rate10.png’)

top 10 drugs rated as 10 barchart

It appears that Birth Control drugs are top rated, so Etonogestrel and Levonorgestrel should be the top recommended drugs as they are (a) most frequently used and (b) rated as 10 by patients.

Likewise, let’s plot the top 10 drugs rated as 1

drugs_rating = dict(data[data[‘rating’]==1][‘drugName’].value_counts())

top_drugs = list(drugs_rating.keys())[0:10]
values = list(drugs_rating.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
b=sns.barplot(x=values,y=top_drugs,palette=’spring’)
plt.title(‘Top 10 Drugs rated as 1’)
plt.ylabel(‘Drug Names’)
plt.xlabel(‘Count of Ratings’)

b.set_xlabel(“Count of Ratings”,fontsize=20)
b.set_ylabel(“Drug Names”, fontsize=20)
b.tick_params(labelsize=12)
b.axes.set_title(“Top 10 Drugs rated as 1”,fontsize=20)

plt.show()
plt.savefig(‘drugbartop10rate1.png’)

top 10 drugs rated as 1 barchart

One can see that Levonorgestrel and Etonogestrel are among top 10 most frequently used drugs that have both ratings ‘10’ and ‘1’. The lowest rating may imply that these two drugs had side effects and/or not effective for certain patients.

Let’s plot the density distributions of ratings 1-10
f,ax = plt.subplots(1,2,figsize=(16,8))
ax1= sns.histplot(data[‘rating’],ax=ax[0])
ax1.set_title(‘Count of Ratings’)
ax2= sns.distplot(data[‘rating’],ax=ax[1],color=”r”)
ax2.set_title(‘Distribution of Ratings density’)
ax2.axes.set_title(“Distribution of Ratings Density”,fontsize=20)
ax2.set_xlabel(“Rating”,fontsize=20)
ax2.set_ylabel(“Count”, fontsize=20)
ax2.tick_params(labelsize=12)

ax1.set_xlabel(“Rating”,fontsize=20)
ax1.set_ylabel(“Count”, fontsize=20)
ax1.tick_params(labelsize=12)
ax1.axes.set_title(“Count of Ratings”,fontsize=20)

plt.show()
plt.savefig(‘drugdistplotratingsdens.png’)

density distributions of ratings 1-10

We can see that rating 10 has the maximum density or frequency count.

Let’s look at the percentage distribution of ratings using pie chart

ratings_count = dict(data[‘rating’].value_counts())
count = list(ratings_count.values())
labels = list(ratings_count.keys())
plt.figure(figsize=(18,9))
plt.pie(count,labels=labels, autopct=’%1.1f%%’,textprops={‘fontsize’: 14})
plt.title(‘Pie Chart Representation of Ratings’,fontsize=20)
plt.legend(title=’Ratings’,fontsize=12)
plt.show()
plt.savefig(‘drugspiechartratings.png’)

pichart % representation of ratings

We can see that 31.6% and 13.5% drugs have ratings 10 and 1, respectively.

Let’s change the date format using to_datetime

data[‘date’]= pd.to_datetime(data[‘date’])

and count ratings per year

year_ratings = dict(data[‘date’].dt.year.value_counts())
years = list(year_ratings.keys())
values = list(year_ratings.values())
plt.figure(figsize=(18,9))
b=sns.barplot(x=years,y=values,palette=’spring’)
plt.xlabel(‘Years’)
plt.ylabel(‘Count of Ratings’)
plt.title(‘Count of Ratings in each Year’)
b.set_xlabel(“Year”,fontsize=20)
b.set_ylabel(“Count of Ratings”, fontsize=20)
b.tick_params(labelsize=12)
b.axes.set_title(“Count of Ratings in each Year”,fontsize=20)
plt.show()
plt.savefig(‘drugsbarratingsyear.png’)

Count rating per year barchart

Notably drug ratings achieved their over-4000 peak in 2016.

Let’s check the distribution of usefulCount

plt.figure(figsize=(16,8))
plt.xlim(0, 200)
ax1 =sns.distplot(data[‘usefulCount’],color=’r’,bins=256)
ax1.set_xlabel(“usefulCount”,fontsize=20)
ax1.set_ylabel(“Density”, fontsize=20)
ax1.tick_params(labelsize=12)
plt.title(‘Distribution of usefulCount’,fontsize=20)
plt.show()
plt.savefig(‘drugdistusefulcount.png’)

distribution of usefulContent feature

It is clear that there that the feature usefulCount has the limit of 200.

Review Sentiments

Let’s create the target feature review_sediment using ratings

data[‘review_sentiment’] = data[‘rating’].apply(lambda x: 1 if x > 5 else 0)

Here 1 and 0 represent positive and negative reviews, respectively.

data.head(10)

review sentiment data table

Let’s plot the pie chart for review sentiments

plt.figure(figsize=(14,7))
plt.pie(data[‘review_sentiment’].value_counts(),labels=[‘Positive’,’Negative’],autopct=’%1.1f%%’,textprops={‘fontsize’: 16})
plt.title(‘Pie Chart Representation of Review Sentiment’,fontsize=20)

plt.show()
plt.savefig(‘drugspiechartratings.png’)

pie chart representation of review sentiments

According to this chart, 70.1% of patients are likely to give positive reviews. Hence, this is an imbalanced dataset.

WordCloud Images

Let’s turn our attention to WordCloud – a technique for visualising frequent words in a text where the size of the words represents their frequency.

Let’s look at WordCloud for positive reviews

positive_reviews = ” “.join([review for review in data[‘review’][data[‘review_sentiment’] == 1]])

stop_words = set(STOPWORDS)

wordcloud = WordCloud(width = 1200, height = 800,background_color =’white’,stopwords = stop_words,min_font_size = 10).generate(positive_reviews)

while plotting the WordCloud image

plt.figure(figsize = (12, 8), facecolor = None)
plt.imshow(wordcloud)
plt.title(‘WordCloud for Positive Reviews’)
plt.axis(“off”)
plt.tight_layout(pad = 0)
plt.show()
plt.savefig(‘drugswordcloudpos.png’)

WordCloud image for positive reviews

Similarly, we can apply WordCloud to negative reviews

negative_reviews = ” “.join([review for review in data[‘review’][data[‘review_sentiment’] == 0]])

wordcloud = WordCloud(width = 1200, height = 800,background_color =’white’,stopwords = stop_words,min_font_size = 10).generate(negative_reviews)

and plot the resulting image

plt.figure(figsize = (12, 8), facecolor = None)
plt.imshow(wordcloud)
plt.title(‘WordCloud for Negative Reviews’)
plt.axis(“off”)
plt.tight_layout(pad = 0)
plt.show()
plt.savefig(‘drugswordcloudneg.png’)

WordCloud image for negative reviews

We can see that both positive and negative reviews contain most frequent words “side effect”, “day”, and “period”.

Review Sentiment Editing

Let’s remove some unwanted conditions in the review sentiment data

del_index = []
conds =[]
for c in data[‘condition’]:
if (‘helpful’ in c) or (‘Listed’ in c):
f= list(data[data[‘condition’]==c].index)
del_index.extend(f)
conds.append(c)

print(‘Size of the data before removing the conditions:’,data.shape)

Size of the data before removing the conditions: (213869, 8)

print(‘The removable conditions count is:’,len(conds))

The removable conditions count is: 1763

data.drop(del_index,inplace=True)
print(‘Size of the data after dropping the condtions:’,data.shape)

Size of the data after dropping the condtions: (212106, 8)

data.reset_index(inplace=True,drop=True)
data.tail()

Review sentiment data after removing unwanted conditions

Text Data Pre-Processing

We need to invoke several functions to clean and preprocess the review text data by removing html tags, punctuations, special characters, stopwords, tabs, new lines, spaces and also stemming of words:

def decontracted(phrase):
# specific
phrase = re.sub(r”won’t”, “will not”, phrase)
phrase = re.sub(r”can\’t”, “can not”, phrase)

# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase

def preprocess_text(text_data):

text_data = decontracted(text_data)

text_data = text_data.replace('\n',' ')
text_data = text_data.replace('\r',' ')
text_data = text_data.replace('\t',' ')
text_data = text_data.replace('-',' ')
text_data = text_data.replace("/",' ')
text_data = text_data.replace(">",' ')
text_data = text_data.replace('"',' ')
text_data = text_data.replace('?',' ')
return text_data

Let’s load stop words from the nltk library

stop_words = set(stopwords.words(‘english’))
stemmer = SnowballStemmer(‘english’)

Let’s remove ‘no’ from the stop words list due to the importance of ‘side effects’ and ‘no side effects’ in reviews
stop_words.remove(‘no’)

Let’s define the functional block that process the text data by removing digits, extra spaces, stop words, while converting words to lower case and stemming words

def nlp_preprocessing(review):

if type(review) is not int:
string = “”
review = preprocess_text(review)
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, review)

    review = re.sub('\s+',' ', review)

    review = review.lower()

    for word in review.split():

        if not word in stop_words:
            word = stemmer.stem(word)
            string += word + " "

    return string 

Let’s apply the NLP processing

data[‘cleaned_review’] = data[‘review’].apply(nlp_preprocessing)

and convert the content of drugName and condition to lower case

data[‘drugName’] = data[‘drugName’].apply(lambda x:x.lower())

data[‘condition’] = data[‘condition’].apply(lambda x:x.lower())

data.head()

NLP cleaned and transformed review table

data.shape

(212106, 9)

Let’s add the sentiment scores for reviews and preprocessed reviews as new features using SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
data[‘sentiment_score’] = [sid.polarity_scores(v)[‘compound’] for v in data[‘review’]]
data[‘sentiment_score_clean’] = [sid.polarity_scores(v)[‘compound’] for v in data[‘cleaned_review’]]

data.head()

These text features and sentiment scores are the basic feature extractions for the text data classification problem.

Sentiment Correlations

Let’s check the crrelation of features present in the sentiment data

data.corr()

correlation features in data

Let’s plot the Pearson’s correlation heatmap lower triangle with Seaborn

corr_data = data.corr(method=’pearson’)

mask_ut=np.triu(np.ones(corr_data.shape)).astype(np.bool)

f, ax = plt.subplots(figsize=(11, 9))
sns.set(font_scale=1)
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr_data, mask=mask_ut, cmap=cmap,annot=True, annot_kws={“size”:16})
hmap.figure.savefig(“Correlation_Heatmap_Lower_Triangle_with_Seaborn.png”,
format=’png’,
dpi=150)

Correlation matrix lower triangle heatmap

Let’s save the data for future reference

data.to_csv(‘new_data_processed.csv’,index=False)

Review Data Editing

Let’s check the content of data

data.head(10)

Review Sentiment data table

checking for any nan values in cleaned_review feature

data[‘cleaned_review’].isna().sum()

0

and droping all rows containing nan values

print(‘The data size before:’,data.shape)
data = data.dropna(axis=0)
data.reset_index(inplace=True,drop=True)
print(‘The data size after dropping:’,data.shape)

The data size before: (212106, 11)
The data size after dropping: (212106, 11)

We can see that there are no nan values in our processed dataset

Let’s add the year as feature and check the data structure

data[‘date’] = pd.to_datetime(data[‘date’])
data[‘year’] = data[‘date’].dt.year
data.head(10)

Processed review data with year feature

Let’s add the word count, stopword count, char length, unique words count, mean word length, and puncation count

stop_words = set(stopwords.words(‘english’))

data[‘word_count’]=data[“cleaned_review”].apply(lambda x: len(str(x).split()))

data[‘unique_word_count’]=data[“cleaned_review”].apply(lambda x: len(set(str(x).split())))

data[‘char_length’]=data[“cleaned_review”].apply(lambda x: len(str(x)))

data[“count_punctuations”] = data[“review”].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

data[“stopword_count”] = data[“review”].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]))

data[“mean_word_len”] = data[“cleaned_review”].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

Let’s check the updated content

data.head(10)

Review data updated content including word count, stopword count, char length, unique words count, mean word length, and puncation count

Feature Engineering

Let’s compute data correlations

Review sentiment correlations including  the word count, stopword count, char length, unique words count, mean word length, and puncation count

Let’s compute the correlation heatmap lower triangle with seaborn

corr1_data = data.corr(method=’pearson’)
mask_ut=np.triu(np.ones(corr1_data.shape)).astype(np.bool)

f, ax = plt.subplots(figsize=(11, 9))
sns.set(font_scale=1)
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr1_data, mask=mask_ut, cmap=”Spectral”,annot=True, annot_kws={“size”:12})
hmap.figure.savefig(“Correlation_Heatmap_feature_Lower_Triangle_with_Seaborn.png”,
format=’png’,
dpi=150)

Lower triangle correlation matrix of 12 features

NER Analysis

Let’s perform Named Entity Recognition (NER) using spacy

nlp = spacy.load(“en_core_web_sm”)

def subj_obj_count(review):

sent = review
doc=nlp(sent)
sub_words = set([str(word) for word in doc if (word.dep_ == "nsubj")])

obj_words = set([str(word) for word in doc if (word.dep_ == "dobj")])

return len(sub_words),len(obj_words)

count = []

for r in tqdm(data[‘review’]):
count.append(subj_obj_count(r))

100%|██████████| 212106/212106 [41:58<00:00, 84.22it/s] 

sub_obj = pd.DataFrame(count,columns=[‘subj_count’,’obj_count’])
sub_obj.head(10)

Subject and object count table

sub_obj.to_csv(‘sub_obj.csv’,index=False)

sub_obj = pd.read_csv(‘sub_obj.csv’)
sub_obj.shape

(212106, 2)

ner_lst = nlp.pipe_labels[‘ner’]

def ner(review):

sent = review
doc=nlp(sent)
dic = {}.fromkeys(ner_lst,0)
for word in doc.ents:
    dic[word.label_]+=1

return dic

entity = pd.DataFrame([ner(r) for r in tqdm(data[‘cleaned_review’])])

100%|██████████| 212106/212106 [33:50<00:00, 104.47it/s]   

entity.to_csv(‘entities.csv’,index=False)

entity = pd.read_csv(‘entities.csv’)
print(entity.shape)
entity.head(10)

(212106, 18)
Entity table with 18 columns and 212106 rows

Topic Modelling

Topic Modelling in NLP analyzes the text data and finds the topics that best describes the set of documents. It is an Unsupervised approach as it recognizes and extracts the topics by detecting hidden patterns like clustering algorithms. Also it doesn’t require any predefined tags or training data.

Let’s pre-process the feature

corpus = data[‘cleaned_review’]

lst_corpus = []
for string in tqdm(corpus):
lst_words = string.split()
lst_grams = [” “.join(lst_words[i:i + 1]) for i in range(0, len(lst_words), 1)]
lst_corpus.append(lst_grams)

and map words to an id
id2word = gensim.corpora.Dictionary(lst_corpus)

while creating the dictionary word:freq
dic_corpus = [id2word.doc2bow(word) for word in lst_corpus]

Train LDA

Let’s train the LDA model

lda_model = gensim.models.ldamodel.LdaModel(corpus=dic_corpus, id2word=id2word, num_topics=20, chunksize=100, passes=10, alpha=’auto’, per_word_topics=True)

100%|██████████| 212106/212106 [00:02<00:00, 74286.58it/s]

storing the topic vectors for each review in a list

train_vecs = []
for i in range(len(corpus)):
top_topics = (
lda_model.get_document_topics(dic_corpus[i],
minimum_probability=0.0)
)
topic_vec = [top_topics[i][1] for i in range(20)]

train_vecs.append(topic_vec)

topics = pd.DataFrame(train_vecs)
print(topics.shape)
topics.head(10)

(212106, 20)
Topics table 212106 rows and 20 columns

Let’s export this table to csv

topics.to_csv(‘topics.csv’,index=False)

Topic Correlations

Let’s read the topics data

topics = pd.read_csv(‘topics.csv’)
topics.shape

(212106, 20)

while concatenating sub_obj, entity and topics

data = pd.concat([data,sub_obj,entity,topics],axis=1)
print(data.shape)
data.tail(10)

(212106, 58)
Concatenating sub_obj, entity and topics 
data table as 212106 by 58

Let’s calculate correlations as 53 rows × 53 columns

data.corr()

Correlations as 58x58 matrix

Let’s plot the corresponding correlation heatmap lower triangle with seaborn

corr2_data = data.corr(method=’pearson’)
mask_ut=np.triu(np.ones(corr2_data.shape)).astype(np.bool)

f, ax = plt.subplots(figsize=(22, 18))
sns.set(font_scale=1)
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr2_data, mask=mask_ut, cmap=”Spectral”,annot=True, annot_kws={“size”:6})
hmap.figure.savefig(“Correlation_Heatmap_topics_Lower_Triangle_with_Seaborn.png”,
format=’png’,
dpi=150)

Correlation heatmap 58x58 matrix lower triangle with seaborn

Let’s export the newly processed data

data.to_csv(‘final_new_data_processed.csv’,index=False)

NLP Modelling

Finally, we are ready to build training models for classifying reviews. This would require the following libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings(‘ignore’)
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import nltk
import regex as re

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

Let’s read the data

data = pd.read_csv(‘final_new_data_processed.csv’)
print(data.shape)
data.head(10)

(212106, 58)
Input to modelling training data 212106 rows and 58 columns

Let’s drop 5 unwanted columns

X = data.drop([‘uniqueID’,’review’,’rating’,’date’,’review_sentiment’],axis=1)
y = data[‘review_sentiment’].values
print(X.shape)

(212106, 53)

Let’s split our data into the 70% training and 30% test datasets while creating 21% cross-validation (CV) data using train_test_split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,test_size=0.30,random_state=42)

X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train,stratify=y_train,test_size=0.30,random_state=42)

print(‘Train data size is:’,X_train.shape)
print(‘Cross Validation data size is:’,X_cv.shape)
print(‘Test data size is:’,X_test.shape)

Train data size is: (103931, 53)
Cross Validation data size is: (44543, 53)
Test data size is: (63632, 53)

Let’s perform encoding Categorical, text and numerical features using LabelEncoder

from sklearn.preprocessing import LabelEncoder

lab_enc_cond = LabelEncoder()

lab_enc_cond.fit(X[‘condition’].values)

X_train_condition = lab_enc_cond.transform(X_train[‘condition’].values).reshape(-1,1)
X_test_condition = lab_enc_cond.transform(X_test[‘condition’].values).reshape(-1,1)
X_cv_condition = lab_enc_cond.transform(X_cv[‘condition’].values).reshape(-1,1)

print(‘After Encoding’)
print(‘Train data shape’,X_train_condition.shape)
print(‘Test data shape’,X_test_condition.shape)
print(‘CV data shape’,X_cv_condition.shape)

After Encoding
Train data shape (103931, 1)
Test data shape (63632, 1)
CV data shape (44543, 1)

Let’s export the condition encoder

import joblib
print(‘Saving condition encoder..’)
joblib.dump(lab_enc_cond,’condition_encoder.pkl’)

Saving condition encoder..

Out[95]:

['condition_encoder.pkl']

Let’s apply label encoder

lab_enc_year = LabelEncoder()

lab_enc_year.fit(X[‘year’].values)

X_train_year = lab_enc_year.transform(X_train[‘year’].values).reshape(-1,1)
X_test_year = lab_enc_year.transform(X_test[‘year’].values).reshape(-1,1)
X_cv_year = lab_enc_year.transform(X_cv[‘year’].values).reshape(-1,1)

print(‘After Encoding’)
print(‘Train data shape’,X_train_year.shape)
print(‘Test data shape’,X_test_year.shape)
print(‘CV data shape’,X_cv_year.shape)

After Encoding
Train data shape (103931, 1)
Test data shape (63632, 1)
CV data shape (44543, 1)

Let’s export the year encoder

print(‘Saving year encoder..’)
joblib.dump(lab_enc_year,’year_encoder.pkl’)

Saving year encoder..

Out[97]:

['year_encoder.pkl']

BoW Vectorizer

The text data can be encoded in different forms.

The ‘cleaned review’ text data can be vectorized using Bag of Words (BoW) as 1 gram tokens

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.preprocessing import Normalizer,StandardScaler,MinMaxScaler
import joblib

vect_bow_1 = CountVectorizer(min_df=10,ngram_range=(1,1))

X_train1 = X_train.dropna()

vect_bow_1.fit(X_train1[‘cleaned_review’].values)

CountVectorizer(min_df=10)

X_train_review_bow_1 = vect_bow_1.transform(X_train1[‘cleaned_review’].values)

X_test1 = X_test.dropna()

X_test_review_bow_1 = vect_bow_1.transform(X_test1[‘cleaned_review’].values)

X_cv1=X_cv.dropna()

X_cv_review_bow_1 = vect_bow_1.transform(X_cv1[‘cleaned_review’].values)

print(‘After Vectorization’)
print(‘Train data shape:’,X_train_review_bow_1.shape)
print(‘Test data shape:’,X_test_review_bow_1.shape)
print(‘CV data shape:’,X_cv_review_bow_1.shape)

After Vectorization
Train data shape: (103927, 7294)
Test data shape: (63628, 7294)
CV data shape: (44543, 7294)

print(‘Vectorizer for BoW is saved..’)
joblib.dump(vect_bow_1,’vectorizer_bow.pkl’)

Vectorizer for BoW is saved..

Out[108]:

['vectorizer_bow.pkl']

TF-IDF Vectorizer

Let’s vectorize our cleaned review data using the Term Frequency and Inverse Document Frequency (TF-IDF) as 1 gram tokens fitted on training data only

vect_tfidf_1 = TfidfVectorizer(min_df=10,ngram_range=(1,1))

vect_tfidf_1.fit(X_train1[‘cleaned_review’].values)

X_train_review_tfidf_1 = vect_tfidf_1.transform(X_train1[‘cleaned_review’].values)
X_test_review_tfidf_1 = vect_tfidf_1.transform(X_test1[‘cleaned_review’].values)
X_cv_review_tfidf_1 = vect_tfidf_1.transform(X_cv1[‘cleaned_review’].values)

print(‘After Vectorization’)
print(‘Train data shape:’,X_train_review_tfidf_1.shape)
print(‘Test data shape:’,X_test_review_tfidf_1.shape)
print(‘CV data shape:’,X_cv_review_tfidf_1.shape)

After Vectorization
Train data shape: (103927, 7294)
Test data shape: (63628, 7294)
CV data shape: (44543, 7294)

print(‘Vectorizer for TF-IDF is saved..’)
joblib.dump(vect_tfidf_1,’vectorizer_tfidf.pkl’)

Word2Vec Vectorizer

The cleaned review text data is now vectorized using Word2Vec

from tqdm import tqdm
cleaned_reviews = data[‘cleaned_review’].values

In doing so, we need to import the pre-trained Glove model as glove.6B.300d.txt

def loadGloveModel(gloveFile):
print (“Loading Glove Model”)
f = open(gloveFile,’r’, encoding=”utf8″)
model = {}
for line in tqdm(f):
splitLine = line.split()
word = splitLine[0]
embedding = np.array([float(val) for val in splitLine[1:]])
model[word] = embedding
print (“Done.”,len(model),” words loaded!”)
return model
model = loadGloveModel(‘glove.6B.300d.txt’)

Loading Glove Model
400000it [00:34, 11432.30it/s]
Done. 400000  words loaded!

words = []
for i in cleaned_reviews:
words.extend(str(i).split(‘ ‘))

Let' check the number of words that are present in both glove vectors and our corpus

print(“all the words in the corpus”, len(words))
words = set(words)
print(“the unique words in the corpus”, len(words))

inter_words = set(model.keys()).intersection(words)
print(“The number of words that are present in both glove vectors and our corpus”, \
len(inter_words),”(“,np.round(len(inter_words)/len(words)*100,3),”%)”)

words_courpus = {}
words_glove = set(model.keys())
for i in words:
if i in words_glove:
words_courpus[i] = model[i]
print(“word 2 vec length”, len(words_courpus))

all the words in the corpus 8868702
the unique words in the corpus 34667
The number of words that are present in both glove vectors and our corpus 13620 ( 39.288 %)
word 2 vec length 13620

Let’s invoke the pickle binary protocol for serializing/de-serializing our vectors

import pickle
with open(‘glove_vectors’, ‘wb’) as f:
pickle.dump(words_courpus, f)

with open(‘glove_vectors’, ‘rb’) as f:
model = pickle.load(f)
glove_words = set(model.keys())
print(len(glove_words))

13620

Let’s create the list of columns

columns = [‘usefulCount’,’word_count’,’unique_word_count’,’char_length’,’count_punctuations’,’stopword_count’,
‘mean_word_len’,’subj_count’,’obj_count’,’CARDINAL’,’DATE’,’EVENT’,’FAC’,’GPE’,’LANGUAGE’,’LAW’,
‘LOC’,’MONEY’,’NORP’,’ORDINAL’,’ORG’, ‘PERCENT’,’PERSON’, ‘PRODUCT’,’QUANTITY’,’TIME’,’WORK_OF_ART’,
‘0’,’1′,’2′,’3′,’4′,’5′,’6′,’7′,’8′,’9′,’10’,’11’,’12’,’13’,’14’,’15’,’16’,’17’,’18’,’19’]

and apply Normalizer() to the train, test and CV data

normalizer = Normalizer()

X_train_num_1 = normalizer.fit_transform(X_train1[columns])
X_test_num_1 = normalizer.fit_transform(X_test1[columns])
X_cv_num_1 = normalizer.fit_transform(X_cv1[columns])

print(“After vectorizations”)
print(X_train_num_1.shape, y_train.shape)
print(X_test_num_1.shape, y_test.shape)
print(X_cv_num_1.shape, y_cv.shape)

After vectorizations
(103927, 47) (103931,)
(63628, 47) (63632,)
(44543, 47) (44543,)

Let’s create the subsets with sentiment_score and sentiment_score_clean values

X_train_sent_score = X_train1[[‘sentiment_score’,’sentiment_score_clean’]].values
X_test_sent_score = X_test1[[‘sentiment_score’,’sentiment_score_clean’]].values
X_cv_sent_score = X_cv1[[‘sentiment_score’,’sentiment_score_clean’]].values

print(“After vectorizations”)
print(X_train_sent_score.shape, y_train.shape)
print(X_test_sent_score.shape, y_test.shape)
print(X_cv_sent_score.shape, y_cv.shape)

After vectorizations
(103927, 2) (103931,)
(63628, 2) (63632,)
(44543, 2) (44543,)

Let’s concatenate all encoded features (all extracted features + sentiment scores)

from scipy.sparse import hstack
X_tr_1 = np.concatenate((X_train_num_1,X_train_sent_score),axis=1)

X_te_1 = np.concatenate((X_test_num_1,X_test_sent_score),axis=1)

X_cv_1 = np.concatenate((X_cv_num_1,X_cv_sent_score),axis=1)

print(“Final Data matrix”)
print(X_tr_1.shape, y_train.shape)
print(X_te_1.shape, y_test.shape)
print(X_cv_1.shape, y_cv.shape)

Final Data matrix
(103927, 49) (103931,)
(63628, 49) (63632,)
(44543, 49) (44543,)

NLP Modelling

We need a few performance metrics

from sklearn.metrics import log_loss, accuracy_score,confusion_matrix, f1_score,roc_auc_score,roc_curve

and the following functions

def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)
print(“Number of misclassified points “,(len(test_y)-np.trace(C))/len(test_y)*100)

A =(((C.T)/(C.sum(axis=1))).T)

B =(C/C.sum(axis=0))

plt.figure(figsize=(20,4))

labels = [0,1]
cmap=sns.light_palette("green")
plt.subplot(1, 3, 1)
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")

plt.subplot(1, 3, 2)
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")

plt.subplot(1, 3, 3)
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")

plt.show()    

def model_metrics(clf,train_data,test_data,cv_data):

print('**LogLoss**')
predict_y = clf.predict_proba(train_data)
print ("The train log loss is:",log_loss(y_train, predict_y))
predict_y = clf.predict_proba(cv_data)
print( "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = clf.predict_proba(test_data)
print( "The test log loss is:",log_loss(y_test, predict_y))

print(50*'-')

print('**Accuracy**')
y_pred_tr = clf.predict(train_data)
print ("The train Accuracy is:",accuracy_score(y_train, y_pred_tr))
y_pred_cv = clf.predict(cv_data)
print( "The cross validation Accuracy is:",accuracy_score(y_cv, y_pred_cv))
y_pred_te = clf.predict(test_data)
print( "The test Accuracy is:",accuracy_score(y_test, y_pred_te))

print(50*'-')


print('**F1 Score**')    
print ("The train F1 score is:",f1_score(y_train, y_pred_tr))
print( "The cross validation F1 score is:",f1_score(y_cv, y_pred_cv))
print( "The test F1 score is:",f1_score(y_test, y_pred_te))

print(50*'-')

print('**AUC**')   
print ("The train AUC is:",roc_auc_score(y_train, y_pred_tr))
print( "The cross validation AUC is:",roc_auc_score(y_cv, y_pred_cv))
print( "The test AUC is:",roc_auc_score(y_test, y_pred_te))

print(50*'-')

As an example, let’s deploy the Random Model
test_len = len(y_test)
predicted_y = np.zeros((test_len,2))
for i in range(test_len):
rand_probs = np.random.rand(1,2)
predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print(“Log loss on Test Data using Random Model”,log_loss(y_test, predicted_y, eps=1e-15))

predicted_y =np.argmax(predicted_y, axis=1)

print(“Accuray on Test Data using Random Model”,accuracy_score(y_test, predicted_y))
print(“F1 score on Test Data using Random Model”,f1_score(y_test, predicted_y))
print(“AUC on Test Data using Random Model”,roc_auc_score(y_test, predicted_y))

Log loss on Test Data using Random Model 0.8817010788104614
Accuray on Test Data using Random Model 0.5019644204174001
F1 score on Test Data using Random Model 0.5859171860504618
AUC on Test Data using Random Model 0.5014991363225234

plot_confusion_matrix(y_test,predicted_y)
plt.show()
plt.savefig(‘drugsconfmatrix.png’)

Number of misclassified points  49.80355795826

confusion matrix, precision and recall

Conclusion

Following recent ML and DA studies, we have addressed the problem of building an NLP-based drug recommendation system in Python. It appears that the Sentiment Analysis, Topic Modelling and Word2Vec techniques play a major role in classifying the drug reviews thereby recommending the effective drugs.

The simplest way to deploy our trained LDA model is to create a web service using the Flask web framework. A final web app can be deployed within the multi-cloud environment discussed in Appendix.

Appendix:

Multi-Cloud NLP Drug Recommendation

AWS:

One can use Amazon Comprehend Medical to extract medication names and medical conditions to monitor drug safety and adverse events. Amazon Comprehend Medical is a NLP service that uses AI to easily extract relevant medical information from unstructured text. We query the OpenFDA API (an open-source API published by the FDA) and Clinicaltrials.gov API (another open-source API published by the National Library of Medicine (NLM) at the National Institutes of Health (NIH)) to get information on past adverse events, recalls, and clinical trials for the drug or medical condition in question. One can then use this data in population scale studies to further analyze the drug’s safety and efficacy.

GCP

Several successful case studies can guide you through the process of creating an ERC20 token recommendation system built with TensorFlowCloud Machine Learning EngineCloud Endpoints, and App Engine created at Google Cloud. The key point is the collaborative filtering technique for generating user recommendations. Collaborative filtering relies only on observed user behavior to make recommendations — no profile data or content access is necessary. The entire workflow looks like this:

  • Creating and training the model for token recommendation system
  • Query token ratings from BigQuery
  • Train the model locally and in Google ML Engine
  • Tuning hyperparameters in Cloud ML Engine
  • Deploying the recommendation system to Google App Engine

With the development of e-commerce, a growing number of people prefer to purchase medicine online. To provide online medication guidance, the novel cloud-assisted drug recommendation system CADRE can recommend users with top-N related medicines according to symptoms.

Azure

Healthcare organizations are using Azure products and services—including hybrid cloud, mixed reality, AI, and IoT—to drive better health outcomes, improve security, scale faster, and enhance data interoperability.

Text Analytics is now Azure Language Service. Specifically, Sentiment Analysis and Opinion MiningNamed Entity Recognition (NER)Entity Linking, Key Phrase ExtractionLanguage DetectionText Analytics for health, and Text Summarization are all part of the Language Service as they exist today. 

Using the Text Analytics NER, users can extract known types like person names, geographical locations, datetimes, and organizations. However, lots of information of interest is more specific than the standard types. The capability allows you to build your own custom entity extractors by providing labelled examples of text to train models. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: