Drug Review Data Analytics

A medication recommender framework is truly vital with the goal that it can assist specialists and help patients to build their knowledge of drugs on specific health conditions [1-4].

Contents:

Drug Review Image
Photo by pina messina on
Unsplash — Photo by pina messina on Unsplash

Objective: Build a Drug Recommendation System that recommends the most effective drug for a certain condition based on available reviews of various drugs used to treat this condition.

The Kaggle UCI ML Drug Review dataset provides over 200000 patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites [1]. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects.

The input dataset contains 7 columns: uniqueID, drugName, condition, review, rating, date, and usefulCount.

Workflow

Importing relevant libraries
Reading input raw data
Check data content/statistics
Plotting top 10 conditions
Drug count for each condition
Top 10 drugs used for the top condition
Plotting the top 10 drugs rated as 1 or 10
Plotting the percentage distribution of ratings using pie chart
Checking the distribution of usefulCount feature
Data analytics report summary

Importing Libraries

Let’s set the working directory YOURPATH

import os

os.chdir(‘YOURPATH’) # Set working directory

Let’s import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import plot_decision_regions

plt.style.use(‘ggplot’)
%config InlineBackend.figure_format = ‘svg’
%matplotlib inline
np.set_printoptions(suppress=True)

!pip install nltk

import re # Regular expression library
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download(‘stopwords’)
nltk.download(‘wordnet’)
nltk.download(‘punkt’)

!pip install spacy

!python -m spacy download en_core_web_sm

Read Raw Data

and read the input train and test datasets

df_train = pd.read_csv(‘drugsComTrain_raw.csv’)
df_test = pd.read_csv(‘drugsComTest_raw.csv’)
print(‘Overview of Train dataset:\n’)
df_train.head(5)

Overview of Train dataset:

Similarly,

print(‘Overview of Test dataset:\n’)
df_test.head(5)

Overview of Test dataset:

Data Content

Let’s look at the training data summary info

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161297 entries, 0 to 161296
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   uniqueID     161297 non-null  int64 
 1   drugName     161297 non-null  object
 2   condition    160398 non-null  object
 3   review       161297 non-null  object
 4   rating       161297 non-null  int64 
 5   date         161297 non-null  object
 6   usefulCount  161297 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 8.6+ MB

Let’s check the dimensions of these two datasets

data_train.shape

(161297, 7)

data_test.shape

(53766, 7)

let’s look at the descriptive statistics of the training dataset

df_train.describe()

Let’s check for missing values

df_train.isnull().any()

uniqueID       False
drugName       False
condition       True
review         False
rating         False
date           False
usefulCount    False
dtype: bool

and calculate the percentage of missing values

percent_missing = df_train.isnull().sum() * 100 / len(df_train)

print(percent_missing)

uniqueID       0.000000
drugName       0.000000
condition      0.557357
review         0.000000
rating         0.000000
date           0.000000
usefulCount    0.000000
dtype: float64

Plotting Top 10 Conditions

data=df_train
conditions = dict(data[‘condition’].value_counts())
top_conditions = list(conditions.keys())[0:10]
values = list(conditions.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
sns.barplot(x=top_conditions,y=values,palette=’spring’)
plt.title(‘Top 10 Conditions’)
plt.xlabel(‘Conditions’)
plt.ylabel(‘Count’)

plt.show()

plt.savefig(‘conditionstop10.png’)

Drug Count for Each Condition

val=[]
for c in list(conditions.keys()):
val.append(data[data[‘condition’]==c][‘drugName’].nunique())
drug_cond = dict(zip(list(conditions.keys()),val))
top_conditions = list(drug_cond.keys())[0:10]
values = list(drug_cond.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
sns.barplot(x=top_conditions,y=values,palette=’spring’)
plt.title(‘Number of Drugs for each Top 10 Conditions’)
plt.xlabel(‘Conditions’)
plt.ylabel(‘Count of Drugs’)

plt.show()

plt.savefig(‘drugcounttop10.png’)

Barchart count of drugs for each top 10 conditions

Top 10 Drugs Used for the Top Condition

drugs_birth = dict(data[data[‘condition’]==’Birth Control’][‘drugName’].value_counts())

top_drugs = list(drugs_birth.keys())[0:10]
values = list(drugs_birth.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
sns.barplot(x=values,y=top_drugs,palette=’spring’)
plt.title(‘Top 10 Drugs used for Birth Control’)
plt.ylabel(‘Drug Names’)
plt.xlabel(‘Count of Patients used’)

plt.show()

plt.savefig(‘drugcountbirthcontrol.png’)

barchart top 10 drugs used for top condition - birth control

Plotting Top 10 Drugs Rated as 10

drugs_rating = dict(data[data[‘rating’]==10][‘drugName’].value_counts())

top_drugs = list(drugs_rating.keys())[0:10]
values = list(drugs_rating.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
sns.barplot(x=values,y=top_drugs,palette=’spring’)
plt.title(‘Top 10 Drugs rated as 10’)
plt.ylabel(‘Drug Names’)
plt.xlabel(‘Count of Ratings’)

plt.show()

plt.savefig(‘drugstop10rate10.png’)

Plotting Top 10 Drugs Rated as 1

drugs_rating = dict(data[data[‘rating’]==1][‘drugName’].value_counts())

top_drugs = list(drugs_rating.keys())[0:10]
values = list(drugs_rating.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style=’darkgrid’)
sns.barplot(x=values,y=top_drugs,palette=’spring’)
plt.title(‘Top 10 Drugs rated as 1’)
plt.ylabel(‘Drug Names’)
plt.xlabel(‘Count of Ratings’)

plt.show()

plt.savefig(‘drugstop10rate1.png’)

Plotting Percentage of Ratings using Pie Chart

ratings_count = dict(data[‘rating’].value_counts())
count = list(ratings_count.values())
labels = list(ratings_count.keys())
plt.figure(figsize=(18,9))
plt.pie(count,labels=labels, autopct=’%1.1f%%’)
plt.title(‘Pie Chart Representation of Ratings’)
plt.legend(title=’Ratings’)

plt.show()

plt.savefig(‘drugsratingsall.png’)

Distribution of usefulCount Feature

plt.figure(figsize=(16,8))
ax =sns.distplot(data[‘usefulCount’])

plt.title(‘Distribution of usefulCount’)

plt.show()

plt.savefig(‘drugsusefulcontent.png’)

Distribution of Ratings

f,ax = plt.subplots(1,2,figsize=(16,8))
ax1= sns.histplot(data[‘rating’],ax=ax[0])
ax1.set_title(‘Count of Ratings’)
ax2= sns.distplot(data[‘rating’],ax=ax[1])
ax2.set_title(‘Distribution of Ratings Density’)

plt.show()

plt.savefig(‘distributionratingsdens.png’)

Summary

The categorical feature condition has 0.55% missing values to be dropped.
Birth Control is the topmost condition followed by Depression, Pain, and Anxiety.
Patients use several drugs to treat their conditions.
Pain and Birth Control conditions have the highest drug count.
Top drugs used for Birth Control are Etonogestrel Ethinyl Estradiol, Levonorgestrel and Nexplanon.
Birth Control and Weight Loss/Obesity drugs are top rated.
Etonogestrel and Levonorgestrel should be the top 2 recommended drugs as they are mostly frequently used and also rated as 10.
It appears that ~75% of drugs are rated with 10,9,8 and 1 ratings.
The maximum number of drug reviews is less than 200 upvotes.

References

The dataset was originally published on the UCI Machine Learning repository. Citation:

[1] Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH ’18). ACM, New York, NY, USA, 121-125.

[2] Sumit Mishra 2020 Python code

[3] Ruthvik Marshetty Drug Recommendation System: Medium June 4, 2022. Ref.

Github Repository, Research paper Garg (2021)

[4] Github jisong316

← Back

Drug Review Data Analytics

Workflow

Importing Libraries

Read Raw Data

Data Content

Plotting Top 10 Conditions

plt.show()

Drug Count for Each Condition

plt.show()

Top 10 Drugs Used for the Top Condition

plt.show()

Plotting Top 10 Drugs Rated as 10

plt.show()

Plotting Top 10 Drugs Rated as 1

plt.show()

Plotting Percentage of Ratings using Pie Chart

plt.show()

Distribution of usefulCount Feature

plt.show()

Distribution of Ratings

plt.show()

Summary

References

Thank you for your response. ✨

Share this:

Discover more from Our Blogs

Leave a comment Cancel reply

Discover more from Our Blogs