-
Stocks to Watch in 2023: MarketBeat Ideas
Let’s review the 10 best stocks to own in 2023 (brought to you by Marketbeat.com).
Featured Photo by Alesia Kozik on Pexels.
Uncertainty is the Only Certainty for 2023:
- Inflation and rising interest rates can pave the way for a lot of risk of recession in 2023.
- The U.S. economy is facing a reckoning in terms of interest rates and the cost of business but it is a strong and resilient economy. It will bounce back, and when it does, it will give off a signal that cannot be ignored.
- Long-term investors who build solid portfolios of dividend-paying stocks and reinvest those dividends will come out on top.
- Until then, the best choices for investors include fundamentally sound blue chip companies with sustainable dividends and share repurchase programs. These stocks won’t be immune from downturns and dips but they should be insulated from them and they will likely keep on paying dividends as well. When the dips come, add to those positions and build a lever for portfolio growth for when the rebound begins, then keep adding to it.
It’s Hard to Own Stocks When the Market is Falling:
- The S&P 500 has already corrected more than 20% and it may fall further.
- Mortgage rates have catapulted to their highest level since before the housing bubble.
- Credit card rates remain at their highest levels in 30 years.
- The labor market is still strong and U.S. consumers will most likely suck it up and pay the bills.
Focus on Opportunity, Value and Dividends in 2023:
On a sector basis, the energy sector looks like the best one to own for two
primary reasons:
• Valuation: The energy sector was valued at less than 10x its earnings going
into the fourth quarter of 2022.Energy prices, while down from their peak,
are still quite high relative to the prior year.
• Earnings outlook: The energy sector is on track to produce 150% earnings
growth in fiscal 2022. Analysts expect another 40% in 2023.
Financials, reflected in the Financial Select Sector SPDR Fund (NYSEARCA: XLF), also looks attractive as a group. The sector struggles with earnings growth in 2022 but should still post positive results — the outlook for next year is much better. The group is expected to post 13.8% earnings growth and come in fourth place among the other 10 S&P 500 sectors.
These are the 10 best stocks to own in 2023:
- Occidental Petroleum: Energy is Still the Top Pick for Earnings
- Schlumberger: An Oilfield Services Rebound is Brewing
- Kraft Heinz – shares of Kraft Heinz are among the cheapest if not the cheapest stock in the consumer staples, represented by the Consumer Staples Select SPDR Fund (NYSEARCA: XLP) universe.
- PepsiCo: A Diversified King of Consumer Staples
- Lowe’s Companies: Another Crown Jewel for Dividend Investors
- Levi Strauss: A Good Fit with Long-Wearing Potential
- Duke Energy: Electrify Your Returns
- Jabil Inc: got its start manufacturing PCBs or printed circuit boards in 1966.
- Intel: A Deep Value/High Yield Combination
- Camping World – The pandemic boosted the RV industry, which more than doubled over the next two years and then saw demand for new RVs peak.
Explore More
Energy:
- XOM SMA-EMA-RSI Golden Crosses ’22
- Energy E&P: XOM Technical Analysis Nov ’22
- The Zacks’s Market Outlook Nov ’22 – Energy Focus
- Gulf’s Oil Price Web Scraping in R
- OXY Stock Analysis, Thursday, 23 June 2022
- OXY Stock Update Wednesday, 25 May 2022
- OXY Stock Technical Analysis 17 May 2022
Stocks:
- Zacks Investment Research Update Q4’22
- The Zacks Steady Investor – A Quick Look
- All Eyes on ETFs Sep ’22
- Zacks Insights into this High Inflation/Rising Rate Market
- SeekingAlpha Risk/Reward July Rundown
- Zacks Insights into the Commodity Bull Market
- Are Blue-Chips Perfect for This Bear Market?
- Bear Market Similarity Analysis using Nasdaq 100 Index Data
- AAPL Stock Technical Analysis 2 June 2022
- Inflation-Resistant Stocks to Buy
- A Weekday Market Research Update
- Stocks on Watch Tomorrow
Embed Socials
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
GIS ML/AI: Multi-Label Classification of Satellite Images with Fast.AI
Satellite image classification is the most significant technique used in remote sensing (GIS) for the computerized study and pattern recognition of satellite GIS, which is based on diversity structures of the image that involve rigorous validation of the training samples depending on the used ML/AI classification algorithm.
Satellite imagery is important for many applications including disaster response, law enforcement, and environmental monitoring. These GIS applications require the automated AI-powered identification of objects and facilities in the imagery.
In this post, we focus on the satellite image segmentation using Fast.Ai.
Satellite imagery is being used together with AI and deep learning in many areas to produce stunning insights and discoveries. Today we look at applying this approach to recognising buildings, woodlands & water areas from satellite images.
Conventionally, we use 4 classes for identifying objects in GIS images:
- Building
- Woodland
- Water
- Background (i.e. everything else).
For this multi-label image classification problem, we will use the Planet dataset, where it’s a collection of satellite images with multiple labels describing the scene.
The entire workflow consists of the following steps:
- Grab our input data
- Train a model with fastai
- QC with fastai metrics.
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()and import the following libraries
from fastai.vision.all import *
import pandas as pd
import torch
from torch import nnfrom fastcore.meta import use_kwargs_dict
from fastai.callback.fp16 import to_fp16
from fastai.callback.progress import ProgressCallback
from fastai.callback.schedule import lr_find, fit_one_cyclefrom fastai.data.block import MultiCategoryBlock, DataBlock
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import RandomSplitter, ColReaderfrom fastai.metrics import accuracy_multi, BaseLoss
from fastai.vision.augment import aug_transforms
from fastai.vision.data import ImageBlock
from fastai.vision.learner import cnn_learnerfrom torchvision.models import resnet34
Let’s import the input dataset
planet_source = untar_data(URLs.PLANET_SAMPLE)
df = pd.read_csv(planet_source/’labels.csv’)Let’s check the content
df.head()
Let’s edit the data columns
df = df[df[‘tags’] != ‘blow_down clear primary road’]
batch_tfms = aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)
planet = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
get_x=ColReader(0, pref=f'{planet_source}/train/’, suff=’.jpg’),
splitter=RandomSplitter(),
get_y=ColReader(1, label_delim=’ ‘),
batch_tfms = batch_tfms)dls = planet.dataloaders(df)
and plot the first 9 selected images
dls.show_batch(max_n=9, figsize=(12,9))
We can also invoke the lambda function
blocks = (ImageBlock, MultiCategoryBlock)
get_x = lambda x:planet_source/’train’/f'{x[0]}.jpg’
val = df.values[0]; val
array(['train_21983', 'partly_cloudy primary'], dtype=object)
get_x(df.values[0])
get_y = lambda x:x[1].split(‘ ‘)
planet = DataBlock(blocks=blocks,
get_x=get_x,
splitter=RandomSplitter(),
get_y=get_y,
batch_tfms=batch_tfms)dls = planet.dataloaders(df)
dls.show_batch(max_n=9, figsize=(12,9))Let’s invoke planet.dataloaders
def _planet_items(x): return (
f'{planet_source}/train/’+x.image_name+’.jpg’, x.tags.str.split())planet = DataBlock.from_columns(blocks=(ImageBlock, MultiCategoryBlock),
get_items = _planet_items,
splitter=RandomSplitter(),
batch_tfms=batch_tfms)dls = planet.dataloaders(df)
dls.show_batch(max_n=9, figsize=(12,9))Let’s train the model
from torchvision.models import resnet34
from fastai.metrics import accuracy_multi
learn = cnn_learner(dls, resnet34, pretrained=True, metrics=[accuracy_multi])
class BCEWithLogitsLossFlat(BaseLoss):
“Same asnn.BCEWithLogitsLoss
, but flattens input and target.”
@use_kwargs_dict(keep=True, weight=None, reduction=’mean’, pos_weight=None)
def init(self, *args, axis=-1, floatify=True, thresh=0.5, **kwargs):
if kwargs.get(‘pos_weight’, None) is not None and kwargs.get(‘flatten’, None) is True:
raise ValueError(“flatten
must be False when usingpos_weight
to avoid a RuntimeError due to shape mismatch”)
if kwargs.get(‘pos_weight’, None) is not None: kwargs[‘flatten’] = False
super().init(nn.BCEWithLogitsLoss, *args, axis=axis, floatify=floatify, is_2d=False, **kwargs)
self.thresh = threshdef decodes(self, x): return x>self.thresh def activation(self, x): return torch.sigmoid(x)
learn.loss_func = BCEWithLogitsLossFlat()
learn.lr_find()
SuggestedLRs(valley=0.0020892962347716093)
lr = 1e-2
learn = learn.to_fp16()learn.fit_one_cycle(5, slice(lr))
learn.save(‘stage-1’)
Path('models/stage-1.pth')
learn.unfreeze()
learn.lr_find()SuggestedLRs(valley=7.585775892948732e-05)
learn.fit_one_cycle(5, slice(1e-5, lr/5))
learn.show_results(figsize=(15,15))
Summary
- Instead of cats & dogs, the Planet Competition Dataset consists of satellite images from the Amazonian region.
- The task here consists of classifying which types of land covers are present on each image. We can have multiple landcovers types present on one image.
- Here the task is a multi-label classification problem, where each image can belong to multiple classes.
- Using pre-trained models is a good practice in general.
Explore More
fast.ai’s superresolution model on satellite imagery.
Multi-label classification using fastai
Multi-Label Keras CNN Image Classification of MNIST Fashion Clothing
ML/AI Wildfire Prediction using Remote Sensing Data
Embed Socials
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
Multi-Label Keras CNN Image Classification of MNIST Fashion Clothing
- Machine Learning (ML) and Deep Learning (DL) play a crucial role in managing efficient supply chain operations in the fashion retail industry.
- Apart from large e-commerce brands like Amazon, even small-time fashion retailers are now using ML algorithms to understand fast-changing customer needs and expectations.
- Neural network (NN) models are considered the most efficient and accurate forecasting methods, as they have demonstrated high performance in various business applications involving fashion e-commerce digital platforms.
- The goal of the present DL project is to train and optimize a Tensor Flow (TF) / Keras Convolution Neural Network (CNN), enabling us to classify the MNIST fashion clothing images.
- In fact, fashion clothing DL is the familiar problem of multi-label image classification. The key benefit of CNN is that the number of training model parameters is independent of the size of the original image.
- Following earlier DL studies, we train a Feedforward CNN model to classify images of clothing on train data and make predictions on test data. We use tf.keras throughout the project, a high-level API to build and train models in TensorFlow.
The Fashion MNIST Dataset
Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes:
- 0 T-shirt/top
- 1 Trouser
- 2 Pullover
- 3 Dress
- 4 Coat
- 5 Sandal
- 6 Shirt
- 7 Sneaker
- 8 Bag
- 9 Ankle boot
Each image pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255.
We aim to feed a 28 x 28 image (784 bytes) as an input to CNN, so that CNN can classify the image as one of the item labels.
Model Version 1
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()and import the following key libraries
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as pltprint(tf.version)
2.10.0
Let’s load the input data
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
and check the data structure
train_images.shape
(60000, 28, 28)
len(train_labels)
60000
test_images.shape
(10000, 28, 28)
len(test_labels)
10000
Let’s plot a single image
plt.figure()
plt.imshow(train_images[1])
plt.colorbar()
plt.grid(False)
plt.show()Let’s scale the images
train_images = train_images / 255.0
test_images = test_images / 255.0
and plot 25 selected grayscale labeled images
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
plt.xlabel(train_labels[i])
plt.savefig(‘example25grayscaleimages.png’)Let’s design a simple CNN model, compile and train the model with optimizer=’adam’ and epochs=20
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation=’relu’),
tf.keras.layers.Dense(10)])model.compile(optimizer=’adam’,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[‘accuracy’])model.fit(train_images, train_labels, epochs=20)
Epoch 1/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.5018 - accuracy: 0.8240 Epoch 2/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.3791 - accuracy: 0.8625 Epoch 3/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.3360 - accuracy: 0.8773 Epoch 4/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.3139 - accuracy: 0.8862 Epoch 5/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2936 - accuracy: 0.8909 Epoch 6/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2817 - accuracy: 0.8954 Epoch 7/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2692 - accuracy: 0.8990 Epoch 8/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2573 - accuracy: 0.9045 Epoch 9/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2480 - accuracy: 0.9074 Epoch 10/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2390 - accuracy: 0.9099 Epoch 11/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2316 - accuracy: 0.9134 Epoch 12/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2233 - accuracy: 0.9159 Epoch 13/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2168 - accuracy: 0.9195 Epoch 14/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2121 - accuracy: 0.9212 Epoch 15/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2066 - accuracy: 0.9217 Epoch 16/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2005 - accuracy: 0.9251 Epoch 17/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1939 - accuracy: 0.9267 Epoch 18/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1913 - accuracy: 0.9278 Epoch 19/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1848 - accuracy: 0.9305 Epoch 20/20 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1806 - accuracy: 0.9312
Let’s check the CNN loss/accuracy
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(‘\nTest accuracy:’, test_acc)
313/313 - 0s - loss: 0.3805 - accuracy: 0.8783 - 441ms/epoch - 1ms/step Test accuracy: 0.8783000111579895
Let’s make predictions using test images
probability_model = tf.keras.Sequential([model,
tf.keras.layers.Softmax()])predictions = probability_model.predict(test_images)
313/313 [==============================] - 0s 645us/step
predictions[1]
array([4.9210530e-06, 3.8770574e-15, 9.9937904e-01, 3.7104611e-11, 5.7768257e-04, 1.3045125e-12, 3.8408274e-05, 1.2573967e-21, 8.8773615e-12, 2.4979118e-16], dtype=float32)
We can see that
np.argmax(predictions[1])
2
which is consistent with the true test label
test_labels[1]
2
Let’s invoke a couple of image plot functions
def plot_image(i, predictions_array, true_label, img):
true_label, img = true_label[i], img[i]
plt.grid(False)
plt.xticks([])
plt.yticks([])plt.imshow(img, cmap=plt.cm.binary)
predicted_label = np.argmax(predictions_array)
if predicted_label == true_label:
color = ‘blue’
else:
color = ‘red’plt.xlabel(“{} {:2.0f}% ({})”.format(predicted_label,
100*np.max(predictions_array),
true_label),
color=color)def plot_value_array(i, predictions_array, true_label):
true_label = true_label[i]
plt.grid(False)
plt.xticks(range(10))
plt.yticks([])
thisplot = plt.bar(range(10), predictions_array, color=”#777777″)
plt.ylim([0, 1])
predicted_label = np.argmax(predictions_array)thisplot[predicted_label].set_color(‘red’)
thisplot[true_label].set_color(‘blue’)Let’s plot a couple of selected images
i = 1
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(i, predictions[i], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(i, predictions[i], test_labels)
plt.show()true label =2
i = 12
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(i, predictions[i], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(i, predictions[i], test_labels)
plt.show()true label =7
Let’s plot several test images, their predicted labels, and the true labels (recall that we color correct predictions in blue and incorrect predictions in red)
num_rows = 5
num_cols = 3
num_images = num_rowsnum_cols plt.figure(figsize=(22num_cols, 2num_rows))
for i in range(num_images):
plt.subplot(num_rows, 2num_cols, 2i+1)
plot_image(i, predictions[i], test_labels, test_images)
plt.subplot(num_rows, 2num_cols, 2i+2)
plot_value_array(i, predictions[i], test_labels)
plt.tight_layout()plt.savefig(‘clothestrainpredict.png’)
We can grab an image from the test dataset
img = test_images[1]print(img.shape)
Add the image to a batch where it’s the only member
img = (np.expand_dims(img,0))print(img.shape)
predictions_single = probability_model.predict(img)print(predictions_single)
(28, 28) (1, 28, 28) 1/1 [==============================] - 0s 16ms/step [[4.9210530e-06 3.8770574e-15 9.9937904e-01 3.7104611e-11 5.7768257e-04 1.3045125e-12 3.8408274e-05 1.2573967e-21 8.8773615e-12 2.4979118e-16]]
plot_value_array(1, predictions_single[0], test_labels)
plt.show()
Model Version 2
Recall that we need to import the key libraries and prepare the input data
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as pltclothing_fashion_mnist = tf.keras.datasets.fashion_mnist
while loading the dataset from tensorflow
(x_train, y_train),(x_test, y_test) = clothing_fashion_mnist.load_data()and displaying the shapes of training and testing datasets
print(‘Shape of training cloth images: ‘,
x_train.shape)print(‘Shape of training label: ‘,
y_train.shape)print(‘Shape of test cloth images: ‘,
x_test.shape)print(‘Shape of test labels: ‘,
y_test.shape)Shape of training cloth images: (60000, 28, 28) Shape of training label: (60000,) Shape of test cloth images: (10000, 28, 28) Shape of test labels: (10000,)
Let’s store the class names
label_class_names = [‘T-shirt/top’, ‘Trouser’,
‘Pullover’, ‘Dress’, ‘Coat’,
‘Sandal’, ‘Shirt’, ‘Sneaker’,
‘Bag’, ‘Ankle boot’]and display the selected image ii=2 with the colorbar
plt.imshow(x_train[ii])
plt.colorbar()
plt.show()Let’s normalize both training and testing datasets
x_train = x_train / 255.0
x_test = x_test / 255.0let’s plot the first 20 training images
plt.figure(figsize=(15, 5)) # figure size
i = 0
while i < 20:
plt.subplot(2, 10, i+1)plt.imshow(x_train[i], cmap=plt.cm.binary) plt.xlabel(label_class_names[y_train[i]]) i = i+1
plt.show()
Let’s build the model
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation=’relu’),
tf.keras.layers.Dense(10)
])compile the model
model.compile(optimizer=’adam’,
loss=tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True),
metrics=[‘accuracy’])and fit the model to the training data
model.fit(x_train, y_train, epochs=20)Epoch 1/20 1875/1875 [==============================] - 10s 5ms/step - loss: 0.4958 - accuracy: 0.8256 Epoch 2/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.3726 - accuracy: 0.8647 Epoch 3/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.3324 - accuracy: 0.8790 Epoch 4/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.3115 - accuracy: 0.8865 Epoch 5/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2932 - accuracy: 0.8913 Epoch 6/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2772 - accuracy: 0.8974 Epoch 7/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2662 - accuracy: 0.9006 Epoch 8/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2559 - accuracy: 0.9043 Epoch 9/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2471 - accuracy: 0.9082 Epoch 10/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2377 - accuracy: 0.9115 Epoch 11/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2305 - accuracy: 0.9125 Epoch 12/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2236 - accuracy: 0.9163 Epoch 13/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2156 - accuracy: 0.9190 Epoch 14/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2106 - accuracy: 0.9215 Epoch 15/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.2064 - accuracy: 0.9224 Epoch 16/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.1977 - accuracy: 0.9251 Epoch 17/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.1925 - accuracy: 0.9283 Epoch 18/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.1882 - accuracy: 0.9295 Epoch 19/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.1845 - accuracy: 0.9306 Epoch 20/20 1875/1875 [==============================] - 9s 5ms/step - loss: 0.1796 - accuracy: 0.9328
Let’s calculate loss/accuracy score
test_loss, test_acc = model.evaluate(x_test,
y_test,
verbose=2)
print(‘\nTest loss:’, test_loss)
print(‘\nTest accuracy:’, test_acc)313/313 - 1s - loss: 0.3656 - accuracy: 0.8845 - 1s/epoch - 4ms/step Test loss: 0.3656046390533447 Test accuracy: 0.8845000267028809
We use Softmax() function to convert linear output logits to probability
prediction_model = tf.keras.Sequential(
[model, tf.keras.layers.Softmax()])and make predictions of test data
prediction = prediction_model.predict(x_test)
Let’s look at the test image with ii=1
print(‘Predicted test label:’, np.argmax(prediction[ii]))
print(label_class_names[np.argmax(prediction[ii])])
print(‘Actual test label:’, y_test[ii])
313/313 [==============================] - 0s 1ms/step Predicted test label: 1 Trouser Actual test label: 1
Let’s plot 24 selected test images
plt.figure(figsize=(15, 6))
i = 0while i < 24:
image, actual_label = x_test[i], y_test[i]
predicted_label = np.argmax(prediction[i])
plt.subplot(3, 8, i+1)
plt.tight_layout()
plt.xticks([])
plt.yticks([])# display plot plt.imshow(image) # if else condition to distinguish right and # wrong if predicted_label == actual_label:color, label = ('green', 'Correct Prediction') if predicted_label != actual_label:color, label = ('red', 'Incorrect Prediction') # plotting labels and giving color to it # according to its correctness plt.title(label, color=color) # labelling the images in x-axis to see # the correct and incorrect results plt.xlabel(" {} ~ {} ".format( label_class_names[actual_label], label_class_names[predicted_label])) # labelling the images orderwise in y-axis plt.ylabel(i) # incrementing counter variable i += 1
Model Version 3
Let’s import the key libraries and load the input dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import keras
import tensorflow as tf
print(tf.version)2.10.0
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
Let’s explore the dataset
Check the shape and size of X_train, X_test, y_train, y_test
print (“Number of observations in training data: ” + str(len(X_train)))
print (“Number of labels in training data: ” + str(len(y_train)))
print (“Dimensions of a single image in X_train:” + str(X_train[0].shape))
print(“————————————————————-\n”)
print (“Number of observations in test data: ” + str(len(X_test)))
print (“Number of labels in test data: ” + str(len(y_test)))
print (“Dimensions of single image in X_test:” + str(X_test[0].shape))Number of observations in training data: 60000 Number of labels in training data: 60000 Dimensions of a single image in X_train:(28, 28) ------------------------------------------------------------- Number of observations in test data: 10000 Number of labels in test data: 10000 Dimensions of single image in X_test:(28, 28)
Let’s set the label list
class_labels = [‘T-shirt/top’,’Trouser’,’Pullover’,’Dress’,’Coat’,’Sandal’,’Shirt’,’Sneakers’,’Bag’,’Ankle boot’]
and plot the selected training image
ii=1
plt.figure(figsize = (8,8))
plt.imshow(X_train[ii],cmap = ‘Greys’);We can also plot the next image
ii1=ii+1
plt.figure(figsize = (8,8))
plt.imshow(X_train[ii1],cmap = ‘Greys’);Let’s plot first 25 images from the training set and display the class name below each image
plt.figure(figsize=(20,16))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(X_train[i], cmap=’Greys’)
plt.xlabel(class_labels[y_train[i]])plt.savefig(‘clothesgrey.png’)
Let’s scale the data
X_train = X_train / 255.0
X_test = X_test / 255.0
and check the shape
X_train.shape , y_train.shape
((60000, 28, 28), (60000,))
Let’s build, compile and train the CNN model
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation=’relu’),
tf.keras.layers.Dense(10)])model.compile(optimizer=’adam’,loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=[‘accuracy’])
model.fit(X_train, y_train, epochs=50)
Epoch 1/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.4999 - accuracy: 0.8251 Epoch 2/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.3732 - accuracy: 0.8652 Epoch 3/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.3342 - accuracy: 0.8793 Epoch 4/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.3111 - accuracy: 0.8856 Epoch 5/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2923 - accuracy: 0.8928 Epoch 6/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2795 - accuracy: 0.8971 Epoch 7/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2678 - accuracy: 0.9009 Epoch 8/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2548 - accuracy: 0.9048 Epoch 9/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2473 - accuracy: 0.9081 Epoch 10/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2379 - accuracy: 0.9113 Epoch 11/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2308 - accuracy: 0.9126 Epoch 12/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2228 - accuracy: 0.9168 Epoch 13/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2160 - accuracy: 0.9185 Epoch 14/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2119 - accuracy: 0.9211 Epoch 15/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.2039 - accuracy: 0.9234 Epoch 16/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1991 - accuracy: 0.9255 Epoch 17/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1923 - accuracy: 0.9282 Epoch 18/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1877 - accuracy: 0.9291 Epoch 19/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1838 - accuracy: 0.9329 Epoch 20/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1769 - accuracy: 0.9342 Epoch 21/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1728 - accuracy: 0.9363 Epoch 22/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1679 - accuracy: 0.9374 Epoch 23/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1655 - accuracy: 0.9376 Epoch 24/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1610 - accuracy: 0.9391 Epoch 25/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1565 - accuracy: 0.9419 Epoch 26/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1525 - accuracy: 0.9433 Epoch 27/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1505 - accuracy: 0.9438 Epoch 28/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1474 - accuracy: 0.9446 Epoch 29/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1431 - accuracy: 0.9470 Epoch 30/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1410 - accuracy: 0.9473 Epoch 31/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1400 - accuracy: 0.9470 Epoch 32/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1358 - accuracy: 0.9491 Epoch 33/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1302 - accuracy: 0.9518 Epoch 34/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1301 - accuracy: 0.9515 Epoch 35/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1275 - accuracy: 0.9522 Epoch 36/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1234 - accuracy: 0.9542 Epoch 37/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1218 - accuracy: 0.9545 Epoch 38/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1195 - accuracy: 0.9559 Epoch 39/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1188 - accuracy: 0.9557 Epoch 40/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1161 - accuracy: 0.9559 Epoch 41/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1122 - accuracy: 0.9578 Epoch 42/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1120 - accuracy: 0.9578 Epoch 43/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1081 - accuracy: 0.9591 Epoch 44/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1102 - accuracy: 0.9593 Epoch 45/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1053 - accuracy: 0.9612 Epoch 46/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1030 - accuracy: 0.9620 Epoch 47/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.1017 - accuracy: 0.9622 Epoch 48/50 1875/1875 [==============================] - 4s 2ms/step - loss: 0.1012 - accuracy: 0.9627 Epoch 49/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.0988 - accuracy: 0.9635 Epoch 50/50 1875/1875 [==============================] - 3s 2ms/step - loss: 0.0978 - accuracy: 0.9639
Model Accuracy Results:
print(“Results:”)
print(“———————“)
scores_train = model.evaluate(X_train, y_train, verbose= 2)
print(“Training Accuracy: %.2f%%\n” % (scores_train[1] * 100))
scores_test = model.evaluate(X_test, y_test, verbose= 2)
print(“Testing Accuracy: %.2f%%\n” % (scores_test[1] * 100))Results: --------------------- 1875/1875 - 2s - loss: 0.0837 - accuracy: 0.9688 - 2s/epoch - 1ms/step Training Accuracy: 96.88% 313/313 - 0s - loss: 0.4888 - accuracy: 0.8870 - 461ms/epoch - 1ms/step Testing Accuracy: 88.70%
HPO
Let’s import the necessary packages
from sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense,Flatten
from keras.wrappers.scikit_learn import KerasClassifierand start defining the model
def create_model():
model=Sequential()
model.add(Flatten(input_shape=(28,28)))
model.add(Dense(128,kernel_initializer=’normal’,activation=’relu’))
model.add(Dense(8,kernel_initializer=’normal’,activation=’relu’))
model.add(Dense(10,activation=’softmax’))
model.compile(loss = ‘sparse_categorical_crossentropy’, optimizer = ‘Adam’, metrics = [‘accuracy’])
return modelLet’s create the Keras model
model= KerasClassifier(build_fn=create_model, verbose=0)Define the grid search parameters
epochs = [5,10,50,100]Make a dictionary of the grid search parameters
param_grid = dict(epochs=epochs)Build and fit the GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv = KFold(3), verbose=10)
grid_result = grid.fit(X_train, y_train)Summarize the results
print(“Best: {0}, using {1}”.format(grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_[‘mean_test_score’]
stds = grid_result.cv_results_[‘std_test_score’]
params = grid_result.cv_results_[‘params’]
for mean, stdev, param in zip(means, stds, params):
print(‘{0} ({1}) with: {2}’.format(mean, stdev, param))Fitting 3 folds for each of 4 candidates, totalling 12 fits [CV 1/3; 1/4] START epochs=5.................................................... [CV 1/3; 1/4] END .....................epochs=5;, score=0.868 total time= 12.0s [CV 2/3; 1/4] START epochs=5.................................................... [CV 2/3; 1/4] END .....................epochs=5;, score=0.878 total time= 11.6s [CV 3/3; 1/4] START epochs=5.................................................... [CV 3/3; 1/4] END .....................epochs=5;, score=0.874 total time= 11.3s [CV 1/3; 2/4] START epochs=10................................................... [CV 1/3; 2/4] END ....................epochs=10;, score=0.873 total time= 22.2s [CV 2/3; 2/4] START epochs=10................................................... [CV 2/3; 2/4] END ....................epochs=10;, score=0.886 total time= 22.4s [CV 3/3; 2/4] START epochs=10................................................... [CV 3/3; 2/4] END ....................epochs=10;, score=0.876 total time= 22.8s [CV 1/3; 3/4] START epochs=50................................................... [CV 1/3; 3/4] END ....................epochs=50;, score=0.882 total time= 1.8min [CV 2/3; 3/4] START epochs=50................................................... [CV 2/3; 3/4] END ....................epochs=50;, score=0.890 total time= 1.8min [CV 3/3; 3/4] START epochs=50................................................... [CV 3/3; 3/4] END ....................epochs=50;, score=0.885 total time= 1.8min [CV 1/3; 4/4] START epochs=100.................................................. [CV 1/3; 4/4] END ...................epochs=100;, score=0.882 total time= 3.8min [CV 2/3; 4/4] START epochs=100.................................................. [CV 2/3; 4/4] END ...................epochs=100;, score=0.891 total time= 3.6min [CV 3/3; 4/4] START epochs=100.................................................. [CV 3/3; 4/4] END ...................epochs=100;, score=0.885 total time= 3.7min Best: 0.8861166636149088, using {'epochs': 100} 0.8736000061035156 (0.00386415461356267) with: {'epochs': 5} 0.8786500096321106 (0.005455429510998301) with: {'epochs': 10} 0.8859000205993652 (0.0032629225436568263) with: {'epochs': 50} 0.8861166636149088 (0.0034991337511122603) with: {'epochs': 100}
Let’s find the best optimizer:
from keras.layers import Dropout
taken from previous results
epochs= 50
batch_size=50
learn_rate = 0.001
dropout_rate = 0.1
init = ‘normal’
activation = ‘tanh’Start defining the model
def create_model(optimizer=’adam’):
model=Sequential()
model.add(Flatten(input_shape=(28, 28)))
model.add(Dense(16, kernel_initializer = init, activation = activation))
model.add(Dropout(dropout_rate))
model.add(Dense(8, kernel_initializer = init, activation = activation))
model.add(Dropout(dropout_rate))
model.add(Dense(10,activation=’softmax’))import tensorflow as tf
opt = tf.keras.optimizers.Adam(learning_rate = learn_rate)
model.compile(loss = ‘sparse_categorical_crossentropy’, optimizer = ‘Adam’, metrics = [‘accuracy’])
return modelCreate the model
model = KerasClassifier(build_fn = create_model, epochs=epochs, batch_size=batch_size, verbose = 0) # This comes from the previous bestDefine the grid search parameters
optimizer = [‘SGD’, ‘RMSprop’, ‘Adagrad’, ‘Adadelta’, ‘Adam’, ‘Adamax’, ‘Nadam’]Make a dictionary of the grid search parameters
param_grid = dict(optimizer=optimizer)Build and fit the GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv = KFold(3), verbose=10)
grid_result = grid.fit(X_train, y_train)Summarize the results
print(“Best: {0}, using {1}”.format(grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_[‘mean_test_score’]
stds = grid_result.cv_results_[‘std_test_score’]
params = grid_result.cv_results_[‘params’]
for mean, stdev, param in zip(means, stds, params):
print(‘{0} ({1}) with: {2}’.format(mean, stdev, param))Best: 0.8654166658719381, using {'optimizer': 'Nadam'}
Train Test Split the Training Data to 70% and Validation Data to 30%
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_scoreX_train,X_val,y_train,y_val = train_test_split(X_train,y_train,test_size=0.3,random_state= 100)
defining input neurons
input_neurons = X_train.shape[1]
define number of output neurons
output_neurons = 10
importing the sequential model
from keras.models import Sequential
importing different layers from keras
from keras.layers import InputLayer, Dense
from keras.layers import DropoutNumber of hidden layers and hidden neurons
Applying hyperparameters obtained using GridSearch CV
Define hidden layers and neuron in each layer
number_of_hidden_layers = 2
neuron_hidden_layer_1 = 16
neuron_hidden_layer_2 = 8Defining the CNN architecture of the model
model_final = Sequential()
model_final.add(Flatten(input_shape=(28, 28)))
model_final.add(Dense(units=neuron_hidden_layer_1, kernel_initializer = ‘normal’, activation=’tanh’))
model_final.add(Dropout(0.1))
model_final.add(Dense(units=neuron_hidden_layer_2,kernel_initializer = ‘normal’, activation=’tanh’))
model_final.add(Dropout(0.1))
model_final.add(Dense(units=output_neurons,activation=’softmax’))Summary of the neural network model
model_final.summary()
Model: "sequential_190" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= flatten_190 (Flatten) (None, 784) 0 dense_569 (Dense) (None, 16) 12560 dropout_332 (Dropout) (None, 16) 0 dense_570 (Dense) (None, 8) 136 dropout_333 (Dropout) (None, 8) 0 dense_571 (Dense) (None, 10) 90 ================================================================= Total params: 12,786 Trainable params: 12,786 Non-trainable params: 0
Compiling the model:
loss as “sparse_categorical_crossentropy”, since we have multi class classification problem
defining the optimizer as “Nadam” obtained in GridSearhCV
Evaluation metric as “accuracy”
Define learning rate obtained in GridSearhCV
learn_rate = 0.001
import tensorflow as tf
opt = tf.keras.optimizers.Adam(learning_rate = learn_rate)
model_final.compile(loss=’sparse_categorical_crossentropy’,optimizer=’Nadam’,metrics=[‘accuracy’])training the model with best hyperparamters obtained in GridSearchCV
passing the independent and dependent features for training set for training the model
validation data will be evaluated at the end of each epoch
storing the trained model in model_history variable which will be used to visualize the training process
model_history = model_final.fit(X_train, y_train, validation_data=(X_val, y_val), epochs= 50,batch_size = 50)
Epoch 1/50 840/840 [==============================] - 6s 6ms/step - loss: 1.0740 - accuracy: 0.6631 - val_loss: 0.6646 - val_accuracy: 0.7894 Epoch 2/50 840/840 [==============================] - 5s 6ms/step - loss: 0.6450 - accuracy: 0.7909 - val_loss: 0.5029 - val_accuracy: 0.8327 Epoch 3/50 840/840 [==============================] - 5s 6ms/step - loss: 0.5550 - accuracy: 0.8142 - val_loss: 0.4626 - val_accuracy: 0.8408 Epoch 4/50 840/840 [==============================] - 5s 6ms/step - loss: 0.5227 - accuracy: 0.8204 - val_loss: 0.4363 - val_accuracy: 0.8471 Epoch 5/50 840/840 [==============================] - 5s 6ms/step - loss: 0.5020 - accuracy: 0.8282 - val_loss: 0.4309 - val_accuracy: 0.8478 Epoch 6/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4891 - accuracy: 0.8310 - val_loss: 0.4329 - val_accuracy: 0.8443 Epoch 7/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4795 - accuracy: 0.8325 - val_loss: 0.4295 - val_accuracy: 0.8448 Epoch 8/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4725 - accuracy: 0.8368 - val_loss: 0.4131 - val_accuracy: 0.8558 Epoch 9/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4603 - accuracy: 0.8408 - val_loss: 0.4056 - val_accuracy: 0.8573 Epoch 10/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4584 - accuracy: 0.8416 - val_loss: 0.4056 - val_accuracy: 0.8581 Epoch 11/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4552 - accuracy: 0.8448 - val_loss: 0.4048 - val_accuracy: 0.8583 Epoch 12/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4480 - accuracy: 0.8440 - val_loss: 0.4021 - val_accuracy: 0.8592 Epoch 13/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4430 - accuracy: 0.8461 - val_loss: 0.4057 - val_accuracy: 0.8594 Epoch 14/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4446 - accuracy: 0.8440 - val_loss: 0.3988 - val_accuracy: 0.8597 Epoch 15/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4388 - accuracy: 0.8476 - val_loss: 0.4053 - val_accuracy: 0.8598 Epoch 16/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4339 - accuracy: 0.8492 - val_loss: 0.4028 - val_accuracy: 0.8624 Epoch 17/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4294 - accuracy: 0.8500 - val_loss: 0.3943 - val_accuracy: 0.8641 Epoch 18/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4295 - accuracy: 0.8515 - val_loss: 0.4026 - val_accuracy: 0.8603 Epoch 19/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4271 - accuracy: 0.8509 - val_loss: 0.4087 - val_accuracy: 0.8575 Epoch 20/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4220 - accuracy: 0.8532 - val_loss: 0.4023 - val_accuracy: 0.8597 Epoch 21/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4214 - accuracy: 0.8539 - val_loss: 0.3942 - val_accuracy: 0.8612 Epoch 22/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4184 - accuracy: 0.8544 - val_loss: 0.3901 - val_accuracy: 0.8631 Epoch 23/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4191 - accuracy: 0.8547 - val_loss: 0.3995 - val_accuracy: 0.8612 Epoch 24/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4122 - accuracy: 0.8547 - val_loss: 0.3901 - val_accuracy: 0.8639 Epoch 25/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4138 - accuracy: 0.8561 - val_loss: 0.3982 - val_accuracy: 0.8597 Epoch 26/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4097 - accuracy: 0.8594 - val_loss: 0.3943 - val_accuracy: 0.8635 Epoch 27/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4116 - accuracy: 0.8563 - val_loss: 0.3976 - val_accuracy: 0.8593 Epoch 28/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4089 - accuracy: 0.8572 - val_loss: 0.3948 - val_accuracy: 0.8617 Epoch 29/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4078 - accuracy: 0.8586 - val_loss: 0.3871 - val_accuracy: 0.8669 Epoch 30/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4040 - accuracy: 0.8593 - val_loss: 0.3954 - val_accuracy: 0.8616 Epoch 31/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4013 - accuracy: 0.8623 - val_loss: 0.3928 - val_accuracy: 0.8633 Epoch 32/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4004 - accuracy: 0.8612 - val_loss: 0.3894 - val_accuracy: 0.8659 Epoch 33/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3993 - accuracy: 0.8605 - val_loss: 0.4030 - val_accuracy: 0.8590 Epoch 34/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3986 - accuracy: 0.8612 - val_loss: 0.4017 - val_accuracy: 0.8603 Epoch 35/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3985 - accuracy: 0.8608 - val_loss: 0.3932 - val_accuracy: 0.8644 Epoch 36/50 840/840 [==============================] - 5s 6ms/step - loss: 0.4004 - accuracy: 0.8608 - val_loss: 0.3909 - val_accuracy: 0.8640 Epoch 37/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3950 - accuracy: 0.8630 - val_loss: 0.3978 - val_accuracy: 0.8603 Epoch 38/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3935 - accuracy: 0.8626 - val_loss: 0.3922 - val_accuracy: 0.8643 Epoch 39/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3910 - accuracy: 0.8630 - val_loss: 0.3865 - val_accuracy: 0.8649 Epoch 40/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3932 - accuracy: 0.8618 - val_loss: 0.3873 - val_accuracy: 0.8664 Epoch 41/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3890 - accuracy: 0.8640 - val_loss: 0.4033 - val_accuracy: 0.8602 Epoch 42/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3916 - accuracy: 0.8626 - val_loss: 0.3934 - val_accuracy: 0.8642 Epoch 43/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3889 - accuracy: 0.8646 - val_loss: 0.3925 - val_accuracy: 0.8613 Epoch 44/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3898 - accuracy: 0.8644 - val_loss: 0.3942 - val_accuracy: 0.8623 Epoch 45/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3860 - accuracy: 0.8651 - val_loss: 0.3838 - val_accuracy: 0.8672 Epoch 46/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3834 - accuracy: 0.8640 - val_loss: 0.3920 - val_accuracy: 0.8636 Epoch 47/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3833 - accuracy: 0.8658 - val_loss: 0.3885 - val_accuracy: 0.8633 Epoch 48/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3854 - accuracy: 0.8653 - val_loss: 0.3885 - val_accuracy: 0.8665 Epoch 49/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3814 - accuracy: 0.8678 - val_loss: 0.3935 - val_accuracy: 0.8648 Epoch 50/50 840/840 [==============================] - 5s 6ms/step - loss: 0.3811 - accuracy: 0.8682 - val_loss: 0.3949 - val_accuracy: 0.8650
Let’s evaluate the model
model_final.evaluate(X_train, y_train)
1313/1313 [==============================] - 2s 2ms/step - loss: 0.3010 - accuracy: 0.8934
[0.30102264881134033, 0.8934047818183899]
model_final.evaluate(X_val, y_val)
563/563 [==============================] - 1s 2ms/step - loss: 0.3949 - accuracy: 0.8650
Out[34]:
[0.3948652148246765, 0.8650000095367432]
Evaluation Report
Model-3 Accuracy Results:
print(“Results:”)
print(“——–“)
scores_train = model_final.evaluate(X_train, y_train, verbose=2)
print(“Training Accuracy: %.2f%%\n” % (scores_train[1] * 100))
scores_val = model_final.evaluate(X_val, y_val, verbose= 2)
print(“Validation Accuracy: %.2f%%\n” % (scores_val[1] * 100))Results: -------- 1313/1313 - 2s - loss: 0.3010 - accuracy: 0.8934 - 2s/epoch - 1ms/step Training Accuracy: 89.34% 563/563 - 1s - loss: 0.3949 - accuracy: 0.8650 - 731ms/epoch - 1ms/step Validation Accuracy: 86.50%
Summarize history for loss
plt.figure(figsize = (15,7))
plt.plot(model_history.history[‘loss’])
plt.plot(model_history.history[‘val_loss’])
plt.title(‘Model loss’)
plt.ylabel(‘Loss’)
plt.xlabel(‘Epoch’)
plt.legend([‘Train Loss’, ‘Validation Loss’], loc=’upper right’)
plt.xlim(0,50)
plt.ylim(0.1,1.0)
plt.show()Summarize history for accuracy
plt.figure(figsize = (15,7))
plt.plot(model_history.history[‘accuracy’])
plt.plot(model_history.history[‘val_accuracy’])
plt.title(‘Model Accuracy’)
plt.ylabel(‘Accuracy’)
plt.xlabel(‘Epoch’)
plt.legend([‘Train Accuracy’, ‘Validation Accuracy’], loc=’upper right’)
plt.xlim(0,50)
plt.ylim(0.5,1.0)
plt.show()Let’s evaluate the test score
scores_test = model_final.evaluate(X_test,y_test)
313/313 [==============================] - 0s 2ms/step - loss: 0.4227 - accuracy: 0.8564
scores_test = model_final.evaluate(X_test, y_test, verbose=2)
print(“Testing Accuracy: %.2f%%\n” % (scores_test[1] * 100))313/313 - 0s - loss: 0.4227 - accuracy: 0.8564 - 424ms/epoch - 1ms/step Testing Accuracy: 85.64%
Add a softmax layer to convert the model’s linear outputs logits to probabilities, which should be easier to interpret
probability_model = tf.keras.Sequential([model_final, tf.keras.layers.Softmax()])Let’s make predictions
predictions = probability_model.predict(X_test)
313/313 [==============================] - 0s 709us/step
Model has predicted the label for each image in the testing set. Let’s take a look at the first prediction:
ii=1
predictions[ii]array([0.08596796, 0.08593319, 0.22232993, 0.08594155, 0.08826148, 0.08593146, 0.08782255, 0.08593164, 0.08594636, 0.08593389], dtype=float32)
np.argmax(predictions[ii])
2
y_test[ii]
2
Let’s plot the test image
plt.figure(figsize = (8,8))
plt.imshow(X_test[ii],cmap = ‘Greys’);Let’s look at the multi-label confusion matrix
from sklearn.metrics import confusion_matrix
plt.figure(figsize = (16,8))
y_pred_labels = [np.argmax(label) for label in predictions]
cm = confusion_matrix(y_test,y_pred_labels)HeatMap:
sns.heatmap(cm , annot = True,fmt = ‘d’,xticklabels = class_labels,yticklabels = class_labels,cmap = ‘viridis’);
Let’s print the multi-label classification report
from sklearn.metrics import classification_report
report = classification_report (y_test,y_pred_labels,target_names = class_labels)
print(report)precision recall f1-score support T-shirt/top 0.81 0.80 0.80 1000 Trouser 0.98 0.96 0.97 1000 Pullover 0.77 0.76 0.76 1000 Dress 0.82 0.90 0.86 1000 Coat 0.75 0.80 0.77 1000 Sandal 0.95 0.93 0.94 1000 Shirt 0.68 0.59 0.63 1000 Sneakers 0.90 0.96 0.93 1000 Bag 0.94 0.96 0.95 1000 Ankle boot 0.96 0.92 0.94 1000 accuracy 0.86 10000 macro avg 0.86 0.86 0.86 10000 weighted avg 0.86 0.86 0.86 10000
Conclusions
- In this post, we discussed how to address the multi-label image classification problem by implementing a CNN model using Keras, TensorFlow and GridSearchCV.
- Specifically, we discovered how to develop a CNN for clothing classification from scratch.
- We looked at the entire process of implementing a feedforward CNN model on the Fashion-MNIST dataset to classify images of clothing apparel on train data and make predictions on test data using GridSearchCV Hyperparameter tuning technique to achieve the best accuracy and performance.
- In this tutorial, you discovered how to develop a convolutional neural network for clothing classification from scratch.
- We have learned:
- How to develop a robust evaluation of a DL model and establish a baseline of performance for a multi-label image classification task.
- How to explore extensions to a baseline model to improve learning and model capacity via hyper-parameter tuning.
- How to develop a finalized model, evaluate the performance of the final model, and use it to make predictions on new images.
Explore More
Short-Term Stock Market Price Prediction using Deep Learning Models
Supervised ML/AI Stock Prediction using Keras LSTM Models
E-Commerce ML/AI Classification
E-Commerce Data Science Use-Case
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
Top E-Commerce Trends in Q1’23
Featured Photo by PhotoMIX Company on Pexels
- E-commerce has grown to become a giant pillar in the global economy.
- The current global e-commerce industry is measured as $5.7 trillion in 2022. The global ecommerce growth rate for 2023 is forecast at 10.4%, bringing global ecommerce sales worldwide to $6.3 trillion. This marks a 0.7 percentage point increase from 2022’s growth rate, which followed a massive dip from 2021.
- For instance, ecommerce sales growth worldwide is expected at 9.6% in 2024 and 8.9% in 2025. In 2026, online retail growth is forecast at 8.2%.
- Despite the falling growth rates, the ecommerce share of retail sales is expected to increase. In 2023, this figure is forecast at 20.8% and will increase to 24% by 2026.
- Following the recent review of e-commerce in 2022, let’s look at the top e-commerce trends in Q1’23.
Best E-Commerce Platforms (January 2023):
Top e-commerce platforms make it both easy and affordable to build a successful online store. Of course, with so many good options on the market, choosing the right system for your needs can be a challenge. To help, we put together this list of the 10 best e-commerce platforms available in 2023.
- Squarespace: Best Overall E-Commerce Platform
- Square Online: Best for Omnichannel Selling
- Ecwid: Best for Existing Sites
- Shift4Shop: Best Free E-Commerce Platform
- Shopify: Best for Dropshipping
- Wix: Best Drag-and-Drop Editor
- Weebly: Best Value
- BigCommerce: Best for Boosting Sales
- WooCommerce: Best for Versatility
- Big Cartel: Best for Creatives
Here are some things to look out for in the e-commerce industry in 2023:
- Increased focus on customer experience
- The continued growth of mobile commerce
- The rise of voice search
- The increasing use of artificial intelligence (AI) and AI-based upsell
- Greater emphasis on sustainability
- Social and Livestream shopping
- AR window shopping
- Personalization that respects the privacy
- Omnichannel shopping
- More ways to pay
Increased focus on customer experience
Consumers are becoming increasingly accustomed to high levels of convenience and personalization when shopping online.
Here are a few suggestions:
- Make it easy for customers to find what they are looking for: This can involve things like having a clear and intuitive website design, providing detailed product descriptions, and making it easy for customers to navigate through the site.
- Offer excellent customer service: This can include things like responding quickly to customer inquiries and complaints, providing helpful and knowledgeable support, and being willing to go the extra mile to solve customer problems.
- Personalize the shopping experience: This can involve things like providing personalized product recommendations, offering personalized communication and marketing efforts, and allowing customers to customize their orders.
- Make the checkout process seamless: This can involve things like offering multiple payment options, providing clear and detailed checkout instructions, and offering fast and reliable shipping options.
- Seek customer feedback and act on it: Asking for and listening to customer feedback can help businesses understand what they are doing well and where they can improve. By acting on this feedback, businesses can continually improve the customer experience.
The continued growth of mobile commerce
Mobile commerce (or m-commerce) is growing rapidly, with more and more people using their smartphones to shop online. In fact, according to Statista, global e-commerce sales are expected to reach $4.8 trillion by 2023. Shopping on a mobile device is often more convenient for customers, as it allows them to shop from anywhere, at any time. This means that businesses that offer a good mobile shopping experience are likely to be more successful in attracting and retaining customers.
Businesses can take the following steps:
- Optimize their website for mobile: This can involve things like making sure that the website is responsive, which means that it adjusts to fit the screen size of the device being used, and ensuring that the website loads quickly on mobile devices.
- Offer a mobile app: A mobile app can provide a more seamless shopping experience for customers, as it allows them to shop and access their account information directly from their device.
- Make it easy for customers to make purchases on mobile: This can involve things like offering multiple payment options, including mobile payments and making the checkout process as simple and straightforward as possible.
- Utilize mobile marketing: Businesses can use mobile marketing techniques, such as SMS marketing and push notifications, to reach and engage with customers on their mobile devices.
The rise of voice search
Voice search is becoming increasingly popular, with many people using voice assistants like Amazon’s Alexa, Google Home, Siri, Google Assistant, and many more to search for products and information. In fact, according to a survey by PwC, 52% of consumers say they use voice search at least once a week. The increasing popularity of smart home devices, such as smart speakers and smart thermostats, means that more people are using voice assistants to control their homes and access information.
Businesses can take the following steps:
- Optimize for long-tail keywords and natural language: When optimizing content for voice search, it is important to use long-tail keywords and phrases that are commonly used in natural language. This can help improve the chances of being found by voice search.
- Use structured data: Structured data, such as schema markup, can help improve the chances of being found by voice search by providing additional context about a business or website.
- Use a clear and concise writing style: When optimizing content for voice search, it is important to use clear and concise language that is easy for voice assistants to understand.
- Optimize for featured snippets: Featured snippets are short summaries of information that are often displayed at the top of search results and can be read out by voice assistants. Optimizing for featured snippets can increase the chances of being found by voice search.
- Use local SEO: Voice search is often used to find local businesses, so it is important to optimize for local SEO in order to be found by voice search.
The increasing use of artificial intelligence (AI) and AI-based upsell
According to a survey by Econsultancy, 72% of businesses that use AI in their operations report an increase in customer satisfaction.
Amazon is the king of AI. 35% of Amazon’s revenue comes from upselling or cross-selling.
Businesses can take the following steps:
- Implement chatbots: Chatbots can provide efficient and personalized customer service, helping to resolve customer inquiries and issues quickly. This can improve the customer experience and reduce the workload of customer-facing employees.
- Use automated email campaigns: AI-powered automated email campaigns can help businesses nurture leads and build customer relationships by sending personalized and targeted emails.
- Implement AI-based upsell techniques: AI-based upsell techniques, such as suggesting related or complementary products to customers during the checkout process, can help businesses increase their revenue.
- Use AI to analyze customer data: By using AI to analyze customer data, businesses can identify patterns and trends that can help them better understand their customers and improve their operations.
Greater emphasis on sustainability
According to a survey by Accenture, 66% of consumers say they are willing to pay more for products that are sustainable or environmentally friendly.
There are a few steps that businesses can take to adapt to the greater emphasis on sustainability in the e-commerce industry:
- Offer sustainable products: This can involve sourcing products that are made from sustainable materials, such as recycled or organic materials, or that are produced using environmentally-friendly methods.
- Implement sustainable practices: This can involve things like reducing waste, using eco-friendly packaging, and using sustainable transportation methods.
- Communicate sustainability efforts: It is important for businesses to communicate their sustainability efforts to customers. This can involve things like providing information about the sustainability of products on the website and sharing sustainability-related updates on social media and taking recommendations from customers directly.
- Consider the entire product lifecycle: When thinking about sustainability, it is important to consider the entire product lifecycle, from sourcing and production to disposal. Businesses can look for ways to minimize the environmental impact of their products at every stage.
Social and Livestream shopping
A study by LiveStream found that 80% of people would rather watch a live video from a brand than read a blog, and 82% of people prefer live video from a brand to social media posts. This shows the power of live streaming as a way to engage with customers and build a connection with a brand.
Live streaming technologies, such as YouTube Live, Instagram Live, Facebook Live, and many more allow businesses to engage with customers in real-time and provide a more interactive shopping experience.
Businesses can take the following steps:
- Establish a presence on social media: This can involve creating profiles on popular social media platforms, such as Facebook and Instagram, and regularly posting updates and engaging with followers.
- Use social media to promote products: Businesses can use social media to promote products and make it easier for customers to discover and purchase products. This can involve things like using sponsored posts or creating shoppable posts that allow customers to purchase products directly from the platform.
- Utilize live streaming technologies: Businesses can use live streaming technologies to engage with customers in real-time and provide a more interactive shopping experience. This can involve things like hosting live Q&A sessions or showcasing products during a live stream.
AR window shopping
There are a few reasons why Augmented Reality (AR) window shopping will be important in the e-commerce industry. First and foremost, it can greatly improve the customer experience by allowing them to virtually try on products and see how they would look in their own space.
In addition to improving the customer experience, AR window shopping also offers increased convenience. Customers can shop from the comfort of their own homes.
Another benefit of AR window shopping is that it can enhance the visual appeal of products by allowing customers to see them in a more realistic and interactive way.
Businesses can take the following steps:
- Invest in AR technology: In order to offer AR window shopping, businesses will need to invest in AR technology. This can involve things like creating AR experiences using AR development tools or integrating AR technology into their e-commerce platform.
- Offer AR try-on features: By offering AR try-on features, businesses can allow customers to virtually try on products and see how they would look in their own space. This can be especially useful for products like clothing, shoes, and accessories.
- Use AR to enhance product visuals: Businesses can use AR to create more interactive and visually appealing product listings and displays, which can help attract and engage customers.
- Promote AR features: It is important for businesses to promote their AR features to customers in order to raise awareness and drive adoption. This can involve things like promoting AR experiences on social media or including information about AR features in marketing materials.
Personalization that respects the privacy
Personalization that respects privacy is important in the e-commerce industry for a few reasons. Firstly, more and more consumers are becoming concerned about their privacy and the potential for their data to be mishandled or misused.
In addition to consumer concerns, there are also legal requirements to consider. Governments around the world are implementing privacy regulations that require businesses to be transparent about how they collect and use customer data. Businesses that fail to comply with these regulations could face fines and other penalties.
By respecting customers’ privacy, businesses can protect their reputations and avoid negative publicity that could result from a privacy breach or mishandling of customer data.
Businesses can take the following steps:
- Implement a privacy policy: It is important for businesses to have a clear and comprehensive privacy policy that explains how they collect, use, and protect customer data.
- Obtain consent: In order to collect and use customer data for personalization, businesses should obtain consent from customers. This can involve things like providing a clear opt-in form or obtaining explicit consent when collecting sensitive data.
- Use privacy-preserving technologies: There are a number of privacy-preserving technologies that can help businesses collect and use customer data in a way that respects privacy. Additionally, using first-party cookies, such as those provided by Enhencer, can help businesses collect customer data in a privacy-preserving way as these cookies do not share user data with third parties.
- Be transparent: It is important for businesses to be transparent about their data collection and use practices. This can involve things like providing clear and easy-to-understand information about how customer data is collected and used, and responding promptly to customer inquiries about privacy.
Omnichannel shopping
Omnichannel shopping is going to be important in the e-commerce industry for a few reasons. Firstly, it can greatly improve the customer experience by allowing them to seamlessly shop across different channels, such as online, in-store, and through social media.
In addition to improving the customer experience, omnichannel shopping also offers increased convenience. Customers can shop in the way that is most convenient for them, whether that be online, in-store, or through a mobile app.
Another benefit of omnichannel shopping is that it increases the visibility of products, making it more likely that customers will discover and purchase them.
Finally, omnichannel shopping provides enhanced customer insights. By tracking customer behavior across multiple channels, businesses can gain a more complete understanding of their customers and use this information to improve the shopping experience and increase sales.
Businesses can take the following steps:
- Offer multiple channels for customers to shop: Businesses can offer multiple channels for customers to shop, such as online, in-store, and through social media.
- Make it easy for customers to switch between channels: Businesses can make it easy for customers to switch between channels by offering a consistent brand experience and providing a seamless transition between channels. This can involve things like offering the same products and prices across all channels and allowing customers to start their shopping journey on one channel and complete it on another.
- Use customer data to improve the omnichannel experience: By tracking customer behavior across multiple channels, businesses can gain a more complete understanding of their customers and use this information to improve the omnichannel experience. This can involve things like offering personalized recommendations or providing personalized content based on a customer’s interests.
More ways to pay
Firstly, it can greatly improve the customer experience by making it easier for them to complete their purchases.
In addition to improving the customer experience, offering more ways to pay also increases the convenience of shopping. Customers can choose the payment method that is most convenient for them, whether that be a credit card, debit card, or mobile payment.
Another benefit of offering more ways to pay is that it makes products more accessible to a wider range of customers, including those who may not have access to traditional payment methods.
Finally, offering more ways to pay enhances the security of the payment system and reduces the risk of fraud.
Businesses can take the following steps:
- Use payment gateways: Payment gateways can help businesses securely process and accept payments from customers. By using payment gateways, businesses can offer more ways to pay without having to handle sensitive payment information themselves.
Explore More
(S)ARIMA(X) TSA Forecasting, QC and Visualization of E-Commerce Food Delivery Sales
Build A Simple NLP/NLTK Chatbot
Simple E-Commerce Sales BI Analytics
A K-means Cluster Cohort E-Commerce
E-Commerce Cohort Analysis in Python
E-Commerce Data Science Use-Case
E-Commerce ML/AI Classification
Brand Architecture: Google vs. Amazon
Start Your E-Commerce with Shopify
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
(S)ARIMA(X) TSA Forecasting, QC and Visualization of E-Commerce Food Delivery Sales
Featured Photo by Ella Olsson on Pexels
Inspired by the recent TSA e-commerce use-case, this article is a beginner-friendly guide to help you understand and evaluate ARIMA-based time-series forecasting models such as SARIMA and SARIMAX.
Objective: To understand the basic concepts of ARIMA, SARIMA and SARIMAX in terms of Time Series Forecasting QC.
Application: We will focus on an QC-optimized SARIMA(X) model in order to forecast the e-commerce sales of a food delivery company based in Helsinki, Finland.
Insights: We will assume that delivery companies get its revenue from two main sources:
- Up to 30% cut from every order.
- Delivery fees.
The first revenue stream depends upon the order volume and the value of each order. The second revenue is linked to multiple sources such as delivery distance, order size, order value, day of the week, etc.
Concepts:
- ACF, PACF
- Seasonal Decomposition
- Stationarity of time-series
- ADF & KPSS Tests
- Hyper-Parameter Optimization (HPO)
- SARIMA/SARIMAX: Model QC Comparisons
- Evaluation Metrics: AIC, BIC, MSE, SSE, and RMSE
Table of Contents:
- Libraries
- Input Data
- Feature Engineering
- Temporal Patterns
- ADF & KPSS Tests
- ACF & PACF
- SARIMA Model
- SARIMAX Model
- Model Comparison
- Summary
- Explore More
- Embed Socials
- Infographic
You can go through the below articles for more details on ARIMA-related topics:
- Complete Guide To SARIMAX in Python for Time Series Modeling
- Time Series Forecasting with SARIMAX
- Time Series Forecasting with ARIMA , SARIMA and SARIMAX
- ARIMA SARIMA SARIMAX For Beginners
Libraries
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()and import the following libraries
import itertools
import warnings
from datetime import datetime, timedeltaimport numpy as np
import pandas as pdimport matplotlib.pyplot as plt
from pandas.plotting import lag_plot
import seaborn as sns
%matplotlib inlinefrom statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tools.sm_exceptions import ConvergenceWarningInput Data
Let’s load the input dataset
source_df = pd.read_csv(‘orders.csv’)
df = source_df.copy()
df.tail(5)df.shape
(18706, 13)
df[‘TIMESTAMP’] = pd.to_datetime(df[‘TIMESTAMP’])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18706 entries, 0 to 18705 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TIMESTAMP 18706 non-null datetime64[ns] 1 ACTUAL_DELIVERY_MINUTES - ESTIMATED_DELIVERY_MINUTES 18706 non-null int64 2 ITEM_COUNT 18706 non-null int64 3 USER_LAT 18706 non-null float64 4 USER_LONG 18706 non-null float64 5 VENUE_LAT 18706 non-null float64 6 VENUE_LONG 18706 non-null float64 7 ESTIMATED_DELIVERY_MINUTES 18706 non-null int64 8 ACTUAL_DELIVERY_MINUTES 18706 non-null int64 9 CLOUD_COVERAGE 18429 non-null float64 10 TEMPERATURE 18429 non-null float64 11 WIND_SPEED 18429 non-null float64 12 PRECIPITATION 18706 non-null float64 dtypes: datetime64[ns](1), float64(8), int64(4) memory usage: 1.9 MB
df.describe().T
We can see that there are no extreme deviations between mean and median values of each column. This suggests that we can expect little skewness in the distributions.
Let’s look at the spatial content of our input data
plt.scatter(df[‘USER_LAT’],df[‘USER_LONG’],c=df[‘ACTUAL_DELIVERY_MINUTES’])
plt.colorbar()
plt.title(“ACTUAL_DELIVERY_MINUTES”)
plt.xlabel(“USER_LAT”)
plt.ylabel(“USER_LONG”)
plt.savefig(‘inputdeliverymin.png’)plt.scatter(df[‘VENUE_LAT’],df[‘VENUE_LONG’],c=df[‘ITEM_COUNT’])
plt.colorbar()
plt.title(“ITEM_COUNT”)
plt.xlabel(“VENUE_LAT”)
plt.ylabel(“VENUE_LONG”)
plt.savefig(‘inputvenueitemcount.png’)Feature Engineering
Let’s plot the sns Pearson correlation heatmap
def correlation_check(df: pd.DataFrame) -> None:
“””
Plots a Pearson Correlation Heatmap.
—
Args:
df (pd.DataFrame): dataframe to plotReturns: None """ # Pretty Name df.rename(columns={"ACTUAL_DELIVERY_MINUTES - ESTIMATED_DELIVERY_MINUTES": "ACTUAL-ESTIMATED"}, inplace=True) # Figure fig, ax = plt.subplots(figsize=(16,12), facecolor='w') correlations_df = df.corr(method='pearson', min_periods=1) sns.heatmap(correlations_df, cmap="Oranges", annot=True, linewidth=.1) # Labels ax.set_title("Pearson Correlation Heatmap", fontsize=15, pad=10) ax.set_facecolor(color='white')
correlation_check(df)
plt.savefig(‘salesheatmap.png’)We can also check feature correlations via the sns pairplot
sns.pairplot(data=df)
plt.savefig(‘salespairplot.png’)As with the heatmap, the pair-plot didn’t reveal any underlying, strong relationship between the variables. Exceptions:
- ACTUAL_DELIVERY_MINUTES – ESTIMATED_DELIVERY_MINUTES is strongly correlated with ACTUAL_DELIVERY_MINUTES and ESTIMATED_DELIVERY_MINUTES.
- There are correlations between spatial coordinates.
Temporal Patterns
Let’s create a new DataFrame with daily frequency and number of orders
daily_df = df.groupby(pd.Grouper(key=’TIMESTAMP’, freq=’D’)).size().reset_index(name=’ORDERS’)
daily_df.set_index(‘TIMESTAMP’, inplace=True)
daily_df.index.freq = ‘D’ # To keep pandas inference in check!print(daily_df.head())
print(daily_df.describe())RDERS TIMESTAMP 2020-08-01 299 2020-08-02 328 2020-08-03 226 2020-08-04 228 2020-08-05 256 ORDERS count 61.000000 mean 306.655738 std 58.949381 min 194.000000 25% 267.000000 50% 294.000000 75% 346.000000 max 460.000000
Let’s plot the number of orders per day
def orders_per_day(df: pd.DataFrame) -> None:
# Figure fig, ax = plt.subplots(figsize=(16,9), facecolor='w') ax.plot(df.index, df['ORDERS']) # Labels ax.set_title("Number of Orders Each Day", fontsize=15, pad=10) ax.set_ylabel("Number of orders", fontsize=12) ax.set_xlabel("Date", fontsize=12) # Grid & Legend plt.grid(linestyle=":", color='grey') plt.legend(["Orders"]) plt.savefig('salesnumberofdays.png')
orders_per_day(daily_df)
plt.savefig(‘salesnumberoforders.png’)Let’s check the series for trends and seasonality
def decompose_series(df: pd.DataFrame) -> None:
# Decomposition decomposition = seasonal_decompose(df) trend = decomposition.trend seasonal = decomposition.seasonal residual = decomposition.resid #Figure fig, (ax1,ax2,ax3,ax4) = plt.subplots(4,1, figsize=(16,10), facecolor='w') ax1.plot(df, label='Original') ax2.plot(trend, label='Trend') ax3.plot(seasonal,label='Seasonality') ax4.plot(residual, label='Residuals') # Legend ax1.legend(loc='upper right') ax2.legend(loc='upper right') ax3.legend(loc='upper right') ax4.legend(loc='upper right') ax1.grid(linestyle=":", color='grey') ax2.grid(linestyle=":", color='grey') ax3.grid(linestyle=":", color='grey') ax4.grid(linestyle=":", color='grey') plt.title('Decomposed Daily Orders (2020-08-01 - 2020-10-01') plt.tight_layout() plt.savefig('salesdecomposeseries.png')
decompose_series(daily_df)
- Trend: There is a general rising trend for the given time period. However, it is not constant. Number of orders decreases around 9th of september but then recovers resulting in an approx. 7% overall increase for the observed time period.
- Seasonality: There are clear weekly seasonal patterns. Number of orders is low in the beginning of the week and grows towards the weekend.
- Residuals: No observable patterns left in the residuals.
Let’s plot a heatmap with days of the week and hours of the day vs the number of orders
def orders_weekdays_hours(dataframe: pd.DataFrame) -> None:
# Data df = dataframe.copy(deep=False) # Reshaping data for the plot df["hour"] = pd.DatetimeIndex(df['TIMESTAMP']).hour df["weekday"] = pd.DatetimeIndex(df['TIMESTAMP']).weekday daily_activity = df.groupby(by=['weekday','hour']).count()['TIMESTAMP'].unstack() # Figure Object fig, ax = plt.subplots(figsize=(10,10), facecolor='w') yticklabels = ["Mon", "Tue","Wed", "Thu", "Fri", "Sat", "Sun"] sns.heatmap(daily_activity, robust=True, cmap="Oranges", yticklabels=yticklabels) # Labeling ax.set_title("Ordering Patterns", fontsize=15, pad=10) ax.set_xlabel("Hours of the day (Hours)", fontsize=12, x=.5) ax.set_ylabel("Day of the week", fontsize=12, y=.5) plt.savefig('salesorderingpatterns.png')
orders_weekdays_hours(df)
- Each day seems to have two peaks in number of orders.
- Hottest ordering times are slightly different for workdays and weekends.
- During the workdays number of orders peaks at 8am and 16pm with decrease in orders during the lunchtime.
- The weekends exhibit a similar behavior but higher overall number of orders and with different peaks at 10-11am and 15-16pm.
Let’s creating a new DataFrame with hourly frequency number of orders
hourly_df = df.groupby(pd.Grouper(key=’TIMESTAMP’, freq=’1h’)).size().reset_index(name=’ORDERS’)hourly_df.set_index(‘TIMESTAMP’, inplace=True)
hourly_df.index.freq = ‘H’print(hourly_df, hourly_df.describe())
ORDERS TIMESTAMP 2020-08-01 06:00:00 3 2020-08-01 07:00:00 6 2020-08-01 08:00:00 15 2020-08-01 09:00:00 20 2020-08-01 10:00:00 26 ... ... 2020-09-30 16:00:00 42 2020-09-30 17:00:00 26 2020-09-30 18:00:00 19 2020-09-30 19:00:00 8 2020-09-30 20:00:00 1 [1455 rows x 1 columns] ORDERS count 1455.000000 mean 12.856357 std 13.733086 min 0.000000 25% 0.000000 50% 8.000000 75% 24.000000 max 53.000000
Let’s plot the number of orders per hour
def orders_per_hour(df: pd.DataFrame, start: datetime, end: datetime ) -> None:
# Figure fig, ax = plt.subplots(figsize=(16,9), facecolor='w') ax.plot(df.index, df['ORDERS']) # Labels ax.set_title(f"Number of Orders Each Hour ({start.date()} - {end.date()})", fontsize=15, pad=10) ax.set_ylabel("Number of orders", fontsize=12) ax.set_xlabel("Date", fontsize=12) # Axis Limit ax.set_xlim([start, end]) # Legend & Grid plt.grid(linestyle=":", color='grey') plt.legend(["Orders"]) plt.savefig('salesnumberofeachhours.png')
start = datetime(2020, 8, 1)
end = datetime(2020, 8, 15)
orders_per_hour(hourly_df, start, end)Let’s decompose these time series
decompose_series(hourly_df)
- Trend: There doesn’t seem to be definite upwards or downwards trend. Recall from the daily plot that there is a trend.
- Seasonality: There is very strong daily seasonal pattern. Recall from the daily plot that there is also a weekly seasonality.
- Residuals: No observable patterns left in the residuals.
Let’s plot rolling mean and STD with window=48
def plot_rolling_mean_and_std(dataframe: pd.DataFrame, window: int) -> None:
df = dataframe.copy()
# Get Things Rolling
roll_mean = df.rolling(window=window).mean()
roll_std = df.rolling(window=window).std()# Figure fig, ax = plt.subplots(figsize=(16,9), facecolor='w') ax.plot(df, label='Original') ax.plot(roll_mean, label='Rolling Mean') ax.plot(roll_std, label='Rolling STD') # Legend & Grid ax.legend(loc='upper right') plt.grid(linestyle=":", color='grey') plt.savefig('salesrollingmean.png')
plot_rolling_mean_and_std(hourly_df, window=48)
- We can see that the mean and the variance of the time series are time-variant.
- Mean and variance seem to follow weekly seasons.
ADF & KPSS Tests
Let’s perform the Augmented Dickey Fuller (ADF) Test:
– The null hypothesis for this test is that there is a unit root.
– The alternate hypothesis is that there is no unit root in the series.def perform_adf_test(df: pd.DataFrame) -> None:
adf_stat, p_value, n_lags, n_observ, crit_vals, icbest = adfuller(df) print('\nAugmented Dickey Fuller Test') print('---'*15) print('ADF Statistic: %f' % adf_stat) print('p-value: %f' % p_value) print(f'Number of lags used: {n_lags}') print(f'Number of observations used: {n_observ}') print(f'T values corresponding to adfuller test:') for key, value in crit_vals.items(): print(key, value)
We also perform the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test for stationarity:
– The null hypothesis for the test is that the data is stationary.
– The alternate hypothesis for the test is that the data is not stationary.def perform_kpss_test(df: pd.DataFrame) -> None:
kpss_stat, p_value, n_lags, crit_vals = kpss(df, nlags='auto', store=False) print('\nKwiatkowski-Phillips-Schmidt-Shin test') print('---'*15) print('KPSS Statistic: %f' % kpss_stat) print('p-value: %f' % p_value) print(f'Number of lags used: {n_lags}') print(f'Critical values of KPSS test:') for key, value in crit_vals.items(): print(key, value)
Let’s call these two functions
perform_adf_test(hourly_df)
perform_kpss_test(hourly_df)Augmented Dickey Fuller Test --------------------------------------------- ADF Statistic: -3.464712 p-value: 0.008941 Number of lags used: 24 Number of observations used: 1430 T values corresponding to adfuller test: 1% -3.434931172941245 5% -2.8635632730206857 10% -2.567847177857108 Kwiatkowski-Phillips-Schmidt-Shin test --------------------------------------------- KPSS Statistic: 0.396855 p-value: 0.078511 Number of lags used: 20 Critical values of KPSS test: 10% 0.347 5% 0.463 2.5% 0.574 1% 0.739
- Since ADF Statistic -3.46 < -3.43 and p-value: 0.0089 < 0.05 we can reject the N0 hypothesis in the favor of NA. According to the ADF test, our time series have no unit root.
- Since KPSS Statistic 0.396 < 0.463 and 0.078 > 0.05 we fail to reject the N0 hypothesis. According to the KPSS test, our time series are trend-stationary.
This is consistent with our observation that rolling mean and STD follow a weekly trend.
ACF & PACF
Let’s invoke ACF and PACF to identify the lags that have high correlations
def plot_acf_pacf(df: pd.DataFrame, acf_lags: int, pacf_lags: int) -> None:
# Figure fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,9), facecolor='w') # ACF & PACF plot_acf(df['ORDERS'], ax=ax1, lags=acf_lags) plot_pacf(df['ORDERS'], ax=ax2, lags=pacf_lags, method='ywm') # Labels ax1.set_title("Autocorrelation", fontsize=15, pad=10) ax1.set_ylabel("Number of orders", fontsize=12) ax1.set_xlabel("Lags (Hours)", fontsize=12) ax2.set_title("Partial Autocorrelation", fontsize=15, pad=10) ax2.set_ylabel("Number of orders", fontsize=12) ax2.set_xlabel("Lags (Hours)", fontsize=12) # Legend & Grid ax1.grid(linestyle=":", color='grey') ax2.grid(linestyle=":", color='grey') plt.savefig('salesautocorrelation.png')
plot_acf_pacf(hourly_df, acf_lags=72, pacf_lags= 72)
- ACF
- As we already knew our series are seasonal and our ACF plot confirms this pattern. If we plot more lags we will also observe that significance of the lags is gradually declining.
- First significant lag is lag 1. The number of daily orders raises/decreases gradually from hour to hour. Hence the orders during the previous hour might tell us something about orders during the current hour.
- Next important lags are 12 and 24. These are deterministic seasonal patterns connected with day/night cycles. 12 hour lag is negatively correlated because when at 8:00 am number of orders starts to increase at 20:00 pm the number of orders is already decreasing. However, 24 hour lag shows that number of orders made today at 16:00 pm might tell us about the number of orders to be made tomorrow at 16:00 pm.
- PACF
- We can see that lag 1 and 24 have the highest correlations. This means that seasons 24 hours apart are directly inter-correlated regardless of what is happening in between.
We can take a detailed look at the above Lags of interest
def lag_plots(df: pd.DataFrame) -> None:
# Figure fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,9), facecolor='w') # Lags lag_plot(df['ORDERS'], lag=1, ax=ax1, c='#187bcd') lag_plot(df['ORDERS'], lag=12, ax=ax2, c='grey') lag_plot(df['ORDERS'], lag=24, ax=ax3, c='#187bcd') # Labels ax1.set_title("y(t+1)", fontsize=15, pad=10) ax2.set_title("y(t+12)", fontsize=15, pad=10) ax3.set_title("y(t+24)", fontsize=15, pad=10) # Legend & Grid ax1.grid(linestyle=":", color='grey') ax2.grid(linestyle=":", color='grey') ax3.grid(linestyle=":", color='grey')
lag_plots(hourly_df)
Lags 1, 12, and 24 follow the ACF correlation trends: both high positive linear correlations (Lags 1 and 24) and a strongly negative non-linear correlation (Lag 12) are confirmed.
SARIMA Model
The SARIMA model is specified as follows
(p,d,q) x (P,D,Q)s
where
- Trend Elements are:
- p: Autoregressive order
- d: Difference order
- q: Moving average order
- Seasonal Elements are:
- P: Seasonal autoregressive order.
- D: Seasonal difference order. D=1 would calculate a first order seasonal difference
- Q: Seasonal moving average order. Q=1 would use a first order errors in the model
- s Single seasonal period
We will use the Box–Jenkins method for SARIMA parameter tuning.
Let’s split the input dataframe into train/test sets as 75% / 15%.
def train_test_split(df: pd.DataFrame, train_set, test_set):
train_set = df[df.index <= train_end] test_set = df[df.index > train_end] return train_set, test_set
warnings.simplefilter(‘ignore’, ConvergenceWarning)
train_end = datetime(2020,9,15)
test_end = datetime(2020,9,30)train_df, test_df = train_test_split(hourly_df, train_end, test_end)
Let's set the hyper-parameters p, d, q = 1, 1, 1 P, D, Q = 2, 1, 1 s = 24 and fit the SARIMA model
sarima_model = SARIMAX(train_df, order=(p, d, q), seasonal_order=(P, D, Q, s))
sarima_model_fit = sarima_model.fit(disp=0)
print(sarima_model_fit.summary())SARIMAX Results ========================================================================================== Dep. Variable: ORDERS No. Observations: 1075 Model: SARIMAX(1, 1, 1)x(2, 1, 1, 24) Log Likelihood -3158.989 Date: Sat, 14 Jan 2023 AIC 6329.979 Time: 11:21:55 BIC 6359.718 Sample: 08-01-2020 HQIC 6341.255 - 09-15-2020 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 0.3729 0.027 13.720 0.000 0.320 0.426 ma.L1 -0.9415 0.012 -75.980 0.000 -0.966 -0.917 ar.S.L24 0.1375 0.024 5.666 0.000 0.090 0.185 ar.S.L48 -0.1325 0.025 -5.272 0.000 -0.182 -0.083 ma.S.L24 -0.9997 2.136 -0.468 0.640 -5.186 3.187 sigma2 21.9775 46.728 0.470 0.638 -69.607 113.562 =================================================================================== Ljung-Box (L1) (Q): 0.64 Jarque-Bera (JB): 317.47 Prob(Q): 0.42 Prob(JB): 0.00 Heteroskedasticity (H): 1.27 Skew: 0.46 Prob(H) (two-sided): 0.03 Kurtosis: 5.53 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).
Let’s plot the SARIMA diagnostics
sarima_model_fit.plot_diagnostics(figsize=(16, 9))
plt.savefig(‘salesarimaxdiagplot.png’)- The standardize residual plot: The residuals over time don’t display any obvious patterns. They appear as white noise.
- The Normal Q-Q-plot: Shows that the ordered distribution of residuals follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). However, the slight curving indicates that our distribution has heavier tails.
- Histogram and estimated density plot: The KDE follows the N(0,1) line however with noticeable differences. As mentioned before our distribution has heavier tails.
- The Correlogram plot: Shows that the time series residuals have low correlation with lagged versions of itself. Meaning there are no patterns left to extract in the residuals.
Let’s compare the test data vs SARIMA predictions
pred_start_date = test_df.index[0]
pred_end_date = test_df.index[-1]sarima_predictions = sarima_model_fit.predict(start=pred_start_date, end=pred_end_date)
sarima_residuals = test_df[‘ORDERS’] – sarima_predictionsdef plot_test_predictions(test_df: pd.DataFrame, predictions) -> None:
# Figure fig, ax = plt.subplots(figsize=(16,9), facecolor='w') ax.plot(test_df, label='Testing Set') ax.plot(predictions, label='Forecast') # Labels ax.set_title("Test vs Predictions", fontsize=15, pad=10) ax.set_ylabel("Number of orders", fontsize=12) ax.set_xlabel("Date", fontsize=12) # Legend & Grid ax.grid(linestyle=":", color='grey') ax.legend() plt.savefig('salesarimaxtestpredictions.png')
plot_test_predictions(test_df, sarima_predictions)
Let’s get the SARIMA evaluation data
sarima_aic = sarima_model_fit.aic
sarima_bic = sarima_model_fit.bic
sarima_mean_squared_error = sarima_model_fit.mse
sarima_sum_squared_error=sarima_model_fit.sse
sarima_root_mean_squared_error = np.sqrt(np.mean(sarima_residuals**2))print(f’Akaike information criterion | AIC: {sarima_aic}’)
print(f’Bayesian information criterion | BIC: {sarima_bic}’)
print(f’Mean Squared Error | MSE: {sarima_mean_squared_error}’)
print(f’Sum Squared Error | SSE: {sarima_sum_squared_error}’)
print(f’Root Mean Squared Error | RMSE: {sarima_root_mean_squared_error}’)Akaike information criterion | AIC: 6329.978772260314 Bayesian information criterion | BIC: 6359.718044919224 Mean Squared Error | MSE: 23.867974361783705 Sum Squared Error | SSE: 25658.072438917483 Root Mean Squared Error | RMSE: 5.6716207290283
Let’s perform the SARIMA forecast of hourly orders
Forecast Window
days = 24
hours = days * 24sarima_forecast = sarima_model_fit.forecast(hours)
sarima_forecast_series = pd.Series(sarima_forecast, index=sarima_forecast.index)Since negative orders are not possible, we can trim them
sarima_forecast_series[sarima_forecast_series < 0] = 0
Let’s plot the test, train and forecast values
def plot_sarima_forecast(train_df: pd.DataFrame, test_df: pd.DataFrame, fc_series: pd.Series) -> None:
fig, ax = plt.subplots(figsize=(16,9), facecolor='w') # Plot Train, Test and Forecast. ax.plot(train_df['ORDERS'], label='Training') ax.plot(test_df['ORDERS'], label='Actual') ax.plot(fc_series, label='Forecast') # Labels ax.set_title("SARIMA Hourly Orders Forecast", fontsize=15, pad=20) ax.set_ylabel("Number of orders", fontsize=12) ax.set_xlabel("Date", fontsize=12) xmin = datetime(2020, 9, 10) xmax = datetime(2020, 10, 7) ax = plt.gca() ax.set_xlim([xmin, xmax]) # Legend & Grid ax.grid(linestyle=":", color='grey') ax.legend() plt.savefig('salesarimahorlyforecast.png')
plot_sarima_forecast(train_df, test_df, sarima_forecast_series)
- Our SARIMA model predicts the overall hourly ordering patterns pretty well. It captures the two daily peaks and the lunchtime valley between them.
- However, it fails to predict weekly patterns. Every day in our forecast seem to be the same. If this model was sent to production we would be able to predict the number of orders every hour but that prediction would be based on the assumption that the number of orders is the same every day.
- While this model doesn’t have a great long term predictive power it can serve as a solid baseline for our next models.
SARIMAX Model
Let’s prepare the data
hour_weekday_df = hourly_df.copy()
hour_weekday_df[‘weekday_exog’] = hour_weekday_df.index.weekday
print(hour_weekday_df.head(10))ORDERS weekday_exog TIMESTAMP 2020-08-01 06:00:00 3 5 2020-08-01 07:00:00 6 5 2020-08-01 08:00:00 15 5 2020-08-01 09:00:00 20 5 2020-08-01 10:00:00 26 5 2020-08-01 11:00:00 29 5 2020-08-01 12:00:00 30 5 2020-08-01 13:00:00 21 5 2020-08-01 14:00:00 23 5 2020-08-01 15:00:00 29 5
weekday_exog = hour_weekday_df[(hour_weekday_df != 0).all(1)]
weekday_exog = weekday_exog.groupby(weekday_exog.index.weekday)[‘ORDERS’].mean()
print(weekday_exog.head(7))TIMESTAMP 1 17.386364 2 18.591241 3 18.628099 4 19.891304 5 21.939597 6 25.715385 Name: ORDERS, dtype: float64
weekday_exog = {key: (weekday_exog[key] / weekday_exog[1]) for key in weekday_exog.keys()}
print(weekday_exog){1: 1.0, 2: 1.0693001288106483, 3: 1.0714200831847889, 4: 1.144075021312873, 5: 1.2618853357897968, 6: 1.4790548014077425}
hour_weekday_df.replace({“weekday_exog”: weekday_exog})
1455 rows × 2 columns
Let’s perform a grid search for the SARIMAX HPO
p = range(1, 3)
d = range(1, 2)
q = range(1, 3)
s = 24pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], s) for x in pdq]def grid_search_sarimax(train_set: pd.DataFrame) -> None:
# Supress UserWarnings warnings.simplefilter('ignore', category=UserWarning) # Grid Search for order in pdq: for seasonal_order in seasonal_pdq: model = SARIMAX(train_set['ORDERS'], order=order, seasonal_order=seasonal_order, exog=train_set['weekday_exog'] ) results = model.fit(disp=0) print(f'ARIMA{order}x{seasonal_order} -> AIC: {results.aic}, BIC:{results.bic}, MSE: {results.mse}')
train_end = datetime(2020,9,15)
test_end = datetime(2020,9,30)
train_df, test_df = train_test_split(hour_weekday_df, train_end, test_end)Set Hyper-Parameters
p, d, q = 1, 1, 2
P, D, Q = 2, 1, 2
s = 24
exog = train_df[‘weekday_exog’]Fit SARIMAX
sarimax_model = SARIMAX(train_df[‘ORDERS’],
order=(p, d, q),
seasonal_order=(P, D, Q, s),
exog=exog)sarimax_model_fit = sarimax_model.fit(disp=0)
Print the summary report
print(sarimax_model_fit.summary())
SARIMAX Results ========================================================================================== Dep. Variable: ORDERS No. Observations: 1075 Model: SARIMAX(1, 1, 2)x(2, 1, 2, 24) Log Likelihood -3118.452 Date: Sat, 14 Jan 2023 AIC 6254.904 Time: 11:29:13 BIC 6299.513 Sample: 08-01-2020 HQIC 6271.818 - 09-15-2020 Covariance Type: opg ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- weekday_exog 0.8400 0.138 6.074 0.000 0.569 1.111 ar.L1 0.4934 0.063 7.777 0.000 0.369 0.618 ma.L1 -1.1571 0.080 -14.553 0.000 -1.313 -1.001 ma.L2 0.1576 0.070 2.261 0.024 0.021 0.294 ar.S.L24 0.8878 0.052 17.064 0.000 0.786 0.990 ar.S.L48 -0.2623 0.024 -10.922 0.000 -0.309 -0.215 ma.S.L24 -1.7497 0.052 -33.831 0.000 -1.851 -1.648 ma.S.L48 0.7717 0.051 15.081 0.000 0.671 0.872 sigma2 20.6048 0.887 23.226 0.000 18.866 22.344 =================================================================================== Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 339.16 Prob(Q): 0.98 Prob(JB): 0.00 Heteroskedasticity (H): 1.34 Skew: 0.54 Prob(H) (two-sided): 0.01 Kurtosis: 5.57 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).
Let’s plot the SARIMAX diagnostics
sarimax_model_fit.plot_diagnostics(figsize=(16, 9))
plt.savefig(‘salesarimaxmodelfit.png’)Let’s plot the predictions
pred_start_date = test_df.index[0]
pred_end_date = test_df.index[-1]exog = test_df[‘weekday_exog’]
predictions = sarimax_model_fit.predict(start=pred_start_date, end=pred_end_date, exog=exog)
residuals = test_df[‘ORDERS’] – predictionsdef plot_sarimax_test(test_set: pd.DataFrame, predictions: pd.Series) -> None:
# Figure fig, ax = plt.subplots(figsize=(16,9), facecolor='w') ax.plot(test_set['ORDERS'], label='Testing Set') ax.plot(predictions, label='Forecast') # Labels ax.set_title("Test vs Forecast", fontsize=15, pad=15) ax.set_ylabel("Number of orders", fontsize=12) ax.set_xlabel("Date", fontsize=12) # Legend & Grid ax.grid(linestyle=":", color='grey') ax.legend() plt.savefig('salessarimaxtestforecastplot.png')
plot_sarimax_test(test_df, predictions)
Let’s get the SARIMAX evaluation data
sarimax_aic = sarimax_model_fit.aic
sarimax_bic = sarimax_model_fit.bic
sarimax_mean_squared_error = sarimax_model_fit.mse
sarimax_sum_squared_error=sarimax_model_fit.sse
sarimax_root_mean_squared_error = np.sqrt(np.mean(residuals**2))print(f’Akaike information criterion | AIC: {sarimax_aic}’)
print(f’Bayesian information criterion | BIC: {sarimax_bic}’)
print(f’Mean Squared Error | MSE: {sarimax_mean_squared_error}’)
print(f’Sum Squared Error | SSE: {sarimax_sum_squared_error}’)
print(f’Root Mean Squared Error | RMSE: {sarimax_root_mean_squared_error}’)Akaike information criterion | AIC: 6254.90400948669 Bayesian information criterion | BIC: 6299.512918475054 Mean Squared Error | MSE: 22.10675398176249 Sum Squared Error | SSE: 23764.760530394677 Root Mean Squared Error | RMSE: 5.132961570576301
Let’s plot the train, test, and predicted values
Forecast Window
days = 24
hours = (days * 24)+1
exog = pd.date_range(start=’2020-10-01′, end=’2020-10-25′, freq=’1H’)fc = sarimax_model_fit.forecast(hours, exog=exog.weekday)
fc_series = pd.Series(fc, index=fc.index)Since negative orders are not possible we can trim them.
fc_series[fc_series < 0] = 0
def plot_sarimax_forecast(train_df: pd.DataFrame, test_df: pd.DataFrame, fc_series: pd.Series) -> None:
# Figure fig, ax = plt.subplots(figsize=(16,9), facecolor='w') # Plot Train, Test and Forecast. ax.plot(train_df['ORDERS'], label='Training') ax.plot(test_df['ORDERS'], label='Testing') ax.plot(fc_series, label='Forecast') # Labels ax.set_title("SARIMAX Hourly Orders Forecast", fontsize=15, pad=10) ax.set_ylabel("Number of orders", fontsize=12) ax.set_xlabel("Date", fontsize=12) # Axis Limits xmin = datetime(2020, 9, 10) xmax = datetime(2020, 10, 8) ax = plt.gca() ax.set_xlim([xmin, xmax]) # Legend & Grid ax.grid(linestyle=":", color='grey') ax.legend() plt.savefig('salesarimaxhourlyordersforecast.png')
plot_sarimax_forecast(train_df, test_df, fc_series)
Model Comparison
Let’s compare the two models
model_comparison = pd.DataFrame({‘Model’:[‘SARIMA(1, 1, 1)(2, 1, 1)24′,’SARIMAX(1, 1, 2)(2, 1, 2)24’],
‘AIC’:[sarima_aic, sarimax_aic],
‘BIC’:[sarima_bic, sarimax_bic],
‘MSE’: [sarima_mean_squared_error, sarimax_mean_squared_error],
‘SSE’: [sarima_sum_squared_error, sarimax_sum_squared_error],
‘RMSE’: [sarima_root_mean_squared_error, sarimax_root_mean_squared_error]})model_comparison.head()
Summary
- SARIMAX performs better than SARIMA in terms of AIC, BIC, MSE, SSE, and RMSE
- There is a general rising trend for the given time period.
- ADF Statistic -3.46 < -3.43 and p-value: 0.0089 < 0.05, so we can reject the null hypophysis N0 in favour of NA, i.e. our series have no unit root.
- Since KPSS Statistic 0.396 < 0.463 and 0.078 > 0.05 we fail to reject the null hypothesis N0, and so our series are trend-stationary.
- ACF confirmed deterministic seasonal patterns connected with day/night cycles.
- PACF shows that lag 1 and 24 have the highest correlation. This means that seasons 24 hours apart are directly correlated regardless of what is happening in between.
- SARIMAX: Log Likelihood -3118.452,
- Ljung-Box (L1) (Q): 0.00
- Prob(Q): 0.98
- Prob(H) (two-sided): 0.01
- Heteroskedasticity (H): 1.34
- Jarque-Bera (JB): 339.23
- Skew: 0.54
- Kurtosis: 5.57
- SARIMA:
- Log Likelihood -3158.989
- Ljung-Box (L1) (Q): 0.64
- Heteroskedasticity (H): 1.27
- Prob(H) (two-sided): 0.03
- Jarque-Bera (JB): 317.41
- Skew: 0.46
- Kurtosis: 5.53
- The log-likelihood value of a regression model is a way to measure the goodness of fit for a model. The higher the value of the log-likelihood, the better a model fits a dataset. In our case, Log Likelihood SARIMA < Log Likelihood SARIMAX.
- Heteroscedasticity is what you have in your data when the conditional variance in your data is not constant. In our case, Heteroskedasticity SARIMAX > Heteroskedasticity SARIMA.
- The Jarque-Bera test is a goodness-of-fit test that determines whether or not sample data have skewness/kurtosis that matches a normal distribution. The test statistic of the Jarque-Bera (JB) test is always a positive number and if it’s far from zero, it indicates that the sample data do not have a normal distribution. In our case, JB SARIMAX > JB SARIMA.
- A two-tailed test, in statistics, is a method in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. In our case, Prob(H) (two-sided) SARIMA > Prob(H) (two-sided) SARIMAX.
- Ljung-Box (L1) (Q) SARIMAX << Ljung-Box (L1) (Q) SARIMA. Essentially, it is a test of lack of fit: if the autocorrelations of the residuals are very small, we say that the model doesn’t show ‘significant lack of fit’.
- The Normal Q-Q-plots shows that the ordered distribution of residuals follows the linear trend for both models. However, the slight curving indicates that our distribution has heavier tails.
- Histogram and estimated density plot: The KDE follows the N(0,1) line with slight differences.
- SARIMA: Model succeeded in capturing the underlying hourly ordering patterns with limited accuracy. However, it failed to capture patterns related to days of the week.
- SARIMAX: Model did a better job! It improved the accuracy of the previous model. However, its accuracy is still limited. This model captured the two daily spikes but not when these spikes crossed the mark of 40 orders. Additionally the model predicts orders during the night hours that are very unlikely to occur. It seems that the model would benefit from another exog variable that would apply hourly weights for each hour of the day.
Explore More
Stock Forecasting with FBProphet
S&P 500 Algorithmic Trading with FBProphet
E-Commerce Cohort Analysis in Python
E-Commerce Data Science Use-Case
E-Commerce ML/AI Classification
- Simple E-Commerce Sales BI Analytics
- Brazilian E-Commerce Showcase
- A K-means Cluster Cohort E-Commerce
Embed Socials
Infographic
How do we calculate HQIC information criteria for time series data when we fit SARIMA model?
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
A Simple YouTube Download NLP GUI
Featured Photo by Cottonbro Studio on Pexels
Here is the simplest NLP example – the Python based YouTube video Downloader GUI Application by Naem Azam.
Let’s try to download the following YouTube video
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()while importing, cloning and/or installing relevant libraries
!pip install gitpython
!pip install git
git clone https://github.com/naemazam/Youtube-video-Downloader.git
import tkinter as tk
from tkinter import *
from PIL import ImageTk, Image
from tkinter import messageboximport pytube
import time!pip install –upgrade pytube
Adding Window Components
root = tk.Tk()
root.title(“Youtube Downloader”)
root.geometry(“700×300”)
root.maxsize(700,250)
root.minsize(700,300)and the GUI function code
def download():
link = text.get(“1.0″,”end-1c”)if link == '': messagebox.showerror("YouTube Downloader", "Please paste a link here") else: yt = pytube.YouTube(link) stream = yt.streams.first() time.sleep(2) text.delete(1.0,'end') text.insert('end','Wait Downloading ......') time.sleep(5) stream.download() messagebox.showinfo("YouTube Downloader",'Video has been download successfully')
The main design code is
header = Label(root,bg=”black”,width=”300″,height=”2″)
header.place(x=0,y=0)with the youtube logo png image
yt_logo = ImageTk.PhotoImage(Image.open(‘youtube.png’))
logo = Label(root, image = yt_logo,borderwidth=0)
logo.place(x=10,y=10)by adding the caption label
caption = Label(root,text=”YouTube Downloader”,font=(‘verdana’,10,’bold’))
caption.place(x=50,y=10)and the youtube logo image
yt1_logo = ImageTk.PhotoImage(Image.open(‘yt.png’))
logo1 = Label(root, image = yt1_logo,borderwidth=0)
logo1.place(x=300,y=60)Let’s get the url
text = Text(root,width=60,height=2,font=(‘verdana’,10,’bold’))
text.place(x=90,y=180)
text.insert(‘end’,’Paste your video link here’)Download Buttons
button = Button(root,text=”Download”,relief=RIDGE,font=(‘verdana’,10,’bold’),bg=”red”,fg=”white”,command=download)
button.place(x=330,y=220)and load the window
root.mainloop()Let’s run the GUI code as follows:
Let’s paste the above URL link and hit Download
Outcome: You should see the 3GPP File in YOURPATH. One can view the content with Movies & TV.
Explore More
Build A Simple NLP/NLTK Chatbot
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
ECG Early Warning System (EWS) in Terms of Time-Variant Deformations and Creep-Recovery Strain Tests
Featured Photo by Hernan Pauccara on Pexels
Referring to an earlier stress-strain case study, the objective of this risk management project is to develop the ECG Early Warning System (EWS) based upon time-dependent viscoelastic deformations and observed creep-recovery mechanisms in the cardiac muscle.
The creep-recovery test involves loading a material at constant stress, holding that stress for some length of time and then removing the load. The response of a typical viscoelastic material to this test is shown below.
- First there is an instantaneous straining (IS), followed by an ever-increasing strain over time known as creep strain (CS). Elastic recovery (ER) stage: when unloaded, the elastic strain is recovered immediately. There is then anelastic recovery (AR) – strain recovered over time due to the viscoelastic time memory effect; this anelastic strain may be significant in some materials. A permanent strain (PS) may then be left in the material.
- For viscoelastic materials, a time-dependent function is used instead of a single value of Young’s modulus, and this is called the Young’s relaxation modulus E(t).
- The creep compliance function is used to describe creep J(t) behavior and can be related to the Young’s modulus E(t).
This analysis leads to the following CVD risk management chart to be discussed below:
- Stage 1: The creep deformation is recovered almost entirely when the load is released, i.e. ER >> AR and PS ~ 0.
- Stage 2: A significant AR effect comparable to CS in terms of both magnitude and duration, whereas PS effect is still negligible.
- Stage 3: The creep deformation is not recovered when the load is released due to the joint effect of AR and PS comparable to CS.
- Stage 4: Long-term AR and significant PS effects similar to that of CS are observed.
Summary
Stages 1 and 2 support the following experimental observations:
Cardiac muscle undergoes creep deformation from 2 to 3 % of its original length in 100 s. Large loads that stretch the muscle beyond 15% of its original length produce negligible PS effects, whereas the time course of AR is nearly identical to that of creep.
Stage 3 is characterized by the long-term AR and PS magnitudes comparable to that of CS, the creep deformation suffered under maintained loading is partially recovered when the load is released.
Stage 4 is characterized by the significant level of PS ~ CS, the creep deformation suffered under maintained loading is not recovered when the load is released.
Explore More
ECG Early Warning System (EWS) in Terms of the Heart Stress-Strain Failure Curve
AI-Based ECG Recognition – EOY ’22 Status
ML-Assisted ECG/EKG Anomaly Detection using LSTM Autoencoder
Heart Failure Prediction using Supervised ML/AI Technique
Embed Socials
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
ECG Early Warning System (EWS) in Terms of the Heart Stress-Strain Failure Curve
Featured Photo by Anna Shvets on Pexels
- The objective of this post is to develop the ECG Early Warning System (EWS) by interpreting (sub-)critical deformations of heart tissues in terms of the stress-strain failure curve. Stress and strain are important concepts in materials engineering and can be related through this curve.
- If you apply some stress to the heart muscle and measure the resulting strain, or vice versa, you can create a stress vs. strain curve like the one shown below for a typical human heart.
- Typical stress-strain plot for a heart: The graph begins with elastic deformations (stage 1) and ends at the fracture point (stage 4).
- We see that the heart elasticity zone OA starts off with stress being proportional to strain, which means that the heart muscle is operating in its linear region.
- Heart tissues deform when pushed, pulled, and twisted. Elasticity zone OA is the measure of the amount that the heart can return to its original shape after these external forces and pressures stop.
- The two parameters that determine the elasticity of a material are its elastic modulus and its elastic limit. A low elastic modulus is typical for materials that are easily deformed under a load; for example, a rubber band. If the stress under a load becomes too high, then when the load is removed, the material no longer comes back to its original shape and size, but relaxes to a different shape and size: The material becomes permanently deformed. The elastic limit A is the stress value beyond which the material no longer behaves elastically but becomes permanently deformed (cf. zone AB).
- For stresses beyond the elastic limit A, a material exhibits plastic behavior. This means the material deforms irreversibly and does not return to its original shape and size, even when the load is removed. When stress is gradually increased beyond the elastic limit, the material undergoes plastic deformation. Rubber-like materials show an increase in stress with the increasing strain, which means they become more difficult to stretch and, eventually, they reach a fracture point where they break.
- One of the plastic deformation stages in the stress-strain curve is the strain hardening region BC. This region starts as the strain goes beyond the yield point and ends at the ultimate strength point, the maximal stress shown in the stress-strain curve. In this region, the stress mainly increases as the material elongates, except that there is a nearly flat region at the beginning. The strain hardening region that occurs when the specimen is subjected to the maximum stress it can sustain (also called the ultimate tensile strength or UTS).
- The necking region CD where the neck forms. At this point, the stress that the material can sustain decreases rapidly as it approaches fracture D.
- Here is a sketch of stress-strain curves for brittle, ductile materials and rubber network on one graph.
- It is clear that brittle < the heart tissues < rubber network.
- Think of a paperclip. If you bend it just a little, it will bounce back every time. This is the elastic region. If you bend it far, however, you will permanently bend the clip.
- At a certain stress, the material will leave the elastic region. This stress is called the “yield strength.”
- At any point past the yield strength, the material will suffer permanent deformation (stage 2).
Explore More
AI-Based ECG Recognition – EOY ’22 Status
ML-Assisted ECG/EKG Anomaly Detection using LSTM Autoencoder
Embed Socials
Infographic
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
E2E NETFLIX Visualization: EDA & Plotly UI
Featured Photo by Roberto Nickson on Pexels
This project consists in the implementation of Python-3 Exploratory Data Analysis (EDA), streaming data visualization and highly interactive Plotly UI for reviewing Netflix movies and TV shows.
Objectives:
- Understanding what content is available in different countries
- Identifying similar content by matching text-based features
- Network analysis of Actors / Directors to find interesting insights
- Does Netflix has more focus on TV Shows than movies in recent years?
The end-to-end workflow has a purpose to informed the movie enthusiasts to discover the Netflix contents which are presented in several data visualizations consistent with AWS dashboards in R.
The Kaggle Netflix dataset consists of various of TV shows and movies that are available in Netflix platform. To briefly describe the contents of the dataset, the descriptions of each variables are described as follows:
- show_id: unique id represents the contents (TV Shows/Movies)
- type: The type of the contents whether it is a Movie or Tv Show
- title: The title of the contents
- director: name of the director(s) of the content
- cast: name of the cast(s) of the content
- country: Country of which contents was produced
- date_added: the date of the contents added into the platform
- release_year: the actual year of the contents release
- rating: the ratings of the content (viewer ratings)
- duration: length of duration for the contents (num of series for TV Shows and num of minutes for Movies)
- listed_in: the list of genres of which the contents was listed in
- description: full descriptions and synopses of the contents.
About
- Netflix is one of the world’s leading entertainment services with 204 million paid memberships in over 190 countries enjoying TV series, documentaries and feature films across a wide variety of genres and languages.
- Since Netflix began its worldwide expansion in 2016, the streaming service has rewritten the playbook for global entertainment — from TV to film, and, more recently, video games.
- In this post we will explore the data on TV Shows and Movies available on Netflix worldwide.
Input Data
Beforehand, the working directory YOURPATH and Python libraries that are required for the project are to be loaded as below:
import os
os.chdir(‘YOURPATH’)
os. getcwd()from nltk.corpus import stopwords
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from wordcloud import WordCloud,STOPWORDSwarnings.filterwarnings(“ignore”)
netflix_dataset = pd.read_csv(‘netflix_titles.csv’)
netflix_dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7787 entries, 0 to 7786 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 7787 non-null object 1 type 7787 non-null object 2 title 7787 non-null object 3 director 5398 non-null object 4 cast 7069 non-null object 5 country 7280 non-null object 6 date_added 7777 non-null object 7 release_year 7787 non-null int64 8 rating 7780 non-null object 9 duration 7787 non-null object 10 listed_in 7787 non-null object 11 description 7787 non-null object dtypes: int64(1), object(11) memory usage: 730.2+ KB
Let’s identify the unique values
dict = {}
for i in list(netflix_dataset.columns):
dict[i] = netflix_dataset[i].value_counts().shape[0]print(pd.DataFrame(dict, index=[“Unique counts”]).transpose())
Unique counts show_id 7787 type 2 title 7787 director 4049 cast 6831 country 681 date_added 1565 release_year 73 rating 14 duration 216 listed_in 492 description 7769
Let’s identify the missing values
temp = netflix_dataset.isnull().sum()
uniq = pd.DataFrame({‘Columns’: temp.index, ‘Numbers of Missing Values’: temp.values})
uniqMovies vs TV Shows
Analysis of Movies vs TV Shows:
netflix_shows=netflix_dataset[netflix_dataset[‘type’]==’TV Show’]
netflix_movies=netflix_dataset[netflix_dataset[‘type’]==’Movie’]plt.figure(figsize=(8,6))
ax= sns.countplot(x = “type”, data = netflix_dataset,palette=”Set1″)
ax.set_title(“TV Shows VS Movies”)plt.savefig(‘barcharttvmovies.png’)
It appears that there are more Movies than TV Shows on Netflix.
Heatmap Year-Month
Let’s plot the following SNS year-Month heatmap
netflix_date= netflix_shows[[‘date_added’]].dropna()
netflix_date[‘year’] = netflix_date[‘date_added’].apply(lambda x: x.split(‘,’)[-1])
netflix_date[‘month’] = netflix_date[‘date_added’].apply(lambda x: x.split(‘ ‘)[0])
month_order = [‘January’, ‘February’, ‘March’, ‘April’, ‘May’, ‘June’, ‘July’, ‘August’, ‘September’, ‘October’, ‘November’, ‘December’] #::-1 just reverse this niggadf = netflix_date.groupby(‘year’)[‘month’].value_counts().unstack().fillna(0)[month_order].T
plt.subplots(figsize=(10,10))
sns.heatmap(df,cmap=’Blues’) #heatmap
plt.savefig(“heatmapyear.png”)This heatmap shows frequencies of TV shows added to Netflix throughout the years 2008-2020.
Historical Analysis
Year-by-year analysis since 2006:
Last_fifteen_years = netflix_dataset[netflix_dataset[‘release_year’]>2005 ]
Last_fifteen_years.head()plt.figure(figsize=(12,10))
sns.set(style=”darkgrid”)
ax = sns.countplot(y=”release_year”, data=Last_fifteen_years, palette=”Set2″, order=netflix_dataset[‘release_year’].value_counts().index[0:15])plt.savefig(‘releaseyearcount.png’)
TV Shows
Analysis of duration of TV shows:
features=[‘title’,’duration’]
durations= netflix_shows[features]
durations[‘no_of_seasons’]=durations[‘duration’].str.replace(‘ Season’,”)
durations[‘no_of_seasons’]=durations[‘no_of_seasons’].str.replace(‘s’,”)durations[‘no_of_seasons’]=durations[‘no_of_seasons’].astype(str).astype(int)
TV shows with the largest number of seasons:
t=[‘title’,’no_of_seasons’]
top=durations[t]top=top.sort_values(by=’no_of_seasons’, ascending=False)
top20=top[0:20]
print(top20)
plt.figure(figsize=(80,60))
top20.plot(kind=’bar’,x=’title’,y=’no_of_seasons’, color=’blue’)
plt.savefig(‘tvshowsmaxseasons.png’)title no_of_seasons 2538 Grey's Anatomy 16 4438 NCIS 15 5912 Supernatural 15 1471 COMEDIANS of the world 13 5137 Red vs. Blue 13 1537 Criminal Minds 12 7169 Trailer Park Boys 12 2678 Heartland 11 1300 Cheers 11 2263 Frasier 11 3592 LEGO Ninjago: Masters of Spinjitzu 10 5538 Shameless (U.S.) 10 1577 Dad's Army 10 5795 Stargate SG-1 10 2288 Friends 10 1597 Danger Mouse: Classic Collection 10 6983 The Walking Dead 9 6718 The Office (U.S.) 9 1431 Club Friday The Series 6 9 2237 Forensic Files 9
<Figure size 8000x6000 with 0 Axes>
WordCloud
Let’s plot the WordCloud of ‘description’
new_df = netflix_dataset[‘description’]
words = ‘ ‘.join(new_df)
cleaned_word = ” “.join(word for word in words.split() )
wordcloud = WordCloud(stopwords=STOPWORDS,
background_color=’black’,
width=3000,
height=2500
).generate(cleaned_word)
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis(‘off’)
plt.savefig(‘netflixwordcloud.png’)Recommendations
Filling null values with empty string
filledna=netflix_dataset.fillna(”)
filledna.head()Cleaning the data – making all the words lower case
def clean_data(x):
return str.lower(x.replace(” “, “”))Identifying features on which the model is to be filtered.
features=[‘title’,’director’,’cast’,’listed_in’,’description’]
filledna=filledna[features]for feature in features:
filledna[feature] = filledna[feature].apply(clean_data)filledna.head()
def create_soup(x):
return x[‘title’]+ ‘ ‘ + x[‘director’] + ‘ ‘ + x[‘cast’] + ‘ ‘ +x[‘listed_in’]+’ ‘+ x[‘description’]filledna[‘soup’] = filledna.apply(create_soup, axis=1)
Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizercount = CountVectorizer(stop_words=’english’)
count_matrix = count.fit_transform(filledna[‘soup’])Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
Reset index of our main DataFrame and construct reverse mapping as before
filledna=filledna.reset_index()
indices = pd.Series(filledna.index, index=filledna[‘title’])Let’s define the cos similarity based recommendation function
def get_recommendations_new(title, cosine_sim = cosine_sim2):
title=title.replace(‘ ‘,”).lower()
idx = indices[title]# Get the pairwsie similarity scores of all movies with that movie sim_scores = list(enumerate(cosine_sim[idx])) # Sort the movies based on the similarity scores sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Get the scores of the 10 most similar movies sim_scores = sim_scores[1:11] # Get the movie indices movie_indices = [i[0] for i in sim_scores] # Return the top 10 most similar movies return netflix_dataset['title'].iloc[movie_indices]
Let’s check recommendations for NCIS
recommendations = get_recommendations_new(‘NCIS’, cosine_sim2)
print(recommendations)4109 MINDHUNTER 6876 The Sinner 2282 Frequency 6524 The Keepers 6900 The Staircase 1537 Criminal Minds 5459 Secret City 1772 Dirty John 2844 How to Get Away with Murder 5027 Quantico Name: title, dtype: object
Countries
Let’s examine the country list
country=df[“country”]
country=country.dropna()country=”, “.join(country)
country=country.replace(‘,, ‘,’, ‘)country=country.split(“, “)
country= list(Counter(country).items())
country.remove((‘Vatican City’, 1))
country.remove((‘East Germany’, 1))
print(country)[('Brazil', 88), ('Mexico', 154), ('Singapore', 39), ('United States', 3297), ('Turkey', 108), ('Egypt', 110), ('India', 990), ('Poland', 36), ('Thailand', 65), ('Nigeria', 76), ('Norway', 29), ('Iceland', 9), ('United Kingdom', 723), ('Japan', 287), ('South Korea', 212), ('Italy', 90), ('Canada', 412), ('Indonesia', 80), ('Romania', 12), ('Spain', 215), ('South Africa', 54), ('France', 349), ('Portugal', 4), ('Hong Kong', 102), ('China', 147), ('Germany', 199), ('Argentina', 82), ('Serbia', 7), ('Denmark', 44), ('Kenya', 5), ('New Zealand', 28), ('Pakistan', 24), ('Australia', 144), ('Taiwan', 85), ('Netherlands', 45), ('Philippines', 78), ('United Arab Emirates', 34), ('Iran', 4), ('Belgium', 85), ('Israel', 26), ('Uruguay', 14), ('Bulgaria', 9), ('Chile', 26), ('Russia', 27), ('Mauritius', 1), ('Lebanon', 26), ('Colombia', 45), ('Algeria', 2), ('Soviet Union', 3), ('Sweden', 39), ('Malaysia', 26), ('Ireland', 40), ('Luxembourg', 11), ('Finland', 11), ('Austria', 11), ('Peru', 10), ('Senegal', 3), ('Switzerland', 17), ('Ghana', 4), ('Saudi Arabia', 10), ('Armenia', 1), ('Jordan', 8), ('Mongolia', 1), ('Namibia', 2), ('Qatar', 7), ('Vietnam', 5), ('Syria', 1), ('Kuwait', 7), ('Malta', 3), ('Czech Republic', 20), ('Bahamas', 1), ('Sri Lanka', 1), ('Cayman Islands', 2), ('Bangladesh', 3), ('Zimbabwe', 3), ('Hungary', 9), ('Latvia', 1), ('Liechtenstein', 1), ('Venezuela', 3), ('Morocco', 6), ('Cambodia', 5), ('Albania', 1), ('Cuba', 1), ('Nicaragua', 1), ('Greece', 10), ('Croatia', 4), ('Guatemala', 2), ('West Germany', 5), ('Slovenia', 3), ('Dominican Republic', 1), ('Nepal', 2), ('Samoa', 1), ('Azerbaijan', 1), ('Bermuda', 1), ('Ecuador', 1), ('Georgia', 2), ('Botswana', 1), ('Puerto Rico', 1), ('Iraq', 2), ('Angola', 1), ('Ukraine', 3), ('Jamaica', 1), ('Belarus', 1), ('Cyprus', 1), ('Kazakhstan', 1), ('Malawi', 1), ('Slovakia', 1), ('Lithuania', 1), ('Afghanistan', 1), ('Paraguay', 1), ('Somalia', 1), ('Sudan', 1), ('Panama', 1), ('Uganda', 1), ('Montenegro', 1)]
Let’s look at the top 10 countries vs show count
max_show_country=country[0:11]
max_show_country = pd.DataFrame(max_show_country)
max_show_country= max_show_country.sort_values(1)fig, ax = plt.subplots(1, figsize=(8, 6))
fig.suptitle(‘Plot of country vs shows’)
ax.barh(max_show_country[0],max_show_country[1],color=’blue’)
plt.grid(b=True, which=’major’, color=’#666666′, linestyle=’-‘)plt.savefig(‘plotcountryshow.png’)
let’s load the list of country codes
df1=pd.read_csv(‘country_code.csv’)
df1=df1.drop(columns=[‘Unnamed: 2’])
df1.head()Let’s define country-based geo-locations as follows
country_map = pd.DataFrame(country)
country_map=country_map.sort_values(1,ascending=False)
location = pd.DataFrame(columns = [‘CODE’])
search_name=df1[‘COUNTRY’]for i in country_map[0]:
x=df1[search_name.str.contains(i,case=False)]
x[‘CODE’].replace(‘ ‘,”)
location=location.append(x)print(location)
CODE COUNTRY 211 USA united states 92 IND india 210 GBR united kingdom 37 CAN canada 70 FRA france .. ... ... 3 ASM american samoa 171 WSM samoa 13 AZE azerbaijan 22 BMU bermuda 137 MNE montenegro [115 rows x 2 columns]
Let’s edit locations
locations=[]
temp=location[‘CODE’]
for i in temp:
locations.append(i.replace(‘ ‘,”))Genres
Let’s look at the listed genres
genre=df[“listed_in”]
genre=”, “.join(genre)
genre=genre.replace(‘,, ‘,’, ‘)
genre=genre.split(“, “)
genre= list(Counter(genre).items())
print(genre)max_genre=genre[0:11]
max_genre = pd.DataFrame(max_genre)
max_genre= max_genre.sort_values(1)plt.figure(figsize=(40,20))
plt.xlabel(‘COUNT’)
plt.ylabel(‘GENRE’)
plt.barh(max_genre[0],max_genre[1], color=’red’)[('International TV Shows', 1199), ('TV Dramas', 704), ('TV Sci-Fi & Fantasy', 76), ('Dramas', 2106), ('International Movies', 2437), ('Horror Movies', 312), ('Action & Adventure', 721), ('Independent Movies', 673), ('Sci-Fi & Fantasy', 218), ('TV Mysteries', 90), ('Thrillers', 491), ('Crime TV Shows', 427), ('Docuseries', 353), ('Documentaries', 786), ('Sports Movies', 196), ('Comedies', 1471), ('Anime Series', 148), ('Reality TV', 222), ('TV Comedies', 525), ('Romantic Movies', 531), ('Romantic TV Shows', 333), ('Science & Nature TV', 85), ('Movies', 56), ('British TV Shows', 232), ('Korean TV Shows', 150), ('Music & Musicals', 321), ('LGBTQ Movies', 90), ('Faith & Spirituality', 57), ("Kids' TV", 414), ('TV Action & Adventure', 150), ('Spanish-Language TV Shows', 147), ('Children & Family Movies', 532), ('TV Shows', 12), ('Classic Movies', 103), ('Cult Movies', 59), ('TV Horror', 69), ('Stand-Up Comedy & Talk Shows', 52), ('Teen TV Shows', 60), ('Stand-Up Comedy', 329), ('Anime Features', 57), ('TV Thrillers', 50), ('Classic & Cult TV', 27)]
Plotly UI
Let’s look at the data columns in terms of null values
df.isnull().sum()
show_id 0 type 0 title 0 director 2389 cast 718 country 507 date_added 10 release_year 0 rating 7 duration 0 listed_in 0 description 0 dtype: int64
Let’s edit our data as follows:
df = df.dropna(how=’any’,subset=[‘cast’, ‘director’])
df = df.dropna()
df[“date_added”] = pd.to_datetime(df[‘date_added’])
df[‘year_added’] = df[‘date_added’].dt.year
df[‘month_added’] = df[‘date_added’].dt.monthdf[‘season_count’] = df.apply(lambda x : x[‘duration’].split(” “)[0] if “Season” in x[‘duration’] else “”, axis = 1)
df[‘duration’] = df.apply(lambda x : x[‘duration’].split(” “)[0] if “Season” not in x[‘duration’] else “”, axis = 1)df = df.rename(columns={“listed_in”:”genre”})
df[‘genre’] = df[‘genre’].apply(lambda x: x.split(“,”)[0])Let’s plot the most watched content as a donut
fig_donut = px.pie(df, names=’type’, height=300, width=600, hole=0.7,
title=’Most watched on Netflix’,
color_discrete_sequence=[‘#b20710’, ‘#221f1f’])
fig_donut.update_traces(hovertemplate=None, textposition=’outside’,
textinfo=’percent+label’, rotation=90)
fig_donut.update_layout(showlegend=False,plot_bgcolor=’#8a8d93′, paper_bgcolor=’#FAEBD7′)Let’s plot the content vs year
d1 = df[df[“type”] == “TV Show”]
d2 = df[df[“type”] == “Movie”]col = “year_added”
vc1 = d1[col].value_counts().reset_index().rename(columns = {col : “count”, “index” : col})
vc1[‘percent’] = vc1[‘count’].apply(lambda x : 100*x/sum(vc1[‘count’]))
vc1 = vc1.sort_values(col)vc2 = d2[col].value_counts().reset_index().rename(columns = {col : “count”, “index” : col})
vc2[‘percent’] = vc2[‘count’].apply(lambda x : 100*x/sum(vc2[‘count’]))
vc2 = vc2.sort_values(col)trace1 = go.Scatter(x=vc1[col], y=vc1[“count”], name=”TV Shows”)
trace2 = go.Scatter(x=vc2[col], y=vc2[“count”], name=”Movies”)
data = [trace1, trace2]
fig_line = go.Figure(data)
fig_line.update_traces(hovertemplate=None)
fig_line.update_xaxes(showgrid=False)
fig_line.update_yaxes(showgrid=False)Let’s plot the global map of the content distribution worldwide
df_country = df.groupby(‘year_added’)[‘country’].value_counts().reset_index(name=’counts’)
fig = px.choropleth(df_country, locations=”country”, color=”counts”,
locationmode=’country names’,
title=’Country ‘,
range_color=[0,200],
color_continuous_scale=px.colors.sequential.OrRd
)
fig.show()We can examine this global distribution as a function of year
df_country = df.groupby(‘year_added’)[‘country’].value_counts().reset_index(name=’counts’)
fig = px.choropleth(df_country, locations=”country”, color=”counts”,
locationmode=’country names’,
animation_frame=’year_added’,
title=’Country Vs Year’,
range_color=[0,200],
color_continuous_scale=px.colors.sequential.OrRd
)
fig.show()Let’s compare ratings for TV Shows and Movies
Making a copy of df
dff = df.copy()
Making 2 df one for tv show and another for movie with rating
df_tv_show = dff[dff[‘type’]==’TV Show’][[‘rating’, ‘type’]].rename(columns={‘type’:’tv_show’})
df_movie = dff[dff[‘type’]==’Movie’][[‘rating’, ‘type’]].rename(columns={‘type’:’movie’})
df_movie = pd.DataFrame(df_movie.rating.value_counts()).reset_index().rename(columns={‘index’:’movie’})df_tv_show = pd.DataFrame(df_tv_show.rating.value_counts()).reset_index().rename(columns={‘index’:’tv_show’})
df_tv_show[‘rating_final’] = df_tv_show[‘rating’]Making rating column value negative
df_tv_show[‘rating’] *= -1
Chart
fig = make_subplots(rows=1, cols=2, specs=[[{}, {}]], shared_yaxes=True, horizontal_spacing=0)
Bar plot for tv shows
fig.append_trace(go.Bar(x=df_tv_show.rating, y=df_tv_show.tv_show, orientation=’h’, showlegend=True,
text=df_tv_show.rating_final, name=’TV Show’, marker_color=’#221f1f’), 1, 1)Bar plot for movies
fig.append_trace(go.Bar(x=df_movie.rating, y=df_movie.movie, orientation=’h’, showlegend=True, text=df_movie.rating,
name=’Movie’, marker_color=’#b20710′), 1, 2)fig.show()
Let’s plot top 5 most preferred genres for movies
df_m = df[df[‘type’]==’Movie’]
df_m = pd.DataFrame(df_m[‘genre’].value_counts()).reset_index()fig_bars = px.bar(df_m[:5], x=’genre’, y=’index’, text=’index’,
title=’Most preferd Genre for Movies’,
color_discrete_sequence=[‘#b20710’])
fig_bars.update_traces(hovertemplate=None)
fig_bars.update_xaxes(visible=False)
fig_bars.update_yaxes(visible=False, categoryorder=’total ascending’)Let’s plot top 5 TV shows
df_tv = df[df[‘type’]==’TV Show’]
df_tv = pd.DataFrame(df_tv[‘genre’].value_counts()).reset_index()fig_tv = px.bar(df_tv[:5], x=’genre’, y=’index’, text=’index’,
color_discrete_sequence=[‘#FAEBD7’])
fig_tv.update_traces(hovertemplate=None)
fig_tv.update_xaxes(visible=False)
fig_tv.update_yaxes(visible=False, categoryorder=’total ascending’)
fig_tv.update_layout(height=300,hovermode="y unified", plot_bgcolor='#333', paper_bgcolor='#333')
fig_tv.show()
Let’s plot increasing (red) /decreasing (orange) movies vs year_added
d2 = df[df[“type”] == “Movie”]
col = “year_added”vc2 = d2[col].value_counts().reset_index().rename(columns = {col : “count”, “index” : col})
vc2[‘percent’] = vc2[‘count’].apply(lambda x : 100*x/sum(vc2[‘count’]))
vc2 = vc2.sort_values(col)fig2 = go.Figure(go.Waterfall(
name = “Movie”, orientation = “v”,
x = [“2008”, “2009”, “2010”, “2011”, “2012”, “2013”, “2014”, “2015”, “2016”, “2017”, “2018”, “2019”, “2020”, “2021”],
textposition = “auto”,
text = [“1”, “2”, “1”, “13”, “3”, “6”, “14”, “48”, “204”, “743”, “1121”, “1366”, “1228”, “84”],
y = [1, 2, -1, 13, -3, 6, 14, 48, 204, 743, 1121, 1366, -1228, -84],
connector = {“line”:{“color”:”#b20710″}},
increasing = {“marker”:{“color”:”#b20710″}},
decreasing = {“marker”:{“color”:”orange”}}))
fig2.show()Trend Detection
Let’s look at our original input dataset
Data Shape: (7787, 12)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7787 entries, 0 to 7786 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 7787 non-null object 1 type 7787 non-null object 2 title 7787 non-null object 3 director 5398 non-null object 4 cast 7069 non-null object 5 country 7280 non-null object 6 date_added 7777 non-null object 7 release_year 7787 non-null int64 8 rating 7780 non-null object 9 duration 7787 non-null object 10 listed_in 7787 non-null object 11 description 7787 non-null object dtypes: int64(1), object(11) memory usage: 730.2+ KB
data.isnull().sum()
show_id 0 type 0 title 0 director 2389 cast 718 country 507 date_added 10 release_year 0 rating 7 duration 0 listed_in 0 description 0 dtype: int64
Let’s fill in NaNs
data[‘date_added’] = data[‘date_added’].fillna(‘NaN Data’)
data[‘year’] = data[‘date_added’].apply(lambda x: x[-4: len(x)])
data[‘month’] = data[‘date_added’].apply(lambda x: x.split(‘ ‘)[0])display(data.sample(3))
Let’s plot the source distribution
val = data[‘type’].value_counts().index
cnt = data[‘type’].value_counts().valuesfig = go.Figure([go.Bar(x=val, y=cnt, marker_color=’darkturquoise’)])
fig.update_layout(title_text=’Netflix Sources Distribution’, title_x=0.5)
fig.show()Let’s plot Trend Movies vs TV Shows in recent years
from collections import defaultdict
dict = data.groupby([‘type’, ‘year’]).groups
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)
for key, values in dict.items():
val = key[0]+’,’+key[1]
dict2[val] = len(values)x = list(np.arange(2008, 2022, 1))
y1, y2= [], []
for i in x:
y1.append(dict2[‘Movie,’+str(i)])
y2.append(dict2[‘TV Show,’+str(i)])fig = go.Figure(data = [
go.Bar(name=’Movie’, x=x, y=y1, marker_color=’mediumpurple’),
go.Bar(name=’TV Show’, x=x, y=y2, marker_color=’lightcoral’)
])
fig.update_layout(title_text=’Trend Movies vs TV Shows in recent years’, title_x=0.5)
fig.show()Let’s plot the monthly Trend Movies vs TV Shows
dict = data.groupby([‘type’, ‘month’]).groups
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)
for key, values in dict.items():
val = key[0]+’,’+key[1]
dict2[val] = len(values)x = [‘January’, ‘February’, ‘March’, ‘April’, ‘May’, ‘June’, ‘July’,
‘August’, ‘September’, ‘October’, ‘November’, ‘December’]y1, y2= [], []
for i in x:
y1.append(dict2[‘Movie,’+str(i)])
y2.append(dict2[‘TV Show,’+str(i)])fig = go.Figure(data = [
go.Bar(name=’Movie’, x=x, y=y1, marker_color=’mediumpurple’),
go.Bar(name=’TV Show’, x=x, y=y2, marker_color=’lightcoral’)
])
fig.update_layout(title_text=’Trend Movies vs TV Shows during Months’, title_x=0.5)
fig.show()Let’s plot Trend Movies vs TV Shows in recent years
data_movie = data[data[‘type’]==’Movie’].groupby(‘release_year’).count()
data_tv = data[data[‘type’]==’TV Show’].groupby(‘release_year’).count()
data_movie.reset_index(level=0, inplace=True)
data_tv.reset_index(level=0, inplace=True)fig = go.Figure()
fig.add_trace(go.Scatter(x=data_movie[‘release_year’], y=data_movie[‘show_id’],
mode=’lines’,
name=’Movies’, marker_color=’mediumpurple’))
fig.add_trace(go.Scatter(x=data_tv[‘release_year’], y=data_tv[‘show_id’],
mode=’lines’,
name=’TV Shows’, marker_color=’lightcoral’))
fig.update_layout(title_text=’Trend Movies vs TV Shows in recent years’, title_x=0.5)
fig.show()Top Countries
Let’s plot top countries where the content was released
import collections
import stringdict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)data[‘country’] = data[‘country’].fillna(‘ ‘)
for i in range(len(data)):
if data[‘type’][i] == ‘Movie’:
val = data[‘country’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
dict1[x]+=1
else:
val = data[‘country’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
dict2[x]+=1
dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))x1 = list(dict1.keys())[:20]
x2 = list(dict2.keys())[:20]
y1 = list(dict1.values())[:20]
y2 = list(dict2.values())[:20]fig = go.Figure([go.Bar(x=x1, y=y1, marker_color=’mediumpurple’)])
fig.update_layout(title_text=’Top Countries where Movies are released’, title_x=0.5)
fig.show()fig = go.Figure([go.Bar(x=x2, y=y2, marker_color=’lightcoral’)])
fig.update_layout(title_text=’Top Countries where TV Shows are released’, title_x=0.5)
fig.show()Let’s look at the global maps
import plotly.offline as py
py.offline.init_notebook_mode()
import pycountrydf1 = pd.DataFrame(dict1.items(), columns=[‘Country’, ‘Count’])
df2 = pd.DataFrame(dict2.items(), columns=[‘Country’, ‘Count’])total = set(list(df1[‘Country’].append(df2[‘Country’])))
d_country_code = {} # To hold the country names and their ISO
for country in total:
try:
country_data = pycountry.countries.search_fuzzy(country)
# country_data is a list of objects of class pycountry.db.Country
# The first item ie at index 0 of list is best fit
# object of class Country have an alpha_3 attribute
country_code = country_data[0].alpha_3
d_country_code.update({country: country_code})
except:
#print(‘could not add ISO 3 code for ->’, country)
# If could not find country, make ISO code ‘ ‘
d_country_code.update({country: ‘ ‘})
for k, v in d_country_code.items():
df1.loc[(df1.Country == k), ‘iso_alpha’] = v
df2.loc[(df2.Country == k), ‘iso_alpha’] = vfig = px.scatter_geo(df1, locations=”iso_alpha”,
hover_name=”Country”, # column added to hover information
size=”Count”, # size of markers, “pop” is one of the columns of gapminder
)
fig.update_layout(title_text=’Top Countries where Movie are released’, title_x=0.5)
fig.show()fig = px.scatter_geo(df2, locations=”iso_alpha”,
hover_name=”Country”, # column added to hover information
size=”Count”, # size of markers, “pop” is one of the columns of gapminder
)
fig.update_layout(title_text=’Top Countries where TV Shows are released’, title_x=0.5)
fig.show()Cast Distributions
Let’s compare most appeared Cast Globally in Movies vs TV Shows
dict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)data[‘cast’] = data[‘cast’].fillna(‘ ‘)
for i in range(len(data)):
if data[‘type’][i] == ‘Movie’:
val = data[‘cast’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
dict1[x]+=1
else:
val = data[‘cast’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
dict2[x]+=1dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))x1 = list(dict1.keys())[:20]
x2 = list(dict2.keys())[:20]
y1 = list(dict1.values())[:20]
y2 = list(dict2.values())[:20]fig = go.Figure([go.Bar(x=x1, y=y1, marker_color=’mediumpurple’)])
fig.update_layout(title_text=’Most appeared Cast Globally in Movies’, title_x=0.5)
fig.show()fig = go.Figure([go.Bar(x=x2, y=y2, marker_color=’lightcoral’)])
fig.update_layout(title_text=’Most appeared Cast Globally in TV Shows’, title_x=0.5)
fig.show()NLTK Classifier
Let’s apply NaiveBayesClassifier to examine the gender ratio in Movies and TV Shows
import nltk
import random
from nltk.corpus import namesdef gender_features(word):
return {‘last_letter’: word[-1]}labeled_names = ([(name, ‘male’) for name in names.words(‘male.txt’)] +
[(name, ‘female’) for name in names.words(‘female.txt’)])random.shuffle(labeled_names)
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
trainset, testset = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(trainset)
dict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)df1 = pd.DataFrame(columns = [‘Gender’, ‘Count’])
df2 = pd.DataFrame(columns = [‘Gender’, ‘Count’])data[‘cast’] = data[‘cast’].fillna(‘ ‘)
for i in range(len(data)):
if data[‘type’][i] == ‘Movie’:
val = data[‘cast’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
if classifier.classify(gender_features(x)) == ‘male’:
df1.loc[len(df1)] = [‘male’, 1]
else:
df1.loc[len(df1)] = [‘female’, 1]
else:
val = data[‘cast’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
if classifier.classify(gender_features(x)) == ‘male’:
df2.loc[len(df2)] = [‘male’, 1]
else:
df2.loc[len(df2)] = [‘female’, 1]fig = px.pie(df1, values=’Count’, names=’Gender’, color=’Gender’,
color_discrete_map={‘female’:’lightcyan’,
‘male’:’darkblue’})
fig.update_layout(title_text=’Gender Ratio in Movies’, title_x=0.5)
fig.show()fig = px.pie(df2, values=’Count’, names=’Gender’, color=’Gender’,
color_discrete_map={‘female’:’lightcyan’,
‘male’:’darkblue’})
fig.update_layout(title_text=’Gender Ratio in TV Shows’, title_x=0.5)
fig.show()Top Genres
Let’s look at the highest occurring genres Globally in Movies vs TV Shows
dict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)data[‘listed_in’] = data[‘listed_in’].fillna(‘ ‘)
for i in range(len(data)):
if data[‘type’][i] == ‘Movie’:
val = data[‘listed_in’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
dict1[x]+=1
else:
val = data[‘listed_in’][i].split(‘,’)
for j in val:
x = j.lower()
x = x.strip()
if x!=”:
dict2[x]+=1dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))
x1 = list(dict1.keys())[:20]
x2 = list(dict2.keys())[:20]
y1 = list(dict1.values())[:20]
y2 = list(dict2.values())[:20]fig = go.Figure([go.Bar(x=x1, y=y1, marker_color=’mediumpurple’)])
fig.update_layout(title_text=’Highest occurring genres Globally in Movies’, title_x=0.5)
fig.show()fig = go.Figure([go.Bar(x=x2, y=y2, marker_color=’lightcoral’)])
fig.update_layout(title_text=’Highest occurring genres Globally in TV Shows’, title_x=0.5)
fig.show()Let’s review the overall country-based genre counts
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)data2 = data
data2[‘country’] = data2[‘country’].apply(lambda x: x.lower())
data2[‘listed_in’] = data2[‘listed_in’].apply(lambda x: x.lower())df1 = pd.DataFrame(columns=[‘Country’, ‘Genre’, ‘Count’])
for i in range(len(data2)):
for j in data2[‘country’][i].split(‘,’):
for k in data2[‘listed_in’][i].split(‘,’):
val = j+’,’+k
dict2[val]+=1dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))
a, b, c = 0, 0, 0
for k,v in dict2.items():
if k.split(‘,’)[0] == ‘india’ and a<5:
df1.loc[len(df1)] = [k.split(‘,’)[0], k.split(‘,’)[1],v]
a+=1
elif k.split(‘,’)[0] == ‘united states’ and b<5:
df1.loc[len(df1)] = [k.split(‘,’)[0], k.split(‘,’)[1],v]
b+=1
elif k.split(‘,’)[0] == ‘united kingdom’ and c<5:
df1.loc[len(df1)] = [k.split(‘,’)[0], k.split(‘,’)[1],v]
c+=1df1
Let’s compare Distribution of Genres in India, US, UK
fig = px.sunburst(df1, path = [‘Country’, ‘Genre’], values = ‘Count’, color = ‘Country’,
color_discrete_map = {‘united states’: ‘#85e0e0’, ‘india’: ‘#99bbff’, ‘united kingdom’: ‘#bfff80’})
fig.update_layout(title_text=’Distribution of Genres in India, US, UK’, title_x=0.5)
fig.show()Age Group
Let’s plot Age Group Distribution
data.iloc[67, 8] = ‘R’
data.iloc[2359, 8] = ‘TV-14’
data.iloc[3660, 8] = ‘TV-PG’
data.iloc[3736, 8] = ‘R’
data.iloc[3737, 8] = ‘R’
data.iloc[3738, 8] = ‘R’
data.iloc[4323, 8] = ‘PG-13’data[‘age_group’] = data[‘rating’]
MR_age = {‘TV-MA’: ‘Adults’,
‘R’: ‘Adults’,
‘PG-13’: ‘Teens’,
‘TV-14’: ‘Young Adults’,
‘TV-PG’: ‘Older Kids’,
‘NR’: ‘Adults’,
‘TV-G’: ‘Kids’,
‘TV-Y’: ‘Kids’,
‘TV-Y7’: ‘Older Kids’,
‘PG’: ‘Older Kids’,
‘G’: ‘Kids’,
‘NC-17’: ‘Adults’,
‘TV-Y7-FV’: ‘Older Kids’,
‘UR’: ‘Adults’}
data[‘age_group’] = data[‘age_group’].map(MR_age)val = data[‘age_group’].value_counts().index
cnt = data[‘age_group’].value_counts().valuesfig = go.Figure([go.Bar(x=val, y=cnt, marker_color=’darkturquoise’)])
fig.update_layout(title_text=’Age Group Distribution’, title_x=0.5)
fig.show()Duration
Let’s plot Distribution of Duration across Movies and TV Show in the past years
data_movie = data[data[‘type’] == ‘Movie’]
data_tv = data[data[‘type’] == ‘TV Show’]create trace 1 that is 3d scatter
trace1 = go.Scatter3d(
x=data_movie.duration,
y=data_tv.duration,
z=data.release_year,
mode=’markers’,
marker_color=’darkturquoise’
)data2 = [trace1]
layout = go.Layout(
)
fig = go.Figure(data=data2, layout=layout)
fig.update_layout(title_text=’Distribution of Duration across Movies and TV Show in the past years’, title_x=0.5)
iplot(fig)Let’s compare duration of movies vs TV shows as boxplots
data_movie = data[data[‘type’] == ‘Movie’]
data_tv = data[data[‘type’] == ‘TV Show’]trace0 = go.Box(
y = data_movie.duration,
name = “Duration of Movies”,
marker_color=’mediumpurple’
)trace1 = go.Box(
y = data_tv.duration,
name = “Duration of TV Shows”,
marker_color=’lightcoral’
)data2 = [trace0,trace1]
iplot(data2)Link to AWS
This post is linked to the AWS Netflix visualization dashboard in R. It consists of the following 3 steps discussed above:
- Data Preparation
- Creating Visualization
- Trend Detection
In fact, the Netflix data set has a lot of information that could be explored. In this article, several information that has been explored including the growth of the contents over the year, the distribution of contents by countries, the common genres in the selected countries, the age of contents distributions by each countries, and network of casts in the Netflix contents worldwide.
Interestingly, the contents of Netflix platform are dramatically increase from 2015-2019 which also shows the possibility of traction gains of the platform during the periods. The contents themselves were mostly derives from US, India, and UK as three of those countries have a high numbers of contents in the world. Likewise, the common genres and age of contents distributions for each of those countries are varied.
Overall, the visualizations of the data set eases the exploration of the data set which would then be processed for ML purpose. The type of the visualizations would be depended on which of the insights or information that would want to be presented.
Summary
- Entertainment companies today are swamped with data stored and collected from various mediums and sources.
- To gain insights from this data, we use Python EDA and advanced data visualization algorithms and make predictions about future events, and plan necessary strategies.
- Learnings gained through data mining can be used further within prescriptive analytics to drive actions based on predictive insights.
- As a recommendation for this data set, a recommender ML could be deployed here which would classify the contents and movies that have similar context in descriptions, directors, genres, and other variables in the data set.
Explore More
Webscraping in R – IMDb ETL Showcase
ML/AI Prediction of Wine Quality
Textual Genres Analysis using the Carloto’s NLP Algorithm
Embed Socials
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly