https://hiscidatmarketradetips.blogspot.com/2022/03/content-marketing-guide-for-absolute.html
  • Stocks to Watch in 2023: MarketBeat Ideas

    Stocks to Watch in 2023: MarketBeat Ideas

    Let’s review the 10 best stocks to own in 2023 (brought to you by Marketbeat.com).

    Featured Photo by Alesia Kozik on Pexels.

    Uncertainty is the Only Certainty for 2023:

    • Inflation and rising interest rates can pave the way for a lot of risk of recession in 2023.
    • The U.S. economy is facing a reckoning in terms of interest rates and the cost of business but it is a strong and resilient economy. It will bounce back, and when it does, it will give off a signal that cannot be ignored.
    • Long-term investors who build solid portfolios of dividend-paying stocks and reinvest those dividends will come out on top.
    • Until then, the best choices for investors include fundamentally sound blue chip companies with sustainable dividends and share repurchase programs. These stocks won’t be immune from downturns and dips but they should be insulated from them and they will likely keep on paying dividends as well. When the dips come, add to those positions and build a lever for portfolio growth for when the rebound begins, then keep adding to it.

    It’s Hard to Own Stocks When the Market is Falling:

    • The S&P 500 has already corrected more than 20% and it may fall further.
    • Mortgage rates have catapulted to their highest level since before the housing bubble.
    • Credit card rates remain at their highest levels in 30 years.
    • The labor market is still strong and U.S. consumers will most likely suck it up and pay the bills.

    Focus on Opportunity, Value and Dividends in 2023:

    On a sector basis, the energy sector looks like the best one to own for two

    primary reasons:

    • Valuation: The energy sector was valued at less than 10x its earnings going

    into the fourth quarter of 2022.Energy prices, while down from their peak,

    are still quite high relative to the prior year.

    • Earnings outlook: The energy sector is on track to produce 150% earnings

    growth in fiscal 2022. Analysts expect another 40% in 2023.

    Financials, reflected in the Financial Select Sector SPDR Fund (NYSEARCA: XLF), also looks attractive as a group. The sector struggles with earnings growth in 2022 but should still post positive results — the outlook for next year is much better. The group is expected to post 13.8% earnings growth and come in fourth place among the other 10 S&P 500 sectors.

    These are the 10 best stocks to own in 2023:

    • Occidental Petroleum: Energy is Still the Top Pick for Earnings
    • Schlumberger: An Oilfield Services Rebound is Brewing
    • Kraft Heinz – shares of Kraft Heinz are among the cheapest if not the cheapest stock in the consumer staples, represented by the Consumer Staples Select SPDR Fund (NYSEARCA: XLP) universe.
    • PepsiCo: A Diversified King of Consumer Staples
    • Lowe’s Companies: Another Crown Jewel for Dividend Investors
    • Levi Strauss: A Good Fit with Long-Wearing Potential
    • Duke Energy: Electrify Your Returns
    • Jabil Inc: got its start manufacturing PCBs or printed circuit boards in 1966.
    • Intel: A Deep Value/High Yield Combination
    • Camping World – The pandemic boosted the RV industry, which more than doubled over the next two years and then saw demand for new RVs peak.

    Explore More

    Energy:

    Stocks:

    Embed Socials


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • GIS ML/AI: Multi-Label Classification of Satellite Images with Fast.AI

    GIS ML/AI: Multi-Label Classification of Satellite Images with Fast.AI

    Satellite image classification is the most significant technique used in remote sensing (GIS) for the computerized study and pattern recognition of satellite GIS, which is based on diversity structures of the image that involve rigorous validation of the training samples depending on the used ML/AI classification algorithm.

    Satellite imagery is important for many applications including disaster response, law enforcement, and environmental monitoring. These GIS applications require the automated AI-powered identification of objects and facilities in the imagery.

    In this post, we focus on the satellite image segmentation using Fast.Ai.

    Satellite imagery is being used together with AI and deep learning in many areas to produce stunning insights and discoveries. Today we look at applying this approach to recognising buildings, woodlands & water areas from satellite images.

    Conventionally, we use 4 classes for identifying objects in GIS images:

    • Building
    • Woodland
    • Water
    • Background (i.e. everything else).

    For this multi-label image classification problem, we will use the Planet dataset, where it’s a collection of satellite images with multiple labels describing the scene. 

    The entire workflow consists of the following steps:

    • Grab our input data
    • Train a model with fastai
    • QC with fastai metrics.

    Let’s set the working directory YOURPATH

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and import the following libraries

    from fastai.vision.all import *

    import pandas as pd

    import torch
    from torch import nn

    from fastcore.meta import use_kwargs_dict

    from fastai.callback.fp16 import to_fp16
    from fastai.callback.progress import ProgressCallback
    from fastai.callback.schedule import lr_find, fit_one_cycle

    from fastai.data.block import MultiCategoryBlock, DataBlock
    from fastai.data.external import untar_data, URLs
    from fastai.data.transforms import RandomSplitter, ColReader

    from fastai.metrics import accuracy_multi, BaseLoss

    from fastai.vision.augment import aug_transforms
    from fastai.vision.data import ImageBlock
    from fastai.vision.learner import cnn_learner

    from torchvision.models import resnet34

    Let’s import the input dataset

    planet_source = untar_data(URLs.PLANET_SAMPLE)
    df = pd.read_csv(planet_source/’labels.csv’)

    Let’s check the content

    df.head()

    Let’s edit the data columns

    df = df[df[‘tags’] != ‘blow_down clear primary road’]

    batch_tfms = aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)

    planet = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
    get_x=ColReader(0, pref=f'{planet_source}/train/’, suff=’.jpg’),
    splitter=RandomSplitter(),
    get_y=ColReader(1, label_delim=’ ‘),
    batch_tfms = batch_tfms)

    dls = planet.dataloaders(df)

    and plot the first 9 selected images

    dls.show_batch(max_n=9, figsize=(12,9))

    We can also invoke the lambda function

    blocks = (ImageBlock, MultiCategoryBlock)

    get_x = lambda x:planet_source/’train’/f'{x[0]}.jpg’

    val = df.values[0]; val

    array(['train_21983', 'partly_cloudy primary'], dtype=object)

    get_x(df.values[0])

    get_y = lambda x:x[1].split(‘ ‘)

    planet = DataBlock(blocks=blocks,
    get_x=get_x,
    splitter=RandomSplitter(),
    get_y=get_y,
    batch_tfms=batch_tfms)

    dls = planet.dataloaders(df)
    dls.show_batch(max_n=9, figsize=(12,9))

    Let’s invoke planet.dataloaders

    def _planet_items(x): return (
    f'{planet_source}/train/’+x.image_name+’.jpg’, x.tags.str.split())

    planet = DataBlock.from_columns(blocks=(ImageBlock, MultiCategoryBlock),
    get_items = _planet_items,
    splitter=RandomSplitter(),
    batch_tfms=batch_tfms)

    dls = planet.dataloaders(df)
    dls.show_batch(max_n=9, figsize=(12,9))

    Let’s train the model

    from torchvision.models import resnet34

    from fastai.metrics import accuracy_multi

    learn = cnn_learner(dls, resnet34, pretrained=True, metrics=[accuracy_multi])

    class BCEWithLogitsLossFlat(BaseLoss):
    “Same as nn.BCEWithLogitsLoss, but flattens input and target.”
    @use_kwargs_dict(keep=True, weight=None, reduction=’mean’, pos_weight=None)
    def init(self, *args, axis=-1, floatify=True, thresh=0.5, **kwargs):
    if kwargs.get(‘pos_weight’, None) is not None and kwargs.get(‘flatten’, None) is True:
    raise ValueError(“flatten must be False when using pos_weight to avoid a RuntimeError due to shape mismatch”)
    if kwargs.get(‘pos_weight’, None) is not None: kwargs[‘flatten’] = False
    super().init(nn.BCEWithLogitsLoss, *args, axis=axis, floatify=floatify, is_2d=False, **kwargs)
    self.thresh = thresh

    def decodes(self, x):    return x>self.thresh
    def activation(self, x): return torch.sigmoid(x)
    

    learn.loss_func = BCEWithLogitsLossFlat()

    learn.lr_find()

    SuggestedLRs(valley=0.0020892962347716093)

    lr = 1e-2
    learn = learn.to_fp16()

    learn.fit_one_cycle(5, slice(lr))

    learn.save(‘stage-1’)

    Path('models/stage-1.pth')

    learn.unfreeze()
    learn.lr_find()

    SuggestedLRs(valley=7.585775892948732e-05)

    learn.fit_one_cycle(5, slice(1e-5, lr/5))

    learn.show_results(figsize=(15,15))

    Summary

    • Instead of cats & dogs, the Planet Competition Dataset consists of satellite images from the Amazonian region.
    • The task here consists of classifying which types of land covers are present on each image. We can have multiple landcovers types present on one image.
    • Here the task is a multi-label classification problem, where each image can belong to multiple classes.
    • Using pre-trained models is a good practice in general.

    Explore More

    fast.ai’s superresolution model on satellite imagery.

    Multi-label classification using fastai

    Multi-Label Keras CNN Image Classification of MNIST Fashion Clothing

    ML/AI Wildfire Prediction using Remote Sensing Data

    ML/AI Wildfire Prediction


    Embed Socials

    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • Multi-Label Keras CNN Image Classification of MNIST Fashion Clothing

    Multi-Label Keras CNN Image Classification of MNIST Fashion Clothing
    • Machine Learning (ML) and Deep Learning (DL) play a crucial role in managing efficient supply chain operations in the fashion retail industry.
    • Apart from large e-commerce brands like Amazon, even small-time fashion retailers are now using ML algorithms to understand fast-changing customer needs and expectations.
    • Neural network (NN) models are considered the most efficient and accurate forecasting methods, as they have demonstrated high performance in various business applications involving fashion e-commerce digital platforms.  
    • In fact, fashion clothing DL is the familiar problem of multi-label image classification. The key benefit of CNN is that the number of training model parameters is independent of the size of the original image. 
    • Following earlier DL studies, we train a Feedforward CNN model to classify images of clothing on train data and make predictions on test data. We use tf.keras throughout the project, a high-level API to build and train models in TensorFlow.

    The Fashion MNIST Dataset

    Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes:

    • 0 T-shirt/top
    • 1 Trouser
    • 2 Pullover
    • 3 Dress
    • 4 Coat
    • 5 Sandal
    • 6 Shirt
    • 7 Sneaker
    • 8 Bag
    • 9 Ankle boot

    Each image pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. 

    We aim to feed a 28 x 28 image (784 bytes) as an input to CNN, so that CNN can classify the image as one of the item labels.

    Model Version 1

    Let’s set the working directory YOURPATH

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and import the following key libraries

    import tensorflow as tf

    import numpy as np
    import matplotlib.pyplot as plt

    print(tf.version)

    2.10.0

    Let’s load the input data

    fashion_mnist = tf.keras.datasets.fashion_mnist

    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

    and check the data structure

    train_images.shape

    (60000, 28, 28)

    len(train_labels)

    60000

    test_images.shape

    (10000, 28, 28)

    len(test_labels)

    10000

    Let’s plot a single image

    plt.figure()
    plt.imshow(train_images[1])
    plt.colorbar()
    plt.grid(False)
    plt.show()

    Example input image

    Let’s scale the images

    train_images = train_images / 255.0

    test_images = test_images / 255.0

    and plot 25 selected grayscale labeled images

    plt.figure(figsize=(10,10))
    for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(train_labels[i])
    plt.savefig(‘example25grayscaleimages.png’)

    Example 25 grayscale images

    Let’s design a simple CNN model, compile and train the model with optimizer=’adam’ and epochs=20

    model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation=’relu’),
    tf.keras.layers.Dense(10)])

    model.compile(optimizer=’adam’,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[‘accuracy’])

    model.fit(train_images, train_labels, epochs=20)

    Epoch 1/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.5018 - accuracy: 0.8240
    Epoch 2/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.3791 - accuracy: 0.8625
    Epoch 3/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.3360 - accuracy: 0.8773
    Epoch 4/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.3139 - accuracy: 0.8862
    Epoch 5/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2936 - accuracy: 0.8909
    Epoch 6/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2817 - accuracy: 0.8954
    Epoch 7/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2692 - accuracy: 0.8990
    Epoch 8/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2573 - accuracy: 0.9045
    Epoch 9/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2480 - accuracy: 0.9074
    Epoch 10/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2390 - accuracy: 0.9099
    Epoch 11/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2316 - accuracy: 0.9134
    Epoch 12/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2233 - accuracy: 0.9159
    Epoch 13/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2168 - accuracy: 0.9195
    Epoch 14/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2121 - accuracy: 0.9212
    Epoch 15/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2066 - accuracy: 0.9217
    Epoch 16/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2005 - accuracy: 0.9251
    Epoch 17/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1939 - accuracy: 0.9267
    Epoch 18/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1913 - accuracy: 0.9278
    Epoch 19/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1848 - accuracy: 0.9305
    Epoch 20/20
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1806 - accuracy: 0.9312

    Let’s check the CNN loss/accuracy

    test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)

    print(‘\nTest accuracy:’, test_acc)

    313/313 - 0s - loss: 0.3805 - accuracy: 0.8783 - 441ms/epoch - 1ms/step
    
    Test accuracy: 0.8783000111579895

    Let’s make predictions using test images

    probability_model = tf.keras.Sequential([model,
    tf.keras.layers.Softmax()])

    predictions = probability_model.predict(test_images)

    313/313 [==============================] - 0s 645us/step

    predictions[1]

    array([4.9210530e-06, 3.8770574e-15, 9.9937904e-01, 3.7104611e-11,
           5.7768257e-04, 1.3045125e-12, 3.8408274e-05, 1.2573967e-21,
           8.8773615e-12, 2.4979118e-16], dtype=float32)

    We can see that

    np.argmax(predictions[1])

    2

    which is consistent with the true test label

    test_labels[1]

    2

    Let’s invoke a couple of image plot functions

    def plot_image(i, predictions_array, true_label, img):
    true_label, img = true_label[i], img[i]
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])

    plt.imshow(img, cmap=plt.cm.binary)

    predicted_label = np.argmax(predictions_array)
    if predicted_label == true_label:
    color = ‘blue’
    else:
    color = ‘red’

    plt.xlabel(“{} {:2.0f}% ({})”.format(predicted_label,
    100*np.max(predictions_array),
    true_label),
    color=color)

    def plot_value_array(i, predictions_array, true_label):
    true_label = true_label[i]
    plt.grid(False)
    plt.xticks(range(10))
    plt.yticks([])
    thisplot = plt.bar(range(10), predictions_array, color=”#777777″)
    plt.ylim([0, 1])
    predicted_label = np.argmax(predictions_array)

    thisplot[predicted_label].set_color(‘red’)
    thisplot[true_label].set_color(‘blue’)

    Let’s plot a couple of selected images

    i = 1
    plt.figure(figsize=(6,3))
    plt.subplot(1,2,1)
    plot_image(i, predictions[i], test_labels, test_images)
    plt.subplot(1,2,2)
    plot_value_array(i, predictions[i], test_labels)
    plt.show()

    true label =2

    Predicted test image label=2

    i = 12
    plt.figure(figsize=(6,3))
    plt.subplot(1,2,1)
    plot_image(i, predictions[i], test_labels, test_images)
    plt.subplot(1,2,2)
    plot_value_array(i, predictions[i], test_labels)
    plt.show()

    true label =7

    Predicted test image label=7

    Let’s plot several test images, their predicted labels, and the true labels (recall that we color correct predictions in blue and incorrect predictions in red)

    num_rows = 5
    num_cols = 3
    num_images = num_rowsnum_cols plt.figure(figsize=(22num_cols, 2num_rows))
    for i in range(num_images):
    plt.subplot(num_rows, 2num_cols, 2i+1)
    plot_image(i, predictions[i], test_labels, test_images)
    plt.subplot(num_rows, 2num_cols, 2i+2)
    plot_value_array(i, predictions[i], test_labels)
    plt.tight_layout()

    plt.savefig(‘clothestrainpredict.png’)

    Plots of several test images, their predicted labels, and the true labels (recall that we color correct predictions in blue and incorrect predictions in red)

    We can grab an image from the test dataset
    img = test_images[1]

    print(img.shape)

    Add the image to a batch where it’s the only member
    img = (np.expand_dims(img,0))

    print(img.shape)
    predictions_single = probability_model.predict(img)

    print(predictions_single)

    (28, 28)
    (1, 28, 28)
    1/1 [==============================] - 0s 16ms/step
    [[4.9210530e-06 3.8770574e-15 9.9937904e-01 3.7104611e-11 5.7768257e-04
      1.3045125e-12 3.8408274e-05 1.2573967e-21 8.8773615e-12 2.4979118e-16]]

    plot_value_array(1, predictions_single[0], test_labels)

    plt.show()

    Probability of label=2 for test_images[1]

    Model Version 2

    Recall that we need to import the key libraries and prepare the input data

    import tensorflow as tf
    import numpy as np
    import matplotlib.pyplot as plt

    clothing_fashion_mnist = tf.keras.datasets.fashion_mnist

    while loading the dataset from tensorflow
    (x_train, y_train),(x_test, y_test) = clothing_fashion_mnist.load_data()

    and displaying the shapes of training and testing datasets
    print(‘Shape of training cloth images: ‘,
    x_train.shape)

    print(‘Shape of training label: ‘,
    y_train.shape)

    print(‘Shape of test cloth images: ‘,
    x_test.shape)

    print(‘Shape of test labels: ‘,
    y_test.shape)

    Shape of training cloth images:  (60000, 28, 28)
    Shape of training label:  (60000,)
    Shape of test cloth images:  (10000, 28, 28)
    Shape of test labels:  (10000,)

    Let’s store the class names

    label_class_names = [‘T-shirt/top’, ‘Trouser’,
    ‘Pullover’, ‘Dress’, ‘Coat’,
    ‘Sandal’, ‘Shirt’, ‘Sneaker’,
    ‘Bag’, ‘Ankle boot’]

    and display the selected image ii=2 with the colorbar

    plt.imshow(x_train[ii])
    plt.colorbar()
    plt.show()

    The selected image ii=2 with the colorbar

    Let’s normalize both training and testing datasets

    x_train = x_train / 255.0
    x_test = x_test / 255.0

    let’s plot the first 20 training images

    plt.figure(figsize=(15, 5)) # figure size
    i = 0
    while i < 20:
    plt.subplot(2, 10, i+1)

    plt.imshow(x_train[i], cmap=plt.cm.binary)
    plt.xlabel(label_class_names[y_train[i]])
    i = i+1
    

    plt.show()

    20 selected input images

    Let’s build the model
    model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation=’relu’),
    tf.keras.layers.Dense(10)
    ])

    compile the model
    model.compile(optimizer=’adam’,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True),
    metrics=[‘accuracy’])

    and fit the model to the training data
    model.fit(x_train, y_train, epochs=20)

    Epoch 1/20
    1875/1875 [==============================] - 10s 5ms/step - loss: 0.4958 - accuracy: 0.8256
    Epoch 2/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.3726 - accuracy: 0.8647
    Epoch 3/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.3324 - accuracy: 0.8790
    Epoch 4/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.3115 - accuracy: 0.8865
    Epoch 5/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2932 - accuracy: 0.8913
    Epoch 6/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2772 - accuracy: 0.8974
    Epoch 7/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2662 - accuracy: 0.9006
    Epoch 8/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2559 - accuracy: 0.9043
    Epoch 9/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2471 - accuracy: 0.9082
    Epoch 10/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2377 - accuracy: 0.9115
    Epoch 11/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2305 - accuracy: 0.9125
    Epoch 12/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2236 - accuracy: 0.9163
    Epoch 13/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2156 - accuracy: 0.9190
    Epoch 14/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2106 - accuracy: 0.9215
    Epoch 15/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.2064 - accuracy: 0.9224
    Epoch 16/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.1977 - accuracy: 0.9251
    Epoch 17/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.1925 - accuracy: 0.9283
    Epoch 18/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.1882 - accuracy: 0.9295
    Epoch 19/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.1845 - accuracy: 0.9306
    Epoch 20/20
    1875/1875 [==============================] - 9s 5ms/step - loss: 0.1796 - accuracy: 0.9328

    Let’s calculate loss/accuracy score
    test_loss, test_acc = model.evaluate(x_test,
    y_test,
    verbose=2)
    print(‘\nTest loss:’, test_loss)
    print(‘\nTest accuracy:’, test_acc)

    313/313 - 1s - loss: 0.3656 - accuracy: 0.8845 - 1s/epoch - 4ms/step
    
    Test loss: 0.3656046390533447
    
    Test accuracy: 0.8845000267028809

    We use Softmax() function to convert linear output logits to probability

    prediction_model = tf.keras.Sequential(
    [model, tf.keras.layers.Softmax()])

    and make predictions of test data

    prediction = prediction_model.predict(x_test)

    Let’s look at the test image with ii=1

    print(‘Predicted test label:’, np.argmax(prediction[ii]))

    print(label_class_names[np.argmax(prediction[ii])])

    print(‘Actual test label:’, y_test[ii])

    313/313 [==============================] - 0s 1ms/step
    Predicted test label: 1
    Trouser
    Actual test label: 1
    

    Let’s plot 24 selected test images

    plt.figure(figsize=(15, 6))
    i = 0

    while i < 24:
    image, actual_label = x_test[i], y_test[i]
    predicted_label = np.argmax(prediction[i])
    plt.subplot(3, 8, i+1)
    plt.tight_layout()
    plt.xticks([])
    plt.yticks([])

    # display plot
    plt.imshow(image)
    
    # if else condition to distinguish right and
    # wrong
    if predicted_label == actual_label:color, label = ('green', 'Correct Prediction')
    if predicted_label != actual_label:color, label = ('red', 'Incorrect Prediction')
    
    # plotting labels and giving color to it
    # according to its correctness
    plt.title(label, color=color)
    
    # labelling the images in x-axis to see
    # the correct and incorrect results
    plt.xlabel(" {} ~ {} ".format(
        label_class_names[actual_label],
      label_class_names[predicted_label]))
    
    # labelling the images orderwise in y-axis
    plt.ylabel(i)
    # incrementing counter variable
    i += 1    
    
    Plot 24 test images with correct (green) and incorrect (red) predicted labels.

    Model Version 3

    Let’s import the key libraries and load the input dataset

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import keras
    import tensorflow as tf
    print(tf.version)

    2.10.0

    fashion_mnist = tf.keras.datasets.fashion_mnist

    (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

    Let’s explore the dataset

    Check the shape and size of X_train, X_test, y_train, y_test
    print (“Number of observations in training data: ” + str(len(X_train)))
    print (“Number of labels in training data: ” + str(len(y_train)))
    print (“Dimensions of a single image in X_train:” + str(X_train[0].shape))
    print(“————————————————————-\n”)
    print (“Number of observations in test data: ” + str(len(X_test)))
    print (“Number of labels in test data: ” + str(len(y_test)))
    print (“Dimensions of single image in X_test:” + str(X_test[0].shape))

    Number of observations in training data: 60000
    Number of labels in training data: 60000
    Dimensions of a single image in X_train:(28, 28)
    -------------------------------------------------------------
    
    Number of observations in test data: 10000
    Number of labels in test data: 10000
    Dimensions of single image in X_test:(28, 28)

    Let’s set the label list

    class_labels = [‘T-shirt/top’,’Trouser’,’Pullover’,’Dress’,’Coat’,’Sandal’,’Shirt’,’Sneakers’,’Bag’,’Ankle boot’]

    and plot the selected training image

    ii=1

    plt.figure(figsize = (8,8))
    plt.imshow(X_train[ii],cmap = ‘Greys’);

    The selected training image

i=1

    We can also plot the next image

    ii1=ii+1

    plt.figure(figsize = (8,8))
    plt.imshow(X_train[ii1],cmap = ‘Greys’);

    The selected training image

i=2

    Let’s plot first 25 images from the training set and display the class name below each image

    plt.figure(figsize=(20,16))
    for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(X_train[i], cmap=’Greys’)
    plt.xlabel(class_labels[y_train[i]])

    plt.savefig(‘clothesgrey.png’)

    First 25 images from the training set with the class name below each image

    Let’s scale the data

    X_train = X_train / 255.0

    X_test = X_test / 255.0

    and check the shape

    X_train.shape , y_train.shape

    ((60000, 28, 28), (60000,))

    Let’s build, compile and train the CNN model

    model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation=’relu’),
    tf.keras.layers.Dense(10)])

    model.compile(optimizer=’adam’,loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=[‘accuracy’])

    model.fit(X_train, y_train, epochs=50)

    Epoch 1/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.4999 - accuracy: 0.8251
    Epoch 2/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.3732 - accuracy: 0.8652
    Epoch 3/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.3342 - accuracy: 0.8793
    Epoch 4/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.3111 - accuracy: 0.8856
    Epoch 5/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2923 - accuracy: 0.8928
    Epoch 6/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2795 - accuracy: 0.8971
    Epoch 7/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2678 - accuracy: 0.9009
    Epoch 8/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2548 - accuracy: 0.9048
    Epoch 9/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2473 - accuracy: 0.9081
    Epoch 10/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2379 - accuracy: 0.9113
    Epoch 11/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2308 - accuracy: 0.9126
    Epoch 12/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2228 - accuracy: 0.9168
    Epoch 13/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2160 - accuracy: 0.9185
    Epoch 14/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2119 - accuracy: 0.9211
    Epoch 15/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.2039 - accuracy: 0.9234
    Epoch 16/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1991 - accuracy: 0.9255
    Epoch 17/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1923 - accuracy: 0.9282
    Epoch 18/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1877 - accuracy: 0.9291
    Epoch 19/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1838 - accuracy: 0.9329
    Epoch 20/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1769 - accuracy: 0.9342
    Epoch 21/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1728 - accuracy: 0.9363
    Epoch 22/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1679 - accuracy: 0.9374
    Epoch 23/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1655 - accuracy: 0.9376
    Epoch 24/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1610 - accuracy: 0.9391
    Epoch 25/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1565 - accuracy: 0.9419
    Epoch 26/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1525 - accuracy: 0.9433
    Epoch 27/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1505 - accuracy: 0.9438
    Epoch 28/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1474 - accuracy: 0.9446
    Epoch 29/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1431 - accuracy: 0.9470
    Epoch 30/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1410 - accuracy: 0.9473
    Epoch 31/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1400 - accuracy: 0.9470
    Epoch 32/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1358 - accuracy: 0.9491
    Epoch 33/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1302 - accuracy: 0.9518
    Epoch 34/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1301 - accuracy: 0.9515
    Epoch 35/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1275 - accuracy: 0.9522
    Epoch 36/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1234 - accuracy: 0.9542
    Epoch 37/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1218 - accuracy: 0.9545
    Epoch 38/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1195 - accuracy: 0.9559
    Epoch 39/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1188 - accuracy: 0.9557
    Epoch 40/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1161 - accuracy: 0.9559
    Epoch 41/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1122 - accuracy: 0.9578
    Epoch 42/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1120 - accuracy: 0.9578
    Epoch 43/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1081 - accuracy: 0.9591
    Epoch 44/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1102 - accuracy: 0.9593
    Epoch 45/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1053 - accuracy: 0.9612
    Epoch 46/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1030 - accuracy: 0.9620
    Epoch 47/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.1017 - accuracy: 0.9622
    Epoch 48/50
    1875/1875 [==============================] - 4s 2ms/step - loss: 0.1012 - accuracy: 0.9627
    Epoch 49/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.0988 - accuracy: 0.9635
    Epoch 50/50
    1875/1875 [==============================] - 3s 2ms/step - loss: 0.0978 - accuracy: 0.9639

    Model Accuracy Results:
    print(“Results:”)
    print(“———————“)
    scores_train = model.evaluate(X_train, y_train, verbose= 2)
    print(“Training Accuracy: %.2f%%\n” % (scores_train[1] * 100))
    scores_test = model.evaluate(X_test, y_test, verbose= 2)
    print(“Testing Accuracy: %.2f%%\n” % (scores_test[1] * 100))

    Results:
    ---------------------
    1875/1875 - 2s - loss: 0.0837 - accuracy: 0.9688 - 2s/epoch - 1ms/step
    Training Accuracy: 96.88%
    
    313/313 - 0s - loss: 0.4888 - accuracy: 0.8870 - 461ms/epoch - 1ms/step
    Testing Accuracy: 88.70%

    HPO

    Let’s import the necessary packages
    from sklearn.model_selection import GridSearchCV, KFold
    from keras.models import Sequential
    from keras.layers import Dense,Flatten
    from keras.wrappers.scikit_learn import KerasClassifier

    and start defining the model
    def create_model():
    model=Sequential()
    model.add(Flatten(input_shape=(28,28)))
    model.add(Dense(128,kernel_initializer=’normal’,activation=’relu’))
    model.add(Dense(8,kernel_initializer=’normal’,activation=’relu’))
    model.add(Dense(10,activation=’softmax’))
    model.compile(loss = ‘sparse_categorical_crossentropy’, optimizer = ‘Adam’, metrics = [‘accuracy’])
    return model

    Let’s create the Keras model
    model= KerasClassifier(build_fn=create_model, verbose=0)

    Define the grid search parameters
    epochs = [5,10,50,100]

    Make a dictionary of the grid search parameters
    param_grid = dict(epochs=epochs)

    Build and fit the GridSearchCV
    grid = GridSearchCV(estimator=model, param_grid=param_grid, cv = KFold(3), verbose=10)
    grid_result = grid.fit(X_train, y_train)

    Summarize the results
    print(“Best: {0}, using {1}”.format(grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_[‘mean_test_score’]
    stds = grid_result.cv_results_[‘std_test_score’]
    params = grid_result.cv_results_[‘params’]
    for mean, stdev, param in zip(means, stds, params):
    print(‘{0} ({1}) with: {2}’.format(mean, stdev, param))

    Fitting 3 folds for each of 4 candidates, totalling 12 fits
    [CV 1/3; 1/4] START epochs=5....................................................
    [CV 1/3; 1/4] END .....................epochs=5;, score=0.868 total time=  12.0s
    [CV 2/3; 1/4] START epochs=5....................................................
    [CV 2/3; 1/4] END .....................epochs=5;, score=0.878 total time=  11.6s
    [CV 3/3; 1/4] START epochs=5....................................................
    [CV 3/3; 1/4] END .....................epochs=5;, score=0.874 total time=  11.3s
    [CV 1/3; 2/4] START epochs=10...................................................
    [CV 1/3; 2/4] END ....................epochs=10;, score=0.873 total time=  22.2s
    [CV 2/3; 2/4] START epochs=10...................................................
    [CV 2/3; 2/4] END ....................epochs=10;, score=0.886 total time=  22.4s
    [CV 3/3; 2/4] START epochs=10...................................................
    [CV 3/3; 2/4] END ....................epochs=10;, score=0.876 total time=  22.8s
    [CV 1/3; 3/4] START epochs=50...................................................
    [CV 1/3; 3/4] END ....................epochs=50;, score=0.882 total time= 1.8min
    [CV 2/3; 3/4] START epochs=50...................................................
    [CV 2/3; 3/4] END ....................epochs=50;, score=0.890 total time= 1.8min
    [CV 3/3; 3/4] START epochs=50...................................................
    [CV 3/3; 3/4] END ....................epochs=50;, score=0.885 total time= 1.8min
    [CV 1/3; 4/4] START epochs=100..................................................
    [CV 1/3; 4/4] END ...................epochs=100;, score=0.882 total time= 3.8min
    [CV 2/3; 4/4] START epochs=100..................................................
    [CV 2/3; 4/4] END ...................epochs=100;, score=0.891 total time= 3.6min
    [CV 3/3; 4/4] START epochs=100..................................................
    [CV 3/3; 4/4] END ...................epochs=100;, score=0.885 total time= 3.7min
    Best: 0.8861166636149088, using {'epochs': 100}
    0.8736000061035156 (0.00386415461356267) with: {'epochs': 5}
    0.8786500096321106 (0.005455429510998301) with: {'epochs': 10}
    0.8859000205993652 (0.0032629225436568263) with: {'epochs': 50}
    0.8861166636149088 (0.0034991337511122603) with: {'epochs': 100}

    Let’s find the best optimizer:

    from keras.layers import Dropout

    taken from previous results
    epochs= 50
    batch_size=50
    learn_rate = 0.001
    dropout_rate = 0.1
    init = ‘normal’
    activation = ‘tanh’

    Start defining the model
    def create_model(optimizer=’adam’):
    model=Sequential()
    model.add(Flatten(input_shape=(28, 28)))
    model.add(Dense(16, kernel_initializer = init, activation = activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(8, kernel_initializer = init, activation = activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(10,activation=’softmax’))

    import tensorflow as tf
    opt = tf.keras.optimizers.Adam(learning_rate = learn_rate)
    model.compile(loss = ‘sparse_categorical_crossentropy’, optimizer = ‘Adam’, metrics = [‘accuracy’])
    return model

    Create the model
    model = KerasClassifier(build_fn = create_model, epochs=epochs, batch_size=batch_size, verbose = 0) # This comes from the previous best

    Define the grid search parameters
    optimizer = [‘SGD’, ‘RMSprop’, ‘Adagrad’, ‘Adadelta’, ‘Adam’, ‘Adamax’, ‘Nadam’]

    Make a dictionary of the grid search parameters
    param_grid = dict(optimizer=optimizer)

    Build and fit the GridSearchCV
    grid = GridSearchCV(estimator=model, param_grid=param_grid, cv = KFold(3), verbose=10)
    grid_result = grid.fit(X_train, y_train)

    Summarize the results
    print(“Best: {0}, using {1}”.format(grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_[‘mean_test_score’]
    stds = grid_result.cv_results_[‘std_test_score’]
    params = grid_result.cv_results_[‘params’]
    for mean, stdev, param in zip(means, stds, params):
    print(‘{0} ({1}) with: {2}’.format(mean, stdev, param))

    Best: 0.8654166658719381, using {'optimizer': 'Nadam'}

    Train Test Split the Training Data to 70% and Validation Data to 30%
    import sklearn
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score

    X_train,X_val,y_train,y_val = train_test_split(X_train,y_train,test_size=0.3,random_state= 100)

    defining input neurons

    input_neurons = X_train.shape[1]

    define number of output neurons

    output_neurons = 10

    importing the sequential model

    from keras.models import Sequential

    importing different layers from keras

    from keras.layers import InputLayer, Dense
    from keras.layers import Dropout

    Number of hidden layers and hidden neurons
    Applying hyperparameters obtained using GridSearch CV
    Define hidden layers and neuron in each layer

    number_of_hidden_layers = 2
    neuron_hidden_layer_1 = 16
    neuron_hidden_layer_2 = 8

    Defining the CNN architecture of the model

    model_final = Sequential()
    model_final.add(Flatten(input_shape=(28, 28)))
    model_final.add(Dense(units=neuron_hidden_layer_1, kernel_initializer = ‘normal’, activation=’tanh’))
    model_final.add(Dropout(0.1))
    model_final.add(Dense(units=neuron_hidden_layer_2,kernel_initializer = ‘normal’, activation=’tanh’))
    model_final.add(Dropout(0.1))
    model_final.add(Dense(units=output_neurons,activation=’softmax’))

    Summary of the neural network model

    model_final.summary()

    Model: "sequential_190"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #   
    =================================================================
     flatten_190 (Flatten)       (None, 784)               0         
                                                                     
     dense_569 (Dense)           (None, 16)                12560     
                                                                     
     dropout_332 (Dropout)       (None, 16)                0         
                                                                     
     dense_570 (Dense)           (None, 8)                 136       
                                                                     
     dropout_333 (Dropout)       (None, 8)                 0         
                                                                     
     dense_571 (Dense)           (None, 10)                90        
                                                                     
    =================================================================
    Total params: 12,786
    Trainable params: 12,786
    Non-trainable params: 0
    Compiling the model:
    loss as “sparse_categorical_crossentropy”, since we have multi class classification problem
    defining the optimizer as “Nadam” obtained in GridSearhCV
    Evaluation metric as “accuracy”
    Define learning rate obtained in GridSearhCV

    learn_rate = 0.001
    import tensorflow as tf
    opt = tf.keras.optimizers.Adam(learning_rate = learn_rate)
    model_final.compile(loss=’sparse_categorical_crossentropy’,optimizer=’Nadam’,metrics=[‘accuracy’])

    training the model with best hyperparamters obtained in GridSearchCV

    passing the independent and dependent features for training set for training the model
    validation data will be evaluated at the end of each epoch
    storing the trained model in model_history variable which will be used to visualize the training process

    model_history = model_final.fit(X_train, y_train, validation_data=(X_val, y_val), epochs= 50,batch_size = 50)

    Epoch 1/50
    840/840 [==============================] - 6s 6ms/step - loss: 1.0740 - accuracy: 0.6631 - val_loss: 0.6646 - val_accuracy: 0.7894
    Epoch 2/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.6450 - accuracy: 0.7909 - val_loss: 0.5029 - val_accuracy: 0.8327
    Epoch 3/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.5550 - accuracy: 0.8142 - val_loss: 0.4626 - val_accuracy: 0.8408
    Epoch 4/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.5227 - accuracy: 0.8204 - val_loss: 0.4363 - val_accuracy: 0.8471
    Epoch 5/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.5020 - accuracy: 0.8282 - val_loss: 0.4309 - val_accuracy: 0.8478
    Epoch 6/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4891 - accuracy: 0.8310 - val_loss: 0.4329 - val_accuracy: 0.8443
    Epoch 7/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4795 - accuracy: 0.8325 - val_loss: 0.4295 - val_accuracy: 0.8448
    Epoch 8/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4725 - accuracy: 0.8368 - val_loss: 0.4131 - val_accuracy: 0.8558
    Epoch 9/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4603 - accuracy: 0.8408 - val_loss: 0.4056 - val_accuracy: 0.8573
    Epoch 10/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4584 - accuracy: 0.8416 - val_loss: 0.4056 - val_accuracy: 0.8581
    Epoch 11/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4552 - accuracy: 0.8448 - val_loss: 0.4048 - val_accuracy: 0.8583
    Epoch 12/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4480 - accuracy: 0.8440 - val_loss: 0.4021 - val_accuracy: 0.8592
    Epoch 13/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4430 - accuracy: 0.8461 - val_loss: 0.4057 - val_accuracy: 0.8594
    Epoch 14/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4446 - accuracy: 0.8440 - val_loss: 0.3988 - val_accuracy: 0.8597
    Epoch 15/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4388 - accuracy: 0.8476 - val_loss: 0.4053 - val_accuracy: 0.8598
    Epoch 16/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4339 - accuracy: 0.8492 - val_loss: 0.4028 - val_accuracy: 0.8624
    Epoch 17/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4294 - accuracy: 0.8500 - val_loss: 0.3943 - val_accuracy: 0.8641
    Epoch 18/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4295 - accuracy: 0.8515 - val_loss: 0.4026 - val_accuracy: 0.8603
    Epoch 19/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4271 - accuracy: 0.8509 - val_loss: 0.4087 - val_accuracy: 0.8575
    Epoch 20/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4220 - accuracy: 0.8532 - val_loss: 0.4023 - val_accuracy: 0.8597
    Epoch 21/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4214 - accuracy: 0.8539 - val_loss: 0.3942 - val_accuracy: 0.8612
    Epoch 22/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4184 - accuracy: 0.8544 - val_loss: 0.3901 - val_accuracy: 0.8631
    Epoch 23/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4191 - accuracy: 0.8547 - val_loss: 0.3995 - val_accuracy: 0.8612
    Epoch 24/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4122 - accuracy: 0.8547 - val_loss: 0.3901 - val_accuracy: 0.8639
    Epoch 25/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4138 - accuracy: 0.8561 - val_loss: 0.3982 - val_accuracy: 0.8597
    Epoch 26/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4097 - accuracy: 0.8594 - val_loss: 0.3943 - val_accuracy: 0.8635
    Epoch 27/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4116 - accuracy: 0.8563 - val_loss: 0.3976 - val_accuracy: 0.8593
    Epoch 28/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4089 - accuracy: 0.8572 - val_loss: 0.3948 - val_accuracy: 0.8617
    Epoch 29/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4078 - accuracy: 0.8586 - val_loss: 0.3871 - val_accuracy: 0.8669
    Epoch 30/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4040 - accuracy: 0.8593 - val_loss: 0.3954 - val_accuracy: 0.8616
    Epoch 31/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4013 - accuracy: 0.8623 - val_loss: 0.3928 - val_accuracy: 0.8633
    Epoch 32/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4004 - accuracy: 0.8612 - val_loss: 0.3894 - val_accuracy: 0.8659
    Epoch 33/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3993 - accuracy: 0.8605 - val_loss: 0.4030 - val_accuracy: 0.8590
    Epoch 34/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3986 - accuracy: 0.8612 - val_loss: 0.4017 - val_accuracy: 0.8603
    Epoch 35/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3985 - accuracy: 0.8608 - val_loss: 0.3932 - val_accuracy: 0.8644
    Epoch 36/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.4004 - accuracy: 0.8608 - val_loss: 0.3909 - val_accuracy: 0.8640
    Epoch 37/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3950 - accuracy: 0.8630 - val_loss: 0.3978 - val_accuracy: 0.8603
    Epoch 38/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3935 - accuracy: 0.8626 - val_loss: 0.3922 - val_accuracy: 0.8643
    Epoch 39/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3910 - accuracy: 0.8630 - val_loss: 0.3865 - val_accuracy: 0.8649
    Epoch 40/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3932 - accuracy: 0.8618 - val_loss: 0.3873 - val_accuracy: 0.8664
    Epoch 41/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3890 - accuracy: 0.8640 - val_loss: 0.4033 - val_accuracy: 0.8602
    Epoch 42/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3916 - accuracy: 0.8626 - val_loss: 0.3934 - val_accuracy: 0.8642
    Epoch 43/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3889 - accuracy: 0.8646 - val_loss: 0.3925 - val_accuracy: 0.8613
    Epoch 44/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3898 - accuracy: 0.8644 - val_loss: 0.3942 - val_accuracy: 0.8623
    Epoch 45/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3860 - accuracy: 0.8651 - val_loss: 0.3838 - val_accuracy: 0.8672
    Epoch 46/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3834 - accuracy: 0.8640 - val_loss: 0.3920 - val_accuracy: 0.8636
    Epoch 47/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3833 - accuracy: 0.8658 - val_loss: 0.3885 - val_accuracy: 0.8633
    Epoch 48/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3854 - accuracy: 0.8653 - val_loss: 0.3885 - val_accuracy: 0.8665
    Epoch 49/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3814 - accuracy: 0.8678 - val_loss: 0.3935 - val_accuracy: 0.8648
    Epoch 50/50
    840/840 [==============================] - 5s 6ms/step - loss: 0.3811 - accuracy: 0.8682 - val_loss: 0.3949 - val_accuracy: 0.8650

    Let’s evaluate the model

    model_final.evaluate(X_train, y_train)

    1313/1313 [==============================] - 2s 2ms/step - loss: 0.3010 - accuracy: 0.8934
    

    [0.30102264881134033, 0.8934047818183899]

    model_final.evaluate(X_val, y_val)

    563/563 [==============================] - 1s 2ms/step - loss: 0.3949 - accuracy: 0.8650
    

    Out[34]:

    [0.3948652148246765, 0.8650000095367432]

    Evaluation Report

    Model-3 Accuracy Results:
    print(“Results:”)
    print(“——–“)
    scores_train = model_final.evaluate(X_train, y_train, verbose=2)
    print(“Training Accuracy: %.2f%%\n” % (scores_train[1] * 100))
    scores_val = model_final.evaluate(X_val, y_val, verbose= 2)
    print(“Validation Accuracy: %.2f%%\n” % (scores_val[1] * 100))

    Results:
    --------
    1313/1313 - 2s - loss: 0.3010 - accuracy: 0.8934 - 2s/epoch - 1ms/step
    Training Accuracy: 89.34%
    
    563/563 - 1s - loss: 0.3949 - accuracy: 0.8650 - 731ms/epoch - 1ms/step
    Validation Accuracy: 86.50%

    Summarize history for loss
    plt.figure(figsize = (15,7))
    plt.plot(model_history.history[‘loss’])
    plt.plot(model_history.history[‘val_loss’])
    plt.title(‘Model loss’)
    plt.ylabel(‘Loss’)
    plt.xlabel(‘Epoch’)
    plt.legend([‘Train Loss’, ‘Validation Loss’], loc=’upper right’)
    plt.xlim(0,50)
    plt.ylim(0.1,1.0)
    plt.show()

    Model train/validation loss

    Summarize history for accuracy
    plt.figure(figsize = (15,7))
    plt.plot(model_history.history[‘accuracy’])
    plt.plot(model_history.history[‘val_accuracy’])
    plt.title(‘Model Accuracy’)
    plt.ylabel(‘Accuracy’)
    plt.xlabel(‘Epoch’)
    plt.legend([‘Train Accuracy’, ‘Validation Accuracy’], loc=’upper right’)
    plt.xlim(0,50)
    plt.ylim(0.5,1.0)
    plt.show()

    Model train/validation accuracy

    Let’s evaluate the test score

    scores_test = model_final.evaluate(X_test,y_test)

    313/313 [==============================] - 0s 2ms/step - loss: 0.4227 - accuracy: 0.8564

    scores_test = model_final.evaluate(X_test, y_test, verbose=2)
    print(“Testing Accuracy: %.2f%%\n” % (scores_test[1] * 100))

    313/313 - 0s - loss: 0.4227 - accuracy: 0.8564 - 424ms/epoch - 1ms/step
    Testing Accuracy: 85.64%

    Add a softmax layer to convert the model’s linear outputs logits to probabilities, which should be easier to interpret
    probability_model = tf.keras.Sequential([model_final, tf.keras.layers.Softmax()])

    Let’s make predictions

    predictions = probability_model.predict(X_test)

    313/313 [==============================] - 0s 709us/step

    Model has predicted the label for each image in the testing set. Let’s take a look at the first prediction:

    ii=1
    predictions[ii]

    array([0.08596796, 0.08593319, 0.22232993, 0.08594155, 0.08826148,
           0.08593146, 0.08782255, 0.08593164, 0.08594636, 0.08593389],
          dtype=float32)

    np.argmax(predictions[ii])

    2

    y_test[ii]

    2

    Let’s plot the test image

    plt.figure(figsize = (8,8))
    plt.imshow(X_test[ii],cmap = ‘Greys’);

    Test image y_test[2]

    Let’s look at the multi-label confusion matrix

    from sklearn.metrics import confusion_matrix
    plt.figure(figsize = (16,8))
    y_pred_labels = [np.argmax(label) for label in predictions]
    cm = confusion_matrix(y_test,y_pred_labels)

    HeatMap:

    sns.heatmap(cm , annot = True,fmt = ‘d’,xticklabels = class_labels,yticklabels = class_labels,cmap = ‘viridis’);

    The multilabel confusion matrix

    Let’s print the multi-label classification report

    from sklearn.metrics import classification_report
    report = classification_report (y_test,y_pred_labels,target_names = class_labels)
    print(report)

            precision    recall  f1-score   support
    
     T-shirt/top       0.81      0.80      0.80      1000
         Trouser       0.98      0.96      0.97      1000
        Pullover       0.77      0.76      0.76      1000
           Dress       0.82      0.90      0.86      1000
            Coat       0.75      0.80      0.77      1000
          Sandal       0.95      0.93      0.94      1000
           Shirt       0.68      0.59      0.63      1000
        Sneakers       0.90      0.96      0.93      1000
             Bag       0.94      0.96      0.95      1000
      Ankle boot       0.96      0.92      0.94      1000
    
        accuracy                           0.86     10000
       macro avg       0.86      0.86      0.86     10000
    weighted avg       0.86      0.86      0.86     10000
    
    f1-score

    Conclusions

    • In this post, we discussed how to address the multi-label image classification problem by implementing a CNN model using Keras, TensorFlow and GridSearchCV.
    • Specifically, we discovered how to develop a CNN for clothing classification from scratch.
    •  We looked at the entire process of implementing a feedforward CNN model on the Fashion-MNIST dataset to classify images of clothing apparel on train data and make predictions on test data using GridSearchCV Hyperparameter tuning technique to achieve the best accuracy and performance.
    • In this tutorial, you discovered how to develop a convolutional neural network for clothing classification from scratch.
    • We have learned:
    • How to develop a robust evaluation of a DL model and establish a baseline of performance for a multi-label image classification task.
    • How to explore extensions to a baseline model to improve learning and model capacity via hyper-parameter tuning.
    • How to develop a finalized model, evaluate the performance of the final model, and use it to make predictions on new images.

    Explore More

    Short-Term Stock Market Price Prediction using Deep Learning Models

    Supervised ML/AI Stock Prediction using Keras LSTM Models

    Start Your Shopify Business

    E-Commerce ML/AI Classification

    E-Commerce Data Science Use-Case


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • Top Digital Marketing Trends 2022-Q1’23

    Top Digital Marketing Trends 2022-Q1’23

    Following Hubspot, let’s summarize top digital marketing trends 2022-Q1’23 with the focus on social media marketing, influencer marketing (#1 emerging trend), short-form video content, investments into Twitter Spaces (TS), and mobile marketing strategy.

    Key points to consider:

    • The top source of website traffic is direct — showing the importance of brand awareness.
    • New/emerging platforms – TS, YouTube Shorts, Instagram Live Rooms, Facebook Live Shopping, and Spotify Green Room
    • Top marketing channels – social media, website/blog, email marketing, content marketing, influencer marketing, SEO marketing, and virtual events.
    • Website Device Traffic Breakdown – mobile 41%, desktop 38%, and tablet 19%.
    • Top Website Traffic Sources. Source: Hubspot.
    Top Website Traffic Sources. Source: Hubspot.
    • Content marketing: videos, blogs, images, infographics, and case studies.
    • Social media platforms. Source: Hubspot.
    Social media platforms. Source: Hubspot
    • Email marketing: B2C brands are more likely to find email marketing to be impactful. Top 10 email clients overall (Source: Litmus May 2022 Email Client Market Share)
     Top 10 email clients overall (Source: Litmus May 2022 Email Client Market Share)
    • Digital Advertising Trends – Which channel has the best ROI on paid social media campaigns. Source: Hubspot.
    Which channel has the best ROI on paid social media campaigns. Source: Hubspot.
    • Marketers Are Ramping Up Their Video Efforts in 2023

    The most common video marketing goals brands hope to achieve this year are: driving brand awareness, educating consumers about their products, increasing customer engagement, and generating leads. This indicates that brands are seeing
    video impact at every stage of their customers’ journey.

    • Influencer marketing is changing rapidly, even since 2021. This year, marketers are increasing influencer marketing efforts on Facebook, Instagram, TikTok, and YouTube, and decreasing investment on Snapchat and Twitch. Marketers are facing challenges measuring ROI of campaigns, balancing the cost of working with influencers, developing a creative strategy for campaigns, and maintaining brand safety.
    • Which types of influencers are marketers working with? Source: Hubspot.
    Which types of influencers 
are marketers working with? Source: Hubspot

    Influencer Marketing Spend by Influencer Audience Size. Source: Hubspot.

    Influencer Marketing Spend by Influencer Audience Size. Source: Hubspot.
    • B2B (52%) and B2C (48%) Marketing Trends

    B2B and B2C brands are both finding success on TikTok.

    B2B brands are more likely to use LinkedIn and find it effective, but B2C brands are more likely to feel that billboards and physical ads are more successful.

    B2C marketers are more likely to expect their marketing budget to increase in 2023.

    What’s working for B2B brands vs. What’s working for B2C brands. Source: Hubspot.

    What’s working for B2B brands vs. What’s working for B2C brands. Source: Hubspot.
    • The Road Ahead – 6 Marketing Trends
    1. More B2B brands on TikTok

    2. Rising B2C investment in SEO, content marketing, and podcasting

    3. AI and machine learning in marketing

    4. Automation and growing investment in marketing and revenue operations

    5. The short-form video boom will continue, and lead to more short-form video ads

    6. Influencer marketing will grow and evolve, with continued growth in the creator economy.

    Bottom Line: The future of marketing is bright, and powered by creatives.

    Explore More

    The ABC Guide to WordPress SEO/SEM

    The 5-Step Tech Blogging Roadmap

    10 AI-Powered Websites for Content Writers

    Marketing Summer 2022 Update

    Marketing Q3 ’22 Round-Up

    HubSpot TikTok Marketing 2022

    20 Top Social Media Sites

    How to Build a MarTech Stack from Scratch

    Content Marketing


    Featured Photo by Negative Space on Pexels

    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • Top E-Commerce Trends in Q1’23

    Top E-Commerce Trends in Q1’23

    Featured Photo by PhotoMIX Company on Pexels

    • E-commerce has grown to become a giant pillar in the global economy.
    • The current global e-commerce industry is measured as $5.7 trillion in 2022. The global ecommerce growth rate for 2023 is forecast at 10.4%, bringing global ecommerce sales worldwide to $6.3 trillion. This marks a 0.7 percentage point increase from 2022’s growth rate, which followed a massive dip from 2021.
    • For instance, ecommerce sales growth worldwide is expected at 9.6% in 2024 and 8.9% in 2025. In 2026, online retail growth is forecast at 8.2%.
    • Despite the falling growth rates, the ecommerce share of retail sales is expected to increase. In 2023, this figure is forecast at 20.8% and will increase to 24% by 2026. 

    Best E-Commerce Platforms (January 2023):

    Top e-commerce platforms make it both easy and affordable to build a successful online store. Of course, with so many good options on the market, choosing the right system for your needs can be a challenge. To help, we put together this list of the 10 best e-commerce platforms available in 2023.

    Here are some things to look out for in the e-commerce industry in 2023:

    1. Increased focus on customer experience
    2. The continued growth of mobile commerce
    3. The rise of voice search
    4. The increasing use of artificial intelligence (AI) and AI-based upsell
    5. Greater emphasis on sustainability
    6. Social and Livestream shopping
    7. AR window shopping
    8. Personalization that respects the privacy
    9. Omnichannel shopping
    10. More ways to pay

    Increased focus on customer experience

    Consumers are becoming increasingly accustomed to high levels of convenience and personalization when shopping online. 

    Here are a few suggestions:

    1. Make it easy for customers to find what they are looking for: This can involve things like having a clear and intuitive website design, providing detailed product descriptions, and making it easy for customers to navigate through the site.
    2. Offer excellent customer service: This can include things like responding quickly to customer inquiries and complaints, providing helpful and knowledgeable support, and being willing to go the extra mile to solve customer problems.
    3. Personalize the shopping experience: This can involve things like providing personalized product recommendations, offering personalized communication and marketing efforts, and allowing customers to customize their orders.
    4. Make the checkout process seamless: This can involve things like offering multiple payment options, providing clear and detailed checkout instructions, and offering fast and reliable shipping options.
    5. Seek customer feedback and act on it: Asking for and listening to customer feedback can help businesses understand what they are doing well and where they can improve. By acting on this feedback, businesses can continually improve the customer experience.

    The continued growth of mobile commerce

    Mobile commerce (or m-commerce) is growing rapidly, with more and more people using their smartphones to shop online. In fact, according to Statista, global e-commerce sales are expected to reach $4.8 trillion by 2023. Shopping on a mobile device is often more convenient for customers, as it allows them to shop from anywhere, at any time. This means that businesses that offer a good mobile shopping experience are likely to be more successful in attracting and retaining customers. 

    Businesses can take the following steps:

    1. Optimize their website for mobile: This can involve things like making sure that the website is responsive, which means that it adjusts to fit the screen size of the device being used, and ensuring that the website loads quickly on mobile devices.
    2. Offer a mobile app: A mobile app can provide a more seamless shopping experience for customers, as it allows them to shop and access their account information directly from their device.
    3. Make it easy for customers to make purchases on mobile: This can involve things like offering multiple payment options, including mobile payments and making the checkout process as simple and straightforward as possible.
    4. Utilize mobile marketing: Businesses can use mobile marketing techniques, such as SMS marketing and push notifications, to reach and engage with customers on their mobile devices.

    The rise of voice search

    Voice search is becoming increasingly popular, with many people using voice assistants like Amazon’s Alexa, Google Home, Siri, Google Assistant, and many more to search for products and information. In fact, according to a survey by PwC52% of consumers say they use voice search at least once a week. The increasing popularity of smart home devices, such as smart speakers and smart thermostats, means that more people are using voice assistants to control their homes and access information.

    Businesses can take the following steps:

    1. Optimize for long-tail keywords and natural language: When optimizing content for voice search, it is important to use long-tail keywords and phrases that are commonly used in natural language. This can help improve the chances of being found by voice search.
    2. Use structured data: Structured data, such as schema markup, can help improve the chances of being found by voice search by providing additional context about a business or website.
    3. Use a clear and concise writing style: When optimizing content for voice search, it is important to use clear and concise language that is easy for voice assistants to understand.
    4. Optimize for featured snippets: Featured snippets are short summaries of information that are often displayed at the top of search results and can be read out by voice assistants. Optimizing for featured snippets can increase the chances of being found by voice search.
    5. Use local SEO: Voice search is often used to find local businesses, so it is important to optimize for local SEO in order to be found by voice search.

    The increasing use of artificial intelligence (AI) and AI-based upsell

    According to a survey by Econsultancy, 72% of businesses that use AI in their operations report an increase in customer satisfaction.

    Amazon is the king of AI. 35% of Amazon’s revenue comes from upselling or cross-selling.

    Businesses can take the following steps:

    1. Implement chatbots: Chatbots can provide efficient and personalized customer service, helping to resolve customer inquiries and issues quickly. This can improve the customer experience and reduce the workload of customer-facing employees.
    2. Use automated email campaigns: AI-powered automated email campaigns can help businesses nurture leads and build customer relationships by sending personalized and targeted emails.
    3. Implement AI-based upsell techniques: AI-based upsell techniques, such as suggesting related or complementary products to customers during the checkout process, can help businesses increase their revenue.
    4. Use AI to analyze customer data: By using AI to analyze customer data, businesses can identify patterns and trends that can help them better understand their customers and improve their operations.

    Greater emphasis on sustainability

    According to a survey by Accenture66% of consumers say they are willing to pay more for products that are sustainable or environmentally friendly.

    There are a few steps that businesses can take to adapt to the greater emphasis on sustainability in the e-commerce industry:

    1. Offer sustainable products: This can involve sourcing products that are made from sustainable materials, such as recycled or organic materials, or that are produced using environmentally-friendly methods.
    2. Implement sustainable practices: This can involve things like reducing waste, using eco-friendly packaging, and using sustainable transportation methods.
    3. Communicate sustainability efforts: It is important for businesses to communicate their sustainability efforts to customers. This can involve things like providing information about the sustainability of products on the website and sharing sustainability-related updates on social media and taking recommendations from customers directly.
    4. Consider the entire product lifecycle: When thinking about sustainability, it is important to consider the entire product lifecycle, from sourcing and production to disposal. Businesses can look for ways to minimize the environmental impact of their products at every stage.

    Social and Livestream shopping

    A study by LiveStream found that 80% of people would rather watch a live video from a brand than read a blog, and 82% of people prefer live video from a brand to social media posts. This shows the power of live streaming as a way to engage with customers and build a connection with a brand.

    Live streaming technologies, such as YouTube Live, Instagram Live, Facebook Live, and many more allow businesses to engage with customers in real-time and provide a more interactive shopping experience. 

    Businesses can take the following steps:

    1. Establish a presence on social media: This can involve creating profiles on popular social media platforms, such as Facebook and Instagram, and regularly posting updates and engaging with followers.
    2. Use social media to promote products: Businesses can use social media to promote products and make it easier for customers to discover and purchase products. This can involve things like using sponsored posts or creating shoppable posts that allow customers to purchase products directly from the platform.
    3. Utilize live streaming technologies: Businesses can use live streaming technologies to engage with customers in real-time and provide a more interactive shopping experience. This can involve things like hosting live Q&A sessions or showcasing products during a live stream.

    AR window shopping

    There are a few reasons why Augmented Reality (AR) window shopping will be important in the e-commerce industry. First and foremost, it can greatly improve the customer experience by allowing them to virtually try on products and see how they would look in their own space. 

    In addition to improving the customer experience, AR window shopping also offers increased convenience. Customers can shop from the comfort of their own homes.

    Another benefit of AR window shopping is that it can enhance the visual appeal of products by allowing customers to see them in a more realistic and interactive way. 

    Businesses can take the following steps:

    1. Invest in AR technology: In order to offer AR window shopping, businesses will need to invest in AR technology. This can involve things like creating AR experiences using AR development tools or integrating AR technology into their e-commerce platform.
    2. Offer AR try-on features: By offering AR try-on features, businesses can allow customers to virtually try on products and see how they would look in their own space. This can be especially useful for products like clothing, shoes, and accessories.
    3. Use AR to enhance product visuals: Businesses can use AR to create more interactive and visually appealing product listings and displays, which can help attract and engage customers.
    4. Promote AR features: It is important for businesses to promote their AR features to customers in order to raise awareness and drive adoption. This can involve things like promoting AR experiences on social media or including information about AR features in marketing materials.

    Personalization that respects the privacy

    Personalization that respects privacy is important in the e-commerce industry for a few reasons. Firstly, more and more consumers are becoming concerned about their privacy and the potential for their data to be mishandled or misused.

    In addition to consumer concerns, there are also legal requirements to consider. Governments around the world are implementing privacy regulations that require businesses to be transparent about how they collect and use customer data. Businesses that fail to comply with these regulations could face fines and other penalties.

    By respecting customers’ privacy, businesses can protect their reputations and avoid negative publicity that could result from a privacy breach or mishandling of customer data. 

    Businesses can take the following steps:

    1. Implement a privacy policy: It is important for businesses to have a clear and comprehensive privacy policy that explains how they collect, use, and protect customer data.
    2. Obtain consent: In order to collect and use customer data for personalization, businesses should obtain consent from customers. This can involve things like providing a clear opt-in form or obtaining explicit consent when collecting sensitive data.
    3. Use privacy-preserving technologies: There are a number of privacy-preserving technologies that can help businesses collect and use customer data in a way that respects privacy. Additionally, using first-party cookies, such as those provided by Enhencer, can help businesses collect customer data in a privacy-preserving way as these cookies do not share user data with third parties.
    4. Be transparent: It is important for businesses to be transparent about their data collection and use practices. This can involve things like providing clear and easy-to-understand information about how customer data is collected and used, and responding promptly to customer inquiries about privacy.

    Omnichannel shopping

    Omnichannel shopping is going to be important in the e-commerce industry for a few reasons. Firstly, it can greatly improve the customer experience by allowing them to seamlessly shop across different channels, such as online, in-store, and through social media.

    In addition to improving the customer experience, omnichannel shopping also offers increased convenience. Customers can shop in the way that is most convenient for them, whether that be online, in-store, or through a mobile app. 

    Another benefit of omnichannel shopping is that it increases the visibility of products, making it more likely that customers will discover and purchase them.

    Finally, omnichannel shopping provides enhanced customer insights. By tracking customer behavior across multiple channels, businesses can gain a more complete understanding of their customers and use this information to improve the shopping experience and increase sales.

    Businesses can take the following steps:

    1. Offer multiple channels for customers to shop: Businesses can offer multiple channels for customers to shop, such as online, in-store, and through social media.
    2. Make it easy for customers to switch between channels: Businesses can make it easy for customers to switch between channels by offering a consistent brand experience and providing a seamless transition between channels. This can involve things like offering the same products and prices across all channels and allowing customers to start their shopping journey on one channel and complete it on another.
    3. Use customer data to improve the omnichannel experience: By tracking customer behavior across multiple channels, businesses can gain a more complete understanding of their customers and use this information to improve the omnichannel experience. This can involve things like offering personalized recommendations or providing personalized content based on a customer’s interests.

    More ways to pay

    Firstly, it can greatly improve the customer experience by making it easier for them to complete their purchases.

    In addition to improving the customer experience, offering more ways to pay also increases the convenience of shopping. Customers can choose the payment method that is most convenient for them, whether that be a credit card, debit card, or mobile payment.

    Another benefit of offering more ways to pay is that it makes products more accessible to a wider range of customers, including those who may not have access to traditional payment methods.

    Finally, offering more ways to pay enhances the security of the payment system and reduces the risk of fraud.

    Businesses can take the following steps:

    1. Use payment gateways: Payment gateways can help businesses securely process and accept payments from customers. By using payment gateways, businesses can offer more ways to pay without having to handle sensitive payment information themselves.

    Explore More

    (S)ARIMA(X) TSA Forecasting, QC and Visualization of E-Commerce Food Delivery Sales

    Build A Simple NLP/NLTK Chatbot

    Simple E-Commerce Sales BI Analytics

    Brazilian E-Commerce Showcase

    A K-means Cluster Cohort E-Commerce

    E-Commerce Cohort Analysis in Python

    E-Commerce Data Science Use-Case

    E-Commerce ML/AI Classification

    Brand Architecture: Google vs. Amazon

    Start Your E-Commerce with Shopify

    Start Your Shopify Business


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • (S)ARIMA(X) TSA Forecasting, QC and Visualization of E-Commerce Food Delivery Sales

    (S)ARIMA(X) TSA Forecasting, QC and Visualization of E-Commerce Food Delivery Sales

    Featured Photo by Ella Olsson on Pexels

    Inspired by the recent TSA e-commerce use-case, this article is a beginner-friendly guide to help you understand and evaluate ARIMA-based time-series forecasting models such as SARIMA and SARIMAX.

    Objective: To understand the basic concepts of ARIMA, SARIMA and SARIMAX in terms of Time Series Forecasting QC.

    Application: We will focus on an QC-optimized SARIMA(X) model in order to forecast the e-commerce sales of a food delivery company based in Helsinki, Finland.

    Insights: We will assume that delivery companies get its revenue from two main sources:

    1. Up to 30% cut from every order.
    2. Delivery fees.

    The first revenue stream depends upon the order volume and the value of each order. The second revenue is linked to multiple sources such as delivery distance, order size, order value, day of the week, etc.

    Concepts:

    • ACF, PACF
    • Seasonal Decomposition
    • Stationarity of time-series
    • ADF & KPSS Tests
    • Hyper-Parameter Optimization (HPO)
    • SARIMA/SARIMAX: Model QC Comparisons
    • Evaluation Metrics: AIC, BIC, MSE, SSE, and RMSE

    Table of Contents:

    1. Libraries
    2. Input Data
    3. Feature Engineering
    4. Temporal Patterns
    5. ADF & KPSS Tests
    6. ACF & PACF
    7. SARIMA Model
    8. SARIMAX Model
    9. Model Comparison
    10. Summary
    11. Explore More
    12. Embed Socials
    13. Infographic

    You can go through the below articles for more details on ARIMA-related topics:

    Libraries

    Let’s set the working directory YOURPATH

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and import the following libraries

    import itertools
    import warnings
    from datetime import datetime, timedelta

    import numpy as np
    import pandas as pd

    import matplotlib.pyplot as plt
    from pandas.plotting import lag_plot
    import seaborn as sns
    %matplotlib inline

    from statsmodels.tsa.seasonal import seasonal_decompose
    from statsmodels.tsa.stattools import adfuller, kpss
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    from statsmodels.tsa.statespace.sarimax import SARIMAX
    from statsmodels.tools.sm_exceptions import ConvergenceWarning

    Input Data

    Let’s load the input dataset

    source_df = pd.read_csv(‘orders.csv’)
    df = source_df.copy()
    df.tail(5)

    Input sales dataset

    df.shape

    (18706, 13)
    

    df[‘TIMESTAMP’] = pd.to_datetime(df[‘TIMESTAMP’])

    df.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 18706 entries, 0 to 18705
    Data columns (total 13 columns):
     #   Column                     Non-Null Count  Dtype         
    ---  ------                                                --------------  -----         
     0   TIMESTAMP                              18706 non-null  datetime64[ns]
     1   ACTUAL_DELIVERY_MINUTES - ESTIMATED_DELIVERY_MINUTES 18706 non-null  int64         
     2   ITEM_COUNT                                      18706 non-null  int64         
     3   USER_LAT                                       18706 non-null  float64       
     4   USER_LONG                                      18706 non-null  float64       
     5   VENUE_LAT                                      18706 non-null  float64       
     6   VENUE_LONG                                     18706 non-null  float64       
     7   ESTIMATED_DELIVERY_MINUTES                     18706 non-null  int64         
     8   ACTUAL_DELIVERY_MINUTES                        18706 non-null  int64         
     9   CLOUD_COVERAGE                                 18429 non-null  float64       
     10  TEMPERATURE                                    18429 non-null  float64       
     11  WIND_SPEED                                     18429 non-null  float64       
     12  PRECIPITATION                                 18706 non-null  float64       
    dtypes: datetime64[ns](1), float64(8), int64(4)
    memory usage: 1.9 MB

    df.describe().T

    Input dataset descriptive statistics

    We can see that there are no extreme deviations between mean and median values of each column. This suggests that we can expect little skewness in the distributions.

    Let’s look at the spatial content of our input data

    plt.scatter(df[‘USER_LAT’],df[‘USER_LONG’],c=df[‘ACTUAL_DELIVERY_MINUTES’])
    plt.colorbar()
    plt.title(“ACTUAL_DELIVERY_MINUTES”)
    plt.xlabel(“USER_LAT”)
    plt.ylabel(“USER_LONG”)
    plt.savefig(‘inputdeliverymin.png’)

    Actual delivery minutes vs user lat and long

    plt.scatter(df[‘VENUE_LAT’],df[‘VENUE_LONG’],c=df[‘ITEM_COUNT’])
    plt.colorbar()
    plt.title(“ITEM_COUNT”)
    plt.xlabel(“VENUE_LAT”)
    plt.ylabel(“VENUE_LONG”)
    plt.savefig(‘inputvenueitemcount.png’)

    Item count vs venue lat and long.

    Feature Engineering

    Let’s plot the sns Pearson correlation heatmap

    def correlation_check(df: pd.DataFrame) -> None:
    “””
    Plots a Pearson Correlation Heatmap.

    Args:
    df (pd.DataFrame): dataframe to plot

    Returns: None
    """
    
    # Pretty Name
    df.rename(columns={"ACTUAL_DELIVERY_MINUTES - ESTIMATED_DELIVERY_MINUTES": "ACTUAL-ESTIMATED"},
              inplace=True)
    
    # Figure
    fig, ax = plt.subplots(figsize=(16,12), facecolor='w')
    correlations_df = df.corr(method='pearson', min_periods=1)
    sns.heatmap(correlations_df, cmap="Oranges", annot=True, linewidth=.1)
    
    # Labels
    ax.set_title("Pearson Correlation Heatmap", fontsize=15, pad=10)
    ax.set_facecolor(color='white')
    

    correlation_check(df)
    plt.savefig(‘salesheatmap.png’)

    sns Pearson correlation heatmap

    We can also check feature correlations via the sns pairplot

    sns.pairplot(data=df)
    plt.savefig(‘salespairplot.png’)

    sns pairplot

    As with the heatmap, the pair-plot didn’t reveal any underlying, strong relationship between the variables. Exceptions:

    • ACTUAL_DELIVERY_MINUTES – ESTIMATED_DELIVERY_MINUTES is strongly correlated with ACTUAL_DELIVERY_MINUTES and ESTIMATED_DELIVERY_MINUTES.
    • There are correlations between spatial coordinates.

    Temporal Patterns

    Let’s create a new DataFrame with daily frequency and number of orders
    daily_df = df.groupby(pd.Grouper(key=’TIMESTAMP’, freq=’D’)).size().reset_index(name=’ORDERS’)
    daily_df.set_index(‘TIMESTAMP’, inplace=True)
    daily_df.index.freq = ‘D’ # To keep pandas inference in check!

    print(daily_df.head())
    print(daily_df.describe())

    RDERS
    TIMESTAMP         
    2020-08-01     299
    2020-08-02     328
    2020-08-03     226
    2020-08-04     228
    2020-08-05     256
               ORDERS
    count   61.000000
    mean   306.655738
    std     58.949381
    min    194.000000
    25%    267.000000
    50%    294.000000
    75%    346.000000
    max    460.000000

    Let’s plot the number of orders per day

    def orders_per_day(df: pd.DataFrame) -> None:

    # Figure
    fig, ax = plt.subplots(figsize=(16,9), facecolor='w')
    ax.plot(df.index, df['ORDERS'])
    
    # Labels
    ax.set_title("Number of Orders Each Day", fontsize=15, pad=10)
    ax.set_ylabel("Number of orders", fontsize=12)
    ax.set_xlabel("Date", fontsize=12)
    
    # Grid & Legend
    plt.grid(linestyle=":", color='grey')
    plt.legend(["Orders"])
    plt.savefig('salesnumberofdays.png')
    

    orders_per_day(daily_df)
    plt.savefig(‘salesnumberoforders.png’)

    Number of orders per day

    Let’s check the series for trends and seasonality

    def decompose_series(df: pd.DataFrame) -> None:

    # Decomposition
    decomposition = seasonal_decompose(df)
    trend = decomposition.trend
    seasonal = decomposition.seasonal
    residual = decomposition.resid
    
    #Figure
    fig, (ax1,ax2,ax3,ax4) = plt.subplots(4,1, figsize=(16,10), facecolor='w')
    
    ax1.plot(df, label='Original')
    ax2.plot(trend, label='Trend')
    ax3.plot(seasonal,label='Seasonality')
    ax4.plot(residual, label='Residuals')
    
    # Legend
    ax1.legend(loc='upper right')
    ax2.legend(loc='upper right')
    ax3.legend(loc='upper right')
    ax4.legend(loc='upper right')
    
    ax1.grid(linestyle=":", color='grey')
    ax2.grid(linestyle=":", color='grey')
    ax3.grid(linestyle=":", color='grey')
    ax4.grid(linestyle=":", color='grey')
    
    plt.title('Decomposed Daily Orders (2020-08-01 - 2020-10-01')
    plt.tight_layout()
    plt.savefig('salesdecomposeseries.png')
    

    decompose_series(daily_df)

    Decomposition of the original time-series data into  the trend, seasonality, and residuals.
    • Trend: There is a general rising trend for the given time period. However, it is not constant. Number of orders decreases around 9th of september but then recovers resulting in an approx. 7% overall increase for the observed time period.
    • Seasonality: There are clear weekly seasonal patterns. Number of orders is low in the beginning of the week and grows towards the weekend.
    • Residuals: No observable patterns left in the residuals.

    Let’s plot a heatmap with days of the week and hours of the day vs the number of orders

    def orders_weekdays_hours(dataframe: pd.DataFrame) -> None:

    # Data
    df = dataframe.copy(deep=False)
    
    # Reshaping data for the plot
    df["hour"] = pd.DatetimeIndex(df['TIMESTAMP']).hour
    df["weekday"] = pd.DatetimeIndex(df['TIMESTAMP']).weekday
    daily_activity = df.groupby(by=['weekday','hour']).count()['TIMESTAMP'].unstack()
    
    # Figure Object
    fig, ax = plt.subplots(figsize=(10,10), facecolor='w')
    yticklabels = ["Mon", "Tue","Wed", "Thu", "Fri", "Sat", "Sun"]
    sns.heatmap(daily_activity, robust=True, cmap="Oranges", yticklabels=yticklabels)
    
    # Labeling
    ax.set_title("Ordering Patterns", fontsize=15, pad=10)
    ax.set_xlabel("Hours of the day (Hours)", fontsize=12, x=.5)
    ax.set_ylabel("Day of the week", fontsize=12, y=.5)
    
    plt.savefig('salesorderingpatterns.png')
    

    orders_weekdays_hours(df)

    A heatmap with days of the week and hours of the day vs the number of orders
    • Each day seems to have two peaks in number of orders.
    • Hottest ordering times are slightly different for workdays and weekends.
    • During the workdays number of orders peaks at 8am and 16pm with decrease in orders during the lunchtime.
    • The weekends exhibit a similar behavior but higher overall number of orders and with different peaks at 10-11am and 15-16pm.

    Let’s creating a new DataFrame with hourly frequency number of orders
    hourly_df = df.groupby(pd.Grouper(key=’TIMESTAMP’, freq=’1h’)).size().reset_index(name=’ORDERS’)

    hourly_df.set_index(‘TIMESTAMP’, inplace=True)
    hourly_df.index.freq = ‘H’

    print(hourly_df, hourly_df.describe())

                        ORDERS
    TIMESTAMP                  
    2020-08-01 06:00:00       3
    2020-08-01 07:00:00       6
    2020-08-01 08:00:00      15
    2020-08-01 09:00:00      20
    2020-08-01 10:00:00      26
    ...                     ...
    2020-09-30 16:00:00      42
    2020-09-30 17:00:00      26
    2020-09-30 18:00:00      19
    2020-09-30 19:00:00       8
    2020-09-30 20:00:00       1
    
    [1455 rows x 1 columns]             ORDERS
    count  1455.000000
    mean     12.856357
    std      13.733086
    min       0.000000
    25%       0.000000
    50%       8.000000
    75%      24.000000
    max      53.000000

    Let’s plot the number of orders per hour

    def orders_per_hour(df: pd.DataFrame, start: datetime, end: datetime ) -> None:

    # Figure
    fig, ax = plt.subplots(figsize=(16,9), facecolor='w')
    ax.plot(df.index, df['ORDERS'])
    
    # Labels
    ax.set_title(f"Number of Orders Each Hour ({start.date()} - {end.date()})", fontsize=15, pad=10)
    ax.set_ylabel("Number of orders", fontsize=12)
    ax.set_xlabel("Date", fontsize=12)
    
    # Axis Limit
    ax.set_xlim([start, end])
    
    # Legend & Grid
    plt.grid(linestyle=":", color='grey')
    plt.legend(["Orders"])
    plt.savefig('salesnumberofeachhours.png')
    

    start = datetime(2020, 8, 1)
    end = datetime(2020, 8, 15)
    orders_per_hour(hourly_df, start, end)

    Number of orders per hour

    Let’s decompose these time series

    decompose_series(hourly_df)

    Number of orders per hour: original, trend, seasonality, and residuals.
    • Trend: There doesn’t seem to be definite upwards or downwards trend. Recall from the daily plot that there is a trend.
    • Seasonality: There is very strong daily seasonal pattern. Recall from the daily plot that there is also a weekly seasonality.
    • Residuals: No observable patterns left in the residuals.

    Let’s plot rolling mean and STD with window=48

    def plot_rolling_mean_and_std(dataframe: pd.DataFrame, window: int) -> None:

    df = dataframe.copy()
    # Get Things Rolling
    roll_mean = df.rolling(window=window).mean()
    roll_std = df.rolling(window=window).std()

    # Figure
    fig, ax = plt.subplots(figsize=(16,9), facecolor='w')
    ax.plot(df, label='Original')
    ax.plot(roll_mean, label='Rolling Mean')
    ax.plot(roll_std,  label='Rolling STD')
    
    # Legend & Grid
    ax.legend(loc='upper right')
    plt.grid(linestyle=":", color='grey')
    plt.savefig('salesrollingmean.png')
    

    plot_rolling_mean_and_std(hourly_df, window=48)

    Original time series, rolling mean, and STD
    • We can see that the mean and the variance of the time series are time-variant.
    • Mean and variance seem to follow weekly seasons.

    ADF & KPSS Tests

    Let’s perform the Augmented Dickey Fuller (ADF) Test:
    – The null hypothesis for this test is that there is a unit root.
    – The alternate hypothesis is that there is no unit root in the series.

    def perform_adf_test(df: pd.DataFrame) -> None:

    adf_stat, p_value, n_lags, n_observ, crit_vals, icbest = adfuller(df)
    
    print('\nAugmented Dickey Fuller Test')
    print('---'*15)
    print('ADF Statistic: %f' % adf_stat)
    print('p-value: %f' % p_value)
    print(f'Number of lags used: {n_lags}')
    print(f'Number of observations used: {n_observ}')
    print(f'T values corresponding to adfuller test:')
    for key, value in crit_vals.items():
        print(key, value)
    

    We also perform the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test for stationarity:
    – The null hypothesis for the test is that the data is stationary.
    – The alternate hypothesis for the test is that the data is not stationary.

    def perform_kpss_test(df: pd.DataFrame) -> None:

    kpss_stat, p_value, n_lags, crit_vals = kpss(df, nlags='auto', store=False)
    print('\nKwiatkowski-Phillips-Schmidt-Shin test')
    print('---'*15)
    print('KPSS Statistic: %f' % kpss_stat)
    print('p-value: %f' % p_value)
    print(f'Number of lags used: {n_lags}')
    print(f'Critical values of KPSS test:')
    for key, value in crit_vals.items():
        print(key, value)
    

    Let’s call these two functions

    perform_adf_test(hourly_df)
    perform_kpss_test(hourly_df)

    Augmented Dickey Fuller Test
    ---------------------------------------------
    ADF Statistic: -3.464712
    p-value: 0.008941
    Number of lags used: 24
    Number of observations used: 1430
    T values corresponding to adfuller test:
    1% -3.434931172941245
    5% -2.8635632730206857
    10% -2.567847177857108
    
    Kwiatkowski-Phillips-Schmidt-Shin test
    ---------------------------------------------
    KPSS Statistic: 0.396855
    p-value: 0.078511
    Number of lags used: 20
    Critical values of KPSS test:
    10% 0.347
    5% 0.463
    2.5% 0.574
    1% 0.739
    • Since ADF Statistic -3.46 < -3.43 and p-value: 0.0089 < 0.05 we can reject the N0 hypothesis in the favor of NA. According to the ADF test, our time series have no unit root.
    • Since KPSS Statistic 0.396 < 0.463 and 0.078 > 0.05 we fail to reject the N0 hypothesis. According to the KPSS test, our time series are trend-stationary.

    This is consistent with our observation that rolling mean and STD follow a weekly trend.

    ACF & PACF

    Let’s invoke ACF and PACF to identify the lags that have high correlations

    def plot_acf_pacf(df: pd.DataFrame, acf_lags: int, pacf_lags: int) -> None:

    # Figure
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,9), facecolor='w')
    
    # ACF & PACF
    plot_acf(df['ORDERS'], ax=ax1, lags=acf_lags)
    plot_pacf(df['ORDERS'], ax=ax2, lags=pacf_lags, method='ywm')
    
    # Labels
    ax1.set_title("Autocorrelation", fontsize=15, pad=10)
    ax1.set_ylabel("Number of orders", fontsize=12)
    ax1.set_xlabel("Lags (Hours)", fontsize=12)
    
    ax2.set_title("Partial Autocorrelation", fontsize=15, pad=10)
    ax2.set_ylabel("Number of orders", fontsize=12)
    ax2.set_xlabel("Lags (Hours)", fontsize=12)
    
    # Legend & Grid
    ax1.grid(linestyle=":", color='grey')
    ax2.grid(linestyle=":", color='grey')
    
    plt.savefig('salesautocorrelation.png')
    

    plot_acf_pacf(hourly_df, acf_lags=72, pacf_lags= 72)

    ACF & PACF plots
    • ACF
      • As we already knew our series are seasonal and our ACF plot confirms this pattern. If we plot more lags we will also observe that significance of the lags is gradually declining.
      • First significant lag is lag 1. The number of daily orders raises/decreases gradually from hour to hour. Hence the orders during the previous hour might tell us something about orders during the current hour.
      • Next important lags are 12 and 24. These are deterministic seasonal patterns connected with day/night cycles. 12 hour lag is negatively correlated because when at 8:00 am number of orders starts to increase at 20:00 pm the number of orders is already decreasing. However, 24 hour lag shows that number of orders made today at 16:00 pm might tell us about the number of orders to be made tomorrow at 16:00 pm.
    • PACF
      • We can see that lag 1 and 24 have the highest correlations. This means that seasons 24 hours apart are directly inter-correlated regardless of what is happening in between.

    We can take a detailed look at the above Lags of interest

    def lag_plots(df: pd.DataFrame) -> None:

    # Figure
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,9), facecolor='w')
    
    # Lags
    lag_plot(df['ORDERS'], lag=1, ax=ax1, c='#187bcd')
    lag_plot(df['ORDERS'], lag=12, ax=ax2, c='grey')
    lag_plot(df['ORDERS'], lag=24, ax=ax3, c='#187bcd')
    
    # Labels
    ax1.set_title("y(t+1)", fontsize=15, pad=10)
    ax2.set_title("y(t+12)", fontsize=15, pad=10)
    ax3.set_title("y(t+24)", fontsize=15, pad=10)
    
    # Legend & Grid
    ax1.grid(linestyle=":", color='grey')
    ax2.grid(linestyle=":", color='grey')
    ax3.grid(linestyle=":", color='grey')
    

    lag_plots(hourly_df)

    A detailed look at Lags 1, 12, and 24

    Lags 1, 12, and 24 follow the ACF correlation trends: both high positive linear correlations (Lags 1 and 24) and a strongly negative non-linear correlation (Lag 12) are confirmed.

    SARIMA Model

    The SARIMA model is specified as follows

    (p,d,q) x (P,D,Q)s

    where

    • Trend Elements are:
      • p: Autoregressive order
      • d: Difference order
      • q: Moving average order
    • Seasonal Elements are:
      • P: Seasonal autoregressive order.
      • D: Seasonal difference order. D=1 would calculate a first order seasonal difference
      • Q: Seasonal moving average order. Q=1 would use a first order errors in the model
      • s Single seasonal period

    We will use the Box–Jenkins method for SARIMA parameter tuning.

    Let’s split the input dataframe into train/test sets as 75% / 15%.

    def train_test_split(df: pd.DataFrame, train_set, test_set):

    train_set = df[df.index <= train_end]
    test_set = df[df.index > train_end]
    return train_set, test_set
    

    warnings.simplefilter(‘ignore’, ConvergenceWarning)

    train_end = datetime(2020,9,15)
    test_end = datetime(2020,9,30)

    train_df, test_df = train_test_split(hourly_df, train_end, test_end)

    Let's set the hyper-parameters
    p, d, q = 1, 1, 1
    P, D, Q = 2, 1, 1
    s = 24
    and fit the SARIMA model
    

    sarima_model = SARIMAX(train_df, order=(p, d, q), seasonal_order=(P, D, Q, s))
    sarima_model_fit = sarima_model.fit(disp=0)
    print(sarima_model_fit.summary())

    SARIMAX Results                                      
    ==========================================================================================
    Dep. Variable:                             ORDERS   No. Observations:                 1075
    Model:             SARIMAX(1, 1, 1)x(2, 1, 1, 24)   Log Likelihood               -3158.989
    Date:                            Sat, 14 Jan 2023   AIC                           6329.979
    Time:                                    11:21:55   BIC                           6359.718
    Sample:                                08-01-2020   HQIC                          6341.255
                                         - 09-15-2020                                         
    Covariance Type:                              opg                                         
    ==============================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------
    ar.L1          0.3729      0.027     13.720      0.000       0.320       0.426
    ma.L1         -0.9415      0.012    -75.980      0.000      -0.966      -0.917
    ar.S.L24       0.1375      0.024      5.666      0.000       0.090       0.185
    ar.S.L48      -0.1325      0.025     -5.272      0.000      -0.182      -0.083
    ma.S.L24      -0.9997      2.136     -0.468      0.640      -5.186       3.187
    sigma2        21.9775     46.728      0.470      0.638     -69.607     113.562
    ===================================================================================
    Ljung-Box (L1) (Q):                   0.64   Jarque-Bera (JB):               317.47
    Prob(Q):                              0.42   Prob(JB):                         0.00
    Heteroskedasticity (H):               1.27   Skew:                             0.46
    Prob(H) (two-sided):                  0.03   Kurtosis:                         5.53
    ===================================================================================
    
    Warnings:
    [1] Covariance matrix calculated using the outer product of gradients (complex-step).

    Let’s plot the SARIMA diagnostics
    sarima_model_fit.plot_diagnostics(figsize=(16, 9))
    plt.savefig(‘salesarimaxdiagplot.png’)

    SARIMA diagnostics plot
    • The standardize residual plot: The residuals over time don’t display any obvious patterns. They appear as white noise.
    • The Normal Q-Q-plot: Shows that the ordered distribution of residuals follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). However, the slight curving indicates that our distribution has heavier tails.
    • Histogram and estimated density plot: The KDE follows the N(0,1) line however with noticeable differences. As mentioned before our distribution has heavier tails.
    • The Correlogram plot: Shows that the time series residuals have low correlation with lagged versions of itself. Meaning there are no patterns left to extract in the residuals.

    Let’s compare the test data vs SARIMA predictions

    pred_start_date = test_df.index[0]
    pred_end_date = test_df.index[-1]

    sarima_predictions = sarima_model_fit.predict(start=pred_start_date, end=pred_end_date)
    sarima_residuals = test_df[‘ORDERS’] – sarima_predictions

    def plot_test_predictions(test_df: pd.DataFrame, predictions) -> None:

    # Figure
    fig, ax = plt.subplots(figsize=(16,9), facecolor='w')
    
    ax.plot(test_df, label='Testing Set')
    ax.plot(predictions, label='Forecast')
    
    # Labels
    ax.set_title("Test vs Predictions", fontsize=15, pad=10)
    ax.set_ylabel("Number of orders", fontsize=12)
    ax.set_xlabel("Date", fontsize=12)
    
    # Legend & Grid
    ax.grid(linestyle=":", color='grey')
    ax.legend()
    
    plt.savefig('salesarimaxtestpredictions.png')
    

    plot_test_predictions(test_df, sarima_predictions)

    Test data vs SARIMA predictions.

    Let’s get the SARIMA evaluation data
    sarima_aic = sarima_model_fit.aic
    sarima_bic = sarima_model_fit.bic
    sarima_mean_squared_error = sarima_model_fit.mse
    sarima_sum_squared_error=sarima_model_fit.sse
    sarima_root_mean_squared_error = np.sqrt(np.mean(sarima_residuals**2))

    print(f’Akaike information criterion | AIC: {sarima_aic}’)
    print(f’Bayesian information criterion | BIC: {sarima_bic}’)
    print(f’Mean Squared Error | MSE: {sarima_mean_squared_error}’)
    print(f’Sum Squared Error | SSE: {sarima_sum_squared_error}’)
    print(f’Root Mean Squared Error | RMSE: {sarima_root_mean_squared_error}’)

    Akaike information criterion | AIC: 6329.978772260314
    Bayesian information criterion | BIC: 6359.718044919224
    Mean Squared Error | MSE: 23.867974361783705
    Sum Squared Error | SSE: 25658.072438917483
    Root Mean Squared Error | RMSE: 5.6716207290283

    Let’s perform the SARIMA forecast of hourly orders

    Forecast Window
    days = 24
    hours = days * 24

    sarima_forecast = sarima_model_fit.forecast(hours)
    sarima_forecast_series = pd.Series(sarima_forecast, index=sarima_forecast.index)

    Since negative orders are not possible, we can trim them

    sarima_forecast_series[sarima_forecast_series < 0] = 0

    Let’s plot the test, train and forecast values

    def plot_sarima_forecast(train_df: pd.DataFrame, test_df: pd.DataFrame, fc_series: pd.Series) -> None:

    fig, ax = plt.subplots(figsize=(16,9), facecolor='w')
    
    # Plot Train, Test and Forecast.
    ax.plot(train_df['ORDERS'], label='Training')
    ax.plot(test_df['ORDERS'], label='Actual')
    ax.plot(fc_series, label='Forecast')
    
    # Labels
    ax.set_title("SARIMA Hourly Orders Forecast", fontsize=15, pad=20)
    ax.set_ylabel("Number of orders", fontsize=12)
    ax.set_xlabel("Date", fontsize=12)
    
    xmin = datetime(2020, 9, 10)
    xmax = datetime(2020, 10, 7)
    ax = plt.gca()
    ax.set_xlim([xmin, xmax])
    
    # Legend & Grid
    ax.grid(linestyle=":", color='grey')
    ax.legend()
    
    plt.savefig('salesarimahorlyforecast.png')
    

    plot_sarima_forecast(train_df, test_df, sarima_forecast_series)

    SARIMA hourly orders forecast vs training and test data.
    • Our SARIMA model predicts the overall hourly ordering patterns pretty well. It captures the two daily peaks and the lunchtime valley between them.
    • However, it fails to predict weekly patterns. Every day in our forecast seem to be the same. If this model was sent to production we would be able to predict the number of orders every hour but that prediction would be based on the assumption that the number of orders is the same every day.
    • While this model doesn’t have a great long term predictive power it can serve as a solid baseline for our next models.

    SARIMAX Model

    Let’s prepare the data

    hour_weekday_df = hourly_df.copy()
    hour_weekday_df[‘weekday_exog’] = hour_weekday_df.index.weekday
    print(hour_weekday_df.head(10))

    ORDERS  weekday_exog
    TIMESTAMP                                
    2020-08-01 06:00:00       3             5
    2020-08-01 07:00:00       6             5
    2020-08-01 08:00:00      15             5
    2020-08-01 09:00:00      20             5
    2020-08-01 10:00:00      26             5
    2020-08-01 11:00:00      29             5
    2020-08-01 12:00:00      30             5
    2020-08-01 13:00:00      21             5
    2020-08-01 14:00:00      23             5
    2020-08-01 15:00:00      29             5

    weekday_exog = hour_weekday_df[(hour_weekday_df != 0).all(1)]
    weekday_exog = weekday_exog.groupby(weekday_exog.index.weekday)[‘ORDERS’].mean()
    print(weekday_exog.head(7))

    TIMESTAMP
    1    17.386364
    2    18.591241
    3    18.628099
    4    19.891304
    5    21.939597
    6    25.715385
    Name: ORDERS, dtype: float64

    weekday_exog = {key: (weekday_exog[key] / weekday_exog[1]) for key in weekday_exog.keys()}
    print(weekday_exog)

    {1: 1.0, 2: 1.0693001288106483, 3: 1.0714200831847889, 4: 1.144075021312873, 5: 1.2618853357897968, 6: 1.4790548014077425}

    hour_weekday_df.replace({“weekday_exog”: weekday_exog})

    1455 rows × 2 columns

    Let’s perform a grid search for the SARIMAX HPO

    p = range(1, 3)
    d = range(1, 2)
    q = range(1, 3)
    s = 24

    pdq = list(itertools.product(p, d, q))
    seasonal_pdq = [(x[0], x[1], x[2], s) for x in pdq]

    def grid_search_sarimax(train_set: pd.DataFrame) -> None:

    # Supress UserWarnings
    warnings.simplefilter('ignore', category=UserWarning)
    
    # Grid Search
    for order in pdq:
        for seasonal_order in seasonal_pdq:
            model = SARIMAX(train_set['ORDERS'],
                            order=order,
                            seasonal_order=seasonal_order,
                            exog=train_set['weekday_exog']
                            )
            results = model.fit(disp=0)
            print(f'ARIMA{order}x{seasonal_order} -> AIC: {results.aic}, BIC:{results.bic},  MSE: {results.mse}')
    

    train_end = datetime(2020,9,15)
    test_end = datetime(2020,9,30)
    train_df, test_df = train_test_split(hour_weekday_df, train_end, test_end)

    Set Hyper-Parameters

    p, d, q = 1, 1, 2
    P, D, Q = 2, 1, 2
    s = 24
    exog = train_df[‘weekday_exog’]

    Fit SARIMAX

    sarimax_model = SARIMAX(train_df[‘ORDERS’],
    order=(p, d, q),
    seasonal_order=(P, D, Q, s),
    exog=exog)

    sarimax_model_fit = sarimax_model.fit(disp=0)

    Print the summary report

    print(sarimax_model_fit.summary())

              SARIMAX Results                                      
    ==========================================================================================
    Dep. Variable:                             ORDERS   No. Observations:                 1075
    Model:             SARIMAX(1, 1, 2)x(2, 1, 2, 24)   Log Likelihood               -3118.452
    Date:                            Sat, 14 Jan 2023   AIC                           6254.904
    Time:                                    11:29:13   BIC                           6299.513
    Sample:                                08-01-2020   HQIC                          6271.818
                                         - 09-15-2020                                         
    Covariance Type:                              opg                                         
    ================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
    --------------------------------------------------------------------------------
    weekday_exog     0.8400      0.138      6.074      0.000       0.569       1.111
    ar.L1            0.4934      0.063      7.777      0.000       0.369       0.618
    ma.L1           -1.1571      0.080    -14.553      0.000      -1.313      -1.001
    ma.L2            0.1576      0.070      2.261      0.024       0.021       0.294
    ar.S.L24         0.8878      0.052     17.064      0.000       0.786       0.990
    ar.S.L48        -0.2623      0.024    -10.922      0.000      -0.309      -0.215
    ma.S.L24        -1.7497      0.052    -33.831      0.000      -1.851      -1.648
    ma.S.L48         0.7717      0.051     15.081      0.000       0.671       0.872
    sigma2          20.6048      0.887     23.226      0.000      18.866      22.344
    ===================================================================================
    Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):               339.16
    Prob(Q):                              0.98   Prob(JB):                         0.00
    Heteroskedasticity (H):               1.34   Skew:                             0.54
    Prob(H) (two-sided):                  0.01   Kurtosis:                         5.57
    ===================================================================================
    
    Warnings:
    [1] Covariance matrix calculated using the outer product of gradients (complex-step).

    Let’s plot the SARIMAX diagnostics
    sarimax_model_fit.plot_diagnostics(figsize=(16, 9))
    plt.savefig(‘salesarimaxmodelfit.png’)

    SARIMAX diagnostic

    Let’s plot the predictions

    pred_start_date = test_df.index[0]
    pred_end_date = test_df.index[-1]

    exog = test_df[‘weekday_exog’]
    predictions = sarimax_model_fit.predict(start=pred_start_date, end=pred_end_date, exog=exog)
    residuals = test_df[‘ORDERS’] – predictions

    def plot_sarimax_test(test_set: pd.DataFrame, predictions: pd.Series) -> None:

    # Figure
    fig, ax = plt.subplots(figsize=(16,9), facecolor='w')
    
    ax.plot(test_set['ORDERS'], label='Testing Set')
    ax.plot(predictions, label='Forecast')
    
    # Labels
    ax.set_title("Test vs Forecast", fontsize=15, pad=15)
    ax.set_ylabel("Number of orders", fontsize=12)
    ax.set_xlabel("Date", fontsize=12)
    
    # Legend & Grid
    ax.grid(linestyle=":", color='grey')
    ax.legend()
    
    plt.savefig('salessarimaxtestforecastplot.png')
    

    plot_sarimax_test(test_df, predictions)

    SARIMAX test vs forecast orders

    Let’s get the SARIMAX evaluation data
    sarimax_aic = sarimax_model_fit.aic
    sarimax_bic = sarimax_model_fit.bic
    sarimax_mean_squared_error = sarimax_model_fit.mse
    sarimax_sum_squared_error=sarimax_model_fit.sse
    sarimax_root_mean_squared_error = np.sqrt(np.mean(residuals**2))

    print(f’Akaike information criterion | AIC: {sarimax_aic}’)
    print(f’Bayesian information criterion | BIC: {sarimax_bic}’)
    print(f’Mean Squared Error | MSE: {sarimax_mean_squared_error}’)
    print(f’Sum Squared Error | SSE: {sarimax_sum_squared_error}’)
    print(f’Root Mean Squared Error | RMSE: {sarimax_root_mean_squared_error}’)

    Akaike information criterion | AIC: 6254.90400948669
    Bayesian information criterion | BIC: 6299.512918475054
    Mean Squared Error | MSE: 22.10675398176249
    Sum Squared Error | SSE: 23764.760530394677
    Root Mean Squared Error | RMSE: 5.132961570576301

    Let’s plot the train, test, and predicted values

    Forecast Window
    days = 24
    hours = (days * 24)+1
    exog = pd.date_range(start=’2020-10-01′, end=’2020-10-25′, freq=’1H’)

    fc = sarimax_model_fit.forecast(hours, exog=exog.weekday)
    fc_series = pd.Series(fc, index=fc.index)

    Since negative orders are not possible we can trim them.

    fc_series[fc_series < 0] = 0

    def plot_sarimax_forecast(train_df: pd.DataFrame, test_df: pd.DataFrame, fc_series: pd.Series) -> None:

    # Figure
    fig, ax = plt.subplots(figsize=(16,9), facecolor='w')
    
    # Plot Train, Test and Forecast.
    ax.plot(train_df['ORDERS'], label='Training')
    ax.plot(test_df['ORDERS'], label='Testing')
    ax.plot(fc_series, label='Forecast')
    
    # Labels
    ax.set_title("SARIMAX Hourly Orders Forecast", fontsize=15, pad=10)
    ax.set_ylabel("Number of orders", fontsize=12)
    ax.set_xlabel("Date", fontsize=12)
    
    # Axis Limits
    xmin = datetime(2020, 9, 10)
    xmax = datetime(2020, 10, 8)
    ax = plt.gca()
    ax.set_xlim([xmin, xmax])
    
    # Legend & Grid
    ax.grid(linestyle=":", color='grey')
    ax.legend()
    
    plt.savefig('salesarimaxhourlyordersforecast.png')
    

    plot_sarimax_forecast(train_df, test_df, fc_series)

    SARIMAX hourly orders forecast, training, and test data

    Model Comparison

    Let’s compare the two models

    model_comparison = pd.DataFrame({‘Model’:[‘SARIMA(1, 1, 1)(2, 1, 1)24′,’SARIMAX(1, 1, 2)(2, 1, 2)24’],
    ‘AIC’:[sarima_aic, sarimax_aic],
    ‘BIC’:[sarima_bic, sarimax_bic],
    ‘MSE’: [sarima_mean_squared_error, sarimax_mean_squared_error],
    ‘SSE’: [sarima_sum_squared_error, sarimax_sum_squared_error],
    ‘RMSE’: [sarima_root_mean_squared_error, sarimax_root_mean_squared_error]})

    model_comparison.head()

    SARIMA vs SARIMAX model comparison table

    Summary

    • SARIMAX performs better than SARIMA in terms of AIC, BIC, MSE, SSE, and RMSE
    •  There is a general rising trend for the given time period.
    •  ADF Statistic -3.46 < -3.43 and p-value: 0.0089 < 0.05, so we can reject the null hypophysis N0 in favour of NA, i.e. our series have no unit root.
    • Since KPSS Statistic 0.396 < 0.463 and 0.078 > 0.05 we fail to reject the null hypothesis N0, and so our series are trend-stationary.
    •  ACF confirmed deterministic seasonal patterns connected with day/night cycles.
    • PACF shows that lag 1 and 24 have the highest correlation. This means that seasons 24 hours apart are directly correlated regardless of what is happening in between.
    • SARIMAX: Log Likelihood -3118.452,
    • Ljung-Box (L1) (Q): 0.00
    • Prob(Q): 0.98
    • Prob(H) (two-sided): 0.01
    • Heteroskedasticity (H): 1.34
    • Jarque-Bera (JB): 339.23
    • Skew: 0.54
    • Kurtosis: 5.57
    • SARIMA:
    • Log Likelihood -3158.989
    • Ljung-Box (L1) (Q): 0.64
    • Heteroskedasticity (H): 1.27
    • Prob(H) (two-sided): 0.03
    • Jarque-Bera (JB): 317.41
    • Skew: 0.46
    • Kurtosis: 5.53
    • The log-likelihood value of a regression model is a way to measure the goodness of fit for a model. The higher the value of the log-likelihood, the better a model fits a dataset. In our case, Log Likelihood SARIMA < Log Likelihood SARIMAX.
    • Heteroscedasticity is what you have in your data when the conditional variance in your data is not constant. In our case, Heteroskedasticity SARIMAX > Heteroskedasticity SARIMA.
    • The Jarque-Bera test is a goodness-of-fit test that determines whether or not sample data have skewness/kurtosis that matches a normal distribution. The test statistic of the Jarque-Bera (JB) test is always a positive number and if it’s far from zero, it indicates that the sample data do not have a normal distribution. In our case, JB SARIMAX > JB SARIMA.
    • A two-tailed test, in statistics, is a method in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. In our case, Prob(H) (two-sided) SARIMA > Prob(H) (two-sided) SARIMAX.
    • Ljung-Box (L1) (Q) SARIMAX << Ljung-Box (L1) (Q) SARIMA.  Essentially, it is a test of lack of fit: if the autocorrelations of the residuals are very small, we say that the model doesn’t show ‘significant lack of fit’.
    • The Normal Q-Q-plots shows that the ordered distribution of residuals follows the linear trend for both models. However, the slight curving indicates that our distribution has heavier tails.
    • Histogram and estimated density plot: The KDE follows the N(0,1) line with slight differences.
    • SARIMA: Model succeeded in capturing the underlying hourly ordering patterns with limited accuracy. However, it failed to capture patterns related to days of the week.
    • SARIMAX: Model did a better job! It improved the accuracy of the previous model. However, its accuracy is still limited. This model captured the two daily spikes but not when these spikes crossed the mark of 40 orders. Additionally the model predicts orders during the night hours that are very unlikely to occur. It seems that the model would benefit from another exog variable that would apply hourly weights for each hour of the day.

    Explore More

    Stock Forecasting with FBProphet

    S&P 500 Algorithmic Trading with FBProphet

    E-Commerce Cohort Analysis in Python

    E-Commerce Data Science Use-Case

    E-Commerce ML/AI Classification

    Blogs

    Embed Socials

    Infographic

    Box-Jenkins Method Schema:

    Box-Jenkins Method Schema

    How do we calculate HQIC information criteria for time series data when we fit SARIMA model?

    HQC and RSS calculations

    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • A Simple YouTube Download NLP GUI

    A Simple YouTube Download NLP GUI

    Featured Photo by Cottonbro Studio on Pexels

    Here is the simplest NLP example – the Python based YouTube video Downloader GUI Application by Naem Azam.

    Let’s try to download the following YouTube video

    Let’s set the working directory YOURPATH

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    while importing, cloning and/or installing relevant libraries

    !pip install gitpython

    !pip install git

    git clone https://github.com/naemazam/Youtube-video-Downloader.git

    import tkinter as tk
    from tkinter import *
    from PIL import ImageTk, Image
    from tkinter import messagebox

    import pytube
    import time

    !pip install –upgrade pytube

    Adding Window Components

    root = tk.Tk()
    root.title(“Youtube Downloader”)
    root.geometry(“700×300”)
    root.maxsize(700,250)
    root.minsize(700,300)

    and the GUI function code

    def download():
    link = text.get(“1.0″,”end-1c”)

        if link == '':
                messagebox.showerror("YouTube Downloader", "Please paste a link here") 
        else:
                yt = pytube.YouTube(link)
                stream = yt.streams.first()
                time.sleep(2)
                text.delete(1.0,'end') 
                text.insert('end','Wait Downloading ......')
                time.sleep(5)
                stream.download()
                messagebox.showinfo("YouTube Downloader",'Video has been download successfully')
    

    The main design code is
    header = Label(root,bg=”black”,width=”300″,height=”2″)
    header.place(x=0,y=0)

    with the youtube logo png image
    yt_logo = ImageTk.PhotoImage(Image.open(‘youtube.png’))
    logo = Label(root, image = yt_logo,borderwidth=0)
    logo.place(x=10,y=10)

    by adding the caption label
    caption = Label(root,text=”YouTube Downloader”,font=(‘verdana’,10,’bold’))
    caption.place(x=50,y=10)

    and the youtube logo image
    yt1_logo = ImageTk.PhotoImage(Image.open(‘yt.png’))
    logo1 = Label(root, image = yt1_logo,borderwidth=0)
    logo1.place(x=300,y=60)

    Let’s get the url
    text = Text(root,width=60,height=2,font=(‘verdana’,10,’bold’))
    text.place(x=90,y=180)
    text.insert(‘end’,’Paste your video link here’)

    Download Buttons
    button = Button(root,text=”Download”,relief=RIDGE,font=(‘verdana’,10,’bold’),bg=”red”,fg=”white”,command=download)
    button.place(x=330,y=220)

    and load the window
    root.mainloop()

    Let’s run the GUI code as follows:

    YouTube downloader menu

    Let’s paste the above URL link and hit Download

    Youtube downloader waiting
    Video has been downloaded successfully.

    Outcome: You should see the 3GPP File in YOURPATH. One can view the content with Movies & TV.

    Explore More

    Build A Simple NLP/NLTK Chatbot


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • ECG Early Warning System (EWS) in Terms of Time-Variant Deformations and Creep-Recovery Strain Tests

    ECG Early Warning System (EWS) in Terms of Time-Variant Deformations and Creep-Recovery Strain Tests

    Featured Photo by Hernan Pauccara on Pexels

    Referring to an earlier stress-strain case study, the objective of this risk management project is to develop the ECG Early Warning System (EWS) based upon time-dependent viscoelastic deformations and observed creep-recovery mechanisms in the cardiac muscle.

    The creep-recovery test involves loading a material at constant stress, holding that stress for some length of time and then removing the load. The response of a typical viscoelastic material to this test is shown below.

    Strain response to the creep-recovery test
    • First there is an instantaneous straining (IS), followed by an ever-increasing strain over time known as creep strain (CS). Elastic recovery (ER) stage: when unloaded, the elastic strain is recovered immediately. There is then anelastic recovery (AR) – strain recovered over time due to the viscoelastic time memory effect; this anelastic strain may be significant in some materials. A permanent strain (PS) may then be left in the material.
    • For viscoelastic materials, a time-dependent function is used instead of a single value of Young’s modulus, and this is called the Young’s relaxation modulus E(t).
    • The creep compliance function is used to describe creep J(t) behavior and can be related to the Young’s modulus E(t).

    This analysis leads to the following CVD risk management chart to be discussed below:

    • Stage 1: The creep deformation is recovered almost entirely when the load is released, i.e. ER >> AR and PS ~ 0.
    • Stage 2: A significant AR effect comparable to CS in terms of both magnitude and duration, whereas PS effect is still negligible.
    • Stage 3: The creep deformation is not recovered when the load is released due to the joint effect of AR and PS comparable to CS.
    • Stage 4: Long-term AR and significant PS effects similar to that of CS are observed.

    Summary

    Stages 1 and 2 support the following experimental observations:

    Cardiac muscle undergoes creep deformation from 2 to 3 % of its original length in 100 s. Large loads that stretch the muscle beyond 15% of its original length produce negligible PS effects, whereas the time course of AR is nearly identical to that of creep.

    Stage 3 is characterized by the long-term AR and PS magnitudes comparable to that of CS, the creep deformation suffered under maintained loading is partially recovered when the load is released.

    Stage 4 is characterized by the significant level of PS ~ CS, the creep deformation suffered under maintained loading is not recovered when the load is released.

    Explore More

    ECG Early Warning System (EWS) in Terms of the Heart Stress-Strain Failure Curve

    AI-Based ECG Recognition – EOY ’22 Status

    ML-Assisted ECG/EKG Anomaly Detection using LSTM Autoencoder

    HealthTech ML/AI Use-Cases

    Heart Failure Prediction using Supervised ML/AI Technique

    Embed Socials


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • ECG Early Warning System (EWS) in Terms of the Heart Stress-Strain Failure Curve

    ECG Early Warning System (EWS) in Terms of the Heart Stress-Strain Failure Curve

    Featured Photo by Anna Shvets on Pexels

    • The objective of this post is to develop the ECG Early Warning System (EWS) by interpreting (sub-)critical deformations of heart tissues in terms of the stress-strain failure curve. Stress and strain are important concepts in materials engineering and can be related through this curve.
    • If you apply some stress to the heart muscle and measure the resulting strain, or vice versa, you can create a stress vs. strain curve like the one shown below for a typical human heart.
    Stress-strain failure curve
    • Typical stress-strain plot for a heart: The graph begins with elastic deformations (stage 1) and ends at the fracture point (stage 4). 
    • We see that the heart elasticity zone OA starts off with stress being proportional to strain, which means that the heart muscle is operating in its linear region.
    • Heart tissues deform when pushed, pulled, and twisted. Elasticity zone OA is the measure of the amount that the heart can return to its original shape after these external forces and pressures stop.
    • The two parameters that determine the elasticity of a material are its elastic modulus and its elastic limit. A low elastic modulus is typical for materials that are easily deformed under a load; for example, a rubber band. If the stress under a load becomes too high, then when the load is removed, the material no longer comes back to its original shape and size, but relaxes to a different shape and size: The material becomes permanently deformed. The elastic limit A is the stress value beyond which the material no longer behaves elastically but becomes permanently deformed (cf. zone AB).
    • For stresses beyond the elastic limit A, a material exhibits plastic behavior. This means the material deforms irreversibly and does not return to its original shape and size, even when the load is removed. When stress is gradually increased beyond the elastic limit, the material undergoes plastic deformation. Rubber-like materials show an increase in stress with the increasing strain, which means they become more difficult to stretch and, eventually, they reach a fracture point where they break.
    • One of the plastic deformation stages in the stress-strain curve is the strain hardening region BC. This region starts as the strain goes beyond the yield point and ends at the ultimate strength point, the maximal stress shown in the stress-strain curve. In this region, the stress mainly increases as the material elongates, except that there is a nearly flat region at the beginning. The strain hardening region that occurs when the specimen is subjected to the maximum stress it can sustain (also called the ultimate tensile strength or UTS).
    • The necking region CD where the neck forms. At this point, the stress that the material can sustain decreases rapidly as it approaches fracture D.
    • Here is a sketch of stress-strain curves for brittle, ductile materials and rubber network on one graph.
    • It is clear that brittle < the heart tissues < rubber network.
    A sketch of stress-strain curves for brittle, ductile materials and rubber network on one graph
    • Think of a paperclip. If you bend it just a little, it will bounce back every time. This is the elastic region. If you bend it far, however, you will permanently bend the clip.
    • At a certain stress, the material will leave the elastic region. This stress is called the “yield strength.”
    • At any point past the yield strength, the material will suffer permanent deformation (stage 2).
    Think of a paperclip. If you bend it just a little, it will bounce back every time. This is the elastic region. If you bend it far, however, you will permanently bend the clip.

    Explore More

    AI-Based ECG Recognition – EOY ’22 Status

    ML-Assisted ECG/EKG Anomaly Detection using LSTM Autoencoder

    Embed Socials

    Infographic

    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • E2E NETFLIX Visualization: EDA & Plotly UI

    E2E NETFLIX Visualization: EDA & Plotly UI

    Featured Photo by Roberto Nickson on Pexels

    This project consists in the implementation of Python-3 Exploratory Data Analysis (EDA), streaming data visualization and highly interactive Plotly UI for reviewing Netflix movies and TV shows.

    Objectives:

    1. Understanding what content is available in different countries
    2. Identifying similar content by matching text-based features
    3. Network analysis of Actors / Directors to find interesting insights
    4. Does Netflix has more focus on TV Shows than movies in recent years?

    The end-to-end workflow has a purpose to informed the movie enthusiasts to discover the Netflix contents which are presented in several data visualizations consistent with AWS dashboards in R

    The Kaggle Netflix dataset consists of various of TV shows and movies that are available in Netflix platform. To briefly describe the contents of the dataset, the descriptions of each variables are described as follows:

    • show_id: unique id represents the contents (TV Shows/Movies)
    • type: The type of the contents whether it is a Movie or Tv Show
    • title: The title of the contents
    • director: name of the director(s) of the content
    • cast: name of the cast(s) of the content
    • country: Country of which contents was produced
    • date_added: the date of the contents added into the platform
    • release_year: the actual year of the contents release
    • rating: the ratings of the content (viewer ratings)
    • duration: length of duration for the contents (num of series for TV Shows and num of minutes for Movies)
    • listed_in: the list of genres of which the contents was listed in
    • description: full descriptions and synopses of the contents.

    About

    • Netflix is one of the world’s leading entertainment services with 204 million paid memberships in over 190 countries enjoying TV series, documentaries and feature films across a wide variety of genres and languages.
    • Since Netflix began its worldwide expansion in 2016, the streaming service has rewritten the playbook for global entertainment — from TV to film, and, more recently, video games.
    • In this post we will explore the data on TV Shows and Movies available on Netflix worldwide. 

    Input Data

    Beforehand, the working directory YOURPATH and Python libraries that are required for the project are to be loaded as below:

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    from nltk.corpus import stopwords
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    from wordcloud import WordCloud,STOPWORDS

    warnings.filterwarnings(“ignore”)

    netflix_dataset = pd.read_csv(‘netflix_titles.csv’)

    netflix_dataset.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 7787 entries, 0 to 7786
    Data columns (total 12 columns):
     #   Column        Non-Null Count  Dtype 
    ---  ------        --------------  ----- 
     0   show_id       7787 non-null   object
     1   type          7787 non-null   object
     2   title         7787 non-null   object
     3   director      5398 non-null   object
     4   cast          7069 non-null   object
     5   country       7280 non-null   object
     6   date_added    7777 non-null   object
     7   release_year  7787 non-null   int64 
     8   rating        7780 non-null   object
     9   duration      7787 non-null   object
     10  listed_in     7787 non-null   object
     11  description   7787 non-null   object
    dtypes: int64(1), object(11)
    memory usage: 730.2+ KB

    Let’s identify the unique values
    dict = {}
    for i in list(netflix_dataset.columns):
    dict[i] = netflix_dataset[i].value_counts().shape[0]

    print(pd.DataFrame(dict, index=[“Unique counts”]).transpose())

    Unique counts
    show_id                7787
    type                      2
    title                  7787
    director               4049
    cast                   6831
    country                 681
    date_added             1565
    release_year             73
    rating                   14
    duration                216
    listed_in               492
    description            7769

    Let’s identify the missing values

    temp = netflix_dataset.isnull().sum()
    uniq = pd.DataFrame({‘Columns’: temp.index, ‘Numbers of Missing Values’: temp.values})
    uniq

    Number of missing values per column.

    Movies vs TV Shows

    Analysis of Movies vs TV Shows:

    netflix_shows=netflix_dataset[netflix_dataset[‘type’]==’TV Show’]
    netflix_movies=netflix_dataset[netflix_dataset[‘type’]==’Movie’]

    plt.figure(figsize=(8,6))
    ax= sns.countplot(x = “type”, data = netflix_dataset,palette=”Set1″)
    ax.set_title(“TV Shows VS Movies”)

    plt.savefig(‘barcharttvmovies.png’)

    Bar chart Movies vs TV Shows

    It appears that there are more Movies than TV Shows on Netflix.

    Heatmap Year-Month

    Let’s plot the following SNS year-Month heatmap

    netflix_date= netflix_shows[[‘date_added’]].dropna()
    netflix_date[‘year’] = netflix_date[‘date_added’].apply(lambda x: x.split(‘,’)[-1])
    netflix_date[‘month’] = netflix_date[‘date_added’].apply(lambda x: x.split(‘ ‘)[0])
    month_order = [‘January’, ‘February’, ‘March’, ‘April’, ‘May’, ‘June’, ‘July’, ‘August’, ‘September’, ‘October’, ‘November’, ‘December’] #::-1 just reverse this nigga

    df = netflix_date.groupby(‘year’)[‘month’].value_counts().unstack().fillna(0)[month_order].T
    plt.subplots(figsize=(10,10))
    sns.heatmap(df,cmap=’Blues’) #heatmap
    plt.savefig(“heatmapyear.png”)

    Heatmap Year-Month

    This heatmap shows frequencies of TV shows added to Netflix throughout the years 2008-2020.

    Historical Analysis

    Year-by-year analysis since 2006:

    Last_fifteen_years = netflix_dataset[netflix_dataset[‘release_year’]>2005 ]
    Last_fifteen_years.head()

    Input data table: last 15 years.

    plt.figure(figsize=(12,10))
    sns.set(style=”darkgrid”)
    ax = sns.countplot(y=”release_year”, data=Last_fifteen_years, palette=”Set2″, order=netflix_dataset[‘release_year’].value_counts().index[0:15])

    plt.savefig(‘releaseyearcount.png’)

    SNS barchart release year 2006-2018 vs count

    TV Shows

    Analysis of duration of TV shows:

    features=[‘title’,’duration’]
    durations= netflix_shows[features]
    durations[‘no_of_seasons’]=durations[‘duration’].str.replace(‘ Season’,”)
    durations[‘no_of_seasons’]=durations[‘no_of_seasons’].str.replace(‘s’,”)

    durations[‘no_of_seasons’]=durations[‘no_of_seasons’].astype(str).astype(int)

    TV shows with the largest number of seasons:
    t=[‘title’,’no_of_seasons’]
    top=durations[t]

    top=top.sort_values(by=’no_of_seasons’, ascending=False)

    top20=top[0:20]
    print(top20)
    plt.figure(figsize=(80,60))
    top20.plot(kind=’bar’,x=’title’,y=’no_of_seasons’, color=’blue’)
    plt.savefig(‘tvshowsmaxseasons.png’)

    title  no_of_seasons
    2538                      Grey's Anatomy             16
    4438                                NCIS             15
    5912                        Supernatural             15
    1471              COMEDIANS of the world             13
    5137                        Red vs. Blue             13
    1537                      Criminal Minds             12
    7169                   Trailer Park Boys             12
    2678                           Heartland             11
    1300                              Cheers             11
    2263                             Frasier             11
    3592  LEGO Ninjago: Masters of Spinjitzu             10
    5538                    Shameless (U.S.)             10
    1577                          Dad's Army             10
    5795                       Stargate SG-1             10
    2288                             Friends             10
    1597    Danger Mouse: Classic Collection             10
    6983                    The Walking Dead              9
    6718                   The Office (U.S.)              9
    1431            Club Friday The Series 6              9
    2237                      Forensic Files              9
    
    <Figure size 8000x6000 with 0 Axes>
    TV shows with the largest number of seasons

    WordCloud

    Let’s plot the WordCloud of ‘description’

    new_df = netflix_dataset[‘description’]
    words = ‘ ‘.join(new_df)
    cleaned_word = ” “.join(word for word in words.split() )
    wordcloud = WordCloud(stopwords=STOPWORDS,
    background_color=’black’,
    width=3000,
    height=2500
    ).generate(cleaned_word)
    plt.figure(1,figsize=(12, 12))
    plt.imshow(wordcloud)
    plt.axis(‘off’)
    plt.savefig(‘netflixwordcloud.png’)

    Wordcloud of description column

    Recommendations

    Filling null values with empty string
    filledna=netflix_dataset.fillna(”)
    filledna.head()

    Cleaning the data – making all the words lower case
    def clean_data(x):
    return str.lower(x.replace(” “, “”))

    Identifying features on which the model is to be filtered.
    features=[‘title’,’director’,’cast’,’listed_in’,’description’]
    filledna=filledna[features]

    for feature in features:
    filledna[feature] = filledna[feature].apply(clean_data)

    filledna.head()

    def create_soup(x):
    return x[‘title’]+ ‘ ‘ + x[‘director’] + ‘ ‘ + x[‘cast’] + ‘ ‘ +x[‘listed_in’]+’ ‘+ x[‘description’]

    filledna[‘soup’] = filledna.apply(create_soup, axis=1)

    Import CountVectorizer and create the count matrix
    from sklearn.feature_extraction.text import CountVectorizer

    count = CountVectorizer(stop_words=’english’)
    count_matrix = count.fit_transform(filledna[‘soup’])

    Compute the Cosine Similarity matrix based on the count_matrix

    from sklearn.metrics.pairwise import cosine_similarity

    cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

    Reset index of our main DataFrame and construct reverse mapping as before
    filledna=filledna.reset_index()
    indices = pd.Series(filledna.index, index=filledna[‘title’])

    Let’s define the cos similarity based recommendation function

    def get_recommendations_new(title, cosine_sim = cosine_sim2):
    title=title.replace(‘ ‘,”).lower()
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return netflix_dataset['title'].iloc[movie_indices]
    

    Let’s check recommendations for NCIS

    recommendations = get_recommendations_new(‘NCIS’, cosine_sim2)
    print(recommendations)

    4109                     MINDHUNTER
    6876                     The Sinner
    2282                      Frequency
    6524                    The Keepers
    6900                  The Staircase
    1537                 Criminal Minds
    5459                    Secret City
    1772                     Dirty John
    2844    How to Get Away with Murder
    5027                       Quantico
    Name: title, dtype: object

    Countries

    Let’s examine the country list

    country=df[“country”]
    country=country.dropna()

    country=”, “.join(country)
    country=country.replace(‘,, ‘,’, ‘)

    country=country.split(“, “)
    country= list(Counter(country).items())
    country.remove((‘Vatican City’, 1))
    country.remove((‘East Germany’, 1))
    print(country)

    [('Brazil', 88), ('Mexico', 154), ('Singapore', 39), ('United States', 3297), ('Turkey', 108), ('Egypt', 110), ('India', 990), ('Poland', 36), ('Thailand', 65), ('Nigeria', 76), ('Norway', 29), ('Iceland', 9), ('United Kingdom', 723), ('Japan', 287), ('South Korea', 212), ('Italy', 90), ('Canada', 412), ('Indonesia', 80), ('Romania', 12), ('Spain', 215), ('South Africa', 54), ('France', 349), ('Portugal', 4), ('Hong Kong', 102), ('China', 147), ('Germany', 199), ('Argentina', 82), ('Serbia', 7), ('Denmark', 44), ('Kenya', 5), ('New Zealand', 28), ('Pakistan', 24), ('Australia', 144), ('Taiwan', 85), ('Netherlands', 45), ('Philippines', 78), ('United Arab Emirates', 34), ('Iran', 4), ('Belgium', 85), ('Israel', 26), ('Uruguay', 14), ('Bulgaria', 9), ('Chile', 26), ('Russia', 27), ('Mauritius', 1), ('Lebanon', 26), ('Colombia', 45), ('Algeria', 2), ('Soviet Union', 3), ('Sweden', 39), ('Malaysia', 26), ('Ireland', 40), ('Luxembourg', 11), ('Finland', 11), ('Austria', 11), ('Peru', 10), ('Senegal', 3), ('Switzerland', 17), ('Ghana', 4), ('Saudi Arabia', 10), ('Armenia', 1), ('Jordan', 8), ('Mongolia', 1), ('Namibia', 2), ('Qatar', 7), ('Vietnam', 5), ('Syria', 1), ('Kuwait', 7), ('Malta', 3), ('Czech Republic', 20), ('Bahamas', 1), ('Sri Lanka', 1), ('Cayman Islands', 2), ('Bangladesh', 3), ('Zimbabwe', 3), ('Hungary', 9), ('Latvia', 1), ('Liechtenstein', 1), ('Venezuela', 3), ('Morocco', 6), ('Cambodia', 5), ('Albania', 1), ('Cuba', 1), ('Nicaragua', 1), ('Greece', 10), ('Croatia', 4), ('Guatemala', 2), ('West Germany', 5), ('Slovenia', 3), ('Dominican Republic', 1), ('Nepal', 2), ('Samoa', 1), ('Azerbaijan', 1), ('Bermuda', 1), ('Ecuador', 1), ('Georgia', 2), ('Botswana', 1), ('Puerto Rico', 1), ('Iraq', 2), ('Angola', 1), ('Ukraine', 3), ('Jamaica', 1), ('Belarus', 1), ('Cyprus', 1), ('Kazakhstan', 1), ('Malawi', 1), ('Slovakia', 1), ('Lithuania', 1), ('Afghanistan', 1), ('Paraguay', 1), ('Somalia', 1), ('Sudan', 1), ('Panama', 1), ('Uganda', 1), ('Montenegro', 1)]

    Let’s look at the top 10 countries vs show count

    max_show_country=country[0:11]
    max_show_country = pd.DataFrame(max_show_country)
    max_show_country= max_show_country.sort_values(1)

    fig, ax = plt.subplots(1, figsize=(8, 6))
    fig.suptitle(‘Plot of country vs shows’)
    ax.barh(max_show_country[0],max_show_country[1],color=’blue’)
    plt.grid(b=True, which=’major’, color=’#666666′, linestyle=’-‘)

    plt.savefig(‘plotcountryshow.png’)

    Top 10 countries vs show count bar plot

    let’s load the list of country codes

    df1=pd.read_csv(‘country_code.csv’)
    df1=df1.drop(columns=[‘Unnamed: 2’])
    df1.head()

    Country codes

    Let’s define country-based geo-locations as follows

    country_map = pd.DataFrame(country)
    country_map=country_map.sort_values(1,ascending=False)
    location = pd.DataFrame(columns = [‘CODE’])
    search_name=df1[‘COUNTRY’]

    for i in country_map[0]:
    x=df1[search_name.str.contains(i,case=False)]
    x[‘CODE’].replace(‘ ‘,”)
    location=location.append(x)

    print(location)

    CODE         COUNTRY
    211   USA   united states
    92    IND           india
    210   GBR  united kingdom
    37    CAN          canada
    70    FRA          france
    ..    ...             ...
    3     ASM  american samoa
    171   WSM           samoa
    13    AZE      azerbaijan
    22    BMU         bermuda
    137   MNE      montenegro
    
    [115 rows x 2 columns]

    Let’s edit locations

    locations=[]
    temp=location[‘CODE’]
    for i in temp:
    locations.append(i.replace(‘ ‘,”))

    Genres

    Let’s look at the listed genres

    genre=df[“listed_in”]
    genre=”, “.join(genre)
    genre=genre.replace(‘,, ‘,’, ‘)
    genre=genre.split(“, “)
    genre= list(Counter(genre).items())
    print(genre)

    max_genre=genre[0:11]
    max_genre = pd.DataFrame(max_genre)
    max_genre= max_genre.sort_values(1)

    plt.figure(figsize=(40,20))
    plt.xlabel(‘COUNT’)
    plt.ylabel(‘GENRE’)
    plt.barh(max_genre[0],max_genre[1], color=’red’)

    [('International TV Shows', 1199), ('TV Dramas', 704), ('TV Sci-Fi & Fantasy', 76), ('Dramas', 2106), ('International Movies', 2437), ('Horror Movies', 312), ('Action & Adventure', 721), ('Independent Movies', 673), ('Sci-Fi & Fantasy', 218), ('TV Mysteries', 90), ('Thrillers', 491), ('Crime TV Shows', 427), ('Docuseries', 353), ('Documentaries', 786), ('Sports Movies', 196), ('Comedies', 1471), ('Anime Series', 148), ('Reality TV', 222), ('TV Comedies', 525), ('Romantic Movies', 531), ('Romantic TV Shows', 333), ('Science & Nature TV', 85), ('Movies', 56), ('British TV Shows', 232), ('Korean TV Shows', 150), ('Music & Musicals', 321), ('LGBTQ Movies', 90), ('Faith & Spirituality', 57), ("Kids' TV", 414), ('TV Action & Adventure', 150), ('Spanish-Language TV Shows', 147), ('Children & Family Movies', 532), ('TV Shows', 12), ('Classic Movies', 103), ('Cult Movies', 59), ('TV Horror', 69), ('Stand-Up Comedy & Talk Shows', 52), ('Teen TV Shows', 60), ('Stand-Up Comedy', 329), ('Anime Features', 57), ('TV Thrillers', 50), ('Classic & Cult TV', 27)]
    Top 11 listed genres bar chart

    Plotly UI

    Let’s look at the data columns in terms of null values

    df.isnull().sum()

    show_id            0
    type               0
    title              0
    director        2389
    cast             718
    country          507
    date_added        10
    release_year       0
    rating             7
    duration           0
    listed_in          0
    description        0
    dtype: int64

    Let’s edit our data as follows:

    df = df.dropna(how=’any’,subset=[‘cast’, ‘director’])

    df = df.dropna()

    df[“date_added”] = pd.to_datetime(df[‘date_added’])
    df[‘year_added’] = df[‘date_added’].dt.year
    df[‘month_added’] = df[‘date_added’].dt.month

    df[‘season_count’] = df.apply(lambda x : x[‘duration’].split(” “)[0] if “Season” in x[‘duration’] else “”, axis = 1)
    df[‘duration’] = df.apply(lambda x : x[‘duration’].split(” “)[0] if “Season” not in x[‘duration’] else “”, axis = 1)

    df = df.rename(columns={“listed_in”:”genre”})
    df[‘genre’] = df[‘genre’].apply(lambda x: x.split(“,”)[0])

    Let’s plot the most watched content as a donut

    fig_donut = px.pie(df, names=’type’, height=300, width=600, hole=0.7,
    title=’Most watched on Netflix’,
    color_discrete_sequence=[‘#b20710’, ‘#221f1f’])
    fig_donut.update_traces(hovertemplate=None, textposition=’outside’,
    textinfo=’percent+label’, rotation=90)
    fig_donut.update_layout(showlegend=False,plot_bgcolor=’#8a8d93′, paper_bgcolor=’#FAEBD7′)

    Most watched content on Netflix as a donut

    Let’s plot the content vs year

    d1 = df[df[“type”] == “TV Show”]
    d2 = df[df[“type”] == “Movie”]

    col = “year_added”

    vc1 = d1[col].value_counts().reset_index().rename(columns = {col : “count”, “index” : col})
    vc1[‘percent’] = vc1[‘count’].apply(lambda x : 100*x/sum(vc1[‘count’]))
    vc1 = vc1.sort_values(col)

    vc2 = d2[col].value_counts().reset_index().rename(columns = {col : “count”, “index” : col})
    vc2[‘percent’] = vc2[‘count’].apply(lambda x : 100*x/sum(vc2[‘count’]))
    vc2 = vc2.sort_values(col)

    trace1 = go.Scatter(x=vc1[col], y=vc1[“count”], name=”TV Shows”)
    trace2 = go.Scatter(x=vc2[col], y=vc2[“count”], name=”Movies”)
    data = [trace1, trace2]
    fig_line = go.Figure(data)
    fig_line.update_traces(hovertemplate=None)
    fig_line.update_xaxes(showgrid=False)
    fig_line.update_yaxes(showgrid=False)

    Plot TV Shows and Movies vs year 2008-2021.

    Let’s plot the global map of the content distribution worldwide

    df_country = df.groupby(‘year_added’)[‘country’].value_counts().reset_index(name=’counts’)

    fig = px.choropleth(df_country, locations=”country”, color=”counts”,
    locationmode=’country names’,
    title=’Country ‘,
    range_color=[0,200],
    color_continuous_scale=px.colors.sequential.OrRd
    )
    fig.show()

    Global country map vs content count .

    We can examine this global distribution as a function of year

    df_country = df.groupby(‘year_added’)[‘country’].value_counts().reset_index(name=’counts’)

    fig = px.choropleth(df_country, locations=”country”, color=”counts”,
    locationmode=’country names’,
    animation_frame=’year_added’,
    title=’Country Vs Year’,
    range_color=[0,200],
    color_continuous_scale=px.colors.sequential.OrRd
    )
    fig.show()

    Country vs year global map

    Let’s compare ratings for TV Shows and Movies

    Making a copy of df

    dff = df.copy()

    Making 2 df one for tv show and another for movie with rating

    df_tv_show = dff[dff[‘type’]==’TV Show’][[‘rating’, ‘type’]].rename(columns={‘type’:’tv_show’})
    df_movie = dff[dff[‘type’]==’Movie’][[‘rating’, ‘type’]].rename(columns={‘type’:’movie’})
    df_movie = pd.DataFrame(df_movie.rating.value_counts()).reset_index().rename(columns={‘index’:’movie’})

    df_tv_show = pd.DataFrame(df_tv_show.rating.value_counts()).reset_index().rename(columns={‘index’:’tv_show’})
    df_tv_show[‘rating_final’] = df_tv_show[‘rating’]

    Making rating column value negative

    df_tv_show[‘rating’] *= -1

    Chart

    fig = make_subplots(rows=1, cols=2, specs=[[{}, {}]], shared_yaxes=True, horizontal_spacing=0)

    Bar plot for tv shows

    fig.append_trace(go.Bar(x=df_tv_show.rating, y=df_tv_show.tv_show, orientation=’h’, showlegend=True,
    text=df_tv_show.rating_final, name=’TV Show’, marker_color=’#221f1f’), 1, 1)

    Bar plot for movies

    fig.append_trace(go.Bar(x=df_movie.rating, y=df_movie.movie, orientation=’h’, showlegend=True, text=df_movie.rating,
    name=’Movie’, marker_color=’#b20710′), 1, 2)

    fig.show()

    Ratings TV shows vs Movie bar plots

    Let’s plot top 5 most preferred genres for movies

    df_m = df[df[‘type’]==’Movie’]
    df_m = pd.DataFrame(df_m[‘genre’].value_counts()).reset_index()

    fig_bars = px.bar(df_m[:5], x=’genre’, y=’index’, text=’index’,
    title=’Most preferd Genre for Movies’,
    color_discrete_sequence=[‘#b20710’])
    fig_bars.update_traces(hovertemplate=None)
    fig_bars.update_xaxes(visible=False)
    fig_bars.update_yaxes(visible=False, categoryorder=’total ascending’)

    Top 5 most preferred genres for movies

    Let’s plot top 5 TV shows

    df_tv = df[df[‘type’]==’TV Show’]
    df_tv = pd.DataFrame(df_tv[‘genre’].value_counts()).reset_index()

    fig_tv = px.bar(df_tv[:5], x=’genre’, y=’index’, text=’index’,
    color_discrete_sequence=[‘#FAEBD7’])
    fig_tv.update_traces(hovertemplate=None)
    fig_tv.update_xaxes(visible=False)
    fig_tv.update_yaxes(visible=False, categoryorder=’total ascending’)
    fig_tv.update_layout(height=300,

                  hovermode="y unified", 
                  plot_bgcolor='#333', paper_bgcolor='#333')
    

    fig_tv.show()

    Top 5 TV shows

    Let’s plot increasing (red) /decreasing (orange) movies vs year_added

    d2 = df[df[“type”] == “Movie”]
    col = “year_added”

    vc2 = d2[col].value_counts().reset_index().rename(columns = {col : “count”, “index” : col})
    vc2[‘percent’] = vc2[‘count’].apply(lambda x : 100*x/sum(vc2[‘count’]))
    vc2 = vc2.sort_values(col)

    fig2 = go.Figure(go.Waterfall(
    name = “Movie”, orientation = “v”,
    x = [“2008”, “2009”, “2010”, “2011”, “2012”, “2013”, “2014”, “2015”, “2016”, “2017”, “2018”, “2019”, “2020”, “2021”],
    textposition = “auto”,
    text = [“1”, “2”, “1”, “13”, “3”, “6”, “14”, “48”, “204”, “743”, “1121”, “1366”, “1228”, “84”],
    y = [1, 2, -1, 13, -3, 6, 14, 48, 204, 743, 1121, 1366, -1228, -84],
    connector = {“line”:{“color”:”#b20710″}},
    increasing = {“marker”:{“color”:”#b20710″}},
    decreasing = {“marker”:{“color”:”orange”}}

    ))
    fig2.show()

    Bar plot of increasing (red) /decreasing (orange) movies vs year_added

    Trend Detection

    Let’s look at our original input dataset

    Data Shape:  (7787, 12)

    data.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 7787 entries, 0 to 7786
    Data columns (total 12 columns):
     #   Column        Non-Null Count  Dtype 
    ---  ------        --------------  ----- 
     0   show_id       7787 non-null   object
     1   type          7787 non-null   object
     2   title         7787 non-null   object
     3   director      5398 non-null   object
     4   cast          7069 non-null   object
     5   country       7280 non-null   object
     6   date_added    7777 non-null   object
     7   release_year  7787 non-null   int64 
     8   rating        7780 non-null   object
     9   duration      7787 non-null   object
     10  listed_in     7787 non-null   object
     11  description   7787 non-null   object
    dtypes: int64(1), object(11)
    memory usage: 730.2+ KB

    data.isnull().sum()

    show_id            0
    type               0
    title              0
    director        2389
    cast             718
    country          507
    date_added        10
    release_year       0
    rating             7
    duration           0
    listed_in          0
    description        0
    dtype: int64

    Let’s fill in NaNs

    data[‘date_added’] = data[‘date_added’].fillna(‘NaN Data’)
    data[‘year’] = data[‘date_added’].apply(lambda x: x[-4: len(x)])
    data[‘month’] = data[‘date_added’].apply(lambda x: x.split(‘ ‘)[0])

    display(data.sample(3))

    Input data table after filling NaNs

    Let’s plot the source distribution

    val = data[‘type’].value_counts().index
    cnt = data[‘type’].value_counts().values

    fig = go.Figure([go.Bar(x=val, y=cnt, marker_color=’darkturquoise’)])
    fig.update_layout(title_text=’Netflix Sources Distribution’, title_x=0.5)
    fig.show()

    bar plot movie vs TV show

    Let’s plot Trend Movies vs TV Shows in recent years

    from collections import defaultdict

    dict = data.groupby([‘type’, ‘year’]).groups
    dict2 = {}
    dict2 = defaultdict(lambda: 0, dict2)
    for key, values in dict.items():
    val = key[0]+’,’+key[1]
    dict2[val] = len(values)

    x = list(np.arange(2008, 2022, 1))

    y1, y2= [], []
    for i in x:
    y1.append(dict2[‘Movie,’+str(i)])
    y2.append(dict2[‘TV Show,’+str(i)])

    fig = go.Figure(data = [
    go.Bar(name=’Movie’, x=x, y=y1, marker_color=’mediumpurple’),
    go.Bar(name=’TV Show’, x=x, y=y2, marker_color=’lightcoral’)
    ])
    fig.update_layout(title_text=’Trend Movies vs TV Shows in recent years’, title_x=0.5)
    fig.show()

     Trend Movies vs TV Shows in recent years

    Let’s plot the monthly Trend Movies vs TV Shows

    dict = data.groupby([‘type’, ‘month’]).groups
    dict2 = {}
    dict2 = defaultdict(lambda: 0, dict2)
    for key, values in dict.items():
    val = key[0]+’,’+key[1]
    dict2[val] = len(values)

    x = [‘January’, ‘February’, ‘March’, ‘April’, ‘May’, ‘June’, ‘July’,
    ‘August’, ‘September’, ‘October’, ‘November’, ‘December’]

    y1, y2= [], []
    for i in x:
    y1.append(dict2[‘Movie,’+str(i)])
    y2.append(dict2[‘TV Show,’+str(i)])

    fig = go.Figure(data = [
    go.Bar(name=’Movie’, x=x, y=y1, marker_color=’mediumpurple’),
    go.Bar(name=’TV Show’, x=x, y=y2, marker_color=’lightcoral’)
    ])
    fig.update_layout(title_text=’Trend Movies vs TV Shows during Months’, title_x=0.5)
    fig.show()

    Trend Movies vs TV Shows during Months

    Let’s plot Trend Movies vs TV Shows in recent years

    data_movie = data[data[‘type’]==’Movie’].groupby(‘release_year’).count()
    data_tv = data[data[‘type’]==’TV Show’].groupby(‘release_year’).count()
    data_movie.reset_index(level=0, inplace=True)
    data_tv.reset_index(level=0, inplace=True)

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=data_movie[‘release_year’], y=data_movie[‘show_id’],
    mode=’lines’,
    name=’Movies’, marker_color=’mediumpurple’))
    fig.add_trace(go.Scatter(x=data_tv[‘release_year’], y=data_tv[‘show_id’],
    mode=’lines’,
    name=’TV Shows’, marker_color=’lightcoral’))
    fig.update_layout(title_text=’Trend Movies vs TV Shows in recent years’, title_x=0.5)
    fig.show()

    Trend Movies vs TV Shows in recent years

    Top Countries

    Let’s plot top countries where the content was released

    import collections
    import string

    dict1 = {}
    dict1 = defaultdict(lambda: 0, dict1)
    dict2 = {}
    dict2 = defaultdict(lambda: 0, dict2)

    data[‘country’] = data[‘country’].fillna(‘ ‘)

    for i in range(len(data)):
    if data[‘type’][i] == ‘Movie’:
    val = data[‘country’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    dict1[x]+=1
    else:
    val = data[‘country’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    dict2[x]+=1
    dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
    dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))

    x1 = list(dict1.keys())[:20]
    x2 = list(dict2.keys())[:20]
    y1 = list(dict1.values())[:20]
    y2 = list(dict2.values())[:20]

    fig = go.Figure([go.Bar(x=x1, y=y1, marker_color=’mediumpurple’)])
    fig.update_layout(title_text=’Top Countries where Movies are released’, title_x=0.5)
    fig.show()

    fig = go.Figure([go.Bar(x=x2, y=y2, marker_color=’lightcoral’)])
    fig.update_layout(title_text=’Top Countries where TV Shows are released’, title_x=0.5)
    fig.show()

    Top Countries where Movies are released
    Top Countries where TV Shows are released

    Let’s look at the global maps

    import plotly.offline as py
    py.offline.init_notebook_mode()
    import pycountry

    df1 = pd.DataFrame(dict1.items(), columns=[‘Country’, ‘Count’])
    df2 = pd.DataFrame(dict2.items(), columns=[‘Country’, ‘Count’])

    total = set(list(df1[‘Country’].append(df2[‘Country’])))

    d_country_code = {} # To hold the country names and their ISO
    for country in total:
    try:
    country_data = pycountry.countries.search_fuzzy(country)
    # country_data is a list of objects of class pycountry.db.Country
    # The first item ie at index 0 of list is best fit
    # object of class Country have an alpha_3 attribute
    country_code = country_data[0].alpha_3
    d_country_code.update({country: country_code})
    except:
    #print(‘could not add ISO 3 code for ->’, country)
    # If could not find country, make ISO code ‘ ‘
    d_country_code.update({country: ‘ ‘})
    for k, v in d_country_code.items():
    df1.loc[(df1.Country == k), ‘iso_alpha’] = v
    df2.loc[(df2.Country == k), ‘iso_alpha’] = v

    fig = px.scatter_geo(df1, locations=”iso_alpha”,
    hover_name=”Country”, # column added to hover information
    size=”Count”, # size of markers, “pop” is one of the columns of gapminder
    )
    fig.update_layout(title_text=’Top Countries where Movie are released’, title_x=0.5)
    fig.show()

    fig = px.scatter_geo(df2, locations=”iso_alpha”,
    hover_name=”Country”, # column added to hover information
    size=”Count”, # size of markers, “pop” is one of the columns of gapminder
    )
    fig.update_layout(title_text=’Top Countries where TV Shows are released’, title_x=0.5)
    fig.show()

    Global map of Top Countries where Movie are released
    Global map of Top Countries where TV Shows are released

    Cast Distributions

    Let’s compare most appeared Cast Globally in Movies vs TV Shows

    dict1 = {}
    dict1 = defaultdict(lambda: 0, dict1)
    dict2 = {}
    dict2 = defaultdict(lambda: 0, dict2)

    data[‘cast’] = data[‘cast’].fillna(‘ ‘)

    for i in range(len(data)):
    if data[‘type’][i] == ‘Movie’:
    val = data[‘cast’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    dict1[x]+=1
    else:
    val = data[‘cast’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    dict2[x]+=1

    dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
    dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))

    x1 = list(dict1.keys())[:20]
    x2 = list(dict2.keys())[:20]
    y1 = list(dict1.values())[:20]
    y2 = list(dict2.values())[:20]

    fig = go.Figure([go.Bar(x=x1, y=y1, marker_color=’mediumpurple’)])
    fig.update_layout(title_text=’Most appeared Cast Globally in Movies’, title_x=0.5)
    fig.show()

    fig = go.Figure([go.Bar(x=x2, y=y2, marker_color=’lightcoral’)])
    fig.update_layout(title_text=’Most appeared Cast Globally in TV Shows’, title_x=0.5)
    fig.show()

    Most appeared Cast Globally in Movies
    Most appeared Cast Globally in TV Shows

    NLTK Classifier

    Let’s apply NaiveBayesClassifier to examine the gender ratio in Movies and TV Shows

    import nltk
    import random
    from nltk.corpus import names

    def gender_features(word):
    return {‘last_letter’: word[-1]}

    labeled_names = ([(name, ‘male’) for name in names.words(‘male.txt’)] +
    [(name, ‘female’) for name in names.words(‘female.txt’)])

    random.shuffle(labeled_names)

    featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

    trainset, testset = featuresets[500:], featuresets[:500]

    classifier = nltk.NaiveBayesClassifier.train(trainset)

    dict1 = {}
    dict1 = defaultdict(lambda: 0, dict1)
    dict2 = {}
    dict2 = defaultdict(lambda: 0, dict2)

    df1 = pd.DataFrame(columns = [‘Gender’, ‘Count’])
    df2 = pd.DataFrame(columns = [‘Gender’, ‘Count’])

    data[‘cast’] = data[‘cast’].fillna(‘ ‘)

    for i in range(len(data)):
    if data[‘type’][i] == ‘Movie’:
    val = data[‘cast’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    if classifier.classify(gender_features(x)) == ‘male’:
    df1.loc[len(df1)] = [‘male’, 1]
    else:
    df1.loc[len(df1)] = [‘female’, 1]
    else:
    val = data[‘cast’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    if classifier.classify(gender_features(x)) == ‘male’:
    df2.loc[len(df2)] = [‘male’, 1]
    else:
    df2.loc[len(df2)] = [‘female’, 1]

    fig = px.pie(df1, values=’Count’, names=’Gender’, color=’Gender’,
    color_discrete_map={‘female’:’lightcyan’,
    ‘male’:’darkblue’})
    fig.update_layout(title_text=’Gender Ratio in Movies’, title_x=0.5)
    fig.show()

    fig = px.pie(df2, values=’Count’, names=’Gender’, color=’Gender’,
    color_discrete_map={‘female’:’lightcyan’,
    ‘male’:’darkblue’})
    fig.update_layout(title_text=’Gender Ratio in TV Shows’, title_x=0.5)
    fig.show()

    Gender ratio in movies
    Gender ratio in TV shows

    Top Genres

    Let’s look at the highest occurring genres Globally in Movies vs TV Shows

    dict1 = {}
    dict1 = defaultdict(lambda: 0, dict1)
    dict2 = {}
    dict2 = defaultdict(lambda: 0, dict2)

    data[‘listed_in’] = data[‘listed_in’].fillna(‘ ‘)

    for i in range(len(data)):
    if data[‘type’][i] == ‘Movie’:
    val = data[‘listed_in’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    dict1[x]+=1
    else:
    val = data[‘listed_in’][i].split(‘,’)
    for j in val:
    x = j.lower()
    x = x.strip()
    if x!=”:
    dict2[x]+=1

    dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
    dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))
    x1 = list(dict1.keys())[:20]
    x2 = list(dict2.keys())[:20]
    y1 = list(dict1.values())[:20]
    y2 = list(dict2.values())[:20]

    fig = go.Figure([go.Bar(x=x1, y=y1, marker_color=’mediumpurple’)])
    fig.update_layout(title_text=’Highest occurring genres Globally in Movies’, title_x=0.5)
    fig.show()

    fig = go.Figure([go.Bar(x=x2, y=y2, marker_color=’lightcoral’)])
    fig.update_layout(title_text=’Highest occurring genres Globally in TV Shows’, title_x=0.5)
    fig.show()

    Highest occurring genres Globally in Movies
    Highest occurring genres Globally in TV Shows

    Let’s review the overall country-based genre counts

    dict2 = {}
    dict2 = defaultdict(lambda: 0, dict2)

    data2 = data
    data2[‘country’] = data2[‘country’].apply(lambda x: x.lower())
    data2[‘listed_in’] = data2[‘listed_in’].apply(lambda x: x.lower())

    df1 = pd.DataFrame(columns=[‘Country’, ‘Genre’, ‘Count’])

    for i in range(len(data2)):
    for j in data2[‘country’][i].split(‘,’):
    for k in data2[‘listed_in’][i].split(‘,’):
    val = j+’,’+k
    dict2[val]+=1

    dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))

    a, b, c = 0, 0, 0
    for k,v in dict2.items():
    if k.split(‘,’)[0] == ‘india’ and a<5:
    df1.loc[len(df1)] = [k.split(‘,’)[0], k.split(‘,’)[1],v]
    a+=1
    elif k.split(‘,’)[0] == ‘united states’ and b<5:
    df1.loc[len(df1)] = [k.split(‘,’)[0], k.split(‘,’)[1],v]
    b+=1
    elif k.split(‘,’)[0] == ‘united kingdom’ and c<5:
    df1.loc[len(df1)] = [k.split(‘,’)[0], k.split(‘,’)[1],v]
    c+=1

    df1

    Country-based genre count

    Let’s compare Distribution of Genres in India, US, UK

    fig = px.sunburst(df1, path = [‘Country’, ‘Genre’], values = ‘Count’, color = ‘Country’,
    color_discrete_map = {‘united states’: ‘#85e0e0’, ‘india’: ‘#99bbff’, ‘united kingdom’: ‘#bfff80’})
    fig.update_layout(title_text=’Distribution of Genres in India, US, UK’, title_x=0.5)
    fig.show()

    Distribution of Genres in India, US, UK

    Age Group

    Let’s plot Age Group Distribution

    data.iloc[67, 8] = ‘R’
    data.iloc[2359, 8] = ‘TV-14’
    data.iloc[3660, 8] = ‘TV-PG’
    data.iloc[3736, 8] = ‘R’
    data.iloc[3737, 8] = ‘R’
    data.iloc[3738, 8] = ‘R’
    data.iloc[4323, 8] = ‘PG-13’

    data[‘age_group’] = data[‘rating’]
    MR_age = {‘TV-MA’: ‘Adults’,
    ‘R’: ‘Adults’,
    ‘PG-13’: ‘Teens’,
    ‘TV-14’: ‘Young Adults’,
    ‘TV-PG’: ‘Older Kids’,
    ‘NR’: ‘Adults’,
    ‘TV-G’: ‘Kids’,
    ‘TV-Y’: ‘Kids’,
    ‘TV-Y7’: ‘Older Kids’,
    ‘PG’: ‘Older Kids’,
    ‘G’: ‘Kids’,
    ‘NC-17’: ‘Adults’,
    ‘TV-Y7-FV’: ‘Older Kids’,
    ‘UR’: ‘Adults’}
    data[‘age_group’] = data[‘age_group’].map(MR_age)

    val = data[‘age_group’].value_counts().index
    cnt = data[‘age_group’].value_counts().values

    fig = go.Figure([go.Bar(x=val, y=cnt, marker_color=’darkturquoise’)])
    fig.update_layout(title_text=’Age Group Distribution’, title_x=0.5)
    fig.show()

    Age Group Distribution

    Duration

    Let’s plot Distribution of Duration across Movies and TV Show in the past years

    data_movie = data[data[‘type’] == ‘Movie’]
    data_tv = data[data[‘type’] == ‘TV Show’]

    create trace 1 that is 3d scatter

    trace1 = go.Scatter3d(
    x=data_movie.duration,
    y=data_tv.duration,
    z=data.release_year,
    mode=’markers’,
    marker_color=’darkturquoise’
    )

    data2 = [trace1]
    layout = go.Layout(
    )
    fig = go.Figure(data=data2, layout=layout)
    fig.update_layout(title_text=’Distribution of Duration across Movies and TV Show in the past years’, title_x=0.5)
    iplot(fig)

    Distribution of Duration across Movies and TV Show in the past years

    Let’s compare duration of movies vs TV shows as boxplots

    data_movie = data[data[‘type’] == ‘Movie’]
    data_tv = data[data[‘type’] == ‘TV Show’]

    trace0 = go.Box(
    y = data_movie.duration,
    name = “Duration of Movies”,
    marker_color=’mediumpurple’
    )

    trace1 = go.Box(
    y = data_tv.duration,
    name = “Duration of TV Shows”,
    marker_color=’lightcoral’
    )

    data2 = [trace0,trace1]
    iplot(data2)

     Duration of movies vs TV shows as boxplots

    Link to AWS

    This post is linked to the AWS Netflix visualization dashboard in R. It consists of the following 3 steps discussed above:

    • Data Preparation
    • Creating Visualization
    • Trend Detection

    In fact, the Netflix data set has a lot of information that could be explored. In this article, several information that has been explored including the growth of the contents over the year, the distribution of contents by countries, the common genres in the selected countries, the age of contents distributions by each countries, and network of casts in the Netflix contents worldwide.

    Interestingly, the contents of Netflix platform are dramatically increase from 2015-2019 which also shows the possibility of traction gains of the platform during the periods. The contents themselves were mostly derives from US, India, and UK as three of those countries have a high numbers of contents in the world. Likewise, the common genres and age of contents distributions for each of those countries are varied.

    Overall, the visualizations of the data set eases the exploration of the data set which would then be processed for ML purpose. The type of the visualizations would be depended on which of the insights or information that would want to be presented.

    Summary

    • Entertainment companies today are swamped with data stored and collected from various mediums and sources.
    • To gain insights from this data, we use Python EDA and advanced data visualization algorithms and make predictions about future events, and plan necessary strategies. 
    • Learnings gained through data mining can be used further within prescriptive analytics to drive actions based on predictive insights.
    • As a recommendation for this data set, a recommender ML could be deployed here which would classify the contents and movies that have similar context in descriptions, directors, genres, and other variables in the data set.

    Explore More

    Webscraping in R – IMDb ETL Showcase

    ML/AI Prediction of Wine Quality

    Textual Genres Analysis using the Carloto’s NLP Algorithm

    Embed Socials


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
%d bloggers like this: