• EdTech for All: Free/Paid IoT Courses ’22

    EdTech for All: Free/Paid IoT Courses ’22

    Featured Photo by Jorge Ramirez on Unsplash.

    IoT = Device + Gateway + Cloud

    IoT = Device + Gateway + Cloud

    The Internet of Things, or IoT, is a system of interrelated computing devices, mechanical and digital machines, objects, animals or people that are provided with unique identifiers (UIDs) and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction.

    IoT Statistics – Key Findings

    • In 2021, there were more than 10 billion active IoT devices.
    • It’s estimated that the number of active IoT devices will surpass 25.4 billion in 2030.
    • By 2025, there will be 152,200 IoT devices connecting to the internet per minute.
    • IoT solutions have the potential to generate $4-11 trillion in economic value by 2025.
    • 83% of organizations have improved their efficiency by introducing IoT technology.
    • It’s estimated that global IoT spending will total $15 trillion in the six-year period between 2019 and 2025.
    • The consumer IoT market is estimated to reach $142 billion by 2026 at a CAGR of 17%.
    • 94% of retailers agree that the benefits of implementing IoT outweigh the risk.
    • The amount of data generated by IoT devices is expected to reach 73.1 ZB (zettabytes) by 2025.

    In 2021, IoT Analytics expects the global number of connected IoT devices to grow 9%, to 12.3 billion active endpoints. By 2025, there will likely be more than 27 billion IoT connections.

    State of IoT 2022: Number of connected IoT devices growing 18% to 14.4 billion globally.

    Read more about the number of connected IoT devices 2010-2025.

    IoT connections market update—May 2022

    Despite a booming demand for IoT solutions and positive sentiment in the IoT community as well as in most IoT end markets, IoT Analytics expects the chip shortage’s impact to the number of connected IoT devices to last well beyond 2023. Other headwinds for IoT markets include the ongoing COVID-19 pandemic and general supply chain disruptions. In 2022, the market for the Internet of Things is expected to grow 18% to 14.4 billion active connections. It is expected that by 2025, as supply constraints ease and growth further accelerates, there will be approximately 27 billion connected IoT devices.

    Most Popular MOOC Platforms in 2022:




    Explore More

  • Invest in AI via Macroaxis Sep ’22 Update

    Invest in AI via Macroaxis Sep ’22 Update

    SeekingAlpha Opinion: Want To Rule The World? Invest In AI

    Following our recent study, let’s look at investing opportunities in Artificial Intelligence (AI) using the Macroaxis Wealth Optimization platform. Here is why:

    4 AI pillars
    Source: Deloitte

    Thematic Idea

    Firms & funds that are developing tools for AI. Tech companies, funds, and ETFs across multiple industries that are involved in R&D in the field of reasoning, learning, NLP and perception as well as its application to science and e-commerce. This theme may also include entities involved in cybernetics and cognitive brain simulation field.

    AI thematic idea
    AI vs DOW stocks
    09-21-2022 DOW -3.99%, AI +3.12%

    Asset Allocation

    The AI investing theme is composed of its constituencies equally weighted against each other.

    The AI investing theme is composed of its constituencies equally weighted against each other.

    MC %

    AI theme market capitalization (MC) usually refers to the total value of a theme’s positions broken down into specific market cap categories. To manage market risk and economic uncertainty, many investors today build portfolios that are diversified across equities with different MCs. However, as a general rule, conservative investors tend to hold large-cap stocks, and those looking for more risk prefer small-cap and mid-cap equities.

    Market Capitalization (%)

    Instrument Composition

    By diversifying AI theme assets across categories where investment returns move up and down under different market conditions, an investor can protect against significant losses. Historically, the returns of the major asset categories such as stocks, funds, ETFs or cryptocurrencies, have not moved up and down simultaneously. Market conditions that usually cause one asset classification to do well often cause another asset classification to have average or poor returns. By investing in more than one asset classification, investors will almost always reduce the risk of losing money, and their portfolio’s overall investment returns will have softer volatility. If one asset category’s investment return falls, you’ll be in a position to counteract your losses in that asset category with better investment returns in another asset classification.

    Instrument Composition And Concentration

    Market Elasticity

    The market elasticity of a theme is the measure of how responsive the resulted portfolio will be to changes in the market or economic conditions. Most investing themes are subject to two types of risk – systematic (i.e., market) and unsystematic (i.e., nonmarket or company-specific) risk. Unsystematic risk is the risk that events specific to Artificial Intelligence theme will adversely affect the performance of its constituents. This type of risk can be diversified away by optimizing the themed equities into an efficient portfolio with different positions weighted according to their correlations. On the other hand, systematic risk is the risk that the theme constituents’ prices will be affected by overall market movements and cannot be diversified. Below are essential risk-adjusted performance indicators that can help to measure the overall market elasticity of the AI theme.

    Market Elasticity

    Risk/Return Ratio

    An investing theme such as AI should be diversified across asset classifications. So, in addition to allocating your investments among stocks, funds, ETFs, cash, and possibly cryptocurrencies, you will also need to spread out your investments within each asset category. The key is to identify investments in segments of each asset category that may perform differently under different market conditions. One way of diversifying your investments within an asset category is investing in a wide range of entities and industry sectors with different risk-return characteristics, as shown below.

    AI portfolio risk vs daily expected return

    Asset Ratings

    Many investors optimize their portfolios to maintain a risk-return balance that meets their personal investing preferences and liquidity needs. Understanding the relationship between the Sharpe ratio, risk, and expected return will help you build an optimal portfolio out of your selected theme. The Sharpe ratios describe how much excess return you receive for the extra volatility you endure for holding a position in a themed portfolio. Below are the essential efficiency ratios that can help you quickly create a reliable input to your portfolio optimization process.

    Asset Ratings
    Asset Ratings (continued)

    Technical Analysis

    The below table shows technical indicators of the assets from the current unweighted theme. These indicators can be significantly improved after the theme is optimized. So, by diversifying this theme into an optimal portfolio, it is possible to reduce not only its total risk but also increase alpha, improve the information ratio and and increase the potential upside.

    Technical Analysis Of Total Daily Returns

    Correlation Matrix

    The AI theme correlation table is a 2D matrix that shows the Pearson’s correlation coefficient between all of the theme’s pairs of securities. The cells in the table are color-coded to highlight significantly positive and negative relationships. The correlation table below represents the degree of relationship between the price movements of different assets included in the theme. In other words, it is a table showing correlation coefficients between all of the theme’s constituents. Each cell in the table shows the correlation between one pair of potential positions.

    Portfolio correlation matrix
    Portfolio correlation matrix (continued)

    Read more: Protecting Portfolios Using Correlation Diversification.


    • Using investing ideas such as the AI theme to originate optimal portfolios saves a lot of time and completely automates your asset selection decisions. The framework behind a single and multiple investing theme optimization is designed to address the most technical part of the wealth optimization process, including asset allocation, equity research, portfolio diversification, portfolio rebalancing, and portfolio suggestion.
    • Macroaxis AI ideas are bundles of up to 20 equally weighted funds, stocks, ETFs, or cryptocurrencies that are programmatically selected from a pull of about 70 equities. The Macroaxis Investing Theme typically reflects a particular investment outlook based on shared economic or social characteristics, a joint business domain, or an essential financial categorization feature such as industry, growth potential, capitalization, locality, volatility, or market segment. It is an excellent tool to take your emotions out of your investing decisions.


  • ML/AI Diamond Price Prediction with R

    ML/AI Diamond Price Prediction with R

    Inspired by the recent study, this post will cover an analysis of diamonds with R Studio. On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Therefore, it makes a perfect sense to use a pre-trained ML/AI model to get an idea of whether you are overpaying. 

    Preparation Phase

    Let’s change our working directory to YOURPATH


    and install the following packages

    type = “binary”
    install.packages(“GGally”, type = “binary”)
    install.packages(“tidymodels”, type = “binary”)
    install.packages(“tictoc”, type = “binary”)
    install.packages(“caTools”, type = “binary”)
    install.packages(“xgboost”, type = “binary”)
    install.packages(“e1071”, type = “binary”)
    install.packages(“rpart”, type = “binary”)
    install.packages(“randomForest”, type = “binary”)
    install.packages(“Metrics”, type = “binary”)

    We need a few libraries









    and call the diamond data


    Here is the summary


    Diamond data summary table

    Exploratory Data Analysis (EDA)

    Let’s look at the pair correlation plot

    Diamond data pair-correlation plot

    Let’s look at the Q-Q plot of log Price

    Normal Q-Q plot of log Price

    We can see that the multi-modal histogram of log Price does not follows the normal distribution

    hist_norm <- ggplot(diamonds, aes(log(price))) +
    geom_histogram(aes(y = ..density..), colour = “black”, fill = ‘yellow’, bins = 55) +
    stat_function(fun = dnorm, args = list(mean = mean(log(diamonds$price)), sd = sd(log(diamonds$price))))

    Histogram of log Price

    Let’s consider log Price as the target variable and split the input data with SplitRatio = 0.7

    split = sample.split(diamonds$log_price, SplitRatio = 0.7)
    diamonds_train = subset(diamonds, split == TRUE)
    diamonds_test = subset(diamonds, split == FALSE)

    Let’s prepare the data for ML

    diamonds_train <- diamonds_train %>%
    mutate_at(c(‘table’, ‘depth’), ~(scale(.) %>% as.vector))
    diamonds_test <- diamonds_test %>%
    mutate_at(c(‘table’, ‘depth’), ~(scale(.) %>% as.vector))

    Training Model

    Let’s call the lm() function to fit linear models

    mlm <- lm(log_price ~ carat + color + cut + clarity + table + depth, diamonds_train)

    mlm: 0.04 sec elapsed

    lm summary: residuals, coefficients, signif. codes, residual standard error 0.345, R2=0.88, p-value 2.2.e-16, and F-statistic 1.51e+4.
    lm summary: residuals, coefficients, signif. codes, residual standard error 0.345, R2=0.88, p-value 2.2.e-16, and F-statistic 1.51e+4.

    Let’s call the 3rd order polynomial function lm()

    poly <- lm(log_price ~ poly(carat,3) + color + cut + clarity + poly(table,3) + poly(depth,3), diamonds_train)

    lm poly 3 summary: residuals, coefficients, signif. codes, residual standard error 0.1332, R2=0.98, p-value 2.2.e-16, and F-statistic 8.64e+4.

    Let’s apply the XGBoost algorithm to both training and test data

    diamonds_train_xgb <- diamonds_train %>%
    mutate_if(is.factor, as.numeric)
    diamonds_test_xgb <- diamonds_test %>%
    mutate_if(is.factor, as.numeric)

    xgb <- xgboost(data = as.matrix(diamonds_train_xgb[-7]), label = diamonds_train_xgb$log_price, nrounds = 6166, verbose = 0)


    XGBooster summary
    XGBoost summary

    xgb_pred = predict(xgb, as.matrix(diamonds_test_xgb[-7]))

    y_actual <- diamonds_test_xgb$log_price
    y_predicted <- xgb_pred

    test <- data.frame(cbind(y_actual, y_predicted))

    let’s look at the test actual-predicted data cross-plot

    xgb_scatter <- ggplot(test, aes(10y_actual, 10y_predicted)) + geom_point(colour = ‘black’, alpha = 0.3) + geom_smooth(method = lm)

    XGBoost test actual-predicted data cross-plot
    XGBoost test actual-predicted data cross-plot

    Let’s apply the SVM algorithm to both training and test data using kernel = ‘radial’

    svr <- svm(formula = log_price ~ .,
    data = diamonds_train,
    type = ‘eps-regression’,
    kernel = ‘radial’)

    svm(formula = log_price ~ ., data = diamonds_train, type = “eps-regression”, kernel = “radial”)

    SVM-Type: eps-regression
    SVM-Kernel: radial
    cost: 1
    gamma: 0.04761905
    epsilon: 0.1

    Number of Support Vectors: 13718

    Let’s switch to the Decision Tree algorithm

    tree <- rpart(formula = log_price ~ .,
    data = diamonds_train,
    method = ‘anova’,
    model = TRUE)

    Decision Tree summary
    Decision Tree summary

    Let’s consider the Random Forest approach

    rf <- randomForest(log_price ~ .,
    data = diamonds_train,
    ntree = 500)

    randomForest(formula = log_price ~ ., data = diamonds_train, ntree = 500)
    Type of random forest: regression
    Number of trees: 500
    No. of variables tried at each split: 2

          Mean of squared residuals: 0.01172975
                    % Var explained: 98.89

    Model Performance

    Let’s make predictions and compare model performance

    mlm_pred <- predict(mlm, diamonds_test)
    poly_pred <- predict(poly, diamonds_test)
    svr_pred <- predict(svr, diamonds_test)
    tree_pred <- predict(tree, diamonds_test)
    rf_pred <- predict(rf, diamonds_test)
    xgb_pred <- predict(xgb, as.matrix(diamonds_test_xgb[-7]))

    xgb_resid <- diamonds_test_xgb$log_price – xgb_pred

    resid <- diamonds_test %>%
    spread_residuals(mlm, poly, svr, tree, rf) %>%
    select(mlm, poly, svr, tree, rf) %>%
    rename_with( ~ paste0(.x, ‘_resid’)) %>%

    predictions <- diamonds_test %>%
    select(log_price) %>%
    cbind(mlm_pred) %>%
    cbind(poly_pred) %>%
    cbind(svr_pred) %>%
    cbind(tree_pred) %>%
    cbind(rf_pred) %>%
    cbind(xgb_pred) %>%

    mean_log_price <- mean(diamonds_test$log_price)
    tss = sum((diamonds_test_xgb$log_price – mean_log_price)^2 )

    square <- function(x) {x**2}
    r2 <- function(x) {1 – x/tss}

    r2_df <- resid %>%
    mutate_all(square) %>%
    summarize_all(sum) %>%
    mutate_all(r2) %>%
    gather(key = ‘model’, value = ‘r2’) %>%
    mutate(model = str_replace(model, ‘_resid’, ”))

    model r2
    1 mlm 0.8916784
    2 poly 0.9813779
    3 svr 0.9862363
    4 tree 0.9103298
    5 rf 0.9882366
    6 xgb 0.9860271

    xgb_rmse = sqrt(mean(residuals^2))

    Let’s look at the model performance plot

    r2_plot <- ggplot(r2_df, aes(x = model, y = r2, colour = model, fill = model)) + geom_bar(stat = ‘identity’)
    r2_plot + ggtitle(‘R-squared Values for each Model’) + coord_cartesian(ylim = c(0.75, 1))

    Model performance in terms of R-squared values
    Model performance in terms of R-squared values

    Let’s look at the predicted vs actual price of the test data

    diamonds_test_sample <- diamonds_test %>%
    left_join(predictions, by = ‘log_price’) %>%
    slice_sample(n = 1000)
    ggplot(diamonds_test_sample, aes(x = exp(log_price), y = exp(rf_pred), size = abs(rf_resid))) +
    geom_point(alpha = 0.1) + labs(title = ‘Predicted vs Actual Cost of Diamonds in USD’, x = ‘Price’, y = ‘Predicted Price’, size = ‘Residuals’)

    Predicted vs actual price in USD of the test data
    Predicted vs actual price in USD of the test data
    Linear fit: Predicted vs actual price in USD of the test data
    Linear fit (red line): Predicted vs actual price in USD of the test data

    Key Takeaways

    • The Random Forest model performed best according to the R2 value.
    • The critical factor driving price is the size or the carat weight of the diamond. 
    • Recall that we applied the log transformation to our long-tailed dollar variable before using it in regression.
    • You can use the above pre-trained model to get a sense of whether you are overpaying for establishing a lasting business relationship with a jeweler you can trust.


    Best Day of My Life — Regression Analysis to Predict the Price of Diamonds

    Diamonds Price EDA and Prediction

    100% ML: Diamond price prediction using machine learning, python, SVM, KNN, Neural networks

    How machine learning can predict the price of the diamond you desire to buy

    Diamond Price Prediction with Machine Learning

    “Diamonds are forever” — price prediction using Machine Learning regression models and neural networks.

    Diamond Price Prediction

    Building Regression Models in R using Support Vector Regression

  • All Eyes on ETFs Sep ’22

    All Eyes on ETFs Sep ’22

    Is Now A Good Time To Buy ETFs?

    S&P 500: This Bear Market Is About To End (Technical Analysis).

    An exchange-traded fund (ETF) is a type of pooled investment security that operates much like a mutual fund. Typically, ETFs will track a particular index, sector, commodity, or other assets, but unlike mutual funds, ETFs can be purchased or sold on a stock exchange the same way that a regular stock can. An ETF can be structured to track anything from the price of an individual commodity to a large and diverse collection of securities. ETFs can even be structured to track specific investment strategies. The first ETF was the SPDR S&P 500 ETF (SPY), which tracks the S&P 500 Index.

    ETFs are versatile investment securities offering a wide range of benefits for investors. Whether you want to passively track a broad market index or invest in a niche area of the market, ETFs provide a low-cost, simple means of accessing a basket of securities in one fund.

    When you buy an ETF, you’re buying a basket of securities wrapped into one investment that trades on an exchange. Most ETFs passively track an underlying index, which is a representation of other securities or asset types, such as stocks, bonds, commodities, or currencies.

    Pros of ETFs:

    • Diversification: ETFs provide exposure to dozens, or even hundreds, of securities in just one basket.
    • Specialization: Certain specialty ETFs enable access to niche areas of the market.
    • Low cost: Because ETFs are passively managed, the operational costs are extremely low compared to actively managed portfolios. 
    • Tax-efficiency: For the ETFs that track a benchmark index, there is very little turnover.
    • Market orders: One of the stock-like aspects that can be a benefit for investors is the ability to place market orders.

    Cons of ETFs:

    • Trading costs: Many ETFs can be traded at zero commission and with no transaction fee. However, some brokers will charge commissions to trade certain ETFs on their platform.
    • Illiquidity: ETFs that have low trading volumes can have wide bid-ask spreads.
    • Settlement: As is the case with stocks, ETF settlement is T+2.

    Types of ETFs:

    • Equity ETFs such as S&P 500, Dividend, and International ETFs.
    • Fixed-Income or bond ETFs (Government Bonds, corporate bond, etc.)
    • Commodity ETFs (Gold, Silver, Oil, Copper, etc.)
    • Currency ETFs (Forex)
    • Real Estate ETFs (REIT) – high-yielding investments.
    • Specialty ETFs – semiconductor, etc.

    The Pup’s Weekend Dig – Does the oversold bounce continue?

    Sectors On Watch: $XBI – Biotech, $XLY – Consumer Discretionary, $XLV – Healthcare, $TAN – Solar, and $XLU – Utilities (Sep 11, 2022, 8:52 PM).

    • Wed, Sep 14, 2:08 PM 5 Reasons to Buy Commodities ETFs Right Now: After a decade of underperformance, commodities are experiencing a huge rally due to the Russia-Ukraine war, sky-high inflation, pent-up demand after the COVID-19 pandemic, widespread vaccination, chances of more oncoming COVID-19 antiviral pills and still-moderate rates.
    • Mon, 12 Sept at 14:04 Stocks are well off their June lows with the Dow up 7.43%, the S&P up 10.9%, the Nasdaq up 13.8%, and the Russell 2000 up 14.1%.
    • September 10, 2022 Right now, it’s easier than ever to own crypto thanks to the launch of Bitcoin ETFs. In total, 38 publicly traded companies are holding over $5 billion in Bitcoin. Between ETFs, countries like Nicaragua, and public and private companies, there is over $28 billion in Bitcoin being held on balance sheets like treasuries.
    • Thu, Sep 8, 2:05 PM. Stocks Soar As Oil Prices Fall: the Fed is widely expected to raise rates by another 75 basis points at their next FOMC meeting on September 20-21. In fact, even with rates climbing, Q3 GDP estimates are forecast at 1.4%. That’s a big improvement from Q2’s -0.6% and Q1’s -1.6%. In other news, MBA Mortgage Applications fell -0.8% w/w, with purchases down -0.7%, and refi’s down -1.1%.
    • This year’s first half performance (down nearly -21%), was strikingly similar to that of 1970 (also down -21%). And in both periods, high inflation was an issue. But in the second half of 1970, the S&P was up 27%.
    • Many dividend-paying stocks have held up well this year as the major indices entered a bear market. Contrary to conventional wisdom, the opportunity still exists for investors to create a reliable stream of income from the equity markets. Dividends are a fantastic way to generate sizeable returns from stocks.
    • While blockchain was put on the map for its use in the cryptocurrency market, it’s evolved into an indispensable business tool for processing all types of transactions and data transfers – from financial, to shipping, to health records, and more. It’s truly revolutionizing virtually all industries that rely on security, cost efficiency, and speed.



    Unlike mutual funds, an ETF trades like a common stock on a stock exchange. ETFs experience price changes throughout the day as they are bought and sold. ETFs typically have higher daily liquidity and lower fees than mutual fund shares, making them an attractive alternative for individual investors.

    Because it trades like a stock, an ETF does not have its net asset value (NAV) calculated once at the end of every day like a mutual fund does.

    TradingView SPY S&P 500 etf Head and Shoulders Chart Pattern


    ETFs market overview

    ETF Funds By Asset Class

    Use this ETF Screener to search for ETFs by Asset Class, including Equity ETFs, Commodity ETFs, Currency ETFs, Fixed Income ETFs, and more.


     [HYG] iShares iBoxx $ High Yield Corporate Bond ETF
     [HYG] iShares iBoxx $ High Yield Corporate Bond ETF
source: AIolux


    Monday September 19, 2022 World Market Overview

    world stock markets
source: macroaxis
    Correlation Matchups

    The Correlation Coefficient is a useful tool to identify correlated or non-correlated securities, which is essential in developing a diversified ETF portfolio. It tells us the relationship between two positions you have in your ETF portfolio or considering acquiring. Over a given time period, the two securities movetogether when the Correlation Coefficient is positive. Conversely, the two ETF assets move in opposite directions when the Correlation Coefficient is negative. Determining your positions’ relationship to each other is valuable for analyzing and projecting your portfolio’s future expected return and risk.

    Investors’ overexposure to a single market brings diversification risk in a portfolio, leaving it vulnerable to losses in that economy and underexposes it to markets in other parts of the world. For the same reason, if you are currently managing a portfolio composed of equities from multiple ETFs, you don’t want your positions to be highly correlated, even at the expense of accepting lower expected returns. Generally speaking, low correlations across different ETFs is the main idea behind global portfolio diversification, and without it, there’s no benefit to the rebalancing of internationally exposed ETF portfolios.

    market correlations
source: macroaxis

    S&P 500 Etf Profile

    Etfs . USA . Etf . SP 500 Utilities

    XLU market performance
source: macroaxis
    S&P 500 Summary:

    S&P 500 is selling for under 74.47 as of the 19th of September 2022; that is -0.37 percent decrease since the beginning of the trading day. The etf’s last reported lowest price was 74.34. S&P 500 has a very small chance of experiencing financial distress in the next few years and had a good performance during the last 90 days. Equity ratings for S&P 500 Utilities are calculated daily based on our scoring framework. The performance scores are derived for the period starting the 21st of June 2022 and ending today, the 19th of September 2022. Click here to learn more.

    In seeking to track the performance of the index, the fund employs a replication strategy. S&P 500 is traded on NYSEArca Exchange in the United States. More on S&P 500 Utilities

    Explore More

    Exchange Traded Fund (ETF): Explained

    What Is an Exchange-Traded Fund (ETF)?

    Trending ETF Content

    10 ETF Concerns That Investors Shouldn’t Overlook

    Swissquote Bank Europe – Simply Invest in ETFs Online

    Related Content


    Macroaxis Wealth Optimization

    Macroaxis AI Investment Opportunity

    Upswing Resilient Investor Guide

    Stocks on Watch Tomorrow

    Investment Risk Management Study


    Inflation-Resistant Stocks to Buy

    Track All Markets with TradingView

    S&P 500 Algorithmic Trading with FBProphet

    Are Blue-Chips Perfect for This Bear Market?

    SeekingAlpha Risk/Reward July Rundown

    Gulf’s Oil Price Web Scraping in R


    Commodity trends, cryptocurrency, AI optimization, and more below:

    market trends Q1 2022
    Crypto market bln USD 2922-03
    Crypto market (MC) bln USD 2022-2023
    Crypto market trln USD 2015-2030
    Crypto market (MC) trln USD 2016-2030
    Blockchain, crypto markets
    Blockchain, crypto markets
    ML/AI cryptocurrency prediction
    ML/AI cryptocurrency prediction
  • US Real Estate – Harnessing the Power of AI

    US Real Estate – Harnessing the Power of AI

    This is the continuation of our recent use-case series dedicated to the real estate (RE) monitoring, trend analysis and forecast. In these series, the focus is on the US house prices by invoking supervised machine learning (ML) and artificial intelligence (AI) algorithms available in Python as it is the language with the largest variety of libraries on the subject (Scikit-learn, TensorFlow, pyTorch, Keras, SparkMLlib, etc.). Our objective is to incorporate these algorithms into the real estate decision making process thanks to its supporting role. Recall that decision-making is a critical part of a typical real estate property valuation aimed at quantifying the market value of a property according to its qualitative characteristics. Being visualization a prominent character of this kind of problems, ML/AI ETL pipelines are commonly used as a support for RE decision analysis. Within the context of testing and validation strategies, it is important to get into training errors and limitations of ML/AI due to its inherent pattern-recognizing nature.

    ML/AI - the soul of tech

    ML/AI is defined as follows: A code learns from experience E with respect to a task T and a performance measure P, if its performance on T, as measured by P, improves with E. ML is a part of AI. ML algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. ML is an important subset of data science. Through the use of statistical methods, data science algorithms are trained to make classifications or predictions, uncovering key insights within data mining projects. These insights subsequently drive decision making within applications and businesses, ideally impacting key BI/fintech metrics. 

    Bottom Line: ML is the science of getting computers to learn, without being explicitly programmed.

    Housing Crunch

    • The August 2022 NAHB measure of homebuilder confidence fell below 50 for the first time since May 2020. Housing starts for July dropped 9.6%, more than expected, (although permits dropped less than forecast). And most recently the NAR reported that July existing home sales fell 5.9%, more than anticipated.
    • The August 2022 housing data did much to confirm a slowdown sought by the Federal Reserve. Along with what may have been peak inflation last week, cooler housing data is another piece in the puzzle as the FOMC tightens conditions.
    • “Existing home sales have now fallen for 6 months in a row, and are 26% lower than the January 2022 peak,” Pantheon Macro Economist Ian Shepherdson said. “But the bottom is still some way off, given the degree to which demand has been crushed by rising rates; the required monthly mortgage payment for a new purchaser of an existing single-family home is no longer rising, but it was still up by 51% year-over-year in July 2022.
    • “Home sales likely have further to fall,” Odeta Kushi, deputy chief economist at First American Financial, tweeted. “Mortgage applications so far in August 2022 point to another decline in existing-home sales. This month’s number of 4.81 million puts us at about 2014 sales level.”
    • “Fed officials pay particularly close attention to the housing market and are monitoring how higher mortgage rates are impacting home sales and housing prices in order to gauge how tighter monetary policy is affecting the broader economy,” Wells Fargo economists wrote.

    This post provides an optimized solution to the problem of unclear RE market changes by allowing brokerages and clients to have access to an ML-backed RE solution that draws upon different housing data sources that are updated to close recency.


    The paper is divided into the following sections: Business Case (see above), supervised ML Methodology, IDE and learning Prerequisites, ETL Python Workflow & Pipeline, multi-scale RE Use Cases using  comprehensive open-source housing datasets (US states and beyond), and Conclusions. Sections contain related links listed in References. Due to the scale of case studies, the entire ML project is split into several Jupyter notebooks: EDA and data cleaning, preprocessing and feature engineering, and model tuning and insights. Each input dataset is limited in scope both in terms of the time frame captured, as well as location. Each training model is also specific to houses in a city or county and may not be as accurate when applied to data from another US state, where house prices may be affected by different factors. The aim of specific training models is not to give a perfect prediction, but act as a guideline to inform RE decisions. In reality, house price may be difficult to predict as it is also affected by buyers’ psychology, the economic climate, and other factors not included in the dataset. 


    We consider the supervised ML techniques (see charts below) when we are given a (training) dataset and already know what our correct output should look like, providing the idea that there is an intrinsic relationship between the input and output data. In this study, house price prediction is regarded as a regression problem, meaning that we are trying to map input variables or features (the size of houses, area, etc.) to a continuous function (house price).

    ML ETL PIpeline Flowchart
    ML/AI Flowchart: Input, Exploratory Data Analysis (EDA), Training, Testing, Validation, Prediction, Inference, Deployment, and Tuning or Optimization.

    The supervised ML algorithm consists of the following steps:

    • Create labeled data (label is the true answer for a given input, the house price $ is the label)
    • Perform model training, testing and cross-validation
    • Deploy trained models
    • Evaluate and tune deployed models
    • Avoid creating high bias/variance

    Model training and evaluation is performed using chosen metrics and objectives. For example, the loss metric is a sum of squares between observed and predicted house prices.

    ML/AI Assisted Real Estate Chart
    House price prediction RCA chart: under-fit, bets-fit, and over-fit KPI’s.
    Three-step ML/AI house price prediction methodology

    The above three-step ML methodology is a way to use regression algorithms to derive predictive insights from housing data and make repeated RE decisions. Qualities of good data (output of EDA): it has coverage, is clean, is complete.

    The broader your data’s coverage, the more robust your training model will be. Dirty data can make ML hard in terms of goodness-of-fit. Incomplete data can limit performance.

    Here is the list of 10 popular ML regression algorithms:

    1. Linear Regression
    2. Ridge Regression
    3. Neural Network Regression 
    4. Lasso Regression 
    5. Decision Tree Regression 
    6. Random Forest
    7. KNN Model 
    8. Support Vector Machines (SVM)
    9. Gausian Regression
    10. Polynomial Regression
    • Conventionally, the Exploratory Data Analysis (EDA) of the dataframe df is carried out using histograms df.plot(kind=’hist’) and pairplots sns.pairplot().
    • The Feature Engineering (FE) phase consists of the following steps: Log Transform np.log() or Square Root Transform np.sqrt(), Feature Importance analysis coef_.ravel(), and Feature Scaling using StandardScaler() (most common option), RobustScaler() (not widely used option), and MinMaxScaler (least robust choice).
    • The typical regression algorithm is the liner/polynomial regression with/without regularization (Lasso, Ridge, etc.) and/or Hyper-Parameter Optimization (HPO).
    • The Model Evaluation phase may represent (optionally) the following comparisons: Ridge vs Lasso and Normal vs Polynomial.
    • The cross-validation metrics utilities can be used to compute some useful statistics of the prediction performance. Some statistics computed are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percent error (MAPE), and median absolute percent error (MDAPE). 


    We begin with setting up the Python-based IDE using the latest version of Anaconda that contains the Jupyter Noebook coupled with (optionally) Datapane. The latter allows you to share an html link in which you can layout your analysis as a report. When started, the Jupyter Notebook App can access only files within its start-up folder (including any sub-folder). No configuration is necessary if you place your notebooks in your home folder or subfolders. Otherwise, you need to choose a Jupyter Notebook App start-up folder which will contain all the notebooks. Read more here.

    Check ML learning prerequisites here.


    The general workflow to create the model will be as follows:

    • Data handling (loading, cleaning, editing or preprocessing)
    • Exploratory Data Analysis (EDA)/Feature Engineering (FE)

    We use Feature Engineering to deal with missing values, outliers, and categorical features

    • Model training & hyperparameter tuning

    We use various ML models and train/test them on train/test data, viz. after tuning all the hyperparameters, test the model on test data

    • Model testing, QC diagnostics, evaluation and final deployment
    • Apply predictions, result interpretation, visualization and export.

    Below is the more detailed sequence of steps:

    • Import Libraries and Loading Dataset

    Example: use Python, opendatasets to load the data from the Kaggle platform, pandas to read and manipulate the data, seabornmatplotlibplotlygeopandas for data points visualizations, sklearn for data preprocessing and training algorithms.

    • EDA & Data Visualization/Overview

    Use a variety of useful data visualization tools that we can analyze tabular data and discover data cleaning procedures that we can fix the data (e.g. looking for missing values and outliers, applying data cleaning by removing unnecessary values or columns, duplicates values, and fixing some errors which can be human-made mistakes when recording).

    • Feature Engineering & Selection to improve a model’s predictive performance

    Use feature selection techniques such as Feature Importance (using ML algorithms such as Lasso and Random Forest), Correlation Matrix with Heatmap, or Univariate Selection. For example, we may choose the Heatmap correlation matrix technique to select features with correlations higher than zero.

    • Data preparation/preprocessing using features scaling, encoding, and imputing

    For example, the function preprocess_data(data) consists of remove_duplicates(), check_missing(), resolve_missing(), and change_types(); it takes in raw data and converts it into data that is ready for making predictions. Here are the steps to be done: 

    Identify the input and target column(s) for training the model.

    Identify numeric and categorical input columns.

    Impute (fill) missing values in numeric columns

    Scale values in numeric columns to a (0,1) range.

    Encode categorical data into one-hot vectors.

    Split the dataset into training and validation sets.

    • Robust model training and hyperparameter tuning

    For example, We may decide to train the data on SkLearn models Random Forest, Gradient Boosting, ExtraTree, LightGBM, and Catboost.

    The predictions from the model can be evaluated using a loss function like the Root Mean Squared Error (RMSE).

    • We can use the trained model to generate predictions for the training, testing and validation inputs by calculating the R-square in each case. The final score can be the model score and the training/testing accuracy.

    Case 1: US

    Let’s set the working directory YOURPATH

    import os
    os. getcwd()

    and import the following libraries

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt

    from sklearn.model_selection import train_test_split

    from sklearn.linear_model import LinearRegression

    Let’s read the Kaggle dataset

    houseDF = pd.read_csv(‘USA_Housing.csv’)

    and check the file content


    (5000, 7)


    Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
           'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],


    Avg. Area Income                float64
    Avg. Area House Age             float64
    Avg. Area Number of Rooms       float64
    Avg. Area Number of Bedrooms    float64
    Area Population                 float64
    Price                           float64
    Address                          object
    dtype: object

    The info is

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 5000 entries, 0 to 4999
    Data columns (total 7 columns):
     #   Column                        Non-Null Count  Dtype  
    ---  ------                        --------------  -----  
     0   Avg. Area Income              5000 non-null   float64
     1   Avg. Area House Age           5000 non-null   float64
     2   Avg. Area Number of Rooms     5000 non-null   float64
     3   Avg. Area Number of Bedrooms  5000 non-null   float64
     4   Area Population               5000 non-null   float64
     5   Price                         5000 non-null   float64
     6   Address                       5000 non-null   object 
    dtypes: float64(6), object(1)
    memory usage: 273.6+ KB

    and the first 5 rows are


    Input data table

    while the input data descriptive statistics is

    Input data descriptive statistics

    The input data pairplot is


    and the correlation heatmap is

    swarm_plot=sns.heatmap(houseDF.corr(), annot=True)
    fig = swarm_plot.get_figure()

    Correlation heatmap

    Let’s separate features and target variables

    X = houseDF[[‘Avg. Area Income’, ‘Avg. Area House Age’ , ‘Avg. Area Number of Rooms’, ‘Avg. Area Number of Bedrooms’, ‘Area Population’]]

    Y = houseDF[‘Price’]

    Let’s split the data into the train and test subsets as 70:30%, respectively,

    from sklearn.model_selection import train_test_split
    X_train, X_test, Y_train, Y_test= train_test_split(X,Y,test_size=0.30, random_state=1)

    Let’s apply the LinearRegression() to the training data

    from sklearn.linear_model import LinearRegression
    lm = LinearRegression(),Y_train)

    Let’s make predictions

    predictions = lm.predict(X_test)

    and plot the result

    plt.xlabel(‘Observed Test Data’)
    plt.ylabel(‘Predicted Test Data’)

    Linear Regression applied to test data

    Let’s compare it with the xgboost algorithm

    import xgboost as xg
    reg = xg.XGBRegressor(objective =’reg:linear’,
    n_estimators = 1000, seed = 123),Y_train)
    predictions = reg.predict(X_test)

    XGBoost Regression applied to test data

    We can see that LinearRegression() yields the more accurate prediction than XGBRegressor(). The same considerations apply to the sklearn algorithms (SVR, TweedieRegressor, RandomForestRegressor, etc.).

    Case 2: CA

    Let’s look at the median house prices for California districts derived from the 1990 census. This is the dataset used in the second chapter of Aurélien Géron’s recent book ‘Hands-On Machine learning with Scikit-Learn and TensorFlow’. The ultimate goal of end-to-end ML is to build a RE prediction engine capable of minimizing error rate RMSE (Root Mean Square Error) or MAE (Mean Absolute Error) or any other metrics of interest.

    Let’s set the working directory YOURPATH

    import os

    os. getcwd()

    and import libraries

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np

    Let’s read the data

    housing_data = pd.read_csv(“housing.csv”)

    Input table data from CA

    representing 20640 rows × 10 columns.

    The data info is

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 20640 entries, 0 to 20639
    Data columns (total 10 columns):
     #   Column              Non-Null Count  Dtype  
    ---  ------              --------------  -----  
     0   longitude           20640 non-null  float64
     1   latitude            20640 non-null  float64
     2   housing_median_age  20640 non-null  float64
     3   total_rooms         20640 non-null  float64
     4   total_bedrooms      20433 non-null  float64
     5   population          20640 non-null  float64
     6   households          20640 non-null  float64
     7   median_income       20640 non-null  float64
     8   median_house_value  20640 non-null  float64
     9   ocean_proximity     20640 non-null  object 
    dtypes: float64(9), object(1)
    memory usage: 1.6+ MB

    Let’s plot the ocean proximity bar chart


    ocean proximity bar chart

    We can see that “ISLAND” value_counts is negligible compared to “1H OCEAN”.

    The descriptive statistics of input data is


    Descriptive statistics of input data table

    Let’s plot the histogram of median income


    Histogram of median income

    Let’s introduce 5 categories of median income

    housing_data[“income_cat”]= pd.cut(housing_data[“median_income”],
    bins=[0,1.5,3.0,4.5,6, np.inf],

    3    7236
    2    6581
    4    3639
    5    2362
    1     822
    Name: income_cat, dtype: int64

    and plot histograms of these categories


    Histograms of 5 median income categories

    Let’s introduce the target variable median_house_value and the model features

    y = housing_data[“median_house_value”]
    X= housing_data.drop(“median_house_value”,axis=1)

    The model features table X

    with 20640 rows × 10 columns.

    Let’s split the data into 33% and 66% for Training and Testing, respectively

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

    Let’s select StratifiedShuffleSplit that provides train/test indices to split data in train/test sets. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class

    from sklearn.model_selection import StratifiedShuffleSplit
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)
    for train_index, test_index in split.split(housing_data,housing_data[“income_cat”]):
    strat_train_set = housing_data.loc[train_index]
    strat_test_set = housing_data.loc[test_index]

    Let’s check strat_test_set value count in terms of income_cat as a fraction

    strat_test_set[“income_cat”].value_counts() / len(strat_test_set)

    3    0.350533
    2    0.318798
    4    0.176357
    5    0.114341
    1    0.039971
    Name: income_cat, dtype: float64

    We can see only 4% of strat_test_set belongs to income_cat=1 as compared to 35% of strat_test_set that belongs to income_cat=3.

    Let’s plot the histograms of training data

    Histograms of training data

    Let’s plot the geo-location map population and housing median age vs median house value

    s = housing[“population”]/100, label=”population”,figsize=(10,7),

    s = housing[“housing_median_age”], label=”housing_median_age”,figsize=(10,7),

    Geo-location map population vs median house value
    Geo-location map housing median age vs median house value

    Let’s look at the housing correlation matrix


    Housing correlation matrix

    and plot the corresponding annotated heatmap

    import seaborn as sns
    corr = housing.corr()
    mask = np.triu(np.ones_like(corr,dtype=bool))

    f, ax = plt.subplots(figsize= (11, 9))
    cmap = sns.diverging_palette(230, 20, as_cmap = True)
    sns_plot=sns.heatmap(corr,mask=mask,cmap=cmap, vmax=.3,center=0,annot = True,
    square=True, linewidths=0.5, cbar_kws={“shrink”:.5})
    fig = sns_plot.get_figure()

    Housing correlation matrix heatmap

    We can see that median_income is the most dominant factor that affects median_house_price.

    Let’s check rows for missing values

    sample_incomplete_rows= housing[housing.isnull().any(axis=1)].head()

    Sample table of rows with missing values

    while dropping the column with no values


    longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity

    Let’s fill NaN with median values

    median = housing[‘total_bedrooms’].median()

    Incomplete rows filled in with median values

    Let’s apply the SimpleImputer method with strategy =’median’

    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy =’median’)
    housing_num = housing.select_dtypes(include=(np.number))

    housing data after applying SimpleImputer method with strategy ='median'

    from sklearn import impute


    X = imputer.transform(housing_num)
    housing_tr = pd.DataFrame(X, columns = housing_num.columns,index=housing_num.index)

    Result of applying the imputer transform to the housing data

    Recall that



    Let’s encode categorical variables to convert non-numerical data into numerical data to create inferences

    housing_cat =housing[[‘ocean_proximity’]]

    Ocean_proximity table for encoding

    Let’s apply OrdinalEncoder to this variable

    from sklearn.preprocessing import OrdinalEncoder

    ordinal_encoder= OrdinalEncoder()


    Let’s apply OneHotEncoder to housing_cat

    from sklearn.preprocessing import OneHotEncoder
    cat_encoder = OneHotEncoder(sparse=False)
    housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

    array([[0., 1., 0., 0., 0.],
           [0., 0., 0., 0., 1.],
           [0., 1., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [1., 0., 0., 0., 0.],
           [0., 1., 0., 0., 0.]])

    Let’s define the feature_engineering function

    def feature_engineering(data):
    data[‘bedrooms_per_household’] = data[‘total_bedrooms’]/data[‘households’]

    return data

    and apply this function to the housing data

    housing_feature_engineered = feature_engineering(housing_num)

    The feature engineering function applied to the housing data

    Let’s scale our data

    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()

    housing_scaled = scaler.fit_transform(housing_feature_engineered)

    array([[-0.94135046,  1.34743822,  0.02756357, ...,  0.05896205,
             0.00622264,  0.01739526],
           [ 1.17178212, -1.19243966, -1.72201763, ...,  0.02830837,
            -0.04081077,  0.56925554],
           [ 0.26758118, -0.1259716 ,  1.22045984, ..., -0.1286475 ,
            -0.07537122, -0.01802432],
           [-1.5707942 ,  1.31001828,  1.53856552, ..., -0.26257303,
            -0.03743619, -0.5092404 ],
           [-1.56080303,  1.2492109 , -1.1653327 , ...,  0.11548226,
            -0.05915604,  0.32814891],
           [-1.28105026,  2.02567448, -0.13148926, ...,  0.05505203,
             0.00657083,  0.01407228]])

    Let’s create the ML input data

    ml_input_data = np.hstack([housing_cat_1hot, housing_scaled])

    array([[ 0.        ,  1.        ,  0.        , ...,  0.05896205,
             0.00622264,  0.01739526],
           [ 0.        ,  0.        ,  0.        , ...,  0.02830837,
            -0.04081077,  0.56925554],
           [ 0.        ,  1.        ,  0.        , ..., -0.1286475 ,
            -0.07537122, -0.01802432],
           [ 1.        ,  0.        ,  0.        , ..., -0.26257303,
            -0.03743619, -0.5092404 ],
           [ 1.        ,  0.        ,  0.        , ...,  0.11548226,
            -0.05915604,  0.32814891],
           [ 0.        ,  1.        ,  0.        , ...,  0.05505203,
             0.00657083,  0.01407228]])

    Let’s define the entire ETL pipeline to be applied to the housing data

    housing = strat_train_set.drop(“median_house_value”, axis=1)
    housing_labels = strat_train_set[“median_house_value”].copy()

    def data_transformations(data):

    ### Separate Labels if they Exist ###
    if "median_house_value" in data.columns:
        labels = data["median_house_value"]
        data = data.drop("median_house_value", axis=1)
        labels = None
    ### Feature Engineering ###
    feature_engineered_data = feature_engineering(data)
    features = list(feature_engineered_data.columns) # Creating a list of our features for future use
    ### Imputing Data ###
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy="median")
    housing_num = feature_engineered_data.select_dtypes(include=[np.number])
    imputed = imputer.fit_transform(housing_num)
    ### Encoding Categorical Data ###
    housing_cat = feature_engineered_data.select_dtypes(exclude=[np.number])
    from sklearn.preprocessing import OneHotEncoder
    cat_encoder = OneHotEncoder(sparse=False)
    housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
    features = features + cat_encoder.categories_[0].tolist()
    features.remove("ocean_proximity") # We're encoding this variable, so we don't need it in our list anymore
    ### Scaling Numerical Data ###
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    housing_scaled = scaler.fit_transform(imputed)
    ### Concatening all Data ###
    output = np.hstack([housing_scaled, housing_cat_1hot])
    return output, labels, features



    Let’s select and train the model
    train_data, train_labels, features = data_transformations(strat_train_set)

    array([[-0.94135046,  1.34743822,  0.02756357, ...,  0.        ,
             0.        ,  0.        ],
           [ 1.17178212, -1.19243966, -1.72201763, ...,  0.        ,
             0.        ,  1.        ],
           [ 0.26758118, -0.1259716 ,  1.22045984, ...,  0.        ,
             0.        ,  0.        ],
           [-1.5707942 ,  1.31001828,  1.53856552, ...,  0.        ,
             0.        ,  0.        ],
           [-1.56080303,  1.2492109 , -1.1653327 , ...,  0.        ,
             0.        ,  0.        ],
           [-1.28105026,  2.02567448, -0.13148926, ...,  0.        ,
             0.        ,  0.        ]])

    Let’s test the model

    test_data, test_labels, features = data_transformations(strat_test_set)

    array([[ 0.57507019, -0.69657252,  0.0329564 , ...,  0.        ,
             0.        ,  0.        ],
           [-0.43480141, -0.33466769, -0.36298077, ...,  0.        ,
             0.        ,  0.        ],
           [ 0.54522177, -0.63547171,  0.58726843, ...,  0.        ,
             0.        ,  0.        ],
           [-0.08656982, -0.54617051,  1.14158047, ...,  0.        ,
             0.        ,  0.        ],
           [ 0.81385757, -0.92687559,  0.11214383, ...,  0.        ,
             0.        ,  0.        ],
           [ 0.49049967, -0.66367208,  0.58726843, ...,  0.        ,
             0.        ,  0.        ]])

    We have got the train labels


    12655     72100.0
    15502    279600.0
    2908      82700.0
    14053    112500.0
    20496    238300.0
    15174    268500.0
    12661     90400.0
    19263    140400.0
    19140    258100.0
    19773     62700.0
    Name: median_house_value, Length: 16512, dtype: float64

    and the features


     '<1H OCEAN',
     'NEAR BAY',
     'NEAR OCEAN']

    Following Case 1 (see above), let’s apply the Linear Regression

    from sklearn.linear_model import LinearRegression
    lin_reg = LinearRegression(),train_labels)


    Let’s compare original and predicted values

    original_values = test_labels[:5]
    predicted_values = lin_reg.predict(test_data[:5])
    comparison_dataframe = pd.DataFrame(data={“Original Values”:original_values, “Predicted Values”:predicted_values})

    comparison_dataframe[“Differences”] = comparison_dataframe[“Original Values”] – comparison_dataframe[“Predicted Values”]


    Difference between Original and Predicted Values

    Let’s check the MSE metric

    from sklearn.metrics import mean_squared_error

    lin_mse = mean_squared_error(original_values,predicted_values)
    lin_rmse = np.sqrt(lin_mse)


    Let’s check the MAE metric

    from sklearn.metrics import mean_absolute_error

    lin_mae = mean_absolute_error(original_values, predicted_values)


    Let’s apply the Decision Tree algorithm

    from sklearn.tree import DecisionTreeRegressor
    tree_reg = DecisionTreeRegressor(random_state=42),train_labels)


    train_predictions = tree_reg.predict(train_data)
    tree_mse = mean_squared_error(train_labels, train_predictions)
    tree_rmse = np.sqrt(tree_mse)


    Let’s compute the cross-validation score

    from sklearn.model_selection import cross_val_score

    scores = cross_val_score(tree_reg, train_data, train_labels, scoring=”neg_mean_squared_error”, cv=10)
    tree_rmse_scores = np.sqrt(-scores)

    def display_scores(scores):
    print(“Scores:”, scores)
    print(“Mean:”, scores.mean())
    print(“Standard deviation:”, scores.std())


    Scores: [70819.83674558 70585.09139446 69861.50467212 73083.46385442
     66246.62162221 74093.76616605 77298.21284135 70265.05374821
     70413.46481703 72693.02785945]
    Mean: 71536.00437208822
    Standard deviation: 2802.723447985299

    Let’s apply the Random Forest Regressor

    rom sklearn.ensemble import RandomForestRegressor

    forest_reg = RandomForestRegressor(n_estimators=100, random_state=42), train_labels)


    train_predictions = forest_reg.predict(train_data)
    forest_mse = mean_squared_error(train_labels, train_predictions)
    forest_rmse = np.sqrt(forest_mse)


    Let’s select the corresponding cross_val_score

    from sklearn.model_selection import cross_val_score

    forest_scores = cross_val_score(forest_reg, train_data, train_labels,
    scoring=”neg_mean_squared_error”, cv=10)
    forest_rmse_scores = np.sqrt(-forest_scores)

    Scores: [51667.47890087 49581.77674843 46845.77133522 52127.48739086
     48082.89639917 51050.84681689 53027.94987383 50218.59780997
     48609.03966622 54669.97457167]
    Mean: 50588.18195131385
    Standard deviation: 2273.9929947683154

    Let’s try 12 (3×4) combinations of hyperparameters and then try then try 6 (2×3) combinations with bootstrap set as False using GridSearchCV

    from sklearn.model_selection import GridSearchCV

    param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {‘n_estimators’: [3, 10, 30], ‘max_features’: [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {‘bootstrap’: [False], ‘n_estimators’: [3, 10], ‘max_features’: [2, 3, 4]},

    forest_reg = RandomForestRegressor(random_state=42)

    Let’s train across 5 folds, that’s a total of (12+6)*5=90 rounds of training

    grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
    return_train_score=True), train_labels)

    GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                 param_grid=[{'max_features': [2, 4, 6, 8],
                              'n_estimators': [3, 10, 30]},
                             {'bootstrap': [False], 'max_features': [2, 3, 4],
                              'n_estimators': [3, 10]}],
                 return_train_score=True, scoring='neg_mean_squared_error')

    Let’s see the best estimator


    RandomForestRegressor(max_features=6, n_estimators=30, random_state=42)

    The results of grid search cv are as follows

    cvres = grid_search.cv_results_
    for mean_score, params in zip(cvres[“mean_test_score”], cvres[“params”]):
    print(np.sqrt(-mean_score), params)

    64441.33583774864 {'max_features': 2, 'n_estimators': 3}
    55010.78729315784 {'max_features': 2, 'n_estimators': 10}
    52756.90743676946 {'max_features': 2, 'n_estimators': 30}
    60419.95105027927 {'max_features': 4, 'n_estimators': 3}
    52548.760723492225 {'max_features': 4, 'n_estimators': 10}
    50475.03023921768 {'max_features': 4, 'n_estimators': 30}
    58658.87553276854 {'max_features': 6, 'n_estimators': 3}
    51688.259845013825 {'max_features': 6, 'n_estimators': 10}
    49602.83903888296 {'max_features': 6, 'n_estimators': 30}
    57764.545176887186 {'max_features': 8, 'n_estimators': 3}
    51906.606161086886 {'max_features': 8, 'n_estimators': 10}
    49851.77165193962 {'max_features': 8, 'n_estimators': 30}
    63137.43571927858 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
    54419.40582754731 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
    58195.29390064867 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
    52168.74519952844 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
    59520.17602710436 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
    51828.25647287002 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

    The corresponding dataframe is


    Grid search CV results

    representing 18 rows × 23 columns.

    Let’s compare it to RandomizedSearchCV

    from sklearn.model_selection import RandomizedSearchCV
    from scipy.stats import randint

    param_distribs = {
    ‘n_estimators’: randint(low=1, high=200),
    ‘max_features’: randint(low=1, high=8),

    forest_reg = RandomForestRegressor(random_state=42)
    rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
    n_iter=10, cv=5, scoring=’neg_mean_squared_error’, random_state=42), train_labels)

    RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                       param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001669BCE8220>,
                                            'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001669BCE0640>},
                       random_state=42, scoring='neg_mean_squared_error')

    The results are as follows

    cvres = rnd_search.cv_results_
    for mean_score, params in zip(cvres[“mean_test_score”], cvres[“params”]):
    print(np.sqrt(-mean_score), params)

    48881.00597871309 {'max_features': 7, 'n_estimators': 180}
    51634.61963021687 {'max_features': 5, 'n_estimators': 15}
    50312.55245794906 {'max_features': 3, 'n_estimators': 72}
    50952.54821857023 {'max_features': 5, 'n_estimators': 21}
    49063.34454115586 {'max_features': 7, 'n_estimators': 122}
    50317.63324666772 {'max_features': 3, 'n_estimators': 75}
    50173.504527094505 {'max_features': 3, 'n_estimators': 88}
    49248.29804214526 {'max_features': 5, 'n_estimators': 100}
    50054.94886918995 {'max_features': 3, 'n_estimators': 150}
    64847.94779269648 {'max_features': 5, 'n_estimators': 2}

    Let’s look at the feature importances

    feature_importances = grid_search.best_estimator_.feature_importances_

    array([8.46978272e-02, 7.69983975e-02, 4.08715796e-02, 1.67325719e-02,
           1.71418340e-02, 1.73518185e-02, 1.56303531e-02, 3.39824215e-01,
           2.30528104e-02, 1.04033701e-01, 8.64983594e-02, 1.29273143e-02,
           1.54663950e-01, 7.22217547e-05, 3.62205279e-03, 5.88099358e-03])

    The corresponding list is as follows

    feature_importance_list = list(zip(features, feature_importances.tolist()))

    [('longitude', 0.0846978271965227),
     ('latitude', 0.07699839747855737),
     ('housing_median_age', 0.040871579612884096),
     ('total_rooms', 0.016732571900462085),
     ('total_bedrooms', 0.01714183399184058),
     ('population', 0.0173518184721046),
     ('households', 0.015630353131298083),
     ('median_income', 0.3398242154869636),
     ('bedrooms_per_household', 0.023052810363875926),
     ('population_per_households', 0.10403370064780083),
     ('rooms_per_households', 0.08649835942626646),
     ('<1H OCEAN', 0.012927314349565632),
     ('INLAND', 0.15466394981681342),
     ('ISLAND', 7.222175467748088e-05),
     ('NEAR BAY', 0.003622052794433035),
     ('NEAR OCEAN', 0.005880993575933963)]

    We can plot this list as the vertical bar container that consists of 16 columns

    plt.barh(y=features, width=feature_importances.tolist())

    Feature importances as
as the vertical bar container that consists of 16 columns

    The final model RMSE is given by

    final_model = grid_search.best_estimator_

    final_predictions = final_model.predict(test_data)

    final_mse = mean_squared_error(test_labels, final_predictions)
    final_rmse = np.sqrt(final_mse)



    This can be modified further using various feature selection methods.

    Thus, median_income is the most important feature. The best result is achieved using RandomForestRegressor + RandomizedSearchCV. The trained prediction of

    yields rmse=18797.8+/-2274,

    whreas min (mean_test_score) yields


    with ‘max_features’: 7, ‘n_estimators’: 180.

    Case 3: IA

    For this case study, the primary objective was to create and assess advanced ML/AI models to accurately predict house prices based on the Ames dataset. It was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

    The data set includes around 3000 records of house sales in Ames, Iowa between 2006 – 2010 and contains 79 explanatory variables detailing various aspects of residential homes such as square footage, number of rooms and sale year. The data is split into a training set, which will be used to create the model and a test set, which will be used to test model performance.

    Results can provide insights on the pricing of real estate assets just by plugging in the house characteristics and letting the model return a price. In addition, the ML/AI output can provide information on which features of a new house are more valuable for potential house buyers. Source code: GitHub.

    The general ETL Python workflow to create the model is as follows:

    1. Data preprocessing
    2. Exploratory data analysis/Feature Engineering
    3. Model training & hyperparameter tuning
    4. Model diagnostics & evaluation
    5. Result interpretation
    Let’s set the working directory YOURPATH

    import os
    os. getcwd()

    Let’s import libraries and download train/test Ames datasets
    %matplotlib inline
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    import sklearn.linear_model as linear_model
    import seaborn as sns
    import xgboost as xgb
    from sklearn.model_selection import KFold
    from IPython.display import HTML, display
    from sklearn.manifold import TSNE
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler

    pd.options.display.max_rows = 1000
    pd.options.display.max_columns = 20

    train = pd.read_csv(‘train.csv’)
    test = pd.read_csv(‘test.csv’)

    Let’s get the dimensions of the train and test data
    print(“Training data set dimension : {}”.format(train.shape))
    print(“Testing data set dimension : {}”.format(test.shape))

    Training data set dimension : (2051, 81)
    Testing data set dimension : (879, 80)

    Let’s look at the continuous features
    numerical_cols = [col for col in train.columns if train.dtypes[col] != ‘object’]
    print(“Continuous features”)
    print(“count of continuous features:”,len(numerical_cols))

    Continuous features
    ['PID', 'MS SubClass', 'Lot Frontage', 'Lot Area', 'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold']
    count of continuous features: 37

    Let’s look at the categorical features
    categorical_cols = [col for col in train.columns if train.dtypes[col] == ‘object’]
    print(“categorical features”)
    print(“count of categorical features:”,len(categorical_cols))

    categorical features
    ['MS Zoning', 'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating', 'Heating QC', 'Central Air', 'Electrical', 'Kitchen Qual', 'Functional', 'Fireplace Qu', 'Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Pool QC', 'Fence', 'Misc Feature', 'Sale Type']
    count of categorical features: 42

    and check unique column values below

    print(‘unique column values’)
    train.apply(lambda x: len(x.unique())).sort_values(ascending=False).head(10)

    unique column values


    Id               2051
    PID              2051
    Lot Area         1476
    Gr Liv Area      1053
    Bsmt Unf SF       968
    1st Flr SF        915
    Total Bsmt SF     893
    SalePrice         828
    BsmtFin SF 1      822
    Garage Area       515
    dtype: int64

    Let’s check the sorted cardinality train values

    cardinality = train[categorical_cols].apply(lambda x: len(x.unique()))

    Neighborhood      28
    Exterior 2nd      15
    Exterior 1st      15
    Sale Type          9
    Condition 1        9
    House Style        8
    Functional         8
    Condition 2        8
    Garage Type        7
    BsmtFin Type 2     7
    BsmtFin Type 1     7
    MS Zoning          7
    Bsmt Qual          6
    Roof Matl          6
    Misc Feature       6
    Garage Cond        6
    Garage Qual        6
    Foundation         6
    Fireplace Qu       6
    Bsmt Cond          6
    Roof Style         6
    Heating            5
    Fence              5
    Pool QC            5
    Electrical         5
    Bldg Type          5
    Bsmt Exposure      5
    Exter Cond         5
    Mas Vnr Type       5
    Lot Config         5
    dtype: int64

    and the cardinality test values

    cardinality = test[categorical_cols].apply(lambda x: len(x.unique()))

    Neighborhood      26
    Exterior 2nd      16
    Exterior 1st      13
    Sale Type         10
    Condition 1        9
    House Style        8
    Garage Type        7
    BsmtFin Type 2     7
    BsmtFin Type 1     7
    Garage Cond        6
    Fireplace Qu       6
    Functional         6
    Foundation         6
    Mas Vnr Type       6
    MS Zoning          6
    Roof Matl          6
    Roof Style         6
    Bsmt Qual          6
    Kitchen Qual       5
    Exter Cond         5
    Fence              5
    Garage Qual        5
    Bsmt Exposure      5
    Lot Config         5
    Bldg Type          5
    Electrical         5
    Misc Feature       4
    Garage Finish      4
    Lot Shape          4
    Land Contour       4
    Exter Qual         4
    Heating QC         4
    Heating            4
    Bsmt Cond          4
    Condition 2        4
    Land Slope         3
    Alley              3
    Paved Drive        3
    Pool QC            3
    Utilities          2
    dtype: int64

    Let’s check good and bad train+test column lists

    good_label_cols = [col for col in categorical_cols if set(test[col]).issubset(set(train[col]))]


    bad_label_cols = list(set(categorical_cols)-set(good_label_cols))

    ['Sale Type',
     'Exterior 1st',
     'Roof Matl',
     'Exterior 2nd',
     'Mas Vnr Type',
     'Kitchen Qual']

    Let’s plot the count of missing values in the training data column features

    cols_with_missing = train.isnull().sum()
    cols_with_missing = cols_with_missing[cols_with_missing>0]
    fig, ax = plt.subplots(figsize=(7,6))
    width = 0.70 # the width of the bars
    ind = np.arange(len(cols_with_missing)) # the x locations for the groups
    ax.barh(ind, cols_with_missing, width, color=”blue”)
    ax.set_yticklabels(cols_with_missing.index, minor=False)

    The count of missing values in the training data column features

    Let’s count the percentage of missing values in training data
    print(‘Percentage of missing values in each columns’)

    total = train.isnull().sum().sort_values(ascending=False)
    percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
    missing_data_tr = pd.concat([total, round(percent*100,2)], axis=1, keys=[‘Total’, ‘Percent’])

    Percentage of missing values in each columns
    The percentage of missing values in training data

    Similarly, we plot the count of missing values in the test data column features

    cols_with_missing = test.isnull().sum()
    cols_with_missing = cols_with_missing[cols_with_missing>0]
    fig, ax = plt.subplots(figsize=(7,6))
    width = 0.70 # the width of the bars
    ind = np.arange(len(cols_with_missing)) # the x locations for the groups
    ax.barh(ind, cols_with_missing, width, color=”blue”)
    ax.set_yticklabels(cols_with_missing.index, minor=False)

    and the percentage of missing values in test data columns

    print(‘Percentage of missing values in each columns’)

    total = test.isnull().sum().sort_values(ascending=False)
    percent = (test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
    missing_data_te = pd.concat([total, round(percent*100,2)], axis=1, keys=[‘Total’, ‘Percent’])

    Percentage of missing values in each columns
    The percentage of missing values in test data columns

    Let’s prepare the data for ML.

    Separate features and target variable SalePrice
    X_train = train_data.drop([‘SalePrice’], axis=1)
    y = train_data.SalePrice

    and concatenate train and test data
    X = pd.concat([X_train, test_data], axis=0)

    let’s apply SimpleImputer to deal with missing values

    from sklearn.impute import SimpleImputer

    group_1 = [
    ‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘FireplaceQu’, ‘GarageType’,
    ‘GarageFinish’, ‘GarageQual’, ‘GarageCond’, ‘BsmtQual’, ‘BsmtCond’,
    ‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinType2’, ‘MasVnrType’
    X[group_1] = X[group_1].fillna(“None”)

    group_2 = [
    ‘GarageArea’, ‘GarageCars’, ‘BsmtFinSF1’, ‘BsmtFinSF2’, ‘BsmtUnfSF’,
    ‘TotalBsmtSF’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘MasVnrArea’

    X[group_2] = X[group_2].fillna(0)

    group_3a = [
    ‘Functional’, ‘MSZoning’, ‘Electrical’, ‘KitchenQual’, ‘Exterior1st’,
    ‘Exterior2nd’, ‘SaleType’, ‘Utilities’

    imputer = SimpleImputer(strategy=’most_frequent’)
    X[group_3a] = pd.DataFrame(imputer.fit_transform(X[group_3a]), index=X.index)

    X.LotFrontage = X.LotFrontage.fillna(X.LotFrontage.mean())
    X.GarageYrBlt = X.GarageYrBlt.fillna(X.YearBuilt)

    Let’s check that there are no remaining missing values



    Let’s drop outliers in GrLivArea and SalePrice (based on Ames EDA)

    outlier_index = train_data[(train_data.GrLivArea > 4000)
    & (train_data.SalePrice < 200000)].index
    X.drop(outlier_index, axis=0, inplace=True)
    y.drop(outlier_index, axis=0, inplace=True)

    Let’s apply label encoding to the categorical columns

    from sklearn.preprocessing import LabelEncoder

    label_encoding_cols = [
    “Alley”, “BsmtCond”, “BsmtExposure”, “BsmtFinType1”, “BsmtFinType2”,
    “BsmtQual”, “ExterCond”, “ExterQual”, “FireplaceQu”, “Functional”,
    “GarageCond”, “GarageQual”, “HeatingQC”, “KitchenQual”, “LandSlope”,
    “LotShape”, “PavedDrive”, “PoolQC”, “Street”, “Utilities”

    label_encoder = LabelEncoder()

    for col in label_encoding_cols:
    X[col] = label_encoder.fit_transform(X[col])

    Let’ transform numerical variables to categorical variables
    to_factor_cols = [‘YrSold’, ‘MoSold’, ‘MSSubClass’]

    for col in to_factor_cols:
    X[col] = X[col].apply(str)

    Let’s apply feature scaling using RobustScaler
    from sklearn.preprocessing import RobustScaler
    numerical_cols = list(X.select_dtypes(exclude=[‘object’]).columns)
    scaler = RobustScaler()
    X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

    followed by one-hot encoding
    X = pd.get_dummies(X, drop_first=True)
    print(“X.shape:”, X.shape)

    X.shape: (2917, 237)

    Let’s define the train and test columns

    ntest = len(test_data)
    X_train = X.iloc[:-ntest, :]
    X_test = X.iloc[-ntest:, :]
    print(“X_train.shape:”, X_train.shape)
    print(“X_test.shape:”, X_test.shape)

    X_train.shape: (1458, 237)
    X_test.shape: (1459, 237)

    let’s perform modeling
    from sklearn.model_selection import KFold, cross_val_score

    n_folds = 5

    def getRMSLE(model):
    Return the average RMSLE over all folds of training data.
    # Set KFold to shuffle data before the split
    kf = KFold(n_folds, shuffle=True, random_state=42)

    # Get RMSLE score
    rmse = np.sqrt(-cross_val_score(
        model, X_train, y, scoring="neg_mean_squared_error", cv=kf))
    return rmse.mean()

    Let’s apply regularized regressions
    from sklearn.linear_model import Ridge, Lasso

    lambda_list = list(np.linspace(20, 25, 101))

    rmsle_ridge = [getRMSLE(Ridge(alpha=lambda_)) for lambda_ in lambda_list]
    rmsle_ridge = pd.Series(rmsle_ridge, index=lambda_list)

    rmsle_ridge.plot(title=”RMSLE by lambda”)
    print(“Best lambda:”, rmsle_ridge.idxmin())
    print(“RMSLE:”, rmsle_ridge.min())

    Ridge lambda:

    Best lambda: 22.9
    RMSLE: 0.11409306668450883
    Rudge lambda regularization RMSLE

    ridge = Ridge(alpha=22.9)

    The Lasso Regression is
    lambda_list = list(np.linspace(0.0006, 0.0007, 11))
    rmsle_lasso = [
    getRMSLE(Lasso(alpha=lambda_, max_iter=100000)) for lambda_ in lambda_list
    rmsle_lasso = pd.Series(rmsle_lasso, index=lambda_list)

    rmsle_lasso.plot(title=”RMSLE by lambda”)
    print(“Best lambda:”, rmsle_lasso.idxmin())
    print(“RMSLE:”, rmsle_lasso.min())

    Best lambda: 0.00065
    RMSLE: 0.11335701578061286
    Lasso RMSLE lambda regularization

    lasso = Lasso(alpha=0.00065, max_iter=100000)

    let’s apply the XGBoost algorithm
    from xgboost import XGBRegressor

    xgb = XGBRegressor(learning_rate=0.05,


    Let’s apply the LightGBM algorithm
    from lightgbm import LGBMRegressor
    lgb = LGBMRegressor(objective=’regression’,


    let’s design the average model

    from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin, clone

    class AveragingModel(BaseEstimator, RegressorMixin, TransformerMixin):
    def init(self, models):
    self.models = models

    def fit(self, X, y):
        # Create clone models
        self.models_ = [clone(x) for x in self.models]
        # Train cloned models
        for model in self.models_:
  , y)
        return self
    def predict(self, X):
        # Get predictions from trained clone models
        predictions = np.column_stack(
            [model.predict(X) for model in self.models_])
        # Return average predictions
        return np.mean(predictions, axis=1)

    avg_model = AveragingModel(models=(ridge, lasso, xgb, lgb))


    Let’s compare the X-plots

    X-plot observed vs predicted test data: Ridge regularization
    X-plot observed vs predicted test data: Ridge regularization
    X-plot observed vs predicted test data: lasso regularization
    X-plot observed vs predicted test data: lasso regularization
    X-plot observed vs predicted test data: XGBoost
    X-plot observed vs predicted test data: XGBoost
    X-plot observed vs predicted test data: LightGBM
    X-plot observed vs predicted test data: LightGBM
    X-plot observed vs predicted test data: Average Model
    X-plot observed vs predicted test data: Average Model

    We can see that both XGBoost and LightGBM methods result in relatively similar X-plots and corresponding RMSLEs.

    Case 4: MA

    Let’s visualize ML model performance using Scikit-Plot evaluation metrics. The public dataset that we’ll use is the Boston housing price dataset. It has information about various houses of Boston and the price at which they were sold. We’ll divide it as well in train and test sets with the train_size=0.8 proportion. Let’s import libraries and import the data:

    import scikitplot as skplt

    import sklearn
    from sklearn.datasets import load_boston
    from sklearn.model_selection import train_test_split

    from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
    from sklearn.linear_model import LinearRegression, LogisticRegression
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA

    import matplotlib.pyplot as plt

    import sys
    import warnings

    print(“Scikit Plot Version : “, skplt.version)
    print(“Scikit Learn Version : “, sklearn.version)
    print(“Python Version : “, sys.version)

    %matplotlib inline

    Scikit Plot Version :  0.3.7
    Scikit Learn Version :  1.0.2
    Python Version :  3.9.12 (main, Apr  4 2022, 05:22:27)

    boston = load_boston()
    X_boston, Y_boston =,

    print(“Boston Dataset Size : “, X_boston.shape, Y_boston.shape)

    print(“Boston Dataset Features : “, boston.feature_names)

    Boston Dataset Size :  (506, 13) (506,)
    Boston Dataset Features :  ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
     'B' 'LSTAT']

    X_boston_train, X_boston_test, Y_boston_train, Y_boston_test = train_test_split(X_boston, Y_boston,

    print(“Boston Train/Test Sizes : “,X_boston_train.shape, X_boston_test.shape, Y_boston_train.shape, Y_boston_test.shape)

    Boston Train/Test Sizes :  (404, 13) (102, 13) (404,) (102,)

    Let’s plot the cross-validation performance of ML models by passing it the Boston dataset. Scikit-plot provides a method named plot_learning_curve() as a part of the estimators module which accepts estimator, X, Y, cross-validation info, and scoring metric for plotting performance of cross-validation on the dataset.

    skplt.estimators.plot_learning_curve(LinearRegression(), X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston Linear Regression Learning Curve “);

    Boston Linear Regression Learning Curve

    skplt.estimators.plot_learning_curve(RandomForestRegressor(), X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston RandomForestRegressor Learning Curve “);

    Boston Random Forest Regression Learning Curve

    from xgboost import XGBRegressor
    skplt.estimators.plot_learning_curve(XGBRegressor(), X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston XGBRegressor Learning Curve “);

    Boston XGBoost Regression Learning Curve

    from lightgbm import LGBMRegressor
    skplt.estimators.plot_learning_curve(LGBMRegressor(), X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston LGBMRegressor Learning Curve “);

    Boston LightGBM regression learning Curve

    from sklearn.linear_model import Ridge, Lasso
    skplt.estimators.plot_learning_curve(Ridge(), X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston Ridge Regression Learning Curve “);

    Boston Ridge Regression Learning Curve

    skplt.estimators.plot_learning_curve(Lasso(), X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston Lasso Regression Learning Curve “);

    Boston Lasso Regression Learning Curve

    from sklearn import linear_model
    reg = linear_model.BayesianRidge()
    skplt.estimators.plot_learning_curve(reg, X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston BayesianRidge Regression Learning Curve “);

    Boston bayesian Ridge Regression Learning Curve

    from sklearn.linear_model import TweedieRegressor
    reg = TweedieRegressor(power=1, alpha=0.5, link=’log’)
    skplt.estimators.plot_learning_curve(reg, X_boston, Y_boston,
    cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
    figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”Boston TweedieRegressor Learning Curve “);

    Boston Tweedie Regression Learning Curve

    It is clear that RandomForestRegressor, XGBRegressor, and LGBMRegressor yield the best training and cross-validation scores for training examples > 420 compared to other ML algorithms.

    Crucial Steps

    • sssss
    • ssss
    • ssss
    • ssss
    • ssss
    • ssss
    • ssss

    Key Takeaways

    • We predict/estimate US house prices in order to allocate a valuation expert over a period of time.
    • We need a fast AI to address rapidly increasing populations and the number of dwelling houses in the country.
    • We use a region-dependent pre-trained ML model to predict prices of new houses.
    • We import key Python libraries (pandas, scikit-learn, etc.) and download public-domain housing datasets from Kaggle or GitHub.
    • We gather and clean, edit, scale and transform data so it can be used for model training and test predictions. Specifically, we identify the target variable (SalePrice), impute missing values, perform label encoding, standardization, splitting and (optional) balancing of training and testing datasets. For example, we can look at scatter plots to detect outliers to be dropped.
    • The input data consists of a home’s features, including its eventual selling price and various descriptive features such as location, remodeling, age, size, type of sale (single family, commercial, etc).
    • These features will be analyzed in determining a home’s value and what the shopper is most likely to buy.
    • Feature engineering can determine what are the most important model features as there may be one feature that stands out or there may be several. Fore example, a larger living or basement area is linked to a higher house price.
    • We perform model training using different linear and non-linear regression algorithms (Ridge, Lasso, Random Forest, Decision Treem SVM, XGBoost, etc.).
    • The model performance is evaluated using a user-defined loss function (RMSE, MSE, OHMSE, etc.).
    • The pre-trained model is then used to generate predictions for both training and validation inputs.
    • Cross-validation of different ML algorithms has proven to be a suitable method to find an acceptable best fitting algorithm for the given set of features.
    • It appears that location and square feet area play an important role in deciding the price of a property. This is helpful information for sellers and buyers.
    • Results provide a primer on advanced ML real estate techniques as well as several best practices for ML readiness. 


    Housing prices are an important reflection of the US real estate, and housing price ranges are of great interest for both buyers and sellers. Real estate is the world’s largest asset class, worthing $277 trillion, three times the total value of all publicly traded companies. And ML/AI applications have been accompanying its sector’s growth. 

    One of the most popular AI applications in the industry is intelligent investing. This application helps answer questions like:

    • Which house should I buy or build to maximize my return?
    • Where or when should I do so?
    • What is its optimum rent or sale price?

    In this blog post, we have reviewed how ML leverages the power of housing data to tackle these important questions. We have also explored the pros and cons of ML algorithms and how optimizing various steps of actual Python workflows can help improve their performance.


    Using Data to Predict Ames, Iowa Housing Price

    Using linear regression and feature engineering to predict housing prices in Ames, Iowa

    GitHub Rep Ames-housing-price-prediction


    Boston House Price Prediction Using Machine Learning

    House Price Prediction using Linear Regression from Scratch

    House price prediction – Austin, TX

    GitHub 137 public repositories matching housing-prices

    Predicting House Prices with Linear Regression | Machine Learning from Scratch (Part II)

    California Housing Prices

    California Housing

    Machine Learning Project: House Price Prediction

    Machine learning


    Real Estate Supervised ML/AI Linear Regression Revisited – USA House Price Prediction

    Supervised Machine Learning Use Case: Prediction of House Prices

  • Gulf’s Oil Price Web Scraping in R

    Gulf’s Oil Price Web Scraping in R

    Gulf states to gain $1.3 trillion in additional oil revenue by 2026: IMF.

    • The gains, due to high oil prices, are expected to provide ‘firepower’ to the region’s sovereign wealth funds (SWFs), one of the largest in the world.
    • Saudi’s PIF is chaired by crown prince Mohammed bin Salman. It has invested over $620 billion in total. Out of these, $7.5 billion alone were invested in US stocks during the second quarter, when the share prices were relatively low. PIF bought stocks of Amazon. PayPal, and BlackRock, among others, according to the report.

    Let’s discuss the basics of sourcing oil market price data for free online. Webscraping in R is a technique to retrieve large amounts of data from the Internet.

    We’ll be scraping data on Gulf’s oil prices from the Oil Price website and converting it into a usable format.

    Let’s install the R version 4.1.2 (2021-11-01) of the R Foundation for Statistical Computing. It is available within the Anaconda IDE.

    Let’s set up the working directory YOURPATH


    and install the following packages via CRAN

    type = “binary”
    install.packages(“rvest”, type = “binary”)

    install.packages(“tidyverse”, type = “binary”)

    install.packages(“ggpubr”, type = “binary”)

    We need the following libraries





    The relevant R webscraping script is given by

    url <- “; webpage <- read_html(url) print(webpage) value <- webpage %>%
    html_nodes( css = “.last_price”) %>%
    name <- webpage %>%
    html_nodes(., css = “td:nth-child(2)”) %>%
    html_text() %>%
    .[c(41:72, 74, 76:77, 79, 81:82, 84:85, 87:89, 91, 94:96, 98:99, 101:103, 105:107, 109:112, 114:116, 118:120, 122:124,
    126:128, 130:131, 133, 135:142, 145:160, 162:164, 166:167, 169:170, 172, 174,
    176:178, 180:182, 184:186, 188:189, 191, 193:196, 198:199, 201:213)]

    name1=append (name, NA, after=138)
    name2=append (name1, NA, after=139)
    name3=append (name2, NA, after=140)

    prices <- tibble(name = name3,
    last_price = value)


    prices10 <- tibble(namex = name10,
    last_pricex = value10)

    tabprices10=table(prices10$namex, prices10$last_pricex)

    Let’s create the basic scatter plot

    df <- data.frame(brand=prices10$namex,

    brand price
    1 Al Shaheen – Qatar 88.77
    2 Iraq 95.02
    3 Basrah Heavy 96.07
    4 Basrah Medium 8.361
    5 Saudi Arabia 3.617
    6 Arab Extra Light 86.19
    7 Arab Heavy 95.28
    8 Arab Medium 66.49
    9 Nigeria 93.53
    10 Brass River 82.50

    ggbarplot(df, “brand”, “price”,
    fill = “steelblue”, color = “steelblue”,
    label = TRUE, lab.pos = “in”, lab.col = “white”)

    Gulf's oil price chart

    This chart is crucial for competitive pricing. In order to keep prices of your products competitive and attractive, you need to monitor and keep track of prices set by your competitors. If you know what your competitors’ pricing strategy is, you can accordingly align your pricing strategy to get an edge over them.

    Explore More

    Webscraping in R – IMDb ETL Showcase

    Firsthand Data Visualization in R: Examples

    Tutorial: Web Scraping in R with rvest

    Web Scrape Text from ANY Website – Web Scraping in R (Part 1)

  • Cloud-Native Tech Status Update Q3 2022

    Cloud-Native Tech Status Update Q3 2022

    Following our June 2022 post, let’s dive even deeper into the cloud computing (CC) trends we plan to follow in the coming months.

    CC 2022 = AI Ops


    • CC Market
    • Infographic
    • Key Services
    • Serverless Functions
    • Microservices
    • DevOps CI/CD
    • ML/AI Products
    • IoT Technology
    • Cybersecurity
    • Use-Cases
    • E-Training
    • Events
    • Explore More
    • Infographic

    CC Market

    • Chicago, Aug. 30, 2022 (GLOBE NEWSWIRE) — Cloud Performance Management Market to grow from USD 1.5 billion in 2022 to USD 3.9 billion by 2027, at a Compound Annual Growth Rate (CAGR) of 17.6% during the forecast period, according to a new report by MarketsandMarkets™. 
    • Global cloud services spend up 33% to hit $62.3 billion in Q2 2022. The top three vendors in Q2 2022, namely Amazon Web Services (AWS), Microsoft Azure and Google Cloud, together accounted for 63% of global spending in Q2 2022 and collectively grew 42%.
    • AWS accounted for 31% of total cloud infrastructure services spend in Q2 2022, making it the leading cloud service provider. It grew 33% on an annual basis. Azure was the second largest cloud service provider in Q2, with a 24% market share after growing 40% annually. Google Cloud grew 45% in the latest quarter and accounted for an 8% market share.
    • Canalys VP Alex Smith said: “Cloud remains the strong growth segment in tech. While opportunities abound for providers large and small, the interesting battle remains right at the top between AWS and Microsoft. The race to invest in infrastructure to keep pace with demand will be intense and test the nerves of the companies’ CFOs as both inflation and rising interest rates create cost headwinds.”
    • Both AWS and Microsoft are continuing to roll out infrastructure. AWS has plans to launch 24 availability zones across eight regions, while Microsoft plans to launch 10 new regions over the next year. In both cases, the providers are increasing investment outside of the US as they look to capture global demand and ensure they can provide low-latency and high data sovereignty solutions.

    Key Services

    Types of CC architecture: public cloud, private cloud, and hybrid cloud.

    Why database migration to cloud?

    Types of cloud services: IaaS, PaaS, FaaS, and SaaS.

    • With IaaS, you rent IT infrastructure—servers and virtual machines (VMs), storage, networks, operating systems—from a cloud provider on a pay-as-you-go basis.
    • PaaS refers to cloud computing services that supply an on-demand environment for developing, testing, delivering, and managing software applications.
    • Overlapping with PaaS, FaaS focuses on building app functionality without spending time continually managing the servers and infrastructure required to do so. 
    • With SaaS, cloud providers host and manage the software application and underlying infrastructure, and handle any maintenance, like software upgrades and security patching.

    IaaS delivered improvements in cost, agility, scalability, and reliability.

    Cloud service models: IaaS, PaaS, and SaaS. Courtesy of David Chou.
    Courtesy of David Chou
    Courtesy of David Chou

    The diagram above provides a simplified/generalized view of choices we have from a hosting perspective:

    • On-premises: represents the traditional model of purchasing/licensing and acquiring software, install them, and manage them in our own data centers
    • Hosted: represents the co-location or managed outsourced hosting services. For example, GoGrid, Amazon EC2, etc.
    • Cloud: represents cloud fabric that provides higher-level application containers and services. For example, Google App Engine, Amazon S3/SimpleDB/SQS, etc.

    FaaS is a type of serverless cloud computing service that allows executing code in response to events without maintaining the complex infrastructure typically associated with building and launching microservices applications. With FaaS, users manage only functions and data while the cloud provider manages the application. This allows developers to get the functions they need without paying for services when code isn’t running. Some popular FaaS examples include: Amazon’s AWS LambdaGoogle Cloud Functions and Microsoft Azure Functions.

    CC = Heart of Global Digital Transformation


    GCP Digital transformation Funnel

    Serverless Functions

    Going Cloud-Native


    What Serverless Does Not Mean

    What Serveless Does Not Mean

    FaaS executes logic in response to events. All methods are grouped into a single deployable unit – a stateless Function. Scaling is handled automatically.

    FaaS use cases

    Traditional vs Serverless Architecture

    Traditional vs Serverless Architecture

    Serverless Offerings

    Serverless Offerings


    Microservices: an architectural pattern that breaks down large application structures into smaller, independent services that are not dependent upon a specific coding language; derived from service-oriented architecture (SOA).

    Top 6 microservices patterns:

    Top 6 microservices patterns.
Source: MuleSoft

    Microservices Use-Cases

    Microservices Use-Cases

    DevOps CI/CD

    Jenkins CI/CD Automation with OpsMX

    Jenkins is widely adopted for continuous integration (CI). Many organizations extend Jenkins with scripts and plug-ins to perform continuous delivery (CD) and deployments. Whether deploying to Kubernetes, VM, hybrid or multi-cloud environments, there is a better approach to implementing CD on top of Jenkins than using scripts and manual processes.

    Automating CD processes using AI can increase the velocity and accuracy of releases, improve the productivity of DevOps teams and eliminate manual Jenkins deployment scripts. In addition, many organizations face challenges with achieving regulatory and security compliance and performing audits. Automating CD with AI and deploying a central policy engine can help to enforce regulatory and security requirements and enable organizations to conduct audits easily.

    Modern CD solutions—those that incorporate AI and include open tool integration layers—can allow developers and DevOps teams to continue using the tools they know and love. Deep toolchain integrations provide real-time insights and diagnostics correlated with pipelines, which speeds approvals, verifications and triage activities. Modernizing your Jenkins CI/CD processing with CD automation offers key capabilities your teams need for software delivery success.

    Codefresh vs Jenkins CI/CD tooling:

    Codefresh vs Jenkins CI/CD tooling

    Codefresh has a bigger scope than Jenkins. In fact, one of the most crucial points of the comparison is that Jenkins is only a Continuous Integration (CI) solution, while Codefresh covers both Continuous Integration and Continuous Delivery.

    Cloud Modernization with CircleCI

    GKE Autopilot is a new mode of operation in Google Kubernetes Engine (GKE)


    designed to reduce operational costs around managing clusters, optimizing production time and driving higher workload availability. As a Google Cloud Platform (GCP) partner, CircleCI makes it easy to integrate CI/CD workflows with GCP and utilize modes of operation like Autopilot.


    GitOps = IaC + MRs + CI/CD

    • IaC – GitOps uses a Git repository as the single source of truth for infrastructure

    definitions. A Git repository is a .git folder in a project that tracks all changes

    made to files in a project over time. Infrastructure as code (IaC) is the practice

    of keeping all infrastructure configuration stored as code.

    • MRs – GitOps uses merge requests (MRs) as the change mechanism for all

    infrastructure updates. The MR is where teams can collaborate via reviews and

    comments and where formal approvals take place. A merge commits to your

    master (or trunk) branch and serves as a changelog for auditing

    and troubleshooting.

    • CI/CD – GitOps automates infrastructure updates using a Git workflow with

    continuous integration and continuous delivery (CI/CD). When new code

    is merged, the CI/CD pipeline enacts the change in the environment. Any

    configuration drift, such as manual changes or errors, is overwritten by GitOps

    automation so the environment converges on the desired state defined in Git.

    GitLab uses CI/CD pipelines to manage and implement GitOps automation, but

    other forms of automation such as definitions operators can be used as well.

    ML/AI Products

    AI Ops

    Why is AI Cloud important to you?

    MLOps vs DevOps:

    MLOps is a key aspect of ML engineering that focuses on simplifying and accelerating the process of delivering ML models to production and maintaining and monitoring them. MLOps involves collaboration between different teams including data scientists, DevOps engineers, IT specialists and others.

    DevOps combines the concepts of development and operations to describe a collaborative approach to performing the tasks usually associated with separate application development and IT operations teams. 

    Technology Deployment StagesDevOpsMLOps
    DevelopmentUsually, the code creates an interface or application. 
    The code is wrapped into an executable or artifact before being deployed and tested with a set of checks. 
    Ideally, this automated cycle will continue until the final product is ready.
    The code enables the team to build or train machine learning models. 
    The output artifacts include serialized files that can receive data inputs to generate inferences. 
    Validation involves checking the trained model’s performance based on the test data.
    This cycle should also continue until the model reaches a specified performance threshold.
    Version ControlVersion control typically only tracks changes to code and artifacts.
    There are few metrics to track.
    MLOps pipelines usually have more factors to track. Building and training an ML model involves an iterative experimentation cycle, requiring tracking of various metrics and components for each experiment (essential for later audits). 
    Additional components to track include training datasets, model building code and model artifacts. 
    Metrics include hyperparameters and model performance indicators, such as error rates.
    ReusabilityDevOps pipelines focus on repeatable processes.
    Teams can mix and match processes without following a specific workflow.
    MLOps pipelines repeatedly apply the same workflows. The common framework across projects helps improve consistency and allows teams to progress faster because they start with familiar processes. 
    Project templates offer structure, enabling customization to address the unique requirements of each use case.
    Uses centralized data management to consolidate the organization’s data to accelerate the discovery and training processes. Common approaches to centralization include a single source of truth and data warehouses.  
    Continuous MonitoringSite reliability engineering (SRE) has been trending over the past few years, emphasizing the need for monitoring software from development through to production deployment. 
    The software does not degrade in the way an ML model does. 
    Machine learning models can degrade quickly, requiring constant monitoring and updating.
    Conditions in the production environment affect the model’s accuracy. After deployment to production, the model starts generating predictions based on new data from the real world. This data is constantly changing and adapting, reducing model performance. 
    MLOps ensures that algorithms remain production-ready by incorporating procedures to facilitate continuous monitoring and model retraining.
    InfrastructureInfrastructure-as-code (IaC)
    Build servers
    CI/CD automation tools
    Deep learning and machine learning frameworks
    Cloud storage for large datasets
    GPUs for deep learning and computationally-intensive ML models
    DevOps Multi-Cloud Configuration Management (CM)

    IoT Technology

    The Internet of Things, or IoT, is a system of interrelated computing devices, mechanical and digital machines, objects, animals or people that are provided with unique identifiers (UIDs) and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction.

    IoT ETL pipeline = data gathering, processing, analytics and ML

    IoT data gathering, processing, analytics and ML

    GCP IIoT Streaming Analytics with FaaS

    IoT = Device + Gateway + Cloud

    IoT = Device + Gateway + Cloud
    Industry-4 GCP IoT Workflow

    Google Cloud IoT solutions

    Top Cloud IoT Platforms

    1. Thingworx 8 IoT Platform
    2. Microsoft Azure IoT Suite
    3. Google Cloud’s IoT Platform
    4. IBM Watson IoT Platform
    5. AWS IoT Platform
    6. Cisco IoT Cloud Connect
    7. Salesforce IoT Cloud
    8. Kaa IoT Platform
    9. Oracle IoT Platform
    10. Thingspeak IoT Platform
    11. GE Predix IoT Platform

    Top 3 cloud-native IoT vendors

    • Dynatrace – Monitor the performance, availability, and health of your IoT devices through a single all-in-one platform powered by AI.
    • 2smart – No-code software for bringing smart devices to internet and market.
    • Emnify – Intuitive IoT Monitoring Dashboard.


    • AT&T Managed Threat Detection and Response (MDR) is a sophisticated managed detection and response service that helps you to detect and respond to advanced threats before they impact your business. It builds on the unified security management (USM) platform for threat detection and response, and AT&T Alien Labs™ threat intelligence.
    • Multifactor authentication (MFA) is becoming increasingly standard within software development organizations, with GitHub recently announcing that two-factor authentication (2FA) will be mandatory for all code contributors by the end of 2023.
    • Cloud computing platforms are rife with misconfigurations that cybercriminals regularly exploit. Developers using infrastructure-as-code tools simply lack the cybersecurity expertise required to make sure cloud application environments are secure. It’s up to the cybersecurity team to make sure that the policies and guardrails created to secure cloud platforms are observed, especially when it comes to APIs.
    • The advent of cloud-native, container-based architecture and microservices-based applications running on platforms like Kubernetes has sharpened the focus on API security and the software supply chain—from both security teams and cyberattackers. Software supply chains and APIs have become the new attack surfaces of choice, and with everyone from the White House to entry-level developers talking SBOMs, open source security and APIs, this is an area that’s getting lots of attention. 
    • BrightTalk Webinars of Interest

    Cut Through Cybersecurity Complexity by Converging Key Capabilities at the Edge

    Through the Eyes of the Wolf: Insights into the Chain of Cyber Threats

    5 Signs the World Isn’t Paying Enough Attention to 5G Security

    Managing User Identity in a Cloud-First World

    How to Launch an Effective Zero Trust Initiative


    Stock Markets

    ML/AI Stock Forecasting

    AI based TSLA stock forecast using AWS Forecast

    Legal-as-a-Service (LaaS)

    Deloitte Legal as a service 16 AI projects


    HealthTech Pilots

    HealthTech ML/AI Use-Cases

    The Application of ML/AI in Diabetes

    AI-Powered Stroke Prediction

    HR Automation

    GCP architecture - HR chatbot


    edX c/o IBM MicroBachelors®

    Full Stack Cloud Application Development


    Coursera: Containerized Applications on AWS


    AWS Startup Showcase
CLoud Storage & Security

    If you’re interested in strategies to help 𝗗𝗲𝘁𝗲𝗰𝘁 𝗮𝗻𝗱 𝗣𝗿𝗼𝘁𝗲𝗰𝘁 𝗔𝗴𝗮𝗶𝗻𝘀𝘁 𝗧𝗵𝗿𝗲𝗮𝘁𝘀 in the cloud, this AWS Startup Showcase is for you.

    Explore More

    Cloud Tech Trends June 2022

    Cloud computing in 2025

    How AI and Cloud Computing help enterprises scale up in 2022?



    Udemy Course: Serveless Concepts

    Edge, Cloud, Core: Building an Affordable Hybrid Cloud

    Scale, Build, and Manage Kubernetes Deployments


    move to cloud database
    Cloud Computing Virtualization

    Gartner cloud security report
    GCP Computing Infrastructure
    IIoT Google BigQuery
    Comparing Gitlab CI/CD to Jenkins CI
    Version 4 of the industry’s most popular DevOps market landscape tool, the Periodic Table of DevOps. Selected Vendors: Snowflake, Moogsoft, Instana, DataDog, GitLab, among others.
    Tweets by @xebialabs
    Cloud security report 2022 by PaloAlto and PrismaCloud
    Key cybersecurity trends Q1 2022
  • 10 AI-Powered Websites for Content Writers

    10 AI-Powered Websites for Content Writers

    Following the recent blog, I recommend the following 10 mind-blowing AI websites you probably didn’t know existed:

    Remove unwanted things from images in seconds.

    AI model drawing images from any prompt! #craiyon

    #craiyon desert images

    A better, 10x faster way to write

    interview questions

    profile bio


    landing page copies

    YouTube descriptions

    SEO titles

    testimonials & reviews


    • Take a picture of the item
    • Get the word in one of 10 languages
    • Start sketching
    • It guesses what you’re trying to draw
    • And offers better alternatives for you.

    Font pairing made simple

    Generate font combinations with deep learning


    Ask your question to the collective wisdom of >100,000 books

    Talk to Books

    Find a person that’s not a person.

    Need to use an image of a person that’s copyright-free?

    • It generates a new face each time But this face is AI-generated.
    • It’s of a person who does not exist.
    • Enter key words
    • Choose what type of name you want

    Namelix generates hundreds of ideas and logos for you.

    Let’s try “food”:

    Namelix example

    Image enhancer & upscaler.

    Upscale pics with AI

    Fix pixelation and blur

    Correct colors and lighting

    Remove JPEG artifacts.

    Lets Enhance Image before
    Lets Enhance Image after
  • The Qullamaggie’s OXY Swing Breakouts

    The Qullamaggie’s OXY Swing Breakouts

    Featured Photo by @Nate_Dumlao at @unsplash

    This post was inspired by the Qullamaggie’s trading journey and its application to the TSLA swing breakouts. Read more about breakouts here.

    Our current goal is to extend the above breakout analysis to the $OXY stock.


    TradingView OXY Analysis:

    OXY advanced price chart: candlesticks, trading volume, Bollinger bands, Awsome Oscillator (AO), and Cahikin Oscillator.


    OXY advanced price chart 1D
candlesticks, trading volume, Bollinger bands, Awsome Oscillator (AO), and Cahikin Oscillator


    OXY advanced price chart 5D
candlesticks, trading volume, Bollinger bands, Awsome Oscillator (AO), and Cahikin Oscillator

    The summary of Occidental Petroleum Corporation is based on the most popular technical indicators, such as Moving Averages, Oscillators and Pivots. 

    TradingView: OXY price target

    OXY technical analysis

    OXY STRONG BUY technical summary based on the most popular technical indicators, such as Moving Averages, Oscillators and Pivots. 


    Relative Strength Index (14)65.42Neutral
    Stochastic %K (14, 3, 3)79.62Neutral
    Commodity Channel Index (20)165.12Neutral
    Average Directional Index (14)33.40Neutral
    Awesome Oscillator9.95Neutral
    Momentum (10)18.11Buy
    MACD Level (12, 26)5.36Buy
    Stochastic RSI Fast (3, 3, 14, 14)45.74Neutral
    Williams Percent Range (14)−5.11Neutral
    Bull Bear Power13.54Neutral
    Ultimate Oscillator (7, 14, 28)65.77Neutral


    Exponential Moving Average (10)65.43Buy
    Simple Moving Average (10)63.40Buy
    Exponential Moving Average (20)61.64Buy
    Simple Moving Average (20)62.90Buy
    Exponential Moving Average (30)57.77Buy
    Simple Moving Average (30)58.97Buy
    Exponential Moving Average (50)51.24Buy
    Simple Moving Average (50)47.96Buy
    Exponential Moving Average (100)43.37Buy
    Simple Moving Average (100)35.35Buy
    Exponential Moving Average (200)43.63Buy
    Simple Moving Average (200)38.43Buy
    Ichimoku Base Line (9, 26, 52, 26)56.54Neutral
    Volume Weighted Moving Average (20)62.64Buy
    Hull Moving Average (9)70.74Buy

    E2E Workflow

    The E2E algorithm is implemented as a sequence of the following steps:

    • trend_filter

    Take in a pandas series and output a binary array to indicate if a stock
    fits the growth criteria (1) or not (0)
    prices : pd.core.series.Series
    The prices we are using to check for growth
    growth_4_min : float, optional
    The minimum 4 week growth. The default is 25
    growth_12_min : float, optional
    The minimum 12 week growth. The default is 50
    growth_24_min : float, optional
    The minimum 24 week growth. The default is 80
    A binary array showing the positions where the growth criteria is met

    • explicit_heat_smooth

    Smoothen out a time series using a explicit finite difference method.
    prices : np.array
    The price to smoothen
    t_end : float
    The time at which to terminate the smootheing (i.e. t = 2)
    P : np.array
    The smoothened time-series

    Time spacing, must be < 1 for numerical stability

    Set up the initial condition

    Solve the finite difference scheme for the next time-step

    Add the fixed boundary conditions since the above solves the interior points only

    • check_consolidation

    Smoothen the time-series and check for consolidation, see the
    docstring of find_consolidation for the parameters

    • find_consolidation

    Return a binary array to indicate whether each of the data-points are
    classed as consolidating or not
    prices : np.array
    The price time series to check for consolidation
    days_to_smooth : int, optional
    The length of the time-series to smoothen (days). The default is 50.
    perc_change_days : int, optional
    The days back to % change compare against (days). The default is 5.
    perc_change_thresh : float, optional
    The range trading % criteria for consolidation. The default is 0.015.
    check_days : int, optional
    This says the number of lookback days to check for any consolidation.
    If any days in check_days back is consolidating, then the last data
    point is said to be consolidating. The default is 5.
    res : np.array
    The binary array indicating consolidation (1) or not (0)

    • We download the Yahoofinance $OXY data and call the above functions
    • Data visualizations using matplotlib scatter plot – original Close price + Volume size (red) vs breakouts (green).


    Set working directory YOURPATH

    import os


    os. getcwd()

    Import libraries

    import numpy as np
    import pandas as pd
    import yfinance as yf

    Define functions

    def trend_filter(prices: pd.core.series.Series,
    growth_4_min: float = 25.,
    growth_12_min: float = 50.,
    growth_24_min: float = 80.) -> np.array:
    Take in a pandas series and output a binary array to indicate if a stock
    fits the growth criteria (1) or not (0)
    prices : pd.core.series.Series
    The prices we are using to check for growth
    growth_4_min : float, optional
    The minimum 4 week growth. The default is 25
    growth_12_min : float, optional
    The minimum 12 week growth. The default is 50
    growth_24_min : float, optional
    The minimum 24 week growth. The default is 80
    A binary array showing the positions where the growth criteria is met

    growth_func = lambda x: 100*(x.values[-1]/x.min() - 1)
    growth_4 = df['Close'].rolling(20).apply(growth_func) > growth_4_min
    growth_12 = df['Close'].rolling(60).apply(growth_func) > growth_12_min
    growth_24 = df['Close'].rolling(120).apply(growth_func) > growth_24_min
    return np.where(
        growth_4 | growth_12 | growth_24,

    if name == ‘main‘:

    df ='OXY')
    df.loc[:, 'trend_filter'] = trend_filter(df['Close'])
    [*********************100%***********************]  1 of 1 completed

    df_trending = df[df[‘trend_filter’] == 1]

    def explicit_heat_smooth(prices: np.array,
    t_end: float = 5.0) -> np.array:
    Smoothen out a time series using a explicit finite difference method.
    prices : np.array
    The price to smoothen
    t_end : float
    The time at which to terminate the smootheing (i.e. t = 2)
    P : np.array
    The smoothened time-series

    k = 0.1 # Time spacing, must be < 1 for numerical stability
    # Set up the initial condition
    P = prices
    t = 0
    while t < t_end:
        # Solve the finite difference scheme for the next time-step
        P = k*(P[2:] + P[:-2]) + P[1:-1]*(1-2*k)
        # Add the fixed boundary conditions since the above solves the interior
        # points only
        P = np.hstack((
        t += k
    return P

    def check_consolidation(prices: np.array,
    perc_change_days: int,
    perc_change_thresh: float,
    check_days: int) -> int:
    Smoothen the time-series and check for consolidation, see the
    docstring of find_consolidation for the parameters

    # Find the smoothed representation of the time series
    prices = explicit_heat_smooth(prices)
    # Perc change of the smoothed time series to perc_change_days days prior
    perc_change = prices[perc_change_days:]/prices[:-perc_change_days] - 1
    consolidating = np.where(np.abs(perc_change) < perc_change_thresh, 1, 0)
    # Provided one entry in the last n days passes the consolidation check,
    # we say that the financial instrument is in consolidation on the end day
    if np.sum(consolidating[-check_days:]) > 0:
        return 1
        return 0

    def find_consolidation(prices: np.array,
    days_to_smooth: int = 50,
    perc_change_days: int = 5,
    perc_change_thresh: float = 0.015,
    check_days: int = 5) -> np.array:
    Return a binary array to indicate whether each of the data-points are
    classed as consolidating or not
    prices : np.array
    The price time series to check for consolidation
    days_to_smooth : int, optional
    The length of the time-series to smoothen (days). The default is 50.
    perc_change_days : int, optional
    The days back to % change compare against (days). The default is 5.
    perc_change_thresh : float, optional
    The range trading % criteria for consolidation. The default is 0.015.
    check_days : int, optional
    This says the number of lookback days to check for any consolidation.
    If any days in check_days back is consolidating, then the last data
    point is said to be consolidating. The default is 5.
    res : np.array
    The binary array indicating consolidation (1) or not (0)

    res = np.full(prices.shape, np.nan)
    for idx in range(days_to_smooth, prices.shape[0]):
        res[idx] = check_consolidation(
            prices = prices[idx-days_to_smooth:idx],
            perc_change_days = perc_change_days,
            perc_change_thresh = perc_change_thresh,
            check_days = check_days,
    return res

    Let’s proceed with main

    if name == ‘main‘:

    df ='TSLA')
    df.loc[:, 'consolidating'] = find_consolidation(df['Close'].values)
    [*********************100%***********************]  1 of 1 completed

    df =‘TSLA’)
    df.loc[:, ‘consolidating’] = find_consolidation(df[‘Close’].values)
    df.loc[:, ‘trend_filter’] = trend_filter(df[‘Close’])
    df.loc[:, ‘filtered’] = np.where(
    df[‘consolidating’] + df[‘trend_filter’] == 2,

    [*********************100%***********************]  1 of 1 completed

    Our dataframe df looks as follows

    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 3051 entries, 2010-06-29 to 2022-08-10
    Data columns (total 9 columns):
     #   Column         Non-Null Count  Dtype  
    ---  ------         --------------  -----  
     0   Open           3051 non-null   float64
     1   High           3051 non-null   float64
     2   Low            3051 non-null   float64
     3   Close          3051 non-null   float64
     4   Adj Close      3051 non-null   float64
     5   Volume         3051 non-null   int64  
     6   consolidating  3001 non-null   float64
     7   trend_filter   3051 non-null   int32  
     8   filtered       3051 non-null   bool   
    dtypes: bool(1), float64(6), int32(1), int64(1)
    memory usage: 205.6 KB

    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd

    df.index = pd.DatetimeIndex(data=df.index, tz=’US/Eastern’)

    import matplotlib
    matplotlib.rcParams.update({‘font.size’: 18})
    import matplotlib.pyplot as plt
    plt.plot(dft.DateTime,df.Close, ‘r’)
    plt.ylabel(“Price $”)

    OXY close share price

    plt.scatter(df0.index, df0[“Close”],color=’red’,s=df0[“Volume”]/scal,alpha=0.4)
    plt.scatter(df1.index, df1[“Close”],color=’green’,s=df1[“Volume”]/scal,alpha=0.4)
    plt.ylabel(“Price $”)

    plt.legend([“Close” , “Filtered”], facecolor=’bisque’,
    loc=’upper center’, bbox_to_anchor=(0.5, -0.08),

    OXY share price (red), volume (scatter size), and breakout points (green)
    OXY close share price (red), volume (scatter size), and breakout points (green)


    print(df.loc[‘2021-01-01 00:00:00-04:00′:’2022-08-22 00:00:00-04:00’])

                        Adj Close    Volume  consolidating  trend_filter  \
    2021-01-04 00:00:00-05:00  17.351271  18497800            0.0             1   
    2021-01-05 00:00:00-05:00  19.101311  37293800            1.0             1   
    2021-01-06 00:00:00-05:00  19.886841  37156400            1.0             1   
    2021-01-07 00:00:00-05:00  20.453615  24299300            1.0             1   
    2021-01-08 00:00:00-05:00  19.966389  18277900            0.0             1   
    ...                              ...       ...            ...           ...   
    2022-08-16 00:00:00-04:00  63.509998  16662200            1.0             0   
    2022-08-17 00:00:00-04:00  62.970001  14889800            1.0             0   
    2022-08-18 00:00:00-04:00  64.879997  16818000            1.0             0   
    2022-08-19 00:00:00-04:00  71.290001  79840900            0.0             0   
    2022-08-22 00:00:00-04:00  69.029999  47888500            0.0             0   
    2021-01-04 00:00:00-05:00     False  
    2021-01-05 00:00:00-05:00      True  
    2021-01-06 00:00:00-05:00      True  
    2021-01-07 00:00:00-05:00      True  
    2021-01-08 00:00:00-05:00     False  
    ...                             ...  
    2022-08-16 00:00:00-04:00     False  
    2022-08-17 00:00:00-04:00     False  
    2022-08-18 00:00:00-04:00     False  
    2022-08-19 00:00:00-04:00     False  
    2022-08-22 00:00:00-04:00     False  

    We focus on the “True” trading signals and ignore “False”.

    The above scatter plot and the table help identify the setups. You need to have a watchlist ready before the market open. You should also probably have alerts set, and know how many shares you want to buy.

    A swing trader can use the daily chart to find these setups, but it also works on the weekly chart and the intraday (1- and 5-minute) charts. 

    Read More


    A trading journey

    OXY Stock Update Wednesday, 25 May 2022

    OXY Stock Analysis, Thursday, 23 June 2022

    Track All Markets with TradingView

    Predicting Trend Reversal in Algorithmic Trading using Stochastic Oscillator in Python

    S&P 500 Algorithmic Trading with FBProphet

  • Towards min(Risk/Reward) – SeekingAlpha August Bear Market Update

    Towards min(Risk/Reward) – SeekingAlpha August Bear Market Update

    Featured Photo by Nick Chong on Unsplash

    Towards min(Risk/Reward)

    Let’s look at the latest SA market update as of Sun, Aug 21, 2022.

    Cryptocurrency Digest:

    Cryptocurrency Daily Digest as of Aug 21, 2022 08:00 ET
    • Investing In The Metaverse, Not Just For Individual Investors

    As NFTs are increasingly recognised as assets, they also present a conundrum for the investment community.

    • Litecoin: Still A Valuable Coin

    LTC has a robust and growing ecosystem.

    However, it is under pressure from competitors and regulators.

    I estimate LTC could be worth over $150.

    • Bitcoin: Black Swans Are Lurking

    Bitcoin’s blow-off top at $25k on August 14th signifies the end of a reflexive rally, representing the “return to normal” stage of a bubble.

    We anticipate Bitcoin is entering “phase 2” of its first-ever bear market, which can decrease BTC by another 60% to 80%.

    Tech Daily:

    • Intel Corporation Yields Over 4%, Now’s The Time To Buy
    • Oracle: The Dividend Growth Stock Your Portfolio Needs
    • Apple: Upside Catalyst Watch – Is The VR/AR Headset Coming Next Month?
    • Twilio: Earnings Beat, Rapid Growth And Undervalued

    Market Outlook:

    Stocks have been living in the land of make-believe for the past 4 weeks.

    Futures, bond, and currency markets have a message that cannot be ignored.

    There will be no dovish pivot, and the Fed is going to raise rates much higher and keep them there for some time.

    • Weekly Commentary: Inaugural Squeeze

    Risk off had attained powerful momentum globally back in July. De-risking/deleveraging dynamics were increasingly fomenting illiquidity, contagion and instability across global markets.

    The S&P500 ended the session with a year-to-date loss of 20%. The Nasdaq100 was down 28%, while the Banks ended the session with a 2022 loss of 25%.

    These days, bullish markets, luxuriating in newfound liquidity abundance, face an unfamiliar policy backdrop. Rather than a dovish pivot, the Fed is poised to plow ahead with its first real tightening cycle in 28 years.

    • The Week On Wall Street – A Very Complex Situation

    Earnings season winds down after producing the best S&P performance since 2009.

    The recently passed tax and spend green energy bill does not contain a scintilla of “growth”. That means any recovery remains rocky at best.

    China’s zero COVID policy continues to take its toll on its economy, adding more fuel to the global recession talk.

    Economic data continues to come in weak and isn’t aligned with the “no recession” commentary.


    • Any discussions about the Federal Reserve and the stock market seem to include concern over the radical uncertainty that exists in the world today.
    • What About That Recession?

    The yield curve is inverted from the 6-month T-bill to the 10-year Treasury note, which would suggest that the bond market’s message is “recession ahead, batten down the hatches”.

    Relatively tight risk spreads suggest that nobody is thinking too seriously about the potential for corporate defaults, which of course tend to be higher during a recession.

    Those consecutive negative GDP reports notwithstanding, our economy is neither currently in nor close to being in the kind of conditions normally associated with a recession.

    Real Estate:

    SA Morning Briefing:

    Alibaba: Fortunes Will Be Made
    Ford: Lightning Charge To $20s – Get, Set, Go!

    Exxon Mobil Has An Ace Up Its Sleeve

    Home Sales Are Crashing Faster Than The Bursting Of The 2005 Housing Bubble

    Wall Street Breakfast: Housing Crunch

    Wall Street Breakfast

    • The stock market’s four-week winning streak came to an end, in reaction to an overbought market that was due for a pullback. After Wall Street’s impressive recent rally and with central bank tightening in the pipeline, traders saw an opportunity to trim back positions.
    • Alex King spent over three decades on the buy-side. Among his recent picks are stocks such as Fortinet (+110%), HubSpot (+127%) and CrowdStrike (+63%).
    • And recently, ProShares UltraPro QQQ (TQQQ) was +27% in just two weeks, and Microsoft (MSFT) was +10% in three days.
    • Growth stocks were among the favorites for hedge fund whales in the second quarter. Monday was the deadline for hedge funds with more than $100M in assets under management, as well as other institutional investors and endowments, to report certain stock holdings through 13F filings. The 13F season gives investors a glimpse into where the big players are betting, albeit with dated information.
    • A number of hedge funds and money managers looked to pick up beaten-down growth stocks, such as tech, in Q2. From April to June, the Nasdaq 100 (NDX) (QQQ) fell more than 22%, while the S&P 500 (SP500) (SPY) was down about 16.5%.
    • Among the big-name disclosures, Warren Buffett’s Berkshire-Hathaway (BRK.A) (BRK.B) boosted its stake in Activision (ATVI) to ~68.4M shares from 64.3M. It also exited its stake in Verizon (VZ).
    • Retail resilience? While housing is struggling there were also signs for some optimism. Results from Home Depot (HD) and Lowe’s (LOW) showed that the home improvement consumer is “holding up quite well,” according to Oppenheimer analyst Brian Nagel.
    • Apple (AAPL) is reportedly looking at holding its annual fall product event, including the announcement of its iPhone 14 product line, on Sept. 7.

    ETF & Portfolio Strategy

    SPY: Targeting $428 For The Last Dance

    BUZZ Investing: Meme-Stocks Make A Comeback

    VIG: Put This Solid Dividend ETF On Your Radar

    Global Investing

    Glencore: Deeply Undervalued Dividend Gem

    Centerra Gold: The Most Extreme Discount In A Mid-Tier Gold Producer That I’ve Ever Seen

    Dividend Ideas

    Blackstone Is Buying REITs Hand Over Fist

    Lincoln National Is A No-Brainer Buy, Here’s Why

    Stock Ideas

    Occidental Petroleum: Is Warren Buffett Taking Over?

    Cleveland-Cliffs: We Simply Can’t Remember A Stock Like This

    Explore More


    Zacks Insights into this High Inflation/Rising Rate Market

    Macroaxis AI Investment Opportunity

    Bear Market Similarity Analysis using Nasdaq 100 Index Data

    Basic Stock Price Analysis in Python

    Stocks on Watch Tomorrow

    Investment Risk Management Study

    Inflation-Resistant Stocks to Buy

    20 Top Social Media Sites

    Track All Markets with TradingView

    Algorithmic Trading using Monte Carlo Predictions and 62 AI-Assisted Trading Technical Indicators (TTI)

    Macroaxis Wealth Optimization

    The Pup’s Weekend Dig – Tech Top & Energy Rotation