Tag: scikit-learn

Titanic Benchmark Hypothesis Testing in Disaster Risk Management: (Auto)EDA, ML, HPO & SHAP

This project aims to apply the Titanic benchmark to hypothesis testing in disaster risk management. Using the Titanic dataset on Kaggle, a Machine Learning (ML) analysis was performed to determine the statistical significance relation between a person’s death and their passenger class, age, sex, and port of embarkation. The project involved comprehensive ML pipeline implementation…

29th Mar 2024
Malware Detection & Interpretation – PCA, T-SNE & ML

This post discusses the application of PCA, T-SNE, and supervised ML algorithms for malware detection using a benchmark dataset. Techniques such as Logistic Regression, SVC, KNN, and XGBoost are implemented, achieving high performance metrics. Results show potential for improving malware detection using ML while reducing false positives and enhancing cyber defense.

22nd Feb 2024
H2O AutoML Malware Detection

This study explores AI-powered malware detection using the H2O AutoML algorithm for effective and rapid classification of PE files. The optimized Stacked Ensemble model achieved high precision, recall, and F1 score. The research validates the H2O AutoML workflow’s accurate malware identification and supports related R&D products and solutions in the field of information security.

13th Feb 2024
Leveraging Predictive Uncertainties of Time Series Forecasting Models

Featured Image via Canva. Table of Contents Introduction Random Simulation Tests TSLA Stock 43 Days TSLA Stock 300 Days Housing in the United States Industrial Production Federal Funds Rate Data S&P 500 Absolute Returns Number of Airline Passengers- 1. Holt-Winters Number of Airline Passengers- 2. Prophet Average Temperature in India Monthly Sales Data Analysis QC…

5th Jan 2024
Prediction of NASA Turbofan Jet Engine RUL: OLS, SciKit-Learn & LSTM

We predict the Remaining Useful Life (RUL) of NASA turbofan jet engines by comparing the statsmodels OLS, ML SciKit-Learn regression vs LSTM Keras in Python. The input dataset is the Kaggle version of the public dataset for asset degradation modeling from NASA. It includes Run-to-Failure simulated data from turbo fan jet engines.

8th Dec 2023
Health Insurance Cross Sell Prediction with ML Model Tuning & Validation

The content discusses the use of AI and Machine Learning (ML) for insurance cross-selling. It covers topics such as data preparation, model training with different algorithms, parameter optimization, and model evaluation. The study showcases the ability of ML models (HGBM, XGBoost, Random Forest) to predict cross-sell customers in the insurance sector, providing potential for improved…

2nd Dec 2023
Weather Forecasting & Flood De-Risking using Machine Learning, Markov Chain & Geospatial Plotly EDA

Foto door Pok Rie Scope: Business Value: Table of Contents U.S.A. Weather Forecast Australian Rainfall Prediction Kerala Flood Prediction Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is asymmetrical, (i.e. ROW LABEL values indicate how much they PROVIDE INFORMATION to each LABEL at the TOP). • Circles are the symmetrical numerical…

26th Nov 2023
Hugging Face NLP, Streamlit, PyGWalker, TF & Gradio App

Table of Contents Streamlit/Dash/Jupyter PyGWalker EDA Demo PyGWalker and Dash — Creating a Data Visualization Dashboard In Less Than 20 Lines of Code PyGWalker Test PyGWalker Tutorial: A Tableau-Like Python Library for Interactive Data Exploration and Visualization PyGWalker: A Python Library for Visualizing Pandas Dataframes You’ll Never Walk Alone: Use Pygwalker to Visualize Data in…

13th Nov 2023
Real-Time Anomaly Detection of NAB Ambient Temperature Readings using the TensorFlow/Keras Autoencoder

The content covers a detailed guide on implementing anomaly detection in time series data using autoencoders. The tutorial utilizes Python and real-world temperature dataset from Numenta Anomaly Benchmark (NAB). Following the Python workflow, the algorithm imports required libraries, performs anomaly detection, and visualizes anomalies. A trained autoencoder model identifies anomalies, with Precision, Recall, and F1…

23rd Oct 2023
Supervised ML Room Occupancy IoT

The article presents a study on applying machine learning (ML) to IoT sensor data for workspace occupancy detection. Comparing 14 popular scikit-learn classifiers, the ML systems built use the gathered IoT sensor data to predict room occupancy with high certainty. The results suggest temperature and light are the significant factors affecting occupancy detection. The study…

10th Aug 2023
WA House Price Prediction: EDA-ML-HPO

A predictive model of house sale prices in King County, Washington, was developed using multiple supervised machine learning (ML) regression models, including LinearRegression, SGDRegressor, RandomForestRegressor, XGBRegressor, and AdaBoostRegressor. The best-performing model, XGBRegressor, explained 90.6% of the price variance, with a RMSE of $18472.7. These results, valuable to local realtors, indicate houses with a waterfront are…

11th Jul 2023
ML Prediction of High/Low Video Game Hits with Data Resampling and Model Tuning

The post outlines a ML-based approach to forecast video game sales, using several techniques to enhance training, accuracy, and prediction. The Kaggle’s VGChartz dataset, containing sales data and other game-specific information, was used to build and refine the model. Several ML techniques including RandomForestClassifier and Logistic Regression yielded top predictors, with the critic’s score deemed…

21st Jun 2023
Comparison of 20 ML + NLP Algorithms for SMS Spam-Ham Binary Classification

This post analyzes a public-domain SMS text message dataset to compare various machine learning algorithms’ abilities to classify spam and ham messages. After implementing a Python workflow that includes data preparation, exploratory analysis, natural language processing, supervised machine learning binary classification, and a model performance analysis, the author finds that MLP, Logistic Regression CV, Linear…

8th Jun 2023
Improved Multiple-Model ML/DL Credit Card Fraud Detection: F1=88% & ROC=91%

In 2023, the global card industry is projected to suffer $36.13 billion in fraud losses. This has necessitated a priority focus on enhancing credit card fraud detection by banks and financial organizations. AI-based techniques are making fraud detection easier and more accurate, with models able to recognize unusual transactions and fraud. The post discusses a…

27th May 2023
Unsupervised ML, K-Means Clustering & Customer Segmentation

Table of Clickable Contents Motivation Methods Open-Source Datasets This file contains the basic information (ID, age, gender, income, and spending score) about the customers. Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion…

22nd May 2023
Dabl Auto EDA-ML

Dabl, short for Data Analysis Baseline Library, is a high-level data exploration library in Python that automates repetitive data wrangling tasks in the early stages of supervised machine learning model development. Developed by Andreas Mueller and the scikit-learn community, it facilitates data preprocessing, advanced integrated visualization, exploratory data analysis (EDA), and ML model development, demonstrated…

19th Apr 2023
About Face Recognition ML Algorithms

Facial Recognition (FR) involves mapping an individual’s facial features mathematically and storing the data as a faceprint. This case study outlines the process of Exploratory Data Analysis (EDA) and performance QC analysis for ML/AI workflows using public-domain datasets and real-time webcam GUI. The study includes the use of SVM for FR, dataset splitting, ML model…

8th Mar 2023
Comparative ML/AI Performance Analysis of 13 Handwritten Digit Recognition (HDR) Scikit-Learn Algorithms with PCA+HPO

Featured Photo by Torsten Dettlaff on Pexels The article consists of the following three parts: 3. Unsupervised ML using the Principal Component Analysis (PCA) for the dimensionality reduction within Parts 1 and 2. Our main goal is to build a text and graphics report comparing the main scikit-learn classification metrics: accuracy_score, classification_report (precision, recall, and…

4th Feb 2023
The Power of AIHealth: Comparison of 12 ML Breast Cancer Classification Models

AI Health is leveraging Machine Learning (ML) and Artificial Intelligence (AI) for early diagnosis and prediction of breast cancer (BC), utilizing different ML techniques for binary classification of the disease. A comparative analysis demonstrated that Linear Regression was the most effective classifier based on various performance metrics. This research aims to integrate ML in public…

3rd Dec 2022
A Comparison of Scikit Learn Algorithms for Breast Cancer Classification – 2. Cross Validation vs Performance

The post is a continuation of a previous breast cancer study comparing Scikit-Learn binary classifiers for cross validation and model performance. The classifiers compared include Logistic Regression, GaussianNB, SVC, KNN, Random Forest, Extra Trees, and Gradient Boosting. Learning curves show the comparison of classifier performance. Results indicate GaussianNB is more efficient than SVC in terms…

25th Nov 2022