Tag: scikit-learn
-
Titanic Benchmark Hypothesis Testing in Disaster Risk Management: (Auto)EDA, ML, HPO & SHAP

This project aims to apply the Titanic benchmark to hypothesis testing in disaster risk management. Using the Titanic dataset on Kaggle, a Machine Learning (ML) analysis was performed to determine the statistical significance relation between a person’s death and their passenger class, age, sex, and port of embarkation. The project involved comprehensive ML pipeline implementation…
-
Malware Detection & Interpretation – PCA, T-SNE & ML

This post discusses the application of PCA, T-SNE, and supervised ML algorithms for malware detection using a benchmark dataset. Techniques such as Logistic Regression, SVC, KNN, and XGBoost are implemented, achieving high performance metrics. Results show potential for improving malware detection using ML while reducing false positives and enhancing cyber defense.
-
H2O AutoML Malware Detection

This study explores AI-powered malware detection using the H2O AutoML algorithm for effective and rapid classification of PE files. The optimized Stacked Ensemble model achieved high precision, recall, and F1 score. The research validates the H2O AutoML workflow’s accurate malware identification and supports related R&D products and solutions in the field of information security.
-
Leveraging Predictive Uncertainties of Time Series Forecasting Models

Featured Image via Canva. Table of Contents Introduction Random Simulation Tests TSLA Stock 43 Days TSLA Stock 300 Days Housing in the United States Industrial Production Federal Funds Rate Data S&P 500 Absolute Returns Number of Airline Passengers- 1. Holt-Winters Number of Airline Passengers- 2. Prophet Average Temperature in India Monthly Sales Data Analysis QC…
-
Health Insurance Cross Sell Prediction with ML Model Tuning & Validation

The content discusses the use of AI and Machine Learning (ML) for insurance cross-selling. It covers topics such as data preparation, model training with different algorithms, parameter optimization, and model evaluation. The study showcases the ability of ML models (HGBM, XGBoost, Random Forest) to predict cross-sell customers in the insurance sector, providing potential for improved…
-
Weather Forecasting & Flood De-Risking using Machine Learning, Markov Chain & Geospatial Plotly EDA

Foto door Pok Rie Scope: Business Value: Table of Contents U.S.A. Weather Forecast Australian Rainfall Prediction Kerala Flood Prediction Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is asymmetrical, (i.e. ROW LABEL values indicate how much they PROVIDE INFORMATION to each LABEL at the TOP). • Circles are the symmetrical numerical…
-
Hugging Face NLP, Streamlit, PyGWalker, TF & Gradio App

Table of Contents Streamlit/Dash/Jupyter PyGWalker EDA Demo PyGWalker and Dash — Creating a Data Visualization Dashboard In Less Than 20 Lines of Code PyGWalker Test PyGWalker Tutorial: A Tableau-Like Python Library for Interactive Data Exploration and Visualization PyGWalker: A Python Library for Visualizing Pandas Dataframes You’ll Never Walk Alone: Use Pygwalker to Visualize Data in…
-
Real-Time Anomaly Detection of NAB Ambient Temperature Readings using the TensorFlow/Keras Autoencoder

The content covers a detailed guide on implementing anomaly detection in time series data using autoencoders. The tutorial utilizes Python and real-world temperature dataset from Numenta Anomaly Benchmark (NAB). Following the Python workflow, the algorithm imports required libraries, performs anomaly detection, and visualizes anomalies. A trained autoencoder model identifies anomalies, with Precision, Recall, and F1…
-
Supervised ML Room Occupancy IoT

The article presents a study on applying machine learning (ML) to IoT sensor data for workspace occupancy detection. Comparing 14 popular scikit-learn classifiers, the ML systems built use the gathered IoT sensor data to predict room occupancy with high certainty. The results suggest temperature and light are the significant factors affecting occupancy detection. The study…
-
WA House Price Prediction: EDA-ML-HPO

A predictive model of house sale prices in King County, Washington, was developed using multiple supervised machine learning (ML) regression models, including LinearRegression, SGDRegressor, RandomForestRegressor, XGBRegressor, and AdaBoostRegressor. The best-performing model, XGBRegressor, explained 90.6% of the price variance, with a RMSE of $18472.7. These results, valuable to local realtors, indicate houses with a waterfront are…
-
ML Prediction of High/Low Video Game Hits with Data Resampling and Model Tuning

The post outlines a ML-based approach to forecast video game sales, using several techniques to enhance training, accuracy, and prediction. The Kaggle’s VGChartz dataset, containing sales data and other game-specific information, was used to build and refine the model. Several ML techniques including RandomForestClassifier and Logistic Regression yielded top predictors, with the critic’s score deemed…
-
Comparison of 20 ML + NLP Algorithms for SMS Spam-Ham Binary Classification

This post analyzes a public-domain SMS text message dataset to compare various machine learning algorithms’ abilities to classify spam and ham messages. After implementing a Python workflow that includes data preparation, exploratory analysis, natural language processing, supervised machine learning binary classification, and a model performance analysis, the author finds that MLP, Logistic Regression CV, Linear…
-
Improved Multiple-Model ML/DL Credit Card Fraud Detection: F1=88% & ROC=91%

In 2023, the global card industry is projected to suffer $36.13 billion in fraud losses. This has necessitated a priority focus on enhancing credit card fraud detection by banks and financial organizations. AI-based techniques are making fraud detection easier and more accurate, with models able to recognize unusual transactions and fraud. The post discusses a…
-
Unsupervised ML, K-Means Clustering & Customer Segmentation

Table of Clickable Contents Motivation Methods Open-Source Datasets This file contains the basic information (ID, age, gender, income, and spending score) about the customers. Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion…
-
Dabl Auto EDA-ML

Dabl, short for Data Analysis Baseline Library, is a high-level data exploration library in Python that automates repetitive data wrangling tasks in the early stages of supervised machine learning model development. Developed by Andreas Mueller and the scikit-learn community, it facilitates data preprocessing, advanced integrated visualization, exploratory data analysis (EDA), and ML model development, demonstrated…
-
About Face Recognition ML Algorithms

Facial Recognition (FR) involves mapping an individual’s facial features mathematically and storing the data as a faceprint. This case study outlines the process of Exploratory Data Analysis (EDA) and performance QC analysis for ML/AI workflows using public-domain datasets and real-time webcam GUI. The study includes the use of SVM for FR, dataset splitting, ML model…
-
Comparative ML/AI Performance Analysis of 13 Handwritten Digit Recognition (HDR) Scikit-Learn Algorithms with PCA+HPO

Featured Photo by Torsten Dettlaff on Pexels The article consists of the following three parts: 3. Unsupervised ML using the Principal Component Analysis (PCA) for the dimensionality reduction within Parts 1 and 2. Our main goal is to build a text and graphics report comparing the main scikit-learn classification metrics: accuracy_score, classification_report (precision, recall, and…
-
The Power of AIHealth: Comparison of 12 ML Breast Cancer Classification Models

AI Health is leveraging Machine Learning (ML) and Artificial Intelligence (AI) for early diagnosis and prediction of breast cancer (BC), utilizing different ML techniques for binary classification of the disease. A comparative analysis demonstrated that Linear Regression was the most effective classifier based on various performance metrics. This research aims to integrate ML in public…
-
A Comparison of Scikit Learn Algorithms for Breast Cancer Classification – 2. Cross Validation vs Performance

The post is a continuation of a previous breast cancer study comparing Scikit-Learn binary classifiers for cross validation and model performance. The classifiers compared include Logistic Regression, GaussianNB, SVC, KNN, Random Forest, Extra Trees, and Gradient Boosting. Learning curves show the comparison of classifier performance. Results indicate GaussianNB is more efficient than SVC in terms…
