ML/AI Diamond Price Prediction with R

Inspired by the recent study, this post will cover an analysis of diamonds with R Studio. On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Therefore, it makes a perfect sense to use a pre-trained ML/AI model to get an idea of whether you are overpaying. 


  1. Preparation Phase
  2. Exploratory Data Analysis (EDA)
  3. Training Model
  4. Model Performance
  5. Key Takeaways
  6. References

Preparation Phase

Let’s change our working directory to YOURPATH


and install the following packages

type = “binary”
install.packages(“GGally”, type = “binary”)
install.packages(“tidymodels”, type = “binary”)
install.packages(“tictoc”, type = “binary”)
install.packages(“caTools”, type = “binary”)
install.packages(“xgboost”, type = “binary”)
install.packages(“e1071”, type = “binary”)
install.packages(“rpart”, type = “binary”)
install.packages(“randomForest”, type = “binary”)
install.packages(“Metrics”, type = “binary”)

We need a few libraries









and call the diamond data


Here is the summary


Diamond data summary table

Exploratory Data Analysis (EDA)

Let’s look at the pair correlation plot

Diamond data pair-correlation plot

Let’s look at the Q-Q plot of log Price

Normal Q-Q plot of log Price

We can see that the multi-modal histogram of log Price does not follows the normal distribution

hist_norm <- ggplot(diamonds, aes(log(price))) +
geom_histogram(aes(y = ..density..), colour = “black”, fill = ‘yellow’, bins = 55) +
stat_function(fun = dnorm, args = list(mean = mean(log(diamonds$price)), sd = sd(log(diamonds$price))))

Histogram of log Price

Let’s consider log Price as the target variable and split the input data with SplitRatio = 0.7

split = sample.split(diamonds$log_price, SplitRatio = 0.7)
diamonds_train = subset(diamonds, split == TRUE)
diamonds_test = subset(diamonds, split == FALSE)

Let’s prepare the data for ML

diamonds_train <- diamonds_train %>%
mutate_at(c(‘table’, ‘depth’), ~(scale(.) %>% as.vector))
diamonds_test <- diamonds_test %>%
mutate_at(c(‘table’, ‘depth’), ~(scale(.) %>% as.vector))

Training Model

Let’s call the lm() function to fit linear models

mlm <- lm(log_price ~ carat + color + cut + clarity + table + depth, diamonds_train)

mlm: 0.04 sec elapsed

lm summary: residuals, coefficients, signif. codes, residual standard error 0.345, R2=0.88, p-value 2.2.e-16, and F-statistic 1.51e+4.
lm summary: residuals, coefficients, signif. codes, residual standard error 0.345, R2=0.88, p-value 2.2.e-16, and F-statistic 1.51e+4.

Let’s call the 3rd order polynomial function lm()

poly <- lm(log_price ~ poly(carat,3) + color + cut + clarity + poly(table,3) + poly(depth,3), diamonds_train)

lm poly 3 summary: residuals, coefficients, signif. codes, residual standard error 0.1332, R2=0.98, p-value 2.2.e-16, and F-statistic 8.64e+4.

Let’s apply the XGBoost algorithm to both training and test data

diamonds_train_xgb <- diamonds_train %>%
mutate_if(is.factor, as.numeric)
diamonds_test_xgb <- diamonds_test %>%
mutate_if(is.factor, as.numeric)

xgb <- xgboost(data = as.matrix(diamonds_train_xgb[-7]), label = diamonds_train_xgb$log_price, nrounds = 6166, verbose = 0)


XGBooster summary
XGBoost summary

xgb_pred = predict(xgb, as.matrix(diamonds_test_xgb[-7]))

y_actual <- diamonds_test_xgb$log_price
y_predicted <- xgb_pred

test <- data.frame(cbind(y_actual, y_predicted))

let’s look at the test actual-predicted data cross-plot

xgb_scatter <- ggplot(test, aes(10y_actual, 10y_predicted)) + geom_point(colour = ‘black’, alpha = 0.3) + geom_smooth(method = lm)

XGBoost test actual-predicted data cross-plot
XGBoost test actual-predicted data cross-plot

Let’s apply the SVM algorithm to both training and test data using kernel = ‘radial’

svr <- svm(formula = log_price ~ .,
data = diamonds_train,
type = ‘eps-regression’,
kernel = ‘radial’)

svm(formula = log_price ~ ., data = diamonds_train, type = “eps-regression”, kernel = “radial”)

SVM-Type: eps-regression
SVM-Kernel: radial
cost: 1
gamma: 0.04761905
epsilon: 0.1

Number of Support Vectors: 13718

Let’s switch to the Decision Tree algorithm

tree <- rpart(formula = log_price ~ .,
data = diamonds_train,
method = ‘anova’,
model = TRUE)

Decision Tree summary
Decision Tree summary

Let’s consider the Random Forest approach

rf <- randomForest(log_price ~ .,
data = diamonds_train,
ntree = 500)

randomForest(formula = log_price ~ ., data = diamonds_train, ntree = 500)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2

      Mean of squared residuals: 0.01172975
                % Var explained: 98.89

Model Performance

Let’s make predictions and compare model performance

mlm_pred <- predict(mlm, diamonds_test)
poly_pred <- predict(poly, diamonds_test)
svr_pred <- predict(svr, diamonds_test)
tree_pred <- predict(tree, diamonds_test)
rf_pred <- predict(rf, diamonds_test)
xgb_pred <- predict(xgb, as.matrix(diamonds_test_xgb[-7]))

xgb_resid <- diamonds_test_xgb$log_price – xgb_pred

resid <- diamonds_test %>%
spread_residuals(mlm, poly, svr, tree, rf) %>%
select(mlm, poly, svr, tree, rf) %>%
rename_with( ~ paste0(.x, ‘_resid’)) %>%

predictions <- diamonds_test %>%
select(log_price) %>%
cbind(mlm_pred) %>%
cbind(poly_pred) %>%
cbind(svr_pred) %>%
cbind(tree_pred) %>%
cbind(rf_pred) %>%
cbind(xgb_pred) %>%

mean_log_price <- mean(diamonds_test$log_price)
tss = sum((diamonds_test_xgb$log_price – mean_log_price)^2 )

square <- function(x) {x**2}
r2 <- function(x) {1 – x/tss}

r2_df <- resid %>%
mutate_all(square) %>%
summarize_all(sum) %>%
mutate_all(r2) %>%
gather(key = ‘model’, value = ‘r2’) %>%
mutate(model = str_replace(model, ‘_resid’, ”))

model r2
1 mlm 0.8916784
2 poly 0.9813779
3 svr 0.9862363
4 tree 0.9103298
5 rf 0.9882366
6 xgb 0.9860271

xgb_rmse = sqrt(mean(residuals^2))

Let’s look at the model performance plot

r2_plot <- ggplot(r2_df, aes(x = model, y = r2, colour = model, fill = model)) + geom_bar(stat = ‘identity’)
r2_plot + ggtitle(‘R-squared Values for each Model’) + coord_cartesian(ylim = c(0.75, 1))

Model performance in terms of R-squared values
Model performance in terms of R-squared values

Let’s look at the predicted vs actual price of the test data

diamonds_test_sample <- diamonds_test %>%
left_join(predictions, by = ‘log_price’) %>%
slice_sample(n = 1000)
ggplot(diamonds_test_sample, aes(x = exp(log_price), y = exp(rf_pred), size = abs(rf_resid))) +
geom_point(alpha = 0.1) + labs(title = ‘Predicted vs Actual Cost of Diamonds in USD’, x = ‘Price’, y = ‘Predicted Price’, size = ‘Residuals’)

Predicted vs actual price in USD of the test data
Predicted vs actual price in USD of the test data
Linear fit: Predicted vs actual price in USD of the test data
Linear fit (red line): Predicted vs actual price in USD of the test data

Key Takeaways

  • The Random Forest model performed best according to the R2 value.
  • The critical factor driving price is the size or the carat weight of the diamond. 
  • Recall that we applied the log transformation to our long-tailed dollar variable before using it in regression.
  • You can use the above pre-trained model to get a sense of whether you are overpaying for establishing a lasting business relationship with a jeweler you can trust.


Best Day of My Life — Regression Analysis to Predict the Price of Diamonds

Diamonds Price EDA and Prediction

100% ML: Diamond price prediction using machine learning, python, SVM, KNN, Neural networks

How machine learning can predict the price of the diamond you desire to buy

Diamond Price Prediction with Machine Learning

“Diamonds are forever” — price prediction using Machine Learning regression models and neural networks.

Diamond Price Prediction

Building Regression Models in R using Support Vector Regression


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: