Predictive Risk Model for Customer contracts
In today’s fast-paced world, assessing risks effectively is crucial for organizations to make informed decisions. And especially when it comes to managing client contracts effectively mitigating potential risks is paramount. Every contract carries uncertainties, from regulatory non-compliance to unexpected financial liabilities. To address this, organizations often rely on risk assessment models to systematically evaluate and categorize contracts based on their risk profiles.
This article walks you through building a comprehensive risk assessment model using R code. I am going to use synthetic data as well as custom-designed input features/parameters to simulate the problem and showcase the journey from data preparation to model evaluation. You can tune it according to your requirements further. By the end, you’ll understand how data-driven insights can empower businesses to make smarter decisions. Let’s dive in!
- Setting Up the Environment
Before we begin, let’s load essential R libraries.
library(dplyr)
library(randomForest)
library(caret)
library(ggplot2)
These libraries help with data manipulation, model building, and visualization.
2. Creating Synthetic Data
To simulate a real-world scenario, I am generating synthetic data representing 1000 contracts, each evaluated on 12 risk factors although in my real project I used the exact contract data which I cannot share due to confidentiality reasons.
# Set seed for reproducibility
set.seed(123)
# Define weights for each risk factor
weights <- c(0.1, 0.1, 0.1, 0.1, 0.1, 0.05, 0.05, 0.1, 0.1, 0.05, 0.1, 0.05)
# Generate synthetic data for 1000 contracts
n <- 1000
data <- data.frame(
contract_id = 1:n,
replicate(12, sample(1:5, n, replace = TRUE))
)
names(data)[2:13] <- paste0("risk_factor_", 1:12)
Each contract is scored on a scale of 1 to 5 for each risk factor. These scores reflect the perceived severity of each risk.
3. Calculating Risk Components
The model assesses 12 critical risk factors across a portfolio of contracts. The ultimate goal is to help organizations prioritize resources by categorizing contracts into Low, Moderate, or High risk levels.
Risk assessment involves three elements: likelihood, impact, and net score. Likelihood represents the probability of a risk occurring, and impact represents its severity. Likelihood and Impact are computed for each Risk factor and then risk score is computed for each risk factor which in turn are summed up to get a Net Score for each contract. Risk Factor Score (each risk factor) = likelihood * impact.
Net Score = ∑ Risk Factors Score 1… 12
These are computed as follows:
# Ensure 70% of the samples have a probability of 1
for (i in 1:12) {
data[[paste0("probability_", i)]] <- sample(c(1, 0), n, replace = TRUE, prob = c(0.7, 0.3))
# Compute likelihood and impact for each risk factor
data[[paste0("likelihood_", i)]] <- data[[paste0("probability_", i)]] * data[[paste0("risk_factor_", i)]]
data[[paste0("impact_", i)]] <- weights[i] * data[[paste0("risk_factor_", i)]]
}
Next, we calculate the net risk score by multiplying likelihood and impact across all risk factors:
# Compute net risk score for each contract
data <- data %>%
rowwise() %>%
mutate(net_score = sum(c_across(starts_with("likelihood")) * c_across(starts_with("impact")))) %>%
ungroup()
min_data <- min(data$net_score)
max_data <- max(data$net_score)
4. Categorizing Risk Levels
Contracts are categorized into “Low”, “Moderate”, or “High” risk based on their net scores. This categorization helps prioritize contracts requiring attention.
max_data <- max(data$net_score)
data <- data %>%
mutate(risk_level = case_when(
net_score <= 0.30 * max_data ~ "Low",
net_score > 0.30 * max_data & net_score <= 0.70 * max_data ~ "Moderate",
TRUE ~ "High"
))
5. Visualizing the Data
Two visualizations help us understand the data distribution:
- Histogram of Net Scores:
ggplot(data, aes(x = net_score)) +
geom_histogram(binwidth = 0.05, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Net Scores", x = "Net Score", y = "Frequency") +
theme_minimal()
2. Bar Chart of Risk Levels:
ggplot(data, aes(x = risk_level, fill = risk_level)) +
geom_bar() +
labs(title = "Count of Contracts by Risk Level", x = "Risk Level", y = "Count") +
scale_fill_manual(values = c("Low" = "green", "Moderate" = "orange", "High" = "red")) +
theme_minimal()
# Write data to an Excel file
write_xlsx(data, "output.xlsx")
table(data$risk_level)
6. Balancing the Data
The dataset is imbalanced, as certain risk levels dominate. To address this, I am using SMOTE (Synthetic Minority Oversampling Technique).
library(smotefamily)
# Assume 'data' is your initial dataset with 'risk_level' as the target variable
# Step 1: Split data by risk_level
data_low <- filter(data, risk_level == "Low")
data_moderate <- filter(data, risk_level == "Moderate")
data_high <- filter(data, risk_level == "High")
# Step 2: Determine the target size (count of the Moderate category)
target_size <- nrow(data_moderate)
# Step 3: Balance Low and High categories
# (a) Balance Low Category
smote_features_low <- data_low %>% select(-risk_level)
smote_target_low <- as.factor(data_low$risk_level)
set.seed(123)
smote_result_low <- SMOTE(smote_features_low, smote_target_low, K = 5, dup_size = ceiling(target_size / nrow(data_low)))
data_low_balanced <- smote_result_low$data
data_low_balanced$risk_level <- "Low"
data_low_balanced$class <- NULL
# (b) Keep Moderate Category as is
data_moderate_balanced <- data_moderate
# (c) Balance High Category
smote_features_high <- data_high %>% select(-risk_level)
smote_target_high <- as.factor(data_high$risk_level)
set.seed(123)
smote_result_high <- SMOTE(smote_features_high, smote_target_high, K = 5, dup_size = ceiling(target_size / nrow(data_high)))
data_high_balanced <- smote_result_high$data
data_high_balanced$risk_level <- "High"
data_high_balanced$class <- NULL
# Step 4: Combine all balanced categories
data_balanced <- bind_rows(data_low_balanced, data_moderate_balanced, data_high_balanced)
# Step 5: Verify the distribution of risk_level
table(data_balanced$risk_level)
# Step 6: Visualize the balanced distribution
ggplot(data_balanced, aes(x = risk_level)) +
geom_bar(fill = "skyblue") +
labs(title = "Class Distribution After Balancing", x = "Risk Level", y = "Count") +
theme_minimal()
# Step 7: Save the balanced dataset
write.csv(data_balanced, "balanced_data.csv", row.names = FALSE)
7. Building the Model
I am training a Random Forest model to classify risk levels. Training and test data has a 70%, 30% split respectively.
# Partition the data
set.seed(123)
trainIndex <- createDataPartition(data_balanced$risk_level, p = 0.7, list = FALSE)
# Create training and testing sets
trainData <- data_balanced[trainIndex, ]
testData <- data_balanced[-trainIndex, ]
trainData$risk_level <- as.factor(trainData$risk_level)
testData$risk_level <- as.factor(testData$risk_level)
# Check the distribution
table(trainData$risk_level)
table(testData$risk_level)
# Feature engineering: Select risk factors as predictors and risk level as the target
train_x <- trainData %>% select(starts_with("risk_factor"))
train_y <- trainData$risk_level
train_y <- as.factor(train_y)
test_x <- testData %>% select(starts_with("risk_factor"))
test_y <- testData$risk_level
test_y <- as.factor(test_y)
# Build a Random Forest model
model <- randomForest(x = train_x, y = as.factor(train_y), ntree = 500, mtry = 3, importance = TRUE)
# Predict on the test data
predictions <- predict(model, test_x)
# Predict probabilities on the test data
probabilities <- predict(model, test_x, type = "prob")
# Ensure test_y is a factor and has the same levels as train_y
test_y <- factor(test_y, levels = levels(train_y))
# Ensure predictions have the same levels as test_y
predictions <- factor(predictions, levels = levels(test_y))
table(predictions)
table(test_y)
8. Evaluating the Model
I am evaluating the model’s performance using a confusion matrix, accuracy, precision, recall and feature importance:
# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test_y)
accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
# Print evaluation metrics
print(conf_matrix)
cat("Accuracy:", accuracy, "\n")
cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
# Extract the confusion matrix table
conf_table <- conf_matrix$table
# Calculate precision and recall for each class
precision <- diag(conf_table) / rowSums(conf_table)
recall <- diag(conf_table) / colSums(conf_table)
# Feature importance visualization
importance <- importance(model)
importance_df <- data.frame(
Feature = rownames(importance),
Importance = importance[, 1]
)
# Plot feature importance
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Feature Importance", x = "Risk Factors", y = "Importance")
9. Generating ROC Curves
Finally, assessing the model using ROC curves and AUC values for each class.
library(pROC)
# Generate AUC curve
roc_curves <- lapply(levels(test_y), function(cls)
{ roc_response <- roc(test_y == cls, probabilities[, cls])
plot(roc_response, main = paste("ROC Curve -", cls), col = "#1c61b6")
auc_value <- auc(roc_response)
return(auc_value) })
# Print AUC values for each class
names(roc_curves) <- levels(test_y)
print(roc_curves)
11. Interpreting Results
It’s crucial to assess the performance to ensure it reliably predicts risk levels. Here’s how we evaluate our Random Forest model.
Confusion Matrix
A confusion matrix summarizes the model’s predictions compared to the actual labels. It shows:
- True Positives (TP): Correctly predicted contracts in each risk category.
- True Negatives (TN): Correctly excluded contracts from each category.
- False Positives (FP): Contracts incorrectly classified into a category.
- False Negatives (FN): Contracts incorrectly excluded from a category.
R code:
conf_matrix <- confusionMatrix(predictions, as.factor(testData$risk_level))
print(conf_matrix)
From the confusion matrix, we derive the following metrics:
- Accuracy: Proportion of correct predictions across all risk levels.
Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}Accuracy=TP+TN+FP+FNTP+TN
2. Precision (Positive Predictive Value): How many contracts predicted as a specific risk level are correctly classified.
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}Precision=TP+FPTP
3. Recall (Sensitivity or True Positive Rate): Proportion of actual risk-level contracts that the model correctly identifies.
Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}Recall=TP+FNTP
4. Specificity: Proportion of non-risk-level contracts that are correctly excluded.
Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}Specificity=TN+FPTN
Feature Importance
Understanding which factors contribute most to the predictions is critical. Random Forest provides a measure of importance for each feature based on its contribution to reducing classification error.
R Code:
# Feature importance visualization
importance <- importance(model)
importance_df <- data.frame(
Feature = rownames(importance),
Importance = importance[, 1]
)
# Plot feature importance
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Feature Importance", x = "Risk Factors", y = "Importance")
Key insights from feature importance:
- Features with higher importance scores have a stronger influence on the model’s predictions.
- Stakeholders can focus on these critical factors for mitigating risks more effectively.
ROC Curves and AUC Scores
ROC (Receiver Operating Characteristic) curves and AUC (Area Under Curve) scores provide a graphical and numerical representation of model performance:
- ROC Curve: Plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 — Specificity) at various thresholds.
- AUC Score: Measures the area under the ROC curve; higher values (close to 1) indicate better model discrimination between classes.
R Code:
library(pROC)
# Generate AUC curve
roc_curves <- lapply(levels(test_y), function(cls)
{ roc_response <- roc(test_y == cls, probabilities[, cls])
plot(roc_response, main = paste("ROC Curve -", cls), col = "#1c61b6")
auc_value <- auc(roc_response)
return(auc_value) })
# Print AUC values for each class
names(roc_curves) <- levels(test_y)
print(roc_curves)
Key points about ROC and AUC:
- Each risk level has its own ROC curve.
- A high AUC score (>0.9) signifies excellent model performance.
Overall Results
- High Accuracy: Indicates the model is generally reliable in predicting risk levels.
- Balanced Precision and Recall: Ensures the model minimizes both false positives and false negatives.
- Feature Importance Analysis: Highlights which risk factors to prioritize for risk mitigation strategies.
- ROC and AUC: Demonstrates strong model performance and its ability to distinguish between risk levels effectively.
With these results, businesses can confidently use this model to prioritize contracts requiring attention and allocate resources strategically.
Conclusion
This end-to-end steps showcases how to simulate, analyze, and model risk data using R. From data preparation to model evaluation, each step builds toward a robust risk assessment tool. Feel free to adapt the code to your specific use case and let us know your thoughts!