Overview
Rows
278,860
Columns
22
Missing Values
0%
0 total
Duplicate Rows
0
210.71 MB in memory
Data Quality
Missing Values by Column
No missing values detected
Outlier Summary
No outliers detected (IQR method)
Distributions
Age — Distribution
Age — Box Plot
Annual Income — Distribution
Annual Income — Box Plot
Number of Dependents — Distribution
Number of Dependents — Box Plot
Health Score — Distribution
Health Score — Box Plot
Previous Claims — Distribution
Previous Claims — Box Plot
Vehicle Age — Distribution
Vehicle Age — Box Plot
Credit Score — Distribution
Credit Score — Box Plot
Insurance Duration — Distribution
Insurance Duration — Box Plot
Premium Amount — Distribution
Premium Amount — Box Plot
Summary Statistics
| Column | Count | Mean | Std | Min | Median | Max |
|---|---|---|---|---|---|---|
| Age | 278,860 | 41.02 | 13.44 | 18.00 | 41.00 | 64.00 |
| Annual Income | 278,860 | 41321.80 | 33867.38 | 0.00 | 32191.00 | 128453.88 |
| Number of Dependents | 278,860 | 2.00 | 1.34 | 0.00 | 2.00 | 4.00 |
| Health Score | 278,860 | 28.46 | 15.53 | 0.04 | 26.45 | 71.08 |
| Previous Claims | 278,860 | 0.95 | 0.72 | 0.00 | 1.00 | 2.50 |
| Vehicle Age | 278,860 | 9.52 | 5.77 | 0.00 | 10.00 | 19.00 |
| Credit Score | 278,860 | 574.43 | 150.64 | 300.00 | 575.00 | 849.00 |
| Insurance Duration | 278,860 | 5.01 | 2.58 | 1.00 | 5.00 | 9.00 |
| Premium Amount | 278,860 | 933.49 | 814.70 | 0.00 | 688.00 | 2968.00 |
Correlations
Pearson Correlation Matrix
Top Correlated Pairs
| Variable 1 | Variable 2 | Correlation |
|---|---|---|
| Credit Score | Premium Amount | 0.0045 |
| Insurance Duration | Premium Amount | 0.0037 |
| Annual Income | Credit Score | -0.0032 |
| Annual Income | Number of Dependents | 0.0030 |
| Health Score | Previous Claims | 0.0030 |
| Age | Premium Amount | 0.0023 |
| Annual Income | Previous Claims | -0.0023 |
| Number of Dependents | Health Score | 0.0023 |
| Health Score | Vehicle Age | -0.0022 |
| Number of Dependents | Credit Score | -0.0020 |
Cramer's V (Categorical Associations)
| Variable 1 | Variable 2 | Cramer's V |
|---|---|---|
| Smoking Status | Risk Category | 0.9429 |
| Exercise Frequency | Risk Category | 0.2884 |
| Smoking Status | Exercise Frequency | 0.0053 |
| Age Group | Risk Category | 0.0049 |
| Marital Status | Age Group | 0.0047 |
| Education Level | Occupation | 0.0047 |
| Gender | Property Type | 0.0045 |
| Marital Status | Income Bracket | 0.0045 |
| Education Level | Age Group | 0.0043 |
| Income Bracket | Risk Category | 0.0043 |
Data Preprocessing
Missing Values Filled
272,794
Outliers Capped
38,186
Features Engineered
3
Imputation Strategies
3
Missing Values: Before vs After
Imputation Log
| Column | Strategy | Fill Value | Count |
|---|---|---|---|
| Age | median | 41 | 4,685 |
| Annual Income | median | 32,191 | 13,955 |
| Number of Dependents | median | 2 | 27,886 |
| Health Score | median | 26.451 | 10,597 |
| Previous Claims | median | 1 | 81,288 |
| Credit Score | median | 575 | 27,886 |
| Premium Amount | median | 688 | 1,841 |
| Marital Status | mode | Single | 5,019 |
| Occupation | constant (Unknown) | Unknown | 81,288 |
| Customer Feedback | constant (Unknown) | Unknown | 18,349 |
Engineered Features
Binned into 18-25/26-35/36-45/46-55/56-65/65+
Quartile-based: Low/Medium/High/Very High
Composite of Smoking + Exercise + Health Score
Outlier Treatment (IQR Capping)
Premium Prediction Model
Model
Ridge Regression
R-squared
-0.0003
Higher is better
RMSE
814.64
Lower is better
MAE
658.59
Lower is better
Test Size
55,772
Model Comparison (5-Fold Cross-Validation)
| Model | R² | RMSE | MAE | CV Mean (R²) | CV Std |
|---|---|---|---|---|---|
| Random Forest | -0.0042 | 816.22 | 660.14 | -0.0080 | 0.0016 |
| Gradient Boosting | -0.0132 | 819.86 | 662.17 | -0.0223 | 0.0044 |
| XGBoost | -0.0633 | 839.9 | 675.46 | -0.0838 | 0.0076 |
| Ridge RegressionBest | -0.0003 | 814.64 | 658.59 | -0.0016 | 0.0010 |
Feature Importance (Top 15)
Actual vs Predicted
Residual Distribution
Customer Segmentation
Method
K-Means
Clusters
4
Features Used
6
PCA Var Explained
33.9%
Elbow Method
Cluster Distribution
PCA 2D Visualization
Cluster Profiles
| Cluster | Size | Age | Annual Income | Health Score | Credit Score | Premium Amount | Insurance Duration |
|---|---|---|---|---|---|---|---|
| Young Budget | 9,665 (32.2%) | 29.03 | 27,001.83 | 27.69 | 567.67 | 591.54 | 4.93 |
| Mid-Age Standard | 5,425 (18.1%) | 40.44 | 98,065.92 | 28.4 | 576.98 | 767.53 | 5.2 |
| Senior Premium | 5,310 (17.7%) | 41 | 33,647.98 | 28.58 | 579.38 | 2,335.49 | 5.04 |
| High-Value | 9,600 (32%) | 53.4 | 27,909.26 | 29.22 | 578.42 | 600.51 | 4.97 |
Hypothesis Testing
Smokers vs Non-Smokers Premium
Not SignificantSmokers
935.4
n=139,635
Non-Smokers
931.58
n=139,225
Test: Welch's T-Test | t = 1.2405 | p = 0.214795
Smokers have 0.4% higher premiums than Non-Smokers (p=0.2148, not significant)
High vs Low Health Score Premium
Not SignificantHealth Score >= 26.5
935.23
n=144,729
Health Score < 26.5
931.62
n=134,131
Test: Welch's T-Test | t = 1.1665 | p = 0.243426
No significant difference in premiums between high and low health score groups (p=0.2434)
Comprehensive vs Premium Policy Premium
Not SignificantComprehensive
934.35
n=92,479
Premium
931.58
n=93,298
Test: Welch's T-Test | t = 0.7351 | p = 0.462256
Comprehensive have 0.3% higher premiums than Premium (p=0.4623, not significant)
Customer Default Rate Prediction
Overall Default Rate
73.9%
Best Model
Random Forest
Train Size
16,000
Test Size
4,000
Model Comparison (5-Fold CV)
| Model | Accuracy | Precision | Recall | F1 | CV Mean | CV Std |
|---|---|---|---|---|---|---|
| Random ForestBest | 74.2% | 74.2% | 100.0% | 85.2% | 74.2% | 0.0001 |
| Gradient Boosting | 73.9% | 74.2% | 99.4% | 84.9% | 73.9% | 0.0020 |
| XGBoost | 73.0% | 74.5% | 96.9% | 84.2% | 72.5% | 0.0025 |
| Logistic Regression | 74.2% | 74.2% | 100.0% | 85.2% | 74.2% | 0.0001 |