Insurance Premium Prediction

Overview

Rows

278,860

Columns

22

Missing Values

0%

0 total

Duplicate Rows

0

210.71 MB in memory

Data Quality

Missing Values by Column

No missing values detected

Outlier Summary

No outliers detected (IQR method)

Distributions

Age — Distribution

Age — Box Plot

Min: 18.00Q1: 30.00Med: 41.00Q3: 53.00Max: 64.00

Annual Income — Distribution

Annual Income — Box Plot

Min: 0.00Q1: 14412.00Med: 32191.00Q3: 60028.75Max: 128453.88

Number of Dependents — Distribution

Number of Dependents — Box Plot

Min: 0.00Q1: 1.00Med: 2.00Q3: 3.00Max: 4.00

Health Score — Distribution

Health Score — Box Plot

Min: 0.04Q1: 16.55Med: 26.45Q3: 38.36Max: 71.08

Previous Claims — Distribution

Previous Claims — Box Plot

Min: 0.00Q1: 0.00Med: 1.00Q3: 1.00Max: 2.50

Vehicle Age — Distribution

Vehicle Age — Box Plot

Min: 0.00Q1: 5.00Med: 10.00Q3: 15.00Max: 19.00

Credit Score — Distribution

Credit Score — Box Plot

Min: 300.00Q1: 452.00Med: 575.00Q3: 697.00Max: 849.00

Insurance Duration — Distribution

Insurance Duration — Box Plot

Min: 1.00Q1: 3.00Med: 5.00Q3: 7.00Max: 9.00

Premium Amount — Distribution

Premium Amount — Box Plot

Min: 0.00Q1: 288.00Med: 688.00Q3: 1360.00Max: 2968.00

Summary Statistics

ColumnCountMeanStdMinMedianMax
Age278,86041.0213.4418.0041.0064.00
Annual Income278,86041321.8033867.380.0032191.00128453.88
Number of Dependents278,8602.001.340.002.004.00
Health Score278,86028.4615.530.0426.4571.08
Previous Claims278,8600.950.720.001.002.50
Vehicle Age278,8609.525.770.0010.0019.00
Credit Score278,860574.43150.64300.00575.00849.00
Insurance Duration278,8605.012.581.005.009.00
Premium Amount278,860933.49814.700.00688.002968.00

Correlations

Pearson Correlation Matrix

AgeAnnual IncomeNumber of De…Health ScorePrevious Cla…Vehicle AgeCredit ScoreInsurance Du…Premium AmountAgeAnnual IncomeNumber of De…Health ScorePrevious Cla…Vehicle AgeCredit ScoreInsurance Du…Premium Amount
-1
+1

Top Correlated Pairs

Variable 1Variable 2Correlation
Credit ScorePremium Amount0.0045
Insurance DurationPremium Amount0.0037
Annual IncomeCredit Score-0.0032
Annual IncomeNumber of Dependents0.0030
Health ScorePrevious Claims0.0030
AgePremium Amount0.0023
Annual IncomePrevious Claims-0.0023
Number of DependentsHealth Score0.0023
Health ScoreVehicle Age-0.0022
Number of DependentsCredit Score-0.0020

Cramer's V (Categorical Associations)

Variable 1Variable 2Cramer's V
Smoking StatusRisk Category0.9429
Exercise FrequencyRisk Category0.2884
Smoking StatusExercise Frequency0.0053
Age GroupRisk Category0.0049
Marital StatusAge Group0.0047
Education LevelOccupation0.0047
GenderProperty Type0.0045
Marital StatusIncome Bracket0.0045
Education LevelAge Group0.0043
Income BracketRisk Category0.0043

Data Preprocessing

Missing Values Filled

272,794

Outliers Capped

38,186

Features Engineered

3

Imputation Strategies

3

Missing Values: Before vs After

Imputation Log

ColumnStrategyFill ValueCount
Agemedian414,685
Annual Incomemedian32,19113,955
Number of Dependentsmedian227,886
Health Scoremedian26.45110,597
Previous Claimsmedian181,288
Credit Scoremedian57527,886
Premium Amountmedian6881,841
Marital StatusmodeSingle5,019
Occupationconstant (Unknown)Unknown81,288
Customer Feedbackconstant (Unknown)Unknown18,349

Engineered Features

Age Group5 unique values

Binned into 18-25/26-35/36-45/46-55/56-65/65+

Income Bracket4 unique values

Quartile-based: Low/Medium/High/Very High

Risk Category3 unique values

Composite of Smoking + Exercise + Health Score

Outlier Treatment (IQR Capping)

Annual Income7,553 capped [-54,013.125 ~ 128,453.875]
Health Score2,457 capped [-16.16 ~ 71.076]
Previous Claims15,871 capped [-1.5 ~ 2.5]
Premium Amount12,305 capped [-1,320 ~ 2,968]

Premium Prediction Model

Model

Ridge Regression

R-squared

-0.0003

Higher is better

RMSE

814.64

Lower is better

MAE

658.59

Lower is better

Test Size

55,772

Model Comparison (5-Fold Cross-Validation)

ModelRMSEMAECV Mean (R²)CV Std
Random Forest-0.0042816.22660.14-0.00800.0016
Gradient Boosting-0.0132819.86662.17-0.02230.0044
XGBoost-0.0633839.9675.46-0.08380.0076
Ridge RegressionBest-0.0003814.64658.59-0.00160.0010

Feature Importance (Top 15)

Actual vs Predicted

Residual Distribution

Customer Segmentation

Method

K-Means

Clusters

4

Features Used

6

PCA Var Explained

33.9%

Elbow Method

Cluster Distribution

PCA 2D Visualization

Cluster Profiles

ClusterSizeAgeAnnual IncomeHealth ScoreCredit ScorePremium AmountInsurance Duration
Young Budget9,665 (32.2%)29.0327,001.8327.69567.67591.544.93
Mid-Age Standard5,425 (18.1%)40.4498,065.9228.4576.98767.535.2
Senior Premium5,310 (17.7%)4133,647.9828.58579.382,335.495.04
High-Value9,600 (32%)53.427,909.2629.22578.42600.514.97

Hypothesis Testing

Smokers vs Non-Smokers Premium

Not Significant

Smokers

935.4

n=139,635

Non-Smokers

931.58

n=139,225

Test: Welch's T-Test | t = 1.2405 | p = 0.214795

Smokers have 0.4% higher premiums than Non-Smokers (p=0.2148, not significant)

High vs Low Health Score Premium

Not Significant

Health Score >= 26.5

935.23

n=144,729

Health Score < 26.5

931.62

n=134,131

Test: Welch's T-Test | t = 1.1665 | p = 0.243426

No significant difference in premiums between high and low health score groups (p=0.2434)

Comprehensive vs Premium Policy Premium

Not Significant

Comprehensive

934.35

n=92,479

Premium

931.58

n=93,298

Test: Welch's T-Test | t = 0.7351 | p = 0.462256

Comprehensive have 0.3% higher premiums than Premium (p=0.4623, not significant)

Customer Default Rate Prediction

Overall Default Rate

73.9%

Best Model

Random Forest

Train Size

16,000

Test Size

4,000

Model Comparison (5-Fold CV)

ModelAccuracyPrecisionRecallF1CV MeanCV Std
Random ForestBest74.2%74.2%100.0%85.2%74.2%0.0001
Gradient Boosting73.9%74.2%99.4%84.9%73.9%0.0020
XGBoost73.0%74.5%96.9%84.2%72.5%0.0025
Logistic Regression74.2%74.2%100.0%85.2%74.2%0.0001

Default Rate by Age Group

Default Risk Factors