Acko Health Insurance: Data-Driven Insurance Premium Pricing Strategy¶

Acko is an online insurance company in India that offers a variety of insurance products, including car, bike, health, and travel insurance.

Acko, a digital insurance provider, is launching a new health insurance product and requires a data-driven approach to determine the optimal insurance premium pricing for different customer segments.

Objective¶

  • To determine an optimal premium pricing strategy that balances business profitability with customer affordability, leveraging available demographic, financial, and behavioral data.
  • The primary objective is to develop a rule-based logical formula to calculate the final premium price for each customer based on their demographic, risk and affordability profile.

Business Impact¶

  • Increased Profitability:Ensures premiums reflect customer risk levels, reducing losses from underpricing and maximizing revenue.
  • Better Customer Acquisition & Retention: Affordable, customized premiums attract new customers and improve loyalty by offering fair pricing.
  • Faster Market Launch:A simple rule-based model allows quick product rollout and easy updates for new customer segments.
  • Enhanced Trust & Compliance:Transparent pricing builds customer trust and ensures alignment with regulatory standards.

Dataset Information¶

  • Dataset: Acko Dataset
  • Count of Rows : 1200000
  • Count of Columns: 20

Explanation of Data Columns¶

  • id (Customer ID): A unique identifier assigned to each customer in the dataset.

  • Age (Customer Age): The age of the customer in years at the time of policy purchase.

  • Gender (Customer Gender): The gender identity of the customer, which can be "Man" or "Woman."

  • Annual Income (Yearly Earnings in INR): The total income earned by the customer in a year, measured in Indian Rupees (INR).

  • Marital Status (Customer’s Marital Condition): The marital status of the customer, such as "Spouse Present," "Not Married," or "Formerly Married."

  • Number of Dependents (People Financially Dependent on Customer): The number of dependents (such as children, parents, or others) that rely on the customer financially.

  • Education Level (Highest Education Attained): The highest level of education completed by the customer, such as "Undergraduate" or "Post Graduate."

  • Occupation (Customer's Job Type): The profession or employment category of the customer. In some cases, this data may be missing.

  • Health Score (Overall Health Indicator): A numerical score representing the customer’s health condition based on lifestyle factors and medical history.Less score mean fit, more means unfit.

  • Location (Customer’s Residence Tier): he classification of the customer's residence area into tiers such as Tier-1, Tier-2, or Tier-3 cities.

  • Policy Type (Type of Insurance Policy Chosen): The category of insurance policy purchased by the customer, such as "Basic," "Premium," or "Comprehensive."

  • Previous Claims (Number of Past Insurance Claims): The total number of insurance claims the customer has made before purchasing the current policy.

  • Credit Score (Financial Responsibility Indicator): A numerical representation of the customer’s creditworthiness, indicating their ability to manage finances and make timely payments.

  • Insurance Duration (Policy Tenure in Years):: The number of years the customer has held the insurance policy.

  • Policy Start Date (Date When Policy Became Active): The date when the insurance policy was purchased and became active.

  • Customer Feedback (Customer Satisfaction Rating): The rating or feedback provided by the customer about their experience with the insurance policy.

  • Smoking Status (Whether the Customer Smokes): Indicates whether the customer is a smoker or not, as smoking impacts health risks and insurance premiums.

  • Exercise Frequency (How Often the Customer Exercises):The frequency of physical exercise performed by the customer, which can impact their health score.

  • Property Type (Type of Residence): The type of home the customer resides in, such as a detached home or an apartment.

  • Premium Amount (Final Insurance Premium in INR): The amount the customer pays for their health insurance policy, measured in Indian Rupees (INR).

Import Data¶

In [ ]:
# Step 1: Install gdown
!pip install gdown

# Step 2: Import necessary libraries
import gdown
import pandas as pd

# Step 3: Set the file ID and create a download URL
file_id = "1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH"
download_url = f"https://drive.google.com/uc?id={file_id}"

# Step 4: Set the output file name
output_file = "acko_dataset.csv"

# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)

# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)
Requirement already satisfied: gdown in /usr/local/lib/python3.11/dist-packages (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown) (4.13.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown) (3.17.0)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown) (2.32.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown) (4.67.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (2.6)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2025.1.31)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (1.7.1)
Downloading...
From (original): https://drive.google.com/uc?id=1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH
From (redirected): https://drive.google.com/uc?id=1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH&confirm=t&uuid=58922796-c0e5-406b-bc6a-07492803306f
To: /content/acko_dataset.csv
100%|██████████| 219M/219M [00:04<00:00, 50.2MB/s]
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [ ]:
df = pd.read_csv('acko_dataset.csv')
In [ ]:
df.head(10)
Out[ ]:
id Age Gender Annual Income Marital Status Number of Dependents Education Level Occupation Health Score Location Policy Type Previous Claims Credit Score Insurance Duration Policy Start Date Customer Feedback Smoking Status Exercise Frequency Property Type Premium Amount
0 0 19.0 Woman 8.642140e+05 Spouse Present 1.0 Undergraduate Business 26.598761 Tier-1 Premium 2.0 372.0 5.0 2023-12-23 15:21:39.134960 Poor No Weekly Detached Home 1945.913327
1 1 39.0 Woman 8.927012e+05 Spouse Present 3.0 Post Graduate Missing 21.569731 Tier-2 Comprehensive 1.0 694.0 2.0 2023-06-12 15:21:39.111551 Average Yes Monthly Detached Home 10908.896072
2 2 23.0 Man 2.201772e+06 Formerly Married 3.0 Undergraduate Business 50.177549 Tier-3 Premium 1.0 NaN 3.0 2023-09-30 15:21:39.221386 Good Yes Weekly Detached Home 21563.135198
3 3 21.0 Man 3.997542e+06 Spouse Present 2.0 Undergraduate Missing 16.938144 Tier-2 Basic 1.0 367.0 1.0 2024-06-12 15:21:39.226954 Poor Yes Daily Flat 2653.539143
4 4 21.0 Man 3.409986e+06 Not Married 1.0 Undergraduate Business 24.376094 Tier-2 Premium 0.0 598.0 4.0 2021-12-01 15:21:39.252145 Poor Yes Weekly Detached Home 1269.243463
5 5 29.0 Man 1.295259e+06 Spouse Present 1.0 Post Graduate Missing 36.053198 Tier-1 Premium 2.0 614.0 5.0 2022-05-20 15:21:39.207847 Average No Weekly Detached Home 1252.689000
6 6 41.0 Man 1.136688e+06 Spouse Present 0.0 Undergraduate Missing NaN Tier-2 Basic 2.0 807.0 6.0 2020-02-21 15:21:39.219432 Poor No Weekly Detached Home NaN
7 7 48.0 Woman 8.942382e+06 Spouse Present 2.0 Secondary Education Full-Time Worker 13.769783 Tier-3 Comprehensive 1.0 398.0 5.0 2022-08-08 15:21:39.181605 Average No Rarely Apartment 21833.802349
8 8 21.0 Man 1.490380e+05 Formerly Married 3.0 Undergraduate Missing 24.869551 Tier-1 Premium 1.0 685.0 8.0 2020-12-14 15:21:39.198406 Average No Monthly Apartment 10997.945102
9 9 44.0 Man 2.510442e+06 Spouse Present 2.0 Undergraduate Full-Time Worker 22.473718 Tier-1 Comprehensive 1.0 635.0 3.0 2020-08-02 15:21:39.144722 Poor No Daily Apartment 10580.678238

Data Cleaning¶

In [ ]:
# Check the shape of dataset.
rows, columns = df.shape
print(f"No.of rows :{rows}\nNo.of columns: {columns}")
No.of rows :1200000
No.of columns: 20

Creating New Columns¶

In [ ]:
def age_to_group(age):
    if age <= 30:
        return '18-30'
    elif age <= 50:
        return '31-50'
    else:
        return '51+'


# Apply function to create Age Group
df['Age Group'] = df['Age'].apply(age_to_group)

print(df['Age Group'].value_counts())
Age Group
31-50    506170
51+      374386
18-30    319444
Name: count, dtype: int64
In [ ]:
# Define Health Score categories based on distribution
def categorize_health_score(score):
    if score <= 20:
        return "Low"
    elif 21 <= score <= 40:
        return "Medium"
    else:
        return "High"
df["Health Score Category"] = df["Health Score"].apply(categorize_health_score)
print(df['Health Score Category'].value_counts())
Health Score Category
Medium    568593
High      411738
Low       219669
Name: count, dtype: int64
In [ ]:
# Dataset information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   id                     1200000 non-null  int64  
 1   Age                    1181295 non-null  float64
 2   Gender                 1200000 non-null  object 
 3   Annual Income          1155051 non-null  float64
 4   Marital Status         1200000 non-null  object 
 5   Number of Dependents   1090328 non-null  float64
 6   Education Level        1200000 non-null  object 
 7   Occupation             1200000 non-null  object 
 8   Health Score           1125924 non-null  float64
 9   Location               1200000 non-null  object 
 10  Policy Type            1200000 non-null  object 
 11  Previous Claims        835971 non-null   float64
 12  Credit Score           1062118 non-null  float64
 13  Insurance Duration     1199999 non-null  float64
 14  Policy Start Date      1200000 non-null  object 
 15  Customer Feedback      1122176 non-null  object 
 16  Smoking Status         1200000 non-null  object 
 17  Exercise Frequency     1200000 non-null  object 
 18  Property Type          1200000 non-null  object 
 19  Premium Amount         784968 non-null   float64
 20  Age Group              1200000 non-null  object 
 21  Health Score Category  1200000 non-null  object 
dtypes: float64(8), int64(1), object(13)
memory usage: 201.4+ MB
In [ ]:
# Checking for Duplicate Values.
duplicate = df.duplicated().sum()
duplicate
Out[ ]:
0
In [ ]:
#Checking Null Values
df.isnull().sum()
Out[ ]:
0
id 0
Age 18705
Gender 0
Annual Income 44949
Marital Status 0
Number of Dependents 109672
Education Level 0
Occupation 0
Health Score 74076
Location 0
Policy Type 0
Previous Claims 364029
Credit Score 137882
Insurance Duration 1
Policy Start Date 0
Customer Feedback 77824
Smoking Status 0
Exercise Frequency 0
Property Type 0
Premium Amount 415032
Age Group 0
Health Score Category 0

Age, Annual Income, Number of Dependents, Health Score, Previous Claims, Credit Score, Insurance Duration, Customer Feedback, Premium Amount contain null values.

Data Types¶

In [ ]:
  df.dtypes
Out[ ]:
0
id int64
Age float64
Gender object
Annual Income float64
Marital Status object
Number of Dependents float64
Education Level object
Occupation object
Health Score float64
Location object
Policy Type object
Previous Claims float64
Credit Score float64
Insurance Duration float64
Policy Start Date object
Customer Feedback object
Smoking Status object
Exercise Frequency object
Property Type object
Premium Amount float64
Age Group object
Health Score Category object

In [ ]:
#Convert the data type of the columns
# Convert 'Policy Start Date' to datetime (handling errors)
df['Policy Start Date'] = pd.to_datetime(df['Policy Start Date'], errors='coerce')
In [ ]:
# Convert Number of Dependents to integer (nullable Int64 to handle NaN)
df['Number of Dependents'] = df['Number of Dependents'].astype('Int64')
In [ ]:
#Convert age datatype into int
if df['Age'].dropna().apply(float.is_integer).all():
    df['Age'] = df['Age'].astype('Int64')
In [ ]:
pd.options.display.float_format = '{:.2f}'.format
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   id                     1200000 non-null  int64         
 1   Age                    1181295 non-null  Int64         
 2   Gender                 1200000 non-null  object        
 3   Annual Income          1155051 non-null  float64       
 4   Marital Status         1200000 non-null  object        
 5   Number of Dependents   1090328 non-null  Int64         
 6   Education Level        1200000 non-null  object        
 7   Occupation             1200000 non-null  object        
 8   Health Score           1125924 non-null  float64       
 9   Location               1200000 non-null  object        
 10  Policy Type            1200000 non-null  object        
 11  Previous Claims        835971 non-null   float64       
 12  Credit Score           1062118 non-null  float64       
 13  Insurance Duration     1199999 non-null  float64       
 14  Policy Start Date      1200000 non-null  datetime64[ns]
 15  Customer Feedback      1122176 non-null  object        
 16  Smoking Status         1200000 non-null  object        
 17  Exercise Frequency     1200000 non-null  object        
 18  Property Type          1200000 non-null  object        
 19  Premium Amount         784968 non-null   float64       
 20  Age Group              1200000 non-null  object        
 21  Health Score Category  1200000 non-null  object        
dtypes: Int64(2), datetime64[ns](1), float64(6), int64(1), object(12)
memory usage: 203.7+ MB
In [ ]:
df.head(10)
Out[ ]:
id Age Gender Annual Income Marital Status Number of Dependents Education Level Occupation Health Score Location ... Credit Score Insurance Duration Policy Start Date Customer Feedback Smoking Status Exercise Frequency Property Type Premium Amount Age Group Health Score Category
0 0 19 Woman 864214.00 Spouse Present 1 Undergraduate Business 26.60 Tier-1 ... 372.00 5.00 2023-12-23 15:21:39.134960 Poor No Weekly Detached Home 1945.91 18-30 Medium
1 1 39 Woman 892701.25 Spouse Present 3 Post Graduate Missing 21.57 Tier-2 ... 694.00 2.00 2023-06-12 15:21:39.111551 Average Yes Monthly Detached Home 10908.90 31-50 Medium
2 2 23 Man 2201772.00 Formerly Married 3 Undergraduate Business 50.18 Tier-3 ... NaN 3.00 2023-09-30 15:21:39.221386 Good Yes Weekly Detached Home 21563.14 18-30 High
3 3 21 Man 3997541.99 Spouse Present 2 Undergraduate Missing 16.94 Tier-2 ... 367.00 1.00 2024-06-12 15:21:39.226954 Poor Yes Daily Flat 2653.54 18-30 Low
4 4 21 Man 3409986.00 Not Married 1 Undergraduate Business 24.38 Tier-2 ... 598.00 4.00 2021-12-01 15:21:39.252145 Poor Yes Weekly Detached Home 1269.24 18-30 Medium
5 5 29 Man 1295259.40 Spouse Present 1 Post Graduate Missing 36.05 Tier-1 ... 614.00 5.00 2022-05-20 15:21:39.207847 Average No Weekly Detached Home 1252.69 18-30 Medium
6 6 41 Man 1136687.84 Spouse Present 0 Undergraduate Missing NaN Tier-2 ... 807.00 6.00 2020-02-21 15:21:39.219432 Poor No Weekly Detached Home NaN 31-50 High
7 7 48 Woman 8942382.00 Spouse Present 2 Secondary Education Full-Time Worker 13.77 Tier-3 ... 398.00 5.00 2022-08-08 15:21:39.181605 Average No Rarely Apartment 21833.80 31-50 Low
8 8 21 Man 149038.00 Formerly Married 3 Undergraduate Missing 24.87 Tier-1 ... 685.00 8.00 2020-12-14 15:21:39.198406 Average No Monthly Apartment 10997.95 18-30 Medium
9 9 44 Man 2510442.00 Spouse Present 2 Undergraduate Full-Time Worker 22.47 Tier-1 ... 635.00 3.00 2020-08-02 15:21:39.144722 Poor No Daily Apartment 10580.68 31-50 Medium

10 rows × 22 columns

Filling Missing Data¶

In [ ]:
df.isnull().sum()
Out[ ]:
0
id 0
Age 18705
Gender 0
Annual Income 44949
Marital Status 0
Number of Dependents 109672
Education Level 0
Occupation 0
Health Score 74076
Location 0
Policy Type 0
Previous Claims 364029
Credit Score 137882
Insurance Duration 1
Policy Start Date 0
Customer Feedback 77824
Smoking Status 0
Exercise Frequency 0
Property Type 0
Premium Amount 415032
Age Group 0
Health Score Category 0

In [ ]:
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Data Heatmap")
plt.show()
No description has been provided for this image
In [ ]:
missing_counts = df.isnull().sum().sort_values(ascending=False)
missing_counts = missing_counts[missing_counts > 0]  # Only show missing columns

missing_counts.plot(kind='barh', figsize=(10, 6), color='skyblue')
plt.xlabel("Number of Missing Values")
plt.title("Missing Values Per Column")
plt.show()
No description has been provided for this image
In [ ]:
#Drop the rows where 5 or more values are missing

Drop the rows where 5 or more values are missing

In [ ]:
# We are removing rows where more than 4 key columns have missing values.

# Key columns to check for missing values
key_columns = ['Age', 'Annual Income', 'Health Score', 'Previous Claims',
               'Number of Dependents', 'Credit Score', 'Customer Feedback','Insurance Duration','Premium Amount']

# Count missing values in each row for the key columns
df['missing_count'] = df[key_columns].isnull().sum(axis=1)

# Filter rows where most key columns are missing (e.g., more than 4 out of 7)
missing_rows = df[df['missing_count'] > 4]

# Drop rows where more than 4 key columns are missing
df = df[df['missing_count'] <= 4].drop(columns='missing_count')

# Check the new shape of the dataset
print(df.shape)
(1198440, 22)

Fill the missing values

In [ ]:
#1Age
print(f"Missing values after filling: {df['Age'].isnull().sum()}")

# 1. Calculate overall median age (fallback value)
overall_median_age = df['Age'].median()

# 2. Fill missing ages using group medians (Occupation + Policy Type)
df['Age'] = df.groupby(['Occupation', 'Policy Type'])['Age'].transform(
    lambda x: x.fillna(x.median() if len(x) >= 50 else overall_median_age)
)

# 3. Fill any remaining nulls with overall media
df['Age'] = df['Age'].fillna(overall_median_age)

# Verify result
print(f"Missing values after filling: {df['Age'].isnull().sum()}")
Missing values after filling: 18391
Missing values after filling: 0
In [ ]:
#Annual Income

print(f"Missing values before filling: {df['Annual Income'].isnull().sum()}")

# 1. Calculate overall median income (fallback value)
overall_median_income = df['Annual Income'].median()

# 2. Fill missing income using group medians (Occupation + Education Level)
df['Annual Income'] = df.groupby(['Occupation', 'Education Level'])['Annual Income'].transform(
    lambda x: x.fillna(x.median() if len(x) >= 50 else overall_median_income)
)

# 3. Fill any remaining nulls with overall median
df['Annual Income'] = df['Annual Income'].fillna(overall_median_income)

# Verify result
print(f"Missing values after filling: {df['Annual Income'].isnull().sum()}")
Missing values before filling: 44421
Missing values after filling: 0
In [ ]:
# Health Score
print(f"Missing values before filling: {df['Health Score'].isnull().sum()}")

# 1. Define function to get mode (with tie-breaker: worst-case score)
def get_mode(series):
    modes = series.mode()
    return modes.min() if not modes.empty else np.nan

# 2. Fill missing values with group mode (Smoking Status + Exercise Frequency)
group_modes = df.groupby(['Smoking Status', 'Exercise Frequency'])['Health Score'].transform(get_mode)
df['Health Score'] = df['Health Score'].fillna(group_modes)

# 3. Fill remaining nulls with overall mode
overall_mode = df['Health Score'].mode()[0]
df['Health Score'] = df['Health Score'].fillna(overall_mode)

print(f"Missing values after filling: {df['Health Score'].isnull().sum()}")
Missing values before filling: 73231
Missing values after filling: 0
In [ ]:
#Previous Claims
print(f"Missing values before filling: {df['Previous Claims'].isnull().sum()}")

# Fill missing Previous Claims
# 1. If Insurance Duration < 1 year or missing, fill with 0 (assuming new customer)
df['Previous Claims'] = df.apply(
    lambda row: 0 if pd.isna(row['Insurance Duration']) or row['Insurance Duration'] < 1 else row['Previous Claims'],
    axis=1
)

# 2. For remaining missing values (existing customers), fill with median based on Policy Type
df['Previous Claims'] = df.groupby('Policy Type')['Previous Claims'].transform(
    lambda x: x.fillna(x.median())
)
print(f"Missing values after filling: {df['Previous Claims'].isnull().sum()}")
Missing values before filling: 362589
Missing values after filling: 0
In [ ]:
#Number of Dependents

print(df['Number of Dependents'].isnull().sum())

# Step 1: Median dependents for 'Spouse Present'
median_dependents_spouse_present = df.loc[df['Marital Status'] == 'Spouse Present', 'Number of Dependents'].median()

# Step 2: Fill missing values based on Marital Status
df.loc[(df['Marital Status'] == 'Spouse Present') & (df['Number of Dependents'].isnull()), 'Number of Dependents'] = median_dependents_spouse_present
df.loc[(df['Marital Status'] == 'Not Married') & (df['Number of Dependents'].isnull()), 'Number of Dependents'] = 0

# Optional: Fill remaining nulls (Unknown/Formerly Married) with overall median
overall_median_dependents = df['Number of Dependents'].median()
df['Number of Dependents'].fillna(overall_median_dependents, inplace=True)

# Check if missing values are filled
print(df['Number of Dependents'].isnull().sum())
108611
0
<ipython-input-27-1fcb0017b83d>:14: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Number of Dependents'].fillna(overall_median_dependents, inplace=True)
In [ ]:
# Credit Score
print(f"Missing values before filling: {df['Credit Score'].isnull().sum()}")

# 1. Calculate overall median (fallback)
overall_median = df['Credit Score'].median()

# 2. Group by Occupation, fill nulls with group median (if group has ≥50 rows) or overall median
group_medians = df.groupby('Occupation')['Credit Score'].transform(
    lambda x: x.median() if len(x) >= 50 else overall_median
)
df['Credit Score'] = df['Credit Score'].fillna(group_medians)

# 3. Fill any remaining nulls with overall median
df['Credit Score'] = df['Credit Score'].fillna(overall_median)

print(f"Missing values after filling: {df['Credit Score'].isnull().sum()}")
Missing values before filling: 136782
Missing values after filling: 0
In [ ]:
# Customer Feedback
print(f"Missing values before filling: {df['Customer Feedback'].isnull().sum()}")

# 1. Fill missing feedback with group mode (Policy Type + Location)
df['Customer Feedback'] = df.groupby(['Policy Type', 'Location'])['Customer Feedback'].transform(
    lambda x: x.fillna(x.mode()[0])  # Use first mode in case of ties
)

# 2. Fill remaining nulls with overall most common feedback
most_common_feedback = df['Customer Feedback'].mode()[0]
df['Customer Feedback'] = df['Customer Feedback'].fillna(most_common_feedback)

print(f"Missing values after filling: {df['Customer Feedback'].isnull().sum()}")
Missing values before filling: 76792
Missing values after filling: 0
In [ ]:
# Level 1: Policy Type + Location + Age Group
df['Premium Amount'] = df.groupby(
    ['Policy Type', 'Location', 'Age Group']
)['Premium Amount'].transform(lambda x: x.fillna(x.median()))

# Level 2: Policy Type + Location (fallback)
df['Premium Amount'] = df.groupby(
    ['Policy Type', 'Location']
)['Premium Amount'].transform(lambda x: x.fillna(x.median()))

# Level 3: Policy Type (fallback)
df['Premium Amount'] = df.groupby('Policy Type')['Premium Amount'].transform(
    lambda x: x.fillna(x.median())
)

# Final Fallback: Overall median
overall_median = df['Premium Amount'].median()
df['Premium Amount'] = df['Premium Amount'].fillna(overall_median)

# --------------------------------------------------------------------------
# Step 3: Verify results
# --------------------------------------------------------------------------
print(f"Missing values remaining: {df['Premium Amount'].isnull().sum()}")
Missing values remaining: 0

Outilers¶

In [ ]:
# List of numerical columns to check for outliers
columns_to_check = [
    "Age", "Annual Income", "Number of Dependents", "Health Score",
    "Previous Claims", "Credit Score", "Insurance Duration", "Premium Amount"
]

# Detect outliers using IQR method
for column in columns_to_check:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers = (df[column] < lower_bound) | (df[column] > upper_bound)

    # Print outlier values
    print(f"Outliers in {column}:")
    print(df[outliers][column].values)
    print("-" * 50)
Outliers in Age:
<IntegerArray>
[]
Length: 0, dtype: Int64
--------------------------------------------------
Outliers in Annual Income:
[ 8942382. 10642586.  5102224. ...  4267852.  7627614.  4225024.]
--------------------------------------------------
Outliers in Number of Dependents:
<IntegerArray>
[]
Length: 0, dtype: Int64
--------------------------------------------------
Outliers in Health Score:
[]
--------------------------------------------------
Outliers in Previous Claims:
[3. 3. 3. ... 4. 3. 3.]
--------------------------------------------------
Outliers in Credit Score:
[]
--------------------------------------------------
Outliers in Insurance Duration:
[]
--------------------------------------------------
Outliers in Premium Amount:
[60067.38023978 47219.5120086  80505.37405629 ... 52714.2924196
 51794.31105174 67101.50990761]
--------------------------------------------------
In [ ]:
# List of columns to check
columns_to_check = ["Annual Income", "Premium Amount", "Previous Claims"] # previous clamins

# Dictionary to store outlier counts
outlier_counts = {}

# Detect outliers using IQR method
for column in columns_to_check:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Count outliers
    outliers = ((df[column] < lower_bound) | (df[column] > upper_bound)).sum()
    outlier_counts[column] = outliers

# Print the number of outliers in each column
print("Outlier Counts:")
print(outlier_counts)
Outlier Counts:
{'Annual Income': 117135, 'Premium Amount': 127320, 'Previous Claims': 62056}
In [ ]:
# Calculate the 5th and 95th percentiles for Annual Income
lower_bound_income = np.percentile(df['Annual Income'], 5)
upper_bound_income = np.percentile(df['Annual Income'], 95)

print(f"Annual Income Capping Range: ({lower_bound_income}, {upper_bound_income})")

# Apply 5%-95% capping to Annual Income
df['Annual Income'] = np.clip(df['Annual Income'], lower_bound_income, upper_bound_income)

# Calculate the 4th and 96th percentiles for Premium Amount
lower_bound_premium = np.percentile(df['Premium Amount'], 4)
upper_bound_premium = np.percentile(df['Premium Amount'], 96)

print(f"Premium Amount Capping Range: ({lower_bound_premium}, {upper_bound_premium})")

# Apply 4%-96% capping to Premium Amount
df['Premium Amount'] = np.clip(df['Premium Amount'], lower_bound_premium, upper_bound_premium)

print(f"After Capping - Min: {df['Premium Amount'].min()}, Max: {df['Premium Amount'].max()}")
Annual Income Capping Range: (88150.0, 6306990.600000008)
Premium Amount Capping Range: (2690.7025942529003, 81897.70396749469)
After Capping - Min: 2690.7025942529003, Max: 81897.70396749469
In [ ]:
# Cap Previous Claims at a maximum of 6
df['Previous Claims'] = np.where(df['Previous Claims'] > 6, 6, df['Previous Claims'])

# Check the updated value counts
print(df['Previous Claims'].value_counts().sort_index())
Previous Claims
0.00    305385
1.00    663368
2.00    167631
3.00     49003
4.00     10668
5.00      2016
6.00       369
Name: count, dtype: int64
  • Annual Income Capping (5th-95th Percentile): We calculate the 5th percentile (lower_bound_income) and 95th percentile (upper_bound_income) of the Annual Income column. Any value below the 5th percentile is capped to the 5th percentile, and above the 95th percentile is capped to the 95th percentile using np.clip(). Premium Amount Capping (4th-96th Percentile):

  • Premium AmountC apping (4th-96th Percentile): Values below the 4th percentile or above the 96th percentile are capped within this range.

  • Cap Previous Claims at 6:Any value in the Previous Claims column greater than 6 is replaced by 6 using np.where().

Key Metrics & Calculations¶

In [ ]:
# 1 Distribution of Policy Types
print("\n\033[1m Policy Type Distribution: \033[0m")
print(df['Policy Type'].value_counts())
 Policy Type Distribution: 
Policy Type
Premium          401345
Comprehensive    399087
Basic            398008
Name: count, dtype: int64
In [ ]:
# 2 Age Group Distribution
print("\n\033[1m Age Group Distribution: \033[0m")
print(df['Age Group'].value_counts())
 Age Group Distribution: 
Age Group
31-50    505676
51+      373681
18-30    319083
Name: count, dtype: int64
In [ ]:
# 3 Location Distribution
print("\n\033[1m Location Distribution: \033[0m")
print(df['Location'].value_counts())
 Location Distribution: 
Location
Tier-3    401023
Tier-2    400416
Tier-1    397001
Name: count, dtype: int64
In [ ]:
# 4 Smoking Status Distribution
print("\n\033[1m Smoking Status Distribution: \033[0m")
print(df['Smoking Status'].value_counts())
 Smoking Status Distribution: 
Smoking Status
Yes    601057
No     597383
Name: count, dtype: int64
In [ ]:
# 5 Previous Claims Count Distribution
print("\n\033[1m Previous Claims Count: \033[0m")
print(df['Previous Claims'].value_counts())
 Previous Claims Count: 
Previous Claims
1.00    663368
0.00    305385
2.00    167631
3.00     49003
4.00     10668
5.00      2016
6.00       369
Name: count, dtype: int64
In [ ]:
# 6 Previous Claims Count Distribution
print("\n\033[1m Previous Claims Count: \033[0m")
print(df['Exercise Frequency'].value_counts())
 Previous Claims Count: 
Exercise Frequency
Weekly     305735
Monthly    299454
Rarely     299050
Daily      294201
Name: count, dtype: int64

Data Analysis¶

1.Older Age Groups Pay Higher Health Insurance Premiums¶

In [ ]:
# Calculate average Premium Amount per Age Group
age_premium_summary = df.groupby('Age Group')['Premium Amount'].agg(['count', 'mean', 'median'])

# Display the results
print(age_premium_summary)
            count     mean   median
Age Group                          
18-30      319083 15375.54 10718.00
31-50      505676 16420.56 11247.49
51+        373681 31080.88 23690.35
In [ ]:
age_premium_summary['mean'].plot(kind='bar', color=['#4E79A7', '#F28E2B', '#76B7B2'], figsize=(10, 7))

# Labels and title
plt.xlabel("Age Group", fontsize=12)
plt.ylabel("Average Premium Amount", fontsize=12)
plt.title("Average Premium Amount by Age Group", fontsize=14)

# Show values on bars
for index, value in enumerate(age_premium_summary['mean']):
    plt.text(index, value + 1000, f"{value:.2f}", ha='center', fontsize=11, fontweight='bold')

# Show plot
plt.show()
No description has been provided for this image

Findings: Premiums increase with age, with the 51+ group paying 2x more than the 18-30 group, based on median values.

The 31-50 group holds the highest policy volume, making it a key segment for pricing adjustments.so our hypothesis is true

2.Health insurance premiums vary across locations due to differences in healthcare access and costs¶

In [ ]:
# Calculate average Premium Amount per Location Tier
location_premium_summary = df.groupby('Location')['Premium Amount'].agg(['count', 'mean', 'median'])

# Display the results
print(location_premium_summary)
           count     mean   median
Location                          
Tier-1    397001 20878.53 12756.78
Tier-2    400416 20801.86 12952.51
Tier-3    401023 20461.92 12915.80
In [ ]:
# Define a refined color palette (modern theme)
colors = ['#4B71A5', '#F28E2B', '#76B7B2']

# Plot the bar chart
location_premium_summary['mean'].plot(kind='bar', color=colors, figsize=(10, 7), edgecolor='black')

# Labels and title with padding
plt.xlabel("Location", fontsize=12, fontweight='bold')
plt.ylabel("Average Premium Amount", fontsize=12, fontweight='bold')
plt.title("Average Premium Amount by Location", fontsize=14, fontweight='bold', pad=30)  # Increased padding

# Show values on bars
for index, value in enumerate(location_premium_summary['mean']):
    plt.text(index, value + 500, f"{value:.2f}", ha='center', fontsize=11, fontweight='bold')

# Show plot
plt.show()
No description has been provided for this image

Findings: The average premium amount is relatively similar across Tier-1 (20,873.54), Tier-2 (20,799.85), and Tier-3 (20,459.86), showing minimal variation in location-based pricing.

Hypothesis Status: Location influences health insurance premiums" – Partially True

3.Policy Type Influences Health Insurance Premium Amounts¶

In [ ]:
# Calculate average Premium Amount per Policy Type Tier
Policy_Type_summary = df.groupby('Policy Type')['Premium Amount'].agg(['count', 'mean', 'median'])

# Display the results
print(Policy_Type_summary)
                count     mean   median
Policy Type                            
Basic          398008 19324.86 10825.14
Comprehensive  399087 20251.81 11397.64
Premium        401345 22549.71 12952.51
In [ ]:
Policy_Type_summary['mean'].plot(kind='bar', color=['#4B6A9B', '#F4A261', '#2A9D8F'], figsize=(10, 7))

# Labels and title
plt.xlabel("Policy Type", fontsize=12)
plt.ylabel("Average Premium Amount", fontsize=12)
plt.title("Average Premium Amount by Policy Type", fontsize=14)

# Show values on bars
for index, value in enumerate(Policy_Type_summary['mean']):
    plt.text(index, value + 1000, f"{value:.2f}", ha='center', fontsize=11, fontweight='bold')

# Show plot
plt.show()
No description has been provided for this image

Findings:Premium amounts increase with policy coverage, with Basic being the lowest (19,321.36) and Premium the highest (22,548.70).

Hypothesis Validation:Hypothesis: "Policy Type Influences Health Insurance Premium Amounts" – TRUE

4.Smokers tend to have higher insurance premiums compared to non-smokers.¶

In [ ]:
smoking_premium_comparison = df.groupby("Smoking Status")["Premium Amount"].mean().reset_index()

# Display results
print(smoking_premium_comparison)
  Smoking Status  Premium Amount
0             No        19405.28
1            Yes        22013.74
In [ ]:
smoking_premium_comparison = pd.DataFrame({
    "Smoking Status": ["No", "Yes"],
    "Premium Amount": [20182.85, 24006.33]
})

# Plot the bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x="Smoking Status", y="Premium Amount", data=smoking_premium_comparison, palette="Blues")

# Labels and title
plt.xlabel("Smoking Status")
plt.ylabel("Average Premium Amount")
plt.title("Impact of Smoking on Insurance Premiums")
plt.show()
<ipython-input-48-e7e12934dc1d>:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x="Smoking Status", y="Premium Amount", data=smoking_premium_comparison, palette="Blues")
No description has been provided for this image

Finding: Smokers pay higher health insurance premiums (₹22,011.66) compared to non-smokers (₹19,401.32).

Hypothesis Status: True Smoking status increases the premium amount.

5.Individuals who exercise more frequently (Daily/Weekly) tend to have lower health insurance premiums compared to those who exercise less frequently (Monthly/Rarely).¶

In [ ]:
# Group by Exercise Frequency and calculate mean Premium Amount
exercise_premium = df.groupby("Exercise Frequency")["Premium Amount"].mean().reset_index()

# Sort the categories in the expected order (Daily → Weekly → Monthly → Rarely)
exercise_premium["Exercise Frequency"] = pd.Categorical(exercise_premium["Exercise Frequency"],
                                                        categories=["Daily", "Weekly", "Monthly", "Rarely"],
                                                        ordered=True)
exercise_premium = exercise_premium.sort_values("Exercise Frequency")

# Plot the data
plt.figure(figsize=(8, 5))
sns.barplot(x="Exercise Frequency", y="Premium Amount", data=exercise_premium, palette="viridis")

# Add labels and title
plt.xlabel("Exercise Frequency")
plt.ylabel("Average Premium Amount")
plt.title("Impact of Exercise Frequency on Premium Amount")
plt.show()
<ipython-input-49-51bfd71790d1>:12: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x="Exercise Frequency", y="Premium Amount", data=exercise_premium, palette="viridis")
No description has been provided for this image
In [ ]:
exercise_premium
Out[ ]:
Exercise Frequency Premium Amount
0 Daily 18960.46
3 Weekly 18968.28
1 Monthly 19879.45
2 Rarely 25057.54

Findings: Individuals who exercise less frequently (Monthly or Rarely) have higher average health insurance premiums, with the "Rarely" group paying the most.

Hypothesis Status: True – More frequent exercise (Daily/Weekly) is associated with lower health insurance premiums.

6.Higher the Health Score, higher the Premium Amount.¶

In [ ]:
df1 = df
health_premium_analysis = df1.groupby("Health Score Category")["Premium Amount"].mean().reset_index()

# Display results
print(health_premium_analysis)
  Health Score Category  Premium Amount
0                  High        22530.12
1                   Low        17419.91
2                Medium        20672.94
In [ ]:
# Set plot style and size
plt.figure(figsize=(8, 6))
sns.set_style("whitegrid")

# Create bar plot using health_premium_analysis
sns.barplot(x="Health Score Category", y="Premium Amount", hue="Health Score Category", data=health_premium_analysis, palette="crest", legend=False)

# Add labels and title
plt.xlabel("Health Score Category", fontsize=12)
plt.ylabel("Average Premium Amount", fontsize=12)
plt.title("Average Premium Amount by Health Score Category", fontsize=14)

# Display values on each bar
for index, value in enumerate(health_premium_analysis["Premium Amount"]):
    plt.text(index, value + 500, f"{value:.2f}", ha='center', fontsize=11, fontweight='bold')

plt.show()
No description has been provided for this image

7.Higher Previous Claims Lead to Increased Premium Prices.¶

In [ ]:
# Group by Previous Claims and calculate average premium
df_grouped = df.groupby("Previous Claims")["Premium Amount"].mean().reset_index()

# Convert Previous Claims to categorical if it's numeric
df_grouped["Previous Claims"] = df_grouped["Previous Claims"].astype(str)

# Barplot for visualization
plt.figure(figsize=(8, 5))
sns.barplot(x=df_grouped["Previous Claims"], y=df_grouped["Premium Amount"], palette="viridis")
plt.xlabel("Previous Claims")
plt.ylabel("Average Premium Amount")
plt.title("Impact of Previous Claims on Premium Amount")
plt.show()

df_grouped
<ipython-input-53-6a6d2f53920a>:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=df_grouped["Previous Claims"], y=df_grouped["Premium Amount"], palette="viridis")
No description has been provided for this image
Out[ ]:
Previous Claims Premium Amount
0 0.0 23097.32
1 1.0 18835.50
2 2.0 22888.42
3 3.0 23065.29
4 4.0 23487.99
5 5.0 24176.75
6 6.0 24572.97

Finding:Premium amounts generally increase as the number of previous claims rises, with a noticeable upward trend from 1 to 6 claims.

Hypothesis Status:Partially True – While premium amounts increase with higher claims, the premium for 1 claim is lower than for 0 claims, indicating some variation in the trend.

Formula Making¶

The code is calculating adjustment factors for Age and Location under the Basic Policy type. These factors help quantify how much the premium amount increases or decreases relative to the base premium. These adjustment factors can later be used to build a pricing formula for calculating premiums based on age and location.

For Basic value b we take medain of each policy type,The Base Premium serves as the reference point to measure how much premiums vary by Age and Location.

Groups the each Policy data by Age Group. Calculates the mean premium for each Age Group. Computes the Age Factor by comparing each age group's mean premium to the Base Premium using this formula:

Age Factor¶

( Mean Premium for Age Group − Base Premium ) Base Premium Age Factor= Base Premium (Mean Premium for Age Group−Base Premium) ​

Groups the each Policy data by Location. Calculates the mean premium for each location (e.g., Tier-1, Tier-2, Tier-3). Computes the Location Factor using the same formula as Age Factor.

In [ ]:
# Assuming df_full contains all 120,000 records
base_premium_basic = df[df['Policy Type'] == 'Basic']['Premium Amount'].median()
print(f"✅ Base Premium (Basic Policy from Full Data): ₹{base_premium_basic:.2f}")

# Filter only Basic policies
df_basic = df[df['Policy Type'] == 'Basic']

# Calculate Age Factors (mean method)
age_factors_basic = df_basic.groupby('Age Group')['Premium Amount'].mean().reset_index()
age_factors_basic['Age Factor'] = (age_factors_basic['Premium Amount'] - base_premium_basic) / base_premium_basic

print("\n✅ Age Adjustment Factors (Basic Policy - Full Data):")
print(age_factors_basic)

# Calculate Location Factors (mean method)
location_factors_basic = df_basic.groupby('Location')['Premium Amount'].mean().reset_index()
location_factors_basic['Location Factor'] = (location_factors_basic['Premium Amount'] - base_premium_basic) / base_premium_basic

print("\n✅ Location Adjustment Factors (Basic Policy - Full Data):")
print(location_factors_basic)
✅ Base Premium (Basic Policy from Full Data): ₹10825.14

✅ Age Adjustment Factors (Basic Policy - Full Data):
  Age Group  Premium Amount  Age Factor
0     18-30        14259.28        0.32
1     31-50        15300.27        0.41
2       51+        29131.29        1.69

✅ Location Adjustment Factors (Basic Policy - Full Data):
  Location  Premium Amount  Location Factor
0   Tier-1        19542.62             0.81
1   Tier-2        19393.83             0.79
2   Tier-3        19040.39             0.76
In [ ]:
# Base Premium (Comprehensive Policy from Full Data)
base_premium_comprehensive = df[df['Policy Type'] == 'Comprehensive']['Premium Amount'].median()
print(f"✅ Base Premium (Comprehensive Policy from Full Data): ₹{base_premium_comprehensive:.2f}")

# Filter only Comprehensive policies
df_comprehensive = df[df['Policy Type'] == 'Comprehensive']

# Calculate Age Factors (mean method)
age_factors_comprehensive = df_comprehensive.groupby('Age Group')['Premium Amount'].mean().reset_index()
age_factors_comprehensive['Age Factor'] = (age_factors_comprehensive['Premium Amount'] - base_premium_comprehensive) / base_premium_comprehensive

print("\n✅ Age Adjustment Factors (Comprehensive Policy - Full Data):")
print(age_factors_comprehensive)

# Calculate Location Factors (mean method)
location_factors_comprehensive = df_comprehensive.groupby('Location')['Premium Amount'].mean().reset_index()
location_factors_comprehensive['Location Factor'] = (location_factors_comprehensive['Premium Amount'] - base_premium_comprehensive) / base_premium_comprehensive

print("\n✅ Location Adjustment Factors (Comprehensive Policy - Full Data):")
print(location_factors_comprehensive)
✅ Base Premium (Comprehensive Policy from Full Data): ₹11397.64

✅ Age Adjustment Factors (Comprehensive Policy - Full Data):
  Age Group  Premium Amount  Age Factor
0     18-30        14953.29        0.31
1     31-50        15951.82        0.40
2       51+        30567.09        1.68

✅ Location Adjustment Factors (Comprehensive Policy - Full Data):
  Location  Premium Amount  Location Factor
0   Tier-1        20323.39             0.78
1   Tier-2        20375.34             0.79
2   Tier-3        20056.35             0.76
In [ ]:
# Base Premium (Premium Policy from Full Data)
base_premium_premium = df[df['Policy Type'] == 'Premium']['Premium Amount'].median()
print(f"✅ Base Premium (Premium Policy from Full Data): ₹{base_premium_premium:.2f}")

# Filter only Premium policies
df_premium = df[df['Policy Type'] == 'Premium']

# Calculate Age Factors (mean method)
age_factors_premium = df_premium.groupby('Age Group')['Premium Amount'].mean().reset_index()
age_factors_premium['Age Factor'] = (age_factors_premium['Premium Amount'] - base_premium_premium) / base_premium_premium

print("\n✅ Age Adjustment Factors (Premium Policy - Full Data):")
print(age_factors_premium)

# Calculate Location Factors (mean method)
location_factors_premium = df_premium.groupby('Location')['Premium Amount'].mean().reset_index()
location_factors_premium['Location Factor'] = (location_factors_premium['Premium Amount'] - base_premium_premium) / base_premium_premium

print("\n✅ Location Adjustment Factors (Premium Policy - Full Data):")
print(location_factors_premium)
✅ Base Premium (Premium Policy from Full Data): ₹12952.51

✅ Age Adjustment Factors (Premium Policy - Full Data):
  Age Group  Premium Amount  Age Factor
0     18-30        16899.40        0.30
1     31-50        18001.85        0.39
2       51+        33519.55        1.59

✅ Location Adjustment Factors (Premium Policy - Full Data):
  Location  Premium Amount  Location Factor
0   Tier-1        22759.68             0.76
1   Tier-2        22628.89             0.75
2   Tier-3        22264.68             0.72

For previous_claims_factor, exercise_factors, health_score_factors, smoking_factor we get the facors from Hypothisis-

previous_claims_factors = {
    0: 0.00,
    1: 0.05,
    2: 0.10,
    3: 0.15,
    4: 0.20,
    5: 0.25
}


smoking_factor = 0.1312 if smoking_status == 'Yes' else 0.00


    exercise_factors = {
    'Daily': 0.00,
    'Weekly': 0.0002,
    'Monthly': 0.0482,
    'Rarely': 0.3151
}

health_score_factors = {
    'High': 0.2930,
    'Medium': 0.1863,
    'Low': 0.00
}
In [ ]:
# Risk Adjustment Data
data = {
   'Basic': {
       'Age': {
           '18-30': (14259.28, 0.32),
           '31-50': (15300.27, 0.41),
           '51+': (29131.29, 1.69)
    },
    'Location': {
           'Tier-1': (19542.62, 0.81),
           'Tier-2': (19393.83, 0.79),
           'Tier-3': (19040.39, 0.76)
    }
    },
  'Comprehensive': {
    'Age': {
          '18-30': (14953.29, 0.31),
          '31-50': (15951.82, 0.40),
          '51+': (30567.09, 1.68)
    },
    'Location': {
          'Tier-1': (20323.39, 0.78),
          'Tier-2': (20375.34, 0.79),
          'Tier-3': (20056.35, 0.76)
    }
    },
  'Premium': {
    'Age': {
        '18-30': (16899.40, 0.30),
        '31-50': (18001.85, 0.39),
        '51+': (33519.55, 1.59)
    },
    'Location': {
        'Tier-1': (22759.68, 0.76),
        'Tier-2': (22628.89, 0.75),
        'Tier-3': (22264.68, 0.72)
    }
    }
}

# Risk Adjustment Factors
def get_risk_factors(smoking_status, exercise_freq, health_score, previous_claims):
    smoking_factor = 0.1312 if smoking_status == 'Yes' else 0.00

    # Previous Claims Adjustment
    previous_claims_factors = {
        0: 0.00,
        1: 0.05,
        2: 0.10,
        3: 0.15,
        4: 0.20,
        5: 0.25
    }
    # Get factor, default to 0.25 if claims exceed 5
    previous_claims_factor = previous_claims_factors.get(previous_claims, 0.25)



    exercise_factors = {
        'Daily': 0.00,
        'Weekly': 0.0002,
        'Monthly': 0.0482,
        'Rarely': 0.3151
    }
    exercise_factor = exercise_factors.get(exercise_freq, 0.00)

    health_score_factors = {
        'High': 0.2930,
        'Medium': 0.1863,
        'Low': 0.00
    }
    health_score_factor = health_score_factors.get(health_score, 0.00)

    return smoking_factor, exercise_factor, health_score_factor , previous_claims_factor

# Function to calculate premium
def calculate_final_premium(row):
    policy_type = row['Policy Type']
    age_group = row['Age Group']
    location = row['Location']

    # Check if policy type, age group, location are valid
    if policy_type not in data or age_group not in data[policy_type]['Age'] or location not in data[policy_type]['Location']:
        return None

    # Get base premium and factors
    age_premium, age_factor = data[policy_type]['Age'][age_group]
    loc_premium, loc_factor = data[policy_type]['Location'][location]

    # Get risk factors
    smoking_factor, exercise_factor, health_score_factor, previous_claims_factor = get_risk_factors(row['Smoking Status'], row['Exercise Frequency'], row['Health Score'],row['Previous Claims'])

    # Calculate final premium
    base_premium = age_premium
    total_factor = loc_factor + smoking_factor + exercise_factor + health_score_factor + previous_claims_factor
    final_premium = base_premium * (1 + total_factor)

    return round(final_premium, 2)

# Assuming your DataFrame is called df
df['Final Premium'] = df.apply(calculate_final_premium, axis=1)  # Apply function row-wise

# Check the DataFrame with the new column
print(df.head())
   id  Age Gender  Annual Income    Marital Status  Number of Dependents  \
0   0   19  Woman      864214.00    Spouse Present                     1   
1   1   39  Woman      892701.25    Spouse Present                     3   
2   2   23    Man     2201772.00  Formerly Married                     3   
3   3   21    Man     3997541.99    Spouse Present                     2   
4   4   21    Man     3409986.00       Not Married                     1   

  Education Level Occupation  Health Score Location  ... Insurance Duration  \
0   Undergraduate   Business         26.60   Tier-1  ...               5.00   
1   Post Graduate    Missing         21.57   Tier-2  ...               2.00   
2   Undergraduate   Business         50.18   Tier-3  ...               3.00   
3   Undergraduate    Missing         16.94   Tier-2  ...               1.00   
4   Undergraduate   Business         24.38   Tier-2  ...               4.00   

           Policy Start Date  Customer Feedback  Smoking Status  \
0 2023-12-23 15:21:39.134960               Poor              No   
1 2023-06-12 15:21:39.111551            Average             Yes   
2 2023-09-30 15:21:39.221386               Good             Yes   
3 2024-06-12 15:21:39.226954               Poor             Yes   
4 2021-12-01 15:21:39.252145               Poor             Yes   

  Exercise Frequency  Property Type Premium Amount Age Group  \
0             Weekly  Detached Home        2690.70     18-30   
1            Monthly  Detached Home       10908.90     31-50   
2             Weekly  Detached Home       21563.14     18-30   
3              Daily           Flat        2690.70     18-30   
4             Weekly  Detached Home        2690.70     18-30   

  Health Score Category  Final Premium  
0                Medium       31436.26  
1                Medium       32213.11  
2                  High       32132.52  
3                   Low       28107.89  
4                Medium       31794.53  

[5 rows x 23 columns]
In [ ]:
df['Final Premium']
Out[ ]:
Final Premium
0 31436.26
1 32213.11
2 32132.52
3 28107.89
4 31794.53
... ...
1199995 32583.35
1199996 56249.56
1199997 25783.63
1199998 59329.60
1199999 32605.70

1198440 rows × 1 columns


Affordability Of Price¶

In [ ]:
df['IPR'] = (df['Final Premium'] / df['Annual Income']) * 100

The Income-to-Premium Ratio (IPR) is a financial metric that measures how much of a person's Annual Income is spent on their Health Insurance Premium. It is expressed as a percentage and helps assess the affordability of the insurance policy.

The IPR (Insurance Premium to Income Ratio) helps measure how affordable a health insurance premium is relative to annual income. The thresholds (2, 4, 12) are based on financial guidelines:

≤ 2%: Higher Affordability – Minimal financial burden.

2% to 4%: High Affordability – Reasonable premium cost.

4% to 12%: Moderate Affordability – Manageable but higher expense.

12%: Low Affordability – Significant financial strain.

These values are chosen to reflect practical affordability levels and industry benchmarks.

In [ ]:
def categorize_affordability(ipr):
    if ipr <= 2:
        return 'Higher Affordability'
    elif 2 < ipr <= 4:
        return 'High Affordability'
    elif 4 < ipr <= 12:  # Increased range from 10 to 12
        return 'Moderate Affordability'
    else:
        return 'Low Affordability'

# Apply the updated affordability categories
df['Affordability_Category'] = df['IPR'].apply(categorize_affordability)

# Check the updated category distribution
print(df['Affordability_Category'].value_counts(normalize=True) * 100)
Affordability_Category
Moderate Affordability   33.09
Higher Affordability     25.35
High Affordability       21.42
Low Affordability        20.14
Name: proportion, dtype: float64

Overall Analysis:¶

  • Age directly influences premium amounts—older individuals pay higher premiums—making age a true factor for premium calculation adjustments.
  • Tier-1 cities have slightly higher premiums compared to Tier-2 **and **Tier-3, but the variation is small. This indicates location has a minor effect on premium pricing.
  • Premium policyholders pay the highest premiums, followed by Comprehensive, and Basic policies have the lowest premiums. Policy type directly affects premium calculations.
  • Smokers consistently pay higher premiums than non-smokers, confirming a direct impact.
  • Higher health scores result in higher premiums, while lower health scores correspond to lower premiums,
  • Zero claims are associated with higher premiums, while individuals with 1 to 6 prior claims have increasing premium amounts.
  • Regular exercise reduces health risks, leading to lower premiums, confirming that exercise frequency directly affects premium costs

Recommendations:¶

  • Target Low & Moderate Affordability Groups (53.23% Total):Offer customized discounts or flexible payment plans to make premiums more manageable.
  • Affordable Policy Variants: Design basic coverage plans tailored for Low and Moderate Affordability groups to capture a broader market.
  • Income-Based Premium Plans:Implement tiered pricing tied to income brackets to improve affordability and attract new customers.
  • Segment-Specific Marketing:Focus marketing efforts on the Low Affordability segment by highlighting cost-saving benefits and value-added services.
  • Focus Offers on the Low Affordability Segment: Discounts & Payment Flexibility: Provide limited-time discounts or EMI options to ease the financial burden.
  • Incentives for High Affordability Segment:Premium Upgrades: Offer value-added services (e.g., dental, vision, or wellness programs) to encourage policy upgrades.