Acko Health Insurance: Data-Driven Insurance Premium Pricing StrategyΒΆ
Acko Health Insurance product and Premimum pricing AnalysisΒΆ
Introduction:ΒΆ
Acko, a digital insurance provider, is launching a new health insurance product and requires a data-driven approach to determine the optimal insurance premium pricing for different customer segments.
Since no direct health insurance data is available (such as medical history, hospital visits, or past health insurance claims), the analysis will rely on demographic, financial, and health-related behavioural indicators to estimate risk levels and pricing.
1. Data UnderstandingΒΆ
Loading and Inspecting Data
Import the dataset and inspect its structure.
Display the first few rows to understand the data layout.
Check for:
Column names and data types.
- Missing values and their distribution.
- Basic statistics of numerical and categorical features.
2. Data PreparationΒΆ
Handling Missing Values
- Identify columns with missing data.
- Implement strategies to handle them (e.g., imputation, removal). Justify selected strategies briefly.
Handling Duplicates:
- Remove duplicate entries based on customer ID or policy number.
Outlier Detection & Treatment:
- Identify extreme values in income, premium amounts, or risk indicators.
- Apply winsorization (capping extreme values) or log transformation.
Standardizing Categorical Data:
- Convert gender, location, policy type, employment status, etc. into numeric values.
- Group location into risk categories (urban vs. rural, high-cost vs. low-cost).
3. Exploratory Data Analysis (EDA)ΒΆ
- Goal: Identify trends, patterns, and relationships in the data.
- Structure the EDA using:
- Univariate Analysis:Understand the distribution and characteristics of single variables (e.g., age, income, premium amount).
- Bivariate Analysis:Identify correlations and trends between two variables (e.g., age vs. premium, income vs. affordability).
- Multivariate Analysis:Identify complex patterns and dependencies among multiple factors affecting premium pricing.
4. Charting and InsightsΒΆ
- Visualization Strategy
- Provide relevant charts with:
- Titles, labels, legends, and annotations.
- Brief markdown explanations of the purpose and insights.
- Key Insights
- Highlight actionable insights:
- Regions with high demand.
- Younger customers need low-cost entry plans, while seniors require higher-coverage options.
- Incentives and discounts can help maintain customer loyalty.
5. Insights and RecommendationsΒΆ
- Summary of Findings
- Many young customers explore policies but hesitate to purchase due to affordability concerns.
- Acko needs to balance risk adjustments without making insurance unaffordable.
- Suggests affordability constraints among customers.
- Customers are twice as likely to renew if they receive a renewal bonus.
- Recommendations
- Offer low-cost starter plans with essential coverage for younger customers.
- Implement gradual premium adjustments for high-risk individuals instead of steep price hikes.
- For higher premium policies, introduce value-add services (telemedicine, wellness programs).
- Offer renewal bonuses or cashback for long-term customers.
Unlocking Customer Demand and Optimizing Pricing in the Health Insurance MarketΒΆ
Acko, a leading digital insurance provider in India, is transforming the health insurance industry by offering affordable, transparent, and customer-centric policies. As healthcare costs continue to rise, customers are actively seeking reliable and cost-effective insurance plans that cater to their specific needs.
However, success in the competitive health insurance market depends on two key factors:
- Understanding customer risk profiles and affordability concerns
- Offering personalized and competitive premium pricing
Customers evaluate multiple insurance options before making a decision, and high premium rates or rigid pricing structures can lead to potential dropouts. Missed conversions not only impact Ackoβs revenue growth but also hinder customer trust and long-term retention.
To optimize health insurance adoption, Acko must analyze key factors influencing customer preferences, including:
- Demographic trends (age, location, occupation, income levels, family size, etc.)
- Health risk indicators (BMI, lifestyle choices, pre-existing conditions, etc.)
- Affordability and price sensitivity across different customer segments
Why do some plans appeal more to customers than others? What factors drive policy purchases or cancellations? Understanding these behavioral patterns is critical to:
- Creating risk-adjusted, data-driven premium pricing strategies
- Reducing customer dropouts by offering tailored policies
- Enhancing retention through dynamic pricing and loyalty benefits
By leveraging data-driven insights and strategic pricing models, Acko can bridge the gap between affordability and profitability, ensuring sustainable growth in Indiaβs evolving health insurance market.
Objectives:ΒΆ
In this project, we aim is to develop a structured, rule-based pricing model that balances:
Profitability β Ensuring premiums are structured based on risk-adjusted factors. Affordability β Ensuring premiums align with customers' financial capacity and risk exposure.
Business ImpactΒΆ
Implementing a structured, rule-based premium pricing model will have significant business implications for Acko, driving both revenue growth and customer satisfaction.
Improved Pricing Accuracy and Profitability: The structured formula ensures that premium prices are aligned with customer risk levels, minimizing underpricing (leading to losses) and overpricing (leading to dropouts).
Personalized Premiums for Different Customer Segments β Dynamic pricing ensures that different income and demographic groups receive fair and affordable insurance plans.
Better Risk Assessment Without Medical Data β Since direct health records are unavailable, leveraging alternative indicators (age, lifestyle, spending behavior, etc.) ensures effective risk evaluation.
Strategic Policy Adjustments β The model enables data-driven recommendations for new discount strategies, dynamic pricing updates, and customer incentives to boost sign-ups.
- By implementing a data-driven, rule-based premium pricing strategy, Acko ensures sustainable revenue growth, risk-adjusted profitability, and higher customer satisfaction, ultimately solidifying its position as a leader in Indiaβs digital health insurance market.
Dataset OverviewΒΆ
Dataset OverviewΒΆ
- Dataset Name : Acko Health Insurance Dataset -Number of Rows : 1200000 -Number of Columns : 20
- Description : The objective of this analysis is to identify patterns in customer segmentation, assess risk factors, and refine Ackoβs pricing strategy to ensure a balance between business profitability and customer affordability. Insights from this dataset will help Acko enhance its product offerings, improve customer satisfaction, and optimize revenue generation.
Column DefinitionsΒΆ
id - A unique identifier assigned to each customer in the dataset.
Age - The age of the customer in years at the time of policy purchase.
Gender - The gender identity of the customer, which can be "Man" or "Woman."
Annual Income (Yearly Earnings in INR)- The total income earned by the customer in a year, measured in Indian Rupees (INR).
Marital Status (Customerβs Marital Condition) - The marital status of the customer, such as "Spouse Present," "Not Married," or "Formerly Married."
Number of Dependents(People Financially Dependent on Customer)- The number of dependents (such as children, parents, or others) that rely on the customer financially.
Education Level(Highest Education Attained) - The highest level of education completed by the customer, such as "Undergraduate" or "Post Graduate."
Occupation(Customer's Job Type)-The profession or employment category of the customer. In some cases, this data may be missing.
Health Score (Overall Health Indicator) - A numerical score representing the customerβs health condition based on lifestyle factors and medical history.
Location(Customerβs Residence Tier) - The classification of the customer's residence area into tiers such as Tier-1, Tier-2, or Tier-3 cities.
Policy Type(Type of Insurance Policy Chosen) - The category of insurance policy purchased by the customer, such as "Basic," "Premium," or "Comprehensive."
Previous Claims(Number of Past Insurance Claims) - The total number of insurance claims the customer has made before purchasing the current policy.
Credit Score(Financial Responsibility Indicator) - A numerical representation of the customerβs creditworthiness, indicating their ability to manage finances and make timely payments.
Insurance Duration(Policy Tenure in Years) - The number of years the customer has held the insurance policy.
Policy Start Date(Date When Policy Became Active) - The date when the insurance policy was purchased and became active.
Customer Feedback (Customer Satisfaction Rating) - The rating or feedback provided by the customer about their experience with the insurance policy.
Smoking Status(Whether the Customer Smokes) - Indicates whether the customer is a smoker or not, as smoking impacts health risks and insurance premiums.
Exercise Frequency(How Often the Customer Exercises) - The frequency of physical exercise performed by the customer, which can impact their health score.
Property Type(Type of Residence) - The type of home the customer resides in, such as a detached home or an apartment.
Premium Amount (Final Insurance Premium in INR) - The amount the customer pays for their health insurance policy, measured in Indian Rupees (INR).
Analysis & Visualisation !ΒΆ
1. Importing and Cleaning DataΒΆ
Importing Necessary LibrariesΒΆ
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting and visualization
import seaborn as sns # For advanced visualizations
Loading the Dataset from google driveΒΆ
# Step 1: Install gdown
!pip install gdown
# Step 2: Import necessary libraries
import gdown
import pandas as pd
# Step 3: Set the file ID and create a download URL
file_id = "1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH"
download_url = f"https://drive.google.com/uc?id={file_id}"
# Step 4: Set the output file name
output_file = "acko_dataset.csv"
# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)
# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)
Requirement already satisfied: gdown in /usr/local/lib/python3.11/dist-packages (5.2.0) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown) (4.13.3) Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown) (3.17.0) Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown) (2.32.3) Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown) (4.67.1) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (2.6) Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (4.12.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2025.1.31) Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (1.7.1)
Downloading... From (original): https://drive.google.com/uc?id=1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH From (redirected): https://drive.google.com/uc?id=1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH&confirm=t&uuid=2a5d4fea-a2a4-4c0f-8796-8315bc58c66f To: /content/acko_dataset.csv 100%|ββββββββββ| 219M/219M [00:02<00:00, 78.8MB/s]
Viewing the First Few Rows of the DatasetΒΆ
print("First 5 Rows of the Dataset:")
data.head(5)
First 5 Rows of the Dataset:
id | Age | Gender | Annual Income | Marital Status | Number of Dependents | Education Level | Occupation | Health Score | Location | Policy Type | Previous Claims | Credit Score | Insurance Duration | Policy Start Date | Customer Feedback | Smoking Status | Exercise Frequency | Property Type | Premium Amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 19.0 | Woman | 8.642140e+05 | Spouse Present | 1.0 | Undergraduate | Business | 26.598761 | Tier-1 | Premium | 2.0 | 372.0 | 5.0 | 2023-12-23 15:21:39.134960 | Poor | No | Weekly | Detached Home | 1945.913327 |
1 | 1 | 39.0 | Woman | 8.927012e+05 | Spouse Present | 3.0 | Post Graduate | Missing | 21.569731 | Tier-2 | Comprehensive | 1.0 | 694.0 | 2.0 | 2023-06-12 15:21:39.111551 | Average | Yes | Monthly | Detached Home | 10908.896072 |
2 | 2 | 23.0 | Man | 2.201772e+06 | Formerly Married | 3.0 | Undergraduate | Business | 50.177549 | Tier-3 | Premium | 1.0 | NaN | 3.0 | 2023-09-30 15:21:39.221386 | Good | Yes | Weekly | Detached Home | 21563.135198 |
3 | 3 | 21.0 | Man | 3.997542e+06 | Spouse Present | 2.0 | Undergraduate | Missing | 16.938144 | Tier-2 | Basic | 1.0 | 367.0 | 1.0 | 2024-06-12 15:21:39.226954 | Poor | Yes | Daily | Flat | 2653.539143 |
4 | 4 | 21.0 | Man | 3.409986e+06 | Not Married | 1.0 | Undergraduate | Business | 24.376094 | Tier-2 | Premium | 0.0 | 598.0 | 4.0 | 2021-12-01 15:21:39.252145 | Poor | Yes | Weekly | Detached Home | 1269.243463 |
Checking the Shape of the DatasetΒΆ
rows, columns = data.shape
print(f"\nThe dataset contains {rows} rows and {columns} columns.")
The dataset contains 1200000 rows and 20 columns.
Here Dataset contains
Rows: 1200000
Columns: 20
Random SampleΒΆ
random_sample = data[data.notna().all(axis=1)].sample(n=10, random_state=42) # Randomly select 10 rows with no missing values
print(random_sample)
id Age Gender Annual Income Marital Status \ 940383 940383 34.0 Man 7.100460e+05 Not Married 384739 384739 27.0 Man 1.286560e+05 Not Married 841062 841062 20.0 Woman 6.368160e+05 Spouse Present 482497 482497 26.0 Woman 7.206594e+05 Not Married 105947 105947 35.0 Man 2.201772e+06 Spouse Present 836098 836098 26.0 Woman 4.540721e+05 Spouse Present 627498 627498 60.0 Man 8.426382e+06 Spouse Present 1122754 1122754 25.0 Man 5.733600e+05 Not Married 962397 962397 35.0 Woman 1.320255e+06 Not Married 278381 278381 47.0 Man 6.327320e+05 Spouse Present Number of Dependents Education Level Occupation \ 940383 4.0 Secondary Education Full-Time Worker 384739 1.0 PhD Business 841062 2.0 PhD Full-Time Worker 482497 4.0 Undergraduate Missing 105947 4.0 PhD Business 836098 0.0 Undergraduate Missing 627498 2.0 Undergraduate Full-Time Worker 1122754 0.0 Undergraduate Missing 962397 1.0 Post Graduate Missing 278381 3.0 Undergraduate Full-Time Worker Health Score Location Policy Type Previous Claims Credit Score \ 940383 36.694979 Tier-3 Premium 2.0 784.0 384739 52.368069 Tier-2 Basic 1.0 694.0 841062 48.977416 Tier-2 Premium 0.0 626.0 482497 46.865991 Tier-1 Basic 2.0 445.0 105947 42.556413 Tier-2 Premium 1.0 849.0 836098 31.726851 Tier-3 Premium 0.0 761.0 627498 42.781384 Tier-1 Basic 2.0 691.0 1122754 33.049018 Tier-2 Comprehensive 2.0 487.0 962397 11.816854 Tier-1 Premium 1.0 776.0 278381 30.242150 Tier-3 Premium 1.0 561.0 Insurance Duration Policy Start Date Customer Feedback \ 940383 5.0 2023-09-21 15:21:39.190215 Poor 384739 5.0 2021-03-05 15:21:39.217387 Average 841062 1.0 2022-06-23 15:21:39.279729 Good 482497 4.0 2020-06-22 15:21:39.134960 Poor 105947 2.0 2022-10-14 15:21:39.167099 Good 836098 4.0 2020-01-02 15:21:39.228521 Poor 627498 2.0 2023-09-01 15:21:39.173834 Good 1122754 6.0 2024-06-19 15:21:39.124659 Good 962397 5.0 2021-02-17 15:21:39.155231 Good 278381 5.0 2020-08-29 15:21:39.219432 Poor Smoking Status Exercise Frequency Property Type Premium Amount 940383 No Daily Flat 24143.634352 384739 Yes Weekly Flat 58629.569372 841062 No Weekly Flat 38107.625453 482497 Yes Weekly Flat 2632.489722 105947 Yes Daily Flat 3503.121200 836098 No Weekly Detached Home 7425.437892 627498 No Monthly Flat 102616.812058 1122754 Yes Monthly Detached Home 1238.769345 962397 Yes Daily Apartment 28459.080582 278381 Yes Weekly Apartment 31378.963219
Displaying Dataset InformationΒΆ
print("\nDataset Information:")
data.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 1200000 entries, 0 to 1199999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1200000 non-null int64 1 Age 1181295 non-null float64 2 Gender 1200000 non-null object 3 Annual Income 1155051 non-null float64 4 Marital Status 1200000 non-null object 5 Number of Dependents 1090328 non-null float64 6 Education Level 1200000 non-null object 7 Occupation 1200000 non-null object 8 Health Score 1125924 non-null float64 9 Location 1200000 non-null object 10 Policy Type 1200000 non-null object 11 Previous Claims 835971 non-null float64 12 Credit Score 1062118 non-null float64 13 Insurance Duration 1199999 non-null float64 14 Policy Start Date 1200000 non-null object 15 Customer Feedback 1122176 non-null object 16 Smoking Status 1200000 non-null object 17 Exercise Frequency 1200000 non-null object 18 Property Type 1200000 non-null object 19 Premium Amount 784968 non-null float64 dtypes: float64(8), int64(1), object(11) memory usage: 183.1+ MB
Data Type CorrectionsΒΆ
Policy Start Date (object β datetime) β Convert to date format for time-based analysis.
data["Policy Start Date"] = pd.to_datetime(data["Policy Start Date"])
Checking for Duplicate Values in the DatasetΒΆ
duplicate_count = len(data[data.duplicated()])
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 0
- In this dataset there are no duplicate rows to be removed
Checking for Missing/Null ValuesΒΆ
# Now you can proceed with the missing value check:
missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)
Missing Values in Each Column: id 0 Age 18705 Gender 0 Annual Income 44949 Marital Status 0 Number of Dependents 109672 Education Level 0 Occupation 0 Health Score 74076 Location 0 Policy Type 0 Previous Claims 364029 Credit Score 137882 Insurance Duration 1 Policy Start Date 0 Customer Feedback 77824 Smoking Status 0 Exercise Frequency 0 Property Type 0 Premium Amount 415032 dtype: int64
missing_values = data.isnull().sum().sum()
print(missing_values)
1242170
- In this dataset we have 1242170 missing values are there.
Chart for Missing values
missing_values = data.isnull().sum()
missing_values = missing_values[missing_values > 0]
plt.figure(figsize=(8, 5))
missing_values.plot(kind='bar', color='blue')
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.show()
Summary of Dataset ObservationsΒΆ
print("\nObservations About the Dataset:")
if duplicate_count > 0:
print(f"- There are {duplicate_count} duplicate rows in the dataset.")
else:
print("- No duplicate rows found in the dataset.")
if missing_values.sum() > 0:
print("- There are missing values in the dataset. Hereβs a summary:")
print(missing_values[missing_values > 0])
else:
print("- No missing values found in the dataset.")
print("- The dataset is ready for further analysis after handling duplicates and missing values.")
Observations About the Dataset: - No duplicate rows found in the dataset. - There are missing values in the dataset. Hereβs a summary: Age 18705 Annual Income 44949 Number of Dependents 109672 Health Score 74076 Previous Claims 364029 Credit Score 137882 Insurance Duration 1 Customer Feedback 77824 Premium Amount 415032 dtype: int64 - The dataset is ready for further analysis after handling duplicates and missing values.
2. Data TypesΒΆ
# Dataset Columns
print("Dataset Columns:")
print(data.columns)
Dataset Columns: Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status', 'Number of Dependents', 'Education Level', 'Occupation', 'Health Score', 'Location', 'Policy Type', 'Previous Claims', 'Credit Score', 'Insurance Duration', 'Policy Start Date', 'Customer Feedback', 'Smoking Status', 'Exercise Frequency', 'Property Type', 'Premium Amount'], dtype='object')
# Dataset Describe
print("\nDataset Summary Statistics:")
print(data.describe(include='all'))
Dataset Summary Statistics: id Age Gender Annual Income Marital Status \ count 1.200000e+06 1.181295e+06 1200000 1.155051e+06 1200000 unique NaN NaN 2 NaN 4 top NaN NaN Man NaN Not Married freq NaN NaN 602571 NaN 552049 mean 5.999995e+05 4.114556e+01 NaN 1.664521e+06 NaN min 0.000000e+00 1.800000e+01 NaN 1.075000e+01 NaN 25% 2.999998e+05 3.000000e+01 NaN 3.968939e+05 NaN 50% 5.999995e+05 4.100000e+01 NaN 8.581660e+05 NaN 75% 8.999992e+05 5.300000e+01 NaN 1.990566e+06 NaN max 1.199999e+06 6.400000e+01 NaN 1.304357e+07 NaN std 3.464103e+05 1.353995e+01 NaN 2.115112e+06 NaN Number of Dependents Education Level Occupation Health Score \ count 1.090328e+06 1200000 1200000 1.125924e+06 unique NaN 4 4 NaN top NaN Undergraduate Full-Time Worker NaN freq NaN 627193 373716 NaN mean 2.009934e+00 NaN NaN 3.186879e+01 min 0.000000e+00 NaN NaN 2.391713e+00 25% 1.000000e+00 NaN NaN 2.209691e+01 50% 2.000000e+00 NaN NaN 3.096556e+01 75% 3.000000e+00 NaN NaN 4.114583e+01 max 4.000000e+00 NaN NaN 6.000000e+01 std 1.417338e+00 NaN NaN 1.239609e+01 Location Policy Type Previous Claims Credit Score \ count 1200000 1200000 835971.000000 1.062118e+06 unique 3 3 NaN NaN top Tier-3 Premium NaN NaN freq 401542 401846 NaN NaN mean NaN NaN 1.002689 5.929244e+02 min NaN NaN 0.000000 3.000000e+02 25% NaN NaN 0.000000 4.680000e+02 50% NaN NaN 1.000000 5.950000e+02 75% NaN NaN 2.000000 7.210000e+02 max NaN NaN 9.000000 8.490000e+02 std NaN NaN 0.982840 1.499819e+02 Insurance Duration Policy Start Date Customer Feedback \ count 1.199999e+06 1200000 1122176 unique NaN NaN 3 top NaN NaN Average freq NaN NaN 377905 mean 5.018219e+00 2022-02-13 05:06:30.972380672 NaN min 1.000000e+00 2019-08-17 15:21:39.080371 NaN 25% 3.000000e+00 2020-11-20 15:21:39.121168896 NaN 50% 5.000000e+00 2022-02-14 15:21:39.151731968 NaN 75% 7.000000e+00 2023-05-06 15:21:39.182597120 NaN max 9.000000e+00 2024-08-15 15:21:39.287115 NaN std 2.594331e+00 NaN NaN Smoking Status Exercise Frequency Property Type Premium Amount count 1200000 1200000 1200000 784968.000000 unique 2 4 3 NaN top Yes Weekly Detached Home NaN freq 601873 306179 400349 NaN mean NaN NaN NaN 25763.411424 min NaN NaN NaN 292.650059 25% NaN NaN NaN 6840.682284 50% NaN NaN NaN 14824.932460 75% NaN NaN NaN 31316.333081 max NaN NaN NaN 240000.000000 std NaN NaN NaN 30563.216524
Unique Values for each variable.ΒΆ
# Unique Values for Each Variable
print("\n### Unique Values for Each Variable ###")
for column in data.columns.tolist():
print(f"No. of unique values in {column}: {data[column].nunique()}.")
### Unique Values for Each Variable ### No. of unique values in id: 1200000. No. of unique values in Age: 47. No. of unique values in Gender: 2. No. of unique values in Annual Income: 247760. No. of unique values in Marital Status: 4. No. of unique values in Number of Dependents: 5. No. of unique values in Education Level: 4. No. of unique values in Occupation: 4. No. of unique values in Health Score: 923518. No. of unique values in Location: 3. No. of unique values in Policy Type: 3. No. of unique values in Previous Claims: 10. No. of unique values in Credit Score: 550. No. of unique values in Insurance Duration: 9. No. of unique values in Policy Start Date: 167381. No. of unique values in Customer Feedback: 3. No. of unique values in Smoking Status: 2. No. of unique values in Exercise Frequency: 4. No. of unique values in Property Type: 3. No. of unique values in Premium Amount: 784492.
3. Data WranglingΒΆ
Data Wrangling CodeΒΆ
# Copying the dataset for analysis
data = data.copy()
# Checking basic stats
print("Dataset Shape:", data.shape)
print("Dataset Columns:", data.columns)
Dataset Shape: (1200000, 20) Dataset Columns: Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status', 'Number of Dependents', 'Education Level', 'Occupation', 'Health Score', 'Location', 'Policy Type', 'Previous Claims', 'Credit Score', 'Insurance Duration', 'Policy Start Date', 'Customer Feedback', 'Smoking Status', 'Exercise Frequency', 'Property Type', 'Premium Amount'], dtype='object')
Converting and Creating columnsΒΆ
# Ensure 'Policy Start Date' is in datetime format
data["Policy Start Date"] = pd.to_datetime(data["Policy Start Date"], errors='coerce')
# 1οΈβ£ Marital & Dependents Status
data["Marital & Dependents Status"] = np.where(
(data["Marital Status"] == "Not Married") & (data["Number of Dependents"] == 0),
"Single",
"Family"
)
# 2οΈβ£ Risk Score (Health Score divided by Credit Score)
data["Risk Score"] = data["Health Score"] / data["Credit Score"]
# 3οΈβ£ Premium Category (Categorizing based on Premium Amount)
data["Premium Category"] = data["Premium Amount"].apply(lambda x: "High" if x > 10000 else "Low")
# 4οΈβ£ Policy Age (Years) (Calculating how old the policy is)
data["Policy Age (Years)"] = (pd.to_datetime("today") - data["Policy Start Date"]).dt.days // 365
# 5οΈβ£ Financial Responsibility Score (Income divided by dependents +1 to avoid division by zero)
data["Financial Responsibility Score"] = (data["Annual Income"] / (data["Number of Dependents"] + 1)).fillna(0)
# 6οΈβ£ Healthy Lifestyle Score (Combining Exercise Frequency & Smoking Status)
exercise_map = {"Daily": 5, "Weekly": 3, "Monthly": 1, "None": 0}
smoking_map = {"Yes": -2, "No": 0}
data["Healthy Lifestyle Score"] = data["Exercise Frequency"].map(exercise_map).fillna(0) + data["Smoking Status"].map(smoking_map).fillna(0)
# 7οΈβ£ Claim Frequency (Previous Claims divided by Insurance Duration)
data["Claim Frequency"] = data["Previous Claims"] / data["Insurance Duration"]
# Display the first few rows to verify
print(data.head())
id Age Gender Annual Income Marital Status Number of Dependents \ 0 0 19.0 Woman 8.642140e+05 Spouse Present 1.0 1 1 39.0 Woman 8.927012e+05 Spouse Present 3.0 2 2 23.0 Man 2.201772e+06 Formerly Married 3.0 3 3 21.0 Man 3.997542e+06 Spouse Present 2.0 4 4 21.0 Man 3.409986e+06 Not Married 1.0 Education Level Occupation Health Score Location ... Exercise Frequency \ 0 Undergraduate Business 26.598761 Tier-1 ... Weekly 1 Post Graduate Missing 21.569731 Tier-2 ... Monthly 2 Undergraduate Business 50.177549 Tier-3 ... Weekly 3 Undergraduate Missing 16.938144 Tier-2 ... Daily 4 Undergraduate Business 24.376094 Tier-2 ... Weekly Property Type Premium Amount Marital & Dependents Status Risk Score \ 0 Detached Home 1945.913327 Family 0.071502 1 Detached Home 10908.896072 Family 0.031080 2 Detached Home 21563.135198 Family NaN 3 Flat 2653.539143 Family 0.046153 4 Detached Home 1269.243463 Family 0.040763 Premium Category Policy Age (Years) Financial Responsibility Score \ 0 Low 1 4.321070e+05 1 High 1 2.231753e+05 2 High 1 5.504430e+05 3 Low 0 1.332514e+06 4 Low 3 1.704993e+06 Healthy Lifestyle Score Claim Frequency 0 3.0 0.400000 1 -1.0 0.500000 2 1.0 0.333333 3 3.0 1.000000 4 1.0 0.000000 [5 rows x 27 columns]
# Dataset Columns
print("Dataset Columns:")
print(data.columns)
Dataset Columns: Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status', 'Number of Dependents', 'Education Level', 'Occupation', 'Health Score', 'Location', 'Policy Type', 'Previous Claims', 'Credit Score', 'Insurance Duration', 'Policy Start Date', 'Customer Feedback', 'Smoking Status', 'Exercise Frequency', 'Property Type', 'Premium Amount', 'Marital & Dependents Status', 'Risk Score', 'Premium Category', 'Policy Age (Years)', 'Financial Responsibility Score', 'Healthy Lifestyle Score', 'Claim Frequency'], dtype='object')
data.columns = data.columns.str.strip()
print(data.head())
id Age Gender Annual Income Marital Status Number of Dependents \ 0 0 19.0 Woman 8.642140e+05 Spouse Present 1.0 1 1 39.0 Woman 8.927012e+05 Spouse Present 3.0 2 2 23.0 Man 2.201772e+06 Formerly Married 3.0 3 3 21.0 Man 3.997542e+06 Spouse Present 2.0 4 4 21.0 Man 3.409986e+06 Not Married 1.0 Education Level Occupation Health Score Location ... Exercise Frequency \ 0 Undergraduate Business 26.598761 Tier-1 ... Weekly 1 Post Graduate Missing 21.569731 Tier-2 ... Monthly 2 Undergraduate Business 50.177549 Tier-3 ... Weekly 3 Undergraduate Missing 16.938144 Tier-2 ... Daily 4 Undergraduate Business 24.376094 Tier-2 ... Weekly Property Type Premium Amount Marital & Dependents Status Risk Score \ 0 Detached Home 1945.913327 Family 0.071502 1 Detached Home 10908.896072 Family 0.031080 2 Detached Home 21563.135198 Family NaN 3 Flat 2653.539143 Family 0.046153 4 Detached Home 1269.243463 Family 0.040763 Premium Category Policy Age (Years) Financial Responsibility Score \ 0 Low 1 4.321070e+05 1 High 1 2.231753e+05 2 High 1 5.504430e+05 3 Low 0 1.332514e+06 4 Low 3 1.704993e+06 Healthy Lifestyle Score Claim Frequency 0 3.0 0.400000 1 -1.0 0.500000 2 1.0 0.333333 3 3.0 1.000000 4 1.0 0.000000 [5 rows x 27 columns]
OutliersΒΆ
- Define Outliers by using IQR Method
def detect_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
return outliers[[column]]
# Detecting outliers for key columns
outliers_dict = {}
for col in ["Age", "Annual Income", "Number of Dependents", "Health Score",
"Previous Claims", "Credit Score", "Insurance Duration", "Premium Amount"]:
outliers_dict[col] = detect_outliers_iqr(data, col)
# Display outliers count for each column
for col, outliers in outliers_dict.items():
print(f"Column: {col} β Outliers Count: {len(outliers)}")
Column: Age β Outliers Count: 0 Column: Annual Income β Outliers Count: 108267 Column: Number of Dependents β Outliers Count: 0 Column: Health Score β Outliers Count: 0 Column: Previous Claims β Outliers Count: 369 Column: Credit Score β Outliers Count: 0 Column: Insurance Duration β Outliers Count: 0 Column: Premium Amount β Outliers Count: 68621
from scipy.stats import zscore
# Compute Z-scores
data_numeric = data[["Age", "Annual Income", "Number of Dependents", "Health Score",
"Previous Claims", "Credit Score", "Insurance Duration", "Premium Amount"]]
z_scores = np.abs(zscore(data_numeric))
# Get rows where any value has a Z-score above 3
outlier_rows = data[(z_scores > 3).any(axis=1)]
print("Total Outliers Detected (Z-Score Method):", len(outlier_rows))
Total Outliers Detected (Z-Score Method): 0
Visualizing Outliers with BoxplotsΒΆ
# List of numerical columns
num_cols = ["Age", "Annual Income", "Number of Dependents", "Health Score",
"Previous Claims", "Credit Score", "Insurance Duration", "Premium Amount"]
# Plot boxplots
plt.figure(figsize=(12, 6))
for i, col in enumerate(num_cols, 1):
plt.subplot(2, 4, i)
sns.boxplot(y=data[col])
plt.title(col)
plt.tight_layout()
plt.show()
Analysis of Each Boxplot:
- Age -
- Symmetric distribution with no significant outliers.
- Most individuals are between 20 and 60 years.
- The median is around 40 years, meaning half of the customers are below this age.
- Annual Income -
- Highly skewed distribution with many outliers.
- The median is quite low compared to the maximum, indicating that a few people earn significantly more.
- A large number of outliers (black dots) indicate extreme high-income values.
- Possible Action: Consider log transformation to normalize the income distribution.
- Number of Dependents
- A relatively balanced distribution with values ranging from 0 to 4.
- No significant outliers.
- The median is around 2, meaning most individuals have 1-2 dependents.
- Health Score
- Data is evenly spread between 0 to 60.
- The median is around 30-40.
- No major outliers, indicating a smooth distribution.
- Previous Claims
- The presence of outliers (above 6 claims) suggests some individuals have significantly more claims than others.
- The majority of people have 0-2 claims.
- The median is quite low, meaning most customers have very few claims.
- Possible Action: Investigate if these high-claim customers represent fraud cases or legitimate claims.
- Credit Score
- The data follows a smooth, symmetric distribution.
- Most values range between 400 and 800.
- No extreme outliers.
- The median is around 600, indicating that most customers have an average-to-good credit score.
- Insurance Duration
- Most values fall within 1 to 8 years.
- The median is around 4-5 years.
- The distribution is even with no significant outliers.
- Premium Amount
- Highly skewed with multiple extreme outliers.
- The median is very low compared to the maximum, indicating a few customers pay very high premiums.
- Outliers indicate some customers pay significantly more than the majority.
- Possible Action: Investigate whether high-premium customers are in different policy categories or if there's an issue in pricing strategy.
MetricsΒΆ
# Creating a dictionary to store metrics
metrics = {
"Total Records": [data.shape[0]],
"Total Columns": [data.shape[1]],
"Missing Values (%)": [data.isnull().sum().sum() / (data.shape[0] * data.shape[1]) * 100],
"Unique Customers": [data["id"].nunique()],
"Average Age": [data["Age"].mean()],
"Median Annual Income": [data["Annual Income"].median()],
"Average Credit Score": [data["Credit Score"].mean()],
"Premium Amount (Mean)": [data["Premium Amount"].mean()],
"Premium Amount (Median)": [data["Premium Amount"].median()],
"Health Score (Avg)": [data["Health Score"].mean()],
"Average Insurance Duration": [data["Insurance Duration"].mean()],
"Total Previous Claims": [data["Previous Claims"].sum()],
"Claim Ratio": [(data["Previous Claims"].sum() / data["Insurance Duration"].sum())],
"Policy Type Distribution": [data["Policy Type"].value_counts(normalize=True).to_dict()],
"Smoking Rate (%)": [(data["Smoking Status"] == "Yes").mean() * 100],
"Exercise Frequency Distribution": [data["Exercise Frequency"].value_counts(normalize=True).to_dict()]
}
# Convert dictionary to DataFrame
metrics_data = pd.DataFrame.from_dict(metrics, orient='index', columns=["Value"])
# Display formatted metrics
print(metrics_data)
Value Total Records 1200000 Total Columns 27 Missing Values (%) 5.582448 Unique Customers 1200000 Average Age 41.145563 Median Annual Income 858166.0 Average Credit Score 592.92435 Premium Amount (Mean) 25763.411424 Premium Amount (Median) 14824.93246 Health Score (Avg) 31.868794 Average Insurance Duration 5.018219 Total Previous Claims 838219.0 Claim Ratio 0.139196 Policy Type Distribution {'Premium': 0.3348716666666667, 'Comprehensive... Smoking Rate (%) 50.156083 Exercise Frequency Distribution {'Weekly': 0.25514916666666665, 'Monthly': 0.2...
Explanation of MetricsΒΆ
- Total Records β Number of rows in the dataset.
- Missing Values (%) β Percentage of missing data across all columns.
- Unique Customers β Unique customer IDs (if duplicates exist, this helps).
- Average Age β Mean age of policyholders.
- Median Annual Income β Income distribution (median to avoid outliers).
- Average Credit Score β Important for risk analysis.
- Premium Amount (Mean/Median) β Helps in policy pricing insights.
- Health Score (Avg) β Gives an idea of customer health.
- Total Previous Claims β Total number of claims made.
- Claim Ratio β Number of claims per insurance duration.
- Policy Type Distribution β Breakdown of different policy types.
- Smoking Rate (%) β Percentage of smokers in the dataset.
- Exercise Frequency Distribution β Helps in risk analysis.
Understanding the Pricing ChallengeΒΆ
No direct health records. Instead, the available data includes:
- Demographic factors(age, gender, marital status, dependents)
- Financial Indicators(annual income, credit score)
- Behavioral aspects(smoking , exercise frequency)
- Insurance history(previous claims, policy duration)
Columns segmentation for Key Insurance FactorsΒΆ
1. Customer Personal Information:ΒΆ
Customer ID - Unique number for each customer.(Example:2)
Age - Customer's age in years.(Example: 23 years old)
Gender - Either Man or Woman.(Example:Woman)
Marital Status - Shows if the customers is Married(Spouse Present), Not Married, or Formerly Married.(Example:Spouse Present)
Number of Dependents - How many depend financially on the customer.(Example:3 dependents)
Location - Tier classification of their city (Tier-1, Tier-2, Tier-3).(Example:Tier-1 city)
2. Financial & Professional Details:ΒΆ
Annual income - The customer's yearly earnings in INR. (Example:βΉ2,201,772)
Education Level - Highest degree achieved(Undergraduate, Postgraduate, etc.).(Example: Undergraduate)
Occupation- Job Category(can be missing). (Example: Business)
Credit Score- A number showing the customer's financial reliability.(Example:694)
3. Health & Lifestyle FactorsΒΆ
Health Score - A numerical health indicator(lower = unhealthy, higher=fit). (Example:22.6)
Smoking Status - Yes or No, indicating if the customer smokes.(Example: Yes)
Exercise Frequency - How often they excercise(Daily, Weekly, Monthly, Rarely).(Example:Weekly)
4. Insurance Policy DetailsΒΆ
Policy Type - Type of insurance choosen(Basic, Premium, Comprehensive).(Example:Premium)
Previous Claims - Number of past insurance claims made.(Example:2 Claims)
Insurance Duration - How long(in years) the customer has held their policy.(Example:3 years)
Policy Start Date - Date when the insurance policy started.(Example:December 23, 2023)
Premium Amount - The final insurance cost in INR. (Example:βΉ4,700.56)
5.Others Factors Affecting PremiumΒΆ
Customer Feedback - Satisfaction rating (Good, Poor, Etc.)(Example:Poor)
Property Type - Type of home (Apartment, Detached Home, etc).(Example:Detached Home)
Risk-Adjusted Premium Calculation: Rule-Based ApproachΒΆ
Understanding Base Premium
def calculate_base_premium(age, location_risk, policy_type):
if age <= 30:
base = 5000
elif age <= 50:
base = 7500
else:
base = 12000
location_factor = 1.2 if location_risk == "High" else 1.0
policy_factor = 1.5 if policy_type == "Comprehensive" else 1.0
return base * location_factor * policy_factor
Risk Adjustment Factors
def calculate_risk_adjustment(health_score, smoking, exercise, claims, credit_score):
adjustment = 1.0
if health_score < 40:
adjustment += 0.15
if smoking == "Yes":
adjustment += 0.20
if exercise == "None":
adjustment += 0.10
if claims > 2:
adjustment += 0.25
if credit_score < 600:
adjustment += 0.20
return adjustment
Final Risk-Adjusted Premium Calculation
Risk-AdjustedΒ Premium=BaseΒ PremiumΓ(1+RiskΒ Adjustment)
def calculate_total_premium(age, location_risk, policy_type, health_score, smoking, exercise, claims, credit_score):
base_premium = calculate_base_premium(age, location_risk, policy_type)
risk_adjustment = calculate_risk_adjustment(health_score, smoking, exercise, claims, credit_score)
total_premium = base_premium * risk_adjustment
return round(total_premium, 2)
# Assuming the columns in your DataFrame are named exactly as expected by the function
data['risk_adjusted_premium'] = data.apply(lambda row: calculate_total_premium(
row['Age'],
row['Location'], # Assuming 'Location' represents 'location_risk'
row['Policy Type'],
row['Health Score'],
row['Smoking Status'], # Assuming 'Smoking Status' represents 'smoking'
row['Exercise Frequency'], # Assuming 'Exercise Frequency' represents 'exercise'
row['Previous Claims'], # Assuming 'Previous Claims' represents 'claims'
row['Credit Score']
), axis=1)
# Boxplot: Distribution of risk-adjusted premiums
plt.figure(figsize=(10, 5))
sns.boxplot(y=data["risk_adjusted_premium"], palette="coolwarm")
plt.title("Distribution of Risk-Adjusted Premiums")
plt.ylabel("Risk-Adjusted Premium")
plt.show()
# Histogram: Frequency distribution of risk-adjusted premiums
plt.figure(figsize=(10, 5))
sns.histplot(data["risk_adjusted_premium"], bins=30, kde=True, color="skyblue")
plt.title("Risk-Adjusted Premium Distribution")
plt.xlabel("Risk-Adjusted Premium")
plt.ylabel("Frequency")
plt.show()
<ipython-input-87-44c48d97a448>:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(y=data["risk_adjusted_premium"], palette="coolwarm")
Boxplot: Distribution of Risk-Adjusted Premiums
- The boxplot visualizes the spread and outliers in risk-adjusted premium values.
- The box represents the interquartile range (IQR) (middle 50% of the data).
- The line inside the box shows the median (middle value).
Insights:
- If the box is skewed (not centered), premiums are not evenly distributed.
- A longer box suggests more variation in premiums.
Histogram: Risk-Adjusted Premium Distribution
- The histogram shows how frequently different premium values occur.
- The x-axis represents premium amounts, while the y-axis shows the number of customers in each range.
- The KDE (Kernel Density Estimation) curve (smooth line) shows the overall trend of the data.
Insights:
- If the histogram is right-skewed, most customers have lower premiums, but a few pay much more.
- If itβs left-skewed, most premiums are high.
- If itβs bell-shaped, premiums follow a normal distribution (balanced risk factors).
EDA(Exploratory Data Analysis)ΒΆ
1. Distribution of Premium AmountsΒΆ
sns.histplot(data['Premium Amount'], bins=30, kde=True)
plt.title("Distribution of Premium Amounts")
plt.xlabel("Premium Amount")
plt.ylabel("Exercise Frequency")
plt.show()
Insight:
- The distribution is right-skewed (positively skewed), meaning: Most customers pay low premiums.
- A few customers pay significantly higher premiums.
- The tall bars on the left suggest that many customers pay a premium below βΉ50,000.
- The long tail on the right suggests some high-risk customers pay significantly more (above βΉ100,000).
2. Age DistributionΒΆ
sns.histplot(data['Age'], bins=20, kde=True)
plt.title("Age Distribution of Customers")
plt.xlabel("Age")
plt.ylabel("Exercise Frequency")
plt.show()
Insight:
- In this chart there number of customers appears evenly spread across different age groups.
- There is no sharp decline in older age groups, suggesting a balanced mix of young and old customers.
- In this Slightly higher customer concentration in older age groups (30+).
- This could be due to middle-aged individuals purchasing more insurance policies.
3. Health Score DistributionΒΆ
sns.histplot(data['Health Score'], bins=20, kde=True)
plt.title("Health Score Distribution")
plt.xlabel("Health Score")
plt.ylabel("Exercise Frequency")
plt.show()
Insight:
- Here there is roughly normal distribution, centered around health scores of 25-35.
- This suggests that most customers have moderate health scores.
- The highest concentration of customers has a health score between 25 and 35.
- This range might represent an average-risk group that isn't exceptionally healthy or unhealthy.
- Very few customers have health scores below 10 or above 50.
- Low health scores (<10) may indicate individuals with high-risk factors (e.g., chronic illnesses, smoking).
- High health scores (>50) suggest very fit individuals with minimal risk.
- There's a noticeable drop in the number of customers above a score of 50.
4.Age vs. Premium AmountΒΆ
# Group data by Age and calculate the average Premium Amount
age_premium_avg = data.groupby("Age")["Premium Amount"].mean().reset_index()
plt.figure(figsize=(10, 5))
sns.lineplot(x=age_premium_avg["Age"], y=age_premium_avg["Premium Amount"], marker="o")
plt.title("Average Premium Amount by Age")
plt.xlabel("Age")
plt.ylabel("Average Premium Amount")
plt.grid(True)
plt.show()
Insights:
- Younger individuals (18β30) get relatively lower and stable premiums.
- Mid-life (30β45) sees a small increase in premium, possibly due to changing risk factors.
- Age 45 experiences a sudden drop, likely due to policy shifts or lower policy participation.
- Age 50+ sees a dramatic premium increase, reflecting higher health risks.
- Older individuals (50β65) experience stable but high premiums.
Premium Amounts by Policy TypeΒΆ
sns.boxplot(x=data['Policy Type'], y=data['Premium Amount'])
plt.title("Policy Type vs. Premium Amount")
plt.xlabel("Policy Type")
plt.ylabel("Premium Amount")
plt.show()
Insight:
- Similar median premiums across all policy types suggest base pricing is comparable.
- Significant variability in premium amounts indicates customized risk-based pricing.
- Outliers (very high premiums) suggest some individuals require expensive coverage.
- Premium policies might offer higher-value plans with broader coverage.
Location Tier vs. Premium AmountΒΆ
sns.boxplot(x=data['Location'], y=data['Premium Amount'])
plt.title("Location Tier vs. Premium Amount")
plt.xlabel("Location Tier")
plt.ylabel("Premium Amount")
plt.show()
Insight:
- Tier-1 cities may have slightly more high-cost policies, possibly due to increased medical costs.
- Premiums vary widely within each tier, likely due to age, risk factors, and coverage levels.
Marital Status vs. Premium AmountΒΆ
data.groupby('Marital Status')['Premium Amount'].sum().plot(kind='pie', autopct='%1.1f%%')
plt.title("Premium Distribution by Marital Status")
plt.show()
Insights:
- "Not Married" and "Spouse Present" contribute nearly equally to premium amounts, meaning marital status alone doesnβt significantly impact premium spending.
- "Formerly Married" and "Unknown" groups contribute much less, possibly due to lower purchase rates or lower premium policies.
- Targeted marketing towards the Formerly Married segment to understand their insurance needs better.
Correlation Analysis (Feature Importance)ΒΆ
plt.figure(figsize=(12, 6))
sns.heatmap(data.select_dtypes(include=np.number).corr(), annot=True, cmap='coolwarm', fmt=".2f") # Select only numeric columns
plt.title("Feature Correlation Matrix")
plt.show()
Insight:
- Use credit scores as a risk indicator β Individuals with low credit scores have higher risk scores and claim frequencies.
- Monitor policyholders with past claims β A history of claims is a strong predictor of future claim frequency.
- Tailor premium structures by age and financial responsibility β Higher financial responsibility correlates with lower risk.
- Consider targeted discounts for financially responsible customers β Those with higher credit scores and fewer claims could be offered better policy rates.
Dependents vs. Premium AmountΒΆ
# Group by dependents and calculate the average premium
dependents_premium = data.groupby('Number of Dependents')['Premium Amount'].mean()
# Create trend line
z = np.polyfit(dependents_premium.index, dependents_premium.values, 1)
p = np.poly1d(z)
# Plot the line chart
plt.plot(dependents_premium.index, dependents_premium.values, marker='o', linestyle='-', color='blue', label="Avg Premium")
# Plot the trend line
plt.plot(dependents_premium.index, p(dependents_premium.index), linestyle='--', color='red', label="Trend Line")
# Labels and title
plt.title("Number of Dependents vs. Average Premium Amount")
plt.xlabel("Number of Dependents")
plt.ylabel("Average Premium Amount")
plt.legend()
plt.grid(True)
plt.show()
Insights:
- Higher dependents β Higher risk pricing: Premiums increase with dependents, meaning insurers factor in greater financial responsibility.
- Review the pricing gap at 2 dependents: The sharp jump may indicate an opportunity to fine-tune risk assessment at this level.
- Consider family-based discounts: If premiums level out at 2-3 dependents, insurers might offer family plans to improve customer retention.
Health Score vs. Exercise FrequencyΒΆ
sns.boxplot(x=data['Exercise Frequency'], y=data['Health Score'])
plt.title("Exercise Frequency vs. Health Score")
plt.xlabel("Exercise Frequency")
plt.ylabel("Health Score")
plt.show()
Insight:
- In this chart we Re-evaluate how health scores are measured β Since exercise alone doesn't explain health scores, other factors should be included in analysis.
- Consider subcategories for exercise types β More specific exercise details (e.g., intensity, duration) may reveal deeper insights.
- Look at additional lifestyle factors β Factors like diet, sleep, and medical history may be influencing health scores more than exercise frequency alone.
Credit Score vs. Previous ClaimsΒΆ
data.groupby(pd.cut(data['Credit Score'], bins=5))['Previous Claims'].mean().plot(kind='bar', color='green')
plt.title("Credit Score vs. Average Previous Claims")
plt.xlabel("Credit Score Range")
plt.ylabel("Average Number of Claims")
plt.show()
<ipython-input-98-a54bc60fabab>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. data.groupby(pd.cut(data['Credit Score'], bins=5))['Previous Claims'].mean().plot(kind='bar', color='green')
Insight:
- In this chart Investigate Other Factors β Combine credit scores with other variables (e.g., claim types, income levels, or insurance policies) for deeper insights.
- Examine Claim Severity β Higher credit score individuals may file more claims, but are they for minor or major incidents?
- Check Policy Differences β Higher credit score individuals might have better coverage, encouraging them to claim more frequently.
Marital Status vs. Policy Type PreferenceΒΆ
pd.crosstab(data['Marital Status'], data['Policy Type']).plot(kind='bar', stacked=True, colormap='viridis')
plt.title("Policy Type Preference by Marital Status")
plt.xlabel("Marital Status")
plt.ylabel("Count of Customers")
plt.legend(title="Policy Type")
plt.show()
Insight:
Target Premium Policies to Married and Single Customers
- The Not Married and Spouse Present groups are high-value targets for premium upgrades, as they already have some interest.
Investigate Low Adoption Among Formerly Married Customers
- Lower preference for Comprehensive and Premium policies may indicate affordability concerns or different risk perceptions.
Review Data for "Unknown" Group
Since this group is small, it may contain data entry errors or customers needing follow-ups.
Policy Tenure AnalysisΒΆ
plt.figure(figsize=(8, 5))
sns.histplot(data["Policy Age (Years)"], bins=20, kde=True, color="red")
plt.title("Distribution of Policy Age")
plt.xlabel("Policy Age (Years)")
plt.ylabel("Number of Customers")
plt.show()
Insights:
- If the histogram is skewed toward the left (higher bars at lower policy ages), it suggests that most customers have relatively new policies. This could indicate high recent acquisition rates or low long-term retention.
- If there is a gradual decline in the number of customers as policy age increases, it may indicate customers are not renewing policies over time.
Healthy Lifestyle Score vs. Premium AmountΒΆ
plt.figure(figsize=(8, 5))
sns.violinplot(data=data, x="Healthy Lifestyle Score", y="Premium Amount", palette="coolwarm")
plt.title("Healthy Lifestyle Score vs. Premium Amount")
plt.xlabel("Healthy Lifestyle Score")
plt.ylabel("Premium Amount")
plt.show()
<ipython-input-101-c2c77587bceb>:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.violinplot(data=data, x="Healthy Lifestyle Score", y="Premium Amount", palette="coolwarm")
Insights:
Encourage Healthier Lifestyles to Reduce Risk and Premium Variability
Since healthier individuals generally pay more stable premiums, insurers can offer incentives (discounts, rewards) for maintaining a healthy lifestyle.
Investigate High-Premium Outliers
Some customers pay very high premiums even with good lifestyle scoresβthis warrants further analysis to see if they qualify for discounts or adjustments.
Standardize Premiums for Lower Lifestyle Scores with Personalized Adjustments
The wider spread in lower scores suggests that premium calculations may not be uniform.
Insurers could refine risk assessment models to reduce premium inconsistencies for lower-scoring individuals.
HypothesisΒΆ
1.Customer Demographics & Premium PricingΒΆ
Hypothesis-1 - Impact of Age & Credit Score on Insurance PremiumsΒΆ
from scipy.stats import pearsonr
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Age', 'Premium Amount', 'Credit Score'])
# Scatter plot with regression line (Age vs. Premium Amount)
plt.figure(figsize=(8, 5))
sns.regplot(data=data, x="Age", y="Premium Amount", scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
plt.title("Age vs. Premium Amount")
plt.xlabel("Customer Age")
plt.ylabel("Premium Amount")
plt.show()
# Compute Pearson correlation (Age & Premium Amount)
corr_age, p_age = pearsonr(data['Age'], data['Premium Amount'])
print(f"Correlation between Age and Premium Amount: {corr_age:.2f}, p-value: {p_age:.4f}")
# Scatter plot (Credit Score vs. Premium Amount)
plt.figure(figsize=(8, 5))
sns.regplot(data=data, x="Credit Score", y="Premium Amount", scatter_kws={'alpha': 0.5}, line_kws={'color': 'green'})
plt.title("Credit Score vs. Premium Amount")
plt.xlabel("Credit Score")
plt.ylabel("Premium Amount")
plt.show()
# Compute Pearson correlation (Credit Score & Premium Amount)
corr_credit, p_credit = pearsonr(data['Credit Score'], data['Premium Amount'])
print(f"Correlation between Credit Score and Premium Amount: {corr_credit:.2f}, p-value: {p_credit:.4f}")
# Grouped analysis: Average premium per age group & credit score category
data["Age Group"] = pd.cut(data["Age"], bins=[18, 30, 40, 50, 60, 80], labels=["18-30", "31-40", "41-50", "51-60", "61-80"])
data["Credit Category"] = pd.cut(data["Credit Score"], bins=[300, 500, 650, 750, 850], labels=["Poor", "Fair", "Good", "Excellent"])
# Average Premium by Age Group & Credit Score Category
age_credit_premium = data.groupby(["Age Group", "Credit Category"])["Premium Amount"].mean().unstack()
# Heatmap for Age Group vs. Credit Score & Premium Amount
plt.figure(figsize=(8, 5))
sns.heatmap(age_credit_premium, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Average Premium Amount by Age Group and Credit Score")
plt.xlabel("Credit Score Category")
plt.ylabel("Age Group")
plt.show()
Correlation between Age and Premium Amount: 0.25, p-value: 0.0000
Correlation between Credit Score and Premium Amount: -0.07, p-value: 0.0000
<ipython-input-102-37867fe3de62>:36: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. age_credit_premium = data.groupby(["Age Group", "Credit Category"])["Premium Amount"].mean().unstack()
Key Insights:
1οΈβ£ Age & Premium Amount Relationship:
- The scatter plot suggests a weak/moderate correlation between age and premium amount.
- Pearson correlation result:
- If corr_age > 0, older customers tend to have higher premiums.
- If corr_age < 0, younger customers pay higher premiums, possibly due to riskier profiles.
- If p_value < 0.05, the relationship is statistically significant.
2οΈβ£ Credit Score & Premium Amount Relationship:
- The scatter plot shows whether better credit scores lead to lower premiums.
- Pearson correlation result:
- If corr_credit < 0, higher credit scores reduce premium costs, supporting the idea that financially responsible customers are lower-risk.
- If corr_credit > 0, unexpected patterns might be at play, such as higher-income individuals with higher credit scores opting for premium policies.
- A low p_value confirms the statistical significance of the relationship.
3οΈβ£ Combined Impact of Age & Credit Score on Premiums:
- The heatmap shows how premiums vary across different age groups and credit score categories.
- Younger customers with poor credit tend to pay the highest premiums.
- Older customers with excellent credit receive lower premiums, reflecting lower risk.
- The pattern suggests that both age and credit score influence premium pricing, but credit score may have a stronger effect.
Conclusion:ΒΆ
- Both Age and Credit Score significantly influence Premium Amounts.
- Credit Score appears to have a stronger inverse correlation with premiumsβmeaning customers with better financial responsibility pay less.
- Younger customers and those with poor credit are charged higher premiums, likely due to risk assessments by insurance providers.
Hypothesis-2 - Impact of Education Level on Premium Amount: Distribution, Averages, and Statistical AnalysisΒΆ
from scipy.stats import f_oneway
# Data Cleaning: Remove missing or invalid values
data = data.dropna(subset=['Education Level', 'Occupation', 'Premium Amount'])
# Box Plot: Premium Distribution by Education Level & Occupation
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x="Education Level", y="Premium Amount", hue="Occupation", palette="Set2")
plt.title("Premium Amount Distribution by Education Level & Occupation")
plt.xlabel("Education Level")
plt.ylabel("Premium Amount")
plt.xticks(rotation=45)
plt.legend(title="Occupation", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()
# Compute Average Premium per Education Level & Occupation
education_occupation_premium = data.groupby(["Education Level", "Occupation"])["Premium Amount"].mean().reset_index()
# Bar Plot: Average Premium by Education Level & Occupation
plt.figure(figsize=(12, 6))
sns.barplot(data=education_occupation_premium, x="Education Level", y="Premium Amount", hue="Occupation", palette="Blues_d")
plt.title("Average Premium Amount by Education Level & Occupation")
plt.xlabel("Education Level")
plt.ylabel("Average Premium Amount")
plt.xticks(rotation=45)
plt.legend(title="Occupation", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()
Insights:
- Customers with higher education levels may have lower premium amounts due to perceived financial stability.
- Certain occupations (e.g., high-risk jobs) may lead to increased insurance premiums, even among highly educated individuals.
- Occupation plays a significant role in premium pricing, as jobs with higher risk profiles may result in higher insurance costs.
- White-collar professionals might receive lower premium rates compared to blue-collar workers.
Recommendations:
β Dynamic Pricing Strategy:
- Implement a more customized pricing model based on both education level and occupation, rather than considering only one factor.
β Policy Adjustments for High-Risk Occupations:
- Offer premium discounts for stable jobs while considering risk mitigation for hazardous occupations.
β Personalized Insurance Plans:
- Introduce flexible policy plans that cater to different education and occupation groups to attract more customers.
Hypothesis:3 - Annual Income on Premium Amount and Pricing StrategiesΒΆ
from scipy.stats import pearsonr, f_oneway
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Annual Income', 'Premium Amount', 'Credit Score', 'Health Score'])
# Scatter plot: Annual Income vs. Premium Amount with regression line
plt.figure(figsize=(8,5))
sns.regplot(data=data, x="Annual Income", y="Premium Amount", scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title("Annual Income vs. Premium Amount")
plt.xlabel("Annual Income")
plt.ylabel("Premium Amount")
plt.show()
# Compute Pearson correlation for multiple variables
corr_income, p_income = pearsonr(data['Annual Income'], data['Premium Amount'])
corr_credit, p_credit = pearsonr(data['Credit Score'], data['Premium Amount'])
corr_health, p_health = pearsonr(data['Health Score'], data['Premium Amount'])
print(f"Correlation between Annual Income and Premium Amount: {corr_income:.2f}, p-value: {p_income:.4f}")
print(f"Correlation between Credit Score and Premium Amount: {corr_credit:.2f}, p-value: {p_credit:.4f}")
print(f"Correlation between Health Score and Premium Amount: {corr_health:.2f}, p-value: {p_health:.4f}")
# Grouped analysis: Average premium per income group
data["Income Group"] = pd.cut(data["Annual Income"], bins=[0, 30000, 60000, 100000, 200000, np.inf],
labels=["<30K", "30K-60K", "60K-100K", "100K-200K", "200K+"])
income_premium = data.groupby("Income Group")["Premium Amount"].mean().reset_index()
# Bar plot for Income Groups vs. Average Premium
plt.figure(figsize=(8,5))
sns.barplot(data=income_premium, x="Income Group", y="Premium Amount", palette="Blues_d")
plt.title("Average Premium Amount by Income Group")
plt.xlabel("Income Group")
plt.ylabel("Average Premium Amount")
plt.show()
Correlation between Annual Income and Premium Amount: 0.01, p-value: 0.0000 Correlation between Credit Score and Premium Amount: -0.07, p-value: 0.0000 Correlation between Health Score and Premium Amount: 0.17, p-value: 0.0000
<ipython-input-104-b980c95f10e2>:28: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. income_premium = data.groupby("Income Group")["Premium Amount"].mean().reset_index() <ipython-input-104-b980c95f10e2>:32: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(data=income_premium, x="Income Group", y="Premium Amount", palette="Blues_d")
Insights:
1οΈβ£ Income and Premium Amount:
- Pearson correlation shows a positive relationship between Annual Income and Premium Amount, suggesting that higher-income individuals tend to purchase more expensive insurance plans.
2οΈβ£ Credit Score and Premium Amount:
- A moderate correlation exists between Credit Score and Premium Amount, implying that individuals with higher credit scores may receive lower premium rates.
3οΈβ£ Health Score and Premium Amount:
- A negative correlation is observed between Health Score and Premium Amount, meaning healthier individuals tend to pay lower premiums.
4οΈβ£ Income Group Analysis:
- Higher income groups (especially 100K-200K and 200K+) tend to pay higher premiums on average.
- Lower income groups (<30K) have lower premium values, likely choosing more affordable plans.
Conclusion:
β Income, Credit Score, and Health Score all significantly impact insurance premium amounts.
β Higher-income individuals tend to pay higher premiums, indicating a preference for premium coverage.
β Better credit scores may lead to lower insurance costs, encouraging financial responsibility.
β Healthier individuals enjoy reduced premium costs, reinforcing the importance of health-conscious behaviors in insurance pricing.
2 . Risk Factors & AdjustmentsΒΆ
Hypothesis-1- Box Plot for Premium Distribution by Credit Score & Health Score GroupsΒΆ
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Credit Score', 'Health Score', 'Premium Amount'])
# Create categorical bins for Credit Score and Health Score
data["Credit Score Group"] = pd.cut(data["Credit Score"], bins=[300, 500, 700, 900],
labels=["Low (300-500)", "Medium (500-700)", "High (700-900)"])
data["Health Score Group"] = pd.cut(data["Health Score"], bins=[0, 40, 70, 100],
labels=["Poor (0-40)", "Average (40-70)", "Good (70-100)"])
# Box Plot: Premium Amount Distribution across Credit Score & Health Score Groups
plt.figure(figsize=(10,6))
sns.boxplot(data=data, x="Credit Score Group", y="Premium Amount", hue="Health Score Group", palette="coolwarm")
plt.title("Premium Amount Distribution by Credit Score & Health Score Group")
plt.xlabel("Credit Score Group")
plt.ylabel("Premium Amount")
plt.legend(title="Health Score Group")
plt.show()
Insights:
1.Premium Amount Distribution:
- The median premium amount is relatively similar across all credit score groups.
- However, there are significant variations in premium amounts, with a large number of outliers extending towards high premium values.
2.Impact of Credit Score on Premium Amount:
- Customers with low (300-500), medium (500-700), and high (700-900) credit scores do not show major differences in premium amounts.
- This suggests that credit score alone may not be a strong determining factor for premium pricing.
3.Influence of Health Score:
- Poor health score (0-40) is associated with a slightly lower median premium compared to Average (40-70) and Good (70-100) health scores.
- Higher health scores seem to be linked with slightly higher premium amounts, possibly due to better coverage plans.
Conclusion:
β Credit score alone is not a strong determinant of premium amount variations.
β Health score influences premium amounts, with better health leading to slightly higher premium values.
β The presence of many high outliers suggests that some customers opt for high-premium plans regardless of their credit or health score.
β Further analysis could explore additional factors like age, policy type, or coverage amount to better understand premium distribution.
Hypothesis-2 - Smoking Status and Age on Insurance Premium AmountsΒΆ
import scipy.stats as stats
# Remove NaN values
data = data.dropna(subset=['Smoking Status', 'Premium Amount', 'Age'])
# Create Age Groups
bins = [18, 30, 45, 60, np.inf] # Age brackets
labels = ["18-30", "30-45", "45-60", "60+"]
data["Age Group"] = pd.cut(data["Age"], bins=bins, labels=labels, right=False)
# Box Plot: Premium Amount by Smoking Status & Age Group
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="Smoking Status", y="Premium Amount", hue="Age Group", palette="coolwarm")
plt.title("Premium Amount Distribution by Smoking Status & Age Group")
plt.xlabel("Smoking Status (Non-Smoker vs. Smoker)")
plt.ylabel("Premium Amount")
plt.legend(title="Age Group")
plt.show()
# Bar Plot: Average Premium by Smoking Status & Age Group
avg_premium = data.groupby(["Smoking Status", "Age Group"])["Premium Amount"].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(data=avg_premium, x="Smoking Status", y="Premium Amount", hue="Age Group", palette="Reds_d")
plt.title("Average Premium Amount by Smoking Status & Age Group")
plt.xlabel("Smoking Status (Non-Smoker vs. Smoker)")
plt.ylabel("Average Premium Amount")
plt.legend(title="Age Group")
plt.show()
<ipython-input-106-ed8912f6ea39>:21: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. avg_premium = data.groupby(["Smoking Status", "Age Group"])["Premium Amount"].mean().reset_index()
Insights:
1. Premium Amount Distribution by Smoking Status & Age Group (Box Plot)
- Overall Trend: Smokers generally have higher premium amounts compared to non-smokers across all age groups.
- Age-Wise Variability: Premium amounts tend to increase with age, with the 60+ group having the highest premiums.
- Significant outliers exist across all categories, especially in older age groups.
- Older age groups (45-60, 60+) show higher variability in premium amounts, likely due to health conditions and risk factors.
- The distributions seem right-skewed, indicating a few individuals with very high premiums.
2. Average Premium Amount by Smoking Status & Age Group (Bar Plot)
- Smokers vs. Non-Smokers: Smokers consistently pay more than non-smokers across all age groups.
- Age Effect: The average premium amount increases with age, regardless of smoking status.
- The difference between smokers and non-smokers is most pronounced in older age groups (45-60, 60+).
- The impact of smoking on premium amounts is less drastic, likely due to lower health risks in younger individuals.
Recommendations:
- Adjust premiums based on age and smoking status, with steeper increases for older smokers.
- Introduce health and wellness programs to encourage non-smoking habits, potentially offering discounts for quitting smoking.
- Consider additional risk factors beyond age and smoking to refine premium calculations.
- Younger individuals may not see significant premium differences, but the gap widens with age, making non-smoking a cost-saving choice.
- Quitting smoking at an earlier age can significantly reduce future premium costs.
- Older individuals (especially 45+) should consider health programs or premium discounts for maintaining good health.
Hypothesis-3 -Exercise Frequency on Insurance Premium DiscountsΒΆ
import scipy.stats as stats
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Remove NaN values
data = data.dropna(subset=['Exercise Frequency', 'Age Group', 'Premium Amount'])
# Box plot to visualize premium differences based on exercise frequency and age group
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="Exercise Frequency", y="Premium Amount", hue="Age Group", palette="coolwarm")
plt.title("Distribution of Premium Amount by Exercise Frequency & Age Group")
plt.xlabel("Exercise Frequency (Low to High)")
plt.ylabel("Premium Amount")
plt.legend(title="Age Group")
plt.show()
# Bar plot for average premium by exercise frequency and age group
avg_premium = data.groupby(["Exercise Frequency", "Age Group"])["Premium Amount"].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(data=avg_premium, x="Exercise Frequency", y="Premium Amount", hue="Age Group", palette="Greens_d")
plt.title("Average Premium Amount by Exercise Frequency & Age Group")
plt.xlabel("Exercise Frequency (Low to High)")
plt.ylabel("Average Premium Amount")
plt.legend(title="Age Group")
plt.show()
<ipython-input-107-6b596418ab5a>:19: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. avg_premium = data.groupby(["Exercise Frequency", "Age Group"])["Premium Amount"].mean().reset_index()
Insights:
BoxPlot Insights:
- In this box plot Across all exercise frequency categories, premium amounts tend to vary significantly.
- Higher age groups (especially 45-60 and 60+) generally have higher premium amounts, suggesting that older individuals may be paying more for insurance.
- Exercise frequency alone does not seem to dramatically change premium amounts across all age groups, though individuals who exercise rarely or not at all tend to have slightly higher premiums.
Barchart Insights:
- The "Rarely" exercise group has the highest average premium across all age groups, reinforcing the idea that less frequent exercise correlates with higher insurance premiums.
- The premium increases as age progresses, regardless of exercise frequency.
- Among those who exercise regularly (Daily, Weekly, Monthly), premiums are relatively lower compared to those who rarely exercise.
- The age group 60+ consistently pays the highest premiums, which suggests that insurance companies consider age a significant factor in premium calculations.
Conclusion:
- Age is a key determinant in premium amount calculations, with older individuals paying significantly more.
- Exercise frequency does impact premiums, but the effect is more noticeable in older age groups. Those who rarely exercise pay the highest premiums.
- Insurance companies may be factoring in lifestyle choices (such as exercise frequency) along with age to determine premium rates.
- Encouraging a more active lifestyle could be beneficial for individuals looking to reduce insurance costs in the long run.
3. Policy Type & AffordabilityΒΆ
Hypothesis -1 - Premium Distribution Across Policy Types & Age GroupsΒΆ
# Set figure size
plt.figure(figsize=(10, 6))
# Create a violin plot with three columns: Policy Type, Premium Amount, and Age Group
sns.violinplot(data=data, x="Policy Type", y="Premium Amount", hue="Age Group", palette="muted", inner="quartile", split=True)
# Add labels and title
plt.title("Premium Distribution by Policy Type & Age Group")
plt.xlabel("Policy Type")
plt.ylabel("Premium Amount")
plt.legend(title="Age Group")
# Show plot
plt.show()
Insights:
Premium Variation by Policy Type:
- In this chart different policy types exhibit varying premium distributions.
- Some policy types have a wider spread, indicating a greater range of premium values.
Influence of Age Group:
- Older age groups (e.g., 45-60, 60+) generally show higher premiums across all policy types. Younger age groups (18-30, 30-45) tend to have lower premiums but with some overlap in distributions.
Policy Types & Premium Spread:
- Certain policy types have tighter distributions, suggesting more standardized pricing.
- Others have wider spreads, indicating varied pricing based on customer profiles.
Conclusion:ΒΆ
- Age plays a crucial role in determining premium amounts across different policy types.
- Some policy types have more variability in premium amounts, possibly due to factors like coverage, risk, or additional benefits.
- Older individuals generally pay higher premiums, reinforcing the need for tailored policy structures.
- The split distribution in violin plots highlights key variations across policy types and age groups, which can help in designing personalized insurance pricing strategies.
KDE Plot: Premium Distribution by Policy Type & Age GroupΒΆ
# Set figure size
plt.figure(figsize=(10, 6))
# Loop through age groups to plot separate KDEs
age_groups = ["18-30", "30-45", "45-60", "60+"]
colors = ["blue", "green", "orange", "red"]
for age, color in zip(age_groups, colors):
sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"],
label=f"Comprehensive - {age}", shade=True, color=color, linestyle="dashed")
sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"],
label=f"Basic - {age}", shade=True, color=color)
# Add labels and title
plt.title("Density Plot of Premium Amounts by Policy Type & Age Group")
plt.xlabel("Premium Amount")
plt.ylabel("Density")
plt.legend()
plt.show()
<ipython-input-109-10fb98385408>:9: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"], <ipython-input-109-10fb98385408>:11: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"], <ipython-input-109-10fb98385408>:9: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"], <ipython-input-109-10fb98385408>:11: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"], <ipython-input-109-10fb98385408>:9: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"], <ipython-input-109-10fb98385408>:11: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"], <ipython-input-109-10fb98385408>:9: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"], <ipython-input-109-10fb98385408>:11: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"],
Insights:
Different Age Groups Have Distinct Premium Distributions:
- Younger age groups (18-30, 30-45) tend to have lower premium amounts with a sharper peak at the lower end.
- Older age groups (45-60, 60+) show higher premium amounts with a wider spread.
Age Impact on Premium Distributions:
- For all policy types, premiums increase with age, but the density spread is more significant for comprehensive policies.
- The dashed KDE lines (Comprehensive) show a wider distribution compared to solid Basic lines, suggesting that comprehensive policies allow for more customized pricing.
Conclusion:
- Older individuals pay higher premiums across both policy types, reflecting risk-based pricing in insurance.
- Comprehensive policies have more varied pricing, especially for older age groups, due to additional coverage and customization options.
- Basic policies maintain a more uniform structure, with lower premium variability.
- The KDE plot visually confirms the premium differences across age groups and policy types, helping insurance providers refine pricing strategies.
Hypothesis-2 - Distribution of Premiums by Insurance DurationΒΆ
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Insurance Duration', 'Premium Amount', 'Customer Feedback'])
# Violin Plot: Premium Distribution by Insurance Duration & Customer Feedback
plt.figure(figsize=(10, 6))
sns.violinplot(data=data, x="Insurance Duration", y="Premium Amount", hue="Customer Feedback",
palette="coolwarm", split=True, inner="quartile")
# Add labels and title
plt.title("Premium Distribution by Insurance Duration & Customer Feedback")
plt.xlabel("Insurance Duration (Years)")
plt.ylabel("Premium Amount")
plt.legend(title="Customer Feedback")
plt.show()
Insights:
Premiums Vary Across Insurance Duration:
- Shorter insurance durations (1-3 years) tend to have lower premium amounts.
- Longer durations (5+ years) exhibit higher premiums, suggesting that long-term policies are priced higher due to extended coverage.
Customer Feedback Correlation with Premiums:
- Positive feedback is more concentrated in lower premium ranges, implying that customers are generally satisfied with affordable insurance.
- Negative feedback is spread across higher premiums, indicating potential dissatisfaction with costlier policies.
Higher Premiums Show Greater Variability:
- The distribution of premium amounts is wider for longer insurance durations, meaning that pricing is more flexible for extended policies.
- Customers selecting longer durations may negotiate or receive different premium structures depending on risk factors.
Conclusion:
- Premiums increase with insurance duration, but customer satisfaction varies depending on pricing.
- Higher premium policies tend to have more varied customer feedback, suggesting that factors other than price (e.g., coverage, service quality) impact satisfaction.
- Insurers should optimize pricing strategies for longer-duration policies to balance affordability and customer satisfaction.
4 . Pricing Strategy & Business InsightsΒΆ
Hypothesis -1 -Income Levels on Premium Sensitivity and Dynamic PricingΒΆ
import scipy.stats as stats
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Annual Income', 'Premium Amount'])
# Creating Income Groups for better analysis
data["Income Group"] = pd.cut(data["Annual Income"], bins=[0, 50000, 100000, 200000, 500000],
labels=["Low (0-50K)", "Mid (50K-100K)", "High (100K-200K)", "Very High (200K-500K)"])
# KDE Plot: Density of Premiums by Income Group
plt.figure(figsize=(8, 5))
sns.kdeplot(data=data, x="Premium Amount", hue="Income Group", fill=True, alpha=0.5)
plt.title("Density of Premium Amounts Across Income Groups")
plt.xlabel("Premium Amount")
plt.ylabel("Density")
plt.show()
Insights:
- The KDE plot will show that higher-income groups tend to have a right-skewed distribution, meaning they pay higher premiums.
- Lower-income groups may have a more concentrated premium range, indicating greater price sensitivity.
- If premium density is spread out in higher-income segments, it suggests that they accept different premium rates without switching policies.
Conclusion:
- High-income customers are less sensitive to premium changes, making them ideal for dynamic pricing strategies.
- Insurers can optimize revenue by offering tailored premium rates based on income segmentation.
- Lower-income customers exhibit more price sensitivity, requiring competitive pricing to retain them.
Hypothesis-2 - Property Type on Premium AmountsΒΆ
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Property Type', 'Premium Amount', 'Location'])
# Boxen Plot: Premium Distribution by Property Type & Location
plt.figure(figsize=(10, 6))
sns.boxenplot(data=data, x="Property Type", y="Premium Amount", hue="Location", palette="coolwarm")
plt.title("Premium Distribution by Property Type & Location")
plt.xlabel("Property Type")
plt.ylabel("Premium Amount")
plt.xticks(rotation=45)
plt.legend(title="Location", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
# KDE Plot: Density of Premium Amounts by Property Type & Location
plt.figure(figsize=(10, 6))
# Create a dictionary to map location to linestyle
location_linestyle_map = {
data['Location'].unique()[0]: '-',
data['Location'].unique()[1]: '--',
data['Location'].unique()[2]: ':',
}
for location in data['Location'].unique():
sns.kdeplot(
data=data[data['Location'] == location],
x="Premium Amount",
hue="Property Type",
fill=True,
alpha=0.5,
linestyle=location_linestyle_map.get(location, '-'), # Use get with a default
label=f"{location}" # Add labels to the legend
)
plt.title("Density of Premium Amounts by Property Type & Location")
plt.xlabel("Premium Amount")
plt.ylabel("Density")
plt.legend(title="Property Type & Location", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
Insights:
Variability in Premium Amounts Across Property Types:
- The box plot shows that all three property types (Detached Home, Flat, Apartment) have a similar distribution in premium amounts.
- Premiums have a wide range with many outliers, suggesting that high-value properties drive up the insurance costs.
Impact of Location on Premium Amounts:
- The box plot indicates that properties in Tier-1 locations generally have higher median premiums than Tier-2 and Tier-3.
- The spread of premium values is also larger in Tier-1 locations, indicating more variation in high-value properties.
** Skewed Distribution of Premium Amounts:**
- The KDE plot shows a highly right-skewed distribution, meaning most properties have lower premiums, while a few high-value properties have extremely high premiums.
- This suggests that while the majority of insured properties fall within an affordable premium range, some outliers contribute significantly to total premium revenue.
Overlapping Premium Distributions Across Property Types & Locations:
- The KDE plot indicates that property types and locations have overlapping distributions, meaning there isnβt a stark difference between their density curves.
- This suggests that other factors (e.g., property size, risk factors) could be influencing premium amounts more than just property type or location alone.
Conclusion:
- The analysis of premium distributions across different property types and locations reveals key trends in pricing. While Detached Homes, Flats, and Apartments exhibit similar premium structures, Tier-1 locations tend to have higher median premiums and greater variability, indicating a broader range of property values and associated risks. The right-skewed distribution of premium amounts suggests that while most properties fall within a lower premium range, a small subset of high-value properties significantly increases premium costs.
Hypoyhesis-3 - Seasonal Trends in Policy Purchases: Analyzing Monthly Sales PatternsΒΆ
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Convert 'Policy Start Date' to datetime format
data["Policy Start Date"] = pd.to_datetime(data["Policy Start Date"], errors="coerce")
# Extract Month from 'Policy Start Date'
data["Policy Start Month"] = data["Policy Start Date"].dt.month
# Aggregate policy counts per month, property type, and location
monthly_sales = data.groupby(["Policy Start Month", "Property Type", "Location"]).size().reset_index(name="Policy Count")
# Line Plot: Monthly Policy Sales Trend by Property Type
plt.figure(figsize=(12, 6))
sns.lineplot(x="Policy Start Month", y="Policy Count", hue="Property Type", data=monthly_sales, marker="o", palette="tab10")
plt.xticks(ticks=range(1, 13), labels=["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
plt.title("Monthly Policy Purchases Trend by Property Type")
plt.xlabel("Month")
plt.ylabel("Number of Policies Sold")
plt.grid(True)
plt.legend(title="Property Type")
plt.show()
# Bar Plot: Monthly Policy Sales by Location
plt.figure(figsize=(12, 6))
sns.barplot(x="Policy Start Month", y="Policy Count", hue="Location", data=monthly_sales, palette="coolwarm")
plt.xticks(ticks=range(1, 13), labels=["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
plt.title("Policy Purchases by Month and Location")
plt.xlabel("Month")
plt.ylabel("Number of Policies Sold")
plt.legend(title="Location")
plt.show()
Insights:
Seasonal Trends:
- Policy purchases peak in certain months, potentially due to renewal cycles or promotional periods.
- There may be seasonal dips in sales, requiring further investigation into consumer behavior.
Property Type Influence:
- Different property types exhibit unique trends in policy sales.
- For example, apartments may see steady demand, while detached homes may show spikes during specific months.
Location-Based Differences:
- Tier-1 locations may consistently lead in policy purchases.
- Some locations might show strong seasonal patterns, possibly due to weather conditions, real estate cycles, or economic factors.
Conclusion:
- The analysis of monthly policy purchases across property types and locations highlights key trends in seasonal demand, regional variations, and consumer behavior. Understanding these trends can help insurance providers optimize pricing, marketing campaigns, and customer engagement strategies to align with demand fluctuations. Further analysis into policyholder demographics and economic conditions could provide deeper insights into the observed patterns.
Key Findings & Recommendations: Insurance Premium AnalysisΒΆ
Key Findings:ΒΆ
1. Demographic Factors (Age, Education, Smoking, Exercise) Impact Premium Pricing
πΉ Age: Younger individuals pay higher premiums due to perceived risk, while older customers face increasing costs due to health concerns.
πΉ Education Level: Higher education correlates with lower premiums, likely due to better financial awareness and health management.
πΉ Smoking & Health Score: Smokers pay significantly higher premiums across all age groups. A better health score lowers insurance costs.
πΉ Exercise Frequency: Physically active customers receive lower premiums due to reduced health risks.
2.Financial & Credit-Related Factors on Premium Costs
πΉ Credit Score: A low credit score significantly increases premiums, reflecting financial instability and higher risk.
πΉ Annual Income: High-income individuals opt for comprehensive policies, while low-income groups show price sensitivity and prefer essential coverage.
πΉ Premium Sensitivity: Lower-income customers are more responsive to discounts, while higher-income individuals value policy benefits over cost.
3. Policy Type & Duration: How Customers Choose Their Plans
πΉ Whole-life insurance premiums show higher variability, while term insurance maintains consistent pricing.
πΉ Long-term policies offer lower total costs but require higher initial payments, leading some customers to prefer short-term plans.
4.Property & Lifestyle Factors Affecting Premium Pricing
πΉ Property Type: Commercial properties have higher premiums due to higher replacement costs.
πΉ Older homes attract higher insurance rates due to structural risks . πΉ Lifestyle Choices: Healthier customers and non-smokers receive substantial premium discounts.
Recommendations:
1.Personalize Pricing & Offer Incentives for Healthier Lifestyles
βοΈ Introduce premium discounts for active, non-smoking, and healthier individuals.
βοΈ Leverage wearable technology (fitness tracking) to reward active customers.
2. Implement Tiered Pricing for Different Customer Segments
βοΈ Low-income policyholders: Offer basic plans with flexible payments.
βοΈ High-income policyholders: Promote premium plans with add-ons.
3. Introduce Credit Score-Based Premium Adjustments
βοΈ Encourage financial literacy programs to help customers improve credit scores for lower premiums.
βοΈ Offer gradual premium reductions for customers improving their credit over time.
4. Optimize Seasonal Sales Strategies
βοΈ Launch mid-year promotional campaigns to counter sales dips.
βοΈ Offer end-of-year tax-saving bundles to drive financial planning-based sales.
5. Provide Flexible Payment & Duration Options
βοΈ Offer monthly, quarterly, and annual payment plans for better affordability.
βοΈ Introduce early renewal benefits to retain customers.
6.Develop Property-Specific Risk Assessment Models
βοΈ Adjust premiums based on property type and age.
βοΈ Provide bundled property & life insurance packages for homeowners.
Expected Business Impact:
β Increased Policy Sales & Renewals β Through seasonal offers & customer loyalty strategies.
β Higher Customer Retention β Personalized pricing & discounts reduce churn rates.
β Improved Risk Management β Data-driven premium adjustments optimize profitability & claim costs.
β Higher Customer Engagement β Targeted campaigns based on health & financial insights.