Acko Health Insurance: Data-Driven Insurance Premium Pricing StrategyΒΆ

Acko Health Insurance product and Premimum pricing AnalysisΒΆ

Introduction:ΒΆ

Acko, a digital insurance provider, is launching a new health insurance product and requires a data-driven approach to determine the optimal insurance premium pricing for different customer segments.

Since no direct health insurance data is available (such as medical history, hospital visits, or past health insurance claims), the analysis will rely on demographic, financial, and health-related behavioural indicators to estimate risk levels and pricing.

1. Data UnderstandingΒΆ

  • Loading and Inspecting Data

  • Import the dataset and inspect its structure.

  • Display the first few rows to understand the data layout.

  • Check for:

  • Column names and data types.

    • Missing values and their distribution.
    • Basic statistics of numerical and categorical features.

2. Data PreparationΒΆ

  • Handling Missing Values

    • Identify columns with missing data.
    • Implement strategies to handle them (e.g., imputation, removal). Justify selected strategies briefly.
  • Handling Duplicates:

    • Remove duplicate entries based on customer ID or policy number.
  • Outlier Detection & Treatment:

    • Identify extreme values in income, premium amounts, or risk indicators.
    • Apply winsorization (capping extreme values) or log transformation.
  • Standardizing Categorical Data:

    • Convert gender, location, policy type, employment status, etc. into numeric values.
    • Group location into risk categories (urban vs. rural, high-cost vs. low-cost).

3. Exploratory Data Analysis (EDA)ΒΆ

  • Goal: Identify trends, patterns, and relationships in the data.
  • Structure the EDA using:
    • Univariate Analysis:Understand the distribution and characteristics of single variables (e.g., age, income, premium amount).
    • Bivariate Analysis:Identify correlations and trends between two variables (e.g., age vs. premium, income vs. affordability).
    • Multivariate Analysis:Identify complex patterns and dependencies among multiple factors affecting premium pricing.

4. Charting and InsightsΒΆ

  • Visualization Strategy
  • Provide relevant charts with:
  • Titles, labels, legends, and annotations.
  • Brief markdown explanations of the purpose and insights.
  • Key Insights
    • Highlight actionable insights:
    • Regions with high demand.
    • Younger customers need low-cost entry plans, while seniors require higher-coverage options.
    • Incentives and discounts can help maintain customer loyalty.

5. Insights and RecommendationsΒΆ

  • Summary of Findings
    • Many young customers explore policies but hesitate to purchase due to affordability concerns.
    • Acko needs to balance risk adjustments without making insurance unaffordable.
    • Suggests affordability constraints among customers.
    • Customers are twice as likely to renew if they receive a renewal bonus.
  • Recommendations
    • Offer low-cost starter plans with essential coverage for younger customers.
    • Implement gradual premium adjustments for high-risk individuals instead of steep price hikes.
    • For higher premium policies, introduce value-add services (telemedicine, wellness programs).
    • Offer renewal bonuses or cashback for long-term customers.

Unlocking Customer Demand and Optimizing Pricing in the Health Insurance MarketΒΆ

Acko, a leading digital insurance provider in India, is transforming the health insurance industry by offering affordable, transparent, and customer-centric policies. As healthcare costs continue to rise, customers are actively seeking reliable and cost-effective insurance plans that cater to their specific needs.

However, success in the competitive health insurance market depends on two key factors:

  • Understanding customer risk profiles and affordability concerns
  • Offering personalized and competitive premium pricing

Customers evaluate multiple insurance options before making a decision, and high premium rates or rigid pricing structures can lead to potential dropouts. Missed conversions not only impact Acko’s revenue growth but also hinder customer trust and long-term retention.

To optimize health insurance adoption, Acko must analyze key factors influencing customer preferences, including:

  • Demographic trends (age, location, occupation, income levels, family size, etc.)
  • Health risk indicators (BMI, lifestyle choices, pre-existing conditions, etc.)
  • Affordability and price sensitivity across different customer segments

Why do some plans appeal more to customers than others? What factors drive policy purchases or cancellations? Understanding these behavioral patterns is critical to:

  • Creating risk-adjusted, data-driven premium pricing strategies
  • Reducing customer dropouts by offering tailored policies
  • Enhancing retention through dynamic pricing and loyalty benefits

By leveraging data-driven insights and strategic pricing models, Acko can bridge the gap between affordability and profitability, ensuring sustainable growth in India’s evolving health insurance market.

Objectives:ΒΆ

In this project, we aim is to develop a structured, rule-based pricing model that balances:

Profitability – Ensuring premiums are structured based on risk-adjusted factors. Affordability – Ensuring premiums align with customers' financial capacity and risk exposure.

Business ImpactΒΆ

Implementing a structured, rule-based premium pricing model will have significant business implications for Acko, driving both revenue growth and customer satisfaction.

Improved Pricing Accuracy and Profitability: The structured formula ensures that premium prices are aligned with customer risk levels, minimizing underpricing (leading to losses) and overpricing (leading to dropouts).

Personalized Premiums for Different Customer Segments – Dynamic pricing ensures that different income and demographic groups receive fair and affordable insurance plans.

Better Risk Assessment Without Medical Data – Since direct health records are unavailable, leveraging alternative indicators (age, lifestyle, spending behavior, etc.) ensures effective risk evaluation.

Strategic Policy Adjustments – The model enables data-driven recommendations for new discount strategies, dynamic pricing updates, and customer incentives to boost sign-ups.

  • By implementing a data-driven, rule-based premium pricing strategy, Acko ensures sustainable revenue growth, risk-adjusted profitability, and higher customer satisfaction, ultimately solidifying its position as a leader in India’s digital health insurance market.

Dataset OverviewΒΆ

Dataset OverviewΒΆ

  • Dataset Name : Acko Health Insurance Dataset -Number of Rows : 1200000 -Number of Columns : 20
  • Description : The objective of this analysis is to identify patterns in customer segmentation, assess risk factors, and refine Acko’s pricing strategy to ensure a balance between business profitability and customer affordability. Insights from this dataset will help Acko enhance its product offerings, improve customer satisfaction, and optimize revenue generation.

Column DefinitionsΒΆ

  1. id - A unique identifier assigned to each customer in the dataset.

  2. Age - The age of the customer in years at the time of policy purchase.

  3. Gender - The gender identity of the customer, which can be "Man" or "Woman."

  4. Annual Income (Yearly Earnings in INR)- The total income earned by the customer in a year, measured in Indian Rupees (INR).

  5. Marital Status (Customer’s Marital Condition) - The marital status of the customer, such as "Spouse Present," "Not Married," or "Formerly Married."

  6. Number of Dependents(People Financially Dependent on Customer)- The number of dependents (such as children, parents, or others) that rely on the customer financially.

  7. Education Level(Highest Education Attained) - The highest level of education completed by the customer, such as "Undergraduate" or "Post Graduate."

  8. Occupation(Customer's Job Type)-The profession or employment category of the customer. In some cases, this data may be missing.

  9. Health Score (Overall Health Indicator) - A numerical score representing the customer’s health condition based on lifestyle factors and medical history.

  10. Location(Customer’s Residence Tier) - The classification of the customer's residence area into tiers such as Tier-1, Tier-2, or Tier-3 cities.

  11. Policy Type(Type of Insurance Policy Chosen) - The category of insurance policy purchased by the customer, such as "Basic," "Premium," or "Comprehensive."

  12. Previous Claims(Number of Past Insurance Claims) - The total number of insurance claims the customer has made before purchasing the current policy.

  13. Credit Score(Financial Responsibility Indicator) - A numerical representation of the customer’s creditworthiness, indicating their ability to manage finances and make timely payments.

  14. Insurance Duration(Policy Tenure in Years) - The number of years the customer has held the insurance policy.

  15. Policy Start Date(Date When Policy Became Active) - The date when the insurance policy was purchased and became active.

  16. Customer Feedback (Customer Satisfaction Rating) - The rating or feedback provided by the customer about their experience with the insurance policy.

  17. Smoking Status(Whether the Customer Smokes) - Indicates whether the customer is a smoker or not, as smoking impacts health risks and insurance premiums.

  18. Exercise Frequency(How Often the Customer Exercises) - The frequency of physical exercise performed by the customer, which can impact their health score.

  19. Property Type(Type of Residence) - The type of home the customer resides in, such as a detached home or an apartment.

  20. Premium Amount (Final Insurance Premium in INR) - The amount the customer pays for their health insurance policy, measured in Indian Rupees (INR).

Analysis & Visualisation !ΒΆ

1. Importing and Cleaning DataΒΆ

Importing Necessary LibrariesΒΆ

InΒ [58]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import seaborn as sns  # For advanced visualizations

Loading the Dataset from google driveΒΆ

InΒ [59]:
# Step 1: Install gdown
!pip install gdown

# Step 2: Import necessary libraries
import gdown
import pandas as pd

# Step 3: Set the file ID and create a download URL
file_id = "1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH"
download_url = f"https://drive.google.com/uc?id={file_id}"

# Step 4: Set the output file name
output_file = "acko_dataset.csv"

# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)

# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)
Requirement already satisfied: gdown in /usr/local/lib/python3.11/dist-packages (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown) (4.13.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown) (3.17.0)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown) (2.32.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown) (4.67.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (2.6)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2025.1.31)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (1.7.1)
Downloading...
From (original): https://drive.google.com/uc?id=1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH
From (redirected): https://drive.google.com/uc?id=1i4ia9ZNfAXgu6JGTXCUgFn7Pb8wltzLH&confirm=t&uuid=2a5d4fea-a2a4-4c0f-8796-8315bc58c66f
To: /content/acko_dataset.csv
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 219M/219M [00:02<00:00, 78.8MB/s]

Viewing the First Few Rows of the DatasetΒΆ

InΒ [60]:
print("First 5 Rows of the Dataset:")
data.head(5)
First 5 Rows of the Dataset:
Out[60]:
id Age Gender Annual Income Marital Status Number of Dependents Education Level Occupation Health Score Location Policy Type Previous Claims Credit Score Insurance Duration Policy Start Date Customer Feedback Smoking Status Exercise Frequency Property Type Premium Amount
0 0 19.0 Woman 8.642140e+05 Spouse Present 1.0 Undergraduate Business 26.598761 Tier-1 Premium 2.0 372.0 5.0 2023-12-23 15:21:39.134960 Poor No Weekly Detached Home 1945.913327
1 1 39.0 Woman 8.927012e+05 Spouse Present 3.0 Post Graduate Missing 21.569731 Tier-2 Comprehensive 1.0 694.0 2.0 2023-06-12 15:21:39.111551 Average Yes Monthly Detached Home 10908.896072
2 2 23.0 Man 2.201772e+06 Formerly Married 3.0 Undergraduate Business 50.177549 Tier-3 Premium 1.0 NaN 3.0 2023-09-30 15:21:39.221386 Good Yes Weekly Detached Home 21563.135198
3 3 21.0 Man 3.997542e+06 Spouse Present 2.0 Undergraduate Missing 16.938144 Tier-2 Basic 1.0 367.0 1.0 2024-06-12 15:21:39.226954 Poor Yes Daily Flat 2653.539143
4 4 21.0 Man 3.409986e+06 Not Married 1.0 Undergraduate Business 24.376094 Tier-2 Premium 0.0 598.0 4.0 2021-12-01 15:21:39.252145 Poor Yes Weekly Detached Home 1269.243463

Checking the Shape of the DatasetΒΆ

InΒ [61]:
rows, columns = data.shape
print(f"\nThe dataset contains {rows} rows and {columns} columns.")
The dataset contains 1200000 rows and 20 columns.
  • Here Dataset contains

    Rows: 1200000

    Columns: 20

Random SampleΒΆ

InΒ [62]:
random_sample = data[data.notna().all(axis=1)].sample(n=10, random_state=42)  # Randomly select 10 rows with no missing values
print(random_sample)
              id   Age Gender  Annual Income  Marital Status  \
940383    940383  34.0    Man   7.100460e+05     Not Married   
384739    384739  27.0    Man   1.286560e+05     Not Married   
841062    841062  20.0  Woman   6.368160e+05  Spouse Present   
482497    482497  26.0  Woman   7.206594e+05     Not Married   
105947    105947  35.0    Man   2.201772e+06  Spouse Present   
836098    836098  26.0  Woman   4.540721e+05  Spouse Present   
627498    627498  60.0    Man   8.426382e+06  Spouse Present   
1122754  1122754  25.0    Man   5.733600e+05     Not Married   
962397    962397  35.0  Woman   1.320255e+06     Not Married   
278381    278381  47.0    Man   6.327320e+05  Spouse Present   

         Number of Dependents      Education Level        Occupation  \
940383                    4.0  Secondary Education  Full-Time Worker   
384739                    1.0                  PhD          Business   
841062                    2.0                  PhD  Full-Time Worker   
482497                    4.0        Undergraduate           Missing   
105947                    4.0                  PhD          Business   
836098                    0.0        Undergraduate           Missing   
627498                    2.0        Undergraduate  Full-Time Worker   
1122754                   0.0        Undergraduate           Missing   
962397                    1.0        Post Graduate           Missing   
278381                    3.0        Undergraduate  Full-Time Worker   

         Health Score Location    Policy Type  Previous Claims  Credit Score  \
940383      36.694979   Tier-3        Premium              2.0         784.0   
384739      52.368069   Tier-2          Basic              1.0         694.0   
841062      48.977416   Tier-2        Premium              0.0         626.0   
482497      46.865991   Tier-1          Basic              2.0         445.0   
105947      42.556413   Tier-2        Premium              1.0         849.0   
836098      31.726851   Tier-3        Premium              0.0         761.0   
627498      42.781384   Tier-1          Basic              2.0         691.0   
1122754     33.049018   Tier-2  Comprehensive              2.0         487.0   
962397      11.816854   Tier-1        Premium              1.0         776.0   
278381      30.242150   Tier-3        Premium              1.0         561.0   

         Insurance Duration           Policy Start Date Customer Feedback  \
940383                  5.0  2023-09-21 15:21:39.190215              Poor   
384739                  5.0  2021-03-05 15:21:39.217387           Average   
841062                  1.0  2022-06-23 15:21:39.279729              Good   
482497                  4.0  2020-06-22 15:21:39.134960              Poor   
105947                  2.0  2022-10-14 15:21:39.167099              Good   
836098                  4.0  2020-01-02 15:21:39.228521              Poor   
627498                  2.0  2023-09-01 15:21:39.173834              Good   
1122754                 6.0  2024-06-19 15:21:39.124659              Good   
962397                  5.0  2021-02-17 15:21:39.155231              Good   
278381                  5.0  2020-08-29 15:21:39.219432              Poor   

        Smoking Status Exercise Frequency  Property Type  Premium Amount  
940383              No              Daily           Flat    24143.634352  
384739             Yes             Weekly           Flat    58629.569372  
841062              No             Weekly           Flat    38107.625453  
482497             Yes             Weekly           Flat     2632.489722  
105947             Yes              Daily           Flat     3503.121200  
836098              No             Weekly  Detached Home     7425.437892  
627498              No            Monthly           Flat   102616.812058  
1122754            Yes            Monthly  Detached Home     1238.769345  
962397             Yes              Daily      Apartment    28459.080582  
278381             Yes             Weekly      Apartment    31378.963219  

Displaying Dataset InformationΒΆ

InΒ [63]:
print("\nDataset Information:")
data.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 20 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1200000 non-null  int64  
 1   Age                   1181295 non-null  float64
 2   Gender                1200000 non-null  object 
 3   Annual Income         1155051 non-null  float64
 4   Marital Status        1200000 non-null  object 
 5   Number of Dependents  1090328 non-null  float64
 6   Education Level       1200000 non-null  object 
 7   Occupation            1200000 non-null  object 
 8   Health Score          1125924 non-null  float64
 9   Location              1200000 non-null  object 
 10  Policy Type           1200000 non-null  object 
 11  Previous Claims       835971 non-null   float64
 12  Credit Score          1062118 non-null  float64
 13  Insurance Duration    1199999 non-null  float64
 14  Policy Start Date     1200000 non-null  object 
 15  Customer Feedback     1122176 non-null  object 
 16  Smoking Status        1200000 non-null  object 
 17  Exercise Frequency    1200000 non-null  object 
 18  Property Type         1200000 non-null  object 
 19  Premium Amount        784968 non-null   float64
dtypes: float64(8), int64(1), object(11)
memory usage: 183.1+ MB

Data Type CorrectionsΒΆ

Policy Start Date (object β†’ datetime) β†’ Convert to date format for time-based analysis.

InΒ [64]:
data["Policy Start Date"] = pd.to_datetime(data["Policy Start Date"])

Checking for Duplicate Values in the DatasetΒΆ

InΒ [65]:
duplicate_count = len(data[data.duplicated()])
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 0
  • In this dataset there are no duplicate rows to be removed

Checking for Missing/Null ValuesΒΆ

InΒ [66]:
# Now you can proceed with the missing value check:
missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)
Missing Values in Each Column:
id                           0
Age                      18705
Gender                       0
Annual Income            44949
Marital Status               0
Number of Dependents    109672
Education Level              0
Occupation                   0
Health Score             74076
Location                     0
Policy Type                  0
Previous Claims         364029
Credit Score            137882
Insurance Duration           1
Policy Start Date            0
Customer Feedback        77824
Smoking Status               0
Exercise Frequency           0
Property Type                0
Premium Amount          415032
dtype: int64
InΒ [67]:
missing_values = data.isnull().sum().sum()
print(missing_values)
1242170
  • In this dataset we have 1242170 missing values are there.

Chart for Missing values

InΒ [68]:
missing_values = data.isnull().sum()
missing_values = missing_values[missing_values > 0]

plt.figure(figsize=(8, 5))
missing_values.plot(kind='bar', color='blue')
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.show()
No description has been provided for this image

Summary of Dataset ObservationsΒΆ

InΒ [69]:
print("\nObservations About the Dataset:")
if duplicate_count > 0:
    print(f"- There are {duplicate_count} duplicate rows in the dataset.")
else:
    print("- No duplicate rows found in the dataset.")

if missing_values.sum() > 0:
    print("- There are missing values in the dataset. Here’s a summary:")
    print(missing_values[missing_values > 0])
else:
    print("- No missing values found in the dataset.")

print("- The dataset is ready for further analysis after handling duplicates and missing values.")
Observations About the Dataset:
- No duplicate rows found in the dataset.
- There are missing values in the dataset. Here’s a summary:
Age                      18705
Annual Income            44949
Number of Dependents    109672
Health Score             74076
Previous Claims         364029
Credit Score            137882
Insurance Duration           1
Customer Feedback        77824
Premium Amount          415032
dtype: int64
- The dataset is ready for further analysis after handling duplicates and missing values.

2. Data TypesΒΆ

InΒ [70]:
# Dataset Columns
print("Dataset Columns:")
print(data.columns)
Dataset Columns:
Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status',
       'Number of Dependents', 'Education Level', 'Occupation', 'Health Score',
       'Location', 'Policy Type', 'Previous Claims', 'Credit Score',
       'Insurance Duration', 'Policy Start Date', 'Customer Feedback',
       'Smoking Status', 'Exercise Frequency', 'Property Type',
       'Premium Amount'],
      dtype='object')
InΒ [71]:
# Dataset Describe
print("\nDataset Summary Statistics:")
print(data.describe(include='all'))
Dataset Summary Statistics:
                  id           Age   Gender  Annual Income Marital Status  \
count   1.200000e+06  1.181295e+06  1200000   1.155051e+06        1200000   
unique           NaN           NaN        2            NaN              4   
top              NaN           NaN      Man            NaN    Not Married   
freq             NaN           NaN   602571            NaN         552049   
mean    5.999995e+05  4.114556e+01      NaN   1.664521e+06            NaN   
min     0.000000e+00  1.800000e+01      NaN   1.075000e+01            NaN   
25%     2.999998e+05  3.000000e+01      NaN   3.968939e+05            NaN   
50%     5.999995e+05  4.100000e+01      NaN   8.581660e+05            NaN   
75%     8.999992e+05  5.300000e+01      NaN   1.990566e+06            NaN   
max     1.199999e+06  6.400000e+01      NaN   1.304357e+07            NaN   
std     3.464103e+05  1.353995e+01      NaN   2.115112e+06            NaN   

        Number of Dependents Education Level        Occupation  Health Score  \
count           1.090328e+06         1200000           1200000  1.125924e+06   
unique                   NaN               4                 4           NaN   
top                      NaN   Undergraduate  Full-Time Worker           NaN   
freq                     NaN          627193            373716           NaN   
mean            2.009934e+00             NaN               NaN  3.186879e+01   
min             0.000000e+00             NaN               NaN  2.391713e+00   
25%             1.000000e+00             NaN               NaN  2.209691e+01   
50%             2.000000e+00             NaN               NaN  3.096556e+01   
75%             3.000000e+00             NaN               NaN  4.114583e+01   
max             4.000000e+00             NaN               NaN  6.000000e+01   
std             1.417338e+00             NaN               NaN  1.239609e+01   

       Location Policy Type  Previous Claims  Credit Score  \
count   1200000     1200000    835971.000000  1.062118e+06   
unique        3           3              NaN           NaN   
top      Tier-3     Premium              NaN           NaN   
freq     401542      401846              NaN           NaN   
mean        NaN         NaN         1.002689  5.929244e+02   
min         NaN         NaN         0.000000  3.000000e+02   
25%         NaN         NaN         0.000000  4.680000e+02   
50%         NaN         NaN         1.000000  5.950000e+02   
75%         NaN         NaN         2.000000  7.210000e+02   
max         NaN         NaN         9.000000  8.490000e+02   
std         NaN         NaN         0.982840  1.499819e+02   

        Insurance Duration              Policy Start Date Customer Feedback  \
count         1.199999e+06                        1200000           1122176   
unique                 NaN                            NaN                 3   
top                    NaN                            NaN           Average   
freq                   NaN                            NaN            377905   
mean          5.018219e+00  2022-02-13 05:06:30.972380672               NaN   
min           1.000000e+00     2019-08-17 15:21:39.080371               NaN   
25%           3.000000e+00  2020-11-20 15:21:39.121168896               NaN   
50%           5.000000e+00  2022-02-14 15:21:39.151731968               NaN   
75%           7.000000e+00  2023-05-06 15:21:39.182597120               NaN   
max           9.000000e+00     2024-08-15 15:21:39.287115               NaN   
std           2.594331e+00                            NaN               NaN   

       Smoking Status Exercise Frequency  Property Type  Premium Amount  
count         1200000            1200000        1200000   784968.000000  
unique              2                  4              3             NaN  
top               Yes             Weekly  Detached Home             NaN  
freq           601873             306179         400349             NaN  
mean              NaN                NaN            NaN    25763.411424  
min               NaN                NaN            NaN      292.650059  
25%               NaN                NaN            NaN     6840.682284  
50%               NaN                NaN            NaN    14824.932460  
75%               NaN                NaN            NaN    31316.333081  
max               NaN                NaN            NaN   240000.000000  
std               NaN                NaN            NaN    30563.216524  

Unique Values for each variable.ΒΆ

InΒ [72]:
# Unique Values for Each Variable
print("\n### Unique Values for Each Variable ###")
for column in data.columns.tolist():
    print(f"No. of unique values in {column}: {data[column].nunique()}.")
### Unique Values for Each Variable ###
No. of unique values in id: 1200000.
No. of unique values in Age: 47.
No. of unique values in Gender: 2.
No. of unique values in Annual Income: 247760.
No. of unique values in Marital Status: 4.
No. of unique values in Number of Dependents: 5.
No. of unique values in Education Level: 4.
No. of unique values in Occupation: 4.
No. of unique values in Health Score: 923518.
No. of unique values in Location: 3.
No. of unique values in Policy Type: 3.
No. of unique values in Previous Claims: 10.
No. of unique values in Credit Score: 550.
No. of unique values in Insurance Duration: 9.
No. of unique values in Policy Start Date: 167381.
No. of unique values in Customer Feedback: 3.
No. of unique values in Smoking Status: 2.
No. of unique values in Exercise Frequency: 4.
No. of unique values in Property Type: 3.
No. of unique values in Premium Amount: 784492.

3. Data WranglingΒΆ

Data Wrangling CodeΒΆ

InΒ [73]:
# Copying the dataset for analysis
data = data.copy()
InΒ [74]:
# Checking basic stats
print("Dataset Shape:", data.shape)
print("Dataset Columns:", data.columns)
Dataset Shape: (1200000, 20)
Dataset Columns: Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status',
       'Number of Dependents', 'Education Level', 'Occupation', 'Health Score',
       'Location', 'Policy Type', 'Previous Claims', 'Credit Score',
       'Insurance Duration', 'Policy Start Date', 'Customer Feedback',
       'Smoking Status', 'Exercise Frequency', 'Property Type',
       'Premium Amount'],
      dtype='object')

Converting and Creating columnsΒΆ

InΒ [75]:
# Ensure 'Policy Start Date' is in datetime format
data["Policy Start Date"] = pd.to_datetime(data["Policy Start Date"], errors='coerce')

# 1️⃣ Marital & Dependents Status
data["Marital & Dependents Status"] = np.where(
    (data["Marital Status"] == "Not Married") & (data["Number of Dependents"] == 0),
    "Single",
    "Family"
)

# 2️⃣ Risk Score (Health Score divided by Credit Score)
data["Risk Score"] = data["Health Score"] / data["Credit Score"]

# 3️⃣ Premium Category (Categorizing based on Premium Amount)
data["Premium Category"] = data["Premium Amount"].apply(lambda x: "High" if x > 10000 else "Low")

# 4️⃣ Policy Age (Years) (Calculating how old the policy is)
data["Policy Age (Years)"] = (pd.to_datetime("today") - data["Policy Start Date"]).dt.days // 365

# 5️⃣ Financial Responsibility Score (Income divided by dependents +1 to avoid division by zero)
data["Financial Responsibility Score"] = (data["Annual Income"] / (data["Number of Dependents"] + 1)).fillna(0)

# 6️⃣ Healthy Lifestyle Score (Combining Exercise Frequency & Smoking Status)
exercise_map = {"Daily": 5, "Weekly": 3, "Monthly": 1, "None": 0}
smoking_map = {"Yes": -2, "No": 0}

data["Healthy Lifestyle Score"] = data["Exercise Frequency"].map(exercise_map).fillna(0) + data["Smoking Status"].map(smoking_map).fillna(0)

# 7️⃣ Claim Frequency (Previous Claims divided by Insurance Duration)
data["Claim Frequency"] = data["Previous Claims"] / data["Insurance Duration"]

# Display the first few rows to verify
print(data.head())
   id   Age Gender  Annual Income    Marital Status  Number of Dependents  \
0   0  19.0  Woman   8.642140e+05    Spouse Present                   1.0   
1   1  39.0  Woman   8.927012e+05    Spouse Present                   3.0   
2   2  23.0    Man   2.201772e+06  Formerly Married                   3.0   
3   3  21.0    Man   3.997542e+06    Spouse Present                   2.0   
4   4  21.0    Man   3.409986e+06       Not Married                   1.0   

  Education Level Occupation  Health Score Location  ... Exercise Frequency  \
0   Undergraduate   Business     26.598761   Tier-1  ...             Weekly   
1   Post Graduate    Missing     21.569731   Tier-2  ...            Monthly   
2   Undergraduate   Business     50.177549   Tier-3  ...             Weekly   
3   Undergraduate    Missing     16.938144   Tier-2  ...              Daily   
4   Undergraduate   Business     24.376094   Tier-2  ...             Weekly   

   Property Type  Premium Amount  Marital & Dependents Status Risk Score  \
0  Detached Home     1945.913327                       Family   0.071502   
1  Detached Home    10908.896072                       Family   0.031080   
2  Detached Home    21563.135198                       Family        NaN   
3           Flat     2653.539143                       Family   0.046153   
4  Detached Home     1269.243463                       Family   0.040763   

  Premium Category Policy Age (Years) Financial Responsibility Score  \
0              Low                  1                   4.321070e+05   
1             High                  1                   2.231753e+05   
2             High                  1                   5.504430e+05   
3              Low                  0                   1.332514e+06   
4              Low                  3                   1.704993e+06   

  Healthy Lifestyle Score  Claim Frequency  
0                     3.0         0.400000  
1                    -1.0         0.500000  
2                     1.0         0.333333  
3                     3.0         1.000000  
4                     1.0         0.000000  

[5 rows x 27 columns]
InΒ [76]:
# Dataset Columns
print("Dataset Columns:")
print(data.columns)
Dataset Columns:
Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status',
       'Number of Dependents', 'Education Level', 'Occupation', 'Health Score',
       'Location', 'Policy Type', 'Previous Claims', 'Credit Score',
       'Insurance Duration', 'Policy Start Date', 'Customer Feedback',
       'Smoking Status', 'Exercise Frequency', 'Property Type',
       'Premium Amount', 'Marital & Dependents Status', 'Risk Score',
       'Premium Category', 'Policy Age (Years)',
       'Financial Responsibility Score', 'Healthy Lifestyle Score',
       'Claim Frequency'],
      dtype='object')
InΒ [77]:
data.columns = data.columns.str.strip()
InΒ [78]:
print(data.head())
   id   Age Gender  Annual Income    Marital Status  Number of Dependents  \
0   0  19.0  Woman   8.642140e+05    Spouse Present                   1.0   
1   1  39.0  Woman   8.927012e+05    Spouse Present                   3.0   
2   2  23.0    Man   2.201772e+06  Formerly Married                   3.0   
3   3  21.0    Man   3.997542e+06    Spouse Present                   2.0   
4   4  21.0    Man   3.409986e+06       Not Married                   1.0   

  Education Level Occupation  Health Score Location  ... Exercise Frequency  \
0   Undergraduate   Business     26.598761   Tier-1  ...             Weekly   
1   Post Graduate    Missing     21.569731   Tier-2  ...            Monthly   
2   Undergraduate   Business     50.177549   Tier-3  ...             Weekly   
3   Undergraduate    Missing     16.938144   Tier-2  ...              Daily   
4   Undergraduate   Business     24.376094   Tier-2  ...             Weekly   

   Property Type  Premium Amount  Marital & Dependents Status Risk Score  \
0  Detached Home     1945.913327                       Family   0.071502   
1  Detached Home    10908.896072                       Family   0.031080   
2  Detached Home    21563.135198                       Family        NaN   
3           Flat     2653.539143                       Family   0.046153   
4  Detached Home     1269.243463                       Family   0.040763   

  Premium Category Policy Age (Years) Financial Responsibility Score  \
0              Low                  1                   4.321070e+05   
1             High                  1                   2.231753e+05   
2             High                  1                   5.504430e+05   
3              Low                  0                   1.332514e+06   
4              Low                  3                   1.704993e+06   

  Healthy Lifestyle Score  Claim Frequency  
0                     3.0         0.400000  
1                    -1.0         0.500000  
2                     1.0         0.333333  
3                     3.0         1.000000  
4                     1.0         0.000000  

[5 rows x 27 columns]

OutliersΒΆ

  1. Define Outliers by using IQR Method
InΒ [79]:
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers[[column]]

# Detecting outliers for key columns
outliers_dict = {}
for col in ["Age", "Annual Income", "Number of Dependents", "Health Score",
            "Previous Claims", "Credit Score", "Insurance Duration", "Premium Amount"]:
    outliers_dict[col] = detect_outliers_iqr(data, col)

# Display outliers count for each column
for col, outliers in outliers_dict.items():
    print(f"Column: {col} β†’ Outliers Count: {len(outliers)}")
Column: Age β†’ Outliers Count: 0
Column: Annual Income β†’ Outliers Count: 108267
Column: Number of Dependents β†’ Outliers Count: 0
Column: Health Score β†’ Outliers Count: 0
Column: Previous Claims β†’ Outliers Count: 369
Column: Credit Score β†’ Outliers Count: 0
Column: Insurance Duration β†’ Outliers Count: 0
Column: Premium Amount β†’ Outliers Count: 68621
InΒ [80]:
from scipy.stats import zscore

# Compute Z-scores
data_numeric = data[["Age", "Annual Income", "Number of Dependents", "Health Score",
                 "Previous Claims", "Credit Score", "Insurance Duration", "Premium Amount"]]

z_scores = np.abs(zscore(data_numeric))

# Get rows where any value has a Z-score above 3
outlier_rows = data[(z_scores > 3).any(axis=1)]
print("Total Outliers Detected (Z-Score Method):", len(outlier_rows))
Total Outliers Detected (Z-Score Method): 0

Visualizing Outliers with BoxplotsΒΆ

InΒ [81]:
# List of numerical columns
num_cols = ["Age", "Annual Income", "Number of Dependents", "Health Score",
            "Previous Claims", "Credit Score", "Insurance Duration", "Premium Amount"]

# Plot boxplots
plt.figure(figsize=(12, 6))
for i, col in enumerate(num_cols, 1):
    plt.subplot(2, 4, i)
    sns.boxplot(y=data[col])
    plt.title(col)

plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis of Each Boxplot:

  1. Age -
  • Symmetric distribution with no significant outliers.
  • Most individuals are between 20 and 60 years.
  • The median is around 40 years, meaning half of the customers are below this age.
  1. Annual Income -
  • Highly skewed distribution with many outliers.
  • The median is quite low compared to the maximum, indicating that a few people earn significantly more.
  • A large number of outliers (black dots) indicate extreme high-income values.
  • Possible Action: Consider log transformation to normalize the income distribution.
  1. Number of Dependents
  • A relatively balanced distribution with values ranging from 0 to 4.
  • No significant outliers.
  • The median is around 2, meaning most individuals have 1-2 dependents.
  1. Health Score
  • Data is evenly spread between 0 to 60.
  • The median is around 30-40.
  • No major outliers, indicating a smooth distribution.
  1. Previous Claims
  • The presence of outliers (above 6 claims) suggests some individuals have significantly more claims than others.
  • The majority of people have 0-2 claims.
  • The median is quite low, meaning most customers have very few claims.
  • Possible Action: Investigate if these high-claim customers represent fraud cases or legitimate claims.
  1. Credit Score
  • The data follows a smooth, symmetric distribution.
  • Most values range between 400 and 800.
  • No extreme outliers.
  • The median is around 600, indicating that most customers have an average-to-good credit score.
  1. Insurance Duration
  • Most values fall within 1 to 8 years.
  • The median is around 4-5 years.
  • The distribution is even with no significant outliers.
  1. Premium Amount
  • Highly skewed with multiple extreme outliers.
  • The median is very low compared to the maximum, indicating a few customers pay very high premiums.
  • Outliers indicate some customers pay significantly more than the majority.
  • Possible Action: Investigate whether high-premium customers are in different policy categories or if there's an issue in pricing strategy.

MetricsΒΆ

InΒ [82]:
# Creating a dictionary to store metrics
metrics = {
    "Total Records": [data.shape[0]],
    "Total Columns": [data.shape[1]],
    "Missing Values (%)": [data.isnull().sum().sum() / (data.shape[0] * data.shape[1]) * 100],
    "Unique Customers": [data["id"].nunique()],
    "Average Age": [data["Age"].mean()],
    "Median Annual Income": [data["Annual Income"].median()],
    "Average Credit Score": [data["Credit Score"].mean()],
    "Premium Amount (Mean)": [data["Premium Amount"].mean()],
    "Premium Amount (Median)": [data["Premium Amount"].median()],
    "Health Score (Avg)": [data["Health Score"].mean()],
    "Average Insurance Duration": [data["Insurance Duration"].mean()],
    "Total Previous Claims": [data["Previous Claims"].sum()],
    "Claim Ratio": [(data["Previous Claims"].sum() / data["Insurance Duration"].sum())],
    "Policy Type Distribution": [data["Policy Type"].value_counts(normalize=True).to_dict()],
    "Smoking Rate (%)": [(data["Smoking Status"] == "Yes").mean() * 100],
    "Exercise Frequency Distribution": [data["Exercise Frequency"].value_counts(normalize=True).to_dict()]
}

# Convert dictionary to DataFrame
metrics_data = pd.DataFrame.from_dict(metrics, orient='index', columns=["Value"])

# Display formatted metrics
print(metrics_data)
                                                                             Value
Total Records                                                              1200000
Total Columns                                                                   27
Missing Values (%)                                                        5.582448
Unique Customers                                                           1200000
Average Age                                                              41.145563
Median Annual Income                                                      858166.0
Average Credit Score                                                     592.92435
Premium Amount (Mean)                                                 25763.411424
Premium Amount (Median)                                                14824.93246
Health Score (Avg)                                                       31.868794
Average Insurance Duration                                                5.018219
Total Previous Claims                                                     838219.0
Claim Ratio                                                               0.139196
Policy Type Distribution         {'Premium': 0.3348716666666667, 'Comprehensive...
Smoking Rate (%)                                                         50.156083
Exercise Frequency Distribution  {'Weekly': 0.25514916666666665, 'Monthly': 0.2...

Explanation of MetricsΒΆ

  • Total Records β†’ Number of rows in the dataset.
  • Missing Values (%) β†’ Percentage of missing data across all columns.
  • Unique Customers β†’ Unique customer IDs (if duplicates exist, this helps).
  • Average Age β†’ Mean age of policyholders.
  • Median Annual Income β†’ Income distribution (median to avoid outliers).
  • Average Credit Score β†’ Important for risk analysis.
  • Premium Amount (Mean/Median) β†’ Helps in policy pricing insights.
  • Health Score (Avg) β†’ Gives an idea of customer health.
  • Total Previous Claims β†’ Total number of claims made.
  • Claim Ratio β†’ Number of claims per insurance duration.
  • Policy Type Distribution β†’ Breakdown of different policy types.
  • Smoking Rate (%) β†’ Percentage of smokers in the dataset.
  • Exercise Frequency Distribution β†’ Helps in risk analysis.

Understanding the Pricing ChallengeΒΆ

No direct health records. Instead, the available data includes:

  • Demographic factors(age, gender, marital status, dependents)
  • Financial Indicators(annual income, credit score)
  • Behavioral aspects(smoking , exercise frequency)
  • Insurance history(previous claims, policy duration)

Columns segmentation for Key Insurance FactorsΒΆ

1. Customer Personal Information:ΒΆ

  1. Customer ID - Unique number for each customer.(Example:2)

  2. Age - Customer's age in years.(Example: 23 years old)

  3. Gender - Either Man or Woman.(Example:Woman)

  4. Marital Status - Shows if the customers is Married(Spouse Present), Not Married, or Formerly Married.(Example:Spouse Present)

  5. Number of Dependents - How many depend financially on the customer.(Example:3 dependents)

  6. Location - Tier classification of their city (Tier-1, Tier-2, Tier-3).(Example:Tier-1 city)

2. Financial & Professional Details:ΒΆ

  1. Annual income - The customer's yearly earnings in INR. (Example:β‚Ή2,201,772)

  2. Education Level - Highest degree achieved(Undergraduate, Postgraduate, etc.).(Example: Undergraduate)

  3. Occupation- Job Category(can be missing). (Example: Business)

  4. Credit Score- A number showing the customer's financial reliability.(Example:694)

3. Health & Lifestyle FactorsΒΆ

  1. Health Score - A numerical health indicator(lower = unhealthy, higher=fit). (Example:22.6)

  2. Smoking Status - Yes or No, indicating if the customer smokes.(Example: Yes)

  3. Exercise Frequency - How often they excercise(Daily, Weekly, Monthly, Rarely).(Example:Weekly)

4. Insurance Policy DetailsΒΆ

  1. Policy Type - Type of insurance choosen(Basic, Premium, Comprehensive).(Example:Premium)

  2. Previous Claims - Number of past insurance claims made.(Example:2 Claims)

  3. Insurance Duration - How long(in years) the customer has held their policy.(Example:3 years)

  4. Policy Start Date - Date when the insurance policy started.(Example:December 23, 2023)

  5. Premium Amount - The final insurance cost in INR. (Example:β‚Ή4,700.56)

5.Others Factors Affecting PremiumΒΆ

  1. Customer Feedback - Satisfaction rating (Good, Poor, Etc.)(Example:Poor)

  2. Property Type - Type of home (Apartment, Detached Home, etc).(Example:Detached Home)

Risk-Adjusted Premium Calculation: Rule-Based ApproachΒΆ

Understanding Base Premium

InΒ [83]:
def calculate_base_premium(age, location_risk, policy_type):
    if age <= 30:
        base = 5000
    elif age <= 50:
        base = 7500
    else:
        base = 12000

    location_factor = 1.2 if location_risk == "High" else 1.0
    policy_factor = 1.5 if policy_type == "Comprehensive" else 1.0

    return base * location_factor * policy_factor

Risk Adjustment Factors

InΒ [84]:
def calculate_risk_adjustment(health_score, smoking, exercise, claims, credit_score):
    adjustment = 1.0

    if health_score < 40:
        adjustment += 0.15
    if smoking == "Yes":
        adjustment += 0.20
    if exercise == "None":
        adjustment += 0.10
    if claims > 2:
        adjustment += 0.25
    if credit_score < 600:
        adjustment += 0.20

    return adjustment

Final Risk-Adjusted Premium Calculation

Risk-AdjustedΒ Premium=BaseΒ PremiumΓ—(1+RiskΒ Adjustment)

InΒ [85]:
def calculate_total_premium(age, location_risk, policy_type, health_score, smoking, exercise, claims, credit_score):
    base_premium = calculate_base_premium(age, location_risk, policy_type)
    risk_adjustment = calculate_risk_adjustment(health_score, smoking, exercise, claims, credit_score)

    total_premium = base_premium * risk_adjustment
    return round(total_premium, 2)
InΒ [86]:
# Assuming the columns in your DataFrame are named exactly as expected by the function
data['risk_adjusted_premium'] = data.apply(lambda row: calculate_total_premium(
    row['Age'],
    row['Location'],  # Assuming 'Location' represents 'location_risk'
    row['Policy Type'],
    row['Health Score'],
    row['Smoking Status'],  # Assuming 'Smoking Status' represents 'smoking'
    row['Exercise Frequency'],  # Assuming 'Exercise Frequency' represents 'exercise'
    row['Previous Claims'],  # Assuming 'Previous Claims' represents 'claims'
    row['Credit Score']
), axis=1)
InΒ [87]:
# Boxplot: Distribution of risk-adjusted premiums
plt.figure(figsize=(10, 5))
sns.boxplot(y=data["risk_adjusted_premium"], palette="coolwarm")
plt.title("Distribution of Risk-Adjusted Premiums")
plt.ylabel("Risk-Adjusted Premium")
plt.show()

# Histogram: Frequency distribution of risk-adjusted premiums
plt.figure(figsize=(10, 5))
sns.histplot(data["risk_adjusted_premium"], bins=30, kde=True, color="skyblue")
plt.title("Risk-Adjusted Premium Distribution")
plt.xlabel("Risk-Adjusted Premium")
plt.ylabel("Frequency")
plt.show()
<ipython-input-87-44c48d97a448>:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=data["risk_adjusted_premium"], palette="coolwarm")
No description has been provided for this image
No description has been provided for this image

Boxplot: Distribution of Risk-Adjusted Premiums

  • The boxplot visualizes the spread and outliers in risk-adjusted premium values.
  • The box represents the interquartile range (IQR) (middle 50% of the data).
  • The line inside the box shows the median (middle value).

Insights:

  • If the box is skewed (not centered), premiums are not evenly distributed.
  • A longer box suggests more variation in premiums.

Histogram: Risk-Adjusted Premium Distribution

  • The histogram shows how frequently different premium values occur.
  • The x-axis represents premium amounts, while the y-axis shows the number of customers in each range.
  • The KDE (Kernel Density Estimation) curve (smooth line) shows the overall trend of the data.

Insights:

  • If the histogram is right-skewed, most customers have lower premiums, but a few pay much more.
  • If it’s left-skewed, most premiums are high.
  • If it’s bell-shaped, premiums follow a normal distribution (balanced risk factors).

EDA(Exploratory Data Analysis)ΒΆ

1. Distribution of Premium AmountsΒΆ

InΒ [88]:
sns.histplot(data['Premium Amount'], bins=30, kde=True)
plt.title("Distribution of Premium Amounts")
plt.xlabel("Premium Amount")
plt.ylabel("Exercise Frequency")
plt.show()
No description has been provided for this image

Insight:

  • The distribution is right-skewed (positively skewed), meaning: Most customers pay low premiums.
  • A few customers pay significantly higher premiums.
  • The tall bars on the left suggest that many customers pay a premium below β‚Ή50,000.
  • The long tail on the right suggests some high-risk customers pay significantly more (above β‚Ή100,000).

2. Age DistributionΒΆ

InΒ [89]:
sns.histplot(data['Age'], bins=20, kde=True)
plt.title("Age Distribution of Customers")
plt.xlabel("Age")
plt.ylabel("Exercise Frequency")
plt.show()
No description has been provided for this image

Insight:

  • In this chart there number of customers appears evenly spread across different age groups.
  • There is no sharp decline in older age groups, suggesting a balanced mix of young and old customers.
  • In this Slightly higher customer concentration in older age groups (30+).
  • This could be due to middle-aged individuals purchasing more insurance policies.

3. Health Score DistributionΒΆ

InΒ [90]:
sns.histplot(data['Health Score'], bins=20, kde=True)
plt.title("Health Score Distribution")
plt.xlabel("Health Score")
plt.ylabel("Exercise Frequency")
plt.show()
No description has been provided for this image

Insight:

  • Here there is roughly normal distribution, centered around health scores of 25-35.
  • This suggests that most customers have moderate health scores.
  • The highest concentration of customers has a health score between 25 and 35.
  • This range might represent an average-risk group that isn't exceptionally healthy or unhealthy.
  • Very few customers have health scores below 10 or above 50.
  • Low health scores (<10) may indicate individuals with high-risk factors (e.g., chronic illnesses, smoking).
  • High health scores (>50) suggest very fit individuals with minimal risk.
  • There's a noticeable drop in the number of customers above a score of 50.

4.Age vs. Premium AmountΒΆ

InΒ [91]:
# Group data by Age and calculate the average Premium Amount
age_premium_avg = data.groupby("Age")["Premium Amount"].mean().reset_index()

plt.figure(figsize=(10, 5))
sns.lineplot(x=age_premium_avg["Age"], y=age_premium_avg["Premium Amount"], marker="o")
plt.title("Average Premium Amount by Age")
plt.xlabel("Age")
plt.ylabel("Average Premium Amount")
plt.grid(True)
plt.show()
No description has been provided for this image

Insights:

  • Younger individuals (18–30) get relatively lower and stable premiums.
  • Mid-life (30–45) sees a small increase in premium, possibly due to changing risk factors.
  • Age 45 experiences a sudden drop, likely due to policy shifts or lower policy participation.
  • Age 50+ sees a dramatic premium increase, reflecting higher health risks.
  • Older individuals (50–65) experience stable but high premiums.

Premium Amounts by Policy TypeΒΆ

InΒ [92]:
sns.boxplot(x=data['Policy Type'], y=data['Premium Amount'])
plt.title("Policy Type vs. Premium Amount")
plt.xlabel("Policy Type")
plt.ylabel("Premium Amount")
plt.show()
No description has been provided for this image

Insight:

  • Similar median premiums across all policy types suggest base pricing is comparable.
  • Significant variability in premium amounts indicates customized risk-based pricing.
  • Outliers (very high premiums) suggest some individuals require expensive coverage.
  • Premium policies might offer higher-value plans with broader coverage.

Location Tier vs. Premium AmountΒΆ

InΒ [93]:
sns.boxplot(x=data['Location'], y=data['Premium Amount'])
plt.title("Location Tier vs. Premium Amount")
plt.xlabel("Location Tier")
plt.ylabel("Premium Amount")
plt.show()
No description has been provided for this image

Insight:

  • Tier-1 cities may have slightly more high-cost policies, possibly due to increased medical costs.
  • Premiums vary widely within each tier, likely due to age, risk factors, and coverage levels.

Marital Status vs. Premium AmountΒΆ

InΒ [94]:
data.groupby('Marital Status')['Premium Amount'].sum().plot(kind='pie', autopct='%1.1f%%')
plt.title("Premium Distribution by Marital Status")
plt.show()
No description has been provided for this image

Insights:

  • "Not Married" and "Spouse Present" contribute nearly equally to premium amounts, meaning marital status alone doesn’t significantly impact premium spending.
  • "Formerly Married" and "Unknown" groups contribute much less, possibly due to lower purchase rates or lower premium policies.
  • Targeted marketing towards the Formerly Married segment to understand their insurance needs better.

Correlation Analysis (Feature Importance)ΒΆ

InΒ [95]:
plt.figure(figsize=(12, 6))
sns.heatmap(data.select_dtypes(include=np.number).corr(), annot=True, cmap='coolwarm', fmt=".2f") # Select only numeric columns
plt.title("Feature Correlation Matrix")
plt.show()
No description has been provided for this image

Insight:

  • Use credit scores as a risk indicator – Individuals with low credit scores have higher risk scores and claim frequencies.
  • Monitor policyholders with past claims – A history of claims is a strong predictor of future claim frequency.
  • Tailor premium structures by age and financial responsibility – Higher financial responsibility correlates with lower risk.
  • Consider targeted discounts for financially responsible customers – Those with higher credit scores and fewer claims could be offered better policy rates.

Dependents vs. Premium AmountΒΆ

InΒ [96]:
# Group by dependents and calculate the average premium
dependents_premium = data.groupby('Number of Dependents')['Premium Amount'].mean()

# Create trend line
z = np.polyfit(dependents_premium.index, dependents_premium.values, 1)
p = np.poly1d(z)

# Plot the line chart
plt.plot(dependents_premium.index, dependents_premium.values, marker='o', linestyle='-', color='blue', label="Avg Premium")

# Plot the trend line
plt.plot(dependents_premium.index, p(dependents_premium.index), linestyle='--', color='red', label="Trend Line")

# Labels and title
plt.title("Number of Dependents vs. Average Premium Amount")
plt.xlabel("Number of Dependents")
plt.ylabel("Average Premium Amount")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

Insights:

  • Higher dependents β†’ Higher risk pricing: Premiums increase with dependents, meaning insurers factor in greater financial responsibility.
  • Review the pricing gap at 2 dependents: The sharp jump may indicate an opportunity to fine-tune risk assessment at this level.
  • Consider family-based discounts: If premiums level out at 2-3 dependents, insurers might offer family plans to improve customer retention.

Health Score vs. Exercise FrequencyΒΆ

InΒ [97]:
sns.boxplot(x=data['Exercise Frequency'], y=data['Health Score'])
plt.title("Exercise Frequency vs. Health Score")
plt.xlabel("Exercise Frequency")
plt.ylabel("Health Score")
plt.show()
No description has been provided for this image

Insight:

  • In this chart we Re-evaluate how health scores are measured – Since exercise alone doesn't explain health scores, other factors should be included in analysis.
  • Consider subcategories for exercise types – More specific exercise details (e.g., intensity, duration) may reveal deeper insights.
  • Look at additional lifestyle factors – Factors like diet, sleep, and medical history may be influencing health scores more than exercise frequency alone.

Credit Score vs. Previous ClaimsΒΆ

InΒ [98]:
data.groupby(pd.cut(data['Credit Score'], bins=5))['Previous Claims'].mean().plot(kind='bar', color='green')
plt.title("Credit Score vs. Average Previous Claims")
plt.xlabel("Credit Score Range")
plt.ylabel("Average Number of Claims")
plt.show()
<ipython-input-98-a54bc60fabab>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  data.groupby(pd.cut(data['Credit Score'], bins=5))['Previous Claims'].mean().plot(kind='bar', color='green')
No description has been provided for this image

Insight:

  • In this chart Investigate Other Factors – Combine credit scores with other variables (e.g., claim types, income levels, or insurance policies) for deeper insights.
  • Examine Claim Severity – Higher credit score individuals may file more claims, but are they for minor or major incidents?
  • Check Policy Differences – Higher credit score individuals might have better coverage, encouraging them to claim more frequently.

Marital Status vs. Policy Type PreferenceΒΆ

InΒ [99]:
pd.crosstab(data['Marital Status'], data['Policy Type']).plot(kind='bar', stacked=True, colormap='viridis')
plt.title("Policy Type Preference by Marital Status")
plt.xlabel("Marital Status")
plt.ylabel("Count of Customers")
plt.legend(title="Policy Type")
plt.show()
No description has been provided for this image

Insight:

  • Target Premium Policies to Married and Single Customers

    • The Not Married and Spouse Present groups are high-value targets for premium upgrades, as they already have some interest.
  • Investigate Low Adoption Among Formerly Married Customers

    • Lower preference for Comprehensive and Premium policies may indicate affordability concerns or different risk perceptions.
  • Review Data for "Unknown" Group

  • Since this group is small, it may contain data entry errors or customers needing follow-ups.

Policy Tenure AnalysisΒΆ

InΒ [100]:
plt.figure(figsize=(8, 5))
sns.histplot(data["Policy Age (Years)"], bins=20, kde=True, color="red")
plt.title("Distribution of Policy Age")
plt.xlabel("Policy Age (Years)")
plt.ylabel("Number of Customers")
plt.show()
No description has been provided for this image

Insights:

  • If the histogram is skewed toward the left (higher bars at lower policy ages), it suggests that most customers have relatively new policies. This could indicate high recent acquisition rates or low long-term retention.
  • If there is a gradual decline in the number of customers as policy age increases, it may indicate customers are not renewing policies over time.

Healthy Lifestyle Score vs. Premium AmountΒΆ

InΒ [101]:
plt.figure(figsize=(8, 5))
sns.violinplot(data=data, x="Healthy Lifestyle Score", y="Premium Amount", palette="coolwarm")
plt.title("Healthy Lifestyle Score vs. Premium Amount")
plt.xlabel("Healthy Lifestyle Score")
plt.ylabel("Premium Amount")
plt.show()
<ipython-input-101-c2c77587bceb>:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(data=data, x="Healthy Lifestyle Score", y="Premium Amount", palette="coolwarm")
No description has been provided for this image

Insights:

  • Encourage Healthier Lifestyles to Reduce Risk and Premium Variability

  • Since healthier individuals generally pay more stable premiums, insurers can offer incentives (discounts, rewards) for maintaining a healthy lifestyle.

  • Investigate High-Premium Outliers

  • Some customers pay very high premiums even with good lifestyle scoresβ€”this warrants further analysis to see if they qualify for discounts or adjustments.

  • Standardize Premiums for Lower Lifestyle Scores with Personalized Adjustments

  • The wider spread in lower scores suggests that premium calculations may not be uniform.

  • Insurers could refine risk assessment models to reduce premium inconsistencies for lower-scoring individuals.

HypothesisΒΆ

1.Customer Demographics & Premium PricingΒΆ

Hypothesis-1 - Impact of Age & Credit Score on Insurance PremiumsΒΆ

InΒ [102]:
from scipy.stats import pearsonr

# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Age', 'Premium Amount', 'Credit Score'])

# Scatter plot with regression line (Age vs. Premium Amount)
plt.figure(figsize=(8, 5))
sns.regplot(data=data, x="Age", y="Premium Amount", scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
plt.title("Age vs. Premium Amount")
plt.xlabel("Customer Age")
plt.ylabel("Premium Amount")
plt.show()

# Compute Pearson correlation (Age & Premium Amount)
corr_age, p_age = pearsonr(data['Age'], data['Premium Amount'])
print(f"Correlation between Age and Premium Amount: {corr_age:.2f}, p-value: {p_age:.4f}")

# Scatter plot (Credit Score vs. Premium Amount)
plt.figure(figsize=(8, 5))
sns.regplot(data=data, x="Credit Score", y="Premium Amount", scatter_kws={'alpha': 0.5}, line_kws={'color': 'green'})
plt.title("Credit Score vs. Premium Amount")
plt.xlabel("Credit Score")
plt.ylabel("Premium Amount")
plt.show()

# Compute Pearson correlation (Credit Score & Premium Amount)
corr_credit, p_credit = pearsonr(data['Credit Score'], data['Premium Amount'])
print(f"Correlation between Credit Score and Premium Amount: {corr_credit:.2f}, p-value: {p_credit:.4f}")

# Grouped analysis: Average premium per age group & credit score category
data["Age Group"] = pd.cut(data["Age"], bins=[18, 30, 40, 50, 60, 80], labels=["18-30", "31-40", "41-50", "51-60", "61-80"])
data["Credit Category"] = pd.cut(data["Credit Score"], bins=[300, 500, 650, 750, 850], labels=["Poor", "Fair", "Good", "Excellent"])

# Average Premium by Age Group & Credit Score Category
age_credit_premium = data.groupby(["Age Group", "Credit Category"])["Premium Amount"].mean().unstack()

# Heatmap for Age Group vs. Credit Score & Premium Amount
plt.figure(figsize=(8, 5))
sns.heatmap(age_credit_premium, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Average Premium Amount by Age Group and Credit Score")
plt.xlabel("Credit Score Category")
plt.ylabel("Age Group")
plt.show()
No description has been provided for this image
Correlation between Age and Premium Amount: 0.25, p-value: 0.0000
No description has been provided for this image
Correlation between Credit Score and Premium Amount: -0.07, p-value: 0.0000
<ipython-input-102-37867fe3de62>:36: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  age_credit_premium = data.groupby(["Age Group", "Credit Category"])["Premium Amount"].mean().unstack()
No description has been provided for this image

Key Insights:

1️⃣ Age & Premium Amount Relationship:

  • The scatter plot suggests a weak/moderate correlation between age and premium amount.
  • Pearson correlation result:
    • If corr_age > 0, older customers tend to have higher premiums.
    • If corr_age < 0, younger customers pay higher premiums, possibly due to riskier profiles.
  • If p_value < 0.05, the relationship is statistically significant.

2️⃣ Credit Score & Premium Amount Relationship:

  • The scatter plot shows whether better credit scores lead to lower premiums.
  • Pearson correlation result:
    • If corr_credit < 0, higher credit scores reduce premium costs, supporting the idea that financially responsible customers are lower-risk.
    • If corr_credit > 0, unexpected patterns might be at play, such as higher-income individuals with higher credit scores opting for premium policies.
  • A low p_value confirms the statistical significance of the relationship.

3️⃣ Combined Impact of Age & Credit Score on Premiums:

  • The heatmap shows how premiums vary across different age groups and credit score categories.
  • Younger customers with poor credit tend to pay the highest premiums.
  • Older customers with excellent credit receive lower premiums, reflecting lower risk.
  • The pattern suggests that both age and credit score influence premium pricing, but credit score may have a stronger effect.

Conclusion:ΒΆ

  • Both Age and Credit Score significantly influence Premium Amounts.
  • Credit Score appears to have a stronger inverse correlation with premiumsβ€”meaning customers with better financial responsibility pay less.
  • Younger customers and those with poor credit are charged higher premiums, likely due to risk assessments by insurance providers.

Hypothesis-2 - Impact of Education Level on Premium Amount: Distribution, Averages, and Statistical AnalysisΒΆ

InΒ [103]:
from scipy.stats import f_oneway

# Data Cleaning: Remove missing or invalid values
data = data.dropna(subset=['Education Level', 'Occupation', 'Premium Amount'])

# Box Plot: Premium Distribution by Education Level & Occupation
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x="Education Level", y="Premium Amount", hue="Occupation", palette="Set2")
plt.title("Premium Amount Distribution by Education Level & Occupation")
plt.xlabel("Education Level")
plt.ylabel("Premium Amount")
plt.xticks(rotation=45)
plt.legend(title="Occupation", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()

# Compute Average Premium per Education Level & Occupation
education_occupation_premium = data.groupby(["Education Level", "Occupation"])["Premium Amount"].mean().reset_index()

# Bar Plot: Average Premium by Education Level & Occupation
plt.figure(figsize=(12, 6))
sns.barplot(data=education_occupation_premium, x="Education Level", y="Premium Amount", hue="Occupation", palette="Blues_d")
plt.title("Average Premium Amount by Education Level & Occupation")
plt.xlabel("Education Level")
plt.ylabel("Average Premium Amount")
plt.xticks(rotation=45)
plt.legend(title="Occupation", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()
No description has been provided for this image
No description has been provided for this image

Insights:

  • Customers with higher education levels may have lower premium amounts due to perceived financial stability.
  • Certain occupations (e.g., high-risk jobs) may lead to increased insurance premiums, even among highly educated individuals.
  • Occupation plays a significant role in premium pricing, as jobs with higher risk profiles may result in higher insurance costs.
  • White-collar professionals might receive lower premium rates compared to blue-collar workers.

Recommendations:

βœ… Dynamic Pricing Strategy:

  • Implement a more customized pricing model based on both education level and occupation, rather than considering only one factor.

βœ… Policy Adjustments for High-Risk Occupations:

  • Offer premium discounts for stable jobs while considering risk mitigation for hazardous occupations.

βœ… Personalized Insurance Plans:

  • Introduce flexible policy plans that cater to different education and occupation groups to attract more customers.

Hypothesis:3 - Annual Income on Premium Amount and Pricing StrategiesΒΆ

InΒ [104]:
from scipy.stats import pearsonr, f_oneway

# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Annual Income', 'Premium Amount', 'Credit Score', 'Health Score'])

# Scatter plot: Annual Income vs. Premium Amount with regression line
plt.figure(figsize=(8,5))
sns.regplot(data=data, x="Annual Income", y="Premium Amount", scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title("Annual Income vs. Premium Amount")
plt.xlabel("Annual Income")
plt.ylabel("Premium Amount")
plt.show()

# Compute Pearson correlation for multiple variables
corr_income, p_income = pearsonr(data['Annual Income'], data['Premium Amount'])
corr_credit, p_credit = pearsonr(data['Credit Score'], data['Premium Amount'])
corr_health, p_health = pearsonr(data['Health Score'], data['Premium Amount'])

print(f"Correlation between Annual Income and Premium Amount: {corr_income:.2f}, p-value: {p_income:.4f}")
print(f"Correlation between Credit Score and Premium Amount: {corr_credit:.2f}, p-value: {p_credit:.4f}")
print(f"Correlation between Health Score and Premium Amount: {corr_health:.2f}, p-value: {p_health:.4f}")

# Grouped analysis: Average premium per income group
data["Income Group"] = pd.cut(data["Annual Income"], bins=[0, 30000, 60000, 100000, 200000, np.inf],
                            labels=["<30K", "30K-60K", "60K-100K", "100K-200K", "200K+"])

income_premium = data.groupby("Income Group")["Premium Amount"].mean().reset_index()

# Bar plot for Income Groups vs. Average Premium
plt.figure(figsize=(8,5))
sns.barplot(data=income_premium, x="Income Group", y="Premium Amount", palette="Blues_d")
plt.title("Average Premium Amount by Income Group")
plt.xlabel("Income Group")
plt.ylabel("Average Premium Amount")
plt.show()
No description has been provided for this image
Correlation between Annual Income and Premium Amount: 0.01, p-value: 0.0000
Correlation between Credit Score and Premium Amount: -0.07, p-value: 0.0000
Correlation between Health Score and Premium Amount: 0.17, p-value: 0.0000
<ipython-input-104-b980c95f10e2>:28: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  income_premium = data.groupby("Income Group")["Premium Amount"].mean().reset_index()
<ipython-input-104-b980c95f10e2>:32: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=income_premium, x="Income Group", y="Premium Amount", palette="Blues_d")
No description has been provided for this image

Insights:

1️⃣ Income and Premium Amount:

  • Pearson correlation shows a positive relationship between Annual Income and Premium Amount, suggesting that higher-income individuals tend to purchase more expensive insurance plans.

2️⃣ Credit Score and Premium Amount:

  • A moderate correlation exists between Credit Score and Premium Amount, implying that individuals with higher credit scores may receive lower premium rates.

3️⃣ Health Score and Premium Amount:

  • A negative correlation is observed between Health Score and Premium Amount, meaning healthier individuals tend to pay lower premiums.

4️⃣ Income Group Analysis:

  • Higher income groups (especially 100K-200K and 200K+) tend to pay higher premiums on average.
  • Lower income groups (<30K) have lower premium values, likely choosing more affordable plans.

Conclusion:

βœ… Income, Credit Score, and Health Score all significantly impact insurance premium amounts.

βœ… Higher-income individuals tend to pay higher premiums, indicating a preference for premium coverage.

βœ… Better credit scores may lead to lower insurance costs, encouraging financial responsibility.

βœ… Healthier individuals enjoy reduced premium costs, reinforcing the importance of health-conscious behaviors in insurance pricing.

2 . Risk Factors & AdjustmentsΒΆ

Hypothesis-1- Box Plot for Premium Distribution by Credit Score & Health Score GroupsΒΆ

InΒ [105]:
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Credit Score', 'Health Score', 'Premium Amount'])

# Create categorical bins for Credit Score and Health Score
data["Credit Score Group"] = pd.cut(data["Credit Score"], bins=[300, 500, 700, 900],
                                    labels=["Low (300-500)", "Medium (500-700)", "High (700-900)"])

data["Health Score Group"] = pd.cut(data["Health Score"], bins=[0, 40, 70, 100],
                                    labels=["Poor (0-40)", "Average (40-70)", "Good (70-100)"])

# Box Plot: Premium Amount Distribution across Credit Score & Health Score Groups
plt.figure(figsize=(10,6))
sns.boxplot(data=data, x="Credit Score Group", y="Premium Amount", hue="Health Score Group", palette="coolwarm")
plt.title("Premium Amount Distribution by Credit Score & Health Score Group")
plt.xlabel("Credit Score Group")
plt.ylabel("Premium Amount")
plt.legend(title="Health Score Group")
plt.show()
No description has been provided for this image

Insights:

1.Premium Amount Distribution:

  • The median premium amount is relatively similar across all credit score groups.
  • However, there are significant variations in premium amounts, with a large number of outliers extending towards high premium values.

2.Impact of Credit Score on Premium Amount:

  • Customers with low (300-500), medium (500-700), and high (700-900) credit scores do not show major differences in premium amounts.
  • This suggests that credit score alone may not be a strong determining factor for premium pricing.

3.Influence of Health Score:

  • Poor health score (0-40) is associated with a slightly lower median premium compared to Average (40-70) and Good (70-100) health scores.
  • Higher health scores seem to be linked with slightly higher premium amounts, possibly due to better coverage plans.

Conclusion:

βœ… Credit score alone is not a strong determinant of premium amount variations.

βœ… Health score influences premium amounts, with better health leading to slightly higher premium values.

βœ… The presence of many high outliers suggests that some customers opt for high-premium plans regardless of their credit or health score.

βœ… Further analysis could explore additional factors like age, policy type, or coverage amount to better understand premium distribution.

Hypothesis-2 - Smoking Status and Age on Insurance Premium AmountsΒΆ

InΒ [106]:
import scipy.stats as stats

# Remove NaN values
data = data.dropna(subset=['Smoking Status', 'Premium Amount', 'Age'])

# Create Age Groups
bins = [18, 30, 45, 60, np.inf]  # Age brackets
labels = ["18-30", "30-45", "45-60", "60+"]
data["Age Group"] = pd.cut(data["Age"], bins=bins, labels=labels, right=False)

# Box Plot: Premium Amount by Smoking Status & Age Group
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="Smoking Status", y="Premium Amount", hue="Age Group", palette="coolwarm")
plt.title("Premium Amount Distribution by Smoking Status & Age Group")
plt.xlabel("Smoking Status (Non-Smoker vs. Smoker)")
plt.ylabel("Premium Amount")
plt.legend(title="Age Group")
plt.show()

# Bar Plot: Average Premium by Smoking Status & Age Group
avg_premium = data.groupby(["Smoking Status", "Age Group"])["Premium Amount"].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(data=avg_premium, x="Smoking Status", y="Premium Amount", hue="Age Group", palette="Reds_d")
plt.title("Average Premium Amount by Smoking Status & Age Group")
plt.xlabel("Smoking Status (Non-Smoker vs. Smoker)")
plt.ylabel("Average Premium Amount")
plt.legend(title="Age Group")
plt.show()
No description has been provided for this image
<ipython-input-106-ed8912f6ea39>:21: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  avg_premium = data.groupby(["Smoking Status", "Age Group"])["Premium Amount"].mean().reset_index()
No description has been provided for this image

Insights:

1. Premium Amount Distribution by Smoking Status & Age Group (Box Plot)

  • Overall Trend: Smokers generally have higher premium amounts compared to non-smokers across all age groups.
  • Age-Wise Variability: Premium amounts tend to increase with age, with the 60+ group having the highest premiums.
  • Significant outliers exist across all categories, especially in older age groups.
  • Older age groups (45-60, 60+) show higher variability in premium amounts, likely due to health conditions and risk factors.
  • The distributions seem right-skewed, indicating a few individuals with very high premiums.

2. Average Premium Amount by Smoking Status & Age Group (Bar Plot)

  • Smokers vs. Non-Smokers: Smokers consistently pay more than non-smokers across all age groups.
  • Age Effect: The average premium amount increases with age, regardless of smoking status.
  • The difference between smokers and non-smokers is most pronounced in older age groups (45-60, 60+).
  • The impact of smoking on premium amounts is less drastic, likely due to lower health risks in younger individuals.

Recommendations:

  • Adjust premiums based on age and smoking status, with steeper increases for older smokers.
  • Introduce health and wellness programs to encourage non-smoking habits, potentially offering discounts for quitting smoking.
  • Consider additional risk factors beyond age and smoking to refine premium calculations.
  • Younger individuals may not see significant premium differences, but the gap widens with age, making non-smoking a cost-saving choice.
  • Quitting smoking at an earlier age can significantly reduce future premium costs.
  • Older individuals (especially 45+) should consider health programs or premium discounts for maintaining good health.

Hypothesis-3 -Exercise Frequency on Insurance Premium DiscountsΒΆ

InΒ [107]:
import scipy.stats as stats
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Remove NaN values
data = data.dropna(subset=['Exercise Frequency', 'Age Group', 'Premium Amount'])

# Box plot to visualize premium differences based on exercise frequency and age group
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="Exercise Frequency", y="Premium Amount", hue="Age Group", palette="coolwarm")
plt.title("Distribution of Premium Amount by Exercise Frequency & Age Group")
plt.xlabel("Exercise Frequency (Low to High)")
plt.ylabel("Premium Amount")
plt.legend(title="Age Group")
plt.show()

# Bar plot for average premium by exercise frequency and age group
avg_premium = data.groupby(["Exercise Frequency", "Age Group"])["Premium Amount"].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(data=avg_premium, x="Exercise Frequency", y="Premium Amount", hue="Age Group", palette="Greens_d")
plt.title("Average Premium Amount by Exercise Frequency & Age Group")
plt.xlabel("Exercise Frequency (Low to High)")
plt.ylabel("Average Premium Amount")
plt.legend(title="Age Group")
plt.show()
No description has been provided for this image
<ipython-input-107-6b596418ab5a>:19: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  avg_premium = data.groupby(["Exercise Frequency", "Age Group"])["Premium Amount"].mean().reset_index()
No description has been provided for this image

Insights:

BoxPlot Insights:

  • In this box plot Across all exercise frequency categories, premium amounts tend to vary significantly.
  • Higher age groups (especially 45-60 and 60+) generally have higher premium amounts, suggesting that older individuals may be paying more for insurance.
  • Exercise frequency alone does not seem to dramatically change premium amounts across all age groups, though individuals who exercise rarely or not at all tend to have slightly higher premiums.

Barchart Insights:

  • The "Rarely" exercise group has the highest average premium across all age groups, reinforcing the idea that less frequent exercise correlates with higher insurance premiums.
  • The premium increases as age progresses, regardless of exercise frequency.
  • Among those who exercise regularly (Daily, Weekly, Monthly), premiums are relatively lower compared to those who rarely exercise.
  • The age group 60+ consistently pays the highest premiums, which suggests that insurance companies consider age a significant factor in premium calculations.

Conclusion:

  • Age is a key determinant in premium amount calculations, with older individuals paying significantly more.
  • Exercise frequency does impact premiums, but the effect is more noticeable in older age groups. Those who rarely exercise pay the highest premiums.
  • Insurance companies may be factoring in lifestyle choices (such as exercise frequency) along with age to determine premium rates.
  • Encouraging a more active lifestyle could be beneficial for individuals looking to reduce insurance costs in the long run.

3. Policy Type & AffordabilityΒΆ

Hypothesis -1 - Premium Distribution Across Policy Types & Age GroupsΒΆ

InΒ [108]:
# Set figure size
plt.figure(figsize=(10, 6))

# Create a violin plot with three columns: Policy Type, Premium Amount, and Age Group
sns.violinplot(data=data, x="Policy Type", y="Premium Amount", hue="Age Group", palette="muted", inner="quartile", split=True)

# Add labels and title
plt.title("Premium Distribution by Policy Type & Age Group")
plt.xlabel("Policy Type")
plt.ylabel("Premium Amount")
plt.legend(title="Age Group")

# Show plot
plt.show()
No description has been provided for this image

Insights:

Premium Variation by Policy Type:

  • In this chart different policy types exhibit varying premium distributions.
  • Some policy types have a wider spread, indicating a greater range of premium values.

Influence of Age Group:

  • Older age groups (e.g., 45-60, 60+) generally show higher premiums across all policy types. Younger age groups (18-30, 30-45) tend to have lower premiums but with some overlap in distributions.

Policy Types & Premium Spread:

  • Certain policy types have tighter distributions, suggesting more standardized pricing.
  • Others have wider spreads, indicating varied pricing based on customer profiles.

Conclusion:ΒΆ

  • Age plays a crucial role in determining premium amounts across different policy types.
  • Some policy types have more variability in premium amounts, possibly due to factors like coverage, risk, or additional benefits.
  • Older individuals generally pay higher premiums, reinforcing the need for tailored policy structures.
  • The split distribution in violin plots highlights key variations across policy types and age groups, which can help in designing personalized insurance pricing strategies.

KDE Plot: Premium Distribution by Policy Type & Age GroupΒΆ

InΒ [109]:
# Set figure size
plt.figure(figsize=(10, 6))

# Loop through age groups to plot separate KDEs
age_groups = ["18-30", "30-45", "45-60", "60+"]
colors = ["blue", "green", "orange", "red"]

for age, color in zip(age_groups, colors):
    sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"],
                label=f"Comprehensive - {age}", shade=True, color=color, linestyle="dashed")
    sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"],
                label=f"Basic - {age}", shade=True, color=color)

# Add labels and title
plt.title("Density Plot of Premium Amounts by Policy Type & Age Group")
plt.xlabel("Premium Amount")
plt.ylabel("Density")
plt.legend()
plt.show()
<ipython-input-109-10fb98385408>:9: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"],
<ipython-input-109-10fb98385408>:11: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"],
<ipython-input-109-10fb98385408>:9: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"],
<ipython-input-109-10fb98385408>:11: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"],
<ipython-input-109-10fb98385408>:9: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"],
<ipython-input-109-10fb98385408>:11: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"],
<ipython-input-109-10fb98385408>:9: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Comprehensive") & (data["Age Group"] == age)]["Premium Amount"],
<ipython-input-109-10fb98385408>:11: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=data[(data["Policy Type"] == "Basic") & (data["Age Group"] == age)]["Premium Amount"],
No description has been provided for this image

Insights:

Different Age Groups Have Distinct Premium Distributions:

  • Younger age groups (18-30, 30-45) tend to have lower premium amounts with a sharper peak at the lower end.
  • Older age groups (45-60, 60+) show higher premium amounts with a wider spread.

Age Impact on Premium Distributions:

  • For all policy types, premiums increase with age, but the density spread is more significant for comprehensive policies.
  • The dashed KDE lines (Comprehensive) show a wider distribution compared to solid Basic lines, suggesting that comprehensive policies allow for more customized pricing.

Conclusion:

  • Older individuals pay higher premiums across both policy types, reflecting risk-based pricing in insurance.
  • Comprehensive policies have more varied pricing, especially for older age groups, due to additional coverage and customization options.
  • Basic policies maintain a more uniform structure, with lower premium variability.
  • The KDE plot visually confirms the premium differences across age groups and policy types, helping insurance providers refine pricing strategies.

Hypothesis-2 - Distribution of Premiums by Insurance DurationΒΆ

InΒ [110]:
# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Insurance Duration', 'Premium Amount', 'Customer Feedback'])

# Violin Plot: Premium Distribution by Insurance Duration & Customer Feedback
plt.figure(figsize=(10, 6))
sns.violinplot(data=data, x="Insurance Duration", y="Premium Amount", hue="Customer Feedback",
               palette="coolwarm", split=True, inner="quartile")

# Add labels and title
plt.title("Premium Distribution by Insurance Duration & Customer Feedback")
plt.xlabel("Insurance Duration (Years)")
plt.ylabel("Premium Amount")
plt.legend(title="Customer Feedback")
plt.show()
No description has been provided for this image

Insights:

Premiums Vary Across Insurance Duration:

  • Shorter insurance durations (1-3 years) tend to have lower premium amounts.
  • Longer durations (5+ years) exhibit higher premiums, suggesting that long-term policies are priced higher due to extended coverage.

Customer Feedback Correlation with Premiums:

  • Positive feedback is more concentrated in lower premium ranges, implying that customers are generally satisfied with affordable insurance.
  • Negative feedback is spread across higher premiums, indicating potential dissatisfaction with costlier policies.

Higher Premiums Show Greater Variability:

  • The distribution of premium amounts is wider for longer insurance durations, meaning that pricing is more flexible for extended policies.
  • Customers selecting longer durations may negotiate or receive different premium structures depending on risk factors.

Conclusion:

  • Premiums increase with insurance duration, but customer satisfaction varies depending on pricing.
  • Higher premium policies tend to have more varied customer feedback, suggesting that factors other than price (e.g., coverage, service quality) impact satisfaction.
  • Insurers should optimize pricing strategies for longer-duration policies to balance affordability and customer satisfaction.

4 . Pricing Strategy & Business InsightsΒΆ

Hypothesis -1 -Income Levels on Premium Sensitivity and Dynamic PricingΒΆ

InΒ [111]:
import scipy.stats as stats

# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Annual Income', 'Premium Amount'])

# Creating Income Groups for better analysis
data["Income Group"] = pd.cut(data["Annual Income"], bins=[0, 50000, 100000, 200000, 500000],
                            labels=["Low (0-50K)", "Mid (50K-100K)", "High (100K-200K)", "Very High (200K-500K)"])

# KDE Plot: Density of Premiums by Income Group
plt.figure(figsize=(8, 5))
sns.kdeplot(data=data, x="Premium Amount", hue="Income Group", fill=True, alpha=0.5)
plt.title("Density of Premium Amounts Across Income Groups")
plt.xlabel("Premium Amount")
plt.ylabel("Density")
plt.show()
No description has been provided for this image

Insights:

  • The KDE plot will show that higher-income groups tend to have a right-skewed distribution, meaning they pay higher premiums.
  • Lower-income groups may have a more concentrated premium range, indicating greater price sensitivity.
  • If premium density is spread out in higher-income segments, it suggests that they accept different premium rates without switching policies.

Conclusion:

  • High-income customers are less sensitive to premium changes, making them ideal for dynamic pricing strategies.
  • Insurers can optimize revenue by offering tailored premium rates based on income segmentation.
  • Lower-income customers exhibit more price sensitivity, requiring competitive pricing to retain them.

Hypothesis-2 - Property Type on Premium AmountsΒΆ

InΒ [112]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Remove NaN and infinite values
data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna(subset=['Property Type', 'Premium Amount', 'Location'])

# Boxen Plot: Premium Distribution by Property Type & Location
plt.figure(figsize=(10, 6))
sns.boxenplot(data=data, x="Property Type", y="Premium Amount", hue="Location", palette="coolwarm")
plt.title("Premium Distribution by Property Type & Location")
plt.xlabel("Property Type")
plt.ylabel("Premium Amount")
plt.xticks(rotation=45)
plt.legend(title="Location", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

# KDE Plot: Density of Premium Amounts by Property Type & Location
plt.figure(figsize=(10, 6))

# Create a dictionary to map location to linestyle
location_linestyle_map = {
    data['Location'].unique()[0]: '-',
    data['Location'].unique()[1]: '--',
    data['Location'].unique()[2]: ':',
}


for location in data['Location'].unique():
    sns.kdeplot(
        data=data[data['Location'] == location],
        x="Premium Amount",
        hue="Property Type",
        fill=True,
        alpha=0.5,
        linestyle=location_linestyle_map.get(location, '-'),  # Use get with a default
        label=f"{location}"  # Add labels to the legend
    )

plt.title("Density of Premium Amounts by Property Type & Location")
plt.xlabel("Premium Amount")
plt.ylabel("Density")
plt.legend(title="Property Type & Location", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
No description has been provided for this image
No description has been provided for this image

Insights:

Variability in Premium Amounts Across Property Types:

  • The box plot shows that all three property types (Detached Home, Flat, Apartment) have a similar distribution in premium amounts.
  • Premiums have a wide range with many outliers, suggesting that high-value properties drive up the insurance costs.

Impact of Location on Premium Amounts:

  • The box plot indicates that properties in Tier-1 locations generally have higher median premiums than Tier-2 and Tier-3.
  • The spread of premium values is also larger in Tier-1 locations, indicating more variation in high-value properties.

** Skewed Distribution of Premium Amounts:**

  • The KDE plot shows a highly right-skewed distribution, meaning most properties have lower premiums, while a few high-value properties have extremely high premiums.
  • This suggests that while the majority of insured properties fall within an affordable premium range, some outliers contribute significantly to total premium revenue.

Overlapping Premium Distributions Across Property Types & Locations:

  • The KDE plot indicates that property types and locations have overlapping distributions, meaning there isn’t a stark difference between their density curves.
  • This suggests that other factors (e.g., property size, risk factors) could be influencing premium amounts more than just property type or location alone.

Conclusion:

  • The analysis of premium distributions across different property types and locations reveals key trends in pricing. While Detached Homes, Flats, and Apartments exhibit similar premium structures, Tier-1 locations tend to have higher median premiums and greater variability, indicating a broader range of property values and associated risks. The right-skewed distribution of premium amounts suggests that while most properties fall within a lower premium range, a small subset of high-value properties significantly increases premium costs.

Hypoyhesis-3 - Seasonal Trends in Policy Purchases: Analyzing Monthly Sales PatternsΒΆ

InΒ [113]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Convert 'Policy Start Date' to datetime format
data["Policy Start Date"] = pd.to_datetime(data["Policy Start Date"], errors="coerce")

# Extract Month from 'Policy Start Date'
data["Policy Start Month"] = data["Policy Start Date"].dt.month

# Aggregate policy counts per month, property type, and location
monthly_sales = data.groupby(["Policy Start Month", "Property Type", "Location"]).size().reset_index(name="Policy Count")

# Line Plot: Monthly Policy Sales Trend by Property Type
plt.figure(figsize=(12, 6))
sns.lineplot(x="Policy Start Month", y="Policy Count", hue="Property Type", data=monthly_sales, marker="o", palette="tab10")
plt.xticks(ticks=range(1, 13), labels=["Jan", "Feb", "Mar", "Apr", "May", "Jun",
                                        "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
plt.title("Monthly Policy Purchases Trend by Property Type")
plt.xlabel("Month")
plt.ylabel("Number of Policies Sold")
plt.grid(True)
plt.legend(title="Property Type")
plt.show()

# Bar Plot: Monthly Policy Sales by Location
plt.figure(figsize=(12, 6))
sns.barplot(x="Policy Start Month", y="Policy Count", hue="Location", data=monthly_sales, palette="coolwarm")
plt.xticks(ticks=range(1, 13), labels=["Jan", "Feb", "Mar", "Apr", "May", "Jun",
                                        "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
plt.title("Policy Purchases by Month and Location")
plt.xlabel("Month")
plt.ylabel("Number of Policies Sold")
plt.legend(title="Location")
plt.show()
No description has been provided for this image
No description has been provided for this image

Insights:

Seasonal Trends:

  • Policy purchases peak in certain months, potentially due to renewal cycles or promotional periods.
  • There may be seasonal dips in sales, requiring further investigation into consumer behavior.

Property Type Influence:

  • Different property types exhibit unique trends in policy sales.
  • For example, apartments may see steady demand, while detached homes may show spikes during specific months.

Location-Based Differences:

  • Tier-1 locations may consistently lead in policy purchases.
  • Some locations might show strong seasonal patterns, possibly due to weather conditions, real estate cycles, or economic factors.

Conclusion:

  • The analysis of monthly policy purchases across property types and locations highlights key trends in seasonal demand, regional variations, and consumer behavior. Understanding these trends can help insurance providers optimize pricing, marketing campaigns, and customer engagement strategies to align with demand fluctuations. Further analysis into policyholder demographics and economic conditions could provide deeper insights into the observed patterns.

Key Findings & Recommendations: Insurance Premium AnalysisΒΆ

Key Findings:ΒΆ

1. Demographic Factors (Age, Education, Smoking, Exercise) Impact Premium Pricing

πŸ”Ή Age: Younger individuals pay higher premiums due to perceived risk, while older customers face increasing costs due to health concerns.

πŸ”Ή Education Level: Higher education correlates with lower premiums, likely due to better financial awareness and health management.

πŸ”Ή Smoking & Health Score: Smokers pay significantly higher premiums across all age groups. A better health score lowers insurance costs.

πŸ”Ή Exercise Frequency: Physically active customers receive lower premiums due to reduced health risks.

2.Financial & Credit-Related Factors on Premium Costs

πŸ”Ή Credit Score: A low credit score significantly increases premiums, reflecting financial instability and higher risk.

πŸ”Ή Annual Income: High-income individuals opt for comprehensive policies, while low-income groups show price sensitivity and prefer essential coverage.

πŸ”Ή Premium Sensitivity: Lower-income customers are more responsive to discounts, while higher-income individuals value policy benefits over cost.

3. Policy Type & Duration: How Customers Choose Their Plans

πŸ”Ή Whole-life insurance premiums show higher variability, while term insurance maintains consistent pricing.

πŸ”Ή Long-term policies offer lower total costs but require higher initial payments, leading some customers to prefer short-term plans.

4.Property & Lifestyle Factors Affecting Premium Pricing

πŸ”Ή Property Type: Commercial properties have higher premiums due to higher replacement costs.

πŸ”Ή Older homes attract higher insurance rates due to structural risks . πŸ”Ή Lifestyle Choices: Healthier customers and non-smokers receive substantial premium discounts.

Recommendations:

1.Personalize Pricing & Offer Incentives for Healthier Lifestyles

βœ”οΈ Introduce premium discounts for active, non-smoking, and healthier individuals.

βœ”οΈ Leverage wearable technology (fitness tracking) to reward active customers.

2. Implement Tiered Pricing for Different Customer Segments

βœ”οΈ Low-income policyholders: Offer basic plans with flexible payments.

βœ”οΈ High-income policyholders: Promote premium plans with add-ons.

3. Introduce Credit Score-Based Premium Adjustments

βœ”οΈ Encourage financial literacy programs to help customers improve credit scores for lower premiums.

βœ”οΈ Offer gradual premium reductions for customers improving their credit over time.

4. Optimize Seasonal Sales Strategies

βœ”οΈ Launch mid-year promotional campaigns to counter sales dips.

βœ”οΈ Offer end-of-year tax-saving bundles to drive financial planning-based sales.

5. Provide Flexible Payment & Duration Options

βœ”οΈ Offer monthly, quarterly, and annual payment plans for better affordability.

βœ”οΈ Introduce early renewal benefits to retain customers.

6.Develop Property-Specific Risk Assessment Models

βœ”οΈ Adjust premiums based on property type and age.

βœ”οΈ Provide bundled property & life insurance packages for homeowners.

Expected Business Impact:

βœ… Increased Policy Sales & Renewals – Through seasonal offers & customer loyalty strategies.

βœ… Higher Customer Retention – Personalized pricing & discounts reduce churn rates.

βœ… Improved Risk Management – Data-driven premium adjustments optimize profitability & claim costs.

βœ… Higher Customer Engagement – Targeted campaigns based on health & financial insights.