import pandas as pd
import numpy as np
%pip install seaborn
import seaborn as sns
import matplotlib.pyplot as plt

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)
Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

df=pd.read_csv(r"C:\Users\avadh\Desktop\data analyst nextleap\Milestone 4\bumble.csv")

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   gender       59946 non-null  object 
 3   body_type    54650 non-null  object 
 4   diet         35551 non-null  object 
 5   drinks       56961 non-null  object 
 6   education    53318 non-null  object 
 7   ethnicity    54266 non-null  object 
 8   height       59943 non-null  float64
 9   income       59946 non-null  int64  
 10  job          51748 non-null  object 
 11  last_online  59946 non-null  object 
 12  location     59946 non-null  object 
 13  pets         40025 non-null  object 
 14  religion     39720 non-null  object 
 15  sign         48890 non-null  object 
 16  speaks       59896 non-null  object 
dtypes: float64(1), int64(2), object(14)
memory usage: 7.8+ MB

print(f"statistical analysis (numerical columns)")
df.describe()

statistical analysis (numerical columns)

df.shape

(59946, 17)

df.isnull().sum()

age                0
status             0
gender             0
body_type       5296
diet           24395
drinks          2985
education       6628
ethnicity       5680
height             3
income             0
job             8198
last_online        0
location           0
pets           19921
religion       20226
sign           11056
speaks            50
dtype: int64

df.isnull().sum().sum()

104438

missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

missing_summary = pd.DataFrame({
    'Missing Values': missing_data,
    'Percentage (%)': missing_percentage
}).sort_values(by='Percentage (%)', ascending=False)

print(missing_summary)

             Missing Values  Percentage (%)
diet                  24395       40.694959
religion              20226       33.740366
pets                  19921       33.231575
sign                  11056       18.443266
job                    8198       13.675641
education              6628       11.056618
ethnicity              5680        9.475194
body_type              5296        8.834618
drinks                 2985        4.979482
speaks                   50        0.083408
height                    3        0.005005
last_online               0        0.000000
location                  0        0.000000
income                    0        0.000000
status                    0        0.000000
gender                    0        0.000000
age                       0        0.000000

df.duplicated().sum()

0

df.dropna(subset=['height'],inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59943 entries, 0 to 59945
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59943 non-null  int64  
 1   status       59943 non-null  object 
 2   gender       59943 non-null  object 
 3   body_type    54650 non-null  object 
 4   diet         35551 non-null  object 
 5   drinks       56961 non-null  object 
 6   education    53318 non-null  object 
 7   ethnicity    54264 non-null  object 
 8   height       59943 non-null  float64
 9   income       59943 non-null  int64  
 10  job          51747 non-null  object 
 11  last_online  59943 non-null  object 
 12  location     59943 non-null  object 
 13  pets         40024 non-null  object 
 14  religion     39720 non-null  object 
 15  sign         48889 non-null  object 
 16  speaks       59893 non-null  object 
dtypes: float64(1), int64(2), object(14)
memory usage: 8.2+ MB

#Does the last_online column need to be converted into a datetime format?

# Convert 'last_online' column to datetime with the correct format
df['last_online'] = pd.to_datetime(df['last_online'], format='%Y-%m-%d-%H-%M')

# Display the result
df['last_online'].head()

0   2012-06-28 20:30:00
1   2012-06-29 21:41:00
2   2012-06-27 09:10:00
3   2012-06-28 14:22:00
4   2012-06-27 21:26:00
Name: last_online, dtype: datetime64[ns]

df.describe()

#Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?

#Any -1 values in numerical columns like income should be replaced with 0, as they may represent missing or invalid data.

# Calculate IQR for age, height, and income
columns_to_check = ['age', 'height']
outliers = {}

for col in columns_to_check:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    # Calculate lower and upper bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers[col] = df[(df[col] < lower_bound) | (df[col] > upper_bound)]

    # Print results
    print(f"{col} - IQR: {IQR:.2f}, Lower Bound: {lower_bound:.2f}, Upper Bound: {upper_bound:.2f}")
    print(f"Number of outliers in {col}: {outliers[col].shape[0]}")

age - IQR: 11.00, Lower Bound: 9.50, Upper Bound: 53.50
Number of outliers in age: 2638
height - IQR: 5.00, Lower Bound: 58.50, Upper Bound: 78.50
Number of outliers in height: 285

# Step 1: Calculate the 10th and 90th percentiles for 'age' and 'height'
age_10th, age_90th = df['age'].quantile(0.10), df['age'].quantile(0.90)
height_10th, height_90th = df['height'].quantile(0.10), df['height'].quantile(0.90)

# Step 2: Calculate the median for 'age' and 'height' within the middle 80%
middle_80_data = df[(df['age'] >= age_10th) & (df['age'] <= age_90th) &
                     (df['height'] >= height_10th) & (df['height'] <= height_90th)]

median_age = middle_80_data['age'].median()
median_height = middle_80_data['height'].median()

print(f"Median Age (Middle 80%): {median_age}")
print(f"Median Height (Middle 80%): {median_height}")

Median Age (Middle 80%): 30.0
Median Height (Middle 80%): 68.0

# Step 3: Replace outliers in 'age' and 'height' with the calculated median
df['age'] = df['age'].apply(lambda x: median_age if x < age_10th or x > age_90th else x)
df['height'] = df['height'].apply(lambda x: median_height if x < height_10th or x > height_90th else x)

# Step 4: Verify the changes
print(df[['age', 'height']].describe())

                age        height
count  59943.000000  59943.000000
mean      30.897986     68.193367
std        5.419528      2.633236
min       23.000000     63.000000
25%       27.000000     66.000000
50%       30.000000     68.000000
75%       34.000000     70.000000
max       46.000000     73.000000

# Replace -1 values with 0 in the 'income' column using replace method
df['income'] = df['income'].replace(-1, 0)

# Verify the changes
print(df['income'].describe())

count      59943.000000
mean       20035.033282
std        97348.358589
min            0.000000
25%            0.000000
50%            0.000000
75%            0.000000
max      1000000.000000
Name: income, dtype: float64

# Create a boolean mask where True indicates missing data
missing_data = df.isnull()

# Create a heatmap to visualize missing values
plt.figure(figsize=(8, 5))
sns.heatmap(missing_data, cbar=False, cmap=['black', '#FFCC00'], yticklabels=False, xticklabels=df.columns)

# Display the heatmap
plt.title("Missing Data Heatmap")
plt.show()

# Define the bin edges
bins = [18, 25, 35, 45, float('inf')]  # 'inf' for ages above 45
labels = ["18-25", "26-35", "36-45", "46+"]  # Labels for the bins

# Bin the 'age' column into the defined categories
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=True)

# Display the first few rows to check
print(df[['age', 'age_group']].head())

    age age_group
0  30.0     26-35
1  35.0     26-35
2  38.0     36-45
3  23.0     18-25
4  29.0     26-35

print(df['income'].value_counts())

income
0          48439
20000       2952
100000      1621
80000       1111
30000       1048
40000       1005
50000        975
60000        736
70000        707
150000       631
1000000      521
250000       149
500000        48
Name: count, dtype: int64

# Plot the income distribution 
plt.figure(figsize=(10, 6))
sns.histplot(df['income'], bins=30, kde=True, color='#FFCC00')  
plt.title('Income Distribution', fontsize=16, color='black')
plt.xlabel('Income', fontsize=12, color='black')
plt.ylabel('Frequency', fontsize=12, color='black')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

# Calculate the quartiles
low_income_threshold = df['income'].quantile(0.25)  # 25th percentile
high_income_threshold = df['income'].quantile(0.75)  # 75th percentile

# Display the thresholds
print(f"Low Income Threshold: {low_income_threshold}")
print(f"High Income Threshold: {high_income_threshold}")

Low Income Threshold: 0.0
High Income Threshold: 0.0

# Exclude zeros for meaningful categorization
non_zero_income = df[df['income'] > 0]['income']

# Calculate quartile-based thresholds (ignoring 0 values)
low_income_threshold = non_zero_income.quantile(0.25)  # 25th percentile
high_income_threshold = non_zero_income.quantile(0.75)  # 75th percentile

# Define bins for income categories
bins = [df['income'].min(), low_income_threshold, high_income_threshold, df['income'].max()]
labels = ["Low Income", "Medium Income", "High Income"]

# Create the 'income_group' column
df['income_group'] = pd.cut(df['income'], bins=bins, labels=labels, right=True)

print(df['income_group'].value_counts(dropna=False))

income_group
NaN              48439
Medium Income     7203
Low Income        2952
High Income       1349
Name: count, dtype: int64

# Calculate profile completeness as the percentage of non-missing values for each user profile
df['profile_completeness'] = (df.notnull().sum(axis=1) / len(df.columns)) * 100

# Analyze completeness by gender
profile_completeness_by_gender = df.groupby('gender')['profile_completeness'].mean().reset_index()

# Display results
print("\nProfile Completeness by Gender:")
print(profile_completeness_by_gender)

# Profile completeness by age
completeness_by_age = df.groupby('age')['profile_completeness'].mean().reset_index()
print("\nProfile Completeness by Age:")
print(completeness_by_age)

# Profile completeness by status
completeness_by_status = df.groupby('status')['profile_completeness'].mean().reset_index()
print("\nProfile Completeness by Status:")
print(completeness_by_status)

Profile Completeness by Gender:
  gender  profile_completeness
0      f             86.454941
1      m             86.662808

Profile Completeness by Age:
     age  profile_completeness
0   23.0             86.478639
1   24.0             86.269035
2   25.0             86.291934
3   26.0             86.314941
4   27.0             86.314361
5   28.0             86.105439
6   29.0             86.386071
7   30.0             86.930099
8   31.0             86.300394
9   32.0             86.373998
10  33.0             86.274276
11  34.0             86.258232
12  35.0             86.435747
13  36.0             86.557835
14  37.0             86.655848
15  38.0             86.909379
16  39.0             86.801689
17  40.0             86.882984
18  41.0             87.454350
19  42.0             87.043401
20  43.0             87.118145
21  44.0             87.265834
22  45.0             87.787509
23  46.0             87.297396

Profile Completeness by Status:
           status  profile_completeness
0       available             87.430507
1         married             86.943973
2  seeing someone             87.112403
3          single             86.530062
4         unknown             80.000000

# Conversion factor from inches to centimeters
conversion_factor = 2.54

# Convert the height column to centimeters and store in a new column
df['height_cm'] = df['height'] * conversion_factor

# Display the updated dataframe or verify the new column
print(df[['height', 'height_cm']].head())

   height  height_cm
0    68.0     172.72
1    70.0     177.80
2    68.0     172.72
3    71.0     180.34
4    66.0     167.64

# Calculate the gender distribution
gender_distribution = df['gender'].value_counts()

# Calculate the percentage distribution for better insights
gender_percentage = df['gender'].value_counts(normalize=True) * 100

# Display the results with formatted percentage values
print("Gender Distribution (Count):")
print(gender_distribution)
print("\nGender Distribution (Percentage):")
print(gender_percentage.apply(lambda x: f"{x:.2f}"))

Gender Distribution (Count):
gender
m    35827
f    24116
Name: count, dtype: int64

Gender Distribution (Percentage):
gender
m    59.77
f    40.23
Name: proportion, dtype: object

# Calculate the status distribution
status_distribution = df['status'].value_counts()

# Calculate the percentage distribution for better insights
status_percentage = df['status'].value_counts(normalize=True) * 100

# Display the results with formatted percentage values
print("Status Distribution (Count):")
print(status_distribution)
print("\nStatus Distribution (Percentage):")
print(status_percentage.apply(lambda x: f"{x:.2f}"))

Status Distribution (Count):
status
single            55694
seeing someone     2064
available          1865
married             310
unknown              10
Name: count, dtype: int64

Status Distribution (Percentage):
status
single            92.91
seeing someone     3.44
available          3.11
married            0.52
unknown            0.02
Name: proportion, dtype: object

# Calculate the count and percentage of status by gender
status_gender = df.groupby(['gender', 'status']).size().unstack()

# Normalize by gender for percentage calculation
status_gender_percentage = status_gender.div(status_gender.sum(axis=1), axis=0) * 100

# Display the results with formatted percentage values
print(f"Status Distribution by Gender (Count):\n{status_gender}")
print(f"\nStatus Distribution by Gender (Percentage):\n{status_gender_percentage.apply(lambda x: x.map(lambda v: f'{v:.2f}'))}")  # Format each percentage value

Status Distribution by Gender (Count):
status  available  married  seeing someone  single  unknown
gender                                                     
f             656      135            1003   22318        4
m            1209      175            1061   33376        6

Status Distribution by Gender (Percentage):
status available married seeing someone single unknown
gender                                                
f           2.72    0.56           4.16  92.54    0.02
m           3.37    0.49           2.96  93.16    0.02

# Encode gender as numeric (1 for male, 0 for female)
df['gender_encoded'] = df['gender'].map({'m': 1, 'f': 0})

# Calculate the correlation matrix
correlation_matrix = df[['age', 'income', 'gender_encoded']].corr()

# Display the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)

Correlation Matrix:
                     age    income  gender_encoded
age             1.000000  0.008459       -0.016291
income          0.008459  1.000000        0.074604
gender_encoded -0.016291  0.074604        1.000000

# Calculate the count and percentage of diet preferences
diet_distribution = df['diet'].value_counts()
diet_percentage = df['diet'].value_counts(normalize=True) * 100

# Combine count and percentage into a single DataFrame for visualization
diet_combined = pd.DataFrame({
    'Count': diet_distribution,
    'Percentage': diet_percentage.apply(lambda x: f"{x:.2f}")  # Format percentage to 2 decimal places
})

# Display the combined count and formatted percentage values
print(diet_combined)

                     Count Percentage
diet                                 
mostly anything      16585      46.65
anything              6183      17.39
strictly anything     5113      14.38
mostly vegetarian     3444       9.69
mostly other          1007       2.83
strictly vegetarian    875       2.46
vegetarian             667       1.88
strictly other         452       1.27
mostly vegan           338       0.95
other                  331       0.93
strictly vegan         228       0.64
vegan                  136       0.38
mostly kosher           86       0.24
mostly halal            48       0.14
strictly halal          18       0.05
strictly kosher         18       0.05
halal                   11       0.03
kosher                  11       0.03

# Group by diet and drinks, then calculate the count
diet_drinks_count = df.groupby(['diet', 'drinks']).size().unstack()

# Calculate percentage distribution by diet
diet_drinks_percentage = diet_drinks_count.apply(lambda x: x / x.sum() * 100, axis=1)

# Display the results
print("Diet vs Drinking Habits (Count):")
print(diet_drinks_count)
print("\nDiet vs Drinking Habits (Percentage):")
print(diet_drinks_percentage.apply(lambda x: x.map(lambda v: f"{v:.2f}")))  # Format percentage to 2 decimal places

Diet vs Drinking Habits (Count):
drinks               desperately  not at all   often  rarely  socially  \
diet                                                                     
anything                    21.0       295.0   574.0   510.0    4534.0   
halal                        NaN         4.0     1.0     NaN       4.0   
kosher                       NaN         1.0     NaN     2.0       7.0   
mostly anything             66.0       842.0  1386.0  1537.0   12277.0   
mostly halal                 3.0        10.0     2.0     8.0      16.0   
mostly kosher                1.0         7.0     4.0    17.0      50.0   
mostly other                 8.0        91.0    49.0   176.0     646.0   
mostly vegan                 2.0        40.0    22.0    62.0     193.0   
mostly vegetarian           19.0       195.0   233.0   466.0    2388.0   
other                        1.0        35.0    23.0    60.0     192.0   
strictly anything           66.0       182.0   685.0   318.0    3677.0   
strictly halal               1.0         7.0     2.0     1.0       5.0   
strictly kosher              6.0         4.0     2.0     3.0       1.0   
strictly other              14.0        67.0    34.0    63.0     244.0   
strictly vegan               8.0        58.0    19.0    37.0      94.0   
strictly vegetarian          9.0        85.0    82.0   121.0     543.0   
vegan                        2.0        23.0    19.0    17.0      65.0   
vegetarian                   6.0        37.0    59.0    67.0     446.0   

drinks               very often  
diet                             
anything                   58.0  
halal                       NaN  
kosher                      1.0  
mostly anything           129.0  
mostly halal                4.0  
mostly kosher               5.0  
mostly other                5.0  
mostly vegan                3.0  
mostly vegetarian          20.0  
other                       1.0  
strictly anything          59.0  
strictly halal              2.0  
strictly kosher             2.0  
strictly other             10.0  
strictly vegan              3.0  
strictly vegetarian         9.0  
vegan                       1.0  
vegetarian                  7.0  

Diet vs Drinking Habits (Percentage):
drinks              desperately not at all  often rarely socially very often
diet                                                                        
anything                   0.35       4.92   9.58   8.51    75.67       0.97
halal                       nan      44.44  11.11    nan    44.44        nan
kosher                      nan       9.09    nan  18.18    63.64       9.09
mostly anything            0.41       5.19   8.54   9.47    75.61       0.79
mostly halal               6.98      23.26   4.65  18.60    37.21       9.30
mostly kosher              1.19       8.33   4.76  20.24    59.52       5.95
mostly other               0.82       9.33   5.03  18.05    66.26       0.51
mostly vegan               0.62      12.42   6.83  19.25    59.94       0.93
mostly vegetarian          0.57       5.87   7.02  14.03    71.91       0.60
other                      0.32      11.22   7.37  19.23    61.54       0.32
strictly anything          1.32       3.65  13.74   6.38    73.73       1.18
strictly halal             5.56      38.89  11.11   5.56    27.78      11.11
strictly kosher           33.33      22.22  11.11  16.67     5.56      11.11
strictly other             3.24      15.51   7.87  14.58    56.48       2.31
strictly vegan             3.65      26.48   8.68  16.89    42.92       1.37
strictly vegetarian        1.06      10.01   9.66  14.25    63.96       1.06
vegan                      1.57      18.11  14.96  13.39    51.18       0.79
vegetarian                 0.96       5.95   9.49  10.77    71.70       1.13

# Splitting the location column into city and state
df[['city', 'state']] = df['location'].str.split(',', n=1, expand=True)

# Cleaning up extra spaces in city and state
df['city'] = df['city'].str.strip()
df['state'] = df['state'].str.strip()

# Calculating the top 5 cities and states
top_cities = df['city'].value_counts().head(5)
top_states = df['state'].value_counts().head(5)

# Displaying the results
print("Top 5 Cities with the highest number of users:")
print(top_cities)

print("\nTop 5 States with the highest number of users:")
print(top_states)

Top 5 Cities with the highest number of users:
city
san francisco    31064
oakland           7214
berkeley          4210
san mateo         1331
palo alto         1064
Name: count, dtype: int64

Top 5 States with the highest number of users:
state
california       59853
new york            17
illinois             8
massachusetts        5
texas                4
Name: count, dtype: int64

# Group data by city and calculate age statistics
age_stats = df[df['city'].isin(['san francisco', 'oakland', 'berkeley', 'san mateo', 'palo alto'])]
age_summary = (
    age_stats.groupby('city')['age']
    .agg(['mean', 'median', 'min', 'max'])
    .reset_index()
    .rename(columns={'mean': 'Mean Age', 'median': 'Median Age', 'min': 'Min Age', 'max': 'Max Age'})
)

print(age_summary)

            city   Mean Age  Median Age  Min Age  Max Age
0       berkeley  30.062233        30.0     23.0     46.0
1        oakland  31.518991        30.0     23.0     46.0
2      palo alto  29.953008        30.0     23.0     46.0
3  san francisco  30.891386        30.0     23.0     46.0
4      san mateo  31.147258        30.0     23.0     46.0

# Filter top cities and states
top_cities = ['san francisco', 'oakland', 'berkeley', 'san mateo', 'palo alto']
top_states = ['california', 'new york', 'illinois', 'massachusetts', 'texas']

# Calculate mean income for top cities
city_income = df[df['city'].isin(top_cities)].groupby('city')['income'].mean().reset_index()
city_income.rename(columns={'income': 'Average Income'}, inplace=True)

# Calculate mean income for top states
state_income = df[df['state'].isin(top_states)].groupby('state')['income'].mean().reset_index()
state_income.rename(columns={'income': 'Average Income'}, inplace=True)

# Display results
print("Average Income by Top Cities:")
print(city_income)

print("\nAverage Income by Top States:")
print(state_income)

Average Income by Top Cities:
            city  Average Income
0       berkeley    17372.921615
1        oakland    22586.637095
2      palo alto    19332.706767
3  san francisco    20150.012877
4      san mateo    22779.864763

Average Income by Top States:
           state  Average Income
0     california    20044.943445
1       illinois        0.000000
2  massachusetts     6000.000000
3       new york    31764.705882
4          texas     5000.000000

# Group by gender and calculate average height in cm
gender_height_cm = df.groupby('gender')['height_cm'].mean().reset_index()
gender_height_cm.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)

# Display results with two decimal places
print("Average Height by Gender (in cm):")
print(gender_height_cm.round(2))

Average Height by Gender (in cm):
  gender  Average Height (cm)
0      f               168.45
1      m               176.42

# Calculate average height by age group with observed=False to avoid future warnings
height_by_age_group = df.groupby('age_group', observed=False)['height_cm'].mean().reset_index()

# Rename column for clarity
height_by_age_group.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)

# Display results with rounded values
print("Average Height by Age Group (in cm):")
print(height_by_age_group.round(2))

Average Height by Age Group (in cm):
  age_group  Average Height (cm)
0     18-25               173.43
1     26-35               173.15
2     36-45               173.23
3       46+               173.44

# Calculate the mean height for each body_type category
height_by_body_type = df.groupby('body_type')['height_cm'].mean().reset_index()
height_by_body_type.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)

# Display results with rounded values
print("Average Height by Body Type (in cm):")
print(height_by_body_type.round(2))

Average Height by Body Type (in cm):
         body_type  Average Height (cm)
0   a little extra               173.85
1         athletic               175.27
2          average               173.04
3            curvy               168.71
4              fit               173.57
5     full figured               169.99
6           jacked               174.12
7       overweight               174.05
8   rather not say               171.90
9           skinny               173.49
10            thin               172.45
11         used up               174.47

# Filter out rows where income is 0
df_nonzero_income = df[df['income'] > 0]

# Calculate the income distribution
income_distribution = df_nonzero_income['income'].describe()

# Display the result
print(income_distribution.round(2))

count      11504.00
mean      104394.99
std       201433.53
min        20000.00
25%        20000.00
50%        50000.00
75%       100000.00
max      1000000.00
Name: income, dtype: float64

# Define bins and labels for income brackets
income_bins = [0, 25000, 50000, 75000, 100000, float('inf')]
income_labels = ['<25K', '25K-50K', '50K-75K', '75K-100K', '100K+']

# Filter out zero income and create income brackets
df_nonzero_income = df[df['income'] > 0].copy()
df_nonzero_income['income_bracket'] = pd.cut(df_nonzero_income['income'], bins=income_bins, labels=income_labels, right=False)

# Get the count of users in each income bracket
income_distribution = df_nonzero_income['income_bracket'].value_counts().sort_index()
print(income_distribution)

income_bracket
<25K        2952
25K-50K     2053
50K-75K     2418
75K-100K    1111
100K+       2970
Name: count, dtype: int64

# Group by age group and gender and calculate the mean income
income_by_age_gender = df.groupby(['age_group', 'gender'],observed=True)['income'].mean().reset_index()

# Pivot the table to have separate columns for male and female income
income_by_age_gender_pivot = income_by_age_gender.pivot(index='age_group', columns='gender', values='income')

# Display the result
print(income_by_age_gender_pivot.round(2))

gender            f         m
age_group                    
18-25      11249.66  25219.91
26-35      11089.33  25623.37
36-45      11319.26  27787.15
46+        13865.55  30676.47

# Calculate mean age
mean_age = df['age'].mean()

# Plot the histogram with Bumble theme color
plt.figure(figsize=(8, 5))
plt.hist(df['age'], bins=15, color='#FFCC00', edgecolor='black', alpha=0.9)
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')

# Add labels and title
plt.title('Age Distribution of Bumble Users', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend()
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

# Set the size of the plot
plt.figure(figsize=(8, 5))

sns.histplot(data=df, x='age', hue='gender', bins=15, kde=False, 
             multiple="dodge", palette={'f': '#87CEEB', 'm': '#FFCC00'}, 
             edgecolor='black', alpha=0.8)

# Add labels and title
plt.title('Age Distribution by Gender', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend(title='Gender', labels=['Female', 'Male'])
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

# Set the size of the plot
plt.figure(figsize=(8, 5))

sns.histplot(data=df, x='age', hue='gender', bins=15, kde=False, 
             multiple="stack", palette={'f': '#87CEEB', 'm': '#FFCC00'}, 
             edgecolor='black', alpha=0.8)

# Add labels and title
plt.title('Age Distribution by Gender', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend(title='Gender', labels=['Female', 'Male'])
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()

# Set plot size
plt.figure(figsize=(8, 5))

# Scatterplot and trend line
sns.scatterplot(data=df, x='age', y='income', color='#FFCC00', alpha=0.8, edgecolor='black')
sns.regplot(data=df, x='age', y='income', scatter=False, color='black', line_kws={"linewidth": 2})

# Add title and labels
plt.title('Income vs. Age', fontsize=14, color='black')
plt.xlabel('Age', fontsize=12, color='black')
plt.ylabel('Income', fontsize=12, color='black')

# Add grid
plt.grid(axis='y', alpha=0.5, linestyle='--', color='gray')

# Show plot
plt.show()

# Filter out rows with zero or negative income
df_filtered = df[df['income'] > 0]

# Create the boxplot
plt.figure(figsize=(8, 5))
sns.boxplot(data=df_filtered, x='age_group', y='income', color='#FFCC00')  # Bumble yellow

# Add title and labels
plt.title('Income by Age Group', fontsize=14, color='black')
plt.xlabel('Age Group', fontsize=12, color='black')
plt.ylabel('Income', fontsize=12, color='black')

# Show the plot
plt.show()

# Create the boxplot to compare income by gender and status
plt.figure(figsize=(8, 5))
sns.boxplot(data=df_filtered, x='status', y='income', hue='gender', palette={'f': '#87CEEB', 'm': '#FFCC00'})

# Add title and labels
plt.title('Income by Gender and Status (Excluding Zero/Negative Income)', fontsize=14)
plt.xlabel('Status', fontsize=12)
plt.ylabel('Income', fontsize=12)

# Show the plot
plt.show()

# Count the occurrences of each pet category
pet_counts = df['pets'].value_counts()

# Plot the bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=pet_counts.index, y=pet_counts.values, color='#FFCC00', edgecolor='black')

# Add labels and title
plt.title('Distribution of Pet Preferences', fontsize=14)
plt.xlabel('Pet Preferences', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=90)

# Show the plot
plt.show()

# Create a new binary column indicating whether the user likes pets
df['likes_pets'] = df['pets'].apply(lambda x: 'likes' in str(x))

# Filter out rows where pets preference is not specified
df_filtered = df[df['pets'] != 'Not Specified']

# Group by age_group and gender and count the users who like pets
# Adding observed=False to avoid the FutureWarning
age_gender_pets = df_filtered[df_filtered['likes_pets'] == True].groupby(['age_group', 'gender'], observed=False).size().reset_index(name='count')

# Plotting the data using a simple bar plot
plt.figure(figsize=(8, 5))
sns.barplot(data=age_gender_pets, x='age_group', y='count', hue='gender', palette=['#87CEEB','#FFCC00'])

# Adding labels and title
plt.title('Pets Preferences Across Gender and Age Groups', fontsize=16)
plt.xlabel('Age Group', fontsize=12)
plt.ylabel('Count of Users Liking Pets', fontsize=12)

# Show the plot
plt.show()

# Extract the main zodiac sign using a regular expression to match the first word
df['cleaned_sign'] = df['sign'].str.extract(r'(\b\w+\b)', expand=False)

# Filter out "Not Specified"
df_filtered = df[df['cleaned_sign'] != 'Not']

# Count the occurrences of each zodiac sign
sign_counts = df_filtered['cleaned_sign'].value_counts()

# Create a fading color palette where lower counts have darker colors
palette = sns.light_palette("#FFCC00", n_colors=len(sign_counts), reverse=True)

# Plot the data as a horizontal bar chart with the fading Bumble color palette
plt.figure(figsize=(8, 5))
sns.barplot(x=sign_counts.values, y=sign_counts.index, palette=palette, hue=sign_counts.index, edgecolor='black')

# Add labels and title
plt.title('Distribution of Zodiac Signs on the Platform (Excluding "Not Specified")', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Zodiac Sign', fontsize=12)

# Show the plot
plt.show()

# Group by 'gender', 'status', and 'cleaned_sign' to count occurrences
sign_gender_status = df_filtered.groupby(['gender', 'status', 'cleaned_sign']).size().reset_index(name='count')

# Plotting the data using countplot
plt.figure(figsize=(8, 5))

# Creating a count plot with hue set to 'gender' and 'status' as different categories
sns.barplot(data=sign_gender_status, x='cleaned_sign', y='count', hue='gender', palette=['#87CEEB', '#FFCC00'], errorbar=None)

# Adding labels and title
plt.title('Zodiac Sign Distribution Across Gender and Status', fontsize=16)
plt.xlabel('Zodiac Sign', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.show()

	age	height	income
count	59946.000000	59943.000000	59946.000000
mean	32.340290	68.295281	20033.222534
std	9.452779	3.994803	97346.192104
min	18.000000	1.000000	-1.000000
25%	26.000000	66.000000	-1.000000
50%	30.000000	68.000000	-1.000000
75%	37.000000	71.000000	-1.000000
max	110.000000	95.000000	1000000.000000

Introduction and Purpose¶

1.1 Load the libraries:¶

1.2. Import the dataset¶

1.3 Check the Information about the data and the datatypes of each respective attributes.¶

1.4 Statistical Analysis¶

1.4 Check the Dimension of data?¶

1.5 Check Null values¶

Part 1: Data Cleaning¶

1. Inspecting Missing Data¶

conclusion:- dropped the height column NA values because it had less than 1% of na values¶

2. Data Types¶

3. Outliers¶

4. Missing Data Visualization¶

Part 2: Data Processing¶

1. Binning and Grouping¶

2. Derived Features¶

3. Unit Conversion¶

Part 3: Data Analysis¶

1. Demographic Analysis¶

2. Correlation Analysis¶

3. Diet and Lifestyle Analysis¶

Overall Conclusion:¶

4. Geographical Insights¶

Questions:¶

5. Height Analysis¶

6. Income Analysis¶

Part 4: Data Visualization¶

1. Age Distribution¶

2. Income and Age¶

3. Pets and Preferences¶

4. Signs and Personality¶

	age	status	gender	body_type	diet	drinks	education	ethnicity	height	income	job	last_online	location	pets	religion	sign	speaks
0	22	single	m	a little extra	strictly anything	socially	working on college/university	asian, white	75.0	-1	transportation	2012-06-28-20-30	south san francisco, california	likes dogs and likes cats	agnosticism and very serious about it	gemini	english
1	35	single	m	average	mostly other	often	working on space camp	white	70.0	80000	hospitality / travel	2012-06-29-21-41	oakland, california	likes dogs and likes cats	agnosticism but not too serious about it	cancer	english (fluently), spanish (poorly), french (...
2	38	available	m	thin	anything	socially	graduated from masters program	NaN	68.0	-1	NaN	2012-06-27-09-10	san francisco, california	has cats	NaN	pisces but it doesn’t matter	english, french, c++
3	23	single	m	thin	vegetarian	socially	working on college/university	white	71.0	20000	student	2012-06-28-14-22	berkeley, california	likes cats	NaN	pisces	english, german (poorly)
4	29	single	m	athletic	NaN	socially	graduated from college/university	asian, black, other	66.0	-1	artistic / musical / writer	2012-06-27-21-26	san francisco, california	likes dogs and likes cats	NaN	aquarius	english