%pip install pandas

Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.13/site-packages (2.3.2)
Requirement already satisfied: numpy>=1.26.0 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2.2.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2025.2)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Note: you may need to restart the kernel to use updated packages.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
# Both files are in the same folder, so we just use the filename
Bumble_Dataset = pd.read_csv('bumble.csv')
Bumble_Dataset

missing_values = Bumble_Dataset.isnull().sum()
missing_values

age                0
status             0
gender             0
body_type       5296
diet           24395
drinks          2985
education       6628
ethnicity       5680
height             3
income             0
job             8198
last_online        0
location           0
pets           19921
religion       20226
sign           11056
speaks            50
dtype: int64

percentage_of_missing_values = ((missing_values) / len(Bumble_Dataset)) * 100
percentage_of_missing_values

age             0.000000
status          0.000000
gender          0.000000
body_type       8.834618
diet           40.694959
drinks          4.979482
education      11.056618
ethnicity       9.475194
height          0.005005
income          0.000000
job            13.675641
last_online     0.000000
location        0.000000
pets           33.231575
religion       33.740366
sign           18.443266
speaks          0.083408
dtype: float64

# Calculate median height by gender
calculate_median = Bumble_Dataset.groupby(["gender"])["height"].transform("median")
# Fill missing height values with median
Bumble_Dataset["height"] = Bumble_Dataset["height"].fillna(calculate_median)
print(calculate_median)

0        70.0
1        70.0
2        70.0
3        70.0
4        70.0
         ... 
59941    65.0
59942    70.0
59943    70.0
59944    70.0
59945    70.0
Name: height, Length: 59946, dtype: float64

Bumble_Dataset.dtypes

age              int64
status          object
gender          object
body_type       object
diet            object
drinks          object
education       object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
pets            object
religion        object
sign            object
speaks          object
dtype: object

# "Last_online" column is in object data type, hence it requires conversion
# Used to_datetime() for data type conversion with error handling
# The 'errors="coerce"' parameter will convert invalid dates to NaT (Not a Time)

import warnings
warnings.filterwarnings('ignore')  # Suppress warnings

Bumble_Dataset["last_online"] = pd.to_datetime(Bumble_Dataset["last_online"], errors='coerce')
Bumble_Dataset["last_online"]

0                              NaT
1                              NaT
2        2012-06-27 09:00:00-10:00
3        2012-06-28 14:00:00-22:00
4                              NaT
                   ...            
59941                          NaT
59942    2012-06-29 11:00:00-01:00
59943                          NaT
59944    2012-06-23 13:00:00-01:00
59945                          NaT
Name: last_online, Length: 59946, dtype: object

Bumble_Dataset.dtypes

age              int64
status          object
gender          object
body_type       object
diet            object
drinks          object
education       object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
pets            object
religion        object
sign            object
speaks          object
dtype: object

Bumble_Dataset.describe()

# Yes there are outliers in the numerical columns which are age, height and income
age_range = (Bumble_Dataset['age'].min(), Bumble_Dataset['age'].max())
height_range = (Bumble_Dataset['height'].min(), Bumble_Dataset['height'].max())
income_range = (Bumble_Dataset['income'].min(), Bumble_Dataset['income'].max())
print(f'Age range is: {age_range}')
print(f'Height range is: {height_range}')
print(f'Income range is: {income_range}')

Age range is: (18, 110)
Height range is: (1.0, 95.0)
Income range is: (-1, 1000000)

Bumble_Dataset['income'] = Bumble_Dataset['income'].replace(-1, 0)

column_names = ["income", "age", "height"]
for columns in column_names:
    Q1 = Bumble_Dataset[columns].quantile(0.10)
    Q3 = Bumble_Dataset[columns].quantile(0.90)
    filtered_values = Bumble_Dataset[(Bumble_Dataset[columns] > Q1) & (Bumble_Dataset[columns] < Q3)]
    mean_values = filtered_values[columns].mean()
    median_values = filtered_values[columns].median()
    print(f"\n{columns}: filtered_median: {median_values}, filtered_mean: {mean_values}")

income: filtered_median: 20000.0, filtered_mean: 26109.89010989011

age: filtered_median: 30.0, filtered_mean: 31.357685563997663

height: filtered_median: 68.0, filtered_mean: 68.25442646465552

missing_values = Bumble_Dataset.isnull().sum()
print(missing_values)
plt.figure(figsize=(12, 8))
sns.heatmap(Bumble_Dataset.isnull(), cbar=False, cmap="inferno")
plt.title('Heatmap of Missing Data')
plt.show()

age                0
status             0
gender             0
body_type       5296
diet           24395
drinks          2985
education       6628
ethnicity       5680
height             0
income             0
job             8198
last_online    35895
location           0
pets           19921
religion       20226
sign           11056
speaks            50
dtype: int64

def get_age_grp(age):
    if age <= 25:
        return '18-25'
    elif age <= 35:
        return "26-35"
    elif age <= 45:
        return "36-45"
    else:
        return "46+"

# Apply the function to create a new column
Bumble_Dataset["age_group"] = Bumble_Dataset["age"].apply(get_age_grp)
age_group_counts = Bumble_Dataset['age_group'].value_counts().sort_index()

# Convert Series to DataFrame
age_group_df = age_group_counts.reset_index()
age_group_df.columns = ['Age Group', 'User Count']

# Display results
print(age_group_df)

  Age Group  User Count
0     18-25       14454
1     26-35       28621
2     36-45       10803
3       46+        6068

# Visualize age group distribution
plt.figure(figsize=(10, 6))
plt.bar(age_group_df['Age Group'], age_group_df['User Count'], color='skyblue', edgecolor='black')
plt.xlabel('Age Group')
plt.ylabel('Number of Users')
plt.title('Distribution of Users Across Age Groups')
plt.grid(axis='y', alpha=0.3)
plt.show()

# Calculate Quartiles
Q1 = Bumble_Dataset["income"].quantile(0.10)
Q3 = Bumble_Dataset["income"].quantile(0.90)

# Define function to categorize income
def categorize_income(x):
    if x <= Q1:
        return "Low Income"
    elif Q1 < x <= Q3:
        return "Medium Income"
    else:
        return "High Income"

# Apply function
Bumble_Dataset["income_category"] = Bumble_Dataset["income"].apply(categorize_income)

# Count occurrences
income_distribution_count = Bumble_Dataset['income_category'].value_counts()
income_distribution_count_df = income_distribution_count.reset_index()
income_distribution_count_df.columns = ['Income Category', 'User Count']
print(income_distribution_count_df)

  Income Category  User Count
0      Low Income       48442
1   Medium Income        5980
2     High Income        5524

# Visualize income categories
plt.figure(figsize=(10, 6))
plt.bar(income_distribution_count_df['Income Category'], 
        income_distribution_count_df['User Count'], 
        color='lightcoral', edgecolor='black')
plt.xlabel('Income Category')
plt.ylabel('Number of Users')
plt.title('Distribution of Users Across Income Categories')
plt.grid(axis='y', alpha=0.3)
plt.show()

# Calculating total columns
total_columns = Bumble_Dataset.shape[1]
# Calculate non-missing values per row
Bumble_Dataset['profile_completeness'] = round(Bumble_Dataset.notnull().sum(axis=1) / total_columns * 100, 2)
print(Bumble_Dataset['profile_completeness'])

0         94.74
1         94.74
2         84.21
3         94.74
4         84.21
          ...  
59941     78.95
59942    100.00
59943     89.47
59944    100.00
59945     89.47
Name: profile_completeness, Length: 59946, dtype: float64

Bumble_Dataset['profile_completeness'].value_counts()

profile_completeness
94.74     15924
89.47     15135
84.21     10483
78.95      6080
100.00     5851
73.68      3200
68.42      1589
63.16       919
57.89       534
52.63       182
47.37        49
Name: count, dtype: int64

# Analyze completeness by gender
Gender_anly = Bumble_Dataset.groupby('profile_completeness')['gender'].value_counts()
Gender_anly

profile_completeness  gender
47.37                 m           33
                      f           16
52.63                 m          125
                      f           57
57.89                 m          358
                      f          176
63.16                 m          568
                      f          351
68.42                 m          972
                      f          617
73.68                 m         1934
                      f         1266
78.95                 m         3657
                      f         2423
84.21                 m         6279
                      f         4204
89.47                 m         9081
                      f         6054
94.74                 m         9344
                      f         6580
100.00                m         3478
                      f         2373
Name: count, dtype: int64

# Creating a new column with height in centimeters
Bumble_Dataset["height_cm"] = Bumble_Dataset["height"] * 2.54
print(Bumble_Dataset[["height", "height_cm"]])

       height  height_cm
0        75.0     190.50
1        70.0     177.80
2        68.0     172.72
3        71.0     180.34
4        66.0     167.64
...       ...        ...
59941    62.0     157.48
59942    72.0     182.88
59943    71.0     180.34
59944    73.0     185.42
59945    68.0     172.72

[59946 rows x 2 columns]

# Count the number of users by gender
gender_distribution = Bumble_Dataset['gender'].value_counts()
# Calculate percentage
gender_percentage = round(gender_distribution / gender_distribution.sum() * 100, 2)
# Display results
gender_distribution_df = pd.DataFrame({
    "Gender": gender_percentage.index,
    "Count": gender_distribution,
    "Percentage": gender_percentage
})
print(gender_distribution_df)

       Gender  Count  Percentage
gender                          
m           m  35829       59.77
f           f  24117       40.23

# Visualize gender distribution
plt.figure(figsize=(10, 6))
plt.bar(gender_distribution_df['Gender'], gender_distribution_df['Count'], 
        color=['skyblue', 'pink'], edgecolor='black')
plt.xlabel('Gender')
plt.ylabel('Number of Users')
plt.title('Gender Distribution on Bumble')
plt.grid(axis='y', alpha=0.3)
plt.show()

# Count the number of users by status
status_distribution = Bumble_Dataset['status'].value_counts()
# Calculate percentage
status_percentage = round(status_distribution / len(Bumble_Dataset["status"]) * 100, 2)
relationship_distribution_df = pd.DataFrame({
    'Status': status_distribution.index,
    'count': status_distribution,
    'Percentage': status_percentage
})
relationship_distribution_df

# Count relationship status within each gender
status_by_gender = Bumble_Dataset.groupby('gender')['status'].value_counts(normalize=True) * 100
# Convert to DataFrame
status_by_gender = status_by_gender.unstack()
# Display results
print(status_by_gender)

status  available   married  seeing someone     single   unknown
gender                                                          
f        2.720073  0.559771        4.158892  92.544678  0.016586
m        3.374362  0.488431        2.961288  93.159173  0.016746

# Selecting numerical columns
numerical_columns = Bumble_Dataset[['age', 'height', 'income']]
# Compute correlation matrix
correlation = numerical_columns.corr()
# Print result
print(correlation)

             age    height    income
age     1.000000 -0.022253 -0.001004
height -0.022253  1.000000  0.065048
income -0.001004  0.065048  1.000000

# Visualize correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

correlation = Bumble_Dataset["age"].corr(Bumble_Dataset["income"])
print(correlation)

-0.0010038681910053968

# Count occurrences of each dietary preference
diet_distribution = Bumble_Dataset['diet'].value_counts().reset_index()
diet_distribution.columns = ['Diet Type', 'Count']
# Calculate percentages
diet_distribution['Percentage'] = round((diet_distribution['Count'] / diet_distribution['Count'].sum()) * 100, 2)
# Display top 10
print(diet_distribution.head(10))

             Diet Type  Count  Percentage
0      mostly anything  16585       46.65
1             anything   6183       17.39
2    strictly anything   5113       14.38
3    mostly vegetarian   3444        9.69
4         mostly other   1007        2.83
5  strictly vegetarian    875        2.46
6           vegetarian    667        1.88
7       strictly other    452        1.27
8         mostly vegan    338        0.95
9                other    331        0.93

# Visualize top 10 diet preferences
plt.figure(figsize=(12, 6))
top_diets = diet_distribution.head(10)
plt.barh(range(len(top_diets)), top_diets['Count'], color='lightgreen', edgecolor='black')
plt.yticks(range(len(top_diets)), top_diets['Diet Type'])
plt.xlabel('Number of Users')
plt.title('Top 10 Diet Preferences on Bumble')
plt.grid(axis='x', alpha=0.3)
plt.show()

# Analyze drinking habits across different diet categories
drink_diet_distribution = Bumble_Dataset.groupby("diet")["drinks"].value_counts(normalize=True)
# Display results
drink_diet_distribution

diet        drinks     
anything    socially       0.756676
            often          0.095794
            rarely         0.085113
            not at all     0.049232
            very often     0.009680
                             ...   
vegetarian  rarely         0.107717
            often          0.094855
            not at all     0.059486
            very often     0.011254
            desperately    0.009646
Name: proportion, Length: 103, dtype: float64

# Extracting city and state from location column
Bumble_Dataset = Bumble_Dataset[Bumble_Dataset['location'].notna()]
# Split only if a comma exists
Bumble_Dataset[['City', 'State']] = Bumble_Dataset['location'].str.split(', ', n=1, expand=True)
# Remove white space
Bumble_Dataset['City'] = Bumble_Dataset['City'].str.strip()
Bumble_Dataset['State'] = Bumble_Dataset['State'].str.strip()

# Count users per city and state
top_cities = Bumble_Dataset['City'].value_counts().head(5)
top_states = Bumble_Dataset['State'].value_counts().head(5)

# Convert to DataFrame
top_cities_df = top_cities.reset_index()
top_cities_df.columns = ['City', 'User Count']
top_states_df = top_states.reset_index()
top_states_df.columns = ['State', 'User Count']

# Display results
print(f"Top 5 Cities:\n{top_cities_df}")
print(f"\nTop 5 States:\n{top_states_df}")

Top 5 Cities:
            City  User Count
0  san francisco       31064
1        oakland        7214
2       berkeley        4212
3      san mateo        1331
4      palo alto        1064

Top 5 States:
           State  User Count
0     california       59855
1       new york          17
2       illinois           8
3  massachusetts           5
4          texas           4

# Visualize top cities
plt.figure(figsize=(10, 6))
plt.barh(range(len(top_cities_df)), top_cities_df['User Count'], color='coral', edgecolor='black')
plt.yticks(range(len(top_cities_df)), top_cities_df['City'])
plt.xlabel('Number of Users')
plt.title('Top 5 Cities by User Count')
plt.grid(axis='x', alpha=0.3)
plt.show()

# Find top 5 cities
top_cities = Bumble_Dataset['City'].value_counts().head(5).index
# Filter dataset
df_top_cities = Bumble_Dataset[Bumble_Dataset['City'].isin(top_cities)]
# Calculate age statistics
age_stats = df_top_cities.groupby('City')['age'].mean().sort_values(ascending=True).reset_index()
age_stats.columns = ['City', 'Age Average']
print(age_stats)

            City  Age Average
0       berkeley    31.391738
1  san francisco    31.614312
2      palo alto    31.980263
3        oakland    33.178819
4      san mateo    33.437265

# Calculate average income by city
df_top_cities = Bumble_Dataset[Bumble_Dataset['City'].isin(top_cities)]
avg_income_by_city = round(df_top_cities.groupby('City')['income'].mean().sort_values(ascending=False), 2)
print("Average Income in Top Cities:")
print(avg_income_by_city)

# Calculate average income by state
top_states = Bumble_Dataset['State'].value_counts().head(5).index
df_top_states = Bumble_Dataset[Bumble_Dataset['State'].isin(top_states)]
avg_income_by_state = round(df_top_states.groupby('State')['income'].mean().sort_values(ascending=False), 2)
print("\nAverage Income in Top States:")
print(avg_income_by_state)

Average Income in Top Cities:
City
san mateo        22779.86
oakland          22586.64
san francisco    20150.01
palo alto        19332.71
berkeley         17364.67
Name: income, dtype: float64

Average Income in Top States:
State
new york         31764.71
california       20044.27
massachusetts     6000.00
texas             5000.00
illinois             0.00
Name: income, dtype: float64

# Calculate average height for each gender
avg_height_by_gender = Bumble_Dataset.groupby('gender')['height'].mean().reset_index()
avg_height_by_gender.columns = ['Gender', 'Average Height']
print(avg_height_by_gender)

  Gender  Average Height
0      f       65.103869
1      m       70.443468

# Visualize height by gender
plt.figure(figsize=(8, 6))
plt.bar(avg_height_by_gender['Gender'], avg_height_by_gender['Average Height'],
        color=['skyblue', 'pink'], edgecolor='black')
plt.xlabel('Gender')
plt.ylabel('Average Height (inches)')
plt.title('Average Height by Gender')
plt.grid(axis='y', alpha=0.3)
plt.show()

# Calculate average height for each age group
avg_height_by_age_group = Bumble_Dataset.groupby('age_group')['height'].mean().reset_index()
avg_height_by_age_group.columns = ['Age Group', 'Average Height']
print(avg_height_by_age_group)

  Age Group  Average Height
0     18-25       68.200913
1     26-35       68.406764
2     36-45       68.325095
3       46+       67.941167

# Group by body_type and calculate height
height_by_body_type = Bumble_Dataset.groupby('body_type')['height'].mean().sort_values(ascending=False).reset_index()
height_by_body_type.columns = ['Body Type', 'Average Height']
print(height_by_body_type.head(10))

        Body Type  Average Height
0        athletic       69.707336
1          jacked       69.292162
2         used up       69.180282
3      overweight       68.948198
4  a little extra       68.820084
5             fit       68.546062
6          skinny       68.544176
7         average       68.100805
8            thin       67.866058
9  rather not say       67.272727

# Filter income more than 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] != 0]
# Calculate income distribution
income_distribution = filtered_income['income'].value_counts().sort_values(ascending=False).head(12).reset_index()
income_distribution.columns = ['Income', 'User Count']
print(income_distribution)

     Income  User Count
0     20000        2952
1    100000        1621
2     80000        1111
3     30000        1048
4     40000        1005
5     50000         975
6     60000         736
7     70000         707
8    150000         631
9   1000000         521
10   250000         149
11   500000          48

# Visualize income distribution
plt.figure(figsize=(10, 6))
plt.hist(filtered_income['income'], bins=30, color='gold', edgecolor='black')
plt.xlabel('Income')
plt.ylabel('Number of Users')
plt.title('Income Distribution on Bumble')
plt.grid(axis='y', alpha=0.3)
plt.show()

# Filter users with income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Group by age group and gender
income_by_age_gender = (
    filtered_income.groupby(['age_group', 'gender'])['income']
    .mean()
    .reset_index()
    .sort_values(by=['age_group', 'income'], ascending=[True, False])
    .round(2)
)
income_by_age_gender.columns = ['Age Group', 'Gender', 'Average Income']
print(income_by_age_gender)

  Age Group Gender  Average Income
1     18-25      m       106618.77
0     18-25      f        86066.35
3     26-35      m       114944.80
2     26-35      f        90398.13
5     36-45      m       112680.61
4     36-45      f        87302.98
7       46+      m       100156.63
6       46+      f        75299.76

# Calculate mean age
mean_age = Bumble_Dataset['age'].mean()
# Plot histogram
plt.figure(figsize=(10, 8))
sns.histplot(Bumble_Dataset['age'], bins=30, color='skyblue')
# Add vertical line for mean age
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')
# Labels and title
plt.xlabel('Age')
plt.ylabel('User Count')
plt.title('Age Distribution of Users')
plt.legend()
plt.show()

# Plot histograms for each gender
plt.figure(figsize=(10, 8))
plt.hist(Bumble_Dataset[Bumble_Dataset['gender'] == 'm']['age'], bins=30, alpha=0.6, label='Male', color='blue')
plt.hist(Bumble_Dataset[Bumble_Dataset['gender'] == 'f']['age'], bins=30, alpha=0.6, label='Female', color='pink')
# Add labels and legend
plt.xlabel("Age")
plt.ylabel("Number of Users")
plt.title("Age Distribution by Gender")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Filter out zero or missing income values
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Set figure size
plt.figure(figsize=(10, 8))
# Create scatter plot
sns.scatterplot(x=filtered_income['age'], y=filtered_income['income'], alpha=0.5)
# Add trend line using regression plot
sns.regplot(x=filtered_income['age'], y=filtered_income['income'], scatter=False, color='red')
# Labels and title
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Scatter Plot of Income vs. Age with Trend Line")
plt.show()

# Filter income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Create boxplot
plt.figure(figsize=(10, 8))
sns.boxplot(x='age_group', y='income', data=filtered_income, hue='age_group', legend=False)
# Labels
plt.xlabel("Age Group")
plt.ylabel("Income")
plt.title("Income Distribution by Age Group")
plt.show()

# Filter income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Boxplot for income by gender and status
plt.figure(figsize=(12, 8))
sns.boxplot(x='status', y='income', hue='gender', data=filtered_income, palette='Set2')
# Labels
plt.xlabel("Status")
plt.ylabel("Income")
plt.title("Income Distribution by Gender and Status")
plt.legend(title="Gender")
plt.show()

# Count pet preferences
pet_counts = Bumble_Dataset['pets'].value_counts().reset_index()
pet_counts.columns = ['Pet Preference', 'Count']
# Plot top 10
plt.figure(figsize=(10, 8))
top_pets = pet_counts.head(10)
sns.barplot(x='Pet Preference', y='Count', data=top_pets, hue='Pet Preference', legend=False)
plt.xlabel("Pet Preferences")
plt.ylabel("Number of Users")
plt.title("Distribution of Pet Preferences on Bumble")
plt.xticks(rotation=90)
plt.show()

# Group by age_group and pets
pet_preferences = Bumble_Dataset.groupby(['age_group', 'pets']).size().reset_index(name='Count')
# Plot
plt.figure(figsize=(10, 8))
top_pets_by_age = pet_preferences.sort_values('Count', ascending=False).head(15)
sns.barplot(x='age_group', y='Count', hue='pets', data=top_pets_by_age, palette='Set3')
plt.xlabel("Age Group")
plt.ylabel("Number of Users")
plt.title("Pet Preferences Across Age Groups")
plt.legend(title="Pet Preference", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

# Count zodiac signs
zodiac_counts = Bumble_Dataset['sign'].value_counts()
# Create bar chart
plt.figure(figsize=(20, 8))
sns.barplot(x=zodiac_counts.index, y=zodiac_counts.values, hue=zodiac_counts.index, legend=False)
plt.xlabel("Zodiac Sign")
plt.ylabel("Number of Users")
plt.title("Distribution of Zodiac Signs on the Platform")
plt.xticks(rotation=90)
plt.show()

# Group by gender, status, and sign
gender_status = Bumble_Dataset.groupby(["gender", "status", "sign"])["status"].count().reset_index(name="count")
stacked_data = gender_status.pivot_table(index="sign", columns=["gender", "status"], values="count", fill_value=0)
# Create stacked chart
stacked_data.plot(kind="bar", stacked=True, figsize=(20, 8), colormap="coolwarm")
plt.xlabel("Zodiac Sign")
plt.ylabel("Number of Users")
plt.title("Proportion of Gender & Status Across Zodiac Signs")
plt.xticks(rotation=90)
plt.legend(title="Status", bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.show()

	age	height	income
count	59946.000000	59946.000000	59946.000000
mean	32.340290	68.295282	20033.222534
std	9.452779	3.994738	97346.192104
min	18.000000	1.000000	-1.000000
25%	26.000000	66.000000	-1.000000
50%	30.000000	68.000000	-1.000000
75%	37.000000	71.000000	-1.000000
max	110.000000	95.000000	1000000.000000

	age	status	gender	body_type	diet	drinks	education	ethnicity	height	income	job	last_online	location	pets	religion	sign	speaks
0	22	single	m	a little extra	strictly anything	socially	working on college/university	asian, white	75.0	-1	transportation	2012-06-28-20-30	south san francisco, california	likes dogs and likes cats	agnosticism and very serious about it	gemini	english
1	35	single	m	average	mostly other	often	working on space camp	white	70.0	80000	hospitality / travel	2012-06-29-21-41	oakland, california	likes dogs and likes cats	agnosticism but not too serious about it	cancer	english (fluently), spanish (poorly), french (...
2	38	available	m	thin	anything	socially	graduated from masters program	NaN	68.0	-1	NaN	2012-06-27-09-10	san francisco, california	has cats	NaN	pisces but it doesn’t matter	english, french, c++
3	23	single	m	thin	vegetarian	socially	working on college/university	white	71.0	20000	student	2012-06-28-14-22	berkeley, california	likes cats	NaN	pisces	english, german (poorly)
4	29	single	m	athletic	NaN	socially	graduated from college/university	asian, black, other	66.0	-1	artistic / musical / writer	2012-06-27-21-26	san francisco, california	likes dogs and likes cats	NaN	aquarius	english
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
59941	59	single	f	NaN	NaN	socially	graduated from college/university	NaN	62.0	-1	sales / marketing / biz dev	2012-06-12-21-47	oakland, california	has dogs	catholicism but not too serious about it	cancer and it’s fun to think about	english
59942	24	single	m	fit	mostly anything	often	working on college/university	white, other	72.0	-1	entertainment / media	2012-06-29-11-01	san francisco, california	likes dogs and likes cats	agnosticism	leo but it doesn’t matter	english (fluently)
59943	42	single	m	average	mostly anything	not at all	graduated from masters program	asian	71.0	100000	construction / craftsmanship	2012-06-27-23-37	south san francisco, california	NaN	christianity but not too serious about it	sagittarius but it doesn’t matter	english (fluently)
59944	27	single	m	athletic	mostly anything	socially	working on college/university	asian, black	73.0	-1	medicine / health	2012-06-23-13-01	san francisco, california	likes dogs and likes cats	agnosticism but not too serious about it	leo and it’s fun to think about	english (fluently), spanish (poorly), chinese ...
59945	39	single	m	average	NaN	socially	graduated from masters program	white	68.0	-1	medicine / health	2012-06-29-00-42	san francisco, california	likes dogs and likes cats	catholicism and laughing about it	gemini and it’s fun to think about	english

	Status	count	Percentage
status
single	single	55697	92.91
seeing someone	seeing someone	2064	3.44
available	available	1865	3.11
married	married	310	0.52
unknown	unknown	10	0.02

Analyzing Bumble Profiles Using Python¶

Installing Pandas¶

Importing pandas, numpy, matplotlib, seaborn.¶

Importing the Dataset as CSV File¶

Part 1: Data Cleaning¶

1. Inspecting Missing Data¶

Que 1:¶

Que 2:¶

Que 3:¶

2. Data Types¶

Que 1:¶

Que 2:¶

Que 3:¶

3. Outliers¶

Que 1:¶

Que 2:¶

Que 3:¶

4. Missing Data Visualization¶

Part 2: Data Processing¶

1. Binning and Grouping¶

Que 1:¶

Que 2:¶

2. Derived Features¶

Que 1:¶

3. Unit Conversion¶

Que 1:¶

Part 3: Data Analysis¶

1. Demographic Analysis¶

Que 1:¶

Que 2:¶

Que 3:¶

2. Correlation Analysis¶

Que 1:¶

Que 2:¶

3. Diet and Lifestyle Analysis¶

Que 1:¶

Que 2:¶

4. Geographical Insights¶

Que 1:¶

Que 2:¶

Que 3:¶

5. Height Analysis¶

Que 1:¶

Que 2:¶

Que 3:¶

6. Income Analysis¶

Que 1:¶

Que 2:¶

Part 4: Data Visualization¶

1. Age Distribution¶

Que 1:¶

Que 2:¶

2. Income and Age¶

Que 1:¶

Que 2:¶

Que 3:¶

3. Pets and Preferences¶

Que 1:¶

Que 2:¶

4. Signs and Personality¶

Que 1:¶

Que 2:¶

Summary of Analysis¶

The analysis of Bumble user data revealed several key insights:¶

Data Completeness:¶

User Demographics:¶

Profile Completeness:¶

Dietary Habits & Drinking:¶

Geographic Trends:¶

Height Differences:¶

Income Distribution:¶

User Preferences:¶

Recommendations¶

To enhance user engagement and experience, Bumble should consider:¶

Targeting the 26-35 Age Group & Male Users:¶

Improving Data Completeness:¶

Enhancing Personalization Through Diet & Drinking Habits:¶

Integrating Pet Preferences into Matchmaking:¶

Leveraging Geographic Data:¶

Income-Based Features:¶