Analyzing Bumble Profiles Using Python¶

This project focuses on analyzing user data from Bumble, a well-known dating app, to uncover valuable insights about its users.

By exploring the information users share in their profiles such as demographics, lifestyle habits, and preferences we aim to understand their behavior and trends better.

These insights will help Bumble's product and marketing teams make informed decisions to improve user engagement, fine-tune the matchmaking algorithms, and offer personalized features.

Ultimately, the goal is to create a more satisfying experience for Bumble users and support the platform's growth.

Installing Pandas¶

In [53]:
%pip install pandas
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.13/site-packages (2.3.2)
Requirement already satisfied: numpy>=1.26.0 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2.2.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2025.2)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Note: you may need to restart the kernel to use updated packages.

Importing pandas, numpy, matplotlib, seaborn.¶

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Importing the Dataset as CSV File¶

In [55]:
# Load the dataset
# Both files are in the same folder, so we just use the filename
Bumble_Dataset = pd.read_csv('bumble.csv')
Bumble_Dataset
Out[55]:
age status gender body_type diet drinks education ethnicity height income job last_online location pets religion sign speaks
0 22 single m a little extra strictly anything socially working on college/university asian, white 75.0 -1 transportation 2012-06-28-20-30 south san francisco, california likes dogs and likes cats agnosticism and very serious about it gemini english
1 35 single m average mostly other often working on space camp white 70.0 80000 hospitality / travel 2012-06-29-21-41 oakland, california likes dogs and likes cats agnosticism but not too serious about it cancer english (fluently), spanish (poorly), french (...
2 38 available m thin anything socially graduated from masters program NaN 68.0 -1 NaN 2012-06-27-09-10 san francisco, california has cats NaN pisces but it doesn’t matter english, french, c++
3 23 single m thin vegetarian socially working on college/university white 71.0 20000 student 2012-06-28-14-22 berkeley, california likes cats NaN pisces english, german (poorly)
4 29 single m athletic NaN socially graduated from college/university asian, black, other 66.0 -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california likes dogs and likes cats NaN aquarius english
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59941 59 single f NaN NaN socially graduated from college/university NaN 62.0 -1 sales / marketing / biz dev 2012-06-12-21-47 oakland, california has dogs catholicism but not too serious about it cancer and it’s fun to think about english
59942 24 single m fit mostly anything often working on college/university white, other 72.0 -1 entertainment / media 2012-06-29-11-01 san francisco, california likes dogs and likes cats agnosticism leo but it doesn’t matter english (fluently)
59943 42 single m average mostly anything not at all graduated from masters program asian 71.0 100000 construction / craftsmanship 2012-06-27-23-37 south san francisco, california NaN christianity but not too serious about it sagittarius but it doesn’t matter english (fluently)
59944 27 single m athletic mostly anything socially working on college/university asian, black 73.0 -1 medicine / health 2012-06-23-13-01 san francisco, california likes dogs and likes cats agnosticism but not too serious about it leo and it’s fun to think about english (fluently), spanish (poorly), chinese ...
59945 39 single m average NaN socially graduated from masters program white 68.0 -1 medicine / health 2012-06-29-00-42 san francisco, california likes dogs and likes cats catholicism and laughing about it gemini and it’s fun to think about english

59946 rows × 17 columns

Part 1: Data Cleaning¶

1. Inspecting Missing Data¶

Missing data is a common issue in real-world datasets. On a platform like Bumble, missing user information might reflect gaps in the user profile setup process, incomplete data collection, or users intentionally leaving certain fields blank. As a data analyst, your role is to assess the extent of missing data, understand its potential impact, and decide the most appropriate methods to address it.

Que 1:¶

Which columns in the dataset have missing values, and what percentage of data is missing in each column?

In [56]:
missing_values = Bumble_Dataset.isnull().sum()
missing_values
Out[56]:
age                0
status             0
gender             0
body_type       5296
diet           24395
drinks          2985
education       6628
ethnicity       5680
height             3
income             0
job             8198
last_online        0
location           0
pets           19921
religion       20226
sign           11056
speaks            50
dtype: int64

Here we used isnull().sum() to find out the total missing values in each column.

Que 2:¶

Are there columns where more than 50% of the data is missing? Would you drop those columns where missing values are >50%.

Ans - As there are no columns with more than 50% of missing data, we don't need to drop any columns.

In [57]:
percentage_of_missing_values = ((missing_values) / len(Bumble_Dataset)) * 100
percentage_of_missing_values
Out[57]:
age             0.000000
status          0.000000
gender          0.000000
body_type       8.834618
diet           40.694959
drinks          4.979482
education      11.056618
ethnicity       9.475194
height          0.005005
income          0.000000
job            13.675641
last_online     0.000000
location        0.000000
pets           33.231575
religion       33.740366
sign           18.443266
speaks          0.083408
dtype: float64

Que 3:¶

How would you handle the missing numerical data (e.g., height, income)? Would you impute the missing data by the median or average value of height and income for the corresponding category, such as gender, age group, or location.

In [58]:
# Calculate median height by gender
calculate_median = Bumble_Dataset.groupby(["gender"])["height"].transform("median")
# Fill missing height values with median
Bumble_Dataset["height"] = Bumble_Dataset["height"].fillna(calculate_median)
print(calculate_median)
0        70.0
1        70.0
2        70.0
3        70.0
4        70.0
         ... 
59941    65.0
59942    70.0
59943    70.0
59944    70.0
59945    70.0
Name: height, Length: 59946, dtype: float64

Used groupby and transform, fillna to calculate and fill missing values in the 'height' column with its median.

Replaced missing values in the 'height' column with its median value calculated earlier using fillna().

2. Data Types¶

Que 1:¶

Are there any inconsistencies in the data types across columns (e.g., numerical data stored as strings)?

In [59]:
Bumble_Dataset.dtypes
Out[59]:
age              int64
status          object
gender          object
body_type       object
diet            object
drinks          object
education       object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
pets            object
religion        object
sign            object
speaks          object
dtype: object

Used dtypes to check data types of each column.

There are no inconsistencies in data types across columns.

Que 2:¶

Which columns require conversion to numerical data types for proper analysis (e.g., income)?

Ans - There are no columns that require conversion to numerical data type.

Que 3:¶

Does the last_online column need to be converted into a datetime format? What additional insights can be gained by analyzing this as a date field?

In [60]:
# "Last_online" column is in object data type, hence it requires conversion
# Used to_datetime() for data type conversion with error handling
# The 'errors="coerce"' parameter will convert invalid dates to NaT (Not a Time)

import warnings
warnings.filterwarnings('ignore')  # Suppress warnings

Bumble_Dataset["last_online"] = pd.to_datetime(Bumble_Dataset["last_online"], errors='coerce')
Bumble_Dataset["last_online"]
Out[60]:
0                              NaT
1                              NaT
2        2012-06-27 09:00:00-10:00
3        2012-06-28 14:00:00-22:00
4                              NaT
                   ...            
59941                          NaT
59942    2012-06-29 11:00:00-01:00
59943                          NaT
59944    2012-06-23 13:00:00-01:00
59945                          NaT
Name: last_online, Length: 59946, dtype: object
In [61]:
Bumble_Dataset.dtypes
Out[61]:
age              int64
status          object
gender          object
body_type       object
diet            object
drinks          object
education       object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
pets            object
religion        object
sign            object
speaks          object
dtype: object

3. Outliers¶

Que 1:¶

Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?

In [62]:
Bumble_Dataset.describe()
Out[62]:
age height income
count 59946.000000 59946.000000 59946.000000
mean 32.340290 68.295282 20033.222534
std 9.452779 3.994738 97346.192104
min 18.000000 1.000000 -1.000000
25% 26.000000 66.000000 -1.000000
50% 30.000000 68.000000 -1.000000
75% 37.000000 71.000000 -1.000000
max 110.000000 95.000000 1000000.000000
In [63]:
# Yes there are outliers in the numerical columns which are age, height and income
age_range = (Bumble_Dataset['age'].min(), Bumble_Dataset['age'].max())
height_range = (Bumble_Dataset['height'].min(), Bumble_Dataset['height'].max())
income_range = (Bumble_Dataset['income'].min(), Bumble_Dataset['income'].max())
print(f'Age range is: {age_range}')
print(f'Height range is: {height_range}')
print(f'Income range is: {income_range}')
Age range is: (18, 110)
Height range is: (1.0, 95.0)
Income range is: (-1, 1000000)

Que 2:¶

Any -1 values in numerical columns like income should be replaced with blank, as they may represent missing or invalid data.

In [64]:
Bumble_Dataset['income'] = Bumble_Dataset['income'].replace(-1, 0)

Que 3:¶

For other outliers, how would you ensure that they do not disproportionately impact the analysis while retaining as much meaningful data as possible. Would you delete the data or rather than deleting them, calculate the mean and median values using only the middle 80% of the data (removing extreme high and low values). Provide appropriate reasons for every step.

In [65]:
column_names = ["income", "age", "height"]
for columns in column_names:
    Q1 = Bumble_Dataset[columns].quantile(0.10)
    Q3 = Bumble_Dataset[columns].quantile(0.90)
    filtered_values = Bumble_Dataset[(Bumble_Dataset[columns] > Q1) & (Bumble_Dataset[columns] < Q3)]
    mean_values = filtered_values[columns].mean()
    median_values = filtered_values[columns].median()
    print(f"\n{columns}: filtered_median: {median_values}, filtered_mean: {mean_values}")
income: filtered_median: 20000.0, filtered_mean: 26109.89010989011

age: filtered_median: 30.0, filtered_mean: 31.357685563997663

height: filtered_median: 68.0, filtered_mean: 68.25442646465552

Calculated mean() and median() for the middle 80% of the data using the 10th and 90th percentiles.

4. Missing Data Visualization¶

Visualizing missing data helps identify patterns of incompleteness in the dataset, which can guide data cleaning strategies. Understanding which columns have high levels of missing data ensures decisions about imputation or removal are well-informed.

Create a heatmap to visualize missing values across the dataset. Which columns show consistent missing data patterns?

In [66]:
missing_values = Bumble_Dataset.isnull().sum()
print(missing_values)
plt.figure(figsize=(12, 8))
sns.heatmap(Bumble_Dataset.isnull(), cbar=False, cmap="inferno")
plt.title('Heatmap of Missing Data')
plt.show()
age                0
status             0
gender             0
body_type       5296
diet           24395
drinks          2985
education       6628
ethnicity       5680
height             0
income             0
job             8198
last_online    35895
location           0
pets           19921
religion       20226
sign           11056
speaks            50
dtype: int64
No description has been provided for this image

Part 2: Data Processing¶

1. Binning and Grouping¶

Grouping continuous variables, such as age or income, into bins helps simplify analysis and identify trends among specific groups. For instance, grouping users into age ranges can reveal distinct patterns in behavior or preferences across demographics.

Que 1:¶

How would you bin the age column into categories (e.g. "18-25", "26-35", "36-45", and "46+") to create a new column, age_group. How does the distribution of users vary across these age ranges?

In [67]:
def get_age_grp(age):
    if age <= 25:
        return '18-25'
    elif age <= 35:
        return "26-35"
    elif age <= 45:
        return "36-45"
    else:
        return "46+"

# Apply the function to create a new column
Bumble_Dataset["age_group"] = Bumble_Dataset["age"].apply(get_age_grp)
age_group_counts = Bumble_Dataset['age_group'].value_counts().sort_index()

# Convert Series to DataFrame
age_group_df = age_group_counts.reset_index()
age_group_df.columns = ['Age Group', 'User Count']

# Display results
print(age_group_df)
  Age Group  User Count
0     18-25       14454
1     26-35       28621
2     36-45       10803
3       46+        6068
In [68]:
# Visualize age group distribution
plt.figure(figsize=(10, 6))
plt.bar(age_group_df['Age Group'], age_group_df['User Count'], color='skyblue', edgecolor='black')
plt.xlabel('Age Group')
plt.ylabel('Number of Users')
plt.title('Distribution of Users Across Age Groups')
plt.grid(axis='y', alpha=0.3)
plt.show()
No description has been provided for this image

Que 2:¶

Group income into categories like "Low Income," "Medium Income," and "High Income" based on meaningful thresholds (e.g., quartiles). What insights can be derived from these groups?

In [69]:
# Calculate Quartiles
Q1 = Bumble_Dataset["income"].quantile(0.10)
Q3 = Bumble_Dataset["income"].quantile(0.90)

# Define function to categorize income
def categorize_income(x):
    if x <= Q1:
        return "Low Income"
    elif Q1 < x <= Q3:
        return "Medium Income"
    else:
        return "High Income"

# Apply function
Bumble_Dataset["income_category"] = Bumble_Dataset["income"].apply(categorize_income)

# Count occurrences
income_distribution_count = Bumble_Dataset['income_category'].value_counts()
income_distribution_count_df = income_distribution_count.reset_index()
income_distribution_count_df.columns = ['Income Category', 'User Count']
print(income_distribution_count_df)
  Income Category  User Count
0      Low Income       48442
1   Medium Income        5980
2     High Income        5524
In [70]:
# Visualize income categories
plt.figure(figsize=(10, 6))
plt.bar(income_distribution_count_df['Income Category'], 
        income_distribution_count_df['User Count'], 
        color='lightcoral', edgecolor='black')
plt.xlabel('Income Category')
plt.ylabel('Number of Users')
plt.title('Distribution of Users Across Income Categories')
plt.grid(axis='y', alpha=0.3)
plt.show()
No description has been provided for this image

2. Derived Features¶

Derived features are new columns created based on the existing data to add depth to the analysis. These features often reveal hidden patterns or provide new dimensions to explore.

Que 1:¶

Create a new feature, profile_completeness, by calculating the percentage of non-missing values for each user profile. How complete are most user profiles, and how does completeness vary across demographics?

In [71]:
# Calculating total columns
total_columns = Bumble_Dataset.shape[1]
# Calculate non-missing values per row
Bumble_Dataset['profile_completeness'] = round(Bumble_Dataset.notnull().sum(axis=1) / total_columns * 100, 2)
print(Bumble_Dataset['profile_completeness'])
0         94.74
1         94.74
2         84.21
3         94.74
4         84.21
          ...  
59941     78.95
59942    100.00
59943     89.47
59944    100.00
59945     89.47
Name: profile_completeness, Length: 59946, dtype: float64
In [72]:
Bumble_Dataset['profile_completeness'].value_counts()
Out[72]:
profile_completeness
94.74     15924
89.47     15135
84.21     10483
78.95      6080
100.00     5851
73.68      3200
68.42      1589
63.16       919
57.89       534
52.63       182
47.37        49
Name: count, dtype: int64
In [73]:
# Analyze completeness by gender
Gender_anly = Bumble_Dataset.groupby('profile_completeness')['gender'].value_counts()
Gender_anly
Out[73]:
profile_completeness  gender
47.37                 m           33
                      f           16
52.63                 m          125
                      f           57
57.89                 m          358
                      f          176
63.16                 m          568
                      f          351
68.42                 m          972
                      f          617
73.68                 m         1934
                      f         1266
78.95                 m         3657
                      f         2423
84.21                 m         6279
                      f         4204
89.47                 m         9081
                      f         6054
94.74                 m         9344
                      f         6580
100.00                m         3478
                      f         2373
Name: count, dtype: int64

3. Unit Conversion¶

Standardizing units across datasets is essential for consistency, especially when working with numerical data. In the context of the Bumble dataset, users' heights are given in inches, which may not be intuitive for all audiences.

Que 1:¶

Convert the height column from inches to centimeters using the conversion factor (1 inch = 2.54 cm). Store the converted values in a new column, height_cm?

In [74]:
# Creating a new column with height in centimeters
Bumble_Dataset["height_cm"] = Bumble_Dataset["height"] * 2.54
print(Bumble_Dataset[["height", "height_cm"]])
       height  height_cm
0        75.0     190.50
1        70.0     177.80
2        68.0     172.72
3        71.0     180.34
4        66.0     167.64
...       ...        ...
59941    62.0     157.48
59942    72.0     182.88
59943    71.0     180.34
59944    73.0     185.42
59945    68.0     172.72

[59946 rows x 2 columns]

Part 3: Data Analysis¶

1. Demographic Analysis¶

Understanding the demographics of users is essential for tailoring marketing strategies, improving user experience, and designing features that resonate with the platform's audience. Insights into gender distribution, orientation, and relationship status can help Bumble refine its matchmaking algorithms and engagement campaigns.

Que 1:¶

What is the gender distribution (gender) across the platform? Are there any significant imbalances?

In [75]:
# Count the number of users by gender
gender_distribution = Bumble_Dataset['gender'].value_counts()
# Calculate percentage
gender_percentage = round(gender_distribution / gender_distribution.sum() * 100, 2)
# Display results
gender_distribution_df = pd.DataFrame({
    "Gender": gender_percentage.index,
    "Count": gender_distribution,
    "Percentage": gender_percentage
})
print(gender_distribution_df)
       Gender  Count  Percentage
gender                          
m           m  35829       59.77
f           f  24117       40.23
In [76]:
# Visualize gender distribution
plt.figure(figsize=(10, 6))
plt.bar(gender_distribution_df['Gender'], gender_distribution_df['Count'], 
        color=['skyblue', 'pink'], edgecolor='black')
plt.xlabel('Gender')
plt.ylabel('Number of Users')
plt.title('Gender Distribution on Bumble')
plt.grid(axis='y', alpha=0.3)
plt.show()
No description has been provided for this image

Que 2:¶

What are the proportions of users in different status categories (e.g., single, married, seeing someone)? What does this suggest about the platform's target audience?

In [77]:
# Count the number of users by status
status_distribution = Bumble_Dataset['status'].value_counts()
# Calculate percentage
status_percentage = round(status_distribution / len(Bumble_Dataset["status"]) * 100, 2)
relationship_distribution_df = pd.DataFrame({
    'Status': status_distribution.index,
    'count': status_distribution,
    'Percentage': status_percentage
})
relationship_distribution_df
Out[77]:
Status count Percentage
status
single single 55697 92.91
seeing someone seeing someone 2064 3.44
available available 1865 3.11
married married 310 0.52
unknown unknown 10 0.02

Que 3:¶

How does status vary by gender? For example, what proportion of men and women identify as single?

In [78]:
# Count relationship status within each gender
status_by_gender = Bumble_Dataset.groupby('gender')['status'].value_counts(normalize=True) * 100
# Convert to DataFrame
status_by_gender = status_by_gender.unstack()
# Display results
print(status_by_gender)
status  available   married  seeing someone     single   unknown
gender                                                          
f        2.720073  0.559771        4.158892  92.544678  0.016586
m        3.374362  0.488431        2.961288  93.159173  0.016746

2. Correlation Analysis¶

Correlation analysis helps uncover relationships between variables, guiding feature engineering and hypothesis generation. For example, understanding how age correlates with income or word count in profiles can reveal behavioral trends that inform platform design.

Que 1:¶

What are the correlations between numerical columns such as age, income, gender? Are there any strong positive or negative relationships?

In [79]:
# Selecting numerical columns
numerical_columns = Bumble_Dataset[['age', 'height', 'income']]
# Compute correlation matrix
correlation = numerical_columns.corr()
# Print result
print(correlation)
             age    height    income
age     1.000000 -0.022253 -0.001004
height -0.022253  1.000000  0.065048
income -0.001004  0.065048  1.000000
In [80]:
# Visualize correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

Que 2:¶

How does age correlate with income? Are older users more likely to report higher income levels?

In [81]:
correlation = Bumble_Dataset["age"].corr(Bumble_Dataset["income"])
print(correlation)
-0.0010038681910053968

3. Diet and Lifestyle Analysis¶

Lifestyle attributes such as diet and drinks provide insights into user habits and preferences. Analyzing these factors helps identify compatibility trends and inform product features like filters or match recommendations.

Que 1:¶

How do dietary preferences (diet) distribute across the platform? For example, what percentage of users identify as vegetarian, vegan, or follow "anything" diets?

In [82]:
# Count occurrences of each dietary preference
diet_distribution = Bumble_Dataset['diet'].value_counts().reset_index()
diet_distribution.columns = ['Diet Type', 'Count']
# Calculate percentages
diet_distribution['Percentage'] = round((diet_distribution['Count'] / diet_distribution['Count'].sum()) * 100, 2)
# Display top 10
print(diet_distribution.head(10))
             Diet Type  Count  Percentage
0      mostly anything  16585       46.65
1             anything   6183       17.39
2    strictly anything   5113       14.38
3    mostly vegetarian   3444        9.69
4         mostly other   1007        2.83
5  strictly vegetarian    875        2.46
6           vegetarian    667        1.88
7       strictly other    452        1.27
8         mostly vegan    338        0.95
9                other    331        0.93
In [83]:
# Visualize top 10 diet preferences
plt.figure(figsize=(12, 6))
top_diets = diet_distribution.head(10)
plt.barh(range(len(top_diets)), top_diets['Count'], color='lightgreen', edgecolor='black')
plt.yticks(range(len(top_diets)), top_diets['Diet Type'])
plt.xlabel('Number of Users')
plt.title('Top 10 Diet Preferences on Bumble')
plt.grid(axis='x', alpha=0.3)
plt.show()
No description has been provided for this image

Que 2:¶

How do drinking habits (drinks) vary across different diet categories? Are users with stricter diets (e.g., vegan) less likely to drink?

In [84]:
# Analyze drinking habits across different diet categories
drink_diet_distribution = Bumble_Dataset.groupby("diet")["drinks"].value_counts(normalize=True)
# Display results
drink_diet_distribution
Out[84]:
diet        drinks     
anything    socially       0.756676
            often          0.095794
            rarely         0.085113
            not at all     0.049232
            very often     0.009680
                             ...   
vegetarian  rarely         0.107717
            often          0.094855
            not at all     0.059486
            very often     0.011254
            desperately    0.009646
Name: proportion, Length: 103, dtype: float64

4. Geographical Insights¶

Analyzing geographical data helps Bumble understand its user base distribution, enabling targeted regional campaigns and feature localization. For instance, identifying the top cities with active users can guide marketing efforts in those areas.

Que 1:¶

Extract city and state information from the location column. What are the top 5 cities and states with the highest number of users?

In [85]:
# Extracting city and state from location column
Bumble_Dataset = Bumble_Dataset[Bumble_Dataset['location'].notna()]
# Split only if a comma exists
Bumble_Dataset[['City', 'State']] = Bumble_Dataset['location'].str.split(', ', n=1, expand=True)
# Remove white space
Bumble_Dataset['City'] = Bumble_Dataset['City'].str.strip()
Bumble_Dataset['State'] = Bumble_Dataset['State'].str.strip()

# Count users per city and state
top_cities = Bumble_Dataset['City'].value_counts().head(5)
top_states = Bumble_Dataset['State'].value_counts().head(5)

# Convert to DataFrame
top_cities_df = top_cities.reset_index()
top_cities_df.columns = ['City', 'User Count']
top_states_df = top_states.reset_index()
top_states_df.columns = ['State', 'User Count']

# Display results
print(f"Top 5 Cities:\n{top_cities_df}")
print(f"\nTop 5 States:\n{top_states_df}")
Top 5 Cities:
            City  User Count
0  san francisco       31064
1        oakland        7214
2       berkeley        4212
3      san mateo        1331
4      palo alto        1064

Top 5 States:
           State  User Count
0     california       59855
1       new york          17
2       illinois           8
3  massachusetts           5
4          texas           4
In [86]:
# Visualize top cities
plt.figure(figsize=(10, 6))
plt.barh(range(len(top_cities_df)), top_cities_df['User Count'], color='coral', edgecolor='black')
plt.yticks(range(len(top_cities_df)), top_cities_df['City'])
plt.xlabel('Number of Users')
plt.title('Top 5 Cities by User Count')
plt.grid(axis='x', alpha=0.3)
plt.show()
No description has been provided for this image

Que 2:¶

How does age vary across the top cities? Are certain cities dominated by younger or older users?

In [87]:
# Find top 5 cities
top_cities = Bumble_Dataset['City'].value_counts().head(5).index
# Filter dataset
df_top_cities = Bumble_Dataset[Bumble_Dataset['City'].isin(top_cities)]
# Calculate age statistics
age_stats = df_top_cities.groupby('City')['age'].mean().sort_values(ascending=True).reset_index()
age_stats.columns = ['City', 'Age Average']
print(age_stats)
            City  Age Average
0       berkeley    31.391738
1  san francisco    31.614312
2      palo alto    31.980263
3        oakland    33.178819
4      san mateo    33.437265

Que 3:¶

What are the average income levels in the top states or cities? Are there regional patterns in reported income?

In [88]:
# Calculate average income by city
df_top_cities = Bumble_Dataset[Bumble_Dataset['City'].isin(top_cities)]
avg_income_by_city = round(df_top_cities.groupby('City')['income'].mean().sort_values(ascending=False), 2)
print("Average Income in Top Cities:")
print(avg_income_by_city)

# Calculate average income by state
top_states = Bumble_Dataset['State'].value_counts().head(5).index
df_top_states = Bumble_Dataset[Bumble_Dataset['State'].isin(top_states)]
avg_income_by_state = round(df_top_states.groupby('State')['income'].mean().sort_values(ascending=False), 2)
print("\nAverage Income in Top States:")
print(avg_income_by_state)
Average Income in Top Cities:
City
san mateo        22779.86
oakland          22586.64
san francisco    20150.01
palo alto        19332.71
berkeley         17364.67
Name: income, dtype: float64

Average Income in Top States:
State
new york         31764.71
california       20044.27
massachusetts     6000.00
texas             5000.00
illinois             0.00
Name: income, dtype: float64

5. Height Analysis¶

Physical attributes like height are often considered important in dating preferences. Analyzing height patterns helps Bumble understand user demographics and preferences better.

Que 1:¶

What is the average height of users across different gender categories?

In [89]:
# Calculate average height for each gender
avg_height_by_gender = Bumble_Dataset.groupby('gender')['height'].mean().reset_index()
avg_height_by_gender.columns = ['Gender', 'Average Height']
print(avg_height_by_gender)
  Gender  Average Height
0      f       65.103869
1      m       70.443468
In [90]:
# Visualize height by gender
plt.figure(figsize=(8, 6))
plt.bar(avg_height_by_gender['Gender'], avg_height_by_gender['Average Height'],
        color=['skyblue', 'pink'], edgecolor='black')
plt.xlabel('Gender')
plt.ylabel('Average Height (inches)')
plt.title('Average Height by Gender')
plt.grid(axis='y', alpha=0.3)
plt.show()
No description has been provided for this image

Que 2:¶

How does height vary by age_group? Are there noticeable trends among younger vs. older users?

In [91]:
# Calculate average height for each age group
avg_height_by_age_group = Bumble_Dataset.groupby('age_group')['height'].mean().reset_index()
avg_height_by_age_group.columns = ['Age Group', 'Average Height']
print(avg_height_by_age_group)
  Age Group  Average Height
0     18-25       68.200913
1     26-35       68.406764
2     36-45       68.325095
3       46+       67.941167

Que 3:¶

What is the distribution of height within body_type categories (e.g., athletic, curvy, thin)? Do the distributions align with expectations?

In [92]:
# Group by body_type and calculate height
height_by_body_type = Bumble_Dataset.groupby('body_type')['height'].mean().sort_values(ascending=False).reset_index()
height_by_body_type.columns = ['Body Type', 'Average Height']
print(height_by_body_type.head(10))
        Body Type  Average Height
0        athletic       69.707336
1          jacked       69.292162
2         used up       69.180282
3      overweight       68.948198
4  a little extra       68.820084
5             fit       68.546062
6          skinny       68.544176
7         average       68.100805
8            thin       67.866058
9  rather not say       67.272727

6. Income Analysis¶

Income is often an important factor for users on dating platforms. Understanding its distribution and relationship with other variables helps refine features like user search filters or personalized recommendations.

Que 1:¶

What is the distribution of income across the platform? Are there specific income brackets that dominate? How would you handle cases where income is blank or 0?

In [93]:
# Filter income more than 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] != 0]
# Calculate income distribution
income_distribution = filtered_income['income'].value_counts().sort_values(ascending=False).head(12).reset_index()
income_distribution.columns = ['Income', 'User Count']
print(income_distribution)
     Income  User Count
0     20000        2952
1    100000        1621
2     80000        1111
3     30000        1048
4     40000        1005
5     50000         975
6     60000         736
7     70000         707
8    150000         631
9   1000000         521
10   250000         149
11   500000          48
In [94]:
# Visualize income distribution
plt.figure(figsize=(10, 6))
plt.hist(filtered_income['income'], bins=30, color='gold', edgecolor='black')
plt.xlabel('Income')
plt.ylabel('Number of Users')
plt.title('Income Distribution on Bumble')
plt.grid(axis='y', alpha=0.3)
plt.show()
No description has been provided for this image

Que 2:¶

How does income vary by age_group and gender? Are older users more likely to report higher incomes?

In [95]:
# Filter users with income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Group by age group and gender
income_by_age_gender = (
    filtered_income.groupby(['age_group', 'gender'])['income']
    .mean()
    .reset_index()
    .sort_values(by=['age_group', 'income'], ascending=[True, False])
    .round(2)
)
income_by_age_gender.columns = ['Age Group', 'Gender', 'Average Income']
print(income_by_age_gender)
  Age Group Gender  Average Income
1     18-25      m       106618.77
0     18-25      f        86066.35
3     26-35      m       114944.80
2     26-35      f        90398.13
5     36-45      m       112680.61
4     36-45      f        87302.98
7       46+      m       100156.63
6       46+      f        75299.76

Part 4: Data Visualization¶

1. Age Distribution¶

Understanding the distribution of user ages can reveal whether the platform caters to specific demographics or age groups. This insight is essential for targeted marketing and user experience design.

Que 1:¶

Plot a histogram of age with a vertical line indicating the mean age. What does the distribution reveal about the most common age group on the platform?

In [96]:
# Calculate mean age
mean_age = Bumble_Dataset['age'].mean()
# Plot histogram
plt.figure(figsize=(10, 8))
sns.histplot(Bumble_Dataset['age'], bins=30, color='skyblue')
# Add vertical line for mean age
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')
# Labels and title
plt.xlabel('Age')
plt.ylabel('User Count')
plt.title('Age Distribution of Users')
plt.legend()
plt.show()
No description has been provided for this image

Que 2:¶

How does the age distribution differ by gender? Are there age groups where one gender is more prevalent?

In [97]:
# Plot histograms for each gender
plt.figure(figsize=(10, 8))
plt.hist(Bumble_Dataset[Bumble_Dataset['gender'] == 'm']['age'], bins=30, alpha=0.6, label='Male', color='blue')
plt.hist(Bumble_Dataset[Bumble_Dataset['gender'] == 'f']['age'], bins=30, alpha=0.6, label='Female', color='pink')
# Add labels and legend
plt.xlabel("Age")
plt.ylabel("Number of Users")
plt.title("Age Distribution by Gender")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image

2. Income and Age¶

Visualizing the relationship between income and age helps uncover patterns in reported income levels across age groups, which could inform user segmentation strategies.

Que 1:¶

Use a scatterplot to visualize the relationship between income and age, with a trend line indicating overall patterns. Are older users more likely to report higher incomes?

In [98]:
# Filter out zero or missing income values
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Set figure size
plt.figure(figsize=(10, 8))
# Create scatter plot
sns.scatterplot(x=filtered_income['age'], y=filtered_income['income'], alpha=0.5)
# Add trend line using regression plot
sns.regplot(x=filtered_income['age'], y=filtered_income['income'], scatter=False, color='red')
# Labels and title
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Scatter Plot of Income vs. Age with Trend Line")
plt.show()
No description has been provided for this image

Que 2:¶

Create boxplots of income grouped by age_group. Which age group reports the highest median income?

In [99]:
# Filter income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Create boxplot
plt.figure(figsize=(10, 8))
sns.boxplot(x='age_group', y='income', data=filtered_income, hue='age_group', legend=False)
# Labels
plt.xlabel("Age Group")
plt.ylabel("Income")
plt.title("Income Distribution by Age Group")
plt.show()
No description has been provided for this image

Que 3:¶

Analyze income levels within gender and status categories. For example, are single men more likely to report higher incomes than single women?

In [100]:
# Filter income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Boxplot for income by gender and status
plt.figure(figsize=(12, 8))
sns.boxplot(x='status', y='income', hue='gender', data=filtered_income, palette='Set2')
# Labels
plt.xlabel("Status")
plt.ylabel("Income")
plt.title("Income Distribution by Gender and Status")
plt.legend(title="Gender")
plt.show()
No description has been provided for this image

3. Pets and Preferences¶

Pets are often a key lifestyle preference and compatibility factor. Analyzing how pet preferences distribute across demographics can provide insights for filters or recommendations.

Que 1:¶

Create a bar chart showing the distribution of pets categories (e.g., likes dogs, likes cats). Which preferences are most common?

In [101]:
# Count pet preferences
pet_counts = Bumble_Dataset['pets'].value_counts().reset_index()
pet_counts.columns = ['Pet Preference', 'Count']
# Plot top 10
plt.figure(figsize=(10, 8))
top_pets = pet_counts.head(10)
sns.barplot(x='Pet Preference', y='Count', data=top_pets, hue='Pet Preference', legend=False)
plt.xlabel("Pet Preferences")
plt.ylabel("Number of Users")
plt.title("Distribution of Pet Preferences on Bumble")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Que 2:¶

How do pet preferences vary across gender and age_group? Are younger users more likely to report liking pets compared to older users?

In [102]:
# Group by age_group and pets
pet_preferences = Bumble_Dataset.groupby(['age_group', 'pets']).size().reset_index(name='Count')
# Plot
plt.figure(figsize=(10, 8))
top_pets_by_age = pet_preferences.sort_values('Count', ascending=False).head(15)
sns.barplot(x='age_group', y='Count', hue='pets', data=top_pets_by_age, palette='Set3')
plt.xlabel("Age Group")
plt.ylabel("Number of Users")
plt.title("Pet Preferences Across Age Groups")
plt.legend(title="Pet Preference", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
No description has been provided for this image

4. Signs and Personality¶

Users' self-reported zodiac signs (sign) can offer insights into personality preferences or trends. While not scientifically grounded, analyzing this data helps explore fun and engaging patterns.

Que 1:¶

Create a bar chart showing the distribution of zodiac signs (sign) across the platform. Which signs are most and least represented?

In [103]:
# Count zodiac signs
zodiac_counts = Bumble_Dataset['sign'].value_counts()
# Create bar chart
plt.figure(figsize=(20, 8))
sns.barplot(x=zodiac_counts.index, y=zodiac_counts.values, hue=zodiac_counts.index, legend=False)
plt.xlabel("Zodiac Sign")
plt.ylabel("Number of Users")
plt.title("Distribution of Zodiac Signs on the Platform")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Que 2:¶

How does sign vary across gender and status? Are there noticeable patterns or imbalances?

In [104]:
# Group by gender, status, and sign
gender_status = Bumble_Dataset.groupby(["gender", "status", "sign"])["status"].count().reset_index(name="count")
stacked_data = gender_status.pivot_table(index="sign", columns=["gender", "status"], values="count", fill_value=0)
# Create stacked chart
stacked_data.plot(kind="bar", stacked=True, figsize=(20, 8), colormap="coolwarm")
plt.xlabel("Zodiac Sign")
plt.ylabel("Number of Users")
plt.title("Proportion of Gender & Status Across Zodiac Signs")
plt.xticks(rotation=90)
plt.legend(title="Status", bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.show()
No description has been provided for this image

Summary of Analysis¶

The analysis of Bumble user data revealed several key insights:¶

Data Completeness:¶

Age and income data are fully available, while diet and religion have significant missing values.

User Demographics:¶

The largest user group falls within the 26-35 age range, with more male users than female.

Profile Completeness:¶

Most users have well-completed profiles, with completeness rates varying across demographics.

Dietary Habits & Drinking:¶

Users who identify as "mostly anything" for diet are the most common, with varied drinking habits.

Geographic Trends:¶

San Francisco and other California cities dominate the user base. California has the highest concentration of users.

Height Differences:¶

Males are generally taller (around 70 inches) than females (around 65 inches).

Income Distribution:¶

Most users earn between $20,000 and $100,000. The 26-35 age group has significant representation across income levels.

User Preferences:¶

Pet lovers (especially dog and cat owners) are common. Zodiac signs show relatively even distribution across the platform.

Recommendations¶

To enhance user engagement and experience, Bumble should consider:¶

Targeting the 26-35 Age Group & Male Users:¶

This group is the most active and represents a significant portion of the user base.

Improving Data Completeness:¶

Address missing values in diet and religion to enhance recommendation accuracy and matching.

Enhancing Personalization Through Diet & Drinking Habits:¶

Introduce features tailored to dietary preferences to improve user compatibility.

Integrating Pet Preferences into Matchmaking:¶

Many users like pets, especially dogs and cats. Incorporating this into the matchmaking algorithm could improve compatibility.

Leveraging Geographic Data:¶

Focus marketing efforts on high-concentration areas like California while exploring growth opportunities in underrepresented regions.

Income-Based Features:¶

Consider income-based filtering or recommendations to help users find compatible matches.