Analyzing Bumble Profiles Using Python¶
This project focuses on analyzing user data from Bumble, a well-known dating app, to uncover valuable insights about its users.
By exploring the information users share in their profiles such as demographics, lifestyle habits, and preferences we aim to understand their behavior and trends better.
These insights will help Bumble's product and marketing teams make informed decisions to improve user engagement, fine-tune the matchmaking algorithms, and offer personalized features.
Ultimately, the goal is to create a more satisfying experience for Bumble users and support the platform's growth.
Installing Pandas¶
%pip install pandas
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.13/site-packages (2.3.2) Requirement already satisfied: numpy>=1.26.0 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2.2.5) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.13/site-packages (from pandas) (2025.2) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0) Note: you may need to restart the kernel to use updated packages.
Importing pandas, numpy, matplotlib, seaborn.¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Importing the Dataset as CSV File¶
# Load the dataset
# Both files are in the same folder, so we just use the filename
Bumble_Dataset = pd.read_csv('bumble.csv')
Bumble_Dataset
| age | status | gender | body_type | diet | drinks | education | ethnicity | height | income | job | last_online | location | pets | religion | sign | speaks | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | single | m | a little extra | strictly anything | socially | working on college/university | asian, white | 75.0 | -1 | transportation | 2012-06-28-20-30 | south san francisco, california | likes dogs and likes cats | agnosticism and very serious about it | gemini | english |
| 1 | 35 | single | m | average | mostly other | often | working on space camp | white | 70.0 | 80000 | hospitality / travel | 2012-06-29-21-41 | oakland, california | likes dogs and likes cats | agnosticism but not too serious about it | cancer | english (fluently), spanish (poorly), french (... |
| 2 | 38 | available | m | thin | anything | socially | graduated from masters program | NaN | 68.0 | -1 | NaN | 2012-06-27-09-10 | san francisco, california | has cats | NaN | pisces but it doesn’t matter | english, french, c++ |
| 3 | 23 | single | m | thin | vegetarian | socially | working on college/university | white | 71.0 | 20000 | student | 2012-06-28-14-22 | berkeley, california | likes cats | NaN | pisces | english, german (poorly) |
| 4 | 29 | single | m | athletic | NaN | socially | graduated from college/university | asian, black, other | 66.0 | -1 | artistic / musical / writer | 2012-06-27-21-26 | san francisco, california | likes dogs and likes cats | NaN | aquarius | english |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59941 | 59 | single | f | NaN | NaN | socially | graduated from college/university | NaN | 62.0 | -1 | sales / marketing / biz dev | 2012-06-12-21-47 | oakland, california | has dogs | catholicism but not too serious about it | cancer and it’s fun to think about | english |
| 59942 | 24 | single | m | fit | mostly anything | often | working on college/university | white, other | 72.0 | -1 | entertainment / media | 2012-06-29-11-01 | san francisco, california | likes dogs and likes cats | agnosticism | leo but it doesn’t matter | english (fluently) |
| 59943 | 42 | single | m | average | mostly anything | not at all | graduated from masters program | asian | 71.0 | 100000 | construction / craftsmanship | 2012-06-27-23-37 | south san francisco, california | NaN | christianity but not too serious about it | sagittarius but it doesn’t matter | english (fluently) |
| 59944 | 27 | single | m | athletic | mostly anything | socially | working on college/university | asian, black | 73.0 | -1 | medicine / health | 2012-06-23-13-01 | san francisco, california | likes dogs and likes cats | agnosticism but not too serious about it | leo and it’s fun to think about | english (fluently), spanish (poorly), chinese ... |
| 59945 | 39 | single | m | average | NaN | socially | graduated from masters program | white | 68.0 | -1 | medicine / health | 2012-06-29-00-42 | san francisco, california | likes dogs and likes cats | catholicism and laughing about it | gemini and it’s fun to think about | english |
59946 rows × 17 columns
Part 1: Data Cleaning¶
1. Inspecting Missing Data¶
Missing data is a common issue in real-world datasets. On a platform like Bumble, missing user information might reflect gaps in the user profile setup process, incomplete data collection, or users intentionally leaving certain fields blank. As a data analyst, your role is to assess the extent of missing data, understand its potential impact, and decide the most appropriate methods to address it.
Que 1:¶
Which columns in the dataset have missing values, and what percentage of data is missing in each column?
missing_values = Bumble_Dataset.isnull().sum()
missing_values
age 0 status 0 gender 0 body_type 5296 diet 24395 drinks 2985 education 6628 ethnicity 5680 height 3 income 0 job 8198 last_online 0 location 0 pets 19921 religion 20226 sign 11056 speaks 50 dtype: int64
Here we used isnull().sum() to find out the total missing values in each column.
Que 2:¶
Are there columns where more than 50% of the data is missing? Would you drop those columns where missing values are >50%.
Ans - As there are no columns with more than 50% of missing data, we don't need to drop any columns.
percentage_of_missing_values = ((missing_values) / len(Bumble_Dataset)) * 100
percentage_of_missing_values
age 0.000000 status 0.000000 gender 0.000000 body_type 8.834618 diet 40.694959 drinks 4.979482 education 11.056618 ethnicity 9.475194 height 0.005005 income 0.000000 job 13.675641 last_online 0.000000 location 0.000000 pets 33.231575 religion 33.740366 sign 18.443266 speaks 0.083408 dtype: float64
Que 3:¶
How would you handle the missing numerical data (e.g., height, income)? Would you impute the missing data by the median or average value of height and income for the corresponding category, such as gender, age group, or location.
# Calculate median height by gender
calculate_median = Bumble_Dataset.groupby(["gender"])["height"].transform("median")
# Fill missing height values with median
Bumble_Dataset["height"] = Bumble_Dataset["height"].fillna(calculate_median)
print(calculate_median)
0 70.0
1 70.0
2 70.0
3 70.0
4 70.0
...
59941 65.0
59942 70.0
59943 70.0
59944 70.0
59945 70.0
Name: height, Length: 59946, dtype: float64
Used groupby and transform, fillna to calculate and fill missing values in the 'height' column with its median.
Replaced missing values in the 'height' column with its median value calculated earlier using fillna().
2. Data Types¶
Que 1:¶
Are there any inconsistencies in the data types across columns (e.g., numerical data stored as strings)?
Bumble_Dataset.dtypes
age int64 status object gender object body_type object diet object drinks object education object ethnicity object height float64 income int64 job object last_online object location object pets object religion object sign object speaks object dtype: object
Used dtypes to check data types of each column.
There are no inconsistencies in data types across columns.
Que 2:¶
Which columns require conversion to numerical data types for proper analysis (e.g., income)?
Ans - There are no columns that require conversion to numerical data type.
Que 3:¶
Does the last_online column need to be converted into a datetime format? What additional insights can be gained by analyzing this as a date field?
# "Last_online" column is in object data type, hence it requires conversion
# Used to_datetime() for data type conversion with error handling
# The 'errors="coerce"' parameter will convert invalid dates to NaT (Not a Time)
import warnings
warnings.filterwarnings('ignore') # Suppress warnings
Bumble_Dataset["last_online"] = pd.to_datetime(Bumble_Dataset["last_online"], errors='coerce')
Bumble_Dataset["last_online"]
0 NaT
1 NaT
2 2012-06-27 09:00:00-10:00
3 2012-06-28 14:00:00-22:00
4 NaT
...
59941 NaT
59942 2012-06-29 11:00:00-01:00
59943 NaT
59944 2012-06-23 13:00:00-01:00
59945 NaT
Name: last_online, Length: 59946, dtype: object
Bumble_Dataset.dtypes
age int64 status object gender object body_type object diet object drinks object education object ethnicity object height float64 income int64 job object last_online object location object pets object religion object sign object speaks object dtype: object
3. Outliers¶
Que 1:¶
Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?
Bumble_Dataset.describe()
| age | height | income | |
|---|---|---|---|
| count | 59946.000000 | 59946.000000 | 59946.000000 |
| mean | 32.340290 | 68.295282 | 20033.222534 |
| std | 9.452779 | 3.994738 | 97346.192104 |
| min | 18.000000 | 1.000000 | -1.000000 |
| 25% | 26.000000 | 66.000000 | -1.000000 |
| 50% | 30.000000 | 68.000000 | -1.000000 |
| 75% | 37.000000 | 71.000000 | -1.000000 |
| max | 110.000000 | 95.000000 | 1000000.000000 |
# Yes there are outliers in the numerical columns which are age, height and income
age_range = (Bumble_Dataset['age'].min(), Bumble_Dataset['age'].max())
height_range = (Bumble_Dataset['height'].min(), Bumble_Dataset['height'].max())
income_range = (Bumble_Dataset['income'].min(), Bumble_Dataset['income'].max())
print(f'Age range is: {age_range}')
print(f'Height range is: {height_range}')
print(f'Income range is: {income_range}')
Age range is: (18, 110) Height range is: (1.0, 95.0) Income range is: (-1, 1000000)
Que 2:¶
Any -1 values in numerical columns like income should be replaced with blank, as they may represent missing or invalid data.
Bumble_Dataset['income'] = Bumble_Dataset['income'].replace(-1, 0)
Que 3:¶
For other outliers, how would you ensure that they do not disproportionately impact the analysis while retaining as much meaningful data as possible. Would you delete the data or rather than deleting them, calculate the mean and median values using only the middle 80% of the data (removing extreme high and low values). Provide appropriate reasons for every step.
column_names = ["income", "age", "height"]
for columns in column_names:
Q1 = Bumble_Dataset[columns].quantile(0.10)
Q3 = Bumble_Dataset[columns].quantile(0.90)
filtered_values = Bumble_Dataset[(Bumble_Dataset[columns] > Q1) & (Bumble_Dataset[columns] < Q3)]
mean_values = filtered_values[columns].mean()
median_values = filtered_values[columns].median()
print(f"\n{columns}: filtered_median: {median_values}, filtered_mean: {mean_values}")
income: filtered_median: 20000.0, filtered_mean: 26109.89010989011 age: filtered_median: 30.0, filtered_mean: 31.357685563997663 height: filtered_median: 68.0, filtered_mean: 68.25442646465552
Calculated mean() and median() for the middle 80% of the data using the 10th and 90th percentiles.
4. Missing Data Visualization¶
Visualizing missing data helps identify patterns of incompleteness in the dataset, which can guide data cleaning strategies. Understanding which columns have high levels of missing data ensures decisions about imputation or removal are well-informed.
Create a heatmap to visualize missing values across the dataset. Which columns show consistent missing data patterns?
missing_values = Bumble_Dataset.isnull().sum()
print(missing_values)
plt.figure(figsize=(12, 8))
sns.heatmap(Bumble_Dataset.isnull(), cbar=False, cmap="inferno")
plt.title('Heatmap of Missing Data')
plt.show()
age 0 status 0 gender 0 body_type 5296 diet 24395 drinks 2985 education 6628 ethnicity 5680 height 0 income 0 job 8198 last_online 35895 location 0 pets 19921 religion 20226 sign 11056 speaks 50 dtype: int64
Part 2: Data Processing¶
1. Binning and Grouping¶
Grouping continuous variables, such as age or income, into bins helps simplify analysis and identify trends among specific groups. For instance, grouping users into age ranges can reveal distinct patterns in behavior or preferences across demographics.
Que 1:¶
How would you bin the age column into categories (e.g. "18-25", "26-35", "36-45", and "46+") to create a new column, age_group. How does the distribution of users vary across these age ranges?
def get_age_grp(age):
if age <= 25:
return '18-25'
elif age <= 35:
return "26-35"
elif age <= 45:
return "36-45"
else:
return "46+"
# Apply the function to create a new column
Bumble_Dataset["age_group"] = Bumble_Dataset["age"].apply(get_age_grp)
age_group_counts = Bumble_Dataset['age_group'].value_counts().sort_index()
# Convert Series to DataFrame
age_group_df = age_group_counts.reset_index()
age_group_df.columns = ['Age Group', 'User Count']
# Display results
print(age_group_df)
Age Group User Count 0 18-25 14454 1 26-35 28621 2 36-45 10803 3 46+ 6068
# Visualize age group distribution
plt.figure(figsize=(10, 6))
plt.bar(age_group_df['Age Group'], age_group_df['User Count'], color='skyblue', edgecolor='black')
plt.xlabel('Age Group')
plt.ylabel('Number of Users')
plt.title('Distribution of Users Across Age Groups')
plt.grid(axis='y', alpha=0.3)
plt.show()
Que 2:¶
Group income into categories like "Low Income," "Medium Income," and "High Income" based on meaningful thresholds (e.g., quartiles). What insights can be derived from these groups?
# Calculate Quartiles
Q1 = Bumble_Dataset["income"].quantile(0.10)
Q3 = Bumble_Dataset["income"].quantile(0.90)
# Define function to categorize income
def categorize_income(x):
if x <= Q1:
return "Low Income"
elif Q1 < x <= Q3:
return "Medium Income"
else:
return "High Income"
# Apply function
Bumble_Dataset["income_category"] = Bumble_Dataset["income"].apply(categorize_income)
# Count occurrences
income_distribution_count = Bumble_Dataset['income_category'].value_counts()
income_distribution_count_df = income_distribution_count.reset_index()
income_distribution_count_df.columns = ['Income Category', 'User Count']
print(income_distribution_count_df)
Income Category User Count 0 Low Income 48442 1 Medium Income 5980 2 High Income 5524
# Visualize income categories
plt.figure(figsize=(10, 6))
plt.bar(income_distribution_count_df['Income Category'],
income_distribution_count_df['User Count'],
color='lightcoral', edgecolor='black')
plt.xlabel('Income Category')
plt.ylabel('Number of Users')
plt.title('Distribution of Users Across Income Categories')
plt.grid(axis='y', alpha=0.3)
plt.show()
2. Derived Features¶
Derived features are new columns created based on the existing data to add depth to the analysis. These features often reveal hidden patterns or provide new dimensions to explore.
Que 1:¶
Create a new feature, profile_completeness, by calculating the percentage of non-missing values for each user profile. How complete are most user profiles, and how does completeness vary across demographics?
# Calculating total columns
total_columns = Bumble_Dataset.shape[1]
# Calculate non-missing values per row
Bumble_Dataset['profile_completeness'] = round(Bumble_Dataset.notnull().sum(axis=1) / total_columns * 100, 2)
print(Bumble_Dataset['profile_completeness'])
0 94.74
1 94.74
2 84.21
3 94.74
4 84.21
...
59941 78.95
59942 100.00
59943 89.47
59944 100.00
59945 89.47
Name: profile_completeness, Length: 59946, dtype: float64
Bumble_Dataset['profile_completeness'].value_counts()
profile_completeness 94.74 15924 89.47 15135 84.21 10483 78.95 6080 100.00 5851 73.68 3200 68.42 1589 63.16 919 57.89 534 52.63 182 47.37 49 Name: count, dtype: int64
# Analyze completeness by gender
Gender_anly = Bumble_Dataset.groupby('profile_completeness')['gender'].value_counts()
Gender_anly
profile_completeness gender
47.37 m 33
f 16
52.63 m 125
f 57
57.89 m 358
f 176
63.16 m 568
f 351
68.42 m 972
f 617
73.68 m 1934
f 1266
78.95 m 3657
f 2423
84.21 m 6279
f 4204
89.47 m 9081
f 6054
94.74 m 9344
f 6580
100.00 m 3478
f 2373
Name: count, dtype: int64
3. Unit Conversion¶
Standardizing units across datasets is essential for consistency, especially when working with numerical data. In the context of the Bumble dataset, users' heights are given in inches, which may not be intuitive for all audiences.
Que 1:¶
Convert the height column from inches to centimeters using the conversion factor (1 inch = 2.54 cm). Store the converted values in a new column, height_cm?
# Creating a new column with height in centimeters
Bumble_Dataset["height_cm"] = Bumble_Dataset["height"] * 2.54
print(Bumble_Dataset[["height", "height_cm"]])
height height_cm 0 75.0 190.50 1 70.0 177.80 2 68.0 172.72 3 71.0 180.34 4 66.0 167.64 ... ... ... 59941 62.0 157.48 59942 72.0 182.88 59943 71.0 180.34 59944 73.0 185.42 59945 68.0 172.72 [59946 rows x 2 columns]
Part 3: Data Analysis¶
1. Demographic Analysis¶
Understanding the demographics of users is essential for tailoring marketing strategies, improving user experience, and designing features that resonate with the platform's audience. Insights into gender distribution, orientation, and relationship status can help Bumble refine its matchmaking algorithms and engagement campaigns.
Que 1:¶
What is the gender distribution (gender) across the platform? Are there any significant imbalances?
# Count the number of users by gender
gender_distribution = Bumble_Dataset['gender'].value_counts()
# Calculate percentage
gender_percentage = round(gender_distribution / gender_distribution.sum() * 100, 2)
# Display results
gender_distribution_df = pd.DataFrame({
"Gender": gender_percentage.index,
"Count": gender_distribution,
"Percentage": gender_percentage
})
print(gender_distribution_df)
Gender Count Percentage gender m m 35829 59.77 f f 24117 40.23
# Visualize gender distribution
plt.figure(figsize=(10, 6))
plt.bar(gender_distribution_df['Gender'], gender_distribution_df['Count'],
color=['skyblue', 'pink'], edgecolor='black')
plt.xlabel('Gender')
plt.ylabel('Number of Users')
plt.title('Gender Distribution on Bumble')
plt.grid(axis='y', alpha=0.3)
plt.show()
Que 2:¶
What are the proportions of users in different status categories (e.g., single, married, seeing someone)? What does this suggest about the platform's target audience?
# Count the number of users by status
status_distribution = Bumble_Dataset['status'].value_counts()
# Calculate percentage
status_percentage = round(status_distribution / len(Bumble_Dataset["status"]) * 100, 2)
relationship_distribution_df = pd.DataFrame({
'Status': status_distribution.index,
'count': status_distribution,
'Percentage': status_percentage
})
relationship_distribution_df
| Status | count | Percentage | |
|---|---|---|---|
| status | |||
| single | single | 55697 | 92.91 |
| seeing someone | seeing someone | 2064 | 3.44 |
| available | available | 1865 | 3.11 |
| married | married | 310 | 0.52 |
| unknown | unknown | 10 | 0.02 |
Que 3:¶
How does status vary by gender? For example, what proportion of men and women identify as single?
# Count relationship status within each gender
status_by_gender = Bumble_Dataset.groupby('gender')['status'].value_counts(normalize=True) * 100
# Convert to DataFrame
status_by_gender = status_by_gender.unstack()
# Display results
print(status_by_gender)
status available married seeing someone single unknown gender f 2.720073 0.559771 4.158892 92.544678 0.016586 m 3.374362 0.488431 2.961288 93.159173 0.016746
2. Correlation Analysis¶
Correlation analysis helps uncover relationships between variables, guiding feature engineering and hypothesis generation. For example, understanding how age correlates with income or word count in profiles can reveal behavioral trends that inform platform design.
Que 1:¶
What are the correlations between numerical columns such as age, income, gender? Are there any strong positive or negative relationships?
# Selecting numerical columns
numerical_columns = Bumble_Dataset[['age', 'height', 'income']]
# Compute correlation matrix
correlation = numerical_columns.corr()
# Print result
print(correlation)
age height income age 1.000000 -0.022253 -0.001004 height -0.022253 1.000000 0.065048 income -0.001004 0.065048 1.000000
# Visualize correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()
Que 2:¶
How does age correlate with income? Are older users more likely to report higher income levels?
correlation = Bumble_Dataset["age"].corr(Bumble_Dataset["income"])
print(correlation)
-0.0010038681910053968
3. Diet and Lifestyle Analysis¶
Lifestyle attributes such as diet and drinks provide insights into user habits and preferences. Analyzing these factors helps identify compatibility trends and inform product features like filters or match recommendations.
Que 1:¶
How do dietary preferences (diet) distribute across the platform? For example, what percentage of users identify as vegetarian, vegan, or follow "anything" diets?
# Count occurrences of each dietary preference
diet_distribution = Bumble_Dataset['diet'].value_counts().reset_index()
diet_distribution.columns = ['Diet Type', 'Count']
# Calculate percentages
diet_distribution['Percentage'] = round((diet_distribution['Count'] / diet_distribution['Count'].sum()) * 100, 2)
# Display top 10
print(diet_distribution.head(10))
Diet Type Count Percentage 0 mostly anything 16585 46.65 1 anything 6183 17.39 2 strictly anything 5113 14.38 3 mostly vegetarian 3444 9.69 4 mostly other 1007 2.83 5 strictly vegetarian 875 2.46 6 vegetarian 667 1.88 7 strictly other 452 1.27 8 mostly vegan 338 0.95 9 other 331 0.93
# Visualize top 10 diet preferences
plt.figure(figsize=(12, 6))
top_diets = diet_distribution.head(10)
plt.barh(range(len(top_diets)), top_diets['Count'], color='lightgreen', edgecolor='black')
plt.yticks(range(len(top_diets)), top_diets['Diet Type'])
plt.xlabel('Number of Users')
plt.title('Top 10 Diet Preferences on Bumble')
plt.grid(axis='x', alpha=0.3)
plt.show()
Que 2:¶
How do drinking habits (drinks) vary across different diet categories? Are users with stricter diets (e.g., vegan) less likely to drink?
# Analyze drinking habits across different diet categories
drink_diet_distribution = Bumble_Dataset.groupby("diet")["drinks"].value_counts(normalize=True)
# Display results
drink_diet_distribution
diet drinks
anything socially 0.756676
often 0.095794
rarely 0.085113
not at all 0.049232
very often 0.009680
...
vegetarian rarely 0.107717
often 0.094855
not at all 0.059486
very often 0.011254
desperately 0.009646
Name: proportion, Length: 103, dtype: float64
4. Geographical Insights¶
Analyzing geographical data helps Bumble understand its user base distribution, enabling targeted regional campaigns and feature localization. For instance, identifying the top cities with active users can guide marketing efforts in those areas.
Que 1:¶
Extract city and state information from the location column. What are the top 5 cities and states with the highest number of users?
# Extracting city and state from location column
Bumble_Dataset = Bumble_Dataset[Bumble_Dataset['location'].notna()]
# Split only if a comma exists
Bumble_Dataset[['City', 'State']] = Bumble_Dataset['location'].str.split(', ', n=1, expand=True)
# Remove white space
Bumble_Dataset['City'] = Bumble_Dataset['City'].str.strip()
Bumble_Dataset['State'] = Bumble_Dataset['State'].str.strip()
# Count users per city and state
top_cities = Bumble_Dataset['City'].value_counts().head(5)
top_states = Bumble_Dataset['State'].value_counts().head(5)
# Convert to DataFrame
top_cities_df = top_cities.reset_index()
top_cities_df.columns = ['City', 'User Count']
top_states_df = top_states.reset_index()
top_states_df.columns = ['State', 'User Count']
# Display results
print(f"Top 5 Cities:\n{top_cities_df}")
print(f"\nTop 5 States:\n{top_states_df}")
Top 5 Cities:
City User Count
0 san francisco 31064
1 oakland 7214
2 berkeley 4212
3 san mateo 1331
4 palo alto 1064
Top 5 States:
State User Count
0 california 59855
1 new york 17
2 illinois 8
3 massachusetts 5
4 texas 4
# Visualize top cities
plt.figure(figsize=(10, 6))
plt.barh(range(len(top_cities_df)), top_cities_df['User Count'], color='coral', edgecolor='black')
plt.yticks(range(len(top_cities_df)), top_cities_df['City'])
plt.xlabel('Number of Users')
plt.title('Top 5 Cities by User Count')
plt.grid(axis='x', alpha=0.3)
plt.show()
Que 2:¶
How does age vary across the top cities? Are certain cities dominated by younger or older users?
# Find top 5 cities
top_cities = Bumble_Dataset['City'].value_counts().head(5).index
# Filter dataset
df_top_cities = Bumble_Dataset[Bumble_Dataset['City'].isin(top_cities)]
# Calculate age statistics
age_stats = df_top_cities.groupby('City')['age'].mean().sort_values(ascending=True).reset_index()
age_stats.columns = ['City', 'Age Average']
print(age_stats)
City Age Average 0 berkeley 31.391738 1 san francisco 31.614312 2 palo alto 31.980263 3 oakland 33.178819 4 san mateo 33.437265
Que 3:¶
What are the average income levels in the top states or cities? Are there regional patterns in reported income?
# Calculate average income by city
df_top_cities = Bumble_Dataset[Bumble_Dataset['City'].isin(top_cities)]
avg_income_by_city = round(df_top_cities.groupby('City')['income'].mean().sort_values(ascending=False), 2)
print("Average Income in Top Cities:")
print(avg_income_by_city)
# Calculate average income by state
top_states = Bumble_Dataset['State'].value_counts().head(5).index
df_top_states = Bumble_Dataset[Bumble_Dataset['State'].isin(top_states)]
avg_income_by_state = round(df_top_states.groupby('State')['income'].mean().sort_values(ascending=False), 2)
print("\nAverage Income in Top States:")
print(avg_income_by_state)
Average Income in Top Cities: City san mateo 22779.86 oakland 22586.64 san francisco 20150.01 palo alto 19332.71 berkeley 17364.67 Name: income, dtype: float64 Average Income in Top States: State new york 31764.71 california 20044.27 massachusetts 6000.00 texas 5000.00 illinois 0.00 Name: income, dtype: float64
5. Height Analysis¶
Physical attributes like height are often considered important in dating preferences. Analyzing height patterns helps Bumble understand user demographics and preferences better.
Que 1:¶
What is the average height of users across different gender categories?
# Calculate average height for each gender
avg_height_by_gender = Bumble_Dataset.groupby('gender')['height'].mean().reset_index()
avg_height_by_gender.columns = ['Gender', 'Average Height']
print(avg_height_by_gender)
Gender Average Height 0 f 65.103869 1 m 70.443468
# Visualize height by gender
plt.figure(figsize=(8, 6))
plt.bar(avg_height_by_gender['Gender'], avg_height_by_gender['Average Height'],
color=['skyblue', 'pink'], edgecolor='black')
plt.xlabel('Gender')
plt.ylabel('Average Height (inches)')
plt.title('Average Height by Gender')
plt.grid(axis='y', alpha=0.3)
plt.show()
Que 2:¶
How does height vary by age_group? Are there noticeable trends among younger vs. older users?
# Calculate average height for each age group
avg_height_by_age_group = Bumble_Dataset.groupby('age_group')['height'].mean().reset_index()
avg_height_by_age_group.columns = ['Age Group', 'Average Height']
print(avg_height_by_age_group)
Age Group Average Height 0 18-25 68.200913 1 26-35 68.406764 2 36-45 68.325095 3 46+ 67.941167
Que 3:¶
What is the distribution of height within body_type categories (e.g., athletic, curvy, thin)? Do the distributions align with expectations?
# Group by body_type and calculate height
height_by_body_type = Bumble_Dataset.groupby('body_type')['height'].mean().sort_values(ascending=False).reset_index()
height_by_body_type.columns = ['Body Type', 'Average Height']
print(height_by_body_type.head(10))
Body Type Average Height 0 athletic 69.707336 1 jacked 69.292162 2 used up 69.180282 3 overweight 68.948198 4 a little extra 68.820084 5 fit 68.546062 6 skinny 68.544176 7 average 68.100805 8 thin 67.866058 9 rather not say 67.272727
6. Income Analysis¶
Income is often an important factor for users on dating platforms. Understanding its distribution and relationship with other variables helps refine features like user search filters or personalized recommendations.
Que 1:¶
What is the distribution of income across the platform? Are there specific income brackets that dominate? How would you handle cases where income is blank or 0?
# Filter income more than 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] != 0]
# Calculate income distribution
income_distribution = filtered_income['income'].value_counts().sort_values(ascending=False).head(12).reset_index()
income_distribution.columns = ['Income', 'User Count']
print(income_distribution)
Income User Count 0 20000 2952 1 100000 1621 2 80000 1111 3 30000 1048 4 40000 1005 5 50000 975 6 60000 736 7 70000 707 8 150000 631 9 1000000 521 10 250000 149 11 500000 48
# Visualize income distribution
plt.figure(figsize=(10, 6))
plt.hist(filtered_income['income'], bins=30, color='gold', edgecolor='black')
plt.xlabel('Income')
plt.ylabel('Number of Users')
plt.title('Income Distribution on Bumble')
plt.grid(axis='y', alpha=0.3)
plt.show()
Que 2:¶
How does income vary by age_group and gender? Are older users more likely to report higher incomes?
# Filter users with income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Group by age group and gender
income_by_age_gender = (
filtered_income.groupby(['age_group', 'gender'])['income']
.mean()
.reset_index()
.sort_values(by=['age_group', 'income'], ascending=[True, False])
.round(2)
)
income_by_age_gender.columns = ['Age Group', 'Gender', 'Average Income']
print(income_by_age_gender)
Age Group Gender Average Income 1 18-25 m 106618.77 0 18-25 f 86066.35 3 26-35 m 114944.80 2 26-35 f 90398.13 5 36-45 m 112680.61 4 36-45 f 87302.98 7 46+ m 100156.63 6 46+ f 75299.76
Part 4: Data Visualization¶
1. Age Distribution¶
Understanding the distribution of user ages can reveal whether the platform caters to specific demographics or age groups. This insight is essential for targeted marketing and user experience design.
Que 1:¶
Plot a histogram of age with a vertical line indicating the mean age. What does the distribution reveal about the most common age group on the platform?
# Calculate mean age
mean_age = Bumble_Dataset['age'].mean()
# Plot histogram
plt.figure(figsize=(10, 8))
sns.histplot(Bumble_Dataset['age'], bins=30, color='skyblue')
# Add vertical line for mean age
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')
# Labels and title
plt.xlabel('Age')
plt.ylabel('User Count')
plt.title('Age Distribution of Users')
plt.legend()
plt.show()
Que 2:¶
How does the age distribution differ by gender? Are there age groups where one gender is more prevalent?
# Plot histograms for each gender
plt.figure(figsize=(10, 8))
plt.hist(Bumble_Dataset[Bumble_Dataset['gender'] == 'm']['age'], bins=30, alpha=0.6, label='Male', color='blue')
plt.hist(Bumble_Dataset[Bumble_Dataset['gender'] == 'f']['age'], bins=30, alpha=0.6, label='Female', color='pink')
# Add labels and legend
plt.xlabel("Age")
plt.ylabel("Number of Users")
plt.title("Age Distribution by Gender")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
2. Income and Age¶
Visualizing the relationship between income and age helps uncover patterns in reported income levels across age groups, which could inform user segmentation strategies.
Que 1:¶
Use a scatterplot to visualize the relationship between income and age, with a trend line indicating overall patterns. Are older users more likely to report higher incomes?
# Filter out zero or missing income values
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Set figure size
plt.figure(figsize=(10, 8))
# Create scatter plot
sns.scatterplot(x=filtered_income['age'], y=filtered_income['income'], alpha=0.5)
# Add trend line using regression plot
sns.regplot(x=filtered_income['age'], y=filtered_income['income'], scatter=False, color='red')
# Labels and title
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Scatter Plot of Income vs. Age with Trend Line")
plt.show()
Que 2:¶
Create boxplots of income grouped by age_group. Which age group reports the highest median income?
# Filter income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Create boxplot
plt.figure(figsize=(10, 8))
sns.boxplot(x='age_group', y='income', data=filtered_income, hue='age_group', legend=False)
# Labels
plt.xlabel("Age Group")
plt.ylabel("Income")
plt.title("Income Distribution by Age Group")
plt.show()
Que 3:¶
Analyze income levels within gender and status categories. For example, are single men more likely to report higher incomes than single women?
# Filter income > 0
filtered_income = Bumble_Dataset[Bumble_Dataset['income'] > 0]
# Boxplot for income by gender and status
plt.figure(figsize=(12, 8))
sns.boxplot(x='status', y='income', hue='gender', data=filtered_income, palette='Set2')
# Labels
plt.xlabel("Status")
plt.ylabel("Income")
plt.title("Income Distribution by Gender and Status")
plt.legend(title="Gender")
plt.show()
3. Pets and Preferences¶
Pets are often a key lifestyle preference and compatibility factor. Analyzing how pet preferences distribute across demographics can provide insights for filters or recommendations.
Que 1:¶
Create a bar chart showing the distribution of pets categories (e.g., likes dogs, likes cats). Which preferences are most common?
# Count pet preferences
pet_counts = Bumble_Dataset['pets'].value_counts().reset_index()
pet_counts.columns = ['Pet Preference', 'Count']
# Plot top 10
plt.figure(figsize=(10, 8))
top_pets = pet_counts.head(10)
sns.barplot(x='Pet Preference', y='Count', data=top_pets, hue='Pet Preference', legend=False)
plt.xlabel("Pet Preferences")
plt.ylabel("Number of Users")
plt.title("Distribution of Pet Preferences on Bumble")
plt.xticks(rotation=90)
plt.show()
Que 2:¶
How do pet preferences vary across gender and age_group? Are younger users more likely to report liking pets compared to older users?
# Group by age_group and pets
pet_preferences = Bumble_Dataset.groupby(['age_group', 'pets']).size().reset_index(name='Count')
# Plot
plt.figure(figsize=(10, 8))
top_pets_by_age = pet_preferences.sort_values('Count', ascending=False).head(15)
sns.barplot(x='age_group', y='Count', hue='pets', data=top_pets_by_age, palette='Set3')
plt.xlabel("Age Group")
plt.ylabel("Number of Users")
plt.title("Pet Preferences Across Age Groups")
plt.legend(title="Pet Preference", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
4. Signs and Personality¶
Users' self-reported zodiac signs (sign) can offer insights into personality preferences or trends. While not scientifically grounded, analyzing this data helps explore fun and engaging patterns.
Que 1:¶
Create a bar chart showing the distribution of zodiac signs (sign) across the platform. Which signs are most and least represented?
# Count zodiac signs
zodiac_counts = Bumble_Dataset['sign'].value_counts()
# Create bar chart
plt.figure(figsize=(20, 8))
sns.barplot(x=zodiac_counts.index, y=zodiac_counts.values, hue=zodiac_counts.index, legend=False)
plt.xlabel("Zodiac Sign")
plt.ylabel("Number of Users")
plt.title("Distribution of Zodiac Signs on the Platform")
plt.xticks(rotation=90)
plt.show()
Que 2:¶
How does sign vary across gender and status? Are there noticeable patterns or imbalances?
# Group by gender, status, and sign
gender_status = Bumble_Dataset.groupby(["gender", "status", "sign"])["status"].count().reset_index(name="count")
stacked_data = gender_status.pivot_table(index="sign", columns=["gender", "status"], values="count", fill_value=0)
# Create stacked chart
stacked_data.plot(kind="bar", stacked=True, figsize=(20, 8), colormap="coolwarm")
plt.xlabel("Zodiac Sign")
plt.ylabel("Number of Users")
plt.title("Proportion of Gender & Status Across Zodiac Signs")
plt.xticks(rotation=90)
plt.legend(title="Status", bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.show()
Summary of Analysis¶
The analysis of Bumble user data revealed several key insights:¶
Data Completeness:¶
Age and income data are fully available, while diet and religion have significant missing values.
User Demographics:¶
The largest user group falls within the 26-35 age range, with more male users than female.
Profile Completeness:¶
Most users have well-completed profiles, with completeness rates varying across demographics.
Dietary Habits & Drinking:¶
Users who identify as "mostly anything" for diet are the most common, with varied drinking habits.
Geographic Trends:¶
San Francisco and other California cities dominate the user base. California has the highest concentration of users.
Height Differences:¶
Males are generally taller (around 70 inches) than females (around 65 inches).
Income Distribution:¶
Most users earn between $20,000 and $100,000. The 26-35 age group has significant representation across income levels.
User Preferences:¶
Pet lovers (especially dog and cat owners) are common. Zodiac signs show relatively even distribution across the platform.
Recommendations¶
To enhance user engagement and experience, Bumble should consider:¶
Targeting the 26-35 Age Group & Male Users:¶
This group is the most active and represents a significant portion of the user base.
Improving Data Completeness:¶
Address missing values in diet and religion to enhance recommendation accuracy and matching.
Enhancing Personalization Through Diet & Drinking Habits:¶
Introduce features tailored to dietary preferences to improve user compatibility.
Integrating Pet Preferences into Matchmaking:¶
Many users like pets, especially dogs and cats. Incorporating this into the matchmaking algorithm could improve compatibility.
Leveraging Geographic Data:¶
Focus marketing efforts on high-concentration areas like California while exploring growth opportunities in underrepresented regions.
Income-Based Features:¶
Consider income-based filtering or recommendations to help users find compatible matches.