Introduction and Purpose¶

Bumble is a popular dating platform where users connect based on mutual interests and compatibility. To provide a better matchmaking experience, Bumble collects user information through profiles, which include details about demographics, lifestyle habits, and personal preferences.

This dataset represents user-generated profiles, offering a rich source of information to understand user behavior, preferences, and trends.

The purpose of this analysis is to leverage the dataset to answer key business and user behavior questions. By cleaning, processing, exploring, and visualizing the data, the goal is to uncover actionable insights that can help improve Bumble's matchmaking experience and user engagement.

1.1 Load the libraries:¶

In [3]:
import pandas as pd
import numpy as np
%pip install seaborn
import seaborn as sns
import matplotlib.pyplot as plt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)
Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

1.2. Import the dataset¶

In [5]:
df=pd.read_csv(r"C:\Users\avadh\Desktop\data analyst nextleap\Milestone 4\bumble.csv")
In [6]:
df.head()
Out[6]:
age status gender body_type diet drinks education ethnicity height income job last_online location pets religion sign speaks
0 22 single m a little extra strictly anything socially working on college/university asian, white 75.0 -1 transportation 2012-06-28-20-30 south san francisco, california likes dogs and likes cats agnosticism and very serious about it gemini english
1 35 single m average mostly other often working on space camp white 70.0 80000 hospitality / travel 2012-06-29-21-41 oakland, california likes dogs and likes cats agnosticism but not too serious about it cancer english (fluently), spanish (poorly), french (...
2 38 available m thin anything socially graduated from masters program NaN 68.0 -1 NaN 2012-06-27-09-10 san francisco, california has cats NaN pisces but it doesn’t matter english, french, c++
3 23 single m thin vegetarian socially working on college/university white 71.0 20000 student 2012-06-28-14-22 berkeley, california likes cats NaN pisces english, german (poorly)
4 29 single m athletic NaN socially graduated from college/university asian, black, other 66.0 -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california likes dogs and likes cats NaN aquarius english

1.3 Check the Information about the data and the datatypes of each respective attributes.¶

This step provides an overview of the dataset by displaying the structure, data types, and non-null counts for each column. It helps identify data types, missing values, and memory usage, laying the foundation for data cleaning and further analysis.

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   gender       59946 non-null  object 
 3   body_type    54650 non-null  object 
 4   diet         35551 non-null  object 
 5   drinks       56961 non-null  object 
 6   education    53318 non-null  object 
 7   ethnicity    54266 non-null  object 
 8   height       59943 non-null  float64
 9   income       59946 non-null  int64  
 10  job          51748 non-null  object 
 11  last_online  59946 non-null  object 
 12  location     59946 non-null  object 
 13  pets         40025 non-null  object 
 14  religion     39720 non-null  object 
 15  sign         48890 non-null  object 
 16  speaks       59896 non-null  object 
dtypes: float64(1), int64(2), object(14)
memory usage: 7.8+ MB

1.4 Statistical Analysis¶

In [11]:
print(f"statistical analysis (numerical columns)")
df.describe()
statistical analysis (numerical columns)
Out[11]:
age height income
count 59946.000000 59943.000000 59946.000000
mean 32.340290 68.295281 20033.222534
std 9.452779 3.994803 97346.192104
min 18.000000 1.000000 -1.000000
25% 26.000000 66.000000 -1.000000
50% 30.000000 68.000000 -1.000000
75% 37.000000 71.000000 -1.000000
max 110.000000 95.000000 1000000.000000

1.4 Check the Dimension of data?¶

In [13]:
df.shape
Out[13]:
(59946, 17)

1.5 Check Null values¶

In [15]:
df.isnull().sum()
Out[15]:
age                0
status             0
gender             0
body_type       5296
diet           24395
drinks          2985
education       6628
ethnicity       5680
height             3
income             0
job             8198
last_online        0
location           0
pets           19921
religion       20226
sign           11056
speaks            50
dtype: int64
In [16]:
df.isnull().sum().sum()
Out[16]:
104438
  • Which columns in the dataset have missing values, and what percentage of data is missing in each column?
In [18]:
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
  • Are there columns where more than 50% of the data is missing? Drop those columns where missing values are >50%.
In [20]:
missing_summary = pd.DataFrame({
    'Missing Values': missing_data,
    'Percentage (%)': missing_percentage
}).sort_values(by='Percentage (%)', ascending=False)

print(missing_summary)
             Missing Values  Percentage (%)
diet                  24395       40.694959
religion              20226       33.740366
pets                  19921       33.231575
sign                  11056       18.443266
job                    8198       13.675641
education              6628       11.056618
ethnicity              5680        9.475194
body_type              5296        8.834618
drinks                 2985        4.979482
speaks                   50        0.083408
height                    3        0.005005
last_online               0        0.000000
location                  0        0.000000
income                    0        0.000000
status                    0        0.000000
gender                    0        0.000000
age                       0        0.000000
In [21]:
df.duplicated().sum()
Out[21]:
0

Part 1: Data Cleaning¶

1. Inspecting Missing Data¶

Inspecting missing data is crucial to identify gaps in the dataset that could impact analysis or model performance. It helps decide how to handle missing values, such as imputation, removal, or flagging, ensuring the integrity and accuracy of the analysis.

In [26]:
df.dropna(subset=['height'],inplace=True)
conclusion:- dropped the height column NA values because it had less than 1% of na values¶

2. Data Types¶

Reviewing data types ensures the data is properly structured for analysis. Correct data types are essential for performing valid operations, applying appropriate transformations, and optimizing both performance and memory usage.

In [30]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 59943 entries, 0 to 59945
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59943 non-null  int64  
 1   status       59943 non-null  object 
 2   gender       59943 non-null  object 
 3   body_type    54650 non-null  object 
 4   diet         35551 non-null  object 
 5   drinks       56961 non-null  object 
 6   education    53318 non-null  object 
 7   ethnicity    54264 non-null  object 
 8   height       59943 non-null  float64
 9   income       59943 non-null  int64  
 10  job          51747 non-null  object 
 11  last_online  59943 non-null  object 
 12  location     59943 non-null  object 
 13  pets         40024 non-null  object 
 14  religion     39720 non-null  object 
 15  sign         48889 non-null  object 
 16  speaks       59893 non-null  object 
dtypes: float64(1), int64(2), object(14)
memory usage: 8.2+ MB
In [31]:
#Does the last_online column need to be converted into a datetime format?
In [32]:
# Convert 'last_online' column to datetime with the correct format
df['last_online'] = pd.to_datetime(df['last_online'], format='%Y-%m-%d-%H-%M')

# Display the result
df['last_online'].head()
Out[32]:
0   2012-06-28 20:30:00
1   2012-06-29 21:41:00
2   2012-06-27 09:10:00
3   2012-06-28 14:22:00
4   2012-06-27 21:26:00
Name: last_online, dtype: datetime64[ns]

3. Outliers¶

Outliers can significantly skew results and affect the accuracy of analysis or models. Identifying and addressing outliers helps ensure the data accurately represents trends and patterns, leading to more reliable insights and decisions.

In [35]:
df.describe()
Out[35]:
age height income last_online
count 59943.000000 59943.000000 59943.000000 59943
mean 32.340140 68.295281 20034.225197 2012-05-22 06:40:43.551373568
min 18.000000 1.000000 -1.000000 2011-06-27 01:52:00
25% 26.000000 66.000000 -1.000000 2012-05-29 20:34:00
50% 30.000000 68.000000 -1.000000 2012-06-27 14:30:00
75% 37.000000 71.000000 -1.000000 2012-06-30 01:09:00
max 110.000000 95.000000 1000000.000000 2012-07-01 08:57:00
std 9.452723 3.994803 97348.524902 NaN
In [36]:
#Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?

#Any -1 values in numerical columns like income should be replaced with 0, as they may represent missing or invalid data.
In [37]:
# Calculate IQR for age, height, and income
columns_to_check = ['age', 'height']
outliers = {}

for col in columns_to_check:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    # Calculate lower and upper bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers[col] = df[(df[col] < lower_bound) | (df[col] > upper_bound)]

    # Print results
    print(f"{col} - IQR: {IQR:.2f}, Lower Bound: {lower_bound:.2f}, Upper Bound: {upper_bound:.2f}")
    print(f"Number of outliers in {col}: {outliers[col].shape[0]}")
age - IQR: 11.00, Lower Bound: 9.50, Upper Bound: 53.50
Number of outliers in age: 2638
height - IQR: 5.00, Lower Bound: 58.50, Upper Bound: 78.50
Number of outliers in height: 285
  • Replacing Outliers with Median:

Outliers can distort the data and affect results. Replacing them with the median is useful because the median is not influenced by extreme values, unlike the mean. It helps maintain a more accurate and representative dataset, ensuring that the data remains balanced and reliable for analysis.

In [39]:
# Step 1: Calculate the 10th and 90th percentiles for 'age' and 'height'
age_10th, age_90th = df['age'].quantile(0.10), df['age'].quantile(0.90)
height_10th, height_90th = df['height'].quantile(0.10), df['height'].quantile(0.90)

# Step 2: Calculate the median for 'age' and 'height' within the middle 80%
middle_80_data = df[(df['age'] >= age_10th) & (df['age'] <= age_90th) &
                     (df['height'] >= height_10th) & (df['height'] <= height_90th)]

median_age = middle_80_data['age'].median()
median_height = middle_80_data['height'].median()

print(f"Median Age (Middle 80%): {median_age}")
print(f"Median Height (Middle 80%): {median_height}")
Median Age (Middle 80%): 30.0
Median Height (Middle 80%): 68.0
In [40]:
# Step 3: Replace outliers in 'age' and 'height' with the calculated median
df['age'] = df['age'].apply(lambda x: median_age if x < age_10th or x > age_90th else x)
df['height'] = df['height'].apply(lambda x: median_height if x < height_10th or x > height_90th else x)

# Step 4: Verify the changes
print(df[['age', 'height']].describe())
                age        height
count  59943.000000  59943.000000
mean      30.897986     68.193367
std        5.419528      2.633236
min       23.000000     63.000000
25%       27.000000     66.000000
50%       30.000000     68.000000
75%       34.000000     70.000000
max       46.000000     73.000000
  • -1 values in numerical column income should be replaced with 0, as they may represent missing or invalid data:

-1 values in the income column likely represent missing or invalid data. Replacing them with 0 helps maintain consistency in the dataset and ensures that these values don’t distort the analysis or modeling, as they are not meaningful income figures.

In [42]:
# Replace -1 values with 0 in the 'income' column using replace method
df['income'] = df['income'].replace(-1, 0)

# Verify the changes
print(df['income'].describe())
count      59943.000000
mean       20035.033282
std        97348.358589
min            0.000000
25%            0.000000
50%            0.000000
75%            0.000000
max      1000000.000000
Name: income, dtype: float64

4. Missing Data Visualization¶

Visualizing missing data helps quickly identify patterns and distributions of missing values across columns. This step aids in understanding the extent of missing data, making it easier to decide on appropriate handling methods, such as imputation or removal, for more accurate analysis.

In [45]:
# Create a boolean mask where True indicates missing data
missing_data = df.isnull()

# Create a heatmap to visualize missing values
plt.figure(figsize=(8, 5))
sns.heatmap(missing_data, cbar=False, cmap=['black', '#FFCC00'], yticklabels=False, xticklabels=df.columns)

# Display the heatmap
plt.title("Missing Data Heatmap")
plt.show()
No description has been provided for this image

Part 2: Data Processing¶

1. Binning and Grouping¶

Grouping continuous variables, such as age or income, into bins helps simplify analysis and identify trends among specific groups. For instance, grouping users into age ranges can reveal distinct patterns in behavior or preferences across demographics.

Questions:

  1. Bin the age column into categories such as "18-25", "26-35", "36-45", and "46+" to create a new column, age_group. How does the distribution of users vary across these age ranges?
In [50]:
# Define the bin edges
bins = [18, 25, 35, 45, float('inf')]  # 'inf' for ages above 45
labels = ["18-25", "26-35", "36-45", "46+"]  # Labels for the bins

# Bin the 'age' column into the defined categories
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=True)

# Display the first few rows to check
print(df[['age', 'age_group']].head())
    age age_group
0  30.0     26-35
1  35.0     26-35
2  38.0     36-45
3  23.0     18-25
4  29.0     26-35
  1. Group income into categories like "Low Income," "Medium Income," and "High Income" based on meaningful thresholds (e.g., quartiles). What insights can be derived from these groups?
In [52]:
print(df['income'].value_counts())
income
0          48439
20000       2952
100000      1621
80000       1111
30000       1048
40000       1005
50000        975
60000        736
70000        707
150000       631
1000000      521
250000       149
500000        48
Name: count, dtype: int64
In [53]:
# Plot the income distribution 
plt.figure(figsize=(10, 6))
sns.histplot(df['income'], bins=30, kde=True, color='#FFCC00')  
plt.title('Income Distribution', fontsize=16, color='black')
plt.xlabel('Income', fontsize=12, color='black')
plt.ylabel('Frequency', fontsize=12, color='black')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()
No description has been provided for this image
In [54]:
# Calculate the quartiles
low_income_threshold = df['income'].quantile(0.25)  # 25th percentile
high_income_threshold = df['income'].quantile(0.75)  # 75th percentile

# Display the thresholds
print(f"Low Income Threshold: {low_income_threshold}")
print(f"High Income Threshold: {high_income_threshold}")
Low Income Threshold: 0.0
High Income Threshold: 0.0

conclusion: This skewness makes it impossible to differentiate between "Low Income," "Medium Income," and "High Income" using quartiles. Zero income values influence the thresholds for "Low Income" and "High Income."

In [56]:
# Exclude zeros for meaningful categorization
non_zero_income = df[df['income'] > 0]['income']

# Calculate quartile-based thresholds (ignoring 0 values)
low_income_threshold = non_zero_income.quantile(0.25)  # 25th percentile
high_income_threshold = non_zero_income.quantile(0.75)  # 75th percentile

# Define bins for income categories
bins = [df['income'].min(), low_income_threshold, high_income_threshold, df['income'].max()]
labels = ["Low Income", "Medium Income", "High Income"]

# Create the 'income_group' column
df['income_group'] = pd.cut(df['income'], bins=bins, labels=labels, right=True)

print(df['income_group'].value_counts(dropna=False))
income_group
NaN              48439
Medium Income     7203
Low Income        2952
High Income       1349
Name: count, dtype: int64

2. Derived Features¶

Derived features are new columns created based on the existing data to add depth to the analysis. These features often reveal hidden patterns or provide new dimensions to explore.

Questions:

  1. Create a new feature, profile_completeness, by calculating the percentage of non-missing values for each user profile. How complete are most user profiles, and how does completeness vary across demographics?
In [60]:
# Calculate profile completeness as the percentage of non-missing values for each user profile
df['profile_completeness'] = (df.notnull().sum(axis=1) / len(df.columns)) * 100
In [61]:
# Analyze completeness by gender
profile_completeness_by_gender = df.groupby('gender')['profile_completeness'].mean().reset_index()

# Display results
print("\nProfile Completeness by Gender:")
print(profile_completeness_by_gender)

# Profile completeness by age
completeness_by_age = df.groupby('age')['profile_completeness'].mean().reset_index()
print("\nProfile Completeness by Age:")
print(completeness_by_age)

# Profile completeness by status
completeness_by_status = df.groupby('status')['profile_completeness'].mean().reset_index()
print("\nProfile Completeness by Status:")
print(completeness_by_status)
Profile Completeness by Gender:
  gender  profile_completeness
0      f             86.454941
1      m             86.662808

Profile Completeness by Age:
     age  profile_completeness
0   23.0             86.478639
1   24.0             86.269035
2   25.0             86.291934
3   26.0             86.314941
4   27.0             86.314361
5   28.0             86.105439
6   29.0             86.386071
7   30.0             86.930099
8   31.0             86.300394
9   32.0             86.373998
10  33.0             86.274276
11  34.0             86.258232
12  35.0             86.435747
13  36.0             86.557835
14  37.0             86.655848
15  38.0             86.909379
16  39.0             86.801689
17  40.0             86.882984
18  41.0             87.454350
19  42.0             87.043401
20  43.0             87.118145
21  44.0             87.265834
22  45.0             87.787509
23  46.0             87.297396

Profile Completeness by Status:
           status  profile_completeness
0       available             87.430507
1         married             86.943973
2  seeing someone             87.112403
3          single             86.530062
4         unknown             80.000000

3. Unit Conversion¶

Standardizing units across datasets is essential for consistency, especially when working with numerical data. In the context of the Bumble dataset, users’ heights are given in inches, which may not be intuitive for all audiences.

Question:

  1. Convert the height column from inches to centimeters using the conversion factor (1 inch = 2.54 cm). Store the converted values in a new column, height_cm.
In [65]:
# Conversion factor from inches to centimeters
conversion_factor = 2.54

# Convert the height column to centimeters and store in a new column
df['height_cm'] = df['height'] * conversion_factor

# Display the updated dataframe or verify the new column
print(df[['height', 'height_cm']].head())
   height  height_cm
0    68.0     172.72
1    70.0     177.80
2    68.0     172.72
3    71.0     180.34
4    66.0     167.64

Part 3: Data Analysis¶

1. Demographic Analysis¶

Understanding the demographics of users is essential for tailoring marketing strategies, improving user experience, and designing features that resonate with the platform’s audience. Insights into gender distribution, orientation, and relationship status can help Bumble refine its matchmaking algorithms and engagement campaigns.

Questions:

  1. What is the gender distribution (gender) across the platform? Are there any significant imbalances?
In [70]:
# Calculate the gender distribution
gender_distribution = df['gender'].value_counts()

# Calculate the percentage distribution for better insights
gender_percentage = df['gender'].value_counts(normalize=True) * 100

# Display the results with formatted percentage values
print("Gender Distribution (Count):")
print(gender_distribution)
print("\nGender Distribution (Percentage):")
print(gender_percentage.apply(lambda x: f"{x:.2f}"))
Gender Distribution (Count):
gender
m    35827
f    24116
Name: count, dtype: int64

Gender Distribution (Percentage):
gender
m    59.77
f    40.23
Name: proportion, dtype: object

Insights:

  • The platform has a higher proportion of male users (59.77%) compared to female users (40.23%).
  • Male users outnumber female users by a significant margin.

Recommendations:

  • Explore ways to attract more female users to balance the gender ratio.
  • Tailor features or marketing strategies to better engage the female audience.
  • Consider developing campaigns that promote gender equality and inclusivity on the platform.
  1. What are the proportions of users in different status categories (e.g., single, married, seeing someone)? What does this suggest about the platform’s target audience?
In [73]:
# Calculate the status distribution
status_distribution = df['status'].value_counts()

# Calculate the percentage distribution for better insights
status_percentage = df['status'].value_counts(normalize=True) * 100

# Display the results with formatted percentage values
print("Status Distribution (Count):")
print(status_distribution)
print("\nStatus Distribution (Percentage):")
print(status_percentage.apply(lambda x: f"{x:.2f}"))
Status Distribution (Count):
status
single            55694
seeing someone     2064
available          1865
married             310
unknown              10
Name: count, dtype: int64

Status Distribution (Percentage):
status
single            92.91
seeing someone     3.44
available          3.11
married            0.52
unknown            0.02
Name: proportion, dtype: object

Insights:

  • Single (92.91%): The majority of users are "single," highlighting that Bumble primarily serves individuals looking for relationships or connections.
  • Seeing Someone (3.44%): A small percentage of users are "seeing someone," suggesting some are in relationships but remain active on the platform for other interactions.
  • Available (3.11%): Users marked as "available" are likely open to connections, though not necessarily seeking long-term commitments.
  • Married (0.52%): A very small portion of users are "married," possibly using the platform for non-romantic purposes.
  • Unknown (0.02%): A negligible amount of users have an "unknown" status, likely due to incomplete profiles.

Recommendations:

  • Focus on engaging the majority of single users (92.91%) with tailored features.
  • Create tools for users who are seeing someone (3.44%) to enhance relationships.
  • Promote visibility for available users (3.11%) to attract more attention.
  • Optimize for married and unknown users, but with less priority.
  1. How does status vary by gender? For example, what proportion of men and women identify as single?
In [76]:
# Calculate the count and percentage of status by gender
status_gender = df.groupby(['gender', 'status']).size().unstack()

# Normalize by gender for percentage calculation
status_gender_percentage = status_gender.div(status_gender.sum(axis=1), axis=0) * 100

# Display the results with formatted percentage values
print(f"Status Distribution by Gender (Count):\n{status_gender}")
print(f"\nStatus Distribution by Gender (Percentage):\n{status_gender_percentage.apply(lambda x: x.map(lambda v: f'{v:.2f}'))}")  # Format each percentage value
Status Distribution by Gender (Count):
status  available  married  seeing someone  single  unknown
gender                                                     
f             656      135            1003   22318        4
m            1209      175            1061   33376        6

Status Distribution by Gender (Percentage):
status available married seeing someone single unknown
gender                                                
f           2.72    0.56           4.16  92.54    0.02
m           3.37    0.49           2.96  93.16    0.02

Insights:

  • The majority of users, both male (93.16%) and female (92.54%), identify as single, reflecting the platform's primary target audience.
  • Men have a slightly higher proportion of single users compared to women.
  • Women are more likely to be "seeing someone" (4.16% vs 2.96% for men), suggesting more relationship activity or engagement.
  • Few users report being available (2.72% for women, 3.37% for men), indicating a smaller portion is actively seeking a partner, which may highlight either intentional disengagement or potential limitations in profile activity.
  • Married and unknown statuses are minimal, reinforcing that Bumble is predominantly focused on non-married, single individuals seeking connections.

Recommendations:

  • Focus marketing efforts on the single users, as they make up the largest group.
  • Consider adding features that help users who are seeing someone engage more, such as relationship tools or connections.
  • Improve the available status section to encourage more users to indicate if they are looking for a partner.

2. Correlation Analysis¶

Correlation analysis helps uncover relationships between variables, guiding feature engineering and hypothesis generation. For example, understanding how age correlates with income or word count in profiles can reveal behavioral trends that inform platform design.

Questions:

  1. What are the correlations between numerical columns such as age, income, gender Are there any strong positive or negative relationships?
In [81]:
# Encode gender as numeric (1 for male, 0 for female)
df['gender_encoded'] = df['gender'].map({'m': 1, 'f': 0})

# Calculate the correlation matrix
correlation_matrix = df[['age', 'income', 'gender_encoded']].corr()

# Display the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)
Correlation Matrix:
                     age    income  gender_encoded
age             1.000000  0.008459       -0.016291
income          0.008459  1.000000        0.074604
gender_encoded -0.016291  0.074604        1.000000
  • Correlation near 1 or -1: Strong positive or negative relationship.
  • Correlation near 0: No significant linear relationship.
  • Intermediate values: Interpret based on the context (0.3 to 0.7 or -0.3 to -0.7 typically means moderate relationships).

Insights:

  • Age vs Income: The correlation is 0.0085, showing no significant relationship between age and income on the platform.
  • Age vs Gender: The correlation is -0.0163, indicating no significant relationship between age and gender.
  • Income vs Gender: The correlation is 0.0746, a very weak positive relationship, suggesting no major income differences between genders.

Recommendations:

  • Targeting Age and Income: Since there's no significant correlation between age and income, targeting users based on these factors may not be effective for matchmaking or feature design.
  • Marketing Strategies: As gender and income have minimal correlation, Bumble could continue with gender-neutral strategies and focus on other aspects like behavior and preferences for more personalized marketing.
  • Feature Design: Age and gender do not show strong relationships with other features, suggesting the need to focus on user behavior and engagement for better user experience and feature development.
  • Further Exploration: Explore non-linear relationships or factors like user activity and location for more meaningful insights on user demographics.
  • Behavioral Segmentation: Behavioral patterns might provide more actionable insights than age and gender, allowing for better user segmentation and targeted engagement.

In summary, age and income do not significantly influence gender, and exploring behavioral or activity-based features could enhance personalization and user experience on the platform.

3. Diet and Lifestyle Analysis¶

Lifestyle attributes such as diet, drinks provide insights into user habits and preferences. Analyzing these factors helps identify compatibility trends and inform product features like filters or match recommendations.

Questions:

  1. How do dietary preferences (diet) distribute across the platform? For example, what percentage of users identify as vegetarian, vegan, or follow "anything" diets?
In [86]:
# Calculate the count and percentage of diet preferences
diet_distribution = df['diet'].value_counts()
diet_percentage = df['diet'].value_counts(normalize=True) * 100

# Combine count and percentage into a single DataFrame for visualization
diet_combined = pd.DataFrame({
    'Count': diet_distribution,
    'Percentage': diet_percentage.apply(lambda x: f"{x:.2f}")  # Format percentage to 2 decimal places
})

# Display the combined count and formatted percentage values
print(diet_combined)
                     Count Percentage
diet                                 
mostly anything      16585      46.65
anything              6183      17.39
strictly anything     5113      14.38
mostly vegetarian     3444       9.69
mostly other          1007       2.83
strictly vegetarian    875       2.46
vegetarian             667       1.88
strictly other         452       1.27
mostly vegan           338       0.95
other                  331       0.93
strictly vegan         228       0.64
vegan                  136       0.38
mostly kosher           86       0.24
mostly halal            48       0.14
strictly halal          18       0.05
strictly kosher         18       0.05
halal                   11       0.03
kosher                  11       0.03

Insights:

1.Diet Preferences Distribution:

  • The largest group (68.36%) of users falls under the "mostly anything" category, indicating a broad preference for a variety of diets.
  • Other significant groups include "anything" (10.31%) and "strictly anything" (8.53%), showing a general inclination towards a non-restricted diet.
  • Smaller groups are represented by users with vegetarian, vegan, kosher, or halal preferences, with each category having less than 6% of the user base.

2.Minority Diet Preferences:

  • The least common diets are "strictly halal" and "strictly kosher," which each account for only about 0.03% of the user base, reflecting niche dietary choices.

3.Vegetarian and Vegan Preferences:

  • There is a noticeable portion of users (7.75% combined) who prefer vegetarian or vegan diets ("mostly vegetarian," "strictly vegetarian," "mostly vegan," "strictly vegan"). However, these categories still represent a smaller proportion compared to non-vegetarian preferences.

Recommendations:

  • Flexible Filters: Add filters that allow users to select different diet types easily, as many prefer a flexible diet.
  • Focus on Vegetarians/Vegans: Offer special features or matching options for users who follow vegetarian or vegan diets.
  • Support Special Diets: Include options for people who follow halal or kosher diets, even though they are less common.
  • Better Profile Information: Help users clearly state their diet preferences to improve match accuracy.
  1. How do drinking habits (drinks) vary across different diet categories? Are users with stricter diets (e.g., vegan) less likely to drink?
In [89]:
# Group by diet and drinks, then calculate the count
diet_drinks_count = df.groupby(['diet', 'drinks']).size().unstack()

# Calculate percentage distribution by diet
diet_drinks_percentage = diet_drinks_count.apply(lambda x: x / x.sum() * 100, axis=1)

# Display the results
print("Diet vs Drinking Habits (Count):")
print(diet_drinks_count)
print("\nDiet vs Drinking Habits (Percentage):")
print(diet_drinks_percentage.apply(lambda x: x.map(lambda v: f"{v:.2f}")))  # Format percentage to 2 decimal places
Diet vs Drinking Habits (Count):
drinks               desperately  not at all   often  rarely  socially  \
diet                                                                     
anything                    21.0       295.0   574.0   510.0    4534.0   
halal                        NaN         4.0     1.0     NaN       4.0   
kosher                       NaN         1.0     NaN     2.0       7.0   
mostly anything             66.0       842.0  1386.0  1537.0   12277.0   
mostly halal                 3.0        10.0     2.0     8.0      16.0   
mostly kosher                1.0         7.0     4.0    17.0      50.0   
mostly other                 8.0        91.0    49.0   176.0     646.0   
mostly vegan                 2.0        40.0    22.0    62.0     193.0   
mostly vegetarian           19.0       195.0   233.0   466.0    2388.0   
other                        1.0        35.0    23.0    60.0     192.0   
strictly anything           66.0       182.0   685.0   318.0    3677.0   
strictly halal               1.0         7.0     2.0     1.0       5.0   
strictly kosher              6.0         4.0     2.0     3.0       1.0   
strictly other              14.0        67.0    34.0    63.0     244.0   
strictly vegan               8.0        58.0    19.0    37.0      94.0   
strictly vegetarian          9.0        85.0    82.0   121.0     543.0   
vegan                        2.0        23.0    19.0    17.0      65.0   
vegetarian                   6.0        37.0    59.0    67.0     446.0   

drinks               very often  
diet                             
anything                   58.0  
halal                       NaN  
kosher                      1.0  
mostly anything           129.0  
mostly halal                4.0  
mostly kosher               5.0  
mostly other                5.0  
mostly vegan                3.0  
mostly vegetarian          20.0  
other                       1.0  
strictly anything          59.0  
strictly halal              2.0  
strictly kosher             2.0  
strictly other             10.0  
strictly vegan              3.0  
strictly vegetarian         9.0  
vegan                       1.0  
vegetarian                  7.0  

Diet vs Drinking Habits (Percentage):
drinks              desperately not at all  often rarely socially very often
diet                                                                        
anything                   0.35       4.92   9.58   8.51    75.67       0.97
halal                       nan      44.44  11.11    nan    44.44        nan
kosher                      nan       9.09    nan  18.18    63.64       9.09
mostly anything            0.41       5.19   8.54   9.47    75.61       0.79
mostly halal               6.98      23.26   4.65  18.60    37.21       9.30
mostly kosher              1.19       8.33   4.76  20.24    59.52       5.95
mostly other               0.82       9.33   5.03  18.05    66.26       0.51
mostly vegan               0.62      12.42   6.83  19.25    59.94       0.93
mostly vegetarian          0.57       5.87   7.02  14.03    71.91       0.60
other                      0.32      11.22   7.37  19.23    61.54       0.32
strictly anything          1.32       3.65  13.74   6.38    73.73       1.18
strictly halal             5.56      38.89  11.11   5.56    27.78      11.11
strictly kosher           33.33      22.22  11.11  16.67     5.56      11.11
strictly other             3.24      15.51   7.87  14.58    56.48       2.31
strictly vegan             3.65      26.48   8.68  16.89    42.92       1.37
strictly vegetarian        1.06      10.01   9.66  14.25    63.96       1.06
vegan                      1.57      18.11  14.96  13.39    51.18       0.79
vegetarian                 0.96       5.95   9.49  10.77    71.70       1.13

Insights:

  • General Trends: Most users across diet categories prefer drinking "socially" (highest percentage in all groups). "Anything" and "Mostly Anything" diets have the highest representation, with over 75% of users in these groups drinking socially.
  • Stricter Diets and Drinking: Vegans and Strict Vegans: Higher percentage of users in these groups either drink "not at all" or "rarely" (25%-38%) compared to less restrictive diets. Strictly Halal and Strictly Kosher: A significant proportion of these users also drink "not at all" (39%-22%), indicating stricter adherence to non-drinking habits.
  • Occasional and Frequent Drinking: Users with less restrictive diets ("Anything" and "Mostly Anything") are more likely to drink "often" or "very often" compared to those with stricter diets like "Strictly Vegan" or "Strictly Vegetarian."

Recommendations:

  • Tailored Marketing for Stricter Diets: Highlight non-alcoholic options or events catering to users with vegan, halal, or kosher preferences to increase inclusivity and engagement.
  • Social Events Targeting Majority Preferences: Focus on "social drinking" campaigns since it aligns with most users' preferences across diets.
  • Educational Campaigns: For stricter diet groups, consider awareness campaigns about low-alcohol or non-alcoholic beverages that align with their dietary restrictions.

Overall Conclusion:¶

People with stricter diets, like vegans or those following halal or kosher, are less likely to drink alcohol or drink only occasionally. On the other hand, people with less restrictive diets, like "anything" or "mostly anything," are more likely to drink socially or more often. This shows that stricter diets are linked to lower alcohol consumption.

4. Geographical Insights¶

Analyzing geographical data helps Bumble understand its user base distribution, enabling targeted regional campaigns and feature localization. For instance, identifying the top cities with active users can guide marketing efforts in those areas.

Questions:¶

  1. Extract city and state information from the location column. What are the top 5 cities and states with the highest number of users?
In [94]:
# Splitting the location column into city and state
df[['city', 'state']] = df['location'].str.split(',', n=1, expand=True)

# Cleaning up extra spaces in city and state
df['city'] = df['city'].str.strip()
df['state'] = df['state'].str.strip()

# Calculating the top 5 cities and states
top_cities = df['city'].value_counts().head(5)
top_states = df['state'].value_counts().head(5)

# Displaying the results
print("Top 5 Cities with the highest number of users:")
print(top_cities)

print("\nTop 5 States with the highest number of users:")
print(top_states)
Top 5 Cities with the highest number of users:
city
san francisco    31064
oakland           7214
berkeley          4210
san mateo         1331
palo alto         1064
Name: count, dtype: int64

Top 5 States with the highest number of users:
state
california       59853
new york            17
illinois             8
massachusetts        5
texas                4
Name: count, dtype: int64

Insights:

City-Level Insights:

  • San Francisco dominates with 31,064 users, far surpassing other cities, indicating it is a key hub for Bumble users.
  • Oakland (7,214 users) and Berkeley (4,212 users) are also significant, suggesting strong engagement within the Bay Area. Cities like San Mateo (1,331 users) and Palo Alto (1,064 users) show moderate user presence, representing potential areas for growth.

State-Level Insights:

  • California leads overwhelmingly with 59,855 users, showcasing it as the primary market for Bumble.
  • States like New York (17 users), Illinois (8 users), Massachusetts (5 users), and Texas (4 users) have minimal user representation in the dataset, indicating under-penetration or data imbalance.

Recommendations:

  • Focus on California: Build on the strong presence in California by enhancing user engagement and launching new features in cities like San Francisco, Oakland, and Berkeley.
  • Grow in Other States: Increase marketing efforts in underrepresented states like New York, Texas, and Illinois to expand Bumble's reach.
  • Expand in Smaller Cities: Target mid-sized cities like San Mateo and Palo Alto for growth opportunities.
  • Check Data Gaps: Verify low user numbers in other states to ensure data accuracy and identify potential growth areas.
  1. How does age vary across the top cities? Are certain cities dominated by younger or older users?
In [97]:
# Group data by city and calculate age statistics
age_stats = df[df['city'].isin(['san francisco', 'oakland', 'berkeley', 'san mateo', 'palo alto'])]
age_summary = (
    age_stats.groupby('city')['age']
    .agg(['mean', 'median', 'min', 'max'])
    .reset_index()
    .rename(columns={'mean': 'Mean Age', 'median': 'Median Age', 'min': 'Min Age', 'max': 'Max Age'})
)

print(age_summary)
            city   Mean Age  Median Age  Min Age  Max Age
0       berkeley  30.062233        30.0     23.0     46.0
1        oakland  31.518991        30.0     23.0     46.0
2      palo alto  29.953008        30.0     23.0     46.0
3  san francisco  30.891386        30.0     23.0     46.0
4      san mateo  31.147258        30.0     23.0     46.0

Insights:

Age Distribution:

  • The average and median age across the top cities are fairly consistent, around 30-31 years, with a minimum of 23 and a maximum of 46 years.
  • Berkeley and Palo Alto have slightly younger populations, with lower mean ages compared to other cities.
  • Oakland and San Mateo have marginally older users on average. Homogeneity: There is minimal variation in age distribution across these cities, indicating a fairly uniform user demographic.

Recommendations:

  • Tailored Features: Develop features appealing to users in their late 20s to early 30s, as this is the dominant age range.
  • Localized Engagement: Focus on slightly different themes in cities like Berkeley (younger, student-friendly) and Oakland or San Mateo (professional, career-focused).
  • Broaden Demographics: Consider campaigns to attract a broader age range, especially older users, to diversify the platform further.
  1. What are the average income levels in the top states or cities? Are there regional patterns in reported income?
In [100]:
# Filter top cities and states
top_cities = ['san francisco', 'oakland', 'berkeley', 'san mateo', 'palo alto']
top_states = ['california', 'new york', 'illinois', 'massachusetts', 'texas']

# Calculate mean income for top cities
city_income = df[df['city'].isin(top_cities)].groupby('city')['income'].mean().reset_index()
city_income.rename(columns={'income': 'Average Income'}, inplace=True)

# Calculate mean income for top states
state_income = df[df['state'].isin(top_states)].groupby('state')['income'].mean().reset_index()
state_income.rename(columns={'income': 'Average Income'}, inplace=True)

# Display results
print("Average Income by Top Cities:")
print(city_income)

print("\nAverage Income by Top States:")
print(state_income)
Average Income by Top Cities:
            city  Average Income
0       berkeley    17372.921615
1        oakland    22586.637095
2      palo alto    19332.706767
3  san francisco    20150.012877
4      san mateo    22779.864763

Average Income by Top States:
           state  Average Income
0     california    20044.943445
1       illinois        0.000000
2  massachusetts     6000.000000
3       new york    31764.705882
4          texas     5000.000000

Insights:

Top Cities:

  • The highest average income is in San Mateo and Oakland, with average incomes of approximately 22,780 and 22,586 respectively.
  • Berkeley has the lowest average income among the top cities at around 17,365.

Top States:

  • California stands out with the highest average income (~20,044).
  • New York follows with an average income of 31,765, indicating a higher income level in comparison to California.
  • Texas and Illinois show very low average incomes, with Texas at 5,000 and Illinois at 0, which may point to missing or unreported data for these states.
  • Recommendation:
  • Targeted Campaigns: Focus on California and New York for regions with higher average incomes. Tailor marketing or product offerings to meet the demands of users with higher incomes.
  • Data Quality: Investigate missing or erroneous income data for Illinois and Texas, as the reported averages seem off, possibly due to incomplete or missing entries.

Regional Patterns:

  • The highest average incomes are concentrated in California and New York, which may align with areas like the Bay Area (San Mateo, San Francisco, Palo Alto) that are known for high living costs and tech industries.
  • Texas and Illinois show unusually low income averages, which may indicate incomplete or erroneous data, rather than regional trends.

5. Height Analysis¶

Physical attributes like height are often considered important in dating preferences. Analyzing height patterns helps Bumble understand user demographics and preferences better.

Questions:

  1. What is the average height of users across different gender categories?
In [105]:
# Group by gender and calculate average height in cm
gender_height_cm = df.groupby('gender')['height_cm'].mean().reset_index()
gender_height_cm.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)

# Display results with two decimal places
print("Average Height by Gender (in cm):")
print(gender_height_cm.round(2))
Average Height by Gender (in cm):
  gender  Average Height (cm)
0      f               168.45
1      m               176.42
  1. How does height vary by age_group? Are there noticeable trends among younger vs. older users?
In [107]:
# Calculate average height by age group with observed=False to avoid future warnings
height_by_age_group = df.groupby('age_group', observed=False)['height_cm'].mean().reset_index()

# Rename column for clarity
height_by_age_group.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)

# Display results with rounded values
print("Average Height by Age Group (in cm):")
print(height_by_age_group.round(2))
Average Height by Age Group (in cm):
  age_group  Average Height (cm)
0     18-25               173.43
1     26-35               173.15
2     36-45               173.23
3       46+               173.44

Insights:

  • Consistent Average Height Across Age Groups: The average height across all age groups remains consistent, with minor variations (173.15 cm to 173.44 cm).
  • Slight Increase in Older Age Group: Users aged 46+ have the highest average height (173.44 cm), though the difference is negligible compared to other groups.
  • Stable Trend: There is no significant trend in height variation between younger and older users, indicating height is largely independent of age groups in this dataset.

Recommendations:

  • Focus on Broader Demographics: Since height shows minimal variation across age groups, it may not be a critical factor for age-specific insights or strategies.
  • Cross-Analyze with Other Factors: Combine height data with other attributes like location, gender, or income to explore more actionable trends.
  • Maintain Data Quality: Ensure accurate measurement and consistency of height data for future analyses.
  1. What is the distribution of height within body_type categories (e.g., athletic, curvy, thin)? Do the distributions align with expectations?
In [110]:
# Calculate the mean height for each body_type category
height_by_body_type = df.groupby('body_type')['height_cm'].mean().reset_index()
height_by_body_type.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)

# Display results with rounded values
print("Average Height by Body Type (in cm):")
print(height_by_body_type.round(2))
Average Height by Body Type (in cm):
         body_type  Average Height (cm)
0   a little extra               173.85
1         athletic               175.27
2          average               173.04
3            curvy               168.71
4              fit               173.57
5     full figured               169.99
6           jacked               174.12
7       overweight               174.05
8   rather not say               171.90
9           skinny               173.49
10            thin               172.45
11         used up               174.47

Insights:

  • Athletic body type has the highest average height at 175.27 cm, suggesting that users with this body type tend to be slightly taller than others.
  • Curvy body type has the lowest average height at 168.71 cm, which could indicate a trend toward shorter users within this category.
  • Other body types like a little extra, fit, and jacked have average heights ranging from 173.04 cm to 174.47 cm, showing no significant variation in height across these categories.
  • Not Specified and rather not say categories show average heights of 172.26 cm and 171.90 cm, respectively, indicating that users in these categories have a slightly lower average height compared to those who specified a body type.

Recommendations:

  • Target Active Users: For users with an athletic body type, consider promoting fitness-related features or content.
  • Inclusive Features: Offer more inclusive options for users with a curvy body type to make them feel represented.
  • Equal Representation: Make sure all body types, like a little extra and full figured, are equally represented in your campaigns.
  • Clarify Body Types: Since many users didn’t specify their body type, encourage users to be more specific for better personalization.

6. Income Analysis¶

Income is often an important factor for users on dating platforms. Understanding its distribution and relationship with other variables helps refine features like user search filters or personalized recommendations.

Questions:

  1. What is the distribution of income across the platform? Are there specific income brackets that dominate? (don't count 0)
In [115]:
# Filter out rows where income is 0
df_nonzero_income = df[df['income'] > 0]

# Calculate the income distribution
income_distribution = df_nonzero_income['income'].describe()

# Display the result
print(income_distribution.round(2))
count      11504.00
mean      104394.99
std       201433.53
min        20000.00
25%        20000.00
50%        50000.00
75%       100000.00
max      1000000.00
Name: income, dtype: float64
In [116]:
# Define bins and labels for income brackets
income_bins = [0, 25000, 50000, 75000, 100000, float('inf')]
income_labels = ['<25K', '25K-50K', '50K-75K', '75K-100K', '100K+']

# Filter out zero income and create income brackets
df_nonzero_income = df[df['income'] > 0].copy()
df_nonzero_income['income_bracket'] = pd.cut(df_nonzero_income['income'], bins=income_bins, labels=income_labels, right=False)

# Get the count of users in each income bracket
income_distribution = df_nonzero_income['income_bracket'].value_counts().sort_index()
print(income_distribution)
income_bracket
<25K        2952
25K-50K     2053
50K-75K     2418
75K-100K    1111
100K+       2970
Name: count, dtype: int64

Insights:

  • The majority of users fall into the <25K and 100K+ income brackets, with 2952 and 2970 users, respectively. This suggests that the platform is used by both lower-income and high-income individuals.
  • 50K-75K has the third-highest number of users (2418), indicating a significant portion of users fall in the middle-income range.
  • 25K-50K and 75K-100K income brackets have fewer users (2053 and 1111, respectively), suggesting a more balanced distribution around the extreme income levels.

Recommendations:

  • Target Marketing: Create campaigns that cater to both low and high-income users, offering exclusive features to high-income users and affordable options to lower-income ones.
  • Personalized Filters: Offer search filters that let users find others within similar income groups, helping them connect with like-minded individuals.
  • Premium Features: Consider adding extra features or subscription plans for the high-income users (100K+), as this group is quite large.
  1. How does income vary by age_group and gender? Are older users more likely to report higher incomes?
In [119]:
# Group by age group and gender and calculate the mean income
income_by_age_gender = df.groupby(['age_group', 'gender'],observed=True)['income'].mean().reset_index()

# Pivot the table to have separate columns for male and female income
income_by_age_gender_pivot = income_by_age_gender.pivot(index='age_group', columns='gender', values='income')

# Display the result
print(income_by_age_gender_pivot.round(2))
gender            f         m
age_group                    
18-25      11249.66  25219.91
26-35      11089.33  25623.37
36-45      11319.26  27787.15
46+        13865.55  30676.47

Insights:

  • Gender Difference: Across all age groups, male users report significantly higher average incomes compared to female users.
  • Age Trends: Income tends to increase with age for both genders, with the highest average incomes observed in the 46+ age group.
  • Female Income Stability: Female users show relatively stable average incomes across age groups, with only a slight increase in the 46+ age group.
  • Male Income Growth: Male users experience a steady increase in income as they age, with a substantial jump in the 46+ category.

Recommendations:

  • Targeted Campaigns: Consider age-based campaigns emphasizing financial stability for older users, particularly males.
  • Gender-Specific Insights: Develop initiatives that address income gaps, focusing on career growth opportunities or financial education for female users.
  • Premium Features: Offer premium features or services tailored to older users who may have higher disposable incomes, especially in the 46+ age group.

Part 4: Data Visualization¶

1. Age Distribution¶

Understanding the distribution of user ages can reveal whether the platform caters to specific demographics or age groups. This insight is essential for targeted marketing and user experience design.

Questions:

  1. Plot a histogram of age with a vertical line indicating the mean age. What does the distribution reveal about the most common age group on the platform?
In [125]:
# Calculate mean age
mean_age = df['age'].mean()

# Plot the histogram with Bumble theme color
plt.figure(figsize=(8, 5))
plt.hist(df['age'], bins=15, color='#FFCC00', edgecolor='black', alpha=0.9)
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')

# Add labels and title
plt.title('Age Distribution of Bumble Users', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend()
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()
No description has been provided for this image

Insights:

  • The age distribution reveals that the majority of Bumble users fall within the 25–35 age range, as indicated by the peak of the histogram.
  • The mean age of 30.90 suggests that the platform primarily caters to millennials and younger adults, aligning with the target demographic for dating apps.

Recommendations:

  • Focus ads and promotions on users aged 25–35, as they are the most active group.
  • Add features and content that appeal to people in their 20s and early 30s.
  • Create offers or events for older users (35+) to attract more variety in age groups.
  1. How does the age distribution differ by gender? Are there age groups where one gender is more prevalent?
In [128]:
# Set the size of the plot
plt.figure(figsize=(8, 5))

sns.histplot(data=df, x='age', hue='gender', bins=15, kde=False, 
             multiple="dodge", palette={'f': '#87CEEB', 'm': '#FFCC00'}, 
             edgecolor='black', alpha=0.8)

# Add labels and title
plt.title('Age Distribution by Gender', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend(title='Gender', labels=['Female', 'Male'])
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()
No description has been provided for this image
In [129]:
# Set the size of the plot
plt.figure(figsize=(8, 5))

sns.histplot(data=df, x='age', hue='gender', bins=15, kde=False, 
             multiple="stack", palette={'f': '#87CEEB', 'm': '#FFCC00'}, 
             edgecolor='black', alpha=0.8)

# Add labels and title
plt.title('Age Distribution by Gender', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend(title='Gender', labels=['Female', 'Male'])
plt.grid(axis='y', alpha=0.75)

# Show the plot
plt.show()
No description has been provided for this image

2. Income and Age¶

Visualizing the relationship between income and age helps uncover patterns in reported income levels across age groups, which could inform user segmentation strategies.

Questions:

  1. Use a scatterplot to visualize the relationship between income and age, with a trend line indicating overall patterns. Are older users more likely to report higher incomes?
In [133]:
# Set plot size
plt.figure(figsize=(8, 5))

# Scatterplot and trend line
sns.scatterplot(data=df, x='age', y='income', color='#FFCC00', alpha=0.8, edgecolor='black')
sns.regplot(data=df, x='age', y='income', scatter=False, color='black', line_kws={"linewidth": 2})

# Add title and labels
plt.title('Income vs. Age', fontsize=14, color='black')
plt.xlabel('Age', fontsize=12, color='black')
plt.ylabel('Income', fontsize=12, color='black')

# Add grid
plt.grid(axis='y', alpha=0.5, linestyle='--', color='gray')

# Show plot
plt.show()
No description has been provided for this image

Observations and Patterns:

  • Scatterplot: Shows individual data points of age and income.
  • Trend Line: Indicates whether income increases, decreases, or remains constant with age.
  • Grid Lines: Subtle gray dashed lines improve readability.

Insight:

  • The trend line indicates that income remains constant across different age groups, suggesting that age does not significantly influence reported income levels among users in this dataset. There may be no clear upward or downward trend linking older users to higher or lower incomes.

Recommendation:

  • Focus on other factors like education or location for income-based user segmentation.
  • Collect additional data to identify better predictors of income.
  • Avoid age-based targeting for income-related strategies.
  1. Create boxplots of income grouped by age_group. Which age group reports the highest median income?
In [136]:
# Filter out rows with zero or negative income
df_filtered = df[df['income'] > 0]

# Create the boxplot
plt.figure(figsize=(8, 5))
sns.boxplot(data=df_filtered, x='age_group', y='income', color='#FFCC00')  # Bumble yellow

# Add title and labels
plt.title('Income by Age Group', fontsize=14, color='black')
plt.xlabel('Age Group', fontsize=12, color='black')
plt.ylabel('Income', fontsize=12, color='black')

# Show the plot
plt.show()
No description has been provided for this image

Observations and Patterns:

  • Box: Displays the interquartile range (IQR) of income for each age group.
  • Line Inside Box: Median income for the group.
  • Whiskers and Points: Show the range and outliers.

Insights:

  • Median Income: The 36-45 age group has the highest median income. The 18-25 age group has the lowest median income.
  • Outliers: The outliers shown on the plot are significantly higher than the typical incomes for their respective age groups.
  • Distribution: Income distributions across age groups are similar, but the 36-45 group shows slightly higher incomes overall.

Recommendations:

  • Focus on 36-45 Age Group: This group is in their peak earning years and is the best target for high-value products or services.
  • Support Young Earners: Offer training or career development to the 18-25 age group to boost their earning potential.
  • Analyze Outliers: Investigate high-income individuals (outliers) to find trends in skills, industries, or education that lead to such incomes.
  • Plan for Older Groups: Provide financial planning and retirement services for the 46+ group as their incomes stabilize.
  1. Analyze income levels within gender and status categories. For example, are single men more likely to report higher incomes than single women?
In [139]:
# Create the boxplot to compare income by gender and status
plt.figure(figsize=(8, 5))
sns.boxplot(data=df_filtered, x='status', y='income', hue='gender', palette={'f': '#87CEEB', 'm': '#FFCC00'})

# Add title and labels
plt.title('Income by Gender and Status (Excluding Zero/Negative Income)', fontsize=14)
plt.xlabel('Status', fontsize=12)
plt.ylabel('Income', fontsize=12)

# Show the plot
plt.show()
No description has been provided for this image

Observations:

  • Male Median is Higher: Males report slightly higher median incomes across all status categories.
  • No Lower Whiskers for Females: Female income distributions lack lower whiskers, indicating low variability at the lower end.
  • "Seeing Someone" Status: The box for females is visible, but the median line is faint or overlaps with the Q1 line, showing concentrated incomes.
  • Outliers in "Single" Status: The "single" category, especially for males, has the highest number of outliers, indicating wide income variability.

Insights:

  • Males generally report higher median incomes than females across all categories.
  • Female incomes are tightly concentrated, with minimal variability in most cases.
  • The "single" category, particularly for males, has significant high-income outliers.

Recommendations:

  • Focus on single males for premium offerings due to high-income variability.
  • Investigate structural factors influencing the concentration of female incomes.
  • Analyze high-income outliers in the "single" category for potential targeting.

3. Pets and Preferences¶

Pets are often a key lifestyle preference and compatibility factor. Analyzing how pets preferences distribute across demographics can provide insights for filters or recommendations.

Questions:

  1. Create a bar chart showing the distribution of pets categories (e.g., likes dogs, likes cats). Which preferences are most common?
In [144]:
# Count the occurrences of each pet category
pet_counts = df['pets'].value_counts()

# Plot the bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=pet_counts.index, y=pet_counts.values, color='#FFCC00', edgecolor='black')

# Add labels and title
plt.title('Distribution of Pet Preferences', fontsize=14)
plt.xlabel('Pet Preferences', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=90)

# Show the plot
plt.show()
No description has been provided for this image

Insights:

  • Most Users Did Not Specify a Preference: The highest bar corresponds to users who did not mention their pet preference (filled with "Not specified").
  • Second Most Popular Preference: Users who like both cats and dogs are the next largest group.
  • Third Place: Users who specifically like dogs.
  • Fourth Preference: Users who like dogs but also have cats.
  • Fifth Preference: Users who specifically have dogs.

Recommendation:

  • Target Messaging for "Unknown" Group: Encourage users to update their profiles with pet preferences to improve personalization.
  • Focus on Popular Groups: Tailor features or content for users who like both cats and dogs, as they form a significant segment.
  • Engage Dog Lovers: Consider features specifically for dog lovers, given their notable representation.
  1. How do pets preferences vary across gender and age_group? Are younger users more likely to report liking pets compared to older users?
In [147]:
# Create a new binary column indicating whether the user likes pets
df['likes_pets'] = df['pets'].apply(lambda x: 'likes' in str(x))

# Filter out rows where pets preference is not specified
df_filtered = df[df['pets'] != 'Not Specified']

# Group by age_group and gender and count the users who like pets
# Adding observed=False to avoid the FutureWarning
age_gender_pets = df_filtered[df_filtered['likes_pets'] == True].groupby(['age_group', 'gender'], observed=False).size().reset_index(name='count')

# Plotting the data using a simple bar plot
plt.figure(figsize=(8, 5))
sns.barplot(data=age_gender_pets, x='age_group', y='count', hue='gender', palette=['#87CEEB','#FFCC00'])

# Adding labels and title
plt.title('Pets Preferences Across Gender and Age Groups', fontsize=16)
plt.xlabel('Age Group', fontsize=12)
plt.ylabel('Count of Users Liking Pets', fontsize=12)

# Show the plot
plt.show()
No description has been provided for this image

Insights Based on Observations:

  • Gender Preference for Pets: In all age groups, males are more likely to report liking pets than females. This trend is consistent across all age ranges, with males showing a higher preference for pets.
  • Age Group Preference: The 26-35 age group stands out with the highest count of users reporting a liking for pets. This group seems to have a strong affinity for pets compared to other age groups.
  • Similar Preferences for Younger and Older Groups: The 18-25 and 36-45 age groups have almost identical pet preferences, with only a minor difference in the number of users who like pets. This suggests that both younger and slightly older users have comparable attitudes toward pets.
  • Less Preference in Older Age Groups: The 46+ age group shows a significantly lower preference for pets, indicating that as users get older, their likelihood of liking pets seems to decrease.

Recommendations:

  • Target Males More: Males across all age groups are more likely to like pets, so prioritize male-targeted campaigns.
  • Focus on 26-35 Age Group: This group has the highest count of pet lovers, making it the most valuable target market.
  • Group 18-25 and 36-45 Together: These groups show similar pet preferences, so they can be combined in strategies.
  • Minimize Focus on 46+: The 46+ age group shows less interest in pets, so pet-related campaigns may be less effective for them.

4. Signs and Personality¶

Users’ self-reported zodiac signs (sign) can offer insights into personality preferences or trends. While not scientifically grounded, analyzing this data helps explore fun and engaging patterns.

Questions:

  1. Create a pie chart showing the distribution of zodiac signs (sign) across the platform. Which signs are most and least represented? Is this the right chart? If not, replace with right chart.
  • the number of categories (zodiac signs) is too large, the chart can become cluttered and harder to interpret. In that case, a bar chart would be a better option as it allows for easy comparison across categories.
In [153]:
# Extract the main zodiac sign using a regular expression to match the first word
df['cleaned_sign'] = df['sign'].str.extract(r'(\b\w+\b)', expand=False)

# Filter out "Not Specified"
df_filtered = df[df['cleaned_sign'] != 'Not']

# Count the occurrences of each zodiac sign
sign_counts = df_filtered['cleaned_sign'].value_counts()

# Create a fading color palette where lower counts have darker colors
palette = sns.light_palette("#FFCC00", n_colors=len(sign_counts), reverse=True)

# Plot the data as a horizontal bar chart with the fading Bumble color palette
plt.figure(figsize=(8, 5))
sns.barplot(x=sign_counts.values, y=sign_counts.index, palette=palette, hue=sign_counts.index, edgecolor='black')

# Add labels and title
plt.title('Distribution of Zodiac Signs on the Platform (Excluding "Not Specified")', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Zodiac Sign', fontsize=12)

# Show the plot
plt.show()
No description has been provided for this image

Insights on Zodiac Signs Distribution:

  • Most Represented Signs: Leo Gemini
  • Least Represented Signs: Capricorn Aquarius
  1. How does sign vary across gender and status? Are there noticeable patterns or imbalances?
In [156]:
# Group by 'gender', 'status', and 'cleaned_sign' to count occurrences
sign_gender_status = df_filtered.groupby(['gender', 'status', 'cleaned_sign']).size().reset_index(name='count')

# Plotting the data using countplot
plt.figure(figsize=(8, 5))

# Creating a count plot with hue set to 'gender' and 'status' as different categories
sns.barplot(data=sign_gender_status, x='cleaned_sign', y='count', hue='gender', palette=['#87CEEB', '#FFCC00'], errorbar=None)

# Adding labels and title
plt.title('Zodiac Sign Distribution Across Gender and Status', fontsize=16)
plt.xlabel('Zodiac Sign', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.show()
No description has been provided for this image

Insights (How signs vary across gender and status):

  • Popular Signs: Zodiac signs like Gemini, Cancer, and Libra are the most represented for both males and females, showing that these signs dominate the user base.
  • Underrepresented Signs: Females: Capricorn and Scorpio are less represented. Males: Aquarius, Aries, and Pisces show relatively lower counts.
  • Gender Imbalance: Males consistently outnumber females across all zodiac signs, with some signs (e.g., Capricorn for females, Aquarius for males) showing a sharper imbalance.
  • Consistent Patterns: Signs like Aquarius, Pisces, and Aries have consistently lower representation across both genders.

Recommendations (Addressing patterns and imbalances):

  • Attract Female Users: Run campaigns for women, especially Capricorn and Scorpio.
  • Highlight Less Visible Signs: Promote Aquarius, Aries, and Pisces.
  • Engage Popular Signs: Focus on Gemini, Cancer, and Libra.
  • Add Zodiac Features: Include compatibility-based matching.
  • Fix Low Counts: Study and address why some signs have fewer users.
In [ ]: