Introduction and Purpose¶
Bumble is a popular dating platform where users connect based on mutual interests and compatibility. To provide a better matchmaking experience, Bumble collects user information through profiles, which include details about demographics, lifestyle habits, and personal preferences.This dataset represents user-generated profiles, offering a rich source of information to understand user behavior, preferences, and trends.
The purpose of this analysis is to leverage the dataset to answer key business and user behavior questions. By cleaning, processing, exploring, and visualizing the data, the goal is to uncover actionable insights that can help improve Bumble's matchmaking experience and user engagement.
1.1 Load the libraries:¶
import pandas as pd
import numpy as np
%pip install seaborn
import seaborn as sns
import matplotlib.pyplot as plt
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (2.2.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\programdata\anaconda3\lib\site-packages (from seaborn) (3.9.2) Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1) Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3) Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
1.2. Import the dataset¶
df=pd.read_csv(r"C:\Users\avadh\Desktop\data analyst nextleap\Milestone 4\bumble.csv")
df.head()
age | status | gender | body_type | diet | drinks | education | ethnicity | height | income | job | last_online | location | pets | religion | sign | speaks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22 | single | m | a little extra | strictly anything | socially | working on college/university | asian, white | 75.0 | -1 | transportation | 2012-06-28-20-30 | south san francisco, california | likes dogs and likes cats | agnosticism and very serious about it | gemini | english |
1 | 35 | single | m | average | mostly other | often | working on space camp | white | 70.0 | 80000 | hospitality / travel | 2012-06-29-21-41 | oakland, california | likes dogs and likes cats | agnosticism but not too serious about it | cancer | english (fluently), spanish (poorly), french (... |
2 | 38 | available | m | thin | anything | socially | graduated from masters program | NaN | 68.0 | -1 | NaN | 2012-06-27-09-10 | san francisco, california | has cats | NaN | pisces but it doesn’t matter | english, french, c++ |
3 | 23 | single | m | thin | vegetarian | socially | working on college/university | white | 71.0 | 20000 | student | 2012-06-28-14-22 | berkeley, california | likes cats | NaN | pisces | english, german (poorly) |
4 | 29 | single | m | athletic | NaN | socially | graduated from college/university | asian, black, other | 66.0 | -1 | artistic / musical / writer | 2012-06-27-21-26 | san francisco, california | likes dogs and likes cats | NaN | aquarius | english |
1.3 Check the Information about the data and the datatypes of each respective attributes.¶
This step provides an overview of the dataset by displaying the structure, data types, and non-null counts for each column. It helps identify data types, missing values, and memory usage, laying the foundation for data cleaning and further analysis.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 59946 entries, 0 to 59945 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 59946 non-null int64 1 status 59946 non-null object 2 gender 59946 non-null object 3 body_type 54650 non-null object 4 diet 35551 non-null object 5 drinks 56961 non-null object 6 education 53318 non-null object 7 ethnicity 54266 non-null object 8 height 59943 non-null float64 9 income 59946 non-null int64 10 job 51748 non-null object 11 last_online 59946 non-null object 12 location 59946 non-null object 13 pets 40025 non-null object 14 religion 39720 non-null object 15 sign 48890 non-null object 16 speaks 59896 non-null object dtypes: float64(1), int64(2), object(14) memory usage: 7.8+ MB
1.4 Statistical Analysis¶
print(f"statistical analysis (numerical columns)")
df.describe()
statistical analysis (numerical columns)
age | height | income | |
---|---|---|---|
count | 59946.000000 | 59943.000000 | 59946.000000 |
mean | 32.340290 | 68.295281 | 20033.222534 |
std | 9.452779 | 3.994803 | 97346.192104 |
min | 18.000000 | 1.000000 | -1.000000 |
25% | 26.000000 | 66.000000 | -1.000000 |
50% | 30.000000 | 68.000000 | -1.000000 |
75% | 37.000000 | 71.000000 | -1.000000 |
max | 110.000000 | 95.000000 | 1000000.000000 |
1.4 Check the Dimension of data?¶
df.shape
(59946, 17)
1.5 Check Null values¶
df.isnull().sum()
age 0 status 0 gender 0 body_type 5296 diet 24395 drinks 2985 education 6628 ethnicity 5680 height 3 income 0 job 8198 last_online 0 location 0 pets 19921 religion 20226 sign 11056 speaks 50 dtype: int64
df.isnull().sum().sum()
104438
- Which columns in the dataset have missing values, and what percentage of data is missing in each column?
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
- Are there columns where more than 50% of the data is missing? Drop those columns where missing values are >50%.
missing_summary = pd.DataFrame({
'Missing Values': missing_data,
'Percentage (%)': missing_percentage
}).sort_values(by='Percentage (%)', ascending=False)
print(missing_summary)
Missing Values Percentage (%) diet 24395 40.694959 religion 20226 33.740366 pets 19921 33.231575 sign 11056 18.443266 job 8198 13.675641 education 6628 11.056618 ethnicity 5680 9.475194 body_type 5296 8.834618 drinks 2985 4.979482 speaks 50 0.083408 height 3 0.005005 last_online 0 0.000000 location 0 0.000000 income 0 0.000000 status 0 0.000000 gender 0 0.000000 age 0 0.000000
df.duplicated().sum()
0
Part 1: Data Cleaning¶
1. Inspecting Missing Data¶
Inspecting missing data is crucial to identify gaps in the dataset that could impact analysis or model performance. It helps decide how to handle missing values, such as imputation, removal, or flagging, ensuring the integrity and accuracy of the analysis.
df.dropna(subset=['height'],inplace=True)
conclusion:- dropped the height column NA values because it had less than 1% of na values¶
2. Data Types¶
Reviewing data types ensures the data is properly structured for analysis. Correct data types are essential for performing valid operations, applying appropriate transformations, and optimizing both performance and memory usage.
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 59943 entries, 0 to 59945 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 59943 non-null int64 1 status 59943 non-null object 2 gender 59943 non-null object 3 body_type 54650 non-null object 4 diet 35551 non-null object 5 drinks 56961 non-null object 6 education 53318 non-null object 7 ethnicity 54264 non-null object 8 height 59943 non-null float64 9 income 59943 non-null int64 10 job 51747 non-null object 11 last_online 59943 non-null object 12 location 59943 non-null object 13 pets 40024 non-null object 14 religion 39720 non-null object 15 sign 48889 non-null object 16 speaks 59893 non-null object dtypes: float64(1), int64(2), object(14) memory usage: 8.2+ MB
#Does the last_online column need to be converted into a datetime format?
# Convert 'last_online' column to datetime with the correct format
df['last_online'] = pd.to_datetime(df['last_online'], format='%Y-%m-%d-%H-%M')
# Display the result
df['last_online'].head()
0 2012-06-28 20:30:00 1 2012-06-29 21:41:00 2 2012-06-27 09:10:00 3 2012-06-28 14:22:00 4 2012-06-27 21:26:00 Name: last_online, dtype: datetime64[ns]
3. Outliers¶
Outliers can significantly skew results and affect the accuracy of analysis or models. Identifying and addressing outliers helps ensure the data accurately represents trends and patterns, leading to more reliable insights and decisions.
df.describe()
age | height | income | last_online | |
---|---|---|---|---|
count | 59943.000000 | 59943.000000 | 59943.000000 | 59943 |
mean | 32.340140 | 68.295281 | 20034.225197 | 2012-05-22 06:40:43.551373568 |
min | 18.000000 | 1.000000 | -1.000000 | 2011-06-27 01:52:00 |
25% | 26.000000 | 66.000000 | -1.000000 | 2012-05-29 20:34:00 |
50% | 30.000000 | 68.000000 | -1.000000 | 2012-06-27 14:30:00 |
75% | 37.000000 | 71.000000 | -1.000000 | 2012-06-30 01:09:00 |
max | 110.000000 | 95.000000 | 1000000.000000 | 2012-07-01 08:57:00 |
std | 9.452723 | 3.994803 | 97348.524902 | NaN |
#Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?
#Any -1 values in numerical columns like income should be replaced with 0, as they may represent missing or invalid data.
# Calculate IQR for age, height, and income
columns_to_check = ['age', 'height']
outliers = {}
for col in columns_to_check:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
# Calculate lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers[col] = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
# Print results
print(f"{col} - IQR: {IQR:.2f}, Lower Bound: {lower_bound:.2f}, Upper Bound: {upper_bound:.2f}")
print(f"Number of outliers in {col}: {outliers[col].shape[0]}")
age - IQR: 11.00, Lower Bound: 9.50, Upper Bound: 53.50 Number of outliers in age: 2638 height - IQR: 5.00, Lower Bound: 58.50, Upper Bound: 78.50 Number of outliers in height: 285
- Replacing Outliers with Median:
Outliers can distort the data and affect results. Replacing them with the median is useful because the median is not influenced by extreme values, unlike the mean. It helps maintain a more accurate and representative dataset, ensuring that the data remains balanced and reliable for analysis.
# Step 1: Calculate the 10th and 90th percentiles for 'age' and 'height'
age_10th, age_90th = df['age'].quantile(0.10), df['age'].quantile(0.90)
height_10th, height_90th = df['height'].quantile(0.10), df['height'].quantile(0.90)
# Step 2: Calculate the median for 'age' and 'height' within the middle 80%
middle_80_data = df[(df['age'] >= age_10th) & (df['age'] <= age_90th) &
(df['height'] >= height_10th) & (df['height'] <= height_90th)]
median_age = middle_80_data['age'].median()
median_height = middle_80_data['height'].median()
print(f"Median Age (Middle 80%): {median_age}")
print(f"Median Height (Middle 80%): {median_height}")
Median Age (Middle 80%): 30.0 Median Height (Middle 80%): 68.0
# Step 3: Replace outliers in 'age' and 'height' with the calculated median
df['age'] = df['age'].apply(lambda x: median_age if x < age_10th or x > age_90th else x)
df['height'] = df['height'].apply(lambda x: median_height if x < height_10th or x > height_90th else x)
# Step 4: Verify the changes
print(df[['age', 'height']].describe())
age height count 59943.000000 59943.000000 mean 30.897986 68.193367 std 5.419528 2.633236 min 23.000000 63.000000 25% 27.000000 66.000000 50% 30.000000 68.000000 75% 34.000000 70.000000 max 46.000000 73.000000
- -1 values in numerical column income should be replaced with 0, as they may represent missing or invalid data:
-1 values in the income column likely represent missing or invalid data. Replacing them with 0 helps maintain consistency in the dataset and ensures that these values don’t distort the analysis or modeling, as they are not meaningful income figures.
# Replace -1 values with 0 in the 'income' column using replace method
df['income'] = df['income'].replace(-1, 0)
# Verify the changes
print(df['income'].describe())
count 59943.000000 mean 20035.033282 std 97348.358589 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1000000.000000 Name: income, dtype: float64
4. Missing Data Visualization¶
Visualizing missing data helps quickly identify patterns and distributions of missing values across columns. This step aids in understanding the extent of missing data, making it easier to decide on appropriate handling methods, such as imputation or removal, for more accurate analysis.
# Create a boolean mask where True indicates missing data
missing_data = df.isnull()
# Create a heatmap to visualize missing values
plt.figure(figsize=(8, 5))
sns.heatmap(missing_data, cbar=False, cmap=['black', '#FFCC00'], yticklabels=False, xticklabels=df.columns)
# Display the heatmap
plt.title("Missing Data Heatmap")
plt.show()
Part 2: Data Processing¶
1. Binning and Grouping¶
Grouping continuous variables, such as age or income, into bins helps simplify analysis and identify trends among specific groups. For instance, grouping users into age ranges can reveal distinct patterns in behavior or preferences across demographics.
Questions:
- Bin the age column into categories such as "18-25", "26-35", "36-45", and "46+" to create a new column, age_group. How does the distribution of users vary across these age ranges?
# Define the bin edges
bins = [18, 25, 35, 45, float('inf')] # 'inf' for ages above 45
labels = ["18-25", "26-35", "36-45", "46+"] # Labels for the bins
# Bin the 'age' column into the defined categories
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=True)
# Display the first few rows to check
print(df[['age', 'age_group']].head())
age age_group 0 30.0 26-35 1 35.0 26-35 2 38.0 36-45 3 23.0 18-25 4 29.0 26-35
- Group income into categories like "Low Income," "Medium Income," and "High Income" based on meaningful thresholds (e.g., quartiles). What insights can be derived from these groups?
print(df['income'].value_counts())
income 0 48439 20000 2952 100000 1621 80000 1111 30000 1048 40000 1005 50000 975 60000 736 70000 707 150000 631 1000000 521 250000 149 500000 48 Name: count, dtype: int64
# Plot the income distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['income'], bins=30, kde=True, color='#FFCC00')
plt.title('Income Distribution', fontsize=16, color='black')
plt.xlabel('Income', fontsize=12, color='black')
plt.ylabel('Frequency', fontsize=12, color='black')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()
# Calculate the quartiles
low_income_threshold = df['income'].quantile(0.25) # 25th percentile
high_income_threshold = df['income'].quantile(0.75) # 75th percentile
# Display the thresholds
print(f"Low Income Threshold: {low_income_threshold}")
print(f"High Income Threshold: {high_income_threshold}")
Low Income Threshold: 0.0 High Income Threshold: 0.0
conclusion: This skewness makes it impossible to differentiate between "Low Income," "Medium Income," and "High Income" using quartiles. Zero income values influence the thresholds for "Low Income" and "High Income."
# Exclude zeros for meaningful categorization
non_zero_income = df[df['income'] > 0]['income']
# Calculate quartile-based thresholds (ignoring 0 values)
low_income_threshold = non_zero_income.quantile(0.25) # 25th percentile
high_income_threshold = non_zero_income.quantile(0.75) # 75th percentile
# Define bins for income categories
bins = [df['income'].min(), low_income_threshold, high_income_threshold, df['income'].max()]
labels = ["Low Income", "Medium Income", "High Income"]
# Create the 'income_group' column
df['income_group'] = pd.cut(df['income'], bins=bins, labels=labels, right=True)
print(df['income_group'].value_counts(dropna=False))
income_group NaN 48439 Medium Income 7203 Low Income 2952 High Income 1349 Name: count, dtype: int64
2. Derived Features¶
Derived features are new columns created based on the existing data to add depth to the analysis. These features often reveal hidden patterns or provide new dimensions to explore.
Questions:
- Create a new feature, profile_completeness, by calculating the percentage of non-missing values for each user profile. How complete are most user profiles, and how does completeness vary across demographics?
# Calculate profile completeness as the percentage of non-missing values for each user profile
df['profile_completeness'] = (df.notnull().sum(axis=1) / len(df.columns)) * 100
# Analyze completeness by gender
profile_completeness_by_gender = df.groupby('gender')['profile_completeness'].mean().reset_index()
# Display results
print("\nProfile Completeness by Gender:")
print(profile_completeness_by_gender)
# Profile completeness by age
completeness_by_age = df.groupby('age')['profile_completeness'].mean().reset_index()
print("\nProfile Completeness by Age:")
print(completeness_by_age)
# Profile completeness by status
completeness_by_status = df.groupby('status')['profile_completeness'].mean().reset_index()
print("\nProfile Completeness by Status:")
print(completeness_by_status)
Profile Completeness by Gender: gender profile_completeness 0 f 86.454941 1 m 86.662808 Profile Completeness by Age: age profile_completeness 0 23.0 86.478639 1 24.0 86.269035 2 25.0 86.291934 3 26.0 86.314941 4 27.0 86.314361 5 28.0 86.105439 6 29.0 86.386071 7 30.0 86.930099 8 31.0 86.300394 9 32.0 86.373998 10 33.0 86.274276 11 34.0 86.258232 12 35.0 86.435747 13 36.0 86.557835 14 37.0 86.655848 15 38.0 86.909379 16 39.0 86.801689 17 40.0 86.882984 18 41.0 87.454350 19 42.0 87.043401 20 43.0 87.118145 21 44.0 87.265834 22 45.0 87.787509 23 46.0 87.297396 Profile Completeness by Status: status profile_completeness 0 available 87.430507 1 married 86.943973 2 seeing someone 87.112403 3 single 86.530062 4 unknown 80.000000
3. Unit Conversion¶
Standardizing units across datasets is essential for consistency, especially when working with numerical data. In the context of the Bumble dataset, users’ heights are given in inches, which may not be intuitive for all audiences.
Question:
- Convert the height column from inches to centimeters using the conversion factor (1 inch = 2.54 cm). Store the converted values in a new column, height_cm.
# Conversion factor from inches to centimeters
conversion_factor = 2.54
# Convert the height column to centimeters and store in a new column
df['height_cm'] = df['height'] * conversion_factor
# Display the updated dataframe or verify the new column
print(df[['height', 'height_cm']].head())
height height_cm 0 68.0 172.72 1 70.0 177.80 2 68.0 172.72 3 71.0 180.34 4 66.0 167.64
Part 3: Data Analysis¶
1. Demographic Analysis¶
Understanding the demographics of users is essential for tailoring marketing strategies, improving user experience, and designing features that resonate with the platform’s audience. Insights into gender distribution, orientation, and relationship status can help Bumble refine its matchmaking algorithms and engagement campaigns.
Questions:
- What is the gender distribution (gender) across the platform? Are there any significant imbalances?
# Calculate the gender distribution
gender_distribution = df['gender'].value_counts()
# Calculate the percentage distribution for better insights
gender_percentage = df['gender'].value_counts(normalize=True) * 100
# Display the results with formatted percentage values
print("Gender Distribution (Count):")
print(gender_distribution)
print("\nGender Distribution (Percentage):")
print(gender_percentage.apply(lambda x: f"{x:.2f}"))
Gender Distribution (Count): gender m 35827 f 24116 Name: count, dtype: int64 Gender Distribution (Percentage): gender m 59.77 f 40.23 Name: proportion, dtype: object
Insights:
- The platform has a higher proportion of male users (59.77%) compared to female users (40.23%).
- Male users outnumber female users by a significant margin.
Recommendations:
- Explore ways to attract more female users to balance the gender ratio.
- Tailor features or marketing strategies to better engage the female audience.
- Consider developing campaigns that promote gender equality and inclusivity on the platform.
- What are the proportions of users in different status categories (e.g., single, married, seeing someone)? What does this suggest about the platform’s target audience?
# Calculate the status distribution
status_distribution = df['status'].value_counts()
# Calculate the percentage distribution for better insights
status_percentage = df['status'].value_counts(normalize=True) * 100
# Display the results with formatted percentage values
print("Status Distribution (Count):")
print(status_distribution)
print("\nStatus Distribution (Percentage):")
print(status_percentage.apply(lambda x: f"{x:.2f}"))
Status Distribution (Count): status single 55694 seeing someone 2064 available 1865 married 310 unknown 10 Name: count, dtype: int64 Status Distribution (Percentage): status single 92.91 seeing someone 3.44 available 3.11 married 0.52 unknown 0.02 Name: proportion, dtype: object
Insights:
- Single (92.91%): The majority of users are "single," highlighting that Bumble primarily serves individuals looking for relationships or connections.
- Seeing Someone (3.44%): A small percentage of users are "seeing someone," suggesting some are in relationships but remain active on the platform for other interactions.
- Available (3.11%): Users marked as "available" are likely open to connections, though not necessarily seeking long-term commitments.
- Married (0.52%): A very small portion of users are "married," possibly using the platform for non-romantic purposes.
- Unknown (0.02%): A negligible amount of users have an "unknown" status, likely due to incomplete profiles.
Recommendations:
- Focus on engaging the majority of single users (92.91%) with tailored features.
- Create tools for users who are seeing someone (3.44%) to enhance relationships.
- Promote visibility for available users (3.11%) to attract more attention.
- Optimize for married and unknown users, but with less priority.
- How does status vary by gender? For example, what proportion of men and women identify as single?
# Calculate the count and percentage of status by gender
status_gender = df.groupby(['gender', 'status']).size().unstack()
# Normalize by gender for percentage calculation
status_gender_percentage = status_gender.div(status_gender.sum(axis=1), axis=0) * 100
# Display the results with formatted percentage values
print(f"Status Distribution by Gender (Count):\n{status_gender}")
print(f"\nStatus Distribution by Gender (Percentage):\n{status_gender_percentage.apply(lambda x: x.map(lambda v: f'{v:.2f}'))}") # Format each percentage value
Status Distribution by Gender (Count): status available married seeing someone single unknown gender f 656 135 1003 22318 4 m 1209 175 1061 33376 6 Status Distribution by Gender (Percentage): status available married seeing someone single unknown gender f 2.72 0.56 4.16 92.54 0.02 m 3.37 0.49 2.96 93.16 0.02
Insights:
- The majority of users, both male (93.16%) and female (92.54%), identify as single, reflecting the platform's primary target audience.
- Men have a slightly higher proportion of single users compared to women.
- Women are more likely to be "seeing someone" (4.16% vs 2.96% for men), suggesting more relationship activity or engagement.
- Few users report being available (2.72% for women, 3.37% for men), indicating a smaller portion is actively seeking a partner, which may highlight either intentional disengagement or potential limitations in profile activity.
- Married and unknown statuses are minimal, reinforcing that Bumble is predominantly focused on non-married, single individuals seeking connections.
Recommendations:
- Focus marketing efforts on the single users, as they make up the largest group.
- Consider adding features that help users who are seeing someone engage more, such as relationship tools or connections.
- Improve the available status section to encourage more users to indicate if they are looking for a partner.
2. Correlation Analysis¶
Correlation analysis helps uncover relationships between variables, guiding feature engineering and hypothesis generation. For example, understanding how age correlates with income or word count in profiles can reveal behavioral trends that inform platform design.
Questions:
- What are the correlations between numerical columns such as age, income, gender Are there any strong positive or negative relationships?
# Encode gender as numeric (1 for male, 0 for female)
df['gender_encoded'] = df['gender'].map({'m': 1, 'f': 0})
# Calculate the correlation matrix
correlation_matrix = df[['age', 'income', 'gender_encoded']].corr()
# Display the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)
Correlation Matrix: age income gender_encoded age 1.000000 0.008459 -0.016291 income 0.008459 1.000000 0.074604 gender_encoded -0.016291 0.074604 1.000000
- Correlation near 1 or -1: Strong positive or negative relationship.
- Correlation near 0: No significant linear relationship.
- Intermediate values: Interpret based on the context (0.3 to 0.7 or -0.3 to -0.7 typically means moderate relationships).
Insights:
- Age vs Income: The correlation is 0.0085, showing no significant relationship between age and income on the platform.
- Age vs Gender: The correlation is -0.0163, indicating no significant relationship between age and gender.
- Income vs Gender: The correlation is 0.0746, a very weak positive relationship, suggesting no major income differences between genders.
Recommendations:
- Targeting Age and Income: Since there's no significant correlation between age and income, targeting users based on these factors may not be effective for matchmaking or feature design.
- Marketing Strategies: As gender and income have minimal correlation, Bumble could continue with gender-neutral strategies and focus on other aspects like behavior and preferences for more personalized marketing.
- Feature Design: Age and gender do not show strong relationships with other features, suggesting the need to focus on user behavior and engagement for better user experience and feature development.
- Further Exploration: Explore non-linear relationships or factors like user activity and location for more meaningful insights on user demographics.
- Behavioral Segmentation: Behavioral patterns might provide more actionable insights than age and gender, allowing for better user segmentation and targeted engagement.
In summary, age and income do not significantly influence gender, and exploring behavioral or activity-based features could enhance personalization and user experience on the platform.
3. Diet and Lifestyle Analysis¶
Lifestyle attributes such as diet, drinks provide insights into user habits and preferences. Analyzing these factors helps identify compatibility trends and inform product features like filters or match recommendations.
Questions:
- How do dietary preferences (diet) distribute across the platform? For example, what percentage of users identify as vegetarian, vegan, or follow "anything" diets?
# Calculate the count and percentage of diet preferences
diet_distribution = df['diet'].value_counts()
diet_percentage = df['diet'].value_counts(normalize=True) * 100
# Combine count and percentage into a single DataFrame for visualization
diet_combined = pd.DataFrame({
'Count': diet_distribution,
'Percentage': diet_percentage.apply(lambda x: f"{x:.2f}") # Format percentage to 2 decimal places
})
# Display the combined count and formatted percentage values
print(diet_combined)
Count Percentage diet mostly anything 16585 46.65 anything 6183 17.39 strictly anything 5113 14.38 mostly vegetarian 3444 9.69 mostly other 1007 2.83 strictly vegetarian 875 2.46 vegetarian 667 1.88 strictly other 452 1.27 mostly vegan 338 0.95 other 331 0.93 strictly vegan 228 0.64 vegan 136 0.38 mostly kosher 86 0.24 mostly halal 48 0.14 strictly halal 18 0.05 strictly kosher 18 0.05 halal 11 0.03 kosher 11 0.03
Insights:
1.Diet Preferences Distribution:
- The largest group (68.36%) of users falls under the "mostly anything" category, indicating a broad preference for a variety of diets.
- Other significant groups include "anything" (10.31%) and "strictly anything" (8.53%), showing a general inclination towards a non-restricted diet.
- Smaller groups are represented by users with vegetarian, vegan, kosher, or halal preferences, with each category having less than 6% of the user base.
2.Minority Diet Preferences:
- The least common diets are "strictly halal" and "strictly kosher," which each account for only about 0.03% of the user base, reflecting niche dietary choices.
3.Vegetarian and Vegan Preferences:
- There is a noticeable portion of users (7.75% combined) who prefer vegetarian or vegan diets ("mostly vegetarian," "strictly vegetarian," "mostly vegan," "strictly vegan"). However, these categories still represent a smaller proportion compared to non-vegetarian preferences.
Recommendations:
- Flexible Filters: Add filters that allow users to select different diet types easily, as many prefer a flexible diet.
- Focus on Vegetarians/Vegans: Offer special features or matching options for users who follow vegetarian or vegan diets.
- Support Special Diets: Include options for people who follow halal or kosher diets, even though they are less common.
- Better Profile Information: Help users clearly state their diet preferences to improve match accuracy.
- How do drinking habits (drinks) vary across different diet categories? Are users with stricter diets (e.g., vegan) less likely to drink?
# Group by diet and drinks, then calculate the count
diet_drinks_count = df.groupby(['diet', 'drinks']).size().unstack()
# Calculate percentage distribution by diet
diet_drinks_percentage = diet_drinks_count.apply(lambda x: x / x.sum() * 100, axis=1)
# Display the results
print("Diet vs Drinking Habits (Count):")
print(diet_drinks_count)
print("\nDiet vs Drinking Habits (Percentage):")
print(diet_drinks_percentage.apply(lambda x: x.map(lambda v: f"{v:.2f}"))) # Format percentage to 2 decimal places
Diet vs Drinking Habits (Count): drinks desperately not at all often rarely socially \ diet anything 21.0 295.0 574.0 510.0 4534.0 halal NaN 4.0 1.0 NaN 4.0 kosher NaN 1.0 NaN 2.0 7.0 mostly anything 66.0 842.0 1386.0 1537.0 12277.0 mostly halal 3.0 10.0 2.0 8.0 16.0 mostly kosher 1.0 7.0 4.0 17.0 50.0 mostly other 8.0 91.0 49.0 176.0 646.0 mostly vegan 2.0 40.0 22.0 62.0 193.0 mostly vegetarian 19.0 195.0 233.0 466.0 2388.0 other 1.0 35.0 23.0 60.0 192.0 strictly anything 66.0 182.0 685.0 318.0 3677.0 strictly halal 1.0 7.0 2.0 1.0 5.0 strictly kosher 6.0 4.0 2.0 3.0 1.0 strictly other 14.0 67.0 34.0 63.0 244.0 strictly vegan 8.0 58.0 19.0 37.0 94.0 strictly vegetarian 9.0 85.0 82.0 121.0 543.0 vegan 2.0 23.0 19.0 17.0 65.0 vegetarian 6.0 37.0 59.0 67.0 446.0 drinks very often diet anything 58.0 halal NaN kosher 1.0 mostly anything 129.0 mostly halal 4.0 mostly kosher 5.0 mostly other 5.0 mostly vegan 3.0 mostly vegetarian 20.0 other 1.0 strictly anything 59.0 strictly halal 2.0 strictly kosher 2.0 strictly other 10.0 strictly vegan 3.0 strictly vegetarian 9.0 vegan 1.0 vegetarian 7.0 Diet vs Drinking Habits (Percentage): drinks desperately not at all often rarely socially very often diet anything 0.35 4.92 9.58 8.51 75.67 0.97 halal nan 44.44 11.11 nan 44.44 nan kosher nan 9.09 nan 18.18 63.64 9.09 mostly anything 0.41 5.19 8.54 9.47 75.61 0.79 mostly halal 6.98 23.26 4.65 18.60 37.21 9.30 mostly kosher 1.19 8.33 4.76 20.24 59.52 5.95 mostly other 0.82 9.33 5.03 18.05 66.26 0.51 mostly vegan 0.62 12.42 6.83 19.25 59.94 0.93 mostly vegetarian 0.57 5.87 7.02 14.03 71.91 0.60 other 0.32 11.22 7.37 19.23 61.54 0.32 strictly anything 1.32 3.65 13.74 6.38 73.73 1.18 strictly halal 5.56 38.89 11.11 5.56 27.78 11.11 strictly kosher 33.33 22.22 11.11 16.67 5.56 11.11 strictly other 3.24 15.51 7.87 14.58 56.48 2.31 strictly vegan 3.65 26.48 8.68 16.89 42.92 1.37 strictly vegetarian 1.06 10.01 9.66 14.25 63.96 1.06 vegan 1.57 18.11 14.96 13.39 51.18 0.79 vegetarian 0.96 5.95 9.49 10.77 71.70 1.13
Insights:
- General Trends: Most users across diet categories prefer drinking "socially" (highest percentage in all groups). "Anything" and "Mostly Anything" diets have the highest representation, with over 75% of users in these groups drinking socially.
- Stricter Diets and Drinking: Vegans and Strict Vegans: Higher percentage of users in these groups either drink "not at all" or "rarely" (25%-38%) compared to less restrictive diets. Strictly Halal and Strictly Kosher: A significant proportion of these users also drink "not at all" (39%-22%), indicating stricter adherence to non-drinking habits.
- Occasional and Frequent Drinking: Users with less restrictive diets ("Anything" and "Mostly Anything") are more likely to drink "often" or "very often" compared to those with stricter diets like "Strictly Vegan" or "Strictly Vegetarian."
Recommendations:
- Tailored Marketing for Stricter Diets: Highlight non-alcoholic options or events catering to users with vegan, halal, or kosher preferences to increase inclusivity and engagement.
- Social Events Targeting Majority Preferences: Focus on "social drinking" campaigns since it aligns with most users' preferences across diets.
- Educational Campaigns: For stricter diet groups, consider awareness campaigns about low-alcohol or non-alcoholic beverages that align with their dietary restrictions.
Overall Conclusion:¶
People with stricter diets, like vegans or those following halal or kosher, are less likely to drink alcohol or drink only occasionally. On the other hand, people with less restrictive diets, like "anything" or "mostly anything," are more likely to drink socially or more often. This shows that stricter diets are linked to lower alcohol consumption.
4. Geographical Insights¶
Analyzing geographical data helps Bumble understand its user base distribution, enabling targeted regional campaigns and feature localization. For instance, identifying the top cities with active users can guide marketing efforts in those areas.
Questions:¶
- Extract city and state information from the location column. What are the top 5 cities and states with the highest number of users?
# Splitting the location column into city and state
df[['city', 'state']] = df['location'].str.split(',', n=1, expand=True)
# Cleaning up extra spaces in city and state
df['city'] = df['city'].str.strip()
df['state'] = df['state'].str.strip()
# Calculating the top 5 cities and states
top_cities = df['city'].value_counts().head(5)
top_states = df['state'].value_counts().head(5)
# Displaying the results
print("Top 5 Cities with the highest number of users:")
print(top_cities)
print("\nTop 5 States with the highest number of users:")
print(top_states)
Top 5 Cities with the highest number of users: city san francisco 31064 oakland 7214 berkeley 4210 san mateo 1331 palo alto 1064 Name: count, dtype: int64 Top 5 States with the highest number of users: state california 59853 new york 17 illinois 8 massachusetts 5 texas 4 Name: count, dtype: int64
Insights:
City-Level Insights:
- San Francisco dominates with 31,064 users, far surpassing other cities, indicating it is a key hub for Bumble users.
- Oakland (7,214 users) and Berkeley (4,212 users) are also significant, suggesting strong engagement within the Bay Area. Cities like San Mateo (1,331 users) and Palo Alto (1,064 users) show moderate user presence, representing potential areas for growth.
State-Level Insights:
- California leads overwhelmingly with 59,855 users, showcasing it as the primary market for Bumble.
- States like New York (17 users), Illinois (8 users), Massachusetts (5 users), and Texas (4 users) have minimal user representation in the dataset, indicating under-penetration or data imbalance.
Recommendations:
- Focus on California: Build on the strong presence in California by enhancing user engagement and launching new features in cities like San Francisco, Oakland, and Berkeley.
- Grow in Other States: Increase marketing efforts in underrepresented states like New York, Texas, and Illinois to expand Bumble's reach.
- Expand in Smaller Cities: Target mid-sized cities like San Mateo and Palo Alto for growth opportunities.
- Check Data Gaps: Verify low user numbers in other states to ensure data accuracy and identify potential growth areas.
- How does age vary across the top cities? Are certain cities dominated by younger or older users?
# Group data by city and calculate age statistics
age_stats = df[df['city'].isin(['san francisco', 'oakland', 'berkeley', 'san mateo', 'palo alto'])]
age_summary = (
age_stats.groupby('city')['age']
.agg(['mean', 'median', 'min', 'max'])
.reset_index()
.rename(columns={'mean': 'Mean Age', 'median': 'Median Age', 'min': 'Min Age', 'max': 'Max Age'})
)
print(age_summary)
city Mean Age Median Age Min Age Max Age 0 berkeley 30.062233 30.0 23.0 46.0 1 oakland 31.518991 30.0 23.0 46.0 2 palo alto 29.953008 30.0 23.0 46.0 3 san francisco 30.891386 30.0 23.0 46.0 4 san mateo 31.147258 30.0 23.0 46.0
Insights:
Age Distribution:
- The average and median age across the top cities are fairly consistent, around 30-31 years, with a minimum of 23 and a maximum of 46 years.
- Berkeley and Palo Alto have slightly younger populations, with lower mean ages compared to other cities.
- Oakland and San Mateo have marginally older users on average. Homogeneity: There is minimal variation in age distribution across these cities, indicating a fairly uniform user demographic.
Recommendations:
- Tailored Features: Develop features appealing to users in their late 20s to early 30s, as this is the dominant age range.
- Localized Engagement: Focus on slightly different themes in cities like Berkeley (younger, student-friendly) and Oakland or San Mateo (professional, career-focused).
- Broaden Demographics: Consider campaigns to attract a broader age range, especially older users, to diversify the platform further.
- What are the average income levels in the top states or cities? Are there regional patterns in reported income?
# Filter top cities and states
top_cities = ['san francisco', 'oakland', 'berkeley', 'san mateo', 'palo alto']
top_states = ['california', 'new york', 'illinois', 'massachusetts', 'texas']
# Calculate mean income for top cities
city_income = df[df['city'].isin(top_cities)].groupby('city')['income'].mean().reset_index()
city_income.rename(columns={'income': 'Average Income'}, inplace=True)
# Calculate mean income for top states
state_income = df[df['state'].isin(top_states)].groupby('state')['income'].mean().reset_index()
state_income.rename(columns={'income': 'Average Income'}, inplace=True)
# Display results
print("Average Income by Top Cities:")
print(city_income)
print("\nAverage Income by Top States:")
print(state_income)
Average Income by Top Cities: city Average Income 0 berkeley 17372.921615 1 oakland 22586.637095 2 palo alto 19332.706767 3 san francisco 20150.012877 4 san mateo 22779.864763 Average Income by Top States: state Average Income 0 california 20044.943445 1 illinois 0.000000 2 massachusetts 6000.000000 3 new york 31764.705882 4 texas 5000.000000
Insights:
Top Cities:
- The highest average income is in San Mateo and Oakland, with average incomes of approximately 22,780 and 22,586 respectively.
- Berkeley has the lowest average income among the top cities at around 17,365.
Top States:
- California stands out with the highest average income (~20,044).
- New York follows with an average income of 31,765, indicating a higher income level in comparison to California.
- Texas and Illinois show very low average incomes, with Texas at 5,000 and Illinois at 0, which may point to missing or unreported data for these states.
- Recommendation:
- Targeted Campaigns: Focus on California and New York for regions with higher average incomes. Tailor marketing or product offerings to meet the demands of users with higher incomes.
- Data Quality: Investigate missing or erroneous income data for Illinois and Texas, as the reported averages seem off, possibly due to incomplete or missing entries.
Regional Patterns:
- The highest average incomes are concentrated in California and New York, which may align with areas like the Bay Area (San Mateo, San Francisco, Palo Alto) that are known for high living costs and tech industries.
- Texas and Illinois show unusually low income averages, which may indicate incomplete or erroneous data, rather than regional trends.
5. Height Analysis¶
Physical attributes like height are often considered important in dating preferences. Analyzing height patterns helps Bumble understand user demographics and preferences better.
Questions:
- What is the average height of users across different gender categories?
# Group by gender and calculate average height in cm
gender_height_cm = df.groupby('gender')['height_cm'].mean().reset_index()
gender_height_cm.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)
# Display results with two decimal places
print("Average Height by Gender (in cm):")
print(gender_height_cm.round(2))
Average Height by Gender (in cm): gender Average Height (cm) 0 f 168.45 1 m 176.42
- How does height vary by age_group? Are there noticeable trends among younger vs. older users?
# Calculate average height by age group with observed=False to avoid future warnings
height_by_age_group = df.groupby('age_group', observed=False)['height_cm'].mean().reset_index()
# Rename column for clarity
height_by_age_group.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)
# Display results with rounded values
print("Average Height by Age Group (in cm):")
print(height_by_age_group.round(2))
Average Height by Age Group (in cm): age_group Average Height (cm) 0 18-25 173.43 1 26-35 173.15 2 36-45 173.23 3 46+ 173.44
Insights:
- Consistent Average Height Across Age Groups: The average height across all age groups remains consistent, with minor variations (173.15 cm to 173.44 cm).
- Slight Increase in Older Age Group: Users aged 46+ have the highest average height (173.44 cm), though the difference is negligible compared to other groups.
- Stable Trend: There is no significant trend in height variation between younger and older users, indicating height is largely independent of age groups in this dataset.
Recommendations:
- Focus on Broader Demographics: Since height shows minimal variation across age groups, it may not be a critical factor for age-specific insights or strategies.
- Cross-Analyze with Other Factors: Combine height data with other attributes like location, gender, or income to explore more actionable trends.
- Maintain Data Quality: Ensure accurate measurement and consistency of height data for future analyses.
- What is the distribution of height within body_type categories (e.g., athletic, curvy, thin)? Do the distributions align with expectations?
# Calculate the mean height for each body_type category
height_by_body_type = df.groupby('body_type')['height_cm'].mean().reset_index()
height_by_body_type.rename(columns={'height_cm': 'Average Height (cm)'}, inplace=True)
# Display results with rounded values
print("Average Height by Body Type (in cm):")
print(height_by_body_type.round(2))
Average Height by Body Type (in cm): body_type Average Height (cm) 0 a little extra 173.85 1 athletic 175.27 2 average 173.04 3 curvy 168.71 4 fit 173.57 5 full figured 169.99 6 jacked 174.12 7 overweight 174.05 8 rather not say 171.90 9 skinny 173.49 10 thin 172.45 11 used up 174.47
Insights:
- Athletic body type has the highest average height at 175.27 cm, suggesting that users with this body type tend to be slightly taller than others.
- Curvy body type has the lowest average height at 168.71 cm, which could indicate a trend toward shorter users within this category.
- Other body types like a little extra, fit, and jacked have average heights ranging from 173.04 cm to 174.47 cm, showing no significant variation in height across these categories.
- Not Specified and rather not say categories show average heights of 172.26 cm and 171.90 cm, respectively, indicating that users in these categories have a slightly lower average height compared to those who specified a body type.
Recommendations:
- Target Active Users: For users with an athletic body type, consider promoting fitness-related features or content.
- Inclusive Features: Offer more inclusive options for users with a curvy body type to make them feel represented.
- Equal Representation: Make sure all body types, like a little extra and full figured, are equally represented in your campaigns.
- Clarify Body Types: Since many users didn’t specify their body type, encourage users to be more specific for better personalization.
6. Income Analysis¶
Income is often an important factor for users on dating platforms. Understanding its distribution and relationship with other variables helps refine features like user search filters or personalized recommendations.
Questions:
- What is the distribution of income across the platform? Are there specific income brackets that dominate? (don't count 0)
# Filter out rows where income is 0
df_nonzero_income = df[df['income'] > 0]
# Calculate the income distribution
income_distribution = df_nonzero_income['income'].describe()
# Display the result
print(income_distribution.round(2))
count 11504.00 mean 104394.99 std 201433.53 min 20000.00 25% 20000.00 50% 50000.00 75% 100000.00 max 1000000.00 Name: income, dtype: float64
# Define bins and labels for income brackets
income_bins = [0, 25000, 50000, 75000, 100000, float('inf')]
income_labels = ['<25K', '25K-50K', '50K-75K', '75K-100K', '100K+']
# Filter out zero income and create income brackets
df_nonzero_income = df[df['income'] > 0].copy()
df_nonzero_income['income_bracket'] = pd.cut(df_nonzero_income['income'], bins=income_bins, labels=income_labels, right=False)
# Get the count of users in each income bracket
income_distribution = df_nonzero_income['income_bracket'].value_counts().sort_index()
print(income_distribution)
income_bracket <25K 2952 25K-50K 2053 50K-75K 2418 75K-100K 1111 100K+ 2970 Name: count, dtype: int64
Insights:
- The majority of users fall into the <25K and 100K+ income brackets, with 2952 and 2970 users, respectively. This suggests that the platform is used by both lower-income and high-income individuals.
- 50K-75K has the third-highest number of users (2418), indicating a significant portion of users fall in the middle-income range.
- 25K-50K and 75K-100K income brackets have fewer users (2053 and 1111, respectively), suggesting a more balanced distribution around the extreme income levels.
Recommendations:
- Target Marketing: Create campaigns that cater to both low and high-income users, offering exclusive features to high-income users and affordable options to lower-income ones.
- Personalized Filters: Offer search filters that let users find others within similar income groups, helping them connect with like-minded individuals.
- Premium Features: Consider adding extra features or subscription plans for the high-income users (100K+), as this group is quite large.
- How does income vary by age_group and gender? Are older users more likely to report higher incomes?
# Group by age group and gender and calculate the mean income
income_by_age_gender = df.groupby(['age_group', 'gender'],observed=True)['income'].mean().reset_index()
# Pivot the table to have separate columns for male and female income
income_by_age_gender_pivot = income_by_age_gender.pivot(index='age_group', columns='gender', values='income')
# Display the result
print(income_by_age_gender_pivot.round(2))
gender f m age_group 18-25 11249.66 25219.91 26-35 11089.33 25623.37 36-45 11319.26 27787.15 46+ 13865.55 30676.47
Insights:
- Gender Difference: Across all age groups, male users report significantly higher average incomes compared to female users.
- Age Trends: Income tends to increase with age for both genders, with the highest average incomes observed in the 46+ age group.
- Female Income Stability: Female users show relatively stable average incomes across age groups, with only a slight increase in the 46+ age group.
- Male Income Growth: Male users experience a steady increase in income as they age, with a substantial jump in the 46+ category.
Recommendations:
- Targeted Campaigns: Consider age-based campaigns emphasizing financial stability for older users, particularly males.
- Gender-Specific Insights: Develop initiatives that address income gaps, focusing on career growth opportunities or financial education for female users.
- Premium Features: Offer premium features or services tailored to older users who may have higher disposable incomes, especially in the 46+ age group.
Part 4: Data Visualization¶
1. Age Distribution¶
Understanding the distribution of user ages can reveal whether the platform caters to specific demographics or age groups. This insight is essential for targeted marketing and user experience design.
Questions:
- Plot a histogram of age with a vertical line indicating the mean age. What does the distribution reveal about the most common age group on the platform?
# Calculate mean age
mean_age = df['age'].mean()
# Plot the histogram with Bumble theme color
plt.figure(figsize=(8, 5))
plt.hist(df['age'], bins=15, color='#FFCC00', edgecolor='black', alpha=0.9)
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=2, label=f'Mean Age: {mean_age:.2f}')
# Add labels and title
plt.title('Age Distribution of Bumble Users', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend()
plt.grid(axis='y', alpha=0.75)
# Show the plot
plt.show()
Insights:
- The age distribution reveals that the majority of Bumble users fall within the 25–35 age range, as indicated by the peak of the histogram.
- The mean age of 30.90 suggests that the platform primarily caters to millennials and younger adults, aligning with the target demographic for dating apps.
Recommendations:
- Focus ads and promotions on users aged 25–35, as they are the most active group.
- Add features and content that appeal to people in their 20s and early 30s.
- Create offers or events for older users (35+) to attract more variety in age groups.
- How does the age distribution differ by gender? Are there age groups where one gender is more prevalent?
# Set the size of the plot
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='age', hue='gender', bins=15, kde=False,
multiple="dodge", palette={'f': '#87CEEB', 'm': '#FFCC00'},
edgecolor='black', alpha=0.8)
# Add labels and title
plt.title('Age Distribution by Gender', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend(title='Gender', labels=['Female', 'Male'])
plt.grid(axis='y', alpha=0.75)
# Show the plot
plt.show()
# Set the size of the plot
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='age', hue='gender', bins=15, kde=False,
multiple="stack", palette={'f': '#87CEEB', 'm': '#FFCC00'},
edgecolor='black', alpha=0.8)
# Add labels and title
plt.title('Age Distribution by Gender', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
plt.legend(title='Gender', labels=['Female', 'Male'])
plt.grid(axis='y', alpha=0.75)
# Show the plot
plt.show()
2. Income and Age¶
Visualizing the relationship between income and age helps uncover patterns in reported income levels across age groups, which could inform user segmentation strategies.
Questions:
- Use a scatterplot to visualize the relationship between income and age, with a trend line indicating overall patterns. Are older users more likely to report higher incomes?
# Set plot size
plt.figure(figsize=(8, 5))
# Scatterplot and trend line
sns.scatterplot(data=df, x='age', y='income', color='#FFCC00', alpha=0.8, edgecolor='black')
sns.regplot(data=df, x='age', y='income', scatter=False, color='black', line_kws={"linewidth": 2})
# Add title and labels
plt.title('Income vs. Age', fontsize=14, color='black')
plt.xlabel('Age', fontsize=12, color='black')
plt.ylabel('Income', fontsize=12, color='black')
# Add grid
plt.grid(axis='y', alpha=0.5, linestyle='--', color='gray')
# Show plot
plt.show()
Observations and Patterns:
- Scatterplot: Shows individual data points of age and income.
- Trend Line: Indicates whether income increases, decreases, or remains constant with age.
- Grid Lines: Subtle gray dashed lines improve readability.
Insight:
- The trend line indicates that income remains constant across different age groups, suggesting that age does not significantly influence reported income levels among users in this dataset. There may be no clear upward or downward trend linking older users to higher or lower incomes.
Recommendation:
- Focus on other factors like education or location for income-based user segmentation.
- Collect additional data to identify better predictors of income.
- Avoid age-based targeting for income-related strategies.
- Create boxplots of income grouped by age_group. Which age group reports the highest median income?
# Filter out rows with zero or negative income
df_filtered = df[df['income'] > 0]
# Create the boxplot
plt.figure(figsize=(8, 5))
sns.boxplot(data=df_filtered, x='age_group', y='income', color='#FFCC00') # Bumble yellow
# Add title and labels
plt.title('Income by Age Group', fontsize=14, color='black')
plt.xlabel('Age Group', fontsize=12, color='black')
plt.ylabel('Income', fontsize=12, color='black')
# Show the plot
plt.show()
Observations and Patterns:
- Box: Displays the interquartile range (IQR) of income for each age group.
- Line Inside Box: Median income for the group.
- Whiskers and Points: Show the range and outliers.
Insights:
- Median Income: The 36-45 age group has the highest median income. The 18-25 age group has the lowest median income.
- Outliers: The outliers shown on the plot are significantly higher than the typical incomes for their respective age groups.
- Distribution: Income distributions across age groups are similar, but the 36-45 group shows slightly higher incomes overall.
Recommendations:
- Focus on 36-45 Age Group: This group is in their peak earning years and is the best target for high-value products or services.
- Support Young Earners: Offer training or career development to the 18-25 age group to boost their earning potential.
- Analyze Outliers: Investigate high-income individuals (outliers) to find trends in skills, industries, or education that lead to such incomes.
- Plan for Older Groups: Provide financial planning and retirement services for the 46+ group as their incomes stabilize.
- Analyze income levels within gender and status categories. For example, are single men more likely to report higher incomes than single women?
# Create the boxplot to compare income by gender and status
plt.figure(figsize=(8, 5))
sns.boxplot(data=df_filtered, x='status', y='income', hue='gender', palette={'f': '#87CEEB', 'm': '#FFCC00'})
# Add title and labels
plt.title('Income by Gender and Status (Excluding Zero/Negative Income)', fontsize=14)
plt.xlabel('Status', fontsize=12)
plt.ylabel('Income', fontsize=12)
# Show the plot
plt.show()
Observations:
- Male Median is Higher: Males report slightly higher median incomes across all status categories.
- No Lower Whiskers for Females: Female income distributions lack lower whiskers, indicating low variability at the lower end.
- "Seeing Someone" Status: The box for females is visible, but the median line is faint or overlaps with the Q1 line, showing concentrated incomes.
- Outliers in "Single" Status: The "single" category, especially for males, has the highest number of outliers, indicating wide income variability.
Insights:
- Males generally report higher median incomes than females across all categories.
- Female incomes are tightly concentrated, with minimal variability in most cases.
- The "single" category, particularly for males, has significant high-income outliers.
Recommendations:
- Focus on single males for premium offerings due to high-income variability.
- Investigate structural factors influencing the concentration of female incomes.
- Analyze high-income outliers in the "single" category for potential targeting.
3. Pets and Preferences¶
Pets are often a key lifestyle preference and compatibility factor. Analyzing how pets preferences distribute across demographics can provide insights for filters or recommendations.
Questions:
- Create a bar chart showing the distribution of pets categories (e.g., likes dogs, likes cats). Which preferences are most common?
# Count the occurrences of each pet category
pet_counts = df['pets'].value_counts()
# Plot the bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=pet_counts.index, y=pet_counts.values, color='#FFCC00', edgecolor='black')
# Add labels and title
plt.title('Distribution of Pet Preferences', fontsize=14)
plt.xlabel('Pet Preferences', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)
# Rotate x-axis labels for better readability
plt.xticks(rotation=90)
# Show the plot
plt.show()
Insights:
- Most Users Did Not Specify a Preference: The highest bar corresponds to users who did not mention their pet preference (filled with "Not specified").
- Second Most Popular Preference: Users who like both cats and dogs are the next largest group.
- Third Place: Users who specifically like dogs.
- Fourth Preference: Users who like dogs but also have cats.
- Fifth Preference: Users who specifically have dogs.
Recommendation:
- Target Messaging for "Unknown" Group: Encourage users to update their profiles with pet preferences to improve personalization.
- Focus on Popular Groups: Tailor features or content for users who like both cats and dogs, as they form a significant segment.
- Engage Dog Lovers: Consider features specifically for dog lovers, given their notable representation.
- How do pets preferences vary across gender and age_group? Are younger users more likely to report liking pets compared to older users?
# Create a new binary column indicating whether the user likes pets
df['likes_pets'] = df['pets'].apply(lambda x: 'likes' in str(x))
# Filter out rows where pets preference is not specified
df_filtered = df[df['pets'] != 'Not Specified']
# Group by age_group and gender and count the users who like pets
# Adding observed=False to avoid the FutureWarning
age_gender_pets = df_filtered[df_filtered['likes_pets'] == True].groupby(['age_group', 'gender'], observed=False).size().reset_index(name='count')
# Plotting the data using a simple bar plot
plt.figure(figsize=(8, 5))
sns.barplot(data=age_gender_pets, x='age_group', y='count', hue='gender', palette=['#87CEEB','#FFCC00'])
# Adding labels and title
plt.title('Pets Preferences Across Gender and Age Groups', fontsize=16)
plt.xlabel('Age Group', fontsize=12)
plt.ylabel('Count of Users Liking Pets', fontsize=12)
# Show the plot
plt.show()
Insights Based on Observations:
- Gender Preference for Pets: In all age groups, males are more likely to report liking pets than females. This trend is consistent across all age ranges, with males showing a higher preference for pets.
- Age Group Preference: The 26-35 age group stands out with the highest count of users reporting a liking for pets. This group seems to have a strong affinity for pets compared to other age groups.
- Similar Preferences for Younger and Older Groups: The 18-25 and 36-45 age groups have almost identical pet preferences, with only a minor difference in the number of users who like pets. This suggests that both younger and slightly older users have comparable attitudes toward pets.
- Less Preference in Older Age Groups: The 46+ age group shows a significantly lower preference for pets, indicating that as users get older, their likelihood of liking pets seems to decrease.
Recommendations:
- Target Males More: Males across all age groups are more likely to like pets, so prioritize male-targeted campaigns.
- Focus on 26-35 Age Group: This group has the highest count of pet lovers, making it the most valuable target market.
- Group 18-25 and 36-45 Together: These groups show similar pet preferences, so they can be combined in strategies.
- Minimize Focus on 46+: The 46+ age group shows less interest in pets, so pet-related campaigns may be less effective for them.
4. Signs and Personality¶
Users’ self-reported zodiac signs (sign) can offer insights into personality preferences or trends. While not scientifically grounded, analyzing this data helps explore fun and engaging patterns.
Questions:
- Create a pie chart showing the distribution of zodiac signs (sign) across the platform. Which signs are most and least represented? Is this the right chart? If not, replace with right chart.
- the number of categories (zodiac signs) is too large, the chart can become cluttered and harder to interpret. In that case, a bar chart would be a better option as it allows for easy comparison across categories.
# Extract the main zodiac sign using a regular expression to match the first word
df['cleaned_sign'] = df['sign'].str.extract(r'(\b\w+\b)', expand=False)
# Filter out "Not Specified"
df_filtered = df[df['cleaned_sign'] != 'Not']
# Count the occurrences of each zodiac sign
sign_counts = df_filtered['cleaned_sign'].value_counts()
# Create a fading color palette where lower counts have darker colors
palette = sns.light_palette("#FFCC00", n_colors=len(sign_counts), reverse=True)
# Plot the data as a horizontal bar chart with the fading Bumble color palette
plt.figure(figsize=(8, 5))
sns.barplot(x=sign_counts.values, y=sign_counts.index, palette=palette, hue=sign_counts.index, edgecolor='black')
# Add labels and title
plt.title('Distribution of Zodiac Signs on the Platform (Excluding "Not Specified")', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Zodiac Sign', fontsize=12)
# Show the plot
plt.show()
Insights on Zodiac Signs Distribution:
- Most Represented Signs: Leo Gemini
- Least Represented Signs: Capricorn Aquarius
- How does sign vary across gender and status? Are there noticeable patterns or imbalances?
# Group by 'gender', 'status', and 'cleaned_sign' to count occurrences
sign_gender_status = df_filtered.groupby(['gender', 'status', 'cleaned_sign']).size().reset_index(name='count')
# Plotting the data using countplot
plt.figure(figsize=(8, 5))
# Creating a count plot with hue set to 'gender' and 'status' as different categories
sns.barplot(data=sign_gender_status, x='cleaned_sign', y='count', hue='gender', palette=['#87CEEB', '#FFCC00'], errorbar=None)
# Adding labels and title
plt.title('Zodiac Sign Distribution Across Gender and Status', fontsize=16)
plt.xlabel('Zodiac Sign', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
# Show the plot
plt.show()
Insights (How signs vary across gender and status):
- Popular Signs: Zodiac signs like Gemini, Cancer, and Libra are the most represented for both males and females, showing that these signs dominate the user base.
- Underrepresented Signs: Females: Capricorn and Scorpio are less represented. Males: Aquarius, Aries, and Pisces show relatively lower counts.
- Gender Imbalance: Males consistently outnumber females across all zodiac signs, with some signs (e.g., Capricorn for females, Aquarius for males) showing a sharper imbalance.
- Consistent Patterns: Signs like Aquarius, Pisces, and Aries have consistently lower representation across both genders.
Recommendations (Addressing patterns and imbalances):
- Attract Female Users: Run campaigns for women, especially Capricorn and Scorpio.
- Highlight Less Visible Signs: Promote Aquarius, Aries, and Pisces.
- Engage Popular Signs: Focus on Gemini, Cancer, and Libra.
- Add Zodiac Features: Include compatibility-based matching.
- Fix Low Counts: Study and address why some signs have fewer users.