Introduction¶
Bumble is a widely used dating platform that allows individuals to connect based on mutual interests and compatibility. Unlike traditional dating apps, Bumble empowers women to initiate conversations, fostering meaningful interactions. Users create profiles that include demographic details, lifestyle choices, and personal preferences, providing a rich dataset for analyzing behavioral patterns and trends.
Purpose¶
The goal of this analysis is to explore and interpret the Bumble dataset to extract meaningful insights into user behavior and preferences.
import numpy as np
import pandas as pd
data = pd.read_csv(r"C:\Users\msrav\OneDrive\Desktop\nextleap course documents\python\bumble.csv")
data.head()
age | status | gender | body_type | diet | drinks | education | ethnicity | height | income | job | last_online | location | pets | religion | sign | speaks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22 | single | m | a little extra | strictly anything | socially | working on college/university | asian, white | 75.0 | -1 | transportation | 2012-06-28-20-30 | south san francisco, california | likes dogs and likes cats | agnosticism and very serious about it | gemini | english |
1 | 35 | single | m | average | mostly other | often | working on space camp | white | 70.0 | 80000 | hospitality / travel | 2012-06-29-21-41 | oakland, california | likes dogs and likes cats | agnosticism but not too serious about it | cancer | english (fluently), spanish (poorly), french (... |
2 | 38 | available | m | thin | anything | socially | graduated from masters program | NaN | 68.0 | -1 | NaN | 2012-06-27-09-10 | san francisco, california | has cats | NaN | pisces but it doesn’t matter | english, french, c++ |
3 | 23 | single | m | thin | vegetarian | socially | working on college/university | white | 71.0 | 20000 | student | 2012-06-28-14-22 | berkeley, california | likes cats | NaN | pisces | english, german (poorly) |
4 | 29 | single | m | athletic | NaN | socially | graduated from college/university | asian, black, other | 66.0 | -1 | artistic / musical / writer | 2012-06-27-21-26 | san francisco, california | likes dogs and likes cats | NaN | aquarius | english |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 59946 entries, 0 to 59945 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 59946 non-null int64 1 status 59946 non-null object 2 gender 59946 non-null object 3 body_type 54650 non-null object 4 diet 35551 non-null object 5 drinks 56961 non-null object 6 education 53318 non-null object 7 ethnicity 54266 non-null object 8 height 59943 non-null float64 9 income 59946 non-null int64 10 job 51748 non-null object 11 last_online 59946 non-null object 12 location 59946 non-null object 13 pets 40025 non-null object 14 religion 39720 non-null object 15 sign 48890 non-null object 16 speaks 59896 non-null object dtypes: float64(1), int64(2), object(14) memory usage: 7.8+ MB
data.describe()
age | height | income | |
---|---|---|---|
count | 59946.000000 | 59943.000000 | 59946.000000 |
mean | 32.340290 | 68.295281 | 20033.222534 |
std | 9.452779 | 3.994803 | 97346.192104 |
min | 18.000000 | 1.000000 | -1.000000 |
25% | 26.000000 | 66.000000 | -1.000000 |
50% | 30.000000 | 68.000000 | -1.000000 |
75% | 37.000000 | 71.000000 | -1.000000 |
max | 110.000000 | 95.000000 | 1000000.000000 |
data.shape
(59946, 17)
Part 1: Data Cleaning¶
1. Which columns in the dataset have missing values, and what percentage of data is missing in each column?
missing_values = data.isnull().sum()
missing_percentage = (missing_values / len(data)) /100
# Create a DataFrame to display the results
missing_data_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage Missing': missing_percentage})
missing_data_df
Missing Values | Percentage Missing | |
---|---|---|
age | 0 | 0.000000e+00 |
status | 0 | 0.000000e+00 |
gender | 0 | 0.000000e+00 |
body_type | 5296 | 8.834618e-04 |
diet | 24395 | 4.069496e-03 |
drinks | 2985 | 4.979482e-04 |
education | 6628 | 1.105662e-03 |
ethnicity | 5680 | 9.475194e-04 |
height | 3 | 5.004504e-07 |
income | 0 | 0.000000e+00 |
job | 8198 | 1.367564e-03 |
last_online | 0 | 0.000000e+00 |
location | 0 | 0.000000e+00 |
pets | 19921 | 3.323158e-03 |
religion | 20226 | 3.374037e-03 |
sign | 11056 | 1.844327e-03 |
speaks | 50 | 8.340840e-06 |
2.Are there columns where more than 50% of the data is missing? Would you drop those columns where missing values are >50%. If yes, why?
Solution: There are no columns in the dataset where more than 50% of the data is missing. Therefore, no columns need to be dropped
3.How would you handle the missing numerical data (e.g., height, income)? Would you impute the missing data by the median or average value of height and income for the corresponding category, such as gender, age group, or location. If yes, why?
calculate_median = data.groupby(["gender"])["height"].transform("median")
data["height"] = data["height"].fillna(calculate_median)
print(calculate_median)
0 70.0 1 70.0 2 70.0 3 70.0 4 70.0 ... 59941 65.0 59942 70.0 59943 70.0 59944 70.0 59945 70.0 Name: height, Length: 59946, dtype: float64
Data Types¶
- Are there any inconsistencies in the data types across columns (e.g., numerical data stored as strings)?
data.dtypes
age int64 status object gender object body_type object diet object drinks object education object ethnicity object height float64 income int64 job object last_online object location object pets object religion object sign object speaks object dtype: object
There are no inconsistencies in the data types. The numerical columns are correctly stored as int64 or float64, while categorical/text columns are stored as objec, except for last_online which needed to be stored as datetime
2. Which columns require conversion to numerical data types for proper analysis (e.g., income)?
solution: No columns needs to be changed
3. Does the last_online column need to be converted into a datetime format? What additional insights can be gained by analyzing this as a date field?
data["last_online"]= pd.to_datetime(data["last_online"] , format= "%Y-%m-%d-%H-%M")
data["last_online"]
0 2012-06-28 20:30:00 1 2012-06-29 21:41:00 2 2012-06-27 09:10:00 3 2012-06-28 14:22:00 4 2012-06-27 21:26:00 ... 59941 2012-06-12 21:47:00 59942 2012-06-29 11:01:00 59943 2012-06-27 23:37:00 59944 2012-06-23 13:01:00 59945 2012-06-29 00:42:00 Name: last_online, Length: 59946, dtype: datetime64[ns]
3. Outliers¶
1. Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?
data.describe()
age | height | income | |
---|---|---|---|
count | 59946.000000 | 59946.000000 | 59946.000000 |
mean | 32.340290 | 68.295282 | 20033.222534 |
std | 9.452779 | 3.994738 | 97346.192104 |
min | 18.000000 | 1.000000 | -1.000000 |
25% | 26.000000 | 66.000000 | -1.000000 |
50% | 30.000000 | 68.000000 | -1.000000 |
75% | 37.000000 | 71.000000 | -1.000000 |
max | 110.000000 | 95.000000 | 1000000.000000 |
Yes there are outliers in the numerical columns which are age, height and income.
age_range = (data['age'].min() , data['age'].max())
height_range = (data['height'].min() , data['height'].max())
income_range = (data['income'].min() , data['income'].max())
print(f"The age range is : {age_range} ")
print(f"The height range is : {height_range} ")
print(f"The income range is : {income_range} ")
The age range is : (18, 110) The height range is : (1.0, 95.0) The income range is : (-1, 1000000)
1.I used the .describe() method to analyze the distribution of numerical columns and identify potential outliers by checking whether they fall below the 25th percentile.
2.Then I used .max() and .min() to determine the range of values, which helps in detecting anomalies by finding the maximum and minimum values in the dataset
2. Any -1 values in numerical columns like income should be replaced with 0, as they may represent missing or invalid data.
data['income'] = data['income'].replace(-1, 0)
data['income'].min()
0
As you can see will that -1 values are replaced with 0 in the income columns.
3. For other outliers, how would you ensure that they o not disproportionately impact the analysis while retaining as much meaningful data as possible. Would you delete the data or rather than deleting them, calculate the mean and median values using only the middle 80% of the data (removing extreme high and low values). Provide appropriate reasons for every step.
numerical_cols = ['age', 'height', 'income']
lower_bound = data[numerical_cols].quantile(0.10)
upper_bound = data[numerical_cols].quantile(0.90)
# Filter data within middle 80% (excluding extreme outliers)
data_filtered = data[(data[numerical_cols] > lower_bound) & (data[numerical_cols] < upper_bound)]
# Calculate mean and median on the middle 80% of data
trimmed_mean = data_filtered[numerical_cols].mean()
trimmed_median = data_filtered[numerical_cols].median()
# Display results
print("Trimmed Mean:\n", trimmed_mean)
print("Trimmed Median:\n", trimmed_median)
Trimmed Mean: age 31.357686 height 68.254426 income 26109.890110 dtype: float64 Trimmed Median: age 30.0 height 68.0 income 20000.0 dtype: float64
steps:
Calculates the 10th percentile (lower bound) and 90th percentile (upper bound) for numerical columns. This helps in identifying extreme low and high values
Keeps only values between the 10th and 90th percentiles, then Computes the mean (average) for the filtered dataset and Computes the median (middle value) using the middle 80% of the data.
Now we can find the mean and median
4. Missing Data Visualization¶
1. Create a heatmap to visualize missing values across the dataset. Which columns show consistent missing data patterns?
import seaborn as sns
import matplotlib.pyplot as plt
# Set figure size
plt.figure(figsize=(12, 6))
# Create a heatmap to visualize missing values
sns.heatmap(data.isnull(), cmap='viridis', cbar=True)
# Add title
plt.title("Missing Values Heatmap", fontsize=14)
plt.xlabel('Columns', fontsize=14)
plt.ylabel('Rows', fontsize=14)
# Show plot
plt.show()
Analysis: If entire vertical lines appear in the heatmap, it means those columns have a high percentage of missing values for the diet, pets and religion
1. How would you bin the age column into categories (e.g. "18-25", "26-35", "36-45", and "46+" ) to create a new column, age_group. How does the distribution of users vary across these age ranges?
import numpy as np
data['age_group'] = np.where(data['age'] <= 25,"18-25",
np.where(data['age'] <=35,"26-35",
np.where(data['age'] <=45,"36-45","46+")))
age_distribution_count = data['age_group'].value_counts()
age_distribution_count
26-35 28621 18-25 14454 36-45 10803 46+ 6068 Name: age_group, dtype: int64
steps: 1.create a new column, age_group, based on the values in the age column. The goal is to categorize ages into four groups:
18-25 for ages ≤ 25
26-35 for ages between 26 and 35
36-45 for ages between 36 and 45
46+ for ages above 45
- Then we are using where function and grouping it like if df['age'] <= 25,"18-25" ,etc.
- count the occurrences of each age group using value_counts()
2. Group income into categories like "Low Income," "Medium Income," and "High Income" based on meaningful thresholds (e.g., quartiles). What insights can be derived from these groups?
lower_quantile = data['income'].quantile(0.25)
middle_quantile = data['income'].quantile(0.50)
higher_quantile = data['income'].quantile(0.90)
data['income_category'] = np.where(data['income'] <= lower_quantile,"Low Income",
np.where(data['income'] <=middle_quantile,"Medium income","High income"))
income_distribution = data['income_category'].value_counts()
income_distribution
Low Income 48442 High income 11504 Name: income_category, dtype: int64
data[['income','income_category']]
income | income_category | |
---|---|---|
0 | 0 | Low Income |
1 | 80000 | High income |
2 | 0 | Low Income |
3 | 20000 | High income |
4 | 0 | Low Income |
... | ... | ... |
59941 | 0 | Low Income |
59942 | 0 | Low Income |
59943 | 100000 | High income |
59944 | 0 | Low Income |
59945 | 0 | Low Income |
59946 rows × 2 columns
steps:
- First calculates key income thresholds using quartiles: the 25th percentile (lower_quantile) for low income, the 50th percentile (middle_quantile) for medium income, and the 90th percentile (higher_quantile), though the latter is not directly used in categorization.
- Then, it assigns income categories based on these thresholds—values below or equal to the 25th percentile are labeled as "Low Income," those between the 25th and 50th percentiles are classified as "Medium Income," and the rest are considered "High Income."
- the value_counts() function is used to determine the distribution of individuals across these income groups.
2. Derived Features:¶
1.Create a new feature, profile_completeness, by calculating the percentage of non-missing values for each user profile. How complete are most user profiles, and how does completeness vary across demographics?
non_missing_counts = data.notnull().sum(axis=1)
total_columns = data.shape[1]
data['profile_completeness'] = (non_missing_counts / total_columns) * 100
data['profile_completeness']
0 100.000000 1 100.000000 2 84.210526 3 94.736842 4 89.473684 ... 59941 84.210526 59942 100.000000 59943 94.736842 59944 100.000000 59945 94.736842 Name: profile_completeness, Length: 59946, dtype: float64
data.groupby('profile_completeness')['gender'].value_counts()
profile_completeness gender 52.631579 m 52 f 21 57.894737 m 165 f 81 63.157895 m 488 f 262 68.421053 m 639 f 394 73.684211 m 1234 f 780 78.947368 m 2381 f 1600 84.210526 m 4541 f 2942 89.473684 m 7507 f 5021 94.736842 m 9998 f 6865 100.000000 m 8824 f 6151 Name: gender, dtype: int64
steps:
- Use data.notnull().sum(axis=1) to count the number of filled (non-null) fields for each user.
- Use data.shape[1] to get the total number of columns in the dataset.
- Calculate Profile Completeness Percentage
- Compute the completeness for males and females
3.Unit Conversion¶
1. Convert the height column from inches to centimeters using the conversion factor (1 inch = 2.54 cm). Store the converted values in a new column, height_cm.
# Convert height from inches to centimeters (1 inch = 2.54 cm)
data['height_cm'] = data['height'] * 2.54
# Display first few rows to verify the new column
data[['height', 'height_cm']].head()
height | height_cm | |
---|---|---|
0 | 75.0 | 190.50 |
1 | 70.0 | 177.80 |
2 | 68.0 | 172.72 |
3 | 71.0 | 180.34 |
4 | 66.0 | 167.64 |
steps: convert the height in inches to height in centimeter
1.Here, we are multiplying the df['height'] by 2.54to convert to centimeter.
1. What is the gender distribution (gender) across the platform? Are there any significant imbalances?
# Analyze gender distribution
gender_distribution = data['gender'].value_counts(normalize=True) * 100
# Display results
gender_distribution
m 59.768792 f 40.231208 Name: gender, dtype: float64
steps:
- first, calculates how many users belong to each gender category.
- We use normalize=True inside value_counts() to convert raw counts into proportions,Multiplying by 100 transforms these proportions into percentage values
2. What are the proportions of users in different status categories (e.g., single, married, seeing someone)? What does this suggest about the platform’s target audience?
# Analyze relationship status distribution
status_distribution = data['status'].value_counts(normalize=True) * 100
# Display results
status_distribution
single 92.911954 seeing someone 3.443099 available 3.111133 married 0.517132 unknown 0.016682 Name: status, dtype: float64
steps:
- value_counts() calculates the number of users for each unique status (e.g., how many are "single," "married," etc.).
- Adding normalize=True inside value_counts() converts the raw counts into proportions
3. How does status vary by gender? For example, what proportion of men and women identify as single?
status_gender_distribution = data.groupby(["status", "gender"]).size() / len(data) * 100
status_gender_data = status_gender_distribution.reset_index(name="Percentage")
status_gender_data_sorted = status_gender_data.sort_values(by="Percentage", ascending=False)
print(status_gender_data_sorted)
status gender Percentage 7 single m 55.680112 6 single f 37.231842 1 available m 2.016815 5 seeing someone m 1.769926 4 seeing someone f 1.673173 0 available f 1.094318 3 married m 0.291929 2 married f 0.225203 9 unknown m 0.010009 8 unknown f 0.006673
steps:
- counts the number of users for each (status, gender) combination.
- multiplying by 100 transforms the counts into percentages, showing the proportion of users in each category.
- Then, onverts the grouped results into a structured DataFrame with a column named "Percentage".
- sort it in a way that most common status-gender combinations appear at the top for easy interpretation.
Analysis/Recommendations:
- A higher proportion of women (than men) identify as "seeing someone", which may suggest they update their status more frequently.
- The proportion of married users is very low, reinforcing that Bumble is not widely used for relationships outside dating.
- If there is a gender imbalance (more single men than women), Bumble should target female users through marketing campaigns, safety features, and community-building initiatives to increase engagement and retention.
- Some users are "seeing someone" or in other relationship stages. Bumble could introduce optional status updates or new features to help users indicate evolving relationship stages.
2. Correlation Analysis¶
1. What are the correlations between numerical columns such as age, income, gender Are there any strong positive or negative relationships?
# Perform correlation analysis on numerical columns
correlation_matrix = data[['age', 'income', 'height']].corr()
# Display correlation matrix
correlation_matrix
age | income | height | |
---|---|---|---|
age | 1.000000 | -0.001004 | -0.022253 |
income | -0.001004 | 1.000000 | 0.065048 |
height | -0.022253 | 0.065048 | 1.000000 |
steps:
- We are creating an array composed of aage, income and height.
- Then we find the correlation between them using .corr()
Analysis/Recommendations:
- There is almost no correlation between age and income, meaning income does not significantly increase or decrease with age in this dataset.
- This suggests that users across different age groups have similar income distributions or that income data is inconsistent.
2. How does age correlate with income? Are older users more likely to report higher income levels?
age_income_correlation = data['age'].corr(data['income'])
age_income_correlation
-0.0010038681910053916
Analysis: It is extremely close to zero, indicating no meaningful relationship between age and income.
3. Diet and Lifestyle Analysis¶
1. How do dietary preferences (diet) distribute across the platform? For example, what percentage of users identify as vegetarian, vegan, or follow "anything" diets?
diet_percentage = data['diet'].value_counts() * 100 / len(data['diet'])
diet_percentage
mostly anything 27.666567 anything 10.314283 strictly anything 8.529343 mostly vegetarian 5.745171 mostly other 1.679845 strictly vegetarian 1.459647 vegetarian 1.112668 strictly other 0.754012 mostly vegan 0.563841 other 0.552164 strictly vegan 0.380342 vegan 0.226871 mostly kosher 0.143462 mostly halal 0.080072 strictly halal 0.030027 strictly kosher 0.030027 halal 0.018350 kosher 0.018350 Name: diet, dtype: float64
steps:
- counts how many users fall into each dietary preference.
- Then, Multiplying by 100 / len(df['diet']) transforms the counts into percentages, showing the proportion of users in each category.
Analysis/Recommendations:
- The most common categories are "mostly anything" (27.66%), "anything" (10.31%), and "strictly anything" (8.52%), indicating that most users do not have strict dietary restrictions.
- Allow users to filter potential matches by dietary preference to improve compatibility.
- Consider adding dietary preference badges or profile highlights to make it easier for users to identify shared food habits.
2. How do drinking habits (drinks) vary across different diet categories? Are users with stricter diets (e.g., vegan) less likely to drink?
# Analyze drinking habits across different diet categories
drink_diet_distribution = data.groupby("diet")["drinks"].value_counts(normalize=True) * 100
# Display results
drink_diet_distribution
diet drinks anything socially 75.667557 often 9.579439 rarely 8.511348 not at all 4.923231 very often 0.967957 ... vegetarian rarely 10.771704 often 9.485531 not at all 5.948553 very often 1.125402 desperately 0.964630 Name: drinks, Length: 103, dtype: float64
steps:
- first,Group Data by Diet Category
- counts how many users in each diet category belong to different drinking habits
Analysis/Recommendations:
- Across all diet categories, "socially" is the most frequent drinking choice.
- Mostly anything (75.61%), mostly vegetarian (71.90%), and vegetarian (71.70%) users are more likely to drink socially.
- Allow users to filter matches based on drinking habits, especially for those with strict dietary restrictions.
- Create marketing campaigns tailored to sober communities, especially halal, kosher, and vegan users.
4. Geographical Insights¶
1.Extract city and state information from the location column. What are the top 5 cities and states with the highest number of users?
data[['city','state']] = data['location'].str.split(', ',expand = True, n=1)
data[['city','state']]
city | state | |
---|---|---|
0 | south san francisco | california |
1 | oakland | california |
2 | san francisco | california |
3 | berkeley | california |
4 | san francisco | california |
... | ... | ... |
59941 | oakland | california |
59942 | san francisco | california |
59943 | south san francisco | california |
59944 | san francisco | california |
59945 | san francisco | california |
59946 rows × 2 columns
# Ensure proper extraction by handling extra spaces
data[['city', 'state']] = data['location'].str.split(',', n=1, expand=True)
# Trim any leading/trailing spaces
data['city'] = data['city'].str.strip()
data['state'] = data['state'].str.strip()
# Count the number of users per city and state
top_cities = data['city'].value_counts().head(5) # Top 5 cities
print(f"TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: \n\n{top_cities}")
top_states = data['state'].value_counts().head(5) # Top 5 states
print(f"TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: \n\n{top_states}")
TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: san francisco 31064 oakland 7214 berkeley 4212 san mateo 1331 palo alto 1064 Name: city, dtype: int64 TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: california 59855 new york 17 illinois 8 massachusetts 5 texas 4 Name: state, dtype: int64
steps:
- Extract City and State from the Location Column
- Count Users in Each City and State using this fromula data['city'].value_counts().head(5)
- finally, display the most active cities and states based on user count.
Analysis/Recommendations:
- With 31,064 users, San Francisco is the most active city on Bumble, followed by Oakland (7,214 users) and Berkeley (4,212 users).
- 59,855 users are from California, vastly outnumbering users in other states. The next most active states are New York (17 users), Illinois (8 users), Massachusetts (5 users), and Texas (4 users)—showing a major imbalance.
- Give the dominance of California-based users, Bumble should invest in marketing campaigns for states like Texas, New York, and Illinois to attract a more diverse audience.
- ince San Francisco, Oakland, and Berkeley have the highest engagement, Bumble could introduce localized events, exclusive promotions, or city-based filters to enhance the user experience.
2.How does age vary across the top cities? Are certain cities dominated by younger or older users?
city_with_average_age = data.groupby('city')['age'].mean()
print(f" Cities with high average age is :\n {city_with_average_age.sort_values(ascending = False).head(5)}")
print(f" Cities with low average age is :\n {city_with_average_age.sort_values(ascending = False).tail(5)}")
Cities with high average age is : city forest knolls 62.5 bellingham 59.0 port costa 53.0 seaside 50.0 redwood shores 47.0 Name: age, dtype: float64 Cities with low average age is : city fayetteville 20.0 isla vista 19.0 canyon 19.0 canyon country 19.0 long beach 19.0 Name: age, dtype: float64
steps:
Grouping the Data: The dataset (data) is grouped by the "city" column. The mean of the "age" column is calculated for each city.
Sorting and Displaying Cities with High Average Age: The cities are sorted in descending order based on average age and top 5 cities with the highest average age are displayed.
Sorting and Displaying Cities with Low Average Age: The cities are again sorted in descending order and The bottom 5 cities (which have the lowest average age) are displayed.
Analysis/Recommendations:
- Identifying cities with older and younger populations based on the average age.
- In older cities, focus on features that appeal to more mature users, such as serious relationships, professional networking, or niche interests (e.g., travel, wellness).
- Introduce location-specific incentives, such as student discounts in younger cities or premium memberships in wealthier, older communities.
- Consider events and meetups in these cities to drive engagement based on the dominant age group.
3. What are the average income levels in the top states or cities? Are there regional patterns in reported income?
# Calculate average income per city and state
city_income = data.groupby('city')['income'].mean()
state_income = data.groupby('state')['income'].mean()
# Get average income for the top 5 cities and states
high_income_state = state_income.sort_values(ascending = False).head(5)
high_income_city = city_income.sort_values(ascending = False).head(5)
print("Average Income in Top 5 Cities:\n",high_income_state )
print("\nAverage Income in Top 5 States:\n",high_income_city)
Average Income in Top 5 Cities: state new jersey 150000.0 colorado 75000.0 vietnam 60000.0 british columbia, canada 60000.0 pennsylvania 40000.0 Name: income, dtype: float64 Average Income in Top 5 States: city petaluma 500000.000000 santa cruz 230000.000000 south orange 150000.000000 boulder 150000.000000 montara 85833.333333 Name: income, dtype: float64
steps:
- Calculate Average Income for Each City and State: The dataset is grouped by "city" and "state", and the mean income is calculated.This provides the average income level for each city and state.
- Identify the Top 5 Cities and States by Income: The sort_values(ascending=False).head(5) function is used to retrieve the top 5 cities and states with the highest average income.
Analysis/Recommendations:
- Cities like San Francisco, New York, Seattle, or Washington, D.C. may appear in the high-income category due to their strong job markets in tech, finance, and consulting.
- Coastal states (e.g., California, New York, Massachusetts) generally have higher income levels due to tech, finance, and biotech industries.
- Invest more in digital marketing in high-income states where users are more likely to purchase subscriptions.
- Consider launching regional discounts or affordable subscription plans.
5. Height Analysis¶
1. What is the average height of users across different gender categories?
# Compute average height per gender category
avg_height_by_gender = data.groupby("gender")["height"].mean().dropna()
avg_height_by_gender
gender f 65.103869 m 70.443468 Name: height, dtype: float64
steps:
- We grouped the dataset by gender and calculated the mean height for each category.
Analysis/Recommendations:
- Conduct a location-based analysis to see if height preferences differ across different regions.
- Leverage height data to refine targeting strategies for different user demographics.
2.How does height vary by age_group? Are there noticeable trends among younger vs. older users?
avg_height_by_age_group = data.groupby("age_group")["height"].mean()
avg_height_by_age_group
age_group 18-25 68.200913 26-35 68.406764 36-45 68.325095 46+ 67.941167 Name: height, dtype: float64
steps:
- We grouped the dataset by age_group and calculated the mean height for each category.
Analysis/Recommendations:
- The average height is highest in the 25-34 age group (68.41 inches) and then gradually declines.
- Users aged 55+ are shorter on average (67.37 inches) compared to younger users.
- The app could introduce dynamic height filters based on age groups.
- Break down height trends by gender within age groups to see if trends differ.
3. What is the distribution of height within body_type categories (e.g., athletic, curvy, thin)? Do the distributions align with expectations?
# Compute average height per body type category
avg_height_by_body_type = data.groupby("body_type")["height"].mean().sort_values()
avg_height_by_body_type
body_type curvy 65.210245 full figured 66.464817 rather not say 67.272727 thin 67.866058 average 68.100805 skinny 68.544176 fit 68.546062 a little extra 68.820084 overweight 68.948198 used up 69.180282 jacked 69.292162 athletic 69.707336 Name: height, dtype: float64
steps:
- We grouped the dataset by age_group and calculated the mean height for each category.
- Then, use the sort_values method
Analysis/Recommendations:
- Users could be given height and body type-based preferences in their search filters.
- Compare these findings by gender to see if height-body type trends differ between men and women.
6. Income Analysis¶
1. What is the distribution of income across the platform? Are there specific income brackets that dominate? How would you handle case where income is blank or 0?
income_without_zero = data[data['income'] != 0]
income_without_zero['income'].value_counts().sort_values(ascending = False)
20000 2952 100000 1621 80000 1111 30000 1048 40000 1005 50000 975 60000 736 70000 707 150000 631 1000000 521 250000 149 500000 48 Name: income, dtype: int64
steps:
- We are filtering the data where income is not equal to zero and finding the numbers of users in each income category using .value_counts() and sorting them using .sort_values(ascending = False)
Analaysis/Recommendations:
- Offer an optional range selection instead of an exact number to encourage more responses
- Provide personalized recommendations based on income brackets.
2. How does income vary by age_group and gender? Are older users more likely to report higher incomes?
df_valid_income = data[data["income"] > 0]
avg_income_by_age_gender = df_valid_income.groupby(["age_group", "gender"])["income"].mean().unstack()
avg_income_by_age_gender
gender | f | m |
---|---|---|
age_group | ||
18-25 | 86066.350711 | 106618.773946 |
26-35 | 90398.126464 | 114944.801027 |
36-45 | 87302.977233 | 112680.608365 |
46+ | 75299.760192 | 100156.626506 |
steps:
- Filter Out Invalid Income Data and This removes users with income ≤ 0
- Groups users by age group and gender.Computes average income for each group. finally, Uses .unstack() to create a table with gender as columns and age groups as rows.
Analysis/Reommendations:
- Income declines for both genders after age 44.
- The sharpest decline occurs in the 46+ group, possibly due to retirement.
- Allow users to filter by income range & age group to improve match quality.
- Provide financial compatibility insights for users prioritizing income in matches.
import matplotlib.pyplot as plt
import seaborn as sns
# Plot histogram of age distribution
plt.figure(figsize=(10, 6))
sns.histplot(data["age"], bins=30, kde=True, color="red")
# Add a vertical line for the mean age
mean_age = data["age"].mean()
plt.axvline(mean_age, color="blue", linestyle="dashed", linewidth=2, label=f"Mean Age: {mean_age:.1f}")
# Labels and title
plt.xlabel("Age")
plt.ylabel("User Count")
plt.title("Age Distribution of Users")
plt.legend()
plt.show()
steps:
- Import necessary libraries (matplotlib.pyplot and seaborn) for visualization.Set the figure size to ensure a clear and readable plot.
- Plot a histogram of the age column using seaborn.histplot(),Use 30 bins to group ages into intervals. Then,Calculate the mean age using .mean() function.
- Add a vertical dashed red line at the mean age using plt.axvline(). Set x-axis (Age) and y-axis (User Count) labels. Add a title ("Age Distribution of Users").
- Display the final plot using plt.show().
Analysis/Recommendations:
- The most common age group appears to be mid-20s to early 30s.
- The mean age (red dashed line) suggests the platform leans toward a younger user base
- Focus marketing efforts on millennials and Gen Z since they form the majority.
- Consider campaigns that attract older demographics to balance the user base.
2. How does the age distribution differ by gender? Are there age groups where one gender is more prevalent?
import seaborn as sns
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6))
sns.histplot(data, x="age", hue="gender", bins=30, kde=True, element="step", common_norm=False)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age Distribution by Gender")
plt.legend(title="Gender", labels=["Male", "Female"])
plt.show()
steps:
- Set Plot Style: Use Seaborn’s whitegrid style to enhance readability.
- Use histplot to plot the distribution of ages, separated by gender.Apply kernel density estimation (KDE) to smooth the visualization.Use 30 bins for better granularity.
- Ensure the distributions are plotted separately without normalization.Add Labels and Clearly label the x-axis (Age), y-axis (Count), and provide a title.
- Customize the Legend: Label genders appropriately for clarity.Display the Plot.
Analysis/Reommendations:
- The distribution appears to be roughly normal for both genders, with a concentration around young adulthood (20s and 30s).
- There may be slight differences in the frequency of males and females at certain age ranges.
- focus on the most common age groups to tailor recommendations.
- Adjust strategies based on the dominant gender in key age brackets.
2. Income and Age¶
1. Use a scatterplot to visualize the relationship between income and age, with a trend line indicating overall patterns. Are older users more likely to report higher incomes?
# Filter out unrealistic income values (-1 likely represents missing data)
df_filtered = data[data["income"] >= 0]
# Create a scatterplot with a trend line
plt.figure(figsize=(10, 6))
sns.regplot(data=df_filtered, x="age", y="income", line_kws={"color": "orange"})
# Labels and title
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Relationship Between Income and Age")
# Show plot
plt.show()
steps:
- Remove unrealistic or missing income values (e.g., negative values like -1).
- Plot age on the x-axis and income on the y-axis. Use transparent points to avoid excessive overlap.
- For adding the trend line use the regression line to identify patterns in the relationship.
- Then,Label the x-axis as “Age” and the y-axis as “Income.” Add a clear, descriptive title.
Analysis/Recommendations:
- The scatterplot shows a wide range of incomes across different ages, with many users reporting lower incomes.
- The trend line (red) suggests a slight upward trajectory, indicating that older users tend to report higher incomes on average.
- Instead of raw values, consider categorizing incomes into low, middle, and high brackets for clearer insights.
- Investigate how education level, job type, and location impact income trends.
2.Create boxplots of income grouped by age_group. Which age group reports the highest median income?
plt.figure(figsize=(12,6))
sns.boxplot(data = data[data['income'] > 0], x="age_group", y='income',showfliers = False)
plt.xlabel('Age Group',fontsize =14)
plt.ylabel('Income',fontsize =14)
plt.title('Distribution of Income Over Age Group ',fontsize =16)
plt.show()
steps:
- sns.boxplot This creates a box plot, which shows the distribution of data for each category of age group
- showfliers=False - This removes the outliers from the plot, so you only see the distribution of the main data points.
Analysis/Recommendations:
- Targeted Financial Advice: For Younger Adults (18-29): Programs or financial tools tailored to early career individuals can help them plan savings, manage student loans, and budget effectively.
- For Mid-Career Professionals (30-59): Given the peak income levels in these groups, advice on investment strategies (retirement planning, higher-risk investments, or portfolio diversification) could be beneficial.
3. Analyze income levels within gender and status categories. For example, are single men more likely to report higher incomes than single women?
import matplotlib.pyplot as plt
import seaborn as sns
palette = {"m": "blue", "f": "red"}
plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="status", y="income", hue="gender", ci=None, palette=palette)
plt.xlabel("Status", fontsize=14)
plt.ylabel("Income", fontsize=14)
plt.title("Income by Gender and Status", fontsize=16)
plt.legend(title="Gender", fontsize=12)
plt.show()
C:\Users\msrav\AppData\Local\Temp\ipykernel_1736\3273359307.py:9: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. sns.barplot(data=data, x="status", y="income", hue="gender", ci=None, palette=palette)
steps:
- data=df → Uses the DataFrame df as the source of data.
- x="status" → Sets the x-axis to relationship status (e.g., Single, In a relationship).
- y="income" → Sets the y-axis to income values.
- hue="gender" → Differentiates bars by gender (e.g., Male vs. Female).
Analysis/Recommendations:
- Men tend to report higher median incomes than women across most relationship statuses.
- Married individuals generally earn more, likely due to dual incomes
- Divorced individuals may experience income drops, indicating financial strain.
- Income variations exist within each group, suggesting other influencing factors like education and industry.
3. Pets and Preferences¶
1. Create a bar chart showing the distribution of pets categories (e.g., likes dogs, likes cats). Which preferences are most common?
pet_counts = data['pets'].value_counts()
plt.figure(figsize=(12, 6))
sns.barplot(x=pet_counts.index, y=pet_counts.values)
plt.xticks(rotation=90, ha='right')
plt.title('Distribution of Pet Preferences')
plt.xlabel('Pet Preference')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
steps:
- Selected the pets column. Used value_counts() to count the occurrences of each unique pet preference category.
- Then,Create a Bar Plot: Initialized a figure with a specified size using plt.figure().Used sns.barplot() to create a bar chart with the counted pet preferences on the y-axis and categories on the x-axis.
- Then, Rotated the x-axis labels for better readability. Customize Plot Appearance and Added a title and axis labels.
- Used plt.tight_layout() to optimize the layout (avoiding overlapping elements).
Analysis/Recommendations:
- The most frequently reported pet preference is "likes dogs and likes cats" with approximately 24.7% of users indicating this, followed by "likes dogs" at about 12.1%.
- Some users have combinations of having pets vs. just liking them, and fewer users report dislike for either. T
- Develop filters that allow users to choose combinations, such as both liking and having pets.
- Use this clustering of preferences to segment users. For example, those who prefer “likes dogs and likes cats” might receive different content or feature suggestions compared to those with more unique or limited preferences
2.How do pets preferences vary across gender and age_group? Are younger users more likely to report liking pets compared to older users?
pet_distribution = data.groupby(['age_group', 'gender', 'pets']).size().reset_index(name='count')
pet_distribution
plt.figure(figsize=(12, 6))
sns.barplot(data=pet_distribution, x="age_group", y="count", hue="pets", ci = None)
plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Pet Preferences by Age Group and Gender", fontsize=16)
plt.legend(title="Pet Preference", fontsize=8)
plt.show()
C:\Users\msrav\AppData\Local\Temp\ipykernel_1736\1986096596.py:6: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. sns.barplot(data=pet_distribution, x="age_group", y="count", hue="pets", ci = None)
steps:
- Set the figure size using plt.figure(figsize=(12, 6)) for better visualization.
- Use sns.barplot() from Seaborn to create a bar plot
- x="age_group" to display different age groups on the x-axis. y="count" to show the number of users who own different pets.
- hue="pets" to categorize users based on their pet preferences.'
- Label the x-axis as "Age Group" with font size 14.Label the y-axis as "Number of Users" with font size 14.Set the title to "Pet Preferences by Age Group and Gender" with font size 16.'
Analysis/Recommendations:
- Younger individuals (e.g., 18-25 age group) might have a higher affinity for cats or dogs, whereas older groups may show a preference for low-maintenance pets (e.g., birds, fish) or no pets.
- Gender differences in pet ownership trends can be observed. For instance: Women may prefer cats or smaller pets over dogs.Men may show higher preference for dogs.
- Pet brands can personalize advertisements based on age and gender
- Animal shelters can design campaigns that match pet types with the most interested age group.
4. Signs and Personality¶
1. Create a pie chart showing the distribution of zodiac signs (sign) across the platform. Which signs are most and least represented? Is this the right chart? If not, replace with right chart.
import matplotlib.pyplot as plt
zodiac_counts = data['sign'].value_counts()
plt.figure(figsize=(10, 6))
plt.pie(zodiac_counts, labels=zodiac_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired.colors)
plt.title("Distribution of Zodiac Signs Among Users", fontsize=14)
plt.show()
steps:
- data['sign'].value_counts() calculates the frequency of each zodiac sign in the dataset.
- zodiac_counts as the values. labels=zodiac_counts.index to label each segment with its zodiac sign.
- autopct='%1.1f%%' to display percentages with one decimal point. retartangle=140 to rotate the chart for better visualization.
- colors=plt.cm.Paired.colors for distinct colors from the Paired colormap
Analysis/Recommendations:
- If the data is skewed toward specific zodiac signs, consider collecting more data to balance representation.
- Businesses can use the insights to tailor marketing strategies based on the most represented zodiac signs.
As you can see above chart is not right. so selected the Bar Chart For Better Comparison
plt.figure(figsize=(12, 8))
sns.barplot(x=zodiac_counts.index, y=zodiac_counts.values, palette="viridis")
plt.xlabel("Zodiac Sign", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Zodiac Sign Distribution Among Users", fontsize=16)
plt.xticks(rotation=45)
plt.show()
steps:
- plt.figure(figsize=(12, 8)): Creates a figure with a width of 12 inches and a height of 8 inches.
- Then, x=zodiac_counts.index: Zodiac signs as the x-axis labels.y=zodiac_counts.values: The count of each zodiac sign as the y-axis values
- plt.xticks(rotation=45): Rotates x-axis labels by 45 degrees for better readability.
- plt.show(): Renders and displays the bar chart.
Analysis/Recomendations:
- The tallest bars indicate the most common zodiac signs among users and The shortest bars indicate the least common zodiac signs.
- Businesses can use the most frequent zodiac signs to tailor products, services, or marketing strategies.
- Analyze correlations between zodiac signs and user behavior, preferences, or demographic
2. How does sign vary across gender and status? Are there noticeable patterns or imbalances?
zodiac_distribution = data.groupby(['sign', 'gender', 'status']).size().reset_index(name='count')
plt.figure(figsize=(14, 7))
sns.barplot(data=zodiac_distribution, x='sign', y='count', hue='gender', ci=None, palette='pastel')
plt.xlabel("Zodiac Sign", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Zodiac Sign Distribution Across Gender", fontsize=16)
plt.xticks(rotation=90)
plt.legend(title="Gender")
plt.show()
C:\Users\msrav\AppData\Local\Temp\ipykernel_1736\618449131.py:4: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. sns.barplot(data=zodiac_distribution, x='sign', y='count', hue='gender', ci=None, palette='pastel')
Steps:
- Groups data by zodiac sign, gender, and status.Computes the count of each unique combination. Then, Converts the grouped data into a DataFrame with a new column, 'count'.
- plt.figure(figsize=(14, 7)): Creates a figure with a width of 14 inches and a height of 7 inches.
- plt.xticks(rotation=90): Rotates x-axis labels by 90 degrees for readability and plt.legend(title="Gender"): Adds a legend with the title "Gender".
Analysis/Recommendations:
- Each zodiac sign has two or more bars depending on the number of genders in the dataset.
- The height of each bar indicates the number of users with that zodiac sign and gender.
- A significant gender difference for certain zodiac signs may indicate demographic trends or preferences.
- If a specific gender dominates certain zodiac signs, businesses can tailor content, products, or marketing strategies accordingly.
sign_across_status = data.groupby(['sign', 'status']).size().reset_index(name='count')
plt.figure(figsize=(14,8))
sns.barplot(data=sign_across_status, x='sign', y='count', hue='status')
plt.title('Distribution of Zodiac Signs across Status', fontsize=14)
plt.xlabel('Zodiac Sign', fontsize=14)
plt.ylabel('Number of Users', fontsize=14)
plt.xticks(rotation=90)
plt.grid(True, axis= 'y', linestyle = 'dashed', alpha = 0.5)
plt.show()
steps:
- the count of each combination and stores it in sign_across_status
- The groupby function ensures that data is categorized based on the two columns (sign and status).
- size() computes the count for each group.reset_index(name='count') converts the grouped data into a structured DataFrame.
- plt.xticks(rotation=90) rotates x-axis labels for better readability.
- Dashed horizontal grid lines (plt.grid(True, axis='y', linestyle='dashed', alpha=0.5)) improve clarity.
Analysis/Recommendations:
- Some Zodiac signs might have significantly more users in certain status categories.
- The distribution might show biases or trends (e.g., certain signs having more active users or a specific relationship status).
- Check for external factors like cultural influences or demographics.
- Investigate why some signs have different distributions.
zodiac_pivot = data.groupby(['sign', 'status']).size().unstack().fillna(0)
zodiac_pivot.plot(kind='bar', stacked=True, figsize=(14, 7), colormap="viridis")
plt.xlabel("Zodiac Sign", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Zodiac Sign Distribution Across Relationship Status", fontsize=16)
plt.xticks(rotation=90)
plt.legend(title="Relationship Status")
plt.show()
steps:
- We first groupby the three columns sign , gender and status.
- size() calculates the count of users for each combination.unstack() reshapes the data so that status categories become separate columns.fillna(0) replaces missing values (if any) with zero.
Analysis/Recommendations:
- Some Zodiac signs may have more users in a particular status than others.
- It visually shows how many users fall into each status category for every sign.
- Identify which Zodiac signs have a higher single status and target them with matchmaking services.
- Personalize recommendations based on sign-related preferences.