Introduction¶

Bumble is a widely used dating platform that allows individuals to connect based on mutual interests and compatibility. Unlike traditional dating apps, Bumble empowers women to initiate conversations, fostering meaningful interactions. Users create profiles that include demographic details, lifestyle choices, and personal preferences, providing a rich dataset for analyzing behavioral patterns and trends.

Purpose¶

The goal of this analysis is to explore and interpret the Bumble dataset to extract meaningful insights into user behavior and preferences.

In [1]:
import numpy as np
import pandas as pd
In [2]:
data = pd.read_csv(r"C:\Users\msrav\OneDrive\Desktop\nextleap course documents\python\bumble.csv")
In [3]:
data.head()
Out[3]:
age status gender body_type diet drinks education ethnicity height income job last_online location pets religion sign speaks
0 22 single m a little extra strictly anything socially working on college/university asian, white 75.0 -1 transportation 2012-06-28-20-30 south san francisco, california likes dogs and likes cats agnosticism and very serious about it gemini english
1 35 single m average mostly other often working on space camp white 70.0 80000 hospitality / travel 2012-06-29-21-41 oakland, california likes dogs and likes cats agnosticism but not too serious about it cancer english (fluently), spanish (poorly), french (...
2 38 available m thin anything socially graduated from masters program NaN 68.0 -1 NaN 2012-06-27-09-10 san francisco, california has cats NaN pisces but it doesn’t matter english, french, c++
3 23 single m thin vegetarian socially working on college/university white 71.0 20000 student 2012-06-28-14-22 berkeley, california likes cats NaN pisces english, german (poorly)
4 29 single m athletic NaN socially graduated from college/university asian, black, other 66.0 -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california likes dogs and likes cats NaN aquarius english
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   gender       59946 non-null  object 
 3   body_type    54650 non-null  object 
 4   diet         35551 non-null  object 
 5   drinks       56961 non-null  object 
 6   education    53318 non-null  object 
 7   ethnicity    54266 non-null  object 
 8   height       59943 non-null  float64
 9   income       59946 non-null  int64  
 10  job          51748 non-null  object 
 11  last_online  59946 non-null  object 
 12  location     59946 non-null  object 
 13  pets         40025 non-null  object 
 14  religion     39720 non-null  object 
 15  sign         48890 non-null  object 
 16  speaks       59896 non-null  object 
dtypes: float64(1), int64(2), object(14)
memory usage: 7.8+ MB
In [5]:
data.describe()
Out[5]:
age height income
count 59946.000000 59943.000000 59946.000000
mean 32.340290 68.295281 20033.222534
std 9.452779 3.994803 97346.192104
min 18.000000 1.000000 -1.000000
25% 26.000000 66.000000 -1.000000
50% 30.000000 68.000000 -1.000000
75% 37.000000 71.000000 -1.000000
max 110.000000 95.000000 1000000.000000
In [6]:
data.shape
Out[6]:
(59946, 17)

Part 1: Data Cleaning¶

1. Which columns in the dataset have missing values, and what percentage of data is missing in each column?

In [7]:
missing_values = data.isnull().sum()
missing_percentage = (missing_values / len(data)) /100

# Create a DataFrame to display the results
missing_data_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage Missing': missing_percentage})
In [8]:
missing_data_df
Out[8]:
Missing Values Percentage Missing
age 0 0.000000e+00
status 0 0.000000e+00
gender 0 0.000000e+00
body_type 5296 8.834618e-04
diet 24395 4.069496e-03
drinks 2985 4.979482e-04
education 6628 1.105662e-03
ethnicity 5680 9.475194e-04
height 3 5.004504e-07
income 0 0.000000e+00
job 8198 1.367564e-03
last_online 0 0.000000e+00
location 0 0.000000e+00
pets 19921 3.323158e-03
religion 20226 3.374037e-03
sign 11056 1.844327e-03
speaks 50 8.340840e-06

2.Are there columns where more than 50% of the data is missing? Would you drop those columns where missing values are >50%. If yes, why?

Solution: There are no columns in the dataset where more than 50% of the data is missing. Therefore, no columns need to be dropped

3.How would you handle the missing numerical data (e.g., height, income)? Would you impute the missing data by the median or average value of height and income for the corresponding category, such as gender, age group, or location. If yes, why?

In [9]:
calculate_median = data.groupby(["gender"])["height"].transform("median")
data["height"] = data["height"].fillna(calculate_median)
print(calculate_median)
0        70.0
1        70.0
2        70.0
3        70.0
4        70.0
         ... 
59941    65.0
59942    70.0
59943    70.0
59944    70.0
59945    70.0
Name: height, Length: 59946, dtype: float64

Data Types¶

  1. Are there any inconsistencies in the data types across columns (e.g., numerical data stored as strings)?
In [10]:
data.dtypes
Out[10]:
age              int64
status          object
gender          object
body_type       object
diet            object
drinks          object
education       object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
pets            object
religion        object
sign            object
speaks          object
dtype: object

There are no inconsistencies in the data types. The numerical columns are correctly stored as int64 or float64, while categorical/text columns are stored as objec, except for last_online which needed to be stored as datetime

2. Which columns require conversion to numerical data types for proper analysis (e.g., income)?

solution: No columns needs to be changed

3. Does the last_online column need to be converted into a datetime format? What additional insights can be gained by analyzing this as a date field?

In [11]:
data["last_online"]= pd.to_datetime(data["last_online"] , format= "%Y-%m-%d-%H-%M")
data["last_online"]
Out[11]:
0       2012-06-28 20:30:00
1       2012-06-29 21:41:00
2       2012-06-27 09:10:00
3       2012-06-28 14:22:00
4       2012-06-27 21:26:00
                ...        
59941   2012-06-12 21:47:00
59942   2012-06-29 11:01:00
59943   2012-06-27 23:37:00
59944   2012-06-23 13:01:00
59945   2012-06-29 00:42:00
Name: last_online, Length: 59946, dtype: datetime64[ns]

3. Outliers¶

1. Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?

In [12]:
data.describe()
Out[12]:
age height income
count 59946.000000 59946.000000 59946.000000
mean 32.340290 68.295282 20033.222534
std 9.452779 3.994738 97346.192104
min 18.000000 1.000000 -1.000000
25% 26.000000 66.000000 -1.000000
50% 30.000000 68.000000 -1.000000
75% 37.000000 71.000000 -1.000000
max 110.000000 95.000000 1000000.000000

Yes there are outliers in the numerical columns which are age, height and income.

In [13]:
age_range = (data['age'].min() , data['age'].max())
height_range = (data['height'].min() , data['height'].max())
income_range = (data['income'].min() , data['income'].max())

print(f"The age range is : {age_range} ")
print(f"The height range is : {height_range} ")
print(f"The income range is : {income_range} ") 
The age range is : (18, 110) 
The height range is : (1.0, 95.0) 
The income range is : (-1, 1000000) 

1.I used the .describe() method to analyze the distribution of numerical columns and identify potential outliers by checking whether they fall below the 25th percentile.

2.Then I used .max() and .min() to determine the range of values, which helps in detecting anomalies by finding the maximum and minimum values in the dataset

2. Any -1 values in numerical columns like income should be replaced with 0, as they may represent missing or invalid data.

In [14]:
data['income'] = data['income'].replace(-1, 0)
In [15]:
data['income'].min()
Out[15]:
0

As you can see will that -1 values are replaced with 0 in the income columns.

3. For other outliers, how would you ensure that they o not disproportionately impact the analysis while retaining as much meaningful data as possible. Would you delete the data or rather than deleting them, calculate the mean and median values using only the middle 80% of the data (removing extreme high and low values). Provide appropriate reasons for every step.

In [16]:
numerical_cols = ['age', 'height', 'income']

lower_bound = data[numerical_cols].quantile(0.10)
upper_bound = data[numerical_cols].quantile(0.90)

# Filter data within middle 80% (excluding extreme outliers)
data_filtered = data[(data[numerical_cols] > lower_bound) & (data[numerical_cols] < upper_bound)]

# Calculate mean and median on the middle 80% of data
trimmed_mean = data_filtered[numerical_cols].mean()
trimmed_median = data_filtered[numerical_cols].median()

# Display results
print("Trimmed Mean:\n", trimmed_mean)
print("Trimmed Median:\n", trimmed_median)
Trimmed Mean:
 age          31.357686
height       68.254426
income    26109.890110
dtype: float64
Trimmed Median:
 age          30.0
height       68.0
income    20000.0
dtype: float64

steps:

  1. Calculates the 10th percentile (lower bound) and 90th percentile (upper bound) for numerical columns. This helps in identifying extreme low and high values

  2. Keeps only values between the 10th and 90th percentiles, then Computes the mean (average) for the filtered dataset and Computes the median (middle value) using the middle 80% of the data.

  3. Now we can find the mean and median

4. Missing Data Visualization¶

1. Create a heatmap to visualize missing values across the dataset. Which columns show consistent missing data patterns?

In [17]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(12, 6))

# Create a heatmap to visualize missing values
sns.heatmap(data.isnull(), cmap='viridis', cbar=True)

# Add title
plt.title("Missing Values Heatmap", fontsize=14)
plt.xlabel('Columns', fontsize=14)
plt.ylabel('Rows', fontsize=14)

# Show plot
plt.show()
No description has been provided for this image

Analysis: If entire vertical lines appear in the heatmap, it means those columns have a high percentage of missing values for the diet, pets and religion

Part 2: Data Processing¶

Binning and Grouping¶

1. How would you bin the age column into categories (e.g. "18-25", "26-35", "36-45", and "46+" ) to create a new column, age_group. How does the distribution of users vary across these age ranges?

In [18]:
import numpy as np 

data['age_group'] = np.where(data['age'] <= 25,"18-25",
                    np.where(data['age'] <=35,"26-35",
                    np.where(data['age'] <=45,"36-45","46+")))

                          
age_distribution_count = data['age_group'].value_counts()
age_distribution_count
Out[18]:
26-35    28621
18-25    14454
36-45    10803
46+       6068
Name: age_group, dtype: int64

steps: 1.create a new column, age_group, based on the values in the age column. The goal is to categorize ages into four groups:

 18-25 for ages ≤ 25

 26-35 for ages between 26 and 35

 36-45 for ages between 36 and 45

 46+ for ages above 45
  1. Then we are using where function and grouping it like if df['age'] <= 25,"18-25" ,etc.
  2. count the occurrences of each age group using value_counts()

2. Group income into categories like "Low Income," "Medium Income," and "High Income" based on meaningful thresholds (e.g., quartiles). What insights can be derived from these groups?

In [19]:
lower_quantile = data['income'].quantile(0.25)
middle_quantile = data['income'].quantile(0.50)
higher_quantile = data['income'].quantile(0.90)

data['income_category'] = np.where(data['income'] <= lower_quantile,"Low Income",
                          np.where(data['income'] <=middle_quantile,"Medium income","High income"))

                          
income_distribution = data['income_category'].value_counts()
income_distribution
Out[19]:
Low Income     48442
High income    11504
Name: income_category, dtype: int64
In [20]:
data[['income','income_category']]
Out[20]:
income income_category
0 0 Low Income
1 80000 High income
2 0 Low Income
3 20000 High income
4 0 Low Income
... ... ...
59941 0 Low Income
59942 0 Low Income
59943 100000 High income
59944 0 Low Income
59945 0 Low Income

59946 rows × 2 columns

steps:

  1. First calculates key income thresholds using quartiles: the 25th percentile (lower_quantile) for low income, the 50th percentile (middle_quantile) for medium income, and the 90th percentile (higher_quantile), though the latter is not directly used in categorization.
  2. Then, it assigns income categories based on these thresholds—values below or equal to the 25th percentile are labeled as "Low Income," those between the 25th and 50th percentiles are classified as "Medium Income," and the rest are considered "High Income."
  3. the value_counts() function is used to determine the distribution of individuals across these income groups.

2. Derived Features:¶

1.Create a new feature, profile_completeness, by calculating the percentage of non-missing values for each user profile. How complete are most user profiles, and how does completeness vary across demographics?

In [21]:
non_missing_counts = data.notnull().sum(axis=1)  
total_columns = data.shape[1]  
data['profile_completeness'] = (non_missing_counts / total_columns) * 100
In [22]:
data['profile_completeness']
Out[22]:
0        100.000000
1        100.000000
2         84.210526
3         94.736842
4         89.473684
            ...    
59941     84.210526
59942    100.000000
59943     94.736842
59944    100.000000
59945     94.736842
Name: profile_completeness, Length: 59946, dtype: float64
In [23]:
data.groupby('profile_completeness')['gender'].value_counts()
Out[23]:
profile_completeness  gender
52.631579             m           52
                      f           21
57.894737             m          165
                      f           81
63.157895             m          488
                      f          262
68.421053             m          639
                      f          394
73.684211             m         1234
                      f          780
78.947368             m         2381
                      f         1600
84.210526             m         4541
                      f         2942
89.473684             m         7507
                      f         5021
94.736842             m         9998
                      f         6865
100.000000            m         8824
                      f         6151
Name: gender, dtype: int64

steps:

  1. Use data.notnull().sum(axis=1) to count the number of filled (non-null) fields for each user.
  2. Use data.shape[1] to get the total number of columns in the dataset.
  3. Calculate Profile Completeness Percentage
  4. Compute the completeness for males and females

3.Unit Conversion¶

1. Convert the height column from inches to centimeters using the conversion factor (1 inch = 2.54 cm). Store the converted values in a new column, height_cm.

In [24]:
# Convert height from inches to centimeters (1 inch = 2.54 cm)
data['height_cm'] = data['height'] * 2.54

# Display first few rows to verify the new column
data[['height', 'height_cm']].head()
Out[24]:
height height_cm
0 75.0 190.50
1 70.0 177.80
2 68.0 172.72
3 71.0 180.34
4 66.0 167.64

steps: convert the height in inches to height in centimeter

1.Here, we are multiplying the df['height'] by 2.54to convert to centimeter.

Part 3: Data Analysis¶

1. Demographic Analysis¶

1. What is the gender distribution (gender) across the platform? Are there any significant imbalances?

In [25]:
# Analyze gender distribution
gender_distribution = data['gender'].value_counts(normalize=True) * 100  

# Display results
gender_distribution
Out[25]:
m    59.768792
f    40.231208
Name: gender, dtype: float64

steps:

  1. first, calculates how many users belong to each gender category.
  2. We use normalize=True inside value_counts() to convert raw counts into proportions,Multiplying by 100 transforms these proportions into percentage values

2. What are the proportions of users in different status categories (e.g., single, married, seeing someone)? What does this suggest about the platform’s target audience?

In [26]:
# Analyze relationship status distribution
status_distribution = data['status'].value_counts(normalize=True) * 100 

# Display results
status_distribution
Out[26]:
single            92.911954
seeing someone     3.443099
available          3.111133
married            0.517132
unknown            0.016682
Name: status, dtype: float64

steps:

  1. value_counts() calculates the number of users for each unique status (e.g., how many are "single," "married," etc.).
  2. Adding normalize=True inside value_counts() converts the raw counts into proportions

3. How does status vary by gender? For example, what proportion of men and women identify as single?

In [27]:
status_gender_distribution = data.groupby(["status", "gender"]).size() / len(data) * 100
status_gender_data = status_gender_distribution.reset_index(name="Percentage")
status_gender_data_sorted = status_gender_data.sort_values(by="Percentage", ascending=False)
print(status_gender_data_sorted)
           status gender  Percentage
7          single      m   55.680112
6          single      f   37.231842
1       available      m    2.016815
5  seeing someone      m    1.769926
4  seeing someone      f    1.673173
0       available      f    1.094318
3         married      m    0.291929
2         married      f    0.225203
9         unknown      m    0.010009
8         unknown      f    0.006673

steps:

  1. counts the number of users for each (status, gender) combination.
  2. multiplying by 100 transforms the counts into percentages, showing the proportion of users in each category.
  3. Then, onverts the grouped results into a structured DataFrame with a column named "Percentage".
  4. sort it in a way that most common status-gender combinations appear at the top for easy interpretation.

Analysis/Recommendations:

  1. A higher proportion of women (than men) identify as "seeing someone", which may suggest they update their status more frequently.
  2. The proportion of married users is very low, reinforcing that Bumble is not widely used for relationships outside dating.
  3. If there is a gender imbalance (more single men than women), Bumble should target female users through marketing campaigns, safety features, and community-building initiatives to increase engagement and retention.
  4. Some users are "seeing someone" or in other relationship stages. Bumble could introduce optional status updates or new features to help users indicate evolving relationship stages.

2. Correlation Analysis¶

1. What are the correlations between numerical columns such as age, income, gender Are there any strong positive or negative relationships?

In [28]:
# Perform correlation analysis on numerical columns
correlation_matrix = data[['age', 'income', 'height']].corr()

# Display correlation matrix
correlation_matrix
Out[28]:
age income height
age 1.000000 -0.001004 -0.022253
income -0.001004 1.000000 0.065048
height -0.022253 0.065048 1.000000

steps:

  1. We are creating an array composed of aage, income and height.
  2. Then we find the correlation between them using .corr()

Analysis/Recommendations:

  1. There is almost no correlation between age and income, meaning income does not significantly increase or decrease with age in this dataset.
  2. This suggests that users across different age groups have similar income distributions or that income data is inconsistent.

2. How does age correlate with income? Are older users more likely to report higher income levels?

In [29]:
age_income_correlation = data['age'].corr(data['income'])
age_income_correlation
Out[29]:
-0.0010038681910053916

Analysis: It is extremely close to zero, indicating no meaningful relationship between age and income.

3. Diet and Lifestyle Analysis¶

1. How do dietary preferences (diet) distribute across the platform? For example, what percentage of users identify as vegetarian, vegan, or follow "anything" diets?

In [30]:
diet_percentage = data['diet'].value_counts() * 100 / len(data['diet'])
diet_percentage
Out[30]:
mostly anything        27.666567
anything               10.314283
strictly anything       8.529343
mostly vegetarian       5.745171
mostly other            1.679845
strictly vegetarian     1.459647
vegetarian              1.112668
strictly other          0.754012
mostly vegan            0.563841
other                   0.552164
strictly vegan          0.380342
vegan                   0.226871
mostly kosher           0.143462
mostly halal            0.080072
strictly halal          0.030027
strictly kosher         0.030027
halal                   0.018350
kosher                  0.018350
Name: diet, dtype: float64

steps:

  1. counts how many users fall into each dietary preference.
  2. Then, Multiplying by 100 / len(df['diet']) transforms the counts into percentages, showing the proportion of users in each category.

Analysis/Recommendations:

  1. The most common categories are "mostly anything" (27.66%), "anything" (10.31%), and "strictly anything" (8.52%), indicating that most users do not have strict dietary restrictions.
  2. Allow users to filter potential matches by dietary preference to improve compatibility.
  3. Consider adding dietary preference badges or profile highlights to make it easier for users to identify shared food habits.

2. How do drinking habits (drinks) vary across different diet categories? Are users with stricter diets (e.g., vegan) less likely to drink?

In [31]:
# Analyze drinking habits across different diet categories
drink_diet_distribution = data.groupby("diet")["drinks"].value_counts(normalize=True) * 100

# Display results
drink_diet_distribution
Out[31]:
diet        drinks     
anything    socially       75.667557
            often           9.579439
            rarely          8.511348
            not at all      4.923231
            very often      0.967957
                             ...    
vegetarian  rarely         10.771704
            often           9.485531
            not at all      5.948553
            very often      1.125402
            desperately     0.964630
Name: drinks, Length: 103, dtype: float64

steps:

  1. first,Group Data by Diet Category
  2. counts how many users in each diet category belong to different drinking habits

Analysis/Recommendations:

  1. Across all diet categories, "socially" is the most frequent drinking choice.
  2. Mostly anything (75.61%), mostly vegetarian (71.90%), and vegetarian (71.70%) users are more likely to drink socially.
  3. Allow users to filter matches based on drinking habits, especially for those with strict dietary restrictions.
  4. Create marketing campaigns tailored to sober communities, especially halal, kosher, and vegan users.

4. Geographical Insights¶

1.Extract city and state information from the location column. What are the top 5 cities and states with the highest number of users?

In [32]:
data[['city','state']] = data['location'].str.split(', ',expand = True, n=1)
data[['city','state']]
Out[32]:
city state
0 south san francisco california
1 oakland california
2 san francisco california
3 berkeley california
4 san francisco california
... ... ...
59941 oakland california
59942 san francisco california
59943 south san francisco california
59944 san francisco california
59945 san francisco california

59946 rows × 2 columns

In [33]:
# Ensure proper extraction by handling extra spaces
data[['city', 'state']] = data['location'].str.split(',', n=1, expand=True)

# Trim any leading/trailing spaces
data['city'] = data['city'].str.strip()
data['state'] = data['state'].str.strip()

# Count the number of users per city and state
top_cities = data['city'].value_counts().head(5)  # Top 5 cities
print(f"TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: \n\n{top_cities}")
top_states = data['state'].value_counts().head(5)  # Top 5 states
print(f"TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: \n\n{top_states}")
TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: 

san francisco    31064
oakland           7214
berkeley          4212
san mateo         1331
palo alto         1064
Name: city, dtype: int64
TOP 5 CITIES WITH HIGHEST NUMBER OF USERS: 

california       59855
new york            17
illinois             8
massachusetts        5
texas                4
Name: state, dtype: int64

steps:

  1. Extract City and State from the Location Column
  2. Count Users in Each City and State using this fromula data['city'].value_counts().head(5)
  3. finally, display the most active cities and states based on user count.

Analysis/Recommendations:

  1. With 31,064 users, San Francisco is the most active city on Bumble, followed by Oakland (7,214 users) and Berkeley (4,212 users).
  2. 59,855 users are from California, vastly outnumbering users in other states. The next most active states are New York (17 users), Illinois (8 users), Massachusetts (5 users), and Texas (4 users)—showing a major imbalance.
  3. Give the dominance of California-based users, Bumble should invest in marketing campaigns for states like Texas, New York, and Illinois to attract a more diverse audience.
  4. ince San Francisco, Oakland, and Berkeley have the highest engagement, Bumble could introduce localized events, exclusive promotions, or city-based filters to enhance the user experience.

2.How does age vary across the top cities? Are certain cities dominated by younger or older users?

In [34]:
city_with_average_age = data.groupby('city')['age'].mean()

print(f" Cities with high average age is :\n {city_with_average_age.sort_values(ascending = False).head(5)}")
print(f" Cities with low average age is :\n {city_with_average_age.sort_values(ascending = False).tail(5)}")
 Cities with high average age is :
 city
forest knolls     62.5
bellingham        59.0
port costa        53.0
seaside           50.0
redwood shores    47.0
Name: age, dtype: float64
 Cities with low average age is :
 city
fayetteville      20.0
isla vista        19.0
canyon            19.0
canyon country    19.0
long beach        19.0
Name: age, dtype: float64

steps:

  1. Grouping the Data: The dataset (data) is grouped by the "city" column. The mean of the "age" column is calculated for each city.

  2. Sorting and Displaying Cities with High Average Age: The cities are sorted in descending order based on average age and top 5 cities with the highest average age are displayed.

  3. Sorting and Displaying Cities with Low Average Age: The cities are again sorted in descending order and The bottom 5 cities (which have the lowest average age) are displayed.

Analysis/Recommendations:

  1. Identifying cities with older and younger populations based on the average age.
  2. In older cities, focus on features that appeal to more mature users, such as serious relationships, professional networking, or niche interests (e.g., travel, wellness).
  3. Introduce location-specific incentives, such as student discounts in younger cities or premium memberships in wealthier, older communities.
  4. Consider events and meetups in these cities to drive engagement based on the dominant age group.

3. What are the average income levels in the top states or cities? Are there regional patterns in reported income?

In [35]:
# Calculate average income per city and state
city_income = data.groupby('city')['income'].mean()
state_income = data.groupby('state')['income'].mean()

# Get average income for the top 5 cities and states
high_income_state = state_income.sort_values(ascending = False).head(5)
high_income_city = city_income.sort_values(ascending = False).head(5)

print("Average Income in Top 5 Cities:\n",high_income_state )
print("\nAverage Income in Top 5 States:\n",high_income_city)
Average Income in Top 5 Cities:
 state
new jersey                  150000.0
colorado                     75000.0
vietnam                      60000.0
british columbia, canada     60000.0
pennsylvania                 40000.0
Name: income, dtype: float64

Average Income in Top 5 States:
 city
petaluma        500000.000000
santa cruz      230000.000000
south orange    150000.000000
boulder         150000.000000
montara          85833.333333
Name: income, dtype: float64

steps:

  1. Calculate Average Income for Each City and State: The dataset is grouped by "city" and "state", and the mean income is calculated.This provides the average income level for each city and state.
  2. Identify the Top 5 Cities and States by Income: The sort_values(ascending=False).head(5) function is used to retrieve the top 5 cities and states with the highest average income.

Analysis/Recommendations:

  1. Cities like San Francisco, New York, Seattle, or Washington, D.C. may appear in the high-income category due to their strong job markets in tech, finance, and consulting.
  2. Coastal states (e.g., California, New York, Massachusetts) generally have higher income levels due to tech, finance, and biotech industries.
  3. Invest more in digital marketing in high-income states where users are more likely to purchase subscriptions.
  4. Consider launching regional discounts or affordable subscription plans.

5. Height Analysis¶

1. What is the average height of users across different gender categories?

In [37]:
# Compute average height per gender category
avg_height_by_gender = data.groupby("gender")["height"].mean().dropna()
avg_height_by_gender
Out[37]:
gender
f    65.103869
m    70.443468
Name: height, dtype: float64

steps:

  1. We grouped the dataset by gender and calculated the mean height for each category.

Analysis/Recommendations:

  1. Conduct a location-based analysis to see if height preferences differ across different regions.
  2. Leverage height data to refine targeting strategies for different user demographics.

2.How does height vary by age_group? Are there noticeable trends among younger vs. older users?

In [40]:
avg_height_by_age_group = data.groupby("age_group")["height"].mean()
avg_height_by_age_group
Out[40]:
age_group
18-25    68.200913
26-35    68.406764
36-45    68.325095
46+      67.941167
Name: height, dtype: float64

steps:

  1. We grouped the dataset by age_group and calculated the mean height for each category.

Analysis/Recommendations:

  1. The average height is highest in the 25-34 age group (68.41 inches) and then gradually declines.
  2. Users aged 55+ are shorter on average (67.37 inches) compared to younger users.
  3. The app could introduce dynamic height filters based on age groups.
  4. Break down height trends by gender within age groups to see if trends differ.

3. What is the distribution of height within body_type categories (e.g., athletic, curvy, thin)? Do the distributions align with expectations?

In [41]:
# Compute average height per body type category
avg_height_by_body_type = data.groupby("body_type")["height"].mean().sort_values()
avg_height_by_body_type
Out[41]:
body_type
curvy             65.210245
full figured      66.464817
rather not say    67.272727
thin              67.866058
average           68.100805
skinny            68.544176
fit               68.546062
a little extra    68.820084
overweight        68.948198
used up           69.180282
jacked            69.292162
athletic          69.707336
Name: height, dtype: float64

steps:

  1. We grouped the dataset by age_group and calculated the mean height for each category.
  2. Then, use the sort_values method

Analysis/Recommendations:

  1. Users could be given height and body type-based preferences in their search filters.
  2. Compare these findings by gender to see if height-body type trends differ between men and women.

6. Income Analysis¶

1. What is the distribution of income across the platform? Are there specific income brackets that dominate? How would you handle case where income is blank or 0?

In [42]:
income_without_zero = data[data['income'] != 0]

income_without_zero['income'].value_counts().sort_values(ascending = False)
Out[42]:
20000      2952
100000     1621
80000      1111
30000      1048
40000      1005
50000       975
60000       736
70000       707
150000      631
1000000     521
250000      149
500000       48
Name: income, dtype: int64

steps:

  1. We are filtering the data where income is not equal to zero and finding the numbers of users in each income category using .value_counts() and sorting them using .sort_values(ascending = False)

Analaysis/Recommendations:

  1. Offer an optional range selection instead of an exact number to encourage more responses
  2. Provide personalized recommendations based on income brackets.

2. How does income vary by age_group and gender? Are older users more likely to report higher incomes?

In [43]:
df_valid_income = data[data["income"] > 0]

avg_income_by_age_gender = df_valid_income.groupby(["age_group", "gender"])["income"].mean().unstack()
avg_income_by_age_gender
Out[43]:
gender f m
age_group
18-25 86066.350711 106618.773946
26-35 90398.126464 114944.801027
36-45 87302.977233 112680.608365
46+ 75299.760192 100156.626506

steps:

  1. Filter Out Invalid Income Data and This removes users with income ≤ 0
  2. Groups users by age group and gender.Computes average income for each group. finally, Uses .unstack() to create a table with gender as columns and age groups as rows.

Analysis/Reommendations:

  1. Income declines for both genders after age 44.
  2. The sharpest decline occurs in the 46+ group, possibly due to retirement.
  3. Allow users to filter by income range & age group to improve match quality.
  4. Provide financial compatibility insights for users prioritizing income in matches.

Part 4: Data Visualization¶

1. Age Distribution¶

1. Plot a histogram of age with a vertical line indicating the mean age. What does the distribution reveal about the most common age group on the platform?

In [47]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histogram of age distribution
plt.figure(figsize=(10, 6))
sns.histplot(data["age"], bins=30, kde=True, color="red")

# Add a vertical line for the mean age
mean_age = data["age"].mean()
plt.axvline(mean_age, color="blue", linestyle="dashed", linewidth=2, label=f"Mean Age: {mean_age:.1f}")

# Labels and title
plt.xlabel("Age")
plt.ylabel("User Count")
plt.title("Age Distribution of Users")
plt.legend()
plt.show()
No description has been provided for this image

steps:

  1. Import necessary libraries (matplotlib.pyplot and seaborn) for visualization.Set the figure size to ensure a clear and readable plot.
  2. Plot a histogram of the age column using seaborn.histplot(),Use 30 bins to group ages into intervals. Then,Calculate the mean age using .mean() function.
  3. Add a vertical dashed red line at the mean age using plt.axvline(). Set x-axis (Age) and y-axis (User Count) labels. Add a title ("Age Distribution of Users").
  4. Display the final plot using plt.show().

Analysis/Recommendations:

  1. The most common age group appears to be mid-20s to early 30s.
  2. The mean age (red dashed line) suggests the platform leans toward a younger user base
  3. Focus marketing efforts on millennials and Gen Z since they form the majority.
  4. Consider campaigns that attract older demographics to balance the user base.

2. How does the age distribution differ by gender? Are there age groups where one gender is more prevalent?

In [49]:
import seaborn as sns


sns.set_style("whitegrid")

plt.figure(figsize=(10, 6))
sns.histplot(data, x="age", hue="gender", bins=30, kde=True, element="step", common_norm=False)


plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age Distribution by Gender")
plt.legend(title="Gender", labels=["Male", "Female"])

plt.show()
No description has been provided for this image

steps:

  1. Set Plot Style: Use Seaborn’s whitegrid style to enhance readability.
  2. Use histplot to plot the distribution of ages, separated by gender.Apply kernel density estimation (KDE) to smooth the visualization.Use 30 bins for better granularity.
  3. Ensure the distributions are plotted separately without normalization.Add Labels and Clearly label the x-axis (Age), y-axis (Count), and provide a title.
  4. Customize the Legend: Label genders appropriately for clarity.Display the Plot.

Analysis/Reommendations:

  1. The distribution appears to be roughly normal for both genders, with a concentration around young adulthood (20s and 30s).
  2. There may be slight differences in the frequency of males and females at certain age ranges.
  3. focus on the most common age groups to tailor recommendations.
  4. Adjust strategies based on the dominant gender in key age brackets.

2. Income and Age¶

1. Use a scatterplot to visualize the relationship between income and age, with a trend line indicating overall patterns. Are older users more likely to report higher incomes?

In [56]:
# Filter out unrealistic income values (-1 likely represents missing data)
df_filtered = data[data["income"] >= 0]

# Create a scatterplot with a trend line
plt.figure(figsize=(10, 6))
sns.regplot(data=df_filtered, x="age", y="income", line_kws={"color": "orange"})

# Labels and title
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Relationship Between Income and Age")

# Show plot
plt.show()
No description has been provided for this image

steps:

  1. Remove unrealistic or missing income values (e.g., negative values like -1).
  2. Plot age on the x-axis and income on the y-axis. Use transparent points to avoid excessive overlap.
  3. For adding the trend line use the regression line to identify patterns in the relationship.
  4. Then,Label the x-axis as “Age” and the y-axis as “Income.” Add a clear, descriptive title.

Analysis/Recommendations:

  1. The scatterplot shows a wide range of incomes across different ages, with many users reporting lower incomes.
  2. The trend line (red) suggests a slight upward trajectory, indicating that older users tend to report higher incomes on average.
  3. Instead of raw values, consider categorizing incomes into low, middle, and high brackets for clearer insights.
  4. Investigate how education level, job type, and location impact income trends.

2.Create boxplots of income grouped by age_group. Which age group reports the highest median income?

In [62]:
plt.figure(figsize=(12,6))
sns.boxplot(data = data[data['income'] > 0], x="age_group", y='income',showfliers = False)
plt.xlabel('Age Group',fontsize =14)
plt.ylabel('Income',fontsize =14)
plt.title('Distribution of Income Over Age Group ',fontsize =16)
plt.show()
No description has been provided for this image

steps:

  1. sns.boxplot This creates a box plot, which shows the distribution of data for each category of age group
  2. showfliers=False - This removes the outliers from the plot, so you only see the distribution of the main data points.

Analysis/Recommendations:

  1. Targeted Financial Advice: For Younger Adults (18-29): Programs or financial tools tailored to early career individuals can help them plan savings, manage student loans, and budget effectively.
  2. For Mid-Career Professionals (30-59): Given the peak income levels in these groups, advice on investment strategies (retirement planning, higher-risk investments, or portfolio diversification) could be beneficial.

3. Analyze income levels within gender and status categories. For example, are single men more likely to report higher incomes than single women?

In [72]:
import matplotlib.pyplot as plt
import seaborn as sns


palette = {"m": "blue", "f": "red"}
plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="status", y="income", hue="gender", ci=None, palette=palette)


plt.xlabel("Status", fontsize=14)
plt.ylabel("Income", fontsize=14)
plt.title("Income by Gender and Status", fontsize=16)
plt.legend(title="Gender", fontsize=12)
plt.show()
C:\Users\msrav\AppData\Local\Temp\ipykernel_1736\3273359307.py:9: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  sns.barplot(data=data, x="status", y="income", hue="gender", ci=None, palette=palette)
No description has been provided for this image

steps:

  1. data=df → Uses the DataFrame df as the source of data.
  2. x="status" → Sets the x-axis to relationship status (e.g., Single, In a relationship).
  3. y="income" → Sets the y-axis to income values.
  4. hue="gender" → Differentiates bars by gender (e.g., Male vs. Female).

Analysis/Recommendations:

  1. Men tend to report higher median incomes than women across most relationship statuses.
  2. Married individuals generally earn more, likely due to dual incomes
  3. Divorced individuals may experience income drops, indicating financial strain.
  4. Income variations exist within each group, suggesting other influencing factors like education and industry.

3. Pets and Preferences¶

1. Create a bar chart showing the distribution of pets categories (e.g., likes dogs, likes cats). Which preferences are most common?

In [75]:
pet_counts = data['pets'].value_counts()

plt.figure(figsize=(12, 6))
sns.barplot(x=pet_counts.index, y=pet_counts.values)
plt.xticks(rotation=90, ha='right')
plt.title('Distribution of Pet Preferences')
plt.xlabel('Pet Preference')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
No description has been provided for this image

steps:

  1. Selected the pets column. Used value_counts() to count the occurrences of each unique pet preference category.
  2. Then,Create a Bar Plot: Initialized a figure with a specified size using plt.figure().Used sns.barplot() to create a bar chart with the counted pet preferences on the y-axis and categories on the x-axis.
  3. Then, Rotated the x-axis labels for better readability. Customize Plot Appearance and Added a title and axis labels.
  4. Used plt.tight_layout() to optimize the layout (avoiding overlapping elements).

Analysis/Recommendations:

  1. The most frequently reported pet preference is "likes dogs and likes cats" with approximately 24.7% of users indicating this, followed by "likes dogs" at about 12.1%.
  2. Some users have combinations of having pets vs. just liking them, and fewer users report dislike for either. T
  3. Develop filters that allow users to choose combinations, such as both liking and having pets.
  4. Use this clustering of preferences to segment users. For example, those who prefer “likes dogs and likes cats” might receive different content or feature suggestions compared to those with more unique or limited preferences

2.How do pets preferences vary across gender and age_group? Are younger users more likely to report liking pets compared to older users?

In [78]:
pet_distribution = data.groupby(['age_group', 'gender', 'pets']).size().reset_index(name='count')
pet_distribution

plt.figure(figsize=(12, 6))
sns.barplot(data=pet_distribution, x="age_group", y="count", hue="pets", ci = None)

plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Pet Preferences by Age Group and Gender", fontsize=16)
plt.legend(title="Pet Preference", fontsize=8)
plt.show()
C:\Users\msrav\AppData\Local\Temp\ipykernel_1736\1986096596.py:6: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  sns.barplot(data=pet_distribution, x="age_group", y="count", hue="pets", ci = None)
No description has been provided for this image

steps:

  1. Set the figure size using plt.figure(figsize=(12, 6)) for better visualization.
  2. Use sns.barplot() from Seaborn to create a bar plot
  3. x="age_group" to display different age groups on the x-axis. y="count" to show the number of users who own different pets.
  4. hue="pets" to categorize users based on their pet preferences.'
  5. Label the x-axis as "Age Group" with font size 14.Label the y-axis as "Number of Users" with font size 14.Set the title to "Pet Preferences by Age Group and Gender" with font size 16.'

Analysis/Recommendations:

  1. Younger individuals (e.g., 18-25 age group) might have a higher affinity for cats or dogs, whereas older groups may show a preference for low-maintenance pets (e.g., birds, fish) or no pets.
  2. Gender differences in pet ownership trends can be observed. For instance: Women may prefer cats or smaller pets over dogs.Men may show higher preference for dogs.
  3. Pet brands can personalize advertisements based on age and gender
  4. Animal shelters can design campaigns that match pet types with the most interested age group.

4. Signs and Personality¶

1. Create a pie chart showing the distribution of zodiac signs (sign) across the platform. Which signs are most and least represented? Is this the right chart? If not, replace with right chart.

In [80]:
import matplotlib.pyplot as plt

zodiac_counts = data['sign'].value_counts()

plt.figure(figsize=(10, 6))
plt.pie(zodiac_counts, labels=zodiac_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired.colors)

plt.title("Distribution of Zodiac Signs Among Users", fontsize=14)
plt.show()
No description has been provided for this image

steps:

  1. data['sign'].value_counts() calculates the frequency of each zodiac sign in the dataset.
  2. zodiac_counts as the values. labels=zodiac_counts.index to label each segment with its zodiac sign.
  3. autopct='%1.1f%%' to display percentages with one decimal point. retartangle=140 to rotate the chart for better visualization.
  4. colors=plt.cm.Paired.colors for distinct colors from the Paired colormap

Analysis/Recommendations:

  1. If the data is skewed toward specific zodiac signs, consider collecting more data to balance representation.
  2. Businesses can use the insights to tailor marketing strategies based on the most represented zodiac signs.

As you can see above chart is not right. so selected the Bar Chart For Better Comparison

In [84]:
plt.figure(figsize=(12, 8))
sns.barplot(x=zodiac_counts.index, y=zodiac_counts.values, palette="viridis")

plt.xlabel("Zodiac Sign", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Zodiac Sign Distribution Among Users", fontsize=16)
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

steps:

  1. plt.figure(figsize=(12, 8)): Creates a figure with a width of 12 inches and a height of 8 inches.
  2. Then, x=zodiac_counts.index: Zodiac signs as the x-axis labels.y=zodiac_counts.values: The count of each zodiac sign as the y-axis values
  3. plt.xticks(rotation=45): Rotates x-axis labels by 45 degrees for better readability.
  4. plt.show(): Renders and displays the bar chart.

Analysis/Recomendations:

  1. The tallest bars indicate the most common zodiac signs among users and The shortest bars indicate the least common zodiac signs.
  2. Businesses can use the most frequent zodiac signs to tailor products, services, or marketing strategies.
  3. Analyze correlations between zodiac signs and user behavior, preferences, or demographic

2. How does sign vary across gender and status? Are there noticeable patterns or imbalances?

In [87]:
zodiac_distribution = data.groupby(['sign', 'gender', 'status']).size().reset_index(name='count')

plt.figure(figsize=(14, 7))
sns.barplot(data=zodiac_distribution, x='sign', y='count', hue='gender', ci=None, palette='pastel')

plt.xlabel("Zodiac Sign", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Zodiac Sign Distribution Across Gender", fontsize=16)
plt.xticks(rotation=90)
plt.legend(title="Gender")

plt.show()
C:\Users\msrav\AppData\Local\Temp\ipykernel_1736\618449131.py:4: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  sns.barplot(data=zodiac_distribution, x='sign', y='count', hue='gender', ci=None, palette='pastel')
No description has been provided for this image

Steps:

  1. Groups data by zodiac sign, gender, and status.Computes the count of each unique combination. Then, Converts the grouped data into a DataFrame with a new column, 'count'.
  2. plt.figure(figsize=(14, 7)): Creates a figure with a width of 14 inches and a height of 7 inches.
  3. plt.xticks(rotation=90): Rotates x-axis labels by 90 degrees for readability and plt.legend(title="Gender"): Adds a legend with the title "Gender".

Analysis/Recommendations:

  1. Each zodiac sign has two or more bars depending on the number of genders in the dataset.
  2. The height of each bar indicates the number of users with that zodiac sign and gender.
  3. A significant gender difference for certain zodiac signs may indicate demographic trends or preferences.
  4. If a specific gender dominates certain zodiac signs, businesses can tailor content, products, or marketing strategies accordingly.
In [89]:
sign_across_status = data.groupby(['sign', 'status']).size().reset_index(name='count')


plt.figure(figsize=(14,8))
sns.barplot(data=sign_across_status, x='sign', y='count', hue='status')

plt.title('Distribution of Zodiac Signs across Status', fontsize=14)
plt.xlabel('Zodiac Sign', fontsize=14)
plt.ylabel('Number of Users', fontsize=14)
plt.xticks(rotation=90)  
plt.grid(True, axis= 'y', linestyle = 'dashed', alpha = 0.5)
plt.show()
No description has been provided for this image

steps:

  1. the count of each combination and stores it in sign_across_status
  2. The groupby function ensures that data is categorized based on the two columns (sign and status).
  3. size() computes the count for each group.reset_index(name='count') converts the grouped data into a structured DataFrame.
  4. plt.xticks(rotation=90) rotates x-axis labels for better readability.
  5. Dashed horizontal grid lines (plt.grid(True, axis='y', linestyle='dashed', alpha=0.5)) improve clarity.

Analysis/Recommendations:

  1. Some Zodiac signs might have significantly more users in certain status categories.
  2. The distribution might show biases or trends (e.g., certain signs having more active users or a specific relationship status).
  3. Check for external factors like cultural influences or demographics.
  4. Investigate why some signs have different distributions.
In [92]:
zodiac_pivot = data.groupby(['sign', 'status']).size().unstack().fillna(0)
zodiac_pivot.plot(kind='bar', stacked=True, figsize=(14, 7), colormap="viridis")

plt.xlabel("Zodiac Sign", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.title("Zodiac Sign Distribution Across Relationship Status", fontsize=16)
plt.xticks(rotation=90)
plt.legend(title="Relationship Status")
plt.show()
No description has been provided for this image

steps:

  1. We first groupby the three columns sign , gender and status.
  2. size() calculates the count of users for each combination.unstack() reshapes the data so that status categories become separate columns.fillna(0) replaces missing values (if any) with zero.

Analysis/Recommendations:

  1. Some Zodiac signs may have more users in a particular status than others.
  2. It visually shows how many users fall into each status category for every sign.
  3. Identify which Zodiac signs have a higher single status and target them with matchmaking services.
  4. Personalize recommendations based on sign-related preferences.