image.png

Understanding Bumble Users: Data-Backed Insights¶

Introduction¶

This project focuses on analyzing user data from Bumble, a well-known dating app, to uncover valuable insights about its users. By exploring the information users share in their profiles such as demographics, lifestyle habits, and preferences we aim to understand their behavior and trends better. These insights will help Bumble’s product and marketing teams make informed decisions to improve user engagement, fine-tune the matchmaking algorithms, and offer personalized features. Ultimately, the goal is to create a more satisfying experience for Bumble users and support the platform's growth.

Goal

Analyzing the Bumble dataset to answer key business and user behavior questions.¶

Columns Overview:¶

Demographics: age, status, gender

Physical Attributes: body_type, height

Lifestyle Preferences: diet, drinks

Educational & Financial Details: education, income

Location & Activity: location, last_online

Additional Details: pets, religion, sign, speaks

Dataset Summary¶

Name of Dataset: Bumble User Profiles

Number of Rows: 59,946

Number of Columns: 17

Numerical Columns: age, height, income

Categorical Columns: status, gender, body_type, diet, drinks, education, ethnicity, job, last_online, location, pets, religion, sign, speaks

Description: This dataset contains user-generated profiles from Bumble, a dating platform. It includes demographic details, physical attributes, lifestyle choices, education, income, and activity status. The data helps analyze user behavior, preferences, and matchmaking trends.

Columns Information¶

age: User’s age (numeric)

status: Relationship status (e.g., single, married, seeing someone)

gender: Gender of the user (e.g., m, f)

body_type: Description of physical appearance (e.g., athletic, curvy, thin)

diet: Dietary preference (e.g., vegetarian, vegan, anything)

drinks: Drinking habits (e.g., socially, often)

education: Level of education (e.g., college, masters)

ethnicity: Ethnic background of the user

height: Height of the user (in inches, numeric)

income: User-reported annual income (numeric)

job: User’s occupation or field of work

last_online: Last active timestamp (to analyze user activity trends)

location: City and state where the user resides

pets: User’s preference or ownership of pets (e.g., likes dogs, has cats)

religion: Religious beliefs of the user

sign: User’s zodiac sign (e.g., Aries, Leo)

speaks: Languages spoken by the user

Key Metrics to Focus On¶

Demographic Analysis: Understanding user distribution by gender, location, and relationship status.

Behavioral Insights: Examining drinking, diet, and pet preferences to identify compatibility trends.

Income and Age Trends: Correlating income levels with age and education.

Activity Patterns: Analyzing when users were last online to gauge engagement.

Regional Trends: Identifying top cities with active users and their characteristics.

Importing the libraries¶

In [2]:
# Importing necessary libraries
import pandas as pd                 # For data manipulation and analysis
import numpy as np                  # For numerical computations
import seaborn as sns               # For advanced data visualization
import matplotlib.pyplot as plt     # For creating static and interactive plots

Loading the Dataset¶

In [3]:
# Step 2: Import necessary libraries
import gdown
import pandas as pd

# Step 3: Define the file ID and create the download URL
file_id = "1v2_R5eIcgmv02XjhKfj1581oAhzZdLnU"
download_url = f"https://drive.google.com/uc?id={file_id}"

# Step 4: Set the output file name
output_file = "spinny_dataset.csv"

# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)

# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)
Downloading...
From: https://drive.google.com/uc?id=1v2_R5eIcgmv02XjhKfj1581oAhzZdLnU
To: /content/spinny_dataset.csv
100%|██████████| 13.7M/13.7M [00:00<00:00, 64.5MB/s]

Display the first 5 rows

In [4]:
data.head(5)
Out[4]:
age status gender body_type diet drinks education ethnicity height income job last_online location pets religion sign speaks
0 22 single m a little extra strictly anything socially working on college/university asian, white 75.0 -1 transportation 2012-06-28-20-30 south san francisco, california likes dogs and likes cats agnosticism and very serious about it gemini english
1 35 single m average mostly other often working on space camp white 70.0 80000 hospitality / travel 2012-06-29-21-41 oakland, california likes dogs and likes cats agnosticism but not too serious about it cancer english (fluently), spanish (poorly), french (...
2 38 available m thin anything socially graduated from masters program NaN 68.0 -1 NaN 2012-06-27-09-10 san francisco, california has cats NaN pisces but it doesn&rsquo;t matter english, french, c++
3 23 single m thin vegetarian socially working on college/university white 71.0 20000 student 2012-06-28-14-22 berkeley, california likes cats NaN pisces english, german (poorly)
4 29 single m athletic NaN socially graduated from college/university asian, black, other 66.0 -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california likes dogs and likes cats NaN aquarius english

Data Cleaning¶

1.Inspecting Missing Data¶

1.Which columns in the dataset have missing values, and what percentage of data is missing in each column?

In [5]:
data = pd.read_csv(output_file)
In [7]:
missing_percentage = data.isnull().mean() * 100
In [8]:
missing_percentage
Out[8]:
0
age 0.000000
status 0.000000
gender 0.000000
body_type 8.834618
diet 40.694959
drinks 4.979482
education 11.056618
ethnicity 9.475194
height 0.005005
income 0.000000
job 13.675641
last_online 0.000000
location 0.000000
pets 33.231575
religion 33.740366
sign 18.443266
speaks 0.083408

  • Conculsation: In this data. The percentage for body_type: 8.8 , diet:40.6 , drinks:4.9 , education:11.05 , ethnicity:9.4 , job:13.6 , pets:33.2 ,religion:33.7 , sign:18.4 and other percentages is 0.0.

2.Are there columns where more than 50% of the data is missing? Drop those columns where missing values are >50%

In [10]:
data = data.dropna(thresh=len(data) * 0.5, axis=1)
In [11]:
data
Out[11]:
age status gender body_type diet drinks education ethnicity height income job last_online location pets religion sign speaks
0 22 single m a little extra strictly anything socially working on college/university asian, white 75.0 -1 transportation 2012-06-28-20-30 south san francisco, california likes dogs and likes cats agnosticism and very serious about it gemini english
1 35 single m average mostly other often working on space camp white 70.0 80000 hospitality / travel 2012-06-29-21-41 oakland, california likes dogs and likes cats agnosticism but not too serious about it cancer english (fluently), spanish (poorly), french (...
2 38 available m thin anything socially graduated from masters program NaN 68.0 -1 NaN 2012-06-27-09-10 san francisco, california has cats NaN pisces but it doesn&rsquo;t matter english, french, c++
3 23 single m thin vegetarian socially working on college/university white 71.0 20000 student 2012-06-28-14-22 berkeley, california likes cats NaN pisces english, german (poorly)
4 29 single m athletic NaN socially graduated from college/university asian, black, other 66.0 -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california likes dogs and likes cats NaN aquarius english
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59941 59 single f NaN NaN socially graduated from college/university NaN 62.0 -1 sales / marketing / biz dev 2012-06-12-21-47 oakland, california has dogs catholicism but not too serious about it cancer and it&rsquo;s fun to think about english
59942 24 single m fit mostly anything often working on college/university white, other 72.0 -1 entertainment / media 2012-06-29-11-01 san francisco, california likes dogs and likes cats agnosticism leo but it doesn&rsquo;t matter english (fluently)
59943 42 single m average mostly anything not at all graduated from masters program asian 71.0 100000 construction / craftsmanship 2012-06-27-23-37 south san francisco, california NaN christianity but not too serious about it sagittarius but it doesn&rsquo;t matter english (fluently)
59944 27 single m athletic mostly anything socially working on college/university asian, black 73.0 -1 medicine / health 2012-06-23-13-01 san francisco, california likes dogs and likes cats agnosticism but not too serious about it leo and it&rsquo;s fun to think about english (fluently), spanish (poorly), chinese ...
59945 39 single m average NaN socially graduated from masters program white 68.0 -1 medicine / health 2012-06-29-00-42 san francisco, california likes dogs and likes cats catholicism and laughing about it gemini and it&rsquo;s fun to think about english

59946 rows × 17 columns

  • Conculsation: There are no columns more than 50%.

Copy of dataset 'df'

In [12]:
df = data.copy()

# Confirmation by printing the first few rows of the copied dataset
print("Copy of the dataset has been created with the name 'df'")
Copy of the dataset has been created with the name 'df'

3.Missing numerical data (e.g., height, income) should be handled by imputing the median value of height and income for the corresponding category, such as gender, age group, or location. This ensures that the imputed values are contextually relevant and reduce potential biases in the analysis.

In [14]:
for column in ['height', 'income']:
    df[column] = df[column].replace(-1, np.nan)
    df[column] = df.groupby(['gender'])[column].transform(lambda x: x.fillna(x.median()))

print(df)
       age     status gender       body_type               diet      drinks  \
0       22     single      m  a little extra  strictly anything    socially   
1       35     single      m         average       mostly other       often   
2       38  available      m            thin           anything    socially   
3       23     single      m            thin         vegetarian    socially   
4       29     single      m        athletic                NaN    socially   
...    ...        ...    ...             ...                ...         ...   
59941   59     single      f             NaN                NaN    socially   
59942   24     single      m             fit    mostly anything       often   
59943   42     single      m         average    mostly anything  not at all   
59944   27     single      m        athletic    mostly anything    socially   
59945   39     single      m         average                NaN    socially   

                               education            ethnicity  height  \
0          working on college/university         asian, white    75.0   
1                  working on space camp                white    70.0   
2         graduated from masters program                  NaN    68.0   
3          working on college/university                white    71.0   
4      graduated from college/university  asian, black, other    66.0   
...                                  ...                  ...     ...   
59941  graduated from college/university                  NaN    62.0   
59942      working on college/university         white, other    72.0   
59943     graduated from masters program                asian    71.0   
59944      working on college/university         asian, black    73.0   
59945     graduated from masters program                white    68.0   

         income                           job       last_online  \
0       60000.0                transportation  2012-06-28-20-30   
1       80000.0          hospitality / travel  2012-06-29-21-41   
2       60000.0                           NaN  2012-06-27-09-10   
3       20000.0                       student  2012-06-28-14-22   
4       60000.0   artistic / musical / writer  2012-06-27-21-26   
...         ...                           ...               ...   
59941   40000.0   sales / marketing / biz dev  2012-06-12-21-47   
59942   60000.0         entertainment / media  2012-06-29-11-01   
59943  100000.0  construction / craftsmanship  2012-06-27-23-37   
59944   60000.0             medicine / health  2012-06-23-13-01   
59945   60000.0             medicine / health  2012-06-29-00-42   

                              location                       pets  \
0      south san francisco, california  likes dogs and likes cats   
1                  oakland, california  likes dogs and likes cats   
2            san francisco, california                   has cats   
3                 berkeley, california                 likes cats   
4            san francisco, california  likes dogs and likes cats   
...                                ...                        ...   
59941              oakland, california                   has dogs   
59942        san francisco, california  likes dogs and likes cats   
59943  south san francisco, california                        NaN   
59944        san francisco, california  likes dogs and likes cats   
59945        san francisco, california  likes dogs and likes cats   

                                        religion  \
0          agnosticism and very serious about it   
1       agnosticism but not too serious about it   
2                                            NaN   
3                                            NaN   
4                                            NaN   
...                                          ...   
59941   catholicism but not too serious about it   
59942                                agnosticism   
59943  christianity but not too serious about it   
59944   agnosticism but not too serious about it   
59945          catholicism and laughing about it   

                                           sign  \
0                                        gemini   
1                                        cancer   
2            pisces but it doesn&rsquo;t matter   
3                                        pisces   
4                                      aquarius   
...                                         ...   
59941  cancer and it&rsquo;s fun to think about   
59942           leo but it doesn&rsquo;t matter   
59943   sagittarius but it doesn&rsquo;t matter   
59944     leo and it&rsquo;s fun to think about   
59945  gemini and it&rsquo;s fun to think about   

                                                  speaks  
0                                                english  
1      english (fluently), spanish (poorly), french (...  
2                                   english, french, c++  
3                               english, german (poorly)  
4                                                english  
...                                                  ...  
59941                                            english  
59942                                 english (fluently)  
59943                                 english (fluently)  
59944  english (fluently), spanish (poorly), chinese ...  
59945                                            english  

[59946 rows x 17 columns]

Conculsation:In this data i did df[column].replace(-1, np.nan) and df.groupby(['gender'])[column].transform(lambda x: x.fillna(x.median())).

2. Data Types¶

Accurate data types are critical for meaningful analysis and visualization. For example, numeric fields like income or height must be stored as numbers for statistical computations, while dates like last_online must be converted to datetime format for time-based calculations.

In [15]:
#df['last_online'] = pd.to_datetime(df['last_online'])
df['last_online'] = pd.to_datetime(df['last_online'], format='%Y-%m-%d-%H-%M')
In [16]:
print(type(df['last_online'][0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
In [17]:
print(type(df['income'][0]))
<class 'numpy.float64'>
In [18]:
print(type(df['height'][0]))
<class 'numpy.float64'>

1.Are there any inconsistencies in the data types across columns (e.g., numerical data stored as strings)?

In [25]:
print(type(df['age'][0]))
print(type(df['status'][0]))
print(type(df['gender'][0]))
print(type(df['body_type'][0]))
print(type(df['diet'][0]))
print(type(df['drinks'][0]))
print(type(df['education'][0]))
print(type(df['ethnicity'][0]))
print(type(df['height'][0]))
print(type(df['income'][0]))
print(type(df['job'][0]))
print(type(df['location'][0]))
print(type(df['pets'][0]))
print(type(df['religion'][0]))
print(type(df['sign'][0]))
print(type(df['speaks'][0]))
<class 'numpy.int64'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>

Conclusation: There are no numerical data that are stored as strings.

2.Which columns require conversion to numerical data types for proper analysis (e.g., income)?

In [26]:
print("Column Data Types:")
print(df.dtypes)
Column Data Types:
age                     int64
status                 object
gender                 object
body_type              object
diet                   object
drinks                 object
education              object
ethnicity              object
height                float64
income                float64
job                    object
last_online    datetime64[ns]
location               object
pets                   object
religion               object
sign                   object
speaks                 object
dtype: object
In [27]:
numeric_columns = ['income', 'height']
for col in numeric_columns:
    if df[col].dtype == 'object':  # Check if the column is not already numeric
        print(f"Column '{col}' requires conversion to numerical data type.")
for col in numeric_columns:
    if df[col].dtype == 'object':
        df[col] = pd.to_numeric(df[col], errors='coerce')  # Convert to numeric, set invalid values to NaN
In [28]:
df[col]
Out[28]:
height
0 75.0
1 70.0
2 68.0
3 71.0
4 66.0
... ...
59941 62.0
59942 72.0
59943 71.0
59944 73.0
59945 68.0

59946 rows × 1 columns


In [29]:
print("\nAfter Conversion:")
print(df.dtypes)
After Conversion:
age                     int64
status                 object
gender                 object
body_type              object
diet                   object
drinks                 object
education              object
ethnicity              object
height                float64
income                float64
job                    object
last_online    datetime64[ns]
location               object
pets                   object
religion               object
sign                   object
speaks                 object
dtype: object

Conculastion:Age, height and income must be in numerical data to do proper analysis as they are in int and float data type.

3.Does the last_online column need to be converted into a datetime format? What additional insights can be gained by analyzing this as a date field?

In [31]:
df2 = pd.read_csv(output_file)
In [32]:
df2['last_online'] = pd.to_datetime(df2['last_online'], format='%Y-%m-%d-%H-%M')
In [33]:
df2['last_online']
Out[33]:
last_online
0 2012-06-28 20:30:00
1 2012-06-29 21:41:00
2 2012-06-27 09:10:00
3 2012-06-28 14:22:00
4 2012-06-27 21:26:00
... ...
59941 2012-06-12 21:47:00
59942 2012-06-29 11:01:00
59943 2012-06-27 23:37:00
59944 2012-06-23 13:01:00
59945 2012-06-29 00:42:00

59946 rows × 1 columns


In [34]:
# Calculate days since last online
df2['days_since_last_online'] = (pd.Timestamp.now() - df2['last_online']).dt.days
df2['days_since_last_online']
Out[34]:
days_since_last_online
0 4657
1 4656
2 4659
3 4658
4 4658
... ...
59941 4673
59942 4657
59943 4658
59944 4663
59945 4657

59946 rows × 1 columns


In [35]:
# Categorize users based on activity
def categorize_activity(days):
    if days <= 7:
        return 'Active'
    elif days <= 30:
        return 'Recently Active'
    else:
        return 'Inactive'

df2['activity_status'] = df2['days_since_last_online'].apply(categorize_activity)
df2['activity_status']
Out[35]:
activity_status
0 Inactive
1 Inactive
2 Inactive
3 Inactive
4 Inactive
... ...
59941 Inactive
59942 Inactive
59943 Inactive
59944 Inactive
59945 Inactive

59946 rows × 1 columns


Conculsation: Here the last_online is in strg format. And then we changed it into Timestamp. Then we did days_since_last_online , and also 'activity_status' for the data given.

3.Outliers¶

1.Are there any apparent outliers in numerical columns such as age, height, or income? What are the ranges of values in these columns?

In [36]:
# 1. View Summary Statistics
print("Summary Statistics:")
print(df.describe())
Summary Statistics:
                age        height          income  \
count  59946.000000  59946.000000    59946.000000   
mean      32.340290     68.295282    61512.027491   
min       18.000000      1.000000    20000.000000   
25%       26.000000     66.000000    40000.000000   
50%       30.000000     68.000000    60000.000000   
75%       37.000000     71.000000    60000.000000   
max      110.000000     95.000000  1000000.000000   
std        9.452779      3.994738    91116.607782   

                         last_online  
count                          59946  
mean   2012-05-22 06:43:35.300770560  
min              2011-06-27 01:52:00  
25%              2012-05-29 20:37:15  
50%              2012-06-27 14:30:00  
75%              2012-06-30 01:09:00  
max              2012-07-01 08:57:00  
std                              NaN  
In [37]:
import pandas as pd
import matplotlib.pyplot as plt

# Create the boxplot
df2[['age', 'height', 'income']].boxplot(figsize=(10, 5))
plt.title("Boxplot of Age, Height, and Income")
plt.show()
No description has been provided for this image
In [38]:
df = pd.read_csv(output_file)
In [39]:
df
Out[39]:
age status gender body_type diet drinks education ethnicity height income job last_online location pets religion sign speaks
0 22 single m a little extra strictly anything socially working on college/university asian, white 75.0 -1 transportation 2012-06-28-20-30 south san francisco, california likes dogs and likes cats agnosticism and very serious about it gemini english
1 35 single m average mostly other often working on space camp white 70.0 80000 hospitality / travel 2012-06-29-21-41 oakland, california likes dogs and likes cats agnosticism but not too serious about it cancer english (fluently), spanish (poorly), french (...
2 38 available m thin anything socially graduated from masters program NaN 68.0 -1 NaN 2012-06-27-09-10 san francisco, california has cats NaN pisces but it doesn&rsquo;t matter english, french, c++
3 23 single m thin vegetarian socially working on college/university white 71.0 20000 student 2012-06-28-14-22 berkeley, california likes cats NaN pisces english, german (poorly)
4 29 single m athletic NaN socially graduated from college/university asian, black, other 66.0 -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california likes dogs and likes cats NaN aquarius english
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59941 59 single f NaN NaN socially graduated from college/university NaN 62.0 -1 sales / marketing / biz dev 2012-06-12-21-47 oakland, california has dogs catholicism but not too serious about it cancer and it&rsquo;s fun to think about english
59942 24 single m fit mostly anything often working on college/university white, other 72.0 -1 entertainment / media 2012-06-29-11-01 san francisco, california likes dogs and likes cats agnosticism leo but it doesn&rsquo;t matter english (fluently)
59943 42 single m average mostly anything not at all graduated from masters program asian 71.0 100000 construction / craftsmanship 2012-06-27-23-37 south san francisco, california NaN christianity but not too serious about it sagittarius but it doesn&rsquo;t matter english (fluently)
59944 27 single m athletic mostly anything socially working on college/university asian, black 73.0 -1 medicine / health 2012-06-23-13-01 san francisco, california likes dogs and likes cats agnosticism but not too serious about it leo and it&rsquo;s fun to think about english (fluently), spanish (poorly), chinese ...
59945 39 single m average NaN socially graduated from masters program white 68.0 -1 medicine / health 2012-06-29-00-42 san francisco, california likes dogs and likes cats catholicism and laughing about it gemini and it&rsquo;s fun to think about english

59946 rows × 17 columns

In [40]:
# Function to find outliers using IQR for selected columns
def find_outliers_iqr(df, selected_columns):
    outliers = {}
    for col in selected_columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers[col] = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    return outliers

selected_columns = ["age", "height", "income"]
outliers_iqr = find_outliers_iqr(df, selected_columns)
print("Outliers using IQR for selected columns:", outliers_iqr)
Outliers using IQR for selected columns: {'age': 117      55
141      59
158      54
172      62
210      59
         ..
59740    61
59746    55
59794    55
59837    66
59941    59
Name: age, Length: 2638, dtype: int64, 'height': 189      79.0
243      80.0
280      80.0
402      91.0
433      79.0
         ... 
58147    94.0
58286    84.0
59038    79.0
59067    79.0
59697    79.0
Name: height, Length: 285, dtype: float64, 'income': 1         80000
3         20000
11        40000
13        30000
14        50000
          ...  
59917    100000
59927     50000
59930     70000
59934     80000
59943    100000
Name: income, Length: 11504, dtype: int64}
In [41]:
ranges = df[["age", "height", "income"]].agg(["min", "max"])
print(ranges)
     age  height   income
min   18     1.0       -1
max  110    95.0  1000000

Conclusation: In this first i did Summary Statistics for the data and i did BoxPlots for age,height,income. Then i did quantile% for the data and i showen Ranges for age,height,income and also Min,Max values.

2.Any -1 values in numerical columns like income should be replaced with 0, as they may represent missing or invalid data.

In [42]:
# Replace -1 with 0 in the 'income' column
df['income'] = df['income'].replace(-1, 0)

# For multiple numerical columns
numerical_columns = ['age', 'height', 'income']  # Specify the columns to check
df[numerical_columns] = df[numerical_columns].replace(-1, 0)
df[numerical_columns]
Out[42]:
age height income
0 22 75.0 0
1 35 70.0 80000
2 38 68.0 0
3 23 71.0 20000
4 29 66.0 0
... ... ... ...
59941 59 62.0 0
59942 24 72.0 0
59943 42 71.0 100000
59944 27 73.0 0
59945 39 68.0 0

59946 rows × 3 columns

In [43]:
# Replace -1 with 0 in all numeric columns
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].replace(-1, 0)
df[numeric_cols]
Out[43]:
age height income
0 22 75.0 0
1 35 70.0 80000
2 38 68.0 0
3 23 71.0 20000
4 29 66.0 0
... ... ... ...
59941 59 62.0 0
59942 24 72.0 0
59943 42 71.0 100000
59944 27 73.0 0
59945 39 68.0 0

59946 rows × 3 columns

Conculsation:Here i have replaced with -1 with 0. Then i converted it into numeric form.

3.For other outliers, rather than deleting them, calculate the mean and median values using only the middle 80% of the data (removing extreme high and low values). This approach ensures that outliers do not disproportionately impact the analysis while retaining as much meaningful data as possible.

In [44]:
def calculate_trimmed_mean_median(column):
    # Calculate 10th and 90th percentiles
    lower_bound = column.quantile(0.1)
    upper_bound = column.quantile(0.9)

    # Filter data to keep only middle 80%
    trimmed_data = column[(column >= lower_bound) & (column <= upper_bound)]

    # Calculate mean and median of trimmed data
    mean_value = trimmed_data.mean()
    median_value = trimmed_data.median()

    return mean_value, median_value

# Numerical columns to analyze
numerical_columns = ['age', 'height', 'income']

# Calculate trimmed mean and median for each column
trimmed_stats = {}
for col in numerical_columns:
    mean, median = calculate_trimmed_mean_median(df[col])
    trimmed_stats[col] = {'mean': mean, 'median': median}

# Print results
for col, stats in trimmed_stats.items():
    print(f"{col}:")
    print(f"  Trimmed Mean: {stats['mean']}")
    print(f"  Trimmed Median: {stats['median']}")
    print()
age:
  Trimmed Mean: 31.090303239005813
  Trimmed Median: 30.0

height:
  Trimmed Mean: 68.2309102137578
  Trimmed Median: 68.0

income:
  Trimmed Mean: 3297.01223769799
  Trimmed Median: 0.0

Conculsation:First i calucated the 10th and 90th percentiles and giving lower_bound ,upper_bound and trimmed data for 80% percentile.And then we calculated mean , median for trimmed data. Numerical columns to analyze data. And then we Calculate trimmed mean and median for each column.then we printed all the values.

4. Missing Data Visualization¶

1.Create a heatmap to visualize missing values across the dataset. Which columns show consistent missing data patterns?

In [45]:
%pip install seaborn
Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.11/dist-packages (from seaborn) (2.0.2)
Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.11/dist-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.11/dist-packages (from seaborn) (3.10.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.2)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2->seaborn) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2->seaborn) (2025.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)
In [47]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv(output_file)
In [48]:
# 1. Check for missing values
missing_values = df.isnull().sum()
missing_values
Out[48]:
0
age 0
status 0
gender 0
body_type 5296
diet 24395
drinks 2985
education 6628
ethnicity 5680
height 3
income 0
job 8198
last_online 0
location 0
pets 19921
religion 20226
sign 11056
speaks 50

In [49]:
#  Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis", yticklabels=False)
plt.title("Heatmap of Missing Values")
plt.show()
No description has been provided for this image
In [50]:
sorted_missing = missing_values[missing_values > 0].sort_values(ascending=False)
print("Columns with Missing Values:\n", sorted_missing)
Columns with Missing Values:
 diet         24395
religion     20226
pets         19921
sign         11056
job           8198
education     6628
ethnicity     5680
body_type     5296
drinks        2985
speaks          50
height           3
dtype: int64

Conclusation:In this data first i caluclated the missing values with that missing values i created the Heat map and then i Sorted the missing values.

Part 2: Data Processing¶

1. Binning and Grouping¶

1.Bin the age column into categories such as "18-25", "26-35", "36-45", and "46+" to create a new column, age_group. How does the distribution of users vary across these age ranges?

In [51]:
# Define bins and labels for age ranges
bins = [0, 25, 35, 45, float("inf")]
labels = ["18-25", "26-35", "36-45", "46+"]

# Bin the age column into age ranges
df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels, right=False)

# Analyze the distribution of users across age ranges
age_distribution = df["age_group"].value_counts().sort_index()

# Output the results
print("Age Distribution:\n", age_distribution)
Age Distribution:
 age_group
18-25    10923
26-35    30397
36-45    11915
46+       6711
Name: count, dtype: int64

Conclusation: In this data i did Bins and Lables . After that i did age_group and age_distrubution.

2.Group income into categories like "Low Income," "Medium Income," and "High Income" based on meaningful thresholds (e.g., quartiles). What insights can be derived from these groups?

In [52]:
# Replace invalid incomes (-1) with NaN for meaningful categorization
df["income"] = df["income"].replace(-1, pd.NA)
df["income"]
Out[52]:
income
0 <NA>
1 80000
2 <NA>
3 20000
4 <NA>
... ...
59941 <NA>
59942 <NA>
59943 100000
59944 <NA>
59945 <NA>

59946 rows × 1 columns


In [53]:
# Calculate quartile thresholds
q1 = df["income"].quantile(0.25)
q2 = df["income"].quantile(0.50)
q3 = df["income"].quantile(0.75)
q1,q2,q3
Out[53]:
(20000.0, 50000.0, 100000.0)
In [54]:
# Define income categories based on thresholds
bins = [-float("inf"), q1, q2, q3, float("inf")]
labels = ["Low Income", "Medium Income", "High Income", "Very High Income"]
bins , labels
Out[54]:
([-inf, 20000.0, 50000.0, 100000.0, inf],
 ['Low Income', 'Medium Income', 'High Income', 'Very High Income'])
In [55]:
# Add a category for missing income
df["income_group"] = pd.cut(df["income"].fillna(-1), bins=[-float("inf"), 0, q1, q2, q3, float("inf")],
                            labels=["Unknown Income", "Low Income", "Medium Income", "High Income", "Very High Income"], right=False)

print(df["income_group"])
0          Unknown Income
1             High Income
2          Unknown Income
3           Medium Income
4          Unknown Income
               ...       
59941      Unknown Income
59942      Unknown Income
59943    Very High Income
59944      Unknown Income
59945      Unknown Income
Name: income_group, Length: 59946, dtype: category
Categories (5, object): ['Unknown Income' < 'Low Income' < 'Medium Income' < 'High Income' <
                         'Very High Income']
<ipython-input-55-f686f7740fa6>:2: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df["income_group"] = pd.cut(df["income"].fillna(-1), bins=[-float("inf"), 0, q1, q2, q3, float("inf")],

Conculsation:Low Income: Users in the lower quartile might represent students or entry-level professionals. High Income: Users in the upper quartile likely hold senior or specialized roles. Users with missing income data can be excluded or analyzed separately to understand reporting gaps.This categorization can help identify user behavior patterns, preferences, or trends based on income groups.

2.Derived Features¶

Create a new feature, profile_completeness, by calculating the percentage of non-missing values for each user profile. How complete are most user profiles, and how does completeness vary across demographics?

In [56]:
# Calculate profile completeness for each user
df["profile_completeness"] = df.notna().sum(axis=1) / len(df.columns) * 100
df["profile_completeness"]
Out[56]:
profile_completeness
0 94.736842
1 100.000000
2 78.947368
3 94.736842
4 84.210526
... ...
59941 78.947368
59942 94.736842
59943 94.736842
59944 94.736842
59945 89.473684

59946 rows × 1 columns


In [57]:
# Analyze profile completeness distribution
completeness_distribution = df["profile_completeness"].value_counts(bins=[0, 25, 50, 75, 100], sort=False)
completeness_distribution
Out[57]:
count
(-0.001, 25.0] 0
(25.0, 50.0] 71
(50.0, 75.0] 7669
(75.0, 100.0] 52206

In [58]:
completeness_by_age = df.groupby("age")["profile_completeness"].mean()
print(completeness_by_age)
age
18     85.641288
19     85.313119
20     86.878003
21     86.665572
22     86.488325
23     86.478639
24     86.269035
25     86.282401
26     86.314941
27     86.314361
28     86.105439
29     86.386071
30     86.216176
31     86.300394
32     86.360955
33     86.274276
34     86.258232
35     86.435747
36     86.557835
37     86.655848
38     86.909379
39     86.801689
40     86.882984
41     87.454350
42     87.043401
43     87.118145
44     87.265834
45     87.787509
46     87.297396
47     87.483832
48     87.722946
49     87.719298
50     87.450319
51     87.624060
52     87.683599
53     86.695906
54     88.310664
55     87.845084
56     87.803457
57     88.384046
58     88.004275
59     87.997142
60     87.854251
61     89.114833
62     88.150016
63     88.138825
64     86.539357
65     88.652825
66     88.571429
67     88.118022
68     90.098127
69     86.757216
109    78.947368
110    47.368421
Name: profile_completeness, dtype: float64
  • Conculastion:In this data i Calculated Non-Missing Values:

df.notna() creates a DataFrame of boolean values where True indicates a non-missing value. .sum(axis=1) counts the number of True values (non-missing columns) for each row. Calculate Percentage:

Divide the count of non-missing columns by the total number of columns (len(df.columns)). Multiply by 100 to express it as a percentage. Create a Distribution:

Use value_counts() with the bins parameter to group profiles into ranges (e.g., 0-25%, 26-50%, etc.). Optional Analysis:

Compare profile_completeness against other demographic features like age or income to identify trends.*

3. Unit Conversion¶

  • 1.Convert the height column from inches to centimeters using the conversion factor (1 inch = 2.54 cm). Store the converted values in a new column, height_cm. *
In [59]:
# Conversion factor: 1 inch = 2.54 cm
conversion_factor = 2.54

# Convert height to centimeters and store in a new column
df["height_cm"] = df["height"] * conversion_factor
df["height_cm"]
Out[59]:
height_cm
0 190.50
1 177.80
2 172.72
3 180.34
4 167.64
... ...
59941 157.48
59942 182.88
59943 180.34
59944 185.42
59945 172.72

59946 rows × 1 columns


  • Conculastion:Define Conversion Factor: Since 1 inch equals 2.54 centimeters, this factor is used for the conversion. Multiply the height column by the conversion factor to calculate the height in centimeters. Store the result in a new column, height_cm.

3.Data Analysis¶

1.Demographic Analysis¶

1.What is the gender distribution (gender) across the platform? Are there any significant imbalances?¶

In [60]:
# Calculate gender distribution (counts and percentages)
gender_counts = df["gender"].value_counts()
gender_percentages = df["gender"].value_counts(normalize=True) * 100

# Combine counts and percentages into a single DataFrame for clarity
gender_distribution = pd.DataFrame({
    "Count": gender_counts,
    "Percentage (%)": gender_percentages
})

# Display the results
print(gender_distribution)
        Count  Percentage (%)
gender                       
m       35829       59.768792
f       24117       40.231208
  • Conculastion: In this data value_counts() for Counts: df["gender"].value_counts() computes the count of each unique gender in the dataset. value_counts(normalize=True) for Percentages: Using normalize=True gives the relative frequencies (proportions), which are multiplied by 100 to express as percentages.

2.What are the proportions of users in different status categories (e.g., single, married, seeing someone)? What does this suggest about the platform’s target audience?¶

In [61]:
# Calculate status distribution (counts and percentages)
status_counts = df["status"].value_counts()
status_percentages = df["status"].value_counts(normalize=True) * 100

# Combine counts and percentages into a single DataFrame
status_distribution = pd.DataFrame({
    "Count": status_counts,
    "Percentage (%)": status_percentages
})

# Display the results
print(status_distribution)
                Count  Percentage (%)
status                               
single          55697       92.911954
seeing someone   2064        3.443099
available        1865        3.111133
married           310        0.517132
unknown            10        0.016682
  • Conculsation: In this data i Caluculated value_counts() for Counts and Percentages: Use value_counts() to calculate the count of each status. Use value_counts(normalize=True) to calculate the percentage, then multiply by 100. Combine Results: Combine counts and percentages into a DataFrame for clarity.

3.How does status vary by gender? For example, what proportion of men and women identify as single?¶

In [62]:
# Calculate the status distribution by gender
status_by_gender = df.groupby(["gender", "status"]).size().unstack(fill_value=0)

# Calculate the proportion of each status category by gender
status_percentage_by_gender = status_by_gender.divide(status_by_gender.sum(axis=1), axis=0) * 100

# Display the results
print(status_percentage_by_gender)
status  available   married  seeing someone     single   unknown
gender                                                          
f        2.720073  0.559771        4.158892  92.544678  0.016586
m        3.374362  0.488431        2.961288  93.159173  0.016746
  • Conculsation:In this data i caluclated the status_by_gender in that i did groupby and size,unstack. Again status_percentage_by_gender in that i did divide ,sum ,axis and * 100 .

2. Correlation Analysis¶

1. What are the correlations between numerical columns such as age, income, gender Are there any strong positive or negative relationships?¶

In [63]:
# Encode gender as numerical values (e.g., 0 for male, 1 for female)
df["gender_encoded"] = df["gender"].map({"m": 0, "f": 1})

# Select numerical columns
numerical_cols = ["age", "income", "gender_encoded"]

# Handle missing values by filling them with the column mean (alternatively, you can use df.dropna())
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].mean())

# Calculate correlation matrix
correlation_matrix = df[numerical_cols].corr()

# Display correlation matrix
print(correlation_matrix)
                     age    income  gender_encoded
age             1.000000 -0.004171        0.041481
income         -0.004171  1.000000       -0.021317
gender_encoded  0.041481 -0.021317        1.000000
<ipython-input-63-757a9fd1725b>:8: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].mean())
  • Conculsation:In this data i did gender_encoded. Numerical_cols and correlation_matrix.

2.How does age correlate with income? Are older users more likely to report higher income levels?¶

In [64]:
# Calculate the Pearson correlation between age and income
correlation = df["age"].corr(df["income"])
# Display the correlation
print(f"Correlation between age and income: {correlation:.2f}")
Correlation between age and income: -0.00
  • Conculastion: In this data i calculated correlation for age ,income.

3. Diet and Lifestyle Analysis¶

1.How do dietary preferences (diet) distribute across the platform? For example, what percentage of users identify as vegetarian, vegan, or follow "anything" diets?¶

In [65]:
# Calculate the distribution of dietary preferences
diet_distribution = df["diet"].value_counts(normalize=True) * 100

# Display the percentage distribution
print(diet_distribution)
diet
mostly anything        46.651290
anything               17.391916
strictly anything      14.382155
mostly vegetarian       9.687491
mostly other            2.832550
strictly vegetarian     2.461253
vegetarian              1.876178
strictly other          1.271413
mostly vegan            0.950747
other                   0.931057
strictly vegan          0.641332
vegan                   0.382549
mostly kosher           0.241906
mostly halal            0.135017
strictly halal          0.050631
strictly kosher         0.050631
halal                   0.030941
kosher                  0.030941
Name: proportion, dtype: float64
  • Conculsation: In this data i calculated diet_distribution and normalize it.

2.How do drinking habits (drinks) vary across different diet categories? Are users with stricter diets (e.g., vegan) less likely to drink?¶

In [66]:
# Cross-tabulate diet and drinks to see how drinking habits vary across diet categories
diet_drinks_distribution = pd.crosstab(df['diet'], df['drinks'], normalize='index') * 100

# Display the percentage distribution
print(diet_drinks_distribution)
drinks               desperately  not at all      often     rarely   socially  \
diet                                                                            
anything                0.350467    4.923231   9.579439   8.511348  75.667557   
halal                   0.000000   44.444444  11.111111   0.000000  44.444444   
kosher                  0.000000    9.090909   0.000000  18.181818  63.636364   
mostly anything         0.406479    5.185687   8.536060   9.466034  75.611258   
mostly halal            6.976744   23.255814   4.651163  18.604651  37.209302   
mostly kosher           1.190476    8.333333   4.761905  20.238095  59.523810   
mostly other            0.820513    9.333333   5.025641  18.051282  66.256410   
mostly vegan            0.621118   12.422360   6.832298  19.254658  59.937888   
mostly vegetarian       0.572117    5.871725   7.015959  14.031918  71.906052   
other                   0.320513   11.217949   7.371795  19.230769  61.538462   
strictly anything       1.323441    3.649489  13.735713   6.376579  73.731702   
strictly halal          5.555556   38.888889  11.111111   5.555556  27.777778   
strictly kosher        33.333333   22.222222  11.111111  16.666667   5.555556   
strictly other          3.240741   15.509259   7.870370  14.583333  56.481481   
strictly vegan          3.652968   26.484018   8.675799  16.894977  42.922374   
strictly vegetarian     1.060071   10.011779   9.658422  14.252061  63.957597   
vegan                   1.574803   18.110236  14.960630  13.385827  51.181102   
vegetarian              0.964630    5.948553   9.485531  10.771704  71.704180   

drinks               very often  
diet                             
anything               0.967957  
halal                  0.000000  
kosher                 9.090909  
mostly anything        0.794482  
mostly halal           9.302326  
mostly kosher          5.952381  
mostly other           0.512821  
mostly vegan           0.931677  
mostly vegetarian      0.602228  
other                  0.320513  
strictly anything      1.183076  
strictly halal        11.111111  
strictly kosher       11.111111  
strictly other         2.314815  
strictly vegan         1.369863  
strictly vegetarian    1.060071  
vegan                  0.787402  
vegetarian             1.125402  
  • Conculsation: In this diet_drinks_distribution = pd.crosstab(df['diet'], df['drinks'], normalize='index') * 100.

4. Geographical Insights¶

1.Extract city and state information from the location column. What are the top 5 cities and states with the highest number of users?¶

In [67]:
df['location'] = df['location'].dropna()
# Split the 'location' column into city and state columns, and handle missing or malformed entries
df[['city', 'state']] = df['location'].str.split(', ', n=1, expand=True)
# Count the number of users in each city and state
city_counts = df['city'].value_counts().head(5)
state_counts = df['state'].value_counts().head(5)
city_counts
state_counts
Out[67]:
count
state
california 59855
new york 17
illinois 8
massachusetts 5
oregon 4

  • Conculastion:Here i calculated city_counts , state_counts.

2.How does age vary across the top cities? Are certain cities dominated by younger or older users?¶

In [68]:
# Split the 'location' column into city and state columns
df[['city', 'state']] = df['location'].str.split(', ', n=1, expand=True)

# Group by city and calculate the average age for each city
city_age_stats = df.groupby('city')['age'].mean().sort_values(ascending=False)

# Display the average age by city
print(city_age_stats)
city
forest knolls      62.5
bellingham         59.0
port costa         53.0
seaside            50.0
redwood shores     47.0
                   ... 
san luis obispo    20.0
canyon             19.0
canyon country     19.0
isla vista         19.0
long beach         19.0
Name: age, Length: 198, dtype: float64
  • Conculsation: Here i calculated city_age_stats.

3.What are the average income levels in the top states or cities? Are there regional patterns in reported income?¶

In [69]:
# Split the 'location' column into city and state columns
df[['city', 'state']] = df['location'].str.split(', ', n=1, expand=True)

# Group by state or city and calculate the average income
city_income_stats = df.groupby('city')['income'].mean().sort_values(ascending=False)
state_income_stats = df.groupby('state')['income'].mean().sort_values(ascending=False)

# Display the average income by city and state
print("Average Income by City:")
print(city_income_stats)
print(state_income_stats)
Average Income by City:
city
petaluma           552197.496523
santa cruz         292636.995828
montara            172829.160872
south orange       150000.000000
boulder            150000.000000
                       ...      
north hollywood     20000.000000
leander             20000.000000
pasadena            20000.000000
new orleans         20000.000000
rohnert park        20000.000000
Name: income, Length: 198, dtype: float64
state
new jersey                  150000.000000
colorado                    127197.496523
new york                    123877.935040
california                  104400.939498
germany                     104394.993046
georgia                     104394.993046
hawaii                      104394.993046
connecticut                 104394.993046
district of columbia        104394.993046
west virginia               104394.993046
wisconsin                   104394.993046
washington                  104394.993046
idaho                       104394.993046
mexico                      104394.993046
minnesota                   104394.993046
illinois                    104394.993046
ireland                     104394.993046
montana                     104394.993046
missouri                    104394.993046
nevada                      104394.993046
mississippi                 104394.993046
netherlands                 104394.993046
spain                       104394.993046
oregon                      104394.993046
north carolina              104394.993046
virginia                    104394.993046
tennessee                   104394.993046
utah                        104394.993046
rhode island                104394.993046
switzerland                 104394.993046
united kingdom              104394.993046
florida                     104394.993046
massachusetts                89515.994437
texas                        83296.244784
michigan                     83296.244784
ohio                         62197.496523
british columbia, canada     60000.000000
vietnam                      60000.000000
pennsylvania                 40000.000000
arizona                      33333.333333
louisiana                    20000.000000
Name: income, dtype: float64
  • Conculsation: Here i calculated the Average Income by City and State.

5. Height Analysis¶

1.What is the average height of users across different gender categories?¶

In [70]:
gender_height_avg = df.groupby('gender')['height'].mean()
# Display the average height by gender
print(gender_height_avg)
gender
f    65.103873
m    70.443492
Name: height, dtype: float64
  • Conculsation: Here i did Gender_height_avg.

2.How does height vary by age_group? Are there noticeable trends among younger vs. older users?¶

In [71]:
# Define age bins and labels
bins = [0, 25, 35, 45, 100]
labels = ["18-25", "26-35", "36-45", "46+"]
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

# Group by age_group and calculate the average height
age_group_height_avg = df.groupby('age_group')['height'].mean()

# Display the average height by age group
print(age_group_height_avg)
age_group
18-25    68.145931
26-35    68.410693
36-45    68.315317
46+      67.976148
Name: height, dtype: float64
<ipython-input-71-f9264f31cf66>:7: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  age_group_height_avg = df.groupby('age_group')['height'].mean()
  • Conculsation:Here i calculated the age_group by using bins,labels. And also i calculted the age_group_height_avg.

3.What is the distribution of height within body_type categories (e.g., athletic, curvy, thin)? Do the distributions align with expectations?¶

In [72]:
# Group by body type and describe the height distribution
body_type_height_distribution = df.groupby('body_type')['height'].describe()
# Display the distribution statistics
print(body_type_height_distribution)
                  count       mean       std   min   25%   50%    75%   max
body_type                                                                  
a little extra   2629.0  68.820084  3.930905  55.0  66.0  69.0  72.00  85.0
athletic        11819.0  69.707336  3.593712   3.0  67.0  70.0  72.00  95.0
average         14652.0  68.100805  3.850883  36.0  65.0  68.0  71.00  95.0
curvy            3924.0  65.210245  3.021022  36.0  63.0  65.0  67.00  95.0
fit             12711.0  68.546062  3.767420   9.0  66.0  69.0  71.00  95.0
full figured     1009.0  66.464817  3.352475  58.0  64.0  66.0  68.00  80.0
jacked            421.0  69.292162  5.871453  37.0  66.0  70.0  72.00  95.0
overweight        444.0  68.948198  4.225529  59.0  66.0  69.0  72.00  95.0
rather not say    198.0  67.272727  4.756653  59.0  64.0  67.0  70.75  95.0
skinny           1777.0  68.544176  3.992023  59.0  66.0  69.0  72.00  84.0
thin             4711.0  67.866058  4.091067   6.0  65.0  68.0  71.00  80.0
used up           355.0  69.180282  5.789251  36.0  66.0  70.0  72.00  95.0
  • Conculsation: In this i did body_type_height_dis

6. Income Analysis¶

1.What is the distribution of income across the platform? Are there specific income brackets that dominate? (don't count 0)¶

In [73]:
df_filtered = df[df['income'] > 0]
# Describe the income distribution (excluding zeros)
income_description = df_filtered['income'].describe()
# Display the income distribution summary
print(income_description)
count      59946.000000
mean      104394.993046
std        88239.052798
min        20000.000000
25%       104394.993046
50%       104394.993046
75%       104394.993046
max      1000000.000000
Name: income, dtype: float64
  • Conculsation:In this i did income_description by using income is >0 and and describe.

2.How does income vary by age_group and gender? Are older users more likely to report higher incomes?¶

In [74]:
# Create an 'age_group' column by categorizing ages
bins = [0, 18, 30, 40, 50, 100]  # Define age groups
labels = [ '0-18','18-30', '31-40', '41-50', '51+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)
# Group by 'age_group' and 'gender' and calculate average income
grouped_income = df.groupby(['age_group', 'gender'])['income'].mean().reset_index()
# Display the grouped data
print(grouped_income)
  age_group gender         income
0      0-18      f            NaN
1      0-18      m            NaN
2     18-30      f  101740.061003
3     18-30      m  105071.928791
4     31-40      f  103352.298007
5     31-40      m  107689.944669
6     41-50      f  101633.033214
7     41-50      m  106586.780425
8       51+      f   99848.000669
9       51+      m  101792.747280
<ipython-input-74-5be5ed22db74>:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped_income = df.groupby(['age_group', 'gender'])['income'].mean().reset_index()
  • Conculsation:In this data for age_group i take bins,labels and['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False). And then i did grouped_income in that df.groupby(['age_group', 'gender'])['income'].mean().reset_index().

Part 4: Data Visualization¶

1. Age Distribution¶

1.Plot a histogram of age with a vertical line indicating the mean age. What does the distribution reveal about the most common age group on the platform?¶

In [75]:
# Calculate the mean age
mean_age = df['age'].mean()

# Plot a histogram
plt.figure(figsize=(8, 6))
plt.hist(df['age'], bins=10, edgecolor='black', alpha=0.7, color='skyblue')
plt.axvline(mean_age, color='red', linestyle='dashed', linewidth=1.5, label=f"Mean Age: {mean_age:.1f}")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Number of Users")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image
  • Conculsation : In this data i did Mean Age: The mean age is calculated using df['age'].mean(). And ploted the graphs by using plt.figure for size , plt.hist in this i gave bins,edgecolor,alpha,color. plt.axvline mean_age, color, linestyle='dashed', linewidth=1.5. plt.title , plt.xlabel , plt.ylabel ,plt.legend ,plt.grid , plt.show .

2.How does the age distribution differ by gender? Are there age groups where one gender is more prevalent?¶

In [76]:
# Plot age distribution by gender using Seaborn
plt.figure(figsize=(10, 6))
sns.histplot(
    data=df,
    x="age",
    hue="gender",
    bins=10,
    kde=True,
    palette="Set2",
    alpha=0.7
)
plt.title("Age Distribution by Gender")
plt.xlabel("Age")
plt.ylabel("Count")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(title="Gender")
plt.show()
<ipython-input-76-0173dc7162e1>:16: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend(title="Gender")
No description has been provided for this image
  • Conculsation:Here i did plt.figure , sns.histplot in that i gave data, x=, hue, bins, kde, palette, alpha. plt.title, plt.xlabel, plt.ylabel, plt.grid, plt.legend, plt.show() with these all i made a graph.

2.Income and Age¶

1.Use a scatterplot to visualize the relationship between income and age, with a trend line indicating overall patterns. Are older users more likely to report higher incomes?¶

In [77]:
# Filter out rows with income of 0 (if present)
df = df[df["income"] > 0]
# Create a scatterplot with a trend line
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x="age", y="income", color="blue", alpha=0.7)
sns.regplot(data=df, x="age", y="income", scatter=False, color="red", ci=None)
# Add titles and labels
plt.title("Relationship Between Age and Income", fontsize=14)
plt.xlabel("Age", fontsize=12)
plt.ylabel("Income", fontsize=12)
plt.grid(axis='both', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image
  • Conculsation: In this first i did plt.figure , sns.scatterplot , sns.regplot. And i gave plt.title, plt.xlabel ,plt.ylabel ,plt.grid , plt.show. with these i created a "ScatterPlot".

2.Create boxplots of income grouped by age_group. Which age group reports the highest median income?¶

In [78]:
# Define age groups
bins = [20, 30, 40, 50, 60]
labels = ["20-29", "30-39", "40-49", "50-59"]
df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels, right=False)

# Filter out rows with income of 0 (if present)
df = df[df["income"] > 0]

# Create a boxplot of income grouped by age_group
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="age_group", y="income", palette="Set2")

# Add titles and labels
plt.title("Income Distribution by Age Group", fontsize=14)
plt.xlabel("Age Group", fontsize=12)
plt.ylabel("Income", fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()

# Identify the age group with the highest median income
median_income = df.groupby("age_group")["income"].median()
print("Median income by age group:")
print(median_income)
print(f"The age group with the highest median income is: {median_income.idxmax()}")
<ipython-input-78-4a0f27304a21>:11: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df, x="age_group", y="income", palette="Set2")
No description has been provided for this image
Median income by age group:
age_group
20-29    104394.993046
30-39    104394.993046
40-49    104394.993046
50-59    104394.993046
Name: income, dtype: float64
The age group with the highest median income is: 20-29
<ipython-input-78-4a0f27304a21>:21: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  median_income = df.groupby("age_group")["income"].median()
  • Conculsation: For age_group i created bins,labels , and income>0 . After that i created a "BoxPlot" by using plt.figure , sns.boxplot. And add titles plt.title,plt.xlabel , plt.ylabel,plt.grid, plt.show. with these all i did median_income.

3.Analyze income levels within gender and status categories. For example, are single men more likely to report higher incomes than single women?¶

In [79]:
# Group by gender and status, calculate median income
grouped_income = df.groupby(["gender", "status"])["income"].median().reset_index()

# Create a bar plot for income comparison
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped_income, x="status", y="income", hue="gender", palette="Set2")

# Add titles and labels
plt.title("Median Income by Gender and Status", fontsize=14)
plt.xlabel("Status", fontsize=12)
plt.ylabel("Median Income", fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)

plt.show()
# Display the grouped income DataFrame
print("Median Income by Gender and Status:")
print(grouped_income)
No description has been provided for this image
Median Income by Gender and Status:
  gender          status         income
0      f       available  104394.993046
1      f         married  104394.993046
2      f  seeing someone  104394.993046
3      f          single  104394.993046
4      f         unknown   62197.496523
5      m       available  104394.993046
6      m         married  104394.993046
7      m  seeing someone  104394.993046
8      m          single  104394.993046
9      m         unknown  104394.993046
  • Conculsation:In this First i did grouped_income in this i did df.groupby(["gender", "status"])["income"].median().reset_index(). And we plotted the barplot by using plt.figure, sns.barplot . And i gave plt.title, plt.xlabel("Status", fontsize=12) plt.ylabel("Median Income", fontsize=12) plt.grid(axis="y", linestyle="--", alpha=0.7) plt.show(). At last "Median Income by Gender and Status:", grouped_income.

3. Pets and Preferences¶

1.Create a bar chart showing the distribution of pets categories (e.g., likes dogs, likes cats). Which preferences are most common?¶

In [80]:
# Count the distribution of pet preferences
pet_distribution = df["pets"].value_counts()

# Plot a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=pet_distribution.index, y=pet_distribution.values, palette="pastel")

# Add titles and labels
plt.title("Distribution of Pet Preferences", fontsize=14)
plt.xlabel("Pet Preference", fontsize=12)
plt.ylabel("Number of Users", fontsize=12)
plt.xticks(rotation=15)
plt.grid(axis="y", linestyle="--", alpha=0.7)

plt.show()
# Display the distribution as a table
print("Pet Preferences Distribution:")
print(pet_distribution)
<ipython-input-80-89b98fe865fd>:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=pet_distribution.index, y=pet_distribution.values, palette="pastel")
No description has been provided for this image
Pet Preferences Distribution:
pets
likes dogs and likes cats          14814
likes dogs                          7224
likes dogs and has cats             4313
has dogs                            4134
has dogs and likes cats             2333
likes dogs and dislikes cats        2029
has dogs and has cats               1474
has cats                            1406
likes cats                          1063
has dogs and dislikes cats           552
dislikes dogs and likes cats         240
dislikes dogs and dislikes cats      196
dislikes cats                        122
dislikes dogs and has cats            81
dislikes dogs                         44
Name: count, dtype: int64
  • Conculsation: In this i taken pets column from data. with that i created a bar chart by using this plt.figure(figsize=(10, 6)), sns.barplot(x=pet_distribution.index, y=pet_distribution.values, palette="pastel"). plt.title, plt.xlabel, plt.ylabel, plt.xticks(rotation=15), plt.grid(axis="y", linestyle="--", alpha=0.7) plt.show(). At last i did pet_distribution.

2.How do pets preferences vary across gender and age_group? Are younger users more likely to report liking pets compared to older users?¶

In [81]:
# Group the data by gender, age_group, and pets preferences
pets_analysis = (
    df.groupby(["gender", "age_group", "pets"])
    .size()
    .reset_index(name="count")
)

# Create a pivot table for visualization
pivot_pets = pets_analysis.pivot_table(
    index=["age_group", "pets"],
    columns="gender",
    values="count",
    fill_value=0
)

# Plot the pet preferences across gender and age groups
plt.figure(figsize=(14, 8))
sns.catplot(
     data=pets_analysis,
    x="age_group",
    y="count",
    hue="pets",
    col="gender",
    kind="bar",
    height=5,
    aspect=1.2,
    palette="pastel"
)

# Add titles and labels
plt.subplots_adjust(top=0.85)
plt.suptitle("Pets Preferences by Gender and Age Group", fontsize=16)
plt.xlabel("Age Group")
plt.ylabel("Count of Users")
plt.show()
<ipython-input-81-d8d73bdc6a0e>:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(["gender", "age_group", "pets"])
<ipython-input-81-d8d73bdc6a0e>:9: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  pivot_pets = pets_analysis.pivot_table(
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
  • Conclusation: In this first i did "pets_analysis" by using group_by , size, reset_index. And then we created Pivot_table for Visulaization and i ploted these all plt.figure(figsize=(14, 8)) sns.catplot( data=pets_analysis, x="age_group", y="count", hue="pets", col="gender", kind="bar", height=5, aspect=1.2, palette="pastel" ). And then i given titles to them plt.subplots_adjust(top=0.85) plt.suptitle("Pets Preferences by Gender and Age Group", fontsize=16) plt.xlabel("Age Group") plt.ylabel("Count of Users") plt.show().

4. Signs and Personality¶

1.Create a pie chart showing the distribution of zodiac signs (sign) across the platform. Which signs are most and least represented? Is this the right chart? If not, replace with right chart.¶

In [82]:
# Calculate the distribution of zodiac signs
zodiac_distribution = df['sign'].value_counts()

# Plot a pie chart
plt.figure(figsize=(10, 8))
zodiac_distribution.plot.pie(
    autopct='%1.1f%%',
    startangle=140,
    colors=plt.cm.Pastel1.colors,
    wedgeprops={'edgecolor': 'black'}
)
plt.title("Zodiac Sign Distribution Across the Platform")
plt.ylabel("")  # Remove y-axis label for better visualization
plt.show()
No description has been provided for this image
  • Conculsation: In this first i did zodiac_distribution and then we created a pie-chart by using plt.figure(figsize=(10, 8)) zodiac_distribution.plot.pie( autopct='%1.1f%%', startangle=140, colors=plt.cm.Pastel1.colors, wedgeprops={'edgecolor': 'black'} ) plt.title , plt.ylabel, plt.show().

2.How does sign vary across gender and status? Are there noticeable patterns or imbalances?¶

In [83]:
# Group by gender and status, then calculate the distribution of signs
sign_gender_status = df.groupby(['gender', 'status', 'sign']).size().reset_index(name='count')

# Create a heatmap to visualize the distribution of signs by gender and status
pivot_table = sign_gender_status.pivot_table(
    index='sign',
    columns=['gender', 'status'],
    values='count',
    aggfunc='sum',
    fill_value=0
)
# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(
    pivot_table,
    annot=True,
    fmt='d',
    cmap='coolwarm',
    cbar_kws={'label': 'Count'}
)
plt.title("Zodiac Sign Distribution by Gender and Status")
plt.xlabel("Gender and Status")
plt.ylabel("Zodiac Sign")
plt.xticks(rotation=45, ha='right')
plt.show()
No description has been provided for this image
  • Conculsation: Here first i did sign_gender_status = df.groupby(['gender', 'status', 'sign']).size().reset_index(name='count') And then i did pivot_table for sign_gender_status. I created a heatmap by using plt.figure(figsize=(12, 8)) sns.heatmap( pivot_table, annot=True, fmt='d', cmap='coolwarm', cbar_kws={'label': 'Count'}. And i gave a title and labels for it plt.figure , plt.xlabel , plt.ylabel , plt.xticks , plt.show().
In [ ]: