Spinny's Revenue Growth Through Demand and Pricing Optimisation¶
1.1 - Given Problem Statement¶
Spinny, a leading Indian startup in the pre-owned car market, wants to increase revenue by boosting the number of completed transactions.
To achieve this, it is essential to understand the demand for cars across different regions and identify the most in-demand cars and their attributes. Additionally, determining the right pricing for each car, based on its attributes and market demand, is crucial for staying competitive.
Objective:
As a data analyst, your goal is to analyse used car transaction data to:
Assess the demand for pre-owned cars in each region.
Identify the demand for specific car models and their key attributes (e.g., condition, fuel type, mileage).
Determine the optimal pricing for each car, ensuring it is competitive and aligned with market trends.
1.2 - Simplified Breakdown Of Problem Statement¶
📜 Project Overview
Spinny, a pre-owned car marketplace, wants to increase revenue by optimizing pricing and understanding regional demand for used cars. The focus is on identifying high-demand cars, key attributes, and competitive pricing to improve transaction success.
🎯 Main Goal
Analyze car transaction data to:
Assess demand for used cars in different regions.
Identify high-demand models & key attributes that drive sales.
Optimize pricing strategy to balance competitiveness and profitability.
📊 Key Focus Areas
Demand Analysis: Identify regions with high/low demand for used cars.
Car Attribute Insights: Study factors like year, condition, fuel type, odometer readings that influence demand.
Pricing Optimization: Ensure pricing is competitive based on demand, location, and vehicle attributes.
🚀 Data-Driven Approach
Segmentation & Trend Analysis: Group cars by attributes and demand patterns.
Price Benchmarking: Compare pricing across similar vehicles in different regions.
Demand Forecasting: Identify seasonal trends and factors affecting used car sales.
📌 Summary
1️⃣ Develop a data-driven strategy for pricing and demand optimization.
2️⃣ Use historical transaction data to identify high-demand car attributes.
3️⃣ Ensure Spinny stays competitive with well-structured pricing based on market trends.
2.1 Dataset Summary¶
- Dataset source: The dataset for this project was provided as part of the Data Analyst course Graduation Project by Nextleap
- Dataset Name: Spinny Revenue Growth
- Number of Rows: 426880
- Number of Columns: 21
- Description: The dataset details pre-owned car transactions, covering attributes like price, model, fuel type, and location, to analyze demand and optimize Spinny's pricing strategy
2.2 Columns Description¶
- id (Transaction ID)
Definition: A unique identifier for each car transaction in the dataset.
Example: If a specific transaction has the ID 7222695916, it uniquely represents that sale.
- Price (INR)
Definition: The final selling price of the vehicle in Indian Rupees (INR).
Example: If the price is 770530, it indicates the car was sold for INR 770,530.
- Year
Definition: The year of manufacture of the vehicle, indicating its age.
Example: If the year is 2015, it means the car was manufactured in 2015.
- Manufacturer
Definition: The brand or company that produced the vehicle.
Example: If the manufacturer is Maruti, it means the car was manufactured by Maruti Suzuki.
- Model
Definition: The specific model name of the vehicle produced by the manufacturer.
Example: If the model is Swift, it refers to the Maruti Suzuki Swift model.
- Condition
Definition: The physical and operational state of the vehicle at the time of sale.
Example: If the condition is excellent, it indicates the car was in top-notch condition during the transaction.
- Cylinders
- Definition: The number of cylinders in the car's engine, influencing its power and performance.
- Example: If the cylinders is 4, it indicates the car has a 4-cylinder engine.
- Fuel
- Definition: The type of fuel the vehicle uses, such as petrol, diesel, CNG, or electric.
- Example: If the fuel is diesel, the car runs on diesel fuel.
- Odometer
- Definition: The total distance the vehicle has traveled, measured in kilometers.
- Example: If the odometer reads 50,000, the car has been driven for 50,000 km.
- Title_status
- Definition: The condition of the car's , indicating if it is clean, or has missing parts or has liens.
- Example: If the title_status is clean, it means the vehicle has no outstanding issues.
- Transmission
- Defination: The type of gearbox system the vehicle uses for shifting gears. (manual, automatic, other)
- Example: If the transmission is automatic, the car shifts gears automatically; if other, it uses a non-standard system.
- VIN (Vehicle Identification Number)
- Definition: A unique serial number assigned to each vehicle for identification.
- Example: If the VIN is 1HGCM82633A123456, it uniquely identifies the vehicle.
- Drive
- Definition: The drivetrain configuration of the vehicle (e.g., front-wheel drive, rear-wheel drive).
- Example: If the drive is FWD, the car is a front-wheel drive vehicle.
- Size
- Definition: The size classification of the vehicle, such as compact, midsize, or full-size.
- Example: If the size is compact, the car is categorized as a compact vehicle.
- Type
- Definition: The type of vehicle, such as sedan, SUV, hatchback, or truck.
- Example: If the type is SUV, the car belongs to the sport utility vehicle category.
- Paint_color
- Definition: The exterior color of the vehicle.
- Example: If the paint_color is white, the car's body is white in color.
- Image_url
- Definition: A link to an image of the vehicle for reference.
- Example: If the image_url is a valid web address, it displays the car's photo.
- State
- Definition: The Indian state where the car transaction took place.
- Example: If the state is Haryana, the car was sold in Haryana.
- Posting_date
- Definition: The date when the car was listed for sale.
- Example: If the posting_date is 2023-10-01, the car was listed on October 1, 2023.
- Latitude
Definition: The geographical latitude coordinate of the transaction location.
Example: If the latitude is 28.8446, it represents the latitude of the transaction in Haryana.
- Longitude
Definition: The geographical longitude coordinate of the transaction location.
Example: If the longitude is 75.1167, it represents the longitude of the transaction in Haryana.
3.1 Importing Required Libraries¶
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting and visualization
import seaborn as sns # For advanced visualizations
import scipy.stats as stats # for statistical analysis
3.2 Loading the dataset from Nextleap google drive¶
# Step 1: Install gdown
!pip install gdown
# Step 2: Import necessary libraries
import gdown
import pandas as pd
# Step 3: Set the file ID and create a download URL
file_id = "1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR"
download_url = f"https://drive.google.com/uc?id={file_id}"
# Step 4: Set the output file name
output_file = "spinny_dataset.csv"
# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)
# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)
Requirement already satisfied: gdown in /usr/local/lib/python3.11/dist-packages (5.2.0) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown) (4.13.3) Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown) (3.17.0) Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown) (2.32.3) Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown) (4.67.1) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (2.6) Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (4.12.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2025.1.31) Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (1.7.1)
Downloading... From: https://drive.google.com/uc?id=1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR To: /content/spinny_dataset.csv 100%|██████████| 99.9M/99.9M [00:00<00:00, 227MB/s]
3.3 Initial Dataset Exploration¶
3.3.1 Checking Shape of the dataset¶
rows, columns = data.shape
print(f"The dataset contains **{rows} rows** and **{columns} columns**.")
The dataset contains **426880 rows** and **21 columns**.
3.3.2 Viewing all column in Dataset¶
# Viewing all column names as not all are visible in head or sample above
data.columns
Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'state', 'posting_date', 'latitude', 'longitude'], dtype='object')
Here, I’ve categorized the column according to the problem statement to simplify the analysis:
- Most Relevant: price, year, manufacturer, model, condition, fuel, odometer, state, transmission
- Relevant: type, drive, title_status, paint_color, cylinders, size, posting_date
- Less Relevant: id, VIN, image_url, latitude, longitude
3.3.3 Understanding dataset by looking few Rows¶
# Checking the top 5 Rows to see how the dataset looks
data.head(5)
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | VIN | drive | size | type | paint_color | image_url | state | posting_date | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7222695916 | 770530.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Haryana | NaN | 28.8446 | 75.1167 |
1 | 7218891961 | 1528210.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Haryana | NaN | 27.9026 | 77.0382 |
2 | 7221797935 | 2696820.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | West Bengal | NaN | 24.6838 | 85.9695 |
3 | 7222270760 | 192610.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Andhra Pradesh | NaN | 15.5158 | 81.9342 |
4 | 7210384030 | 629240.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | West Bengal | NaN | 23.7821 | 87.4861 |
5 rows × 21 columns
📌 Observation: Most of the column values are missing, except a few. It's important to check if this is intentional or an error.
- Next Step: Verify this by inspecting ramdomly selected 5 rows for better clarity.
# Randomly selecting 5 rows from the dataset
data.sample(5)
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | VIN | drive | size | type | paint_color | image_url | state | posting_date | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6661 | 7308571224 | 3916750.0 | 2017.0 | bmw | 430i | NaN | 4 cylinders | gas | 32214.0 | clean | ... | WBA4F7C35HG787932 | rwd | NaN | sedan | blue | https://images.craigslist.org/00b0b_dEVueDQ2dN... | Bihar | 2021-04-18 01:02:14+00:00 | 26.0785 | 87.0355 |
192725 | 7309999168 | 2387620.0 | 2020.0 | kia | soul s wagon 4d | good | NaN | other | 27505.0 | clean | ... | KNDJ23AUXL7005901 | NaN | NaN | wagon | silver | https://images.craigslist.org/00909_cdPG9Py9E1... | Dadra and Nagar Haveli and Daman and Diu | 2021-04-20 20:00:30+00:00 | 20.1024 | 72.9349 |
193042 | 7307142093 | 2940840.0 | 2017.0 | bmw | x1 | NaN | NaN | gas | 56452.0 | clean | ... | WBXHT3Z36H4A56656 | NaN | NaN | NaN | NaN | https://images.craigslist.org/00L0L_bbPvpa1KvW... | Uttar Pradesh | 2021-04-15 13:50:42+00:00 | 24.9249 | 79.1547 |
420708 | 7311357255 | 4236880.0 | 2015.0 | ford | f150 supercrew cab xlt | good | 6 cylinders | gas | 65329.0 | clean | ... | 1FTEW1EG0FKE39342 | 4wd | NaN | pickup | blue | https://images.craigslist.org/00N0N_1xMPvfxRAI... | Telangana | 2021-04-23 15:20:44+00:00 | 16.3049 | 78.3479 |
223133 | 7302507866 | 0.0 | 2014.0 | gmc | sierra 1500 4x4 crew cab sle | good | 8 cylinders | other | 110494.0 | clean | ... | 3GTU2UEC7EG175242 | 4wd | NaN | pickup | blue | https://images.craigslist.org/00b0b_1T8Tm7RYOn... | Uttar Pradesh | 2021-04-06 12:00:35+00:00 | 24.6724 | 77.6303 |
5 rows × 21 columns
📌 Observation:
- After inspecting sample(5), it is observed that very few values are missing in the randomly selected rows, suggesting that missing values might not be uniformly distributed across the dataset.
- Need to convert dataset to lowercase since State names first letter is in captial also VIN column is in capital.
3.3.4 Displaying dataset information¶
print("Dataset Information:\n")
data.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 426880 entries, 0 to 426879 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 426880 non-null int64 1 price 426880 non-null float64 2 year 425675 non-null float64 3 manufacturer 409234 non-null object 4 model 421603 non-null object 5 condition 252776 non-null object 6 cylinders 249202 non-null object 7 fuel 423867 non-null object 8 odometer 422480 non-null float64 9 title_status 418638 non-null object 10 transmission 424324 non-null object 11 VIN 265838 non-null object 12 drive 296313 non-null object 13 size 120519 non-null object 14 type 334022 non-null object 15 paint_color 296677 non-null object 16 image_url 426812 non-null object 17 state 426880 non-null object 18 posting_date 426812 non-null object 19 latitude 426880 non-null float64 20 longitude 426880 non-null float64 dtypes: float64(5), int64(1), object(15) memory usage: 68.4+ MB
📌 Observation:
- Datatype: The year column should be converted to int, and posting_date should be converted to datetime for accurate handling and analysis
- Not-Null: The columns with non-null values below 4 lakh, arranged in ascending order are: size, condition, cylinders, VIN, drive, paint_color, and type.
Next Step: Lets count the null values and their percentage to get a better understanding of the missing data and figure out how to handle them.
3.3.5 Missing values data summary¶
# Calculating the missing value for each column
missing_values = data.isnull().sum()
# calculating the missing percentage
missing_percentage = (missing_values / len(data)) * 100
# creating summary table
missing_summary = pd.DataFrame({"missing values in each column": missing_values,
"missing %": missing_percentage }).sort_values(by ="missing %",ascending=False)
# Displaying summary table
print("missing data summary for each column:\n\n", missing_summary)
missing data summary for each column: missing values in each column missing % size 306361 71.767476 cylinders 177678 41.622470 condition 174104 40.785232 VIN 161042 37.725356 drive 130567 30.586347 paint_color 130203 30.501078 type 92858 21.752717 manufacturer 17646 4.133714 title_status 8242 1.930753 model 5277 1.236179 odometer 4400 1.030735 fuel 3013 0.705819 transmission 2556 0.598763 year 1205 0.282281 posting_date 68 0.015930 image_url 68 0.015930 latitude 0 0.000000 id 0 0.000000 state 0 0.000000 price 0 0.000000 longitude 0 0.000000
# Creating a heatmap for visualisation missing values
plt.figure(figsize=(14, 6))
sns.heatmap(data.isnull(), cbar=False, cmap='crest', yticklabels=False)
plt.title("Missing Data Heatmap (blue = missing data)", fontsize=16)
plt.xlabel("Columns", fontsize=14)
plt.ylabel("", fontsize=14)
plt.tight_layout()
plt.show()
📌 Observation: The size column has over 50% missing values, so its impractical to impute and will likely be dropped during data preparation, considering other factors
Observation: The size column has 71.77% missing values, making it unreliable for analysis due to the difficulty of imputing such a large proportion of missing data without introducing bias.
Relevance: The problem statement emphasizes pricing and demand analysis. While size could provide insights, it is less critical than columns like type, manufacturer, or fuel.
Alternative: The type column, with only 21.75% missing values, is a more robust alternative for categorization and aligns better with the analysis goals.
Recommendation: Drop the size column due to its high missing percentage and focus on the type column for meaningful categorization and insights.
3.3.6 Unique values count for each column¶
# Counting unique values in each column and presenting them in a DataFrame
unique_values_df = pd.DataFrame({
"Column Name": data.columns,
"Unique Value Count": [data[col].nunique() for col in data.columns]
})
# Display the DataFrame
print("Unique Value Counts:\n", unique_values_df)
Unique Value Counts: Column Name Unique Value Count 0 id 426880 1 price 6253 2 year 114 3 manufacturer 42 4 model 29667 5 condition 6 6 cylinders 8 7 fuel 5 8 odometer 104870 9 title_status 6 10 transmission 3 11 VIN 118264 12 drive 3 13 size 4 14 type 13 15 paint_color 12 16 image_url 241899 17 state 35 18 posting_date 333355 19 latitude 191354 20 longitude 162536
📌 Observation:
- Outliers might exist as the average car life is around 20 years, but 114 unique years are recorded, which seems unrealistic.
- The size column has only 4 unique values and, as previously observed, 70% missing data, making it unfit for analysis and a good candidate for remove
3.3.7 Statistical summary of price, year & odometer¶
# Checking summary statistics for specific numerical columns
print("\n. Summary Statistics (Price, Year, and Odometer Only):")
print(data[['price', 'year', 'odometer']].describe())
. Summary Statistics (Price, Year, and Odometer Only): price year odometer count 4.268800e+05 425675.000000 4.224800e+05 mean 9.657028e+06 2011.235191 9.804333e+04 std 1.564439e+09 9.452120 2.138815e+05 min 0.000000e+00 1900.000000 0.000000e+00 25% 7.576800e+05 2008.000000 3.770400e+04 50% 1.791970e+06 2013.000000 8.554800e+04 75% 3.401080e+06 2017.000000 1.335425e+05 max 4.798934e+11 2022.000000 1.000000e+07
⚠ Observation: These stats are just to get an idea, results may be affected by missing values or outliers
- Price: Goes from ₹0 to ₹479 billion, with some prices looking way too high or too low to be real.
- Year: Starts at 1900, which doesn’t make sense, so it needs a closer look.
- Odometer: Ranges from 0 to 10 million, with some unrealistic numbers and missing data like 0.
3.3.8 Getting the timeline of the dataset¶
# Calculating the timeline while ignoring missing values
valid_dates = pd.to_datetime(data['posting_date'], errors='coerce').dropna()
# Get the minimum and maximum dates from valid entries
timeline_start = valid_dates.min().date()
timeline_end = valid_dates.max().date()
print(f"Dataset timeline: {timeline_start} to {timeline_end}")
Dataset timeline: 2021-04-04 to 2021-05-05
4.1 Creating a Backup of the Dataset ( df ) to Work On"¶
Creating a copy of the original dataset is a good practice because it:
- Prevents Data Loss: If cleaning goes wrong, the original data remains untouched for reference or rollback.
- Enables Experimentation: We can try different cleaning methods on the copy without risking the original data.
- Maintains Integrity: Keeps a record of the raw data for comparison or future use.
# Creating a copy of the dataset
df = data.copy()
# Confirmation by printing the first few rows of the copied dataset
print("Copy of the dataset has been created with the name 'df'")
Copy of the dataset has been created with the name 'df'
4.2 Structuring the Dataset¶
4.2.1 Ensuring Consistency by Lowercasing Text Data¶
# Convert only string columns to lowercase
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].map(lambda x: x.lower() if isinstance(x, str) else x)
4.2.2 Renaming the 'year' Column to 'mfg_year'¶
Renaming the column "year" to "mfg_year" improves clarity and avoids confusion, making it clear that the column refers to the manufacturing year of the vehicle especially for those unfamiliar with the dataset structure
# Renaming the 'year' column to 'mfg_year'
df.rename(columns={'year': 'mfg_year'}, inplace=True)
# Confirmation
print("The 'year' column has been renamed to 'mfg_year' for better understanding.")
The 'year' column has been renamed to 'mfg_year' for better understanding.
4.2.3 Removing Special Characters and Keeping Only Letters, Numbers and Single Spaces¶
# Function to clean special characters in string columns
def clean_special_characters(column):
return column.str.replace(r'[^a-zA-Z0-9 ]', '', regex=True) # Keep only letters, numbers, and spaces
# List of columns to exclude
exclude_columns = ['image_url', 'posting_date', 'size']
# Apply the cleaning function to string columns except the excluded ones
for col in df.select_dtypes(include=['object']).columns:
if col not in exclude_columns:
df[col] = clean_special_characters(df[col])
4.3 Datatype Correction¶
Correcting data types ensures the information is stored in the right way, making it easier to work with and avoid any confusion or mistakes later
4.3.1 Correcting mfg_year from 'float' to 'int64'¶
# Converting 'mfg_year' to an integer-compatible type
df['mfg_year'] = df['mfg_year'].astype('Int64')
# Confirmation message
print("The 'mfg_year' column has been converted from float to Int64")
The 'mfg_year' column has been converted from float to Int64
4.3.2 Correcting posting_date from 'object' to 'date'¶
# Convert posting_date to datetime
df['posting_date'] = pd.to_datetime(df['posting_date'])
# Confirmation
print("The 'posting_date' column has been successfully converted from object to date format.")
The 'posting_date' column has been successfully converted from object to date format.
4.4 Removed Duplicate Rows¶
Removing duplicate rows ensures the dataset is clean, avoids redundancy, and prevents repeated data from skewing the analysis
4.4.1 Checking duplicate rows¶
# Checking for duplicate rows
total_rows = df.shape[0]
num_duplicates = df.duplicated().sum()
# Printing the results
print(f"Total number of rows: {total_rows}")
print(f"Number of duplicate rows: {num_duplicates}")
Total number of rows: 426880 Number of duplicate rows: 0
4.4.2 Removing duplicate rows if present¶
# Removing duplicate rows
df = df.drop_duplicates()
# Confirmation
print(f"Duplicate rows removed: {num_duplicates}")
print(f"Remaining rows in the dataset: {df.shape[0]}")
Duplicate rows removed: 0 Remaining rows in the dataset: 426880
4.5 Dropping Unnessasry Columns¶
4.5.1 Droping image_url column¶
- The image_url column is dropped because it is not relevant for analysis related to demand or pricing.
- It serves as a visual reference rather than contributing to meaningful insights into car attributes or transactions.
- Retaining it would only increase data size without adding value to the problem statement's focus on optimizing demand and pricing.
- Dropping image_url helps streamline the dataset, focusing only on columns critical to analysis and decision-making.
# Dropping the 'image_url' column
df = df.drop(columns=['image_url'])
# Displaying confirmation
print("Columns 'image_url' have been dropped.")
Columns 'image_url' have been dropped.
4.5.2 Droping size column¶
- The size column is dropped because it has 71.77% missing values, making it unsuitable for reliable analysis.
- The type column provides more detailed and granular insights into vehicle categories (e.g., sedan, SUV) compared to size.
- The type column has only 21.75% missing values, making it more complete and reliable for analysis.
- Retaining type aligns better with the problem statement's focus on understanding demand and pricing based on car attributes.
- Dropping size helps streamline the dataset and ensures a cleaner, more actionable analysis.
# Dropping the 'size' column
df = df.drop(columns=['size'])
# Displaying confirmation
print("Columns 'size' have been dropped.")
Columns 'size' have been dropped.
4.5.3 Assessing and Deciding the Relevance of VIN (Vehicle Identification Number)¶
VIN (Vehicle Identification Number) is 17-character alphanumeric code and is unique to a single vehicle and is associated with only one manufacturer. It acts as the vehicle's fingerprint, providing details such as the manufacturer, model, year of production, and other specifications. The first three characters of the VIN (World Manufacturer Identifier or WMI) are used to identify the manufacturer of the vehicle. (WMI can be used to fill manufacturer after accessing)
While VIN serves as a globally unique identifier, its relevance to the current analysis must be carefully evaluated based on its contribution to the objectives of pricing and demand optimisation.
4.5.3.1 Counting Repeated VINs¶
Objective: To determine how frequently VINs are repeated in the dataset and assess their impact.
# Count repeated VIN values in the DataFrame
repeated_vin_count = df['VIN'].value_counts()[df['VIN'].value_counts() > 1].count()
# Print the result
print(f"Number of total repeated VINs in dataset is: {repeated_vin_count}")
Number of total repeated VINs in dataset is: 40280
Observation:
A high number of repeated VINs indicates that many vehicles have been listed multiple times. These repetitions might result from repostings, price updates, or placeholder entries. Understanding these repetitions is critical to evaluate their impact on data quality and insights.
4.5.3.2 Assessing Rows to Be Removed if Repeated VINs Are Excluded¶
Objective: To evaluate the extent of data loss if repeated VINs are excluded entirely.
# Get the unique repeated VINs
unique_repeated_vins = df['VIN'].value_counts()[df['VIN'].value_counts() > 1].index.tolist()
# Count how many rows will be removed
rows_to_remove = df[df['VIN'].isin(unique_repeated_vins)].shape[0]
# Print the result
print(f"Number of rows that will be removed if we removed repeated VINs will be: {rows_to_remove}")
Number of rows that will be removed if we removed repeated VINs will be: 187854
Observation:
Removing all repeated VINs would eliminate a significant portion of the dataset (over 40%), which could result in the loss of valuable information about market behaviour, price adjustments, and demand trends. This calls for further exploration to decide whether retaining or removing VIN is more suitable.
4.5.3.3 Examining Unique VINs and Patterns¶
Objective: To explore the uniqueness of VIN entries and their contribution to the dataset by finding pattern to dentify unusual VINs.
# Display random rows from the DataFrame to inspect VINs
random_vins = df.sample(n=10, random_state=42) # AdjustING n for the number of rows we want
print(random_vins[['VIN']]) # Display only the VIN column
VIN 100905 1ftfx1eg9hkd14814 143835 NaN 20235 NaN 300734 NaN 316249 1gnskgkc7kr124145 163902 2c4gp44392r547816 353675 NaN 265426 NaN 231608 1fa6p8th9g5224182 297925 1gc1kwe87ff570325
Observation:
- Non-String VIN: Flags VINs that are not strings.
- Invalid Length: Identifies VINs not exactly 17 characters long.
- Forbidden Letters: Detects presence of restricted characters (I, O, Q).
- Non-Alphanumeric Characters: Flags VINs containing symbols or spaces.
- All Numbers or Letters: Identifies VINs composed entirely of digits or letters.
- Imbalanced Letters-Numbers: Flags VINs lacking both letters and numbers.
- Excessive Repetition: Detects VINs with any character repeated more than 10 times.
4.5.3.4 Filtering Rows with Unusual VINs¶
Objective: To isolate rows containing invalid or problematic VINs for further analysis or removal, ensuring data quality
1. Identifying Unusual VIN Patterns
- Purpose: Detect VIN entries that deviate from standard rules using various pattern checks.
- Key Criteria: The function evaluates VINs based on properties like length, forbidden characters, alphanumeric composition, letter-number balance, and repetition.
2. Processing and Filtering Unusual VINs
- Function Used: identify_unusual_vin_patterns classifies VINs as "Valid" or provides the specific issue(s) based on the patterns.
- Implementation: The .apply() method is used to apply the function across all VIN entries, filtering rows with issues into a separate DataFrame (df_temp).
3. Summarising Unusual VINs
- Purpose: Count and summarise the occurrences of unusual VINs to understand their impact on the dataset.
- Methodology: The value_counts() function is used to aggregate occurrences, and the results are structured into a summary table, with an additional row for the total count.
def identify_unusual_vin_patterns(vin):
if not isinstance(vin, str):
return "Non-String"
patterns = [
"Invalid Length" if len(vin) != 17 else "",
"Forbidden Letters (I, O, Q)" if any(c in "IOQ" for c in vin) else "",
"Non-Alphanumeric Characters" if not vin.isalnum() else "",
"All Numbers" if vin.isdigit() else "",
"All Letters" if vin.isalpha() else "",
"Imbalanced Letters-Numbers" if sum(c.isalpha() for c in vin) == 0 or sum(c.isdigit() for c in vin) == 0 else "",
"Excessive Repetition" if max(vin.count(c) for c in set(vin)) > 10 else "",
]
return ", ".join(filter(bool, patterns)) or "Valid"
# Process VINs directly without modifying the original DataFrame
df_temp = df[df['VIN'].apply(identify_unusual_vin_patterns) != "Valid"]
# Summarise unusual VINs
unusual_vin_counts = df_temp['VIN'].value_counts()
unusual_vin_summary = unusual_vin_counts.reset_index().rename(columns={'index': 'VIN', 'VIN': 'Row Count'})
# Add total row
unusual_vin_summary.loc[len(unusual_vin_summary)] = ['Total', len(df_temp)]
# Display results
print(unusual_vin_summary)
print(f"Total number of rows carried by unusual VINs: {len(df_temp)}")
print(f"Number of unique unusual VINs: {len(unusual_vin_counts)}")
Row Count count 0 13131313131313131 85 1 15151515151515151 81 2 14141414141414141 52 3 11111111111111111 24 4 cr315045444 23 ... ... ... 1068 124871n507703 1 1069 f8zs2993 1 1070 3n67k5m336901 1 1071 ccl449f505274 1 1072 Total 162803 [1073 rows x 2 columns] Total number of rows carried by unusual VINs: 162803 Number of unique unusual VINs: 1072
Observation:
A large number of VIN entries were invalid, with some being placeholders like "11111111111111111". These invalid entries highlight data quality issues and undermine the reliability of VIN for deriving meaningful insights.
This show that VIN data is incorrrect and most of repeated VINs are not valid
4.5.4 Dropping VIN Column Based On Observations¶
Dropping the VIN column ensures a cleaner, more focused dataset aligned with the project objectives. By removing noisy and irrelevant data, this decision enhances the accuracy of demand and pricing analyses while maintaining the integrity of the insights derived from the remaining columns.
Justification for Dropping VIN:
- Data Quality Issues: A significant proportion of VIN entries are invalid, placeholders, or duplicates, reducing their reliability for analysis.
- Limited Relevance: VIN is redundant for key insights as attributes like manufacturer, model, and mfg_year already provide sufficient detail.
- Simplified Dataset: Removing VIN streamlines the dataset, focusing on actionable variables related to pricing and demand.
# Droping the VIN column
df = df.drop(columns=['VIN'])
# Confirm the removal
print("VIN column successfully removed from the dataset.")
VIN column successfully removed from the dataset.
4.6 Outliers Handling¶
Outlier: A value that is very different from the rest of the data (example - a car with 5,00,000 km running when most are around 50,000–1,00,000 km).
Why Remove:
- They can distort averages and trends.
- Ensure analysis is accurate and reflects real patterns in the data
4.6.1 Price Outlier Detection and Removal with IQR¶
The IQR (Interquartile Range) is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of data, basically capturing the range of the middle 50% of values. Outliers are values that are way too small (below 𝑄1 − 1.5 × 𝐼𝑄𝑅 ) or way too big (above 𝑄3 + 1.5 × 𝐼𝑄𝑅 ). By removing these, we get rid of extreme points that can mess up our analysis, keeping the focus on the main data trends.
4.6.1.1 Spotting Outliers with Boxplot¶
Boxplots help identify extreme values by showing the range and central tendency of the data.
- Boxplots are superior for simplicity, clarity, and direct representation of key statistics
- summary of the data distribution (median, spread, and outliers)
# Boxplot without creating a new column
plt.figure(figsize=(10, 4))
sns.boxplot(x=df['price'] / 1e6, color='cyan')
plt.xlabel('Price (in Millions)')
plt.show()
4.6.1.2 Outlier Removal and Confirmation¶
We removed extreme values using the IQR method to clean the dataset. The smallest and largest outliers are shown as examples to confirm the process and ensure valid data is retained
# Calculating IQR for Outlier Detection
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identifying Outliers (Single Step)
outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]
# Displaying Extreme End Outliers
extreme_low_outliers = outliers.nsmallest(1, 'price')[['price', 'odometer', 'mfg_year']]
extreme_high_outliers = outliers.nlargest(1, 'price')[['price', 'odometer', 'mfg_year']]
print("Extreme low outlier:")
print(extreme_low_outliers)
print("\nExtreme high outlier:")
print(extreme_high_outliers)
print("\nDataset size before cleaning:", df.shape[0])
# Removing Outliers and Updating the Original Dataset
df = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]
# Printing Results
print("\nDataset size after cleaning:", df.shape[0])
print("Number of outliers removed:", outliers.shape[0])
Extreme low outlier: price odometer mfg_year 303716 7371250.0 435400.0 2004 Extreme high outlier: price odometer mfg_year 318592 4.798934e+11 164000.0 2007 Dataset size before cleaning: 426880 Dataset size after cleaning: 418703 Number of outliers removed: 8177
4.6.1.3 Validating Outlier Removal with Boxplot¶
# Boxplot without creating a new column
plt.figure(figsize=(10, 4))
sns.boxplot(x=df['price'] / 1e6, color='cyan')
plt.xlabel('Price (in Millions)')
plt.show()
📌Some points near the boundaries still show up in the boxplot because they are within the acceptable range and not extreme enough to be flagged as outliers. By comparing the boxplot before outlier removal, we can see that these values are valid and represent normal variations in the data
4.6.2 Odometer Outlier Detection and Removal with IQR*¶
4.6.2.1 Spotting outliers with violin plot¶
The violin plot is used to confirm that extreme odometer values are effectively handled while providing a clearer view of the data distribution and density.
Why is a Violin Plot Better than a Boxplot?
- Distribution: A violin plot shows the spread and shape of the data, unlike a boxplot which only displays summary statistics.
- Visual Density: It reveals where the data points are concentrated, making patterns or skewness easier to detect.
- More Informative: Combines the benefits of a boxplot with a smooth visualization of the data distribution.
# Visualize the Violin Plot for Odometer (in lakh kilometers) excluding missing values
plt.figure(figsize=(12, 6))
sns.violinplot(x=df[df['odometer'].notnull()]['odometer'] / 100000, color='greenyellow') # Exclude missing values
plt.xlabel('Odometer (in lakh km)')
plt.title('Odometer Distribution (Before removing outliers)')
plt.show()
4.6.2.2 Outlier Removal and Confirmation¶
- We removed extreme odometer values using the IQR method to clean the dataset. The smallest and largest outliers are shown as examples to confirm the process and ensure valid data is retained.
- Also making sure that missing values doen't affact
# Exclude missing values for odometer for calculations
df_filtered = df[df['odometer'].notnull()]
# Calculate IQR for Odometer Outlier Detection
Q1_odometer = df_filtered['odometer'].quantile(0.25)
Q3_odometer = df_filtered['odometer'].quantile(0.75)
IQR_odometer = Q3_odometer - Q1_odometer
lower_bound_odometer = Q1_odometer - 1.5 * IQR_odometer
upper_bound_odometer = Q3_odometer + 1.5 * IQR_odometer
# Identify Odometer Outliers
odometer_outliers = df_filtered[(df_filtered['odometer'] < lower_bound_odometer) | (df_filtered['odometer'] > upper_bound_odometer)]
# Display Extreme End Odometer Outliers
extreme_low_odometer = odometer_outliers.nsmallest(1, 'odometer')[['price', 'odometer', 'mfg_year']]
extreme_high_odometer = odometer_outliers.nlargest(1, 'odometer')[['price', 'odometer', 'mfg_year']]
print("Extreme low odometer outlier:")
print(extreme_low_odometer)
print("\nExtreme high odometer outlier:")
print(extreme_high_odometer)
print("\nDataset size before odometer outlier removal:", df.shape[0])
# Remove Odometer Outliers and Retain Missing Values in the Original Dataset
df = df[(df['odometer'].isnull()) | ((df['odometer'] >= lower_bound_odometer) & (df['odometer'] <= upper_bound_odometer))]
# Print Results
print("\nDataset size after odometer outlier removal:", df.shape[0])
print("Number of odometer outliers removed:", odometer_outliers.shape[0])
Extreme low odometer outlier: price odometer mfg_year 132381 205460.0 277814.0 1992 Extreme high odometer outlier: price odometer mfg_year 9218 513660.0 10000000.0 1975 Dataset size before odometer outlier removal: 418703 Dataset size after odometer outlier removal: 414473 Number of odometer outliers removed: 4230
4.6.2.3 Validating outlier removal with violin plot¶
# Visualize the Violin Plot for Odometer (in lakh kilometers) excluding missing values
plt.figure(figsize=(12, 6))
sns.violinplot(x=df[df['odometer'].notnull()]['odometer'] / 100000, color='greenyellow') # Excluding missing values
plt.xlabel('Odometer (in lakh km)')
plt.title('Odometer Distribution (After outliers removal)')
plt.show()
4.7 Droping Rows Based on Conditions¶
4.7.1 Droped Rows With Missing, Irrelevent & Outdated mfg_year¶
In this step we'll focus on removing rows with outdated and future manufacturing years (mfg_year) that did not align with the project’s goals of analyzing demand and pricing trends, ensuring a cleaner and more meaningful dataset
4.7.1.1 Checking min and max of mfg_year¶
# Count occurrences of each year
year_counts = df['mfg_year'].value_counts().sort_index()
# Compute statistics
mean_year, median_year, std_year = df['mfg_year'].mean(), df['mfg_year'].median(), df['mfg_year'].std()
# Create figure
fig, ax1 = plt.subplots(figsize=(12,6))
# Bar chart for manufacturing year distribution
ax1.bar(year_counts.index, year_counts.values, color='skyblue', alpha=0.6, label="Car Count")
# Line plot for trend visualization
sns.lineplot(x=year_counts.index, y=year_counts.values, marker="o", color="royalblue", linewidth=2, ax=ax1)
# Add statistical markers (Mean, Median, Std Dev)
for val, color, label in [(mean_year, 'blue', 'Mean'), (median_year, 'red', 'Median')]:
ax1.axvline(val, color=color, linestyle='dashed', linewidth=2, label=f'{label}: {int(val)}')
ax1.axvline(mean_year + std_year, color='orange', linestyle='dashed', linewidth=2, label=f'Std Dev (+1σ)')
# Labels & Titles
ax1.set_title("Manufacturing Year Distribution", fontsize=14, fontweight='bold')
ax1.set_xlabel("Manufacturing Year")
ax1.set_ylabel("Number of Cars")
ax1.legend()
ax1.grid(axis="y", linestyle="--", alpha=0.5)
# Show chart
plt.show()
# Checking the minimum and maximum manufacturing year
min_year = df['mfg_year'].min()
max_year = df['mfg_year'].max()
# Displaying the results
print(f"The oldest car was posted is of mfg_year {min_year}")
print(f"The newest car was posted is of mfg_year {max_year}")
The oldest car was posted is of mfg_year 1900 The newest car was posted is of mfg_year 2022
4.7.1.2 Removing rows where mfg_year > posting_date¶
# Get the initial number of rows
initial_rows = df.shape[0]
# Remove rows where mfg_year is 2022
df = df[df['mfg_year'] != 2022]
# Get the updated number of rows
final_rows = df.shape[0]
# Calculate the total number of rows removed
rows_removed = initial_rows - final_rows
# Print the results
print(f"Total rows removed: {rows_removed}")
print(f"Updated DataFrame rows: {final_rows}")
Total rows removed: 1134 Updated DataFrame rows: 413339
Here we are removing future mfg_year (posting_date 2021 but mfg_year 2022)
4.7.1.3 Assessing rows to drop: mfg_year < 2006 for relevant analysis¶
Assessing factors¶
- Market Relevance: Focuses exclusively on the core market (2006–2021)
- Regulatory Compliance: Fully aligned with Indian regulations for most cities (diesel ≤10 years, petrol ≤15 years). While a 5-year extension is possible, with the government's focus on promoting electric vehicles and reducing pollution, such extensions may become anytime unlikely in the future
- Business Goal Alignment: Highly aligned with Spinny’s focus on pricing and demand for in-demand, marketable cars.
- Insights for Pricing: Focused pricing insights for highly marketable and in-demand cars.
- Demand Analysis: Focuses entirely on core demand, which is Spinny’s primary objective.
- Data Quality: Clean, high-quality data from modern, marketable cars.
Why Not to Include Cars Before 2006? They contribute minimal value to the analysis as they represent outdated models with low demand and limited buyer interest. These vehicles often lack modern features and appeal, making them less competitive in today’s market. Including them may dilute insights, introduce noise in pricing optimization, and misalign with Spinny’s goal of identifying marketable cars and optimizing demand.
impact on dataset if decide to drop rows mfg_year before 2006 & missing mfg_year¶
# Calculating the total number of rows in the dataset
total_rows = df.shape[0]
# Calculating the number of rows manufactured before 2006
before_2006 = df[df['mfg_year'] < 2006].shape[0]
# Calculating the number of rows manufactured in or after 2006
after_2006 = df[df['mfg_year'] >= 2006].shape[0]
# Counting the missing values in the 'mfg_year' column
missing_values = df['mfg_year'].isna().sum()
# Determining the number of rows remaining after removing 'before 2006' and 'missing'
remaining_rows = after_2006 # Rows from 2006 onwards (excluding missing values)
# Calculating percentages for each category
before_2006_pct = (before_2006 / total_rows) * 100
after_2006_pct = (after_2006 / total_rows) * 100
missing_values_pct = (missing_values / total_rows) * 100
# Printing the results
print(f"total rows in the dataset: {total_rows}")
print(f"percentage of rows manufactured before 2006: {before_2006_pct:.2f}%")
print(f"percentage of rows manufactured from 2006 onwards: {after_2006_pct:.2f}%")
print(f"percentage of missing values in 'mfg_year': {missing_values_pct:.2f}%")
print(f"rows remaining after removing 'before 2006' and 'missing': {remaining_rows}")
total rows in the dataset: 413339 percentage of rows manufactured before 2006: 15.13% percentage of rows manufactured from 2006 onwards: 84.87% percentage of missing values in 'mfg_year': 0.00% rows remaining after removing 'before 2006' and 'missing': 350793
Interpretation:
- Core Focus: After removing cars manufactured before 2006, 84.32% of rows remain, focusing on the core market (2006–2021) with modern, in-demand cars.
- Missing Values in mfg_year: 0.27% missing values in mfg_year should dropped to avoid biases as manufacturing year is a critical variable for demand and pricing analysis.
- Business Alignment: The refined dataset improves data quality and aligns with Spinny’s goals of optimizing pricing, demand, and market trends effectively
We decided to drop rows with mfg_year before 2006 because these cars are outdated, have low demand, and lack modern features that align with today’s buyer preferences. Retaining only cars from 2006 onwards ensures a cleaner dataset, focusing on the core market (2006–2021) that represents 84.32% of the data. This decision improves data quality, aligns with Spinny’s goals of optimizing pricing and demand for marketable cars, and ensures compliance with regulatory and market trends.¶
4.7.1.4 Dropping Rows: mfg_year < 2006 and missing values in mfg_year¶
# Step 1: Remove rows with missing values in 'mfg_year'
df = df[df['mfg_year'].notnull()]
# Step 2: Remove rows where 'mfg_year' is less than 2006
df = df[df['mfg_year'] >= 2006]
# Confirmation prints
print(f"Total rows remaining: {df.shape[0]}")
print(f"Minimum mfg_year in the dataset: {df['mfg_year'].min()}")
print(f"Rows with missing mfg_year: {df['mfg_year'].isna().sum()}")
Total rows remaining: 350793 Minimum mfg_year in the dataset: 2006 Rows with missing mfg_year: 0
# Count occurrences of each year
year_counts = df['mfg_year'].value_counts().sort_index()
# Compute statistics
mean_year, median_year, std_year = df['mfg_year'].mean(), df['mfg_year'].median(), df['mfg_year'].std()
# Create figure
fig, ax1 = plt.subplots(figsize=(12,6))
# Bar chart for manufacturing year distribution
ax1.bar(year_counts.index, year_counts.values, color='skyblue', alpha=0.6, label="Car Count")
# Line plot for trend visualization
sns.lineplot(x=year_counts.index, y=year_counts.values, marker="o", color="royalblue", linewidth=2, ax=ax1)
# Add statistical markers (Mean, Median, Std Dev)
for val, color, label in [(mean_year, 'blue', 'Mean'), (median_year, 'red', 'Median')]:
ax1.axvline(val, color=color, linestyle='dashed', linewidth=2, label=f'{label}: {int(val)}')
ax1.axvline(mean_year + std_year, color='orange', linestyle='dashed', linewidth=2, label=f'Std Dev (+1σ)')
# Labels & Titles
ax1.set_title("Manufacturing Year Distribution", fontsize=14, fontweight='bold')
ax1.set_xlabel("Manufacturing Year")
ax1.set_ylabel("Number of Cars")
ax1.legend()
ax1.grid(axis="y", linestyle="--", alpha=0.5)
# Show chart
plt.show()
4.7.2 Droping Rows Where Price Is Zero¶
Why Dropping Rows with Price = 0 Makes Sense
Relevance to Revenue Growth: The objective explicitly focuses on revenue optimization. Entries with a price of 0 indicate that the car was either not sold or erroneously recorded. These entries cannot contribute to the revenue analysis or pricing models and hence are not useful for achieving the stated goal.
Demand and Pricing Optimization: A key focus is determining optimal pricing. Cars with a price of 0 cannot provide meaningful insights into pricing strategies because they lack valid price points to analyze trends, demand elasticity, or market competitiveness. Including such entries could skew any pricing analysis.
Data Cleaning and Preprocessing: One of the objectives is data cleaning and preprocessing. Handling missing or invalid data is a critical step in preparing the dataset for analysis. Since a price of 0 is not realistic in the context of car transactions, these rows represent anomalies or incomplete data and should be removed to ensure the quality and reliability of subsequent analyses.
Exploratory Data Analysis (EDA): Conducting EDA to uncover trends and patterns is another stated objective. Rows with price = 0 do not reflect valid transactions and would distort analyses of revenue, demand trends, and pricing distributions, leading to inaccurate insights.
Business Objectives and Goals Referenced
The decision to drop rows with price = 0 is directly tied to these stated objectives:
- Boost Revenue: Only transactions with valid prices contribute to understanding revenue growth.
- Optimize Pricing Strategies: Accurate pricing data is essential to identify trends, set competitive prices, and align with market demand.
- Understand Demand: Invalid price entries do not represent true demand patterns or buyer behavior and would compromise the demand analysis.
# Total number of rows in the dataset before cleaning
total_rows = df.shape[0]
# Counting rows where price is zero
rows_with_price_zero = df[df['price'] == 0].shape[0]
# Removing rows where price equals zero
# This modifies the original dataset
df = df[df['price'] != 0]
# Remaining rows after cleaning
remaining_rows = df.shape[0]
# Printing the results
print(f"Total rows in the dataset: {total_rows}")
print(f"Rows removed where price = 0: {rows_with_price_zero}")
print(f"Remaining rows after cleaning: {remaining_rows}")
Total rows in the dataset: 350793 Rows removed where price = 0: 32420 Remaining rows after cleaning: 318373
4.7.3 Dropping Rows Based On Missing Column Thresholds¶
Incomplete Analysis: Rows with more than 6 missing columns lack critical information needed for accurate analysis, such as demand trends, pricing strategies, and regional behavior. Missing data for key attributes (e.g., condition, fuel type, cylinders) limits the ability to identify factors influencing demand or optimize prices.
Distorted Insights: Retaining such rows introduces noise, leading to skewed patterns and unreliable insights during exploratory data analysis (EDA). For instance, demand or pricing trends could appear inaccurate due to incomplete data.
Impact on Statistical Models: Predictive models or hypothesis testing rely on complete data to generate accurate results. Rows with significant missing data can bias models, reduce performance, and lead to poor decisions.
Data Imputation Challenges: Filling or imputing values for so many missing columns is impractical and risks introducing assumptions that may not reflect real-world scenarios, leading to artificial patterns.
Focus on Usable Data: Dropping rows with excessive missing values ensures the dataset is focused on high-quality, reliable data. This aligns with the problem statement’s goal of generating actionable insights to optimize revenue and pricing strategies.
In summary, keeping rows with so many missing columns can compromise the quality, accuracy, and reliability of the analysis, making it more practical and beneficial to exclude them.
# Step 1: Counting the number of missing values for each row
missing_per_row = df.isnull().sum(axis=1) # Sum missing values across columns for each row
# Step 2: Counting the total rows for each unique count of missing columns
missing_summary = missing_per_row.value_counts().sort_index() # Count rows for each missing column count
# Step 3: Displaying results in the desired format
print("Column Wise Missing Values Summary")
for num_missing, total_rows in missing_summary.items():
print(f"No of Missing Column: {num_missing:<3} Total Rows: {total_rows}")
Column Wise Missing Values Summary No of Missing Column: 0 Total Rows: 85417 No of Missing Column: 1 Total Rows: 86510 No of Missing Column: 2 Total Rows: 66850 No of Missing Column: 3 Total Rows: 35933 No of Missing Column: 4 Total Rows: 14348 No of Missing Column: 5 Total Rows: 27017 No of Missing Column: 6 Total Rows: 1982 No of Missing Column: 7 Total Rows: 11 No of Missing Column: 8 Total Rows: 110 No of Missing Column: 9 Total Rows: 195
# Step 1: Identifying rows with more than 7 missing columns
rows_to_drop = df[df.isnull().sum(axis=1) > 6].index # Get indices of rows with > 6 missing columns
# Step 2: Droping these rows from the dataframe
df = df.drop(index=rows_to_drop) # Drops rows from the original dataframe
# Step 3: Displaying confirmation
print(f"Rows dropped: {len(rows_to_drop)}")
print(f"Remaining rows in the dataset: {len(df)}")
Rows dropped: 316 Remaining rows in the dataset: 318057
4.7.4 Dropping Rows with Missing Values in Columns Having Less Than 2% Missing Data¶
# Calculating and summarising missing values
missing_data = df.isnull().sum().to_frame('missing values')
missing_data['missing %'] = (missing_data['missing values'] / len(df)) * 100
# Filtering and sorting columns with missing values
missing_data = missing_data[missing_data['missing values'] > 0].sort_values(by='missing %', ascending=False)
# Displaying the summary
print(missing_data)
missing values missing % cylinders 135686 42.660907 condition 124608 39.177883 drive 97081 30.523145 paint_color 91796 28.861493 type 61647 19.382375 manufacturer 8485 2.667761 title_status 6245 1.963485 model 2013 0.632905 fuel 1815 0.570652 odometer 1702 0.535124 transmission 1300 0.408732
Explanation:
- Calculate Missing Percentages: Calculate the percentage of missing values for all columns.
- Filter Columns: Identify columns where missing values are less than 2%.
- Drop Missing Rows: Remove rows with missing values only in those selected columns.
- Print Results: Display the impact of cleaning on the dataset.
# Dropping rows with missing values from columns with <2% missing values
original_rows = len(df)
columns_to_clean = df.isnull().mean()[df.isnull().mean() < 0.02].index
df = df.dropna(subset=columns_to_clean)
rows_after_cleaning = len(df)
# Printing the results
print(f"Original dataset rows: {original_rows}")
print(f"Rows after cleaning: {rows_after_cleaning}")
print(f"Total rows dropped: {original_rows - rows_after_cleaning}")
# Printing the missing data summary after cleaning with percentages
print("\nMissing data summary after cleaning:")
print(df.isnull().sum()[df.isnull().sum() > 0].apply(
lambda x: f"{x} ({(x / len(df)) * 100:.2f}%)"
))
Original dataset rows: 318057 Rows after cleaning: 305423 Total rows dropped: 12634 Missing data summary after cleaning: manufacturer 8319 (2.72%) condition 118875 (38.92%) cylinders 131377 (43.01%) drive 92149 (30.17%) type 60058 (19.66%) paint_color 84143 (27.55%) dtype: object
🔻Resetting Index Because Droped Many Row¶
# Reset the index after dropping rows
df = df.reset_index(drop=True)
# Confirm the index reset
print(df.index) # Display the new index
RangeIndex(start=0, stop=305423, step=1)
4.8 Missing Values Imputation¶
4.8.1 Reallocating missing values from "model" column ( one to many )¶
Some entries in model column, like "toyota-fwd-sedan", combine values for manufacturer (toyota), drive (fwd), and type (sedan) in the model column. This method extracts and places them correctly, ensuring accurate data without changing model column.
Unique Value Extraction: Builds a dictionary of valid entries for target columns (fuel, drive, size, type, manufacturer) to ensure accurate matching.
Logic: A row-by-row function checks for matches in the model column against these unique values. If a match is found and the target column is empty, the value is copied to its correct column also it continues checking and assigns matches for other columns as well.
Handling complex data entries: Matching and reallocating misplaced values.
Dynamic preprocessing: Can be used to automate tasks when assigning values from one column to multiple columns
# List of target columns
target_columns = ['manufacturer', 'fuel', 'transmission', 'drive', 'type', 'paint_color']
# Step 1: Build a dictionary of unique values for each column
unique_values = {col: df[col].dropna().unique().tolist() for col in target_columns}
# Step 2: Ensure the 'model' column is treated as a string
df['model'] = df['model'].astype(str)
# Step 3: Function to reallocate missing values
def reallocate_values(row):
model_data = row['model'] # Ensure it's a string
for col in target_columns:
for value in unique_values[col]:
# General substring matching for target columns
if isinstance(value, str) and value in model_data and pd.isna(row[col]):
row[col] = value
return row
# Calculate missing values before processing
missing_before = df[target_columns].isna().sum()
# Step 4: Apply the function row-wise
df = df.apply(reallocate_values, axis=1)
# Calculate missing values after processing
missing_after = df[target_columns].isna().sum()
# Combine results into a comparison DataFrame
comparison_df = pd.DataFrame({
'Missing Before': missing_before,
'Missing After': missing_after
})
# Display the comparison for easy review
comparison_df['Difference'] = comparison_df['Missing Before'] - comparison_df['Missing After']
print("Missing Value Comparison Before and After Reallocation:")
print(comparison_df)
Missing Value Comparison Before and After Reallocation: Missing Before Missing After Difference manufacturer 8319 7826 493 fuel 0 0 0 transmission 0 0 0 drive 92149 91723 426 type 60058 57245 2813 paint_color 84143 80715 3428
4.8.2 Reducing missing "manufacturer" values with confirmed "model" based lookup ( one to one )¶
Why Fill manufacturer Using model?
Each model uniquely belongs to a specific manufacturer ("Civic" → Honda), making it the most reliable and logical choice for imputation. Also manufacturer is imortant parameter to analyze the demand across location.
Mapping Creation: Creates a dictionary (model_to_manufacturer) linking each model to its corresponding manufacturer based on non-missing rows.
Efficiency: Quickly retrieves the manufacturer for a given model using a dictionary lookup, avoiding complex computations.
Preservation: Updates only missing values in the manufacturer column, leaving existing non-missing values unchanged.
Flexibility: Handles missing or unknown model values gracefully by leaving manufacturer unchanged in such cases, making it robust for large datasets.
# Function for imputing missing 'manufacturer' values based on 'model'
def impute_manufacturer_by_model(df):
# Creating a dictionary to map 'model' to 'manufacturer' based on non-missing values
model_to_manufacturer = df.dropna(subset=['manufacturer', 'model']).set_index('model')['manufacturer'].to_dict()
# Imputing only missing values in 'manufacturer' using the mapping, leaving all other rows untouched
df['manufacturer'] = df['manufacturer'].fillna(df['model'].map(model_to_manufacturer))
# Returning the DataFrame with only the required changes
return df
# Checking missing values in 'manufacturer' before
print(f"Missing values in 'manufacturer' before imputing: {df['manufacturer'].isnull().sum()}")
# Applying the function to impute missing 'manufacturer'
df = impute_manufacturer_by_model(df)
# Checking missing values in 'manufacturer' after
print(f"Missing values in 'manufacturer' after imputing: {df['manufacturer'].isnull().sum()}")
Missing values in 'manufacturer' before imputing: 7826 Missing values in 'manufacturer' after imputing: 6828
4.8.3 Using a custom dictionary to further reducing missing manufacturer values ( one to one )¶
Objective: To fill missing values in the manufacturer column by using the model column, ensuring the data is complete for analysing car demand and pricing.
Why This is Important: The manufacturer column plays a key role in understanding which brands are in demand and how they affect pricing. Accurate manufacturer data helps Spinny make better decisions about inventory and pricing strategies.
How We Did It:
We manually created a dictionary linking car models to their manufacturers ( 'camry' → 'toyota', 'civic' → 'honda').
Missing manufacturers were filled by checking the model column against this dictionary.
Business Impact: This step made the data more complete and reliable, helping Spinny gain valuable insights into brand-specific trends and make smarter pricing decisions to grow revenue.
# Dictionary mapping specific car models to manufacturers
models_to_manufacturer = {
'toyota': ['camry', 'corolla', 'rav4', 'tacoma', '4runner', 'highlander', 'land cruiser', 'prius'],
'honda': ['accord', 'civic', 'cr-v', 'pilot', 'odyssey', 'fit', 'hr-v'],
'ford': ['f-150', 'f-250', 'f-350', 'explorer', 'expedition', 'mustang', 'fusion', 'escape'],
'chevrolet': ['silverado', 'tahoe', 'suburban', 'impala', 'malibu', 'corvette', 'camaro'],
'nissan': ['altima', 'sentra', 'maxima', 'rogue', 'murano', 'pathfinder', 'xterra'],
'jeep': ['wrangler', 'cherokee', 'grand cherokee', 'renegade'],
'dodge': ['charger', 'challenger', 'durango', 'ram'],
'bmw': ['3 series', '5 series', '7 series', 'x5', 'x3'],
'mercedes-benz': ['c-class', 'e-class', 's-class', 'gla', 'glc-class', 'gle', 'benz s550', 'glk-class', 'ml350', 'clk 550'],
'subaru': ['outback', 'forester', 'impreza', 'legacy', 'wrx'],
'mazda': ['mazda3', 'mazda6', 'cx-5', 'cx-9', 'mx-5'],
'hyundai': ['elantra', 'sonata', 'tucson', 'santa fe'],
'kia': ['optima', 'sorento', 'sportage', 'soul'],
'volkswagen': ['jetta', 'passat', 'golf', 'tiguan'],
'lexus': ['rx', 'es', 'nx', 'gx', 'ls'],
'hummer': ['h1', 'h2', 'h3'],
'mini': ['cooper', 'clubman', 'countryman'],
'audi': ['a3', 'a4', 'a5', 'a6', 'q3', 'q5', 'q7'],
'scion': ['im hatchback 4d', 'xd hatchback 4d', 'fr-s coupe 2d', 'tc hatchback coupe 2d', 'xb base wagon', 'iq'],
'smart': ['fortwo passion hatchback', 'fortwo electric drive passion', 'fortwo pure'],
'maserati': ['levante', 'quattroporte', 'ghibli', 'granturismo'],
'suzuki': ['xl-7', 'reno', 'forenza', 'grand vitara', 'sx4 sport awd'],
'isuzu': ['npr hd', 'nrr box truck', 'npr crew'],
'freightliner': ['m2 106 medium duty', 'm-line walk-in van'],
}
# Function to fill the manufacturer column based on the model column
def fill_manufacturer(row):
# Check if model is valid and manufacturer is empty
model_value = row['model']
if pd.isnull(row['manufacturer']) and pd.notnull(model_value):
# Match the manufacturer using next() for efficiency
row['manufacturer'] = next(
(manufacturer for manufacturer, patterns in models_to_manufacturer.items()
if any(pattern in model_value for pattern in patterns)),
row['manufacturer'] # Default: None if no match is found
)
return row
# Checking missing values in 'manufacturer' before
print(f"Missing values in 'manufacturer' before imputing: {df['manufacturer'].isnull().sum()}")
# Applying the function row by row to fill the manufacturer column
df = df.apply(fill_manufacturer, axis=1)
# Checking missing values in 'manufacturer' after
print(f"Missing values in 'manufacturer' after imputing: {df['manufacturer'].isnull().sum()}")
Missing values in 'manufacturer' before imputing: 6828 Missing values in 'manufacturer' after imputing: 3491
4.8.4 Hierarchical Imputation of Missing Values in the 'Condition' Column using bfill and ffill (many to one)¶
This code ensures that the condition column is completely filled using a structured hierarchical imputation approach. By preserving data integrity and type consistency, it prepares the dataset for accurate demand analysis and pricing optimization. This preprocessing step aligns directly with Spinny's objectives of optimizing revenue through better data-driven insights.
Sorting for Logical Grouping:
Sorting by mfg_year, odometer, and type ensures that the dataset is organized, enabling meaningful imputation for missing values based on similar characteristics.
Hierarchical Filling:
- Step 2: Filling missing values within the most specific grouping (mfg_year, odometer, type) for maximum accuracy.
- Step 3: Falls back to a broader grouping (mfg_year, odometer) to fill gaps when finer grouping lacks data.
- Step 4: Broadens further to mfg_year for situations where previous groupings are insufficient.
- Step 5: Applies a global backfill and forward fill as a final fallback to ensure no missing values remain.
Type Consistency:
Ensuring that the filled column maintains consistent data types, which is crucial for downstream operations like visualization and modeling. Efficient Handling:
The use of explicit forward-fill (ffill) and backward-fill (bfill) ensures predictable and logical imputation without arbitrary assumptions.
# Function to fill missing values in the 'condition' column without modifying the original DataFrame
def fill_missing_condition(df):
# Set the pandas option to avoid FutureWarnings
pd.set_option('future.no_silent_downcasting', True)
# Create a copy of the DataFrame
df_copy = df.copy()
# Step 1: Sort data for logical filling
df_copy = df_copy.sort_values(by=['mfg_year', 'odometer'])
# Step 3: Fallback to broader grouping (mfg_year, odometer)
df_copy['condition'] = (
df_copy.groupby(['mfg_year', 'odometer'])['condition']
.transform(lambda x: x.bfill().ffill()) # Use bfill and ffill explicitly
)
# Step 4: Fallback with mfg_year (broader group)
df_copy['condition'] = (
df_copy.groupby(['mfg_year'])['condition']
.transform(lambda x: x.bfill().ffill()) # Use bfill and ffill explicitly
)
# Step 5: Global bfill and ffill as the final fallback
df_copy['condition'] = df_copy['condition'].bfill().ffill()
# Extract the filled 'condition' column
filled_condition = df_copy['condition']
# Explicitly delete the temporary copy to free memory
del df_copy
# Reset the pandas option after processing
pd.reset_option('future.no_silent_downcasting')
# Return the filled 'condition' column
return filled_condition
# Call the function to fill missing values in the 'condition' column without modifying the original dataset
filled_condition = fill_missing_condition(df)
# Print the original and filled 'condition' columns for comparison
print("Original 'condition' column:")
print(df['condition'])
print("\nFilled 'condition' column:")
print(filled_condition)
Original 'condition' column: 0 good 1 good 2 good 3 good 4 excellent ... 305418 good 305419 good 305420 good 305421 good 305422 good Name: condition, Length: 305423, dtype: object Filled 'condition' column: 55215 good 79185 good 103123 good 103386 good 103974 good ... 69815 good 210694 good 254305 good 188408 excellent 142725 good Name: condition, Length: 305423, dtype: object
4.8.5 Filling All Remaing Missing Values In Categorical Columns With 'unknown'¶
# Calculate missing values count and percentage for each column
missing_info = pd.DataFrame({
'Missing Count': df.isnull().sum(),
'Missing Percentage (%)': (df.isnull().sum() / len(df)) * 100
})
# Display the result
print(missing_info)
Missing Count Missing Percentage (%) id 0 0.000000 price 0 0.000000 mfg_year 0 0.000000 manufacturer 3491 1.143005 model 0 0.000000 condition 118875 38.921430 cylinders 131377 43.014770 fuel 0 0.000000 odometer 0 0.000000 title_status 0 0.000000 transmission 0 0.000000 drive 91723 30.031465 type 57245 18.742858 paint_color 80715 26.427283 state 0 0.000000 posting_date 0 0.000000 latitude 0 0.000000 longitude 0 0.000000
Filling 'unknown' to all remaining missing values & prining confirmation
# List of columns to fill missing values with 'unknown'
columns_to_ffill = ['paint_color', 'drive', 'condition', 'cylinders']
# Fill missing values with 'unknown' and apply changes to the dataframe
data[columns_to_ffill] = data[columns_to_ffill].fillna('unknown')
# Verify that there are no missing values remaining in the specified columns
print("Number of remaining missing values in the dataset:", data[columns_to_ffill].isnull().sum().sum())
Number of remaining missing values in the dataset: 0
# Calculate missing values count and percentage for each column
missing_info = pd.DataFrame({
'Missing Count': df.isnull().sum(),
'Missing Percentage (%)': (df.isnull().sum() / len(df)) * 100
})
# Display the result
print(missing_info)
Missing Count Missing Percentage (%) id 0 0.000000 price 0 0.000000 mfg_year 0 0.000000 manufacturer 3491 1.143005 model 0 0.000000 condition 118875 38.921430 cylinders 131377 43.014770 fuel 0 0.000000 odometer 0 0.000000 title_status 0 0.000000 transmission 0 0.000000 drive 91723 30.031465 type 57245 18.742858 paint_color 80715 26.427283 state 0 0.000000 posting_date 0 0.000000 latitude 0 0.000000 longitude 0 0.000000
4.9 Data Preparation For ML Models¶
4.9.1 Normalizing Price and Odometer Using Min-Max Method¶
Where:
X: Original value
X': Normalized value
min(X): Minimum value of the feature
max(X): Maximum value of the feature
Why Min-Max Normalization?
- Preserves Relationships: Retains relative differences between data points without distortion.
- Machine Learning Compatibility:** Essential for algorithms like KNN and gradient-based models, ensuring equal contribution of features.
- Interpretable Range: Scales values between [0, 1], making them easy to interpret.
Why Not Z-Score or Robust Normalization?
- Z-Score: Centers data around 0 with unit variance, suitable for Gaussian distributions but less effective for non-Gaussian data.
- Robust Scaling: Reduces the effect of outliers but does not produce a bounded range like Min-Max normalization.
Benefits of Min-Max Normalization
- Uniform Feature Contribution: Prevents large-magnitude features from dominating smaller ones.
- Maintains Proportions: Higher values remain higher after normalization.
- Improved Performance: Ensures faster convergence for gradient descent and better accuracy for distance-based models.
- Intuitive Interpretation: Outputs within [0, 1] are easy to analyze.
When to Avoid Min-Max Normalization
- Extreme Outliers: Can skew normalized values; robust scaling is preferred.
- Dynamic Data: Frequent recalculation may be required if min/max values change over time.
This method is ideal for non-Gaussian data without extreme outliers and when a bounded range is essential for model performance.
# Add new columns for Min-Max Normalisation
df['price_minmax'] = (df['price'] - df['price'].min()) / (df['price'].max() - df['price'].min())
df['odometer_minmax'] = (df['odometer'] - df['odometer'].min()) / (df['odometer'].max() - df['odometer'].min())
# Verify that only the new columns are added
print("Columns after adding normalised values:")
print(df.columns)
# Check the first few rows to confirm the new columns
print("First few rows with new columns:")
print(df[['price', 'price_minmax', 'odometer', 'odometer_minmax']].head())
Columns after adding normalised values: Index(['id', 'price', 'mfg_year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state', 'posting_date', 'latitude', 'longitude', 'price_minmax', 'odometer_minmax'], dtype='object') First few rows with new columns: price price_minmax odometer odometer_minmax 0 4313900.0 0.585802 57923.0 0.208574 1 2901280.0 0.393936 71229.0 0.256487 2 5084440.0 0.690459 19160.0 0.068993 3 3980090.0 0.540463 41124.0 0.148083 4 1926290.0 0.261510 128000.0 0.460912
4.9.2 Optimizing Object Columns: Converting to Categorical for Faster Processing¶
Why Convert Object Columns to Categorical?
***Memory Efficiency:**
- Object columns take more memory because they store strings as raw values.
- Converting to categorical reduces memory usage by storing unique values (categories) and mapping each row to an integer.
Faster Processing:
- Categorical data is numerically encoded internally, which is faster for processing in many machine learning libraries.
- Operations like grouping, filtering, or joining are more efficient on categorical data.
Improved Model Performance:
- Some machine learning models, like decision trees, gradient boosting, or random forests, can handle categorical data directly when encoded efficiently.
- Categorical conversion is a necessary step before encoding methods like label encoding or one-hot encoding.
# Convert object columns to categorical
for col in df.select_dtypes(include='object').columns:
df[col] = df[col].astype('category')
# Check memory usage before and after
print("Memory usage after conversion:")
print(df.memory_usage(deep=True))
print(df.info())
Memory usage after conversion: Index 132 id 2443384 price 2443384 mfg_year 2443384 manufacturer 309549 model 2400803 condition 305972 cylinders 306263 fuel 305908 odometer 2443384 title_status 305977 transmission 305722 drive 305711 type 306794 paint_color 306468 state 308862 posting_date 2443384 latitude 2443384 longitude 2443384 price_minmax 2443384 odometer_minmax 2443384 dtype: int64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 305423 entries, 0 to 305422 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 305423 non-null int64 1 price 305423 non-null float64 2 mfg_year 305423 non-null int64 3 manufacturer 301932 non-null category 4 model 305423 non-null category 5 condition 186548 non-null category 6 cylinders 174046 non-null category 7 fuel 305423 non-null category 8 odometer 305423 non-null float64 9 title_status 305423 non-null category 10 transmission 305423 non-null category 11 drive 213700 non-null category 12 type 248178 non-null category 13 paint_color 224708 non-null category 14 state 305423 non-null category 15 posting_date 305423 non-null datetime64[ns, UTC] 16 latitude 305423 non-null float64 17 longitude 305423 non-null float64 18 price_minmax 305423 non-null float64 19 odometer_minmax 305423 non-null float64 dtypes: category(11), datetime64[ns, UTC](1), float64(6), int64(2) memory usage: 25.1 MB None
4.10 Feature Engineering¶
4.10.1 car_age: Added ( posting_date - mfg_year )¶
Purpose of the Code: The code calculates the car’s age (cars_age) by subtracting its manufacturing year (mfg_year) from the year of its posting_date. This provides a direct numeric measure of vehicle age, simplifying analysis.
Why Use cars_age Instead of mfg_year:
Direct Comparisons: cars_age standardizes vehicle age, making comparisons across posting years more intuitive.
Simplifies Trend Analysis: Age is easier to correlate with price depreciation, buyer preferences, and demand trends than mfg_year.
Better for Models: Numeric features like cars_age improve machine learning model accuracy and interpretation.
Efficiency: Precomputing age avoids repeated calculations and enhances workflow during analysis.
How This Helps in Analysis:
- Identifies price vs. age relationships and regional trends.
- Supports clear visualizations and comparisons.
Using cars_age simplifies workflows and aligns analysis with practical decision-making.
# Calculate cars_age and store it as a new column
df['cars_age'] = df['posting_date'].dt.year - df['mfg_year']
# Print the three columns in one row for each record
print(df[['posting_date', 'mfg_year', 'cars_age']])
posting_date mfg_year cars_age 0 2021-05-04 17:31:18+00:00 2014 7 1 2021-05-04 17:31:08+00:00 2010 11 2 2021-05-04 17:31:25+00:00 2020 1 3 2021-05-04 15:41:31+00:00 2017 4 4 2021-05-03 19:02:03+00:00 2013 8 ... ... ... ... 305418 2021-04-04 09:21:31+00:00 2019 2 305419 2021-04-04 09:21:29+00:00 2020 1 305420 2021-04-04 09:21:17+00:00 2020 1 305421 2021-04-04 09:21:11+00:00 2018 3 305422 2021-04-04 09:21:07+00:00 2019 2 [305423 rows x 3 columns]
4.10.2 cylinders_count: Converthing Condition To Numeric Value As Int64¶
Purpose of the Code: this code creates a new column, cylinders_count, by extracting numerical values from the cylinders column while preserving the original. Invalid or missing entries are replaced with NaN, ensuring the new column is clean and ready for analysis.
How This Helps in Analysis:
Facilitates Comparisons: Enables consistent comparisons of engine sizes across different car types or manufacturers.
Supports Exploratory Analysis: Allows analysis of trends and correlations with other variables (e.g., price, odometer).
Enhances Data Reliability: Ensures accurate calculations and interpretations by excluding invalid or inconsistent data entries.
Enables Flexibility: Having both the original and processed columns provides flexibility in choosing the appropriate format for different analytical tasks.
# Extract integers from the 'cylinders' column and create a new column named 'cylinders_count'
df['cylinders_count'] = df['cylinders'].astype(str).str.extract(r'(\d+)').astype('Int64')
# Confirmation print as a Series
# Confirmation print as a DataFrame with both columns
print(df[['cylinders', 'cylinders_count']].head(5))
cylinders cylinders_count 0 8 cylinders 8 1 8 cylinders 8 2 8 cylinders 8 3 8 cylinders 8 4 6 cylinders 6
4.10.3 state_income: Created New categorical column of states based on income¶
source of the data for categorisation: StudyIQ
Purpose of this column
- Analyse Regional Demand: Understand how state income levels influence car demand.
- Price Sensitivity & Market Segmentation: Adjust pricing strategies based on purchasing power.
- Optimise Revenue Strategy: Tailor pricing and marketing to different income categories.
- Enhance Model Performance: Use regional economic factors to improve demand forecasting.
- Align with Business Goals: Support Spinny’s objective of revenue growth through demand-based pricing.
# Dictionary mapping income categories to a list of states
state_income_category = {
'high': ['goa', 'sikkim', 'delhi', 'chandigarh', 'haryana', 'telangana',
'karnataka', 'gujarat', 'tamil nadu', 'puducherry'],
'medium': ['andaman and nicobar islands', 'kerala', 'arunachal pradesh', 'mizoram',
'uttarakhand', 'maharashtra', 'himachal pradesh', 'andhra pradesh',
'dadra and nagar haveli and daman and diu', 'ladakh'],
'low': ['punjab', 'jammu and kashmir', 'tripura', 'rajasthan', 'west bengal',
'chhattisgarh', 'madhya pradesh', 'odisha'],
'very low': ['assam', 'meghalaya', 'manipur', 'jharkhand', 'uttar pradesh', 'bihar']
}
# Reverse mapping: Assign each state its respective income category
state_income_mapping = {state: category for category, states in state_income_category.items() for state in states}
# Create a new column 'state_income' while keeping 'state' unchanged
df['state_income'] = df['state'].str.lower().map(state_income_mapping)
# Ensure state_income values are in lowercase
df['state_income'] = df['state_income'].str.lower()
# Display the first few rows to confirm changes
df.columns
Index(['id', 'price', 'mfg_year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state', 'posting_date', 'latitude', 'longitude', 'price_minmax', 'odometer_minmax', 'cars_age', 'cylinders_count', 'state_income'], dtype='object')
Correlation Heatmap of cars_age, price, odometer, cylinder_count & condition_score¶
What is Correlation Analysis?
Correlation analysis measures the relationship between two or more variables, showing how changes in one variable affect another.
Why is it Important?
- Identifies key relationships to understand how features like price, car age, and odometer reading are connected.
- Supports pricing strategy by showing how factors like car age or mileage impact resale price.
- Guides feature selection by helping choose relevant variables for predictive models.
- Detects multicollinearity to avoid highly correlated features that may distort machine learning models.
# Automatically select relevant columns
columns_of_interest = ['cars_age', 'price', 'odometer', 'cylinders_count']
selected_df = df[columns_of_interest]
# Calculate the correlation matrix
correlation_matrix = selected_df.corr()
# Global setting for figure size (optional)
sns.set(rc={'figure.figsize': (14, 6)})
# Plot the heatmap
plt.figure(figsize=(16, 5))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True, square=False)
plt.title("Correlation Heatmap")
plt.show()
Strong negative correlation (-0.59): Older cars tend to have lower prices.
odometer and price:Moderate negative correlation (-0.51): Higher mileage reduces the resale value of cars.
cars_age and odometer: Strong positive correlation (0.69): Older cars generally have higher mileage.
cylinders_count and price: Moderate positive correlation (0.45): Cars with more cylinders are often priced higher, reflecting engine capacity and performance.
Distributions Overview of Price, Odometer, Car Age, & Cylinders Count¶
plt.figure(figsize=(14, 5))
for col, color, scale in zip(
['price', 'odometer', 'cars_age', 'cylinders_count'],
['blue', 'purple', 'green', 'orange'],
[1e6, 1e5, 1, 1]
):
sns.kdeplot(df[col] / scale if scale != 1 else df[col],
label=col.replace("_", " ").title(),
fill=True, alpha=0.3, color=color, linewidth=3)
plt.title('Density Plot Price (in million), odometer (in lakh km), cars_age & cylinders')
plt.xlabel('Values')
plt.ylabel('Density')
plt.xlim(0, 16)
plt.legend(title='Features', loc='upper right')
plt.show()
Price: Prices range from ₹0 to ₹80 million, with the highest demand observed between ₹0.5 to ₹1.5 million, while price ranges from ₹4.5 to ₹8 million have the lowest demand.
Odometer: Odometer readings range from 0 to 3 lakh km, with most cars falling within 0 to 1 lakh km, and cars with 2 to 3 lakh km readings experiencing the lowest demand.
Car Age: Car ages range from 0 to 16 years, with the highest registrations observed for cars aged 3 to 9 years, and the lowest registrations for newer cars aged 0 to 1 year.
Cylinders: Cars have 3 to 12 cylinders, with the highest registrations for cars with 4, 6, and 8 cylinders, while cars with 3, 5, 10, and 12 cylinders have the lowest or negligible registrations.
5.1 UNIVARIATE ANAYSIS¶
Univariate analysis examines a single variable's distribution, central tendency, and dispersion to understand its characteristics. This process helps detect patterns, identify anomalies, and assess the presence of missing values. Additionally, it reveals dominant categories or frequent occurrences within the data, providing essential insights for subsequent analyses.
5.1.1 Cars with odometer readings between 0 to 0.5 lakh KM dominate the dataset.¶
Reason: Lower mileage is typically associated with less wear and tear, better performance, and higher resale value, making these cars more appealing to buyers
odo_lakh = df['odometer'] / 1e5
print({k: f"{v:,.2f}" for k, v in {
"Mean": odo_lakh.mean(), "Median": odo_lakh.median(),
"Var": odo_lakh.var(), "Std Dev": odo_lakh.std()}.items()})
plt.figure(figsize=(14,5))
sns.histplot(odo_lakh, bins=30, kde=True, color='indigo', alpha=0.6)
plt.title("Odometer Distribution (Lakh KM)"); plt.xlabel("Lakh KM")
plt.ylabel("Frequency"); plt.show()
{'Mean': '0.84', 'Median': '0.79', 'Var': '0.32', 'Std Dev': '0.57'}
Odometer Distribution:
- Most cars have odometer readings between 0.5 lakh KM and 1 lakh KM, indicating a preference for low-mileage vehicles.
- The distribution gradually decreases beyond 1 lakh KM, with very few cars having readings above 2 lakh KM
Skewness:
- The odometer distribution is right-skewed, showing that higher mileage cars are less common or less preferred by buyers.
Outliers:
- There is a small number of cars with odometer readings exceeding 2 lakh KM, indicating outliers or niche inventory segments.
Result:
Analysis confirms that cars with odometer readings between 0 to 0.5 lakh KM constitute the largest segment in the dataset. This aligns with the hypothesis that buyers prioritise low-mileage vehicles for reliability and long-term use.
📌 Insight: These observations suggest focusing on cars with odometer readings under 1 lakh KM, as they dominate the inventory. For higher-mileage cars (beyond 1 lakh KM), targeted strategies may be needed to address lower demand.
5.1.2 Cars aged between 3 to 8 years dominate the dataset.¶
Reason: Cars in the 3 to 10-year range are typically less expensive than newer models while still being reliable and equipped with relatively modern features. Buyers often prioritise such vehicles for their reduced depreciation and reasonable maintenance costs.
# Count occurrences of each manufacturing year
mfg_year_counts = df['mfg_year'].value_counts().sort_index()
# Convert counts to percentage
total_cars = mfg_year_counts.sum()
mfg_year_percent = (mfg_year_counts / total_cars) * 100 # Convert to percentage
# Plot bar chart with correct `hue` to remove warning
plt.figure(figsize=(14,5))
ax = sns.barplot(x=mfg_year_counts.index, y=mfg_year_counts.values, hue=mfg_year_counts.index, palette="icefire")
ax.legend_.remove() # Remove legend since hue is only for color
# Add percentage annotations on top of bars
for p, percent in zip(ax.patches, mfg_year_percent):
ax.annotate(f'{percent:.1f}%',
(p.get_x() + p.get_width() / 2, p.get_height()),
ha='center', va='bottom', fontsize=10, fontweight='bold')
# Titles and labels
plt.title("Mfg Year Distribution & % contribution", fontsize=14, fontweight='bold')
plt.xlabel("Manufacturing Year"); plt.ylabel("Car Count")
plt.xticks(rotation=45)
# Set proper y-axis ticks dynamically
ax.set_yticks(range(0, max(mfg_year_counts.values) + 5000, 5000))
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
Observations:
- Recent Models: Cars from 2013 to 2018 dominate the inventory, peaking around 2017-2018, showing a preference for vehicles aged 3-8 years at the time of posting. The highest count is for 2018 models, indicating strong demand or availability.
- Newer Models: Cars from 2020 and 2021 are limited in inventory, possibly reflecting supply constraints or lower demand for nearly new vehicles.
- Older Models: Cars manufactured before 2010 represent a smaller share, indicating reduced demand or availability for vehicles older than 11 years.
- Distribution: A bell-shaped distribution with a rise from 2006 to 2018, peaking in later years, followed by a decline from 2019 onwards.
Mean Year: 2013.95 (approximately 2014).
Median Year: 2014.
Standard Deviation: 3.93 (indicating most cars are within a 4-year range of the mean).
Variance: 15.47 (moderate spread in manufacturing years).
Skewness: The distribution is slightly left-skewed, with older models (pre-2010) having limited representation
Result:
Analysis confirms that cars aged between 3 to 10 years form the largest segment in the dataset, validating the hypothesis. This trend highlights the strong demand for mid-age vehicles due to their affordability and perceived value.
📌 Insights:
- Focus on 2013-2018 Models: Prioritise inventory and marketing for these models as they dominate demand.
- Expand Newer Models: Investigate and address low representation of 2020 and 2021 cars if there is unmet demand.
- Clearance Strategies for Older Models: Promote pre-2010 cars with discounts or added value to move low-demand inventory.
- 2019 and Beyond Decline: Likely tied to pandemic disruptions in production/sales, worth further exploration.
5.1.3 Cars with 4-cylinder and 6-cylinder configurations dominate the inventory.¶
Reason: 4-cylinder engines are typically found in compact and budget-friendly cars, while 6-cylinder engines cater to mid-size and performance-oriented buyers. These configurations are widely preferred due to their balance between power, efficiency, and affordability. Higher cylinder counts, like 8, 10, or 12, target niche markets, such as luxury or performance enthusiasts.
cyl_count = df['cylinders_count'].value_counts().sort_index()
plt.figure(figsize=(14,5))
sns.barplot(x=cyl_count.index, y=cyl_count.values, hue=cyl_count.index, palette="Set3", legend=False)
plt.title("Cylinders Count Distribution"); plt.xlabel("Cylinders"); plt.ylabel("Count")
for i, v in enumerate(cyl_count.values):
plt.text(i, v + 0.5, str(v), ha='center', fontsize=10)
plt.show()
Result:
The analysis confirms that:
- 4-cylinder cars (60,254) and 6-cylinder cars (67,835) make up the majority of the inventory.
- 8-cylinder cars (42,952) cater to a smaller but notable segment, likely targeting performance buyers.
- Rare configurations (3, 5, 10, and 12 cylinders) have minimal representation, indicating niche demand.
📌Insight: Focus on marketing and inventory for 4-cylinder and 6-cylinder cars to capture mainstream demand, while using targeted promotions for higher-cylinder cars to address niche segments and optimise inventory turnover
Reason:
- 4WD vehicles are highly versatile, offering better performance in rugged, off-road, or adverse weather conditions. This makes them popular in regions requiring robust capabilities.
- FWD cars are known for their affordability, fuel efficiency, and reliability, making them ideal for urban buyers prioritising cost-effective and practical vehicles.
- RWD cars, while offering better performance and handling, are typically associated with premium or niche segments, reducing their overall market share.
Reason:
- Automatic transmissions are easier to operate and more comfortable, especially in congested urban areas or for less experienced drivers, driving their popularity.
- Manual transmissions are often cheaper and provide more control over vehicle performance, appealing to budget-conscious buyers or driving enthusiasts. Other transmission types have limited market appeal due to specialised use cases or higher costs.
# Count occurrences of each unique value, excluding 'unknown'
drive_counts = df['drive'].value_counts()
drive_counts_filtered = drive_counts[drive_counts.index != 'unknown']
transmission_counts = df['transmission'].value_counts()
transmission_counts_filtered = transmission_counts[transmission_counts.index != 'unknown']
# Define color palettes for better visualization
palette_drive = sns.color_palette("Set2", len(drive_counts_filtered))
palette_transmission = sns.color_palette("Set3_r", len(transmission_counts_filtered))
# Create side-by-side donut charts
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Drive Type Donut Chart
axes[0].pie(drive_counts_filtered, labels=drive_counts_filtered.index, autopct='%1.1f%%',
startangle=140, colors=palette_drive, wedgeprops={'edgecolor': 'white'})
axes[0].add_artist(plt.Circle((0, 0), 0.70, fc='white')) # Create the donut effect
axes[0].set_title("Drive Type Distribution")
# Transmission Type Donut Chart
axes[1].pie(transmission_counts_filtered, labels=transmission_counts_filtered.index, autopct='%1.1f%%',
startangle=140, colors=palette_transmission, wedgeprops={'edgecolor': 'white'})
axes[1].add_artist(plt.Circle((0, 0), 0.70, fc='white')) # Create the donut effect
axes[1].set_title("Transmission Type Distribution")
# Show the plot
plt.tight_layout()
plt.show()
# Free up memory by deleting unnecessary variables
del drive_counts, drive_counts_filtered, transmission_counts, transmission_counts_filtered
observations:
- 4WD: 43.6%
- FWD: 39.2%
- RWD: 17.1%
Result: The analysis confirms that 4WD and FWD cars dominate the inventory, catering to the majority of buyer preferences.
📌Insight:
- Focus inventory management and marketing strategies on 4WD and FWD vehicles to maximise revenue.
- Target niche marketing for RWD cars to reach performance-oriented buyers or luxury segments.
Observations:
- Automatic: 77%
- Manual: 19.3%
- Other: 3.7%
Result: The analysis confirms the dominance of automatic transmissions, with manual transmissions forming a secondary but relevant market segment.
📌Insight:
- Prioritise procurement and pricing strategies around automatic vehicles to cater to the dominant demand.
- Use specialised promotions for manual transmission cars to appeal to specific buyer demographics, such as cost-sensitive buyers or driving enthusiasts.
5.1.6 Gasoline (gas) vehicles dominate the market.¶
Reason:
- Gasoline vehicles: These are widely preferred due to their affordability, availability, and lower initial cost compared to diesel or alternative fuel vehicles.
- Diesel vehicles: Typically cater to buyers seeking better fuel efficiency for high-mileage usage or rural regions where diesel is more practical.
- Electric and hybrid vehicles: Represent an emerging trend but face limited adoption due to high costs, lack of infrastructure, and slower consumer shift towards alternative fuel technologies.
# Count values
fuel_counts = df['fuel'].value_counts()
# Adjust x-positions with proper spacing
x_positions = [i * 5 for i in range(len(fuel_counts))] # Increased horizontal spacing
# Plot bubble chart
plt.figure(figsize=(14, 5)) # Increase figure size to accommodate full circles
plt.scatter(x_positions, [0] * len(fuel_counts), s=fuel_counts.values * 0.3, c=plt.cm.Set3(range(len(fuel_counts))), alpha=0.9)
# Add labels with count values
for x, (label, count) in zip(x_positions, zip(fuel_counts.index, fuel_counts.values)):
plt.text(x, 0, f"{label}\n{count}", ha='center', va='center', fontsize=10, fontweight='bold', color='black')
# Adjust axis limits to ensure full circles are visible
plt.xlim(min(x_positions) - 5, max(x_positions) + 5)
plt.ylim(-100, 100) # Increase vertical space for larger circles
# Final touches
plt.axis('off')
plt.title("Fuel Type Distribution", fontsize=16)
plt.tight_layout()
plt.show()
Observations:
- Gas vehicles dominate the dataset, accounting for the majority of entries. Diesel vehicles show moderate representation, suggesting specific demand segments.
- Electric and hybrid vehicles have minimal presence, reflecting low market penetration.
- Other fuel types contribute a small but notable portion, catering to niche requirements.
Result: The analysis confirms that gas vehicles dominate the fuel type distribution, while diesel vehicles occupy a smaller but significant segment. Electric and hybrid vehicles remain niche due to limited adoption.
📌Insight: Focus: Concentrate on pricing and inventory management for gas vehicles to align with the dominant demand. Monitor Trends: Keep track of growth in electric and hybrid vehicle markets to identify emerging opportunities in alternative fuel segments.
5.1.7 Cars in good and excellent condition dominate the market.¶
Reason: Good and excellent condition cars offer a balance of quality and affordability, making them the most appealing to buyers. Other conditions cater to specific needs or budgets, limiting their demand.
# Count occurrences of each unique value in the 'condition' column (case insensitive)
condition_counts = df['condition'].str.lower().value_counts()
# Exclude 'unknown' from the counts for plotting purposes only
condition_counts_filtered = condition_counts[condition_counts.index != 'unknown']
# Plot the distribution
plt.figure(figsize=(14, 5))
sns.barplot(x=condition_counts_filtered.index, y=condition_counts_filtered.values, hue=condition_counts_filtered.index, palette="Set3", dodge=False, legend=False)
plt.title("Condition Distribution")
plt.xlabel("Condition")
plt.ylabel("Count")
plt.show()
# Free up memory by deleting unnecessary variables
del condition_counts, condition_counts_filtered
Observation:
- Good and excellent condition cars dominate, reflecting strong demand.
- Like new vehicles cater to a moderate, quality-focused segment.
- New cars have minimal listings, likely from rare resale cases.
- Fair and salvage cars show limited demand or supply.
Result: The analysis confirms that cars in good and excellent condition dominate the dataset, while other conditions represent smaller niche segments.
📌Insights: Focus on inventory and pricing for good and excellent condition cars to capture majority demand, while exploring niche opportunities for other conditions like new and salvage cars.
5.1.7 Mid-range to premium models dominate the dataset, reflecting their popularity due to brand reliability and affordability in the second-hand car market¶
Reason: People prefer branded second-hand cars as they are more affordable than new ones, and consistent upgrades by affluent buyers lead to a steady supply of popular models like the F-150 and Silverado 1500 in the used car market.
color_palette, figsize = "Set3", (14, 5) # Adjust color & chart size
top_models = df['model'].value_counts().nlargest(10)
colors = plt.get_cmap(color_palette)(np.linspace(0, 1, 10))
plt.figure(figsize=figsize)
plt.barh(top_models.index, top_models.values, color=colors)
plt.gca().invert_yaxis()
plt.xlabel("Count"), plt.ylabel("Model"), plt.title("Top 10 Most Repeated Car Models")
plt.show()
Result: The F-150 dominates the list, followed by Silverado 1500 and 1500, while models like Escape, Altima, and Camry show moderate demand.
📌Insight: Focus on acquiring and marketing branded second-hand cars like the F-150 and Silverado 1500, as their popularity and steady supply from affluent sellers present a significant opportunity to generate high revenue and meet market demand efficiently.
5.1.8 Mid-range and premium manufacturers dominate the dataset¶
Reason: Popular manufacturers like Ford and Chevrolet are known for offering reliable and well-performing models at competitive prices, making them highly preferred in the second-hand car market.
df['manufacturer'].value_counts().nlargest(10) \
.plot.barh(figsize=(14,5), width=0.6, color=plt.cm.cubehelix(np.linspace(0,1,10))) \
.invert_yaxis()
plt.show()
Result: Ford has the highest representation, followed by Chevrolet and Toyota, while brands like BMW and GMC cater to niche or premium segments.
📌Insight: Prioritise acquiring and marketing models from Ford, Chevrolet, and Toyota to generate high revenue, while exploring premium brands like BMW to cater to niche buyer segments.
5.1.9 Sedans and SUVs dominate the market due to their widespread appeal¶
Reason: Sedans and SUVs offer versatility and affordability, appealing to a wide range of buyers, while other types cater to specific needs or niche preferences.
# Filter the data to exclude 'unknown' and sort in ascending order without modifying the original DataFrame
filtered_types = df['type'].str.lower().value_counts()
filtered_types = filtered_types[filtered_types.index != 'unknown'].sort_values()
# Plot the distribution
plt.figure(figsize=(14, 5))
sns.barplot(x=filtered_types.index, y=filtered_types.values, hue=filtered_types.index, palette="Set3", dodge=False, legend=False)
plt.title("Car Type Distribution ")
plt.xlabel("Car Type")
plt.ylabel("Count")
plt.show()
Observation: Sedans dominate the market, followed by SUVs. Pickup trucks and hatchbacks show moderate demand. Other types, like off-road vehicles and minivans, cater to niche markets.
Result:
Analysis confirms that sedans and SUVs dominate the market, while pickups, hatchbacks, and niche vehicles like off-road cars and minivans cater to smaller, specific buyer segments.
📌Insight: Prioritise inventory for sedans and SUVs to capture majority demand, and explore niche segments for potential growth opportunities.
5.1.10 White and black cars dominate the market, as they are widely preferred for their neutral appeal and resale value.¶
Reason: Neutral colours like white and black are versatile, timeless, and maintain high resale value, making them the most popular choices.
# Filter the data to exclude 'unknown' and sort in ascending order without modifying the original DataFrame
filtered_color = df['paint_color'].str.lower().value_counts()
filtered_color = filtered_color[filtered_color.index != 'unknown'].sort_values()
# Plot the distribution
plt.figure(figsize=(14, 5))
sns.barplot(x=filtered_color.index, y=filtered_color.values, hue=filtered_color.index, palette="Set3", dodge=False, legend=False)
plt.title("Paint Color Distribution (Ascending Order)")
plt.xlabel("paint_color")
plt.ylabel("Count")
plt.show()
# Free up memory by deleting unnecessary variables
del filtered_color
Observations:
- White cars have the highest count, followed by black cars.
- Silver, red, and blue are moderately popular. Colours like purple, yellow, and orange are the least preferred.
Result: White and black cars dominate the dataset, while unique colours cater to niche markets.
📌Insight: Focus on inventory for white and black cars to meet majority demand, while leveraging niche colours for targeted campaigns.
5.1.11 Weekly post counts are growing rapidly reflecting Spinny's expanding presence in the pre-owned car market¶
# Group by week and count posts without creating a new column
weekly_posts = df.groupby(df['posting_date'].dt.isocalendar().week).size()
# Plot with annotations
plt.figure(figsize=(14, 5))
plt.plot(weekly_posts.index, weekly_posts.values, marker='o')
for week, total_posts in zip(weekly_posts.index, weekly_posts.values):
plt.text(week, total_posts + max(weekly_posts) * 0.02, # Offset for visibility
f"{total_posts//1000}K", ha='center', fontsize=9)
plt.xticks(weekly_posts.index, [f"W{w}" for w in weekly_posts.index])
plt.xlabel('Week [April - May]')
plt.ylabel('Total Posts')
plt.title('Weekly Post Counts [2021]')
plt.grid(True)
plt.tight_layout()
plt.show()
Observation:
- Weekly posts increased from 2K in Week 13 to a peak of 102K in Week 17, showcasing Spinny's growth in car sourcing and listing activities.
- The drop to 46K in Week 18 indicates potential stabilisation or reduced inventory additions during that period.
Observation: States such as Maharashtra, Uttar Pradesh, and Tamil Nadu, which have significant urban populations, show the highest post counts, indicating the influence of larger cities and economic hubs.
📌Insight: Focus on marketing efforts in metro states to drive engagement and increase post counts.
5.1.12 Majority of the cars priced below ₹3 million¶
Reason: In a price-sensitive market like India, buyers typically opt for affordable cars within a budget-friendly range, contributing to a higher concentration of transactions at lower price points. Luxury or premium cars, which are priced higher, cater to a smaller niche segment, resulting in a skewed distribution
# Work on a copy to preserve df
df_copy = df[['price']].dropna().copy()
df_copy['price'] = pd.to_numeric(df_copy['price']) / 1e6 # Convert price to millions
# Compute histogram data & statistics
hist_values, bin_edges = np.histogram(df_copy['price'], bins=30)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
mean_price, median_price, std_price = df_copy['price'].agg(['mean', 'median', 'std'])
# Plot histogram-like area chart
plt.figure(figsize=(14,6))
plt.fill_between(bin_centers, hist_values, color='darkgreen', alpha=0.6, step="mid")
# Add statistical markers
for val, color, label in [(mean_price, 'blue', 'Mean'), (median_price, 'red', 'Median'),
(mean_price + std_price, 'orange', 'Std Dev (+1σ)')]:
plt.axvline(val, color=color, linestyle='dashed', linewidth=2, label=f'{label}: {val:.2f}M')
# Titles, labels, and grid
plt.title("Price Distribution of Pre-Owned Cars (Million ₹)", fontsize=14, fontweight='bold')
plt.xlabel("Car Price (Million ₹)"); plt.ylabel("Frequency")
plt.legend(); plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.show()
# Free memory
del df_copy, hist_values, bin_edges, bin_centers, mean_price, median_price, std_price, val, color, label
Observation:
- The chart highlights that most cars are priced between ₹0.5 million and ₹3 million, with diminishing registrations as the price increases beyond ₹4 million.
- The price distribution is right-skewed, reflecting limited demand for higher-priced cars.
- Cars priced above ₹4.5 million have the lowest registrations, while cars below this threshold account for the majority of transactions.
Result:
Analysis confirms that the price distribution is right-skewed, with most cars priced below ₹3 million. This trend highlights the dominance of budget-friendly cars in the dataset, aligning with the typical demand pattern in cost-sensitive markets
📌Insight: These insights reinforce the importance of focusing on inventory and pricing strategies within the ₹0.5 million to ₹3 million range.
5.2 HYPOTHESIS TESTING¶
HYPOTHESIS 1 - Regional Demand Variation¶
H₀: Demand for pre-owned cars does not significantly differ across states.
H₁: There is a significant difference in demand for pre-owned cars across states.
Reason: The hypothesis tests whether demand for pre-owned cars is evenly distributed across different states or if certain states exhibit significantly higher demand than others. The reasoning behind this hypothesis includes:
Reason for Analysing ID Count (Transaction Volume per State)
The ID count represents the number of completed transactions, which is a direct indicator of demand for pre-owned cars in different states. By analysing this, we can identify which states have the highest and lowest demand, helping businesses like Spinny to:
# Aggregate the data by state to count transactions (ID count per state)
state_transaction_data = df.groupby("state", observed=True)["id"].count().reset_index()
state_transaction_data.columns = ["State", "Transaction_Count"]
# Optional: Use palette with hue explicitly to remove warnings (Alternative version)
state_transaction_data["Hue_Group"] = state_transaction_data["State"]
plt.figure(figsize=(14, 8))
sns.barplot(x="State", y="Transaction_Count", hue="Hue_Group", data=state_transaction_data, palette="ocean_r", legend=False)
plt.xticks(rotation=90, fontsize=10)
plt.xlabel("State", fontsize=12)
plt.ylabel("Number of Transactions (ID Count)", fontsize=12)
plt.title("Demand for Pre-Owned Cars Across Different States (Based on Transaction Count)", fontsize=14)
plt.tight_layout()
plt.show()
# Step 3: Perform Chi-Square Test
chi2_stat, p_value = stats.chisquare(state_transaction_data["Transaction_Count"])
# Print results
print(f"Chi-Square Statistic: {chi2_stat}")
print(f"P-value: {p_value}")
Chi-Square Statistic: 299316.30432547646 P-value: 0.0
Interpreting the Chi-Square Test Results
Chi-Square Statistic:
- The calculated Chi-Square value is 299316.30, which is a very high number. This indicates a significant deviation between the observed transaction counts across states and a uniform distribution (if the null hypothesis assumed all states had equal demand).
P-value:
The P-value is 0.0, which is much smaller than any typical significance level (e.g., 0.05 or 0.01). This means the result is statistically significant.
Conclusion
Since the P-value is effectively 0, we reject the null hypothesis (H₀) and accept the alternative hypothesis (H₁):
There is a significant difference in demand for pre-owned cars across different states/regions.
What This Means?
Some states have significantly higher demand for pre-owned cars, while others have much lower demand.
These insights can guide business decisions, such as:
- Focusing inventory in high-demand states.
- Improving marketing in low-demand regions.
- Adjusting pricing strategies regionally.
HYPOTHESIS 2 - Impact of Odometer Reading on Demand and Price¶
Objective: Understanding the relationship between odometer readings and demand (price as a proxy)
Null Hypothesis (H₀): The odometer reading of a vehicle does not affect its demand or price.
Alternate Hypothesis (H₁): Vehicles with lower odometer readings have higher demand and price than those with higher readings.
Reason:
- Consumer Perception: Lower mileage cars are perceived to be in better condition and less worn out, making them more desirable.
- Market Trends: Buyers are willing to pay a premium for cars with less usage.
- Price as Proxy for Demand: In economic terms, higher demand often translates into higher prices, so price serves as a reasonable proxy for measuring demand.
Violin Plot of Odometer vs Price
Purpose: To visually analyse how price varies across odometer categories.
Why Use a Violin Plot?
- It combines boxplot and kernel density estimates to show both data spread and concentration.
- Helps to identify trends (e.g., whether lower odometer readings correspond to higher prices).
plt.figure(figsize=(12, 6))
# Assign x to hue and disable legend
sns.violinplot(
x=pd.qcut(df['odometer'], q=4, labels=['Low', 'Medium', 'High', 'Very High']),
y=df['price'],
data=df,
hue=pd.qcut(df['odometer'], q=4, labels=['Low', 'Medium', 'High', 'Very High']),
dodge=False,
palette='coolwarm',
legend=False # Disable legend explicitly
)
plt.title('Odometer Reading vs Price Distribution', fontsize=14)
plt.xlabel('Odometer Reading (Mileage Category)', fontsize=12)
plt.ylabel('Price (INR)', fontsize=12)
plt.grid(True)
plt.show()
Interpretation of Chart
- Low Mileage Vehicles: Higher median prices and tighter data spread (indicating consistent high demand).
- High Mileage Vehicles: Wider price variation and lower medians suggest declining demand.
Correlation Heatmap: Odometer and Price
Why This Correlation?
The correlation heatmap is used to explore the linear relationship between odometer readings and price, two numerical variables. It helps us understand how the mileage of a vehicle impacts its price, providing a quantitative view of their relationship.
# Selecting only price and odometer columns
correlation_features = df[['price', 'odometer']]
# Compute the correlation matrix
correlation_matrix = correlation_features.corr()
# Create the heatmap
plt.figure(figsize=(13, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Heatmap: Odometer and Price', fontsize=14)
plt.show()
Observation
- Price vs Odometer Correlation: The correlation coefficient is -0.51, indicating a moderate negative relationship between odometer readings and price.
- Negative Value: Higher odometer readings (more mileage) are associated with lower prices.
✔ Conclusion:
The null hypothesis (H₀) that odometer readings do not affect price is rejected.
The alternate hypothesis (H₁) is accepted, confirming that vehicles with lower odometer readings have higher demand (price)
Insights:
Increase Revenue: Focus on premium pricing for low-mileage cars.
Boost Sales: Improve marketing for high-mileage cars with value-added offers.
Customer Satisfaction: Align inventory with customer demand for well-maintained, low-mileage
HYPOTHESIS 3 - Impact of Cylinders on Vehicle Price¶
Null and Alternative Hypothesis
Null Hypothesis (H₀): The number of cylinders does not affect vehicle price
Alternative Hypothesis (H₁): Vehicles with more cylinders have higher prices
Reason for This Hypothesis
Performance & Engine Power: Cars with more cylinders typically provide better performance, leading to higher pricing.
Luxury & Premium Cars: High-end vehicles (BMW, Mercedes, SUVs, sports cars) often have 6+ cylinders, making them more expensive.
Market Demand & Pricing Strategy: If a strong correlation exists, Spinny can optimise its pricing strategy for vehicles with different engine type
Visualising Price Trends by Cylinder Count
Explanation
Before conducting statistical tests, we first visualise the relationship between the number of cylinders and vehicle prices using a box plot.
Compare median prices for different cylinder counts.
# Creating a copy of the dataset to avoid modifying the original data
df_copy = df.copy()
# Convert price to millions in a temporary column
df_copy["price_million"] = df_copy["price"] / 1e6 # Convert price to million INR
# Boxplot: Vehicle price distribution across different cylinder counts (in millions)
plt.figure(figsize=(13, 6))
sns.boxplot(data=df_copy, x="cylinders_count", y="price_million", hue="cylinders_count", palette="viridis", legend=False) # Fix warning
# Adding labels and title
plt.title("Vehicle Price Distribution (in Millions) by Number of Cylinders", fontsize=14, fontweight='bold')
plt.xlabel("Number of Cylinders", fontsize=12)
plt.ylabel("Vehicle Price (Million INR)", fontsize=12)
plt.grid(axis='y', linestyle="--", alpha=0.5)
# Show the plot
plt.show()
# Drop the temporary column to ensure original data remains unchanged
df_copy.drop(columns=["price_million"], inplace=True)
Observations
Median Price Trends: The median vehicle price increases as the number of cylinders increases. Vehicles with 12 cylinders have the highest median price, followed by 8-cylinder vehicles.
Price Variability:
Higher-cylinder vehicles (e.g., 6, 8, 12 cylinders) exhibit a wider price range, indicating more variability in pricing. 4-cylinder vehicles show a narrower price range and are more affordable.
Outliers:
4-cylinder vehicles have several outliers, indicating a few highly-priced models in this category.
Scatter Plot: Cylinders vs. Price (in Millions)
Explanation of the Heading Scatter Plot: A graphical representation showing individual data points for vehicle prices based on the number of cylinders. Cylinders vs. Price: Displays how vehicle prices (y-axis) vary with the number of cylinders (x-axis). Price in Millions: The y-axis is scaled to represent prices in millions for better readability.
# Dropping rows with missing values in cylinders_count and price for a temporary DataFrame
df_corr = df.dropna(subset=["cylinders_count", "price"]).copy()
# Convert price to millions (temporary operation in df_corr)
df_corr["price_million"] = df_corr["price"] / 1e6 # Price in millions
# Scatter plot with regression line
plt.figure(figsize=(13, 6))
sns.regplot(
data=df_corr,
x="cylinders_count",
y="price_million",
scatter_kws={"alpha": 0.5},
line_kws={"color": "red"}
)
plt.title("Scatter Plot: Cylinders vs. Price (in Millions)")
plt.xlabel("Number of Cylinders")
plt.ylabel("Vehicle Price (Million INR)")
plt.show()
Observations
Positive Correlation: The red regression line shows a clear upward trend, indicating that vehicles with more cylinders generally have higher prices.
Price Variability: Higher cylinder counts (8, 12) exhibit greater price variability. Vehicles with fewer cylinders (e.g., 4) are mostly clustered in the lower price range.
Clusters: Noticeable clusters at specific cylinder counts (e.g., 4, 6, 8), reflecting popular engine configurations.
Regression Analysis with scipy.stats
Why Perform Regression? Regression helps us quantify the relationship between two variables, in this case, the number of cylinders (cylinders_count) and vehicle price (in millions). It allows us to:
- Understand how much price increases with each additional cylinder.
- Test if the relationship is statistically significant
# Convert price to millions for regression
df_corr["price_million"] = df_corr["price"] / 1e6
# Perform linear regression using scipy.stats
slope, intercept, r_value, p_value, std_err = stats.linregress(
df_corr["cylinders_count"], df_corr["price_million"]
)
# Display regression results
print(f"Slope (Coefficient): {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Standard Error: {std_err:.4f}")
Slope (Coefficient): 0.4644 Intercept: -0.2783 R-squared: 0.2009 P-value: 0.0000 Standard Error: 0.0022
Regression Analysis Results
- Slope (0.4644): Each additional cylinder increases the price by ₹464,400 on average.
- Intercept (-0.2783): Defines the regression line but not practically meaningful here.
- R-squared (0.2009): About 20% of price variation is explained by cylinder count (moderate relationship).
- P-value (0.0000): Confirms the strong relationship and is statistically significant.
- Standard Error (0.0022): Indicates the slope estimate is precise.
This test highlights that vehicles with more cylinders are significantly pricier, though other factors also influence the price.
✔ We reject the null hypothesis (H₀) and conclude that the number of cylinders significantly affects vehicle prices. Vehicles with more cylinders tend to have higher prices.
📌 Higher cylinder count is a significant factor driving vehicle prices, with each additional cylinder increasing the price by ₹464,400 on average. This aligns with the objective of understanding how vehicle features influence pricing, enabling Spinny to optimise its pricing strategies and cater to premium market demand for high-performance vehicles in relevant regions.
HYPOTHESIS 4 - Relationship Between Income Levels and Average Car Prices¶
Objective: To analyze how state income levels influence the demand for premium versus budget-friendly cars. This will help identify regional preferences and optimize inventory and pricing strategies.
Null Hypothesis (H₀): There is no significant difference in the average price of cars sold between high-income and low-income states.
Alternative Hypothesis (H₁): : High-income states have a significantly higher average car price compared to low-income states.
Reason
Understanding regional demand and its relation to income levels can help Spinny:
- Optimize inventory allocation by region.
- Tailor pricing strategies to match purchasing power.
- Identify potential markets for premium or budget-friendly cars.
# Calculate the average price for each state income level
average_price_by_income = df.groupby('state_income')['price'].mean()
# Display the result
print("Average price for each state income level:")
print(average_price_by_income)
Average price for each state income level: state_income high 2.533686e+06 low 2.533291e+06 medium 2.536091e+06 very low 2.530398e+06 Name: price, dtype: float64
# Calculate and plot
avg_price = df.groupby('state_income')['price'].mean()
sns.barplot(x=avg_price.index, y=avg_price.values, hue=avg_price.index, dodge=False, legend=False)
plt.title("Average Car Prices by State Income Levels")
plt.xlabel("State Income Levels")
plt.ylabel("Average Price (₹)")
plt.show()
We faild to reject null hypotheis as average
✔ Failed to reject Null Hypothesis
Null hypothesis ("There is no significant difference in the average price of cars sold between high-income and low-income states") is failed to reject because:
Small Difference in Averages: The average car prices for high- and low-income states are very close, with minimal variation.
Relationship Between Price and Odometer Readings Null Hypothesis (H₀): Odometer readings do not significantly impact vehicle price. Alternative Hypothesis (H₁): Vehicles with lower odometer readings command higher prices. Reason: Lower mileage is associated with better condition, reliability, and reduced maintenance costs, making these vehicles more desirable.
HYPOTHESIS 5 - Luxury Cars Have Higher Demand in States With Tear One Cities Compared to Rural States¶
Null Hypothesis (H₀): There is no significant difference in the demand for luxury cars between metro and rural states.
Alternative Hypothesis (H₁): Luxury cars are in higher demand in metro states compared to rural states.
Reasoning:
- Higher Disposable Income: Metro residents have more purchasing power to afford luxury vehicles.
- Better Infrastructure: Well-maintained roads make luxury cars more practical in metro areas.
- Availability of Dealerships & Service Centres: Metro cities have more luxury car dealerships and service options.
- Financing & Leasing Benefits: Easier access to loans and EMI plans for high-end cars in metro states.
- Consumer Preferences: Metro buyers prefer premium features, while rural buyers focus on durability and cost-effectiveness.
Count of Cars Priced Above 5 Million INR in Metro vs Rural States
This analysis compares the number of pre-owned cars priced above ₹5 million (₹50 lakh) in metro states (Maharashtra, Karnataka, Delhi) versus rural states (Bihar, Odisha, Chhattisgarh). This helps in understanding the concentration of high-value vehicle sales across different regions.
Steps in the analysis:
- Filter cars where price > 5,000,000 INR.
- Segment data into metro and rural states.
- Count the number of transactions in each segment.
- Present findings in a structured summary.
# Create a copy of the dataset to ensure the original remains unchanged
df_copy = df.copy()
# Define metro and rural states
metro_states = ["maharashtra", "karnataka", "delhi"]
rural_states = ["bihar", "odisha", "chhattisgarh"]
# Filter cars priced above 5 million INR in each category
metro_count = df_copy[(df_copy["price"] > 5_000_000) & (df_copy["state"].isin(metro_states))].shape[0]
rural_count = df_copy[(df_copy["price"] > 5_000_000) & (df_copy["state"].isin(rural_states))].shape[0]
# Create a summary table
car_count_summary = pd.DataFrame({
"Region": ["Metro States (Maharashtra, Karnataka, Delhi)",
"Rural States (Bihar, Odisha, Chhattisgarh)"],
"Car Count (> 5M INR)": [metro_count, rural_count]
})
# Display the summary
print(car_count_summary)
Region Car Count (> 5M INR) 0 Metro States (Maharashtra, Karnataka, Delhi) 6758 1 Rural States (Bihar, Odisha, Chhattisgarh) 1526
# Create a pie chart to visualize the data
plt.figure(figsize=(14, 6))
plt.pie(car_count_summary["Car Count (> 5M INR)"],
labels=car_count_summary["Region"],
autopct='%1.1f%%',
colors=['pink', 'yellowgreen'],
startangle=140,
wedgeprops={'edgecolor': 'black'})
# Title for the pie chart
plt.title("Proportion of Cars Priced Above 5 Million INR in Metro vs Rural States")
# Show the chart
plt.show()
Conclusion:
✔ We reject the null hypothesis (H₀), as the data shows a significant difference in the demand for luxury cars between metro and rural states. The proportion of cars priced above ₹5 million is 81.6% in metro states (Maharashtra, Karnataka, Delhi) compared to only 18.4% in rural states (Bihar, Odisha, Chhattisgarh).
Reasoning:
The data highlights a much higher demand for luxury cars in metro states, likely due to:
Higher Disposable Income: Metro residents have greater purchasing power, enabling them to afford luxury vehicles.
Better Infrastructure: Well-maintained roads in metro states make owning luxury cars more practical.
Insights:
- Target Market: Metro states are the primary markets for luxury car sales, and efforts to boost transactions in these regions could yield better results.
- Marketing Strategy: Luxury car brands should focus advertising, financing, and dealership presence in metro states to tap into the high demand.
HYPOTHESIS 6 - Fuel Price and Demand Analysis for Different Manufactured Year Cars¶
Objective: To optimise pricing strategies and inventory management by analyzing the average price of cars by fuel type across different manufacturing year brackets and evaluating the count of different fuel types in the pre-owned car market.
Hypothesis
H₀ (Null Hypothesis): The average price of vehicles does not significantly vary across different fuel types and manufacturing year brackets, and fuel type count does not influence pricing trends.
H₁ (Alternate Hypothesis): The average price of vehicles significantly varies across different fuel types and manufacturing year brackets, and fuel type count influences pricing trends.
Reasoning:
Fuel Type and Pricing: Different fuel types (Diesel, Petrol, Electric, Hybrid) have varying market values due to fuel efficiency, maintenance costs, and resale demand.
Manufacturing Year Influence: Older vehicles generally have lower prices due to depreciation, while newer ones retain higher value.
Market Demand Trends: The count of fuel types per manufacturing year bracket reflects shifting consumer preferences and emerging trends in the pre-owned car market.
fig, axes = plt.subplots(1, 3, figsize=(15, 6), sharey=True)
colors = sns.color_palette("pastel", n_colors=5) # Adjust palette size to match categories
# Define year ranges
year_ranges = {
'2006-2010': (2006, 2010),
'2011-2015': (2011, 2015),
'2016-2022': (2016, 2022)
}
# Generate individual charts for each year range
for idx, (label, (start_year, end_year)) in enumerate(year_ranges.items()):
subset = df[(df['mfg_year'] >= start_year) & (df['mfg_year'] <= end_year)]
grouped = subset.groupby('fuel', observed=True).agg(
Average_Price=('price', 'mean'),
fuel_Count=('fuel', 'count')
).reset_index()
grouped['Average_Price'] = grouped['Average_Price'] / 1e6 # Convert to million
sns.barplot(x='fuel', y='Average_Price', hue='fuel', data=grouped, ax=axes[idx], palette=colors, legend=False)
ax2 = axes[idx].twinx()
sns.lineplot(x='fuel', y='fuel_Count', data=grouped, ax=ax2, color='blue', marker='o')
# Add text labels for transaction count
for x, y in zip(grouped['fuel'], grouped['fuel_Count']):
ax2.text(x, y, f'{y:,}', ha='center', va='bottom', fontsize=10, color='purple')
axes[idx].set_title(f"Avg Price & Demand by Fuel ({label})")
axes[idx].set_xlabel("Fuel Type")
axes[idx].set_ylabel("Avg Price (Million INR)")
ax2.set_ylabel("Fuel Count")
plt.tight_layout()
plt.show()
Observations, Insights & Recommendations:
Diesel Vehicles Hold Strong Resale Value but Have Moderate Availability:
Diesel cars maintain higher average prices across all year brackets, indicating sustained demand.
Recommendation: Price diesel cars as premium listings and highlight their fuel efficiency and durability to attract buyers.
Gas Vehicles Are Most Available but Show Price Fluctuations:
Gas-powered vehicles have the highest count across all year brackets, but their prices vary.
Recommendation: Monitor gas vehicle price trends and dynamically adjust pricing based on supply and demand to maximize profitability.
Electric & Hybrid Cars Show Rising Adoption in 2016-2022:
The count of electric and hybrid vehicles has increased in recent years, indicating growing consumer interest.
Recommendation: Expand inventory in the electric & hybrid segment to meet future demand and market trends.
Inventory Optimization for Fuel Type Demand:
Stock more Diesel cars in premium categories where demand remains steady. Regulate Gas vehicle inventory based on demand trends to avoid oversupply. Increase Electric & Hybrid vehicle stock to align with evolving buyer preferences.
CONCLUSION
We reject the null hypothesis (H₀):
There is a significant impact of fuel type and manufacturing year bracket on average vehicle price and fuel count. Spinny should strategically adjust pricing and inventory based on fuel-type trends to maximize revenue.
1️⃣ Optimizing Pricing Strategy¶
Set pricing within ₹0.5M–₹3M, as this range has the highest transaction volume.
Adjust gas vehicle pricing dynamically, given their highest count but fluctuating demand.
Price diesel cars at a premium, as they retain strong resale value and demand stability.
Introduce tiered pricing for electric & hybrid vehicles, as their demand has risen post-2016.
Reduce prices for high-mileage (>1 lakh km) and older (>10 years) cars to improve sales.
2️⃣ Enhancing Inventory Management for Market Demand¶
- Stock more low-mileage cars (<1 lakh km), as they dominate buyer preference.
- Prioritize mid-age (3–8 years old) vehicles, as they form the largest segment.
- Expand gasoline vehicle inventory, as they are the most available and in demand.
- Maintain a strong mix of 4-cylinder and 6-cylinder cars, as they dominate registrations.
- Balance stock for automatic transmissions, as they are the most preferred option.
3️⃣ Expanding Market Reach & Regional Growth Strategy¶
- Focus on Tier-1 cities for premium cars, as metro areas show higher luxury car demand.
- Increase budget-friendly vehicle availability in rural regions, where affordability is key.
- Expand listings for Ford, Chevrolet, and Toyota models, as they dominate demand.
- Leverage the rising popularity of SUVs & sedans, which are the most sought-after body types.
- Introduce region-specific pricing strategies, based on historical demand trends and buyer preferences.
4️⃣ Improving Customer Experience & Service Offerings¶
- Prioritize vehicles in "Good" and "Excellent" condition, as they have the highest demand.
- Use targeted promotions for high-mileage and older vehicles to clear low-demand stock.
- Focus on popular models like the Ford F-150 and Chevrolet Silverado 1500, as they dominate transactions.
- Monitor EV and hybrid adoption trends, gradually increasing their inventory.
- Leverage AI-based customer engagement tools, such as personalized recommendations and predictive buying trends.