Graduation Project -- Spinny Analysis¶

Business Problem Overview¶

Driving Revenue Growth Through Demand and Pricing Optimization for Spinny

Spinny, a trailblazer in India’s rapidly growing pre-owned car market, connects buyers and sellers with a seamless, customer-focused platform. With a mission to simplify car ownership, Spinny offers a curated selection of pre-owned vehicles, guaranteeing quality, transparency, and trust.

However, achieving sustainable growth in this competitive market requires a strategic approach to revenue generation. Success hinges on two critical factors: understanding regional demand and optimizing car pricing.

India's diverse regions exhibit varying preferences and purchasing behaviors for pre-owned cars. Factors such as local demand, vehicle condition, fuel type, and mileage significantly influence transaction volumes. Misaligned pricing strategies can lead to stagnant inventory, missed revenue opportunities, and diminished competitiveness.

For Spinny, unlocking higher revenue entails identifying the most sought-after cars and their defining attributes across regions. Additionally, optimizing pricing based on vehicle features and real-time market trends is essential to attract buyers while maximizing profitability.

By analyzing transaction data, uncovering patterns in demand, and fine-tuning pricing strategies, Spinny can not only increase completed transactions but also solidify its position as the go-to platform for pre-owned car buyers in India.

Objective¶

In this project, we aim to analyze Duolingo user activity data to:

  1. Understand regional demand: Analyze pre-owned car transaction trends across various regions to identify areas with high and low demand.
  2. Identify key car preferences: Determine the most in-demand car models and their defining attributes, such as condition, fuel type, mileage, and other features driving buyer interest.
  3. Optimize pricing strategies: Develop data-driven insights to establish competitive and market-aligned pricing for each car, ensuring higher transaction success rates and improved profitability.

Business Impact¶

This analysis will enable Spinny to:

  1. Boost transaction volumes: By understanding regional demand and buyer preferences, Spinny can strategically align its inventory with market needs, driving higher sales.
  2. Enhance customer satisfaction: Identifying key car attributes and optimizing pricing ensures buyers find vehicles that meet their expectations at competitive prices, fostering trust and loyalty.
  3. Maximize profitability: Data-driven pricing strategies will help Spinny achieve a balance between competitiveness and revenue growth, reducing stagnant inventory and increasing margins.
  4. Strengthen market position: Leveraging insights to cater to diverse regional preferences and optimize pricing will solidify Spinny’s reputation as a customer-first, data-savvy leader in the pre-owned car market.

Dataset Overview¶

Dataset Overview¶

  • Dataset Name: Spinny Analytics Dataset
  • Number of Rows: 4,26,880
  • Number of Columns: 21
  • Description: The dataset captures language learning session details, including user behavior, engagement metrics, and learning progress. It records recall probability, session performance, and lexeme interactions, enabling analysis of learning outcomes and activity patterns over days.

Column Definitions¶

  1. id (TransacThe year of manufacture of the vehicle, indicating its age.tion ID): A unique identifier for each car transaction in the dataset.
  2. Price: The final selling price of the vehicle in Indian Rupees (INR).
  3. Year: The time (in seconds) since the last lesson or practice where this specific word (lexeme) was encountered.
  4. Manufacturer: The brand or company that produced the vehicle.
  5. Model: The specific model name of the vehicle produced by the manufacturer.
  6. Condition: The physical and operational state of the vehicle at the time of sale.
  7. Cylinders: The number of cylinders in the car's engine, influencing its power and performance.
  8. Fuel: The type of fuel the vehicle uses, such as petrol, diesel, CNG, or electric.
  9. Odometer: The total distance the vehicle has traveled, measured in kilometers.
  10. Title_status: The condition of the car's , indicating if it is clean, or has missing parts or has liens.
  11. VIN (Vehicle Identification Number): A unique serial number assigned to each vehicle for identification.
  12. Drive: The drivetrain configuration of the vehicle (e.g., front-wheel drive, rear-wheel drive).
  13. Size: The size classification of the vehicle, such as compact, midsize, or full-size.
  14. Type: The type of vehicle, such as sedan, SUV, hatchback, or truck.
  15. Paint_color: The exterior color of the vehicle.
  16. Image_url: A link to an image of the vehicle for reference.
  17. State: The Indian state where the car transaction took place.
  18. Posting_date: The date when the car was listed for sale.
  19. Latitude: The geographical latitude coordinate of the transaction location.
  20. Longitude: The geographical longitude coordinate of the transaction location.

Analysis & Visualization¶

1. Importing and Cleaning Data¶

Importing Necessary Libraries¶

In [1]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import seaborn as sns  # For advanced visualizations

Loading the Dataset from google drive¶

In [2]:
sp = pd.read_csv("final_dataset.csv") # Loading the Data
In [3]:
sp
Out[3]:
id price year manufacturer model condition cylinders fuel odometer title_status ... VIN drive size type paint_color image_url state posting_date latitude longitude
0 7222695916 770530.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Haryana NaN 28.8446 75.1167
1 7218891961 1528210.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Haryana NaN 27.9026 77.0382
2 7221797935 2696820.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN West Bengal NaN 24.6838 85.9695
3 7222270760 192610.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Andhra Pradesh NaN 15.5158 81.9342
4 7210384030 629240.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN West Bengal NaN 23.7821 87.4861
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d good 6 cylinders gas 32226.0 clean ... 1N4AA6AV6KC367801 fwd NaN sedan NaN https://images.craigslist.org/00o0o_iiraFnHg8q... Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644
426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d good NaN gas 12029.0 clean ... 7JR102FKXLG042696 fwd NaN sedan red https://images.craigslist.org/00x0x_15sbgnxCIS... Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915
426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d good NaN diesel 4174.0 clean ... 1GYFZFR46LF088296 NaN NaN hatchback white https://images.craigslist.org/00L0L_farM7bxnxR... Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247
426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d good 6 cylinders gas 30112.0 clean ... 58ABK1GG4JU103853 fwd NaN sedan silver https://images.craigslist.org/00z0z_bKnIVGLkDT... Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719
426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe good NaN gas 22716.0 clean ... WBA4J1C58KBM14708 rwd NaN coupe NaN https://images.craigslist.org/00Y0Y_lEUocjyRxa... Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688

426880 rows × 21 columns

Displaying Dataset Information¶

In [4]:
print("Dataset Information:")
sp.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 21 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   price         426880 non-null  float64
 2   year          425675 non-null  float64
 3   manufacturer  409234 non-null  object 
 4   model         421603 non-null  object 
 5   condition     252776 non-null  object 
 6   cylinders     249202 non-null  object 
 7   fuel          423867 non-null  object 
 8   odometer      422480 non-null  float64
 9   title_status  418638 non-null  object 
 10  transmission  424324 non-null  object 
 11  VIN           265838 non-null  object 
 12  drive         296313 non-null  object 
 13  size          120519 non-null  object 
 14  type          334022 non-null  object 
 15  paint_color   296677 non-null  object 
 16  image_url     426812 non-null  object 
 17  state         426880 non-null  object 
 18  posting_date  426812 non-null  object 
 19  latitude      426880 non-null  float64
 20  longitude     426880 non-null  float64
dtypes: float64(5), int64(1), object(15)
memory usage: 68.4+ MB
  • sp.info() gives concise information about the columns and its data types with their count.

Displaying Column Names¶

In [5]:
sp.columns
Out[5]:
Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'image_url', 'state',
       'posting_date', 'latitude', 'longitude'],
      dtype='object')
  • Displays the Columns in the DataFrame

Describing Dataset Information¶

In [6]:
sp.describe()
Out[6]:
id price year odometer latitude longitude
count 4.268800e+05 4.268800e+05 425675.000000 4.224800e+05 426880.000000 426880.000000
mean 7.311487e+09 9.657028e+06 2011.235191 9.804333e+04 22.800888 79.012914
std 4.473170e+06 1.564439e+09 9.452120 2.138815e+05 6.532617 5.208436
min 7.207408e+09 0.000000e+00 1900.000000 0.000000e+00 6.800300 68.400200
25% 7.308143e+09 7.576800e+05 2008.000000 3.770400e+04 17.796000 75.719100
50% 7.312621e+09 1.791970e+06 2013.000000 8.554800e+04 24.058650 77.353900
75% 7.315254e+09 3.401080e+06 2017.000000 1.335425e+05 28.563525 80.927325
max 7.317101e+09 4.798934e+11 2022.000000 1.000000e+07 35.997600 97.398000
  • sp.describe() function describes the numerical columns from the DataFrame.
  • It Displays the count, mean, std, min, 25%, 50%, 75%, max which identifies potential outliers or data issues.
  • It Helps to understand the Numerical Data Columns.

Displaying Column Data Types¶

In [7]:
print("The Data Types of all Columns:")
sp.dtypes
The Data Types of all Columns:
Out[7]:
id                int64
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object
  • Using sp.dtypes, the data types for each column has been Displayed.

Checking the Shape of the Dataset¶

In [8]:
rows, columns = sp.shape
print(f"The dataset contains {rows} rows and {columns} columns.")
The dataset contains 426880 rows and 21 columns.
  • sp.shape has been used for checking the size of the dataset

Checking the unique values in the Dataset¶

In [9]:
for column in sp.columns.tolist():
    print(f"No. of unique values in {column}:")
    print(sp[column].nunique())
No. of unique values in id:
426880
No. of unique values in price:
6253
No. of unique values in year:
114
No. of unique values in manufacturer:
42
No. of unique values in model:
29667
No. of unique values in condition:
6
No. of unique values in cylinders:
8
No. of unique values in fuel:
5
No. of unique values in odometer:
104870
No. of unique values in title_status:
6
No. of unique values in transmission:
3
No. of unique values in VIN:
118264
No. of unique values in drive:
3
No. of unique values in size:
4
No. of unique values in type:
13
No. of unique values in paint_color:
12
No. of unique values in image_url:
241899
No. of unique values in state:
35
No. of unique values in posting_date:
333355
No. of unique values in latitude:
191354
No. of unique values in longitude:
162536
  • sp.columns.tolist() converts that list of column names into a Python list which loops and iterates over all the columns in the DataFrame dl.
  • sp[column].nunique() returns the number of unique (distinct) values in the column.

Checking for the Value Counts in the Dataset¶

In [10]:
value_counts_dict = {}

for column in sp.columns:
    value_counts_dict[column] = sp[column].value_counts()
    print(value_counts_dict[column])
7222695916    1
7313139418    1
7313423023    1
7313423324    1
7313424533    1
             ..
7314655506    1
7314655833    1
7314657468    1
7314659947    1
7301591129    1
Name: id, Length: 426880, dtype: int64
0.0           34868
3851650.0      3683
897960.0       3623
1026400.0      3585
1154830.0      3512
              ...  
13329960.0        1
1819610.0         1
425710.0          1
2222640.0         1
611520.0          1
Name: price, Length: 6253, dtype: int64
2017.0    36420
2018.0    36369
2015.0    31538
2013.0    30794
2016.0    30434
          ...  
1943.0        1
1915.0        1
1902.0        1
1905.0        1
1909.0        1
Name: year, Length: 114, dtype: int64
ford               70985
chevrolet          55064
toyota             34202
honda              21269
nissan             19067
jeep               19014
ram                18342
gmc                16785
bmw                14699
dodge              13707
mercedes-benz      11817
hyundai            10338
subaru              9495
volkswagen          9345
kia                 8457
lexus               8200
audi                7573
cadillac            6953
chrysler            6031
acura               5978
buick               5501
mazda               5427
infiniti            4802
lincoln             4220
volvo               3374
mitsubishi          3292
mini                2376
pontiac             2288
rover               2113
jaguar              1946
porsche             1384
mercury             1184
saturn              1090
alfa-romeo           897
tesla                868
fiat                 792
harley-davidson      153
ferrari               95
datsun                63
aston-martin          24
land rover            21
morgan                 3
Name: manufacturer, dtype: int64
f-150                       8009
silverado 1500              5140
1500                        4211
camry                       3135
silverado                   3023
                            ... 
Huyndai Sante Fe Limited       1
astro awd 4x4                  1
escalade and                   1
cx 3                           1
Paige Glenbrook Touring        1
Name: model, Length: 29667, dtype: int64
good         121456
excellent    101467
like new      21178
fair           6769
new            1305
salvage         601
Name: condition, dtype: int64
6 cylinders     94169
4 cylinders     77642
8 cylinders     72062
5 cylinders      1712
10 cylinders     1455
other            1298
3 cylinders       655
12 cylinders      209
Name: cylinders, dtype: int64
gas         356209
other        30728
diesel       30062
hybrid        5170
electric      1698
Name: fuel, dtype: int64
100000.0    2263
1.0         2246
0.0         1965
200000.0    1728
150000.0    1603
            ... 
149468.0       1
154259.0       1
111887.0       1
213826.0       1
26892.0        1
Name: odometer, Length: 104870, dtype: int64
clean         405117
rebuilt         7219
salvage         3868
lien            1422
missing          814
parts only       198
Name: title_status, dtype: int64
automatic    336524
other         62682
manual        25118
Name: transmission, dtype: int64
1FMJU1JT1HEA52352    261
3C6JR6DT3KG560649    235
1FTER1EH1LLA36301    231
5TFTX4CN3EX042751    227
1GCHTCE37G1186784    214
                    ... 
2G4GP5EX5E9213578      1
1GCEK19J78Z219711      1
JA4AT3AW1AZ006543      1
4T1BF28B61U153724      1
SAJGX2749VCOO8376      1
Name: VIN, Length: 118264, dtype: int64
4wd    131904
fwd    105517
rwd     58892
Name: drive, dtype: int64
full-size      63465
mid-size       34476
compact        19384
sub-compact     3194
Name: size, dtype: int64
sedan          87056
SUV            77284
pickup         43510
truck          35279
other          22110
coupe          19204
hatchback      16598
wagon          10751
van             8548
convertible     7731
mini-van        4825
offroad          609
bus              517
Name: type, dtype: int64
white     79285
black     62861
silver    42970
blue      31223
red       30473
grey      24416
green      7343
custom     6700
brown      6593
yellow     2142
orange     1984
purple      687
Name: paint_color, dtype: int64
https://images.craigslist.org/00N0N_1xMPvfxRAIdz_0gw0co_600x450.jpg    7357
https://images.craigslist.org/00R0R_lwWjXSEWNa7z_0x20oM_600x450.jpg    2192
https://images.craigslist.org/01111_l590CdXzJSoz_001001_600x450.jpg     276
https://images.craigslist.org/01313_9Zne268YnWZz_0gw0co_600x450.jpg     231
https://images.craigslist.org/00X0X_1cMDo5xqSuAz_0gw0co_600x450.jpg     227
                                                                       ... 
https://images.craigslist.org/00a0a_gbpYPGPfZEwz_0lM0t2_600x450.jpg       1
https://images.craigslist.org/00e0e_g9RfXGmsjDaz_0CI0hq_600x450.jpg       1
https://images.craigslist.org/00Y0Y_4fRgRCeuUj2z_0CI0t2_600x450.jpg       1
https://images.craigslist.org/00K0K_fUdf8IWJOlmz_0t20CI_600x450.jpg       1
https://images.craigslist.org/00C0C_3D2bgxR1cysz_0ak07K_600x450.jpg       1
Name: image_url, Length: 241899, dtype: int64
Uttar Pradesh                               45573
Maharashtra                                 45263
Karnataka                                   30492
Punjab                                      30455
Delhi                                       30379
Haryana                                     30175
Tamil Nadu                                  18124
West Bengal                                 18052
Telangana                                   15356
Kerala                                      15304
Madhya Pradesh                              15205
Rajasthan                                   15131
Andhra Pradesh                              15070
Gujarat                                     14992
Odisha                                      12189
Jharkhand                                    6080
Bihar                                        6034
Chhattisgarh                                 6003
Assam                                        5987
Uttarakhand                                  5951
Andaman and Nicobar Islands                  3079
Sikkim                                       3065
Ladakh                                       3062
Puducherry                                   3046
Jammu and Kashmir                            3030
Meghalaya                                    3017
Himachal Pradesh                             3016
Mizoram                                      3010
Manipur                                      2990
Chandigarh                                   2986
Arunachal Pradesh                            2980
Dadra and Nagar Haveli and Daman and Diu     2980
Tripura                                      2961
Goa                                          2923
Lakshadweep                                  2920
Name: state, dtype: int64
2021-04-24 02:13:05+00:00    13
2021-04-30 14:01:13+00:00    12
2021-05-04 15:00:42+00:00    12
2021-05-03 13:41:54+00:00    11
2021-04-29 20:25:19+00:00    11
                             ..
2021-04-21 00:52:45+00:00     1
2021-04-20 20:41:24+00:00     1
2021-04-20 19:24:56+00:00     1
2021-04-20 19:20:54+00:00     1
2021-04-04 09:21:29+00:00     1
Name: posting_date, Length: 333355, dtype: int64
28.8329    20
28.8703    20
28.7504    19
28.6300    19
28.5907    19
           ..
19.5164     1
11.4463     1
27.3585     1
21.9371     1
17.0875     1
Name: latitude, Length: 191354, dtype: int64
77.2591    24
77.1361    24
77.2894    23
77.2735    23
76.9070    23
           ..
80.3373     1
68.9184     1
84.2857     1
89.4721     1
83.7688     1
Name: longitude, Length: 162536, dtype: int64
  • value_counts_dict = {} initializes an empty dictionary.
  • sp[column].value_counts() returns the counts of unique values in the column.

2. Data Preparation¶

Checking for Missing/Null Values¶

In [11]:
missing_value_count = sp.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count
Missing Values in Each Column:
Out[11]:
id                   0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
image_url           68
state                0
posting_date        68
latitude             0
longitude            0
dtype: int64
  • sp.isnull() represents whether the data contains null or not.
  • sp.isnull().sum() Provides the count of missing values for each column in the DataFrame sp.

Checking for Duplicate Values in the Dataset¶

In [12]:
duplicates = sp[sp.duplicated()]
duplicate_count = len(duplicates)
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 0
  • The sp.duplicated() function is used to identify duplicate rows in a DataFrame sp.

Checking Data Column types for Well Structured Analysis.¶

In [13]:
sp.dtypes
Out[13]:
id                int64
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object
  • For the further analysis, the data types for all the column should be appropriate for a better directed analysis.
  • Here its noticed that data type of posting_date is not appropriate as its states 'object' but it has to be in 'datetime'.

Changing Datatype of Inappropriate Columns.¶

In [14]:
pd.options.mode.chained_assignment = None
sp["posting_date"]= pd.to_datetime(sp["posting_date"], format = '%Y-%m-%d %H:%M:%S')
  • Pandas might issue a warning. Setting pd.options.mode.chained_assignment = None stops that warning from appearing.
  • pd.to_datetime() function that converts column sp["posting_date"] to datetime format.
In [15]:
# Convert 2022.0 to 2022 while keeping NaN values
sp['year'] = pd.to_numeric(sp['year'], errors='coerce').astype('Int64')
sp['year']
Out[15]:
0         <NA>
1         <NA>
2         <NA>
3         <NA>
4         <NA>
          ... 
426875    2019
426876    2020
426877    2020
426878    2018
426879    2019
Name: year, Length: 426880, dtype: Int64
  • The method pd.to_numeric() attempts to convert the values in the year column to numeric format.
In [16]:
sp['price'] = pd.to_numeric(sp['price'], errors='coerce')
  • The method pd.to_numeric() attempts to convert the values in the price column to numeric format.
In [17]:
sp.dtypes
Out[17]:
id                            int64
price                       float64
year                          Int64
manufacturer                 object
model                        object
condition                    object
cylinders                    object
fuel                         object
odometer                    float64
title_status                 object
transmission                 object
VIN                          object
drive                        object
size                         object
type                         object
paint_color                  object
image_url                    object
state                        object
posting_date    datetime64[ns, UTC]
latitude                    float64
longitude                   float64
dtype: object
  • .dtypes retrieves the data types of all columns and ensure that all the column has the appropriate data type.

Segregating Numerical Columns for Devicing Outliers.¶

In [18]:
# Select columns with numeric types (either float64 or int64)
numerical_columns = sp.select_dtypes(include=['float64']).columns.tolist()
print(numerical_columns)
['price', 'odometer', 'latitude', 'longitude']
  • sp.select_dtypes(include=['float64', 'int64']) filters the DataFrame sp to select only the columns that are of type float64 or int64 which is numerical column.
  • .columns.tolist() converts the filtered columns into a list.

Devicing outliers from the DataFrame¶

In [19]:
outliers = {}
  • Initializing a Dictionary using outliers = {}.
In [20]:
for column in numerical_columns:
    Q1 = sp[column].quantile(0.10)
    Q3 = sp[column].quantile(0.90)
    IQR = Q3-Q1
    
    lower_bound = Q1 - 1.5*IQR
    upper_bound = Q3 + 1.5*IQR
    
    outliers[column] = sp[column][(sp[column]<lower_bound) | (sp[column]>upper_bound)]
    print(outliers[column])
280       1.268338e+11
346       1.270072e+07
461       1.283604e+07
732       1.284192e+07
1727      1.669450e+07
              ...     
420535    1.284100e+07
420693    1.605240e+07
424281    3.846158e+07
424840    1.258505e+07
426302    1.605240e+07
Name: price, Length: 874, dtype: float64
280        999999.0
455       9999999.0
562       2222222.0
615       1111111.0
616       1111111.0
            ...    
424388     468000.0
424840     550000.0
424962     447247.0
426106    9999999.0
426108     999999.0
Name: odometer, Length: 1619, dtype: float64
Series([], Name: latitude, dtype: float64)
Series([], Name: longitude, dtype: float64)
  • The code loops through all numeric columns (numerical_columns) in the DataFrame sp.
  • For each column, it calculates the 10th and 90th percentiles to determine the Inter Quartile Range.
  • lower_bound = Q1 - 1.5*IQR,upper_bound = Q3 + 1.5*IQR calculates the lower and upper bounds.
  • Outliers[column] = sp[column][(sp[column]<lower_bound) | (sp[column]>upper_bound)] identifies the outliers which is any data points outside this range.

Handling the Outliers.¶

In [21]:
for col, data in outliers.items():
    print(f"{col}: {len(data)}")
price: 874
odometer: 1619
latitude: 0
longitude: 0
  • Using the looping through conditions through outliers.items().
  • len(data) employs the length of the outliers.

Data Distribution in Outliers.¶

In [22]:
for columns in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=sp[columns])
    plt.title('Boxplot for Numerical Columns')
    plt.xlabel(columns)
    plt.ylabel('Values')
    plt.xticks(rotation=45)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • The above code is generating separate boxplots for each of the numerical columns in the numerical_columns list as it allows visually inspect the distribution and potential outliers in each column.

Creating New columns for better Analysis¶

Create posting_year column¶

In [23]:
sp['posting_year'] = sp['posting_date'].dt.year
sp['posting_year'] = pd.to_numeric(sp['posting_year'], errors='coerce').astype('Int64')
sp['posting_year']
Out[23]:
0         <NA>
1         <NA>
2         <NA>
3         <NA>
4         <NA>
          ... 
426875    2021
426876    2021
426877    2021
426878    2021
426879    2021
Name: posting_year, Length: 426880, dtype: Int64
  • The .dt.year attribute extracts the year from the datetime column posting_date.
  • pd.to_numeric() Converts values in the posting_year column to numeric format.

Create posting_month column¶

In [24]:
sp['posting_month'] = sp['posting_date'].dt.month
sp['posting_month'] = pd.to_numeric(sp['posting_month'], errors='coerce').astype('Int64')
sp['posting_month']
Out[24]:
0         <NA>
1         <NA>
2         <NA>
3         <NA>
4         <NA>
          ... 
426875       4
426876       4
426877       4
426878       4
426879       4
Name: posting_month, Length: 426880, dtype: Int64
  • The .dt.month attribute extracts the month from the datetime column posting_date.
  • pd.to_numeric() Converts values in the posting_month column to numeric format.

Create posting_weekday column¶

In [25]:
sp['posting_weekday'] = sp['posting_date'].dt.day_name()
sp['posting_weekday']
Out[25]:
0            NaN
1            NaN
2            NaN
3            NaN
4            NaN
           ...  
426875    Sunday
426876    Sunday
426877    Sunday
426878    Sunday
426879    Sunday
Name: posting_weekday, Length: 426880, dtype: object
  • The .dt.day_name() attribute extracts the weekday from the datetime column posting_date.
  • pd.to_numeric() Converts values in the posting_weekday column to numeric format.

Ensuring Data Integrity¶

In [26]:
invalid_data = sp[sp['year'] > sp['posting_year']]
invalid_data
Out[26]:
id price year manufacturer model condition cylinders fuel odometer title_status ... type paint_color image_url state posting_date latitude longitude posting_year posting_month posting_weekday
9738 7303397975 770530.0 2022 NaN Suzuki Vitara NaN NaN gas 115000.0 clean ... NaN NaN https://images.craigslist.org/00n0n_9cYtTzxWFZ... Uttar Pradesh 2021-04-07 22:56:59+00:00 28.1913 84.3726 2021 4 Wednesday
32148 7314343550 0.0 2022 NaN Freightliner SportChassis RHA112 NaN NaN diesel 10.0 clean ... pickup black https://images.craigslist.org/00202_j8VlmOCW5g... Gujarat 2021-04-29 15:54:44+00:00 21.1472 70.9694 2021 4 Thursday
43183 7314165992 1540060.0 2022 toyota 4runner 4wd NaN NaN gas 207834.0 clean ... NaN custom https://images.craigslist.org/00404_8uTHOfbLml... Uttar Pradesh 2021-04-29 02:16:46+00:00 25.3863 82.3836 2021 4 Thursday
65611 7309051490 321050.0 2022 honda civic cvcc fair 4 cylinders gas 605000.0 clean ... NaN NaN https://images.craigslist.org/01414_7ghFSAOlNo... Maharashtra 2021-04-19 01:43:58+00:00 19.7144 75.0804 2021 4 Monday
65612 7309047664 449480.0 2022 toyota mighty max fair 4 cylinders gas 180000.0 clean ... truck white https://images.craigslist.org/00m0m_5XzKF5TFiE... Delhi 2021-04-19 01:31:30+00:00 28.6941 76.8817 2021 4 Monday
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
410935 7315558994 49390.0 2022 mitsubishi eclipse cross le excellent 4 cylinders gas 5.0 NaN ... SUV NaN https://images.craigslist.org/00C0C_mwIIiP2iUq... Puducherry 2021-05-01 23:06:10+00:00 12.2829 79.8404 2021 5 Saturday
410936 7315551108 52330.0 2022 mitsubishi eclipse cross se excellent 4 cylinders gas 3.0 NaN ... SUV NaN https://images.craigslist.org/00z0z_6xrjH3eEy3... Chandigarh 2021-05-01 22:46:56+00:00 30.7712 76.7766 2021 5 Saturday
412094 7301823858 55270.0 2022 mitsubishi eclipse cross sel excellent 4 cylinders gas 5.0 NaN ... SUV NaN https://images.craigslist.org/00v0v_c8464eYsPz... Maharashtra 2021-04-04 22:02:54+00:00 17.3951 77.1283 2021 4 Sunday
413805 7301823862 55270.0 2022 mitsubishi eclipse cross sel excellent 4 cylinders gas 5.0 NaN ... SUV NaN https://images.craigslist.org/00v0v_c8464eYsPz... Meghalaya 2021-04-04 22:02:54+00:00 25.9323 91.2154 2021 4 Sunday
423091 7314368209 31670.0 2022 NaN trailer fair NaN other 1111111.0 clean ... NaN NaN https://images.craigslist.org/00Y0Y_3AeJN6UYkh... Goa 2021-04-29 16:34:59+00:00 15.5037 73.8108 2021 4 Thursday

133 rows × 24 columns

  • This line creates a new DataFrame, invalid_data, by filtering rows in sp where the value in the year column is greater than the value in the posting_year column.
  • Here it is noticed that there are 133 entries which are invalid.

Dropping Invalid Data¶

In [27]:
sp = sp.drop(invalid_data.index)
sp
Out[27]:
id price year manufacturer model condition cylinders fuel odometer title_status ... type paint_color image_url state posting_date latitude longitude posting_year posting_month posting_weekday
0 7222695916 770530.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Haryana NaT 28.8446 75.1167 <NA> <NA> NaN
1 7218891961 1528210.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Haryana NaT 27.9026 77.0382 <NA> <NA> NaN
2 7221797935 2696820.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN West Bengal NaT 24.6838 85.9695 <NA> <NA> NaN
3 7222270760 192610.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Andhra Pradesh NaT 15.5158 81.9342 <NA> <NA> NaN
4 7210384030 629240.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN West Bengal NaT 23.7821 87.4861 <NA> <NA> NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
426875 7301591192 3029710.0 2019 nissan maxima s sedan 4d good 6 cylinders gas 32226.0 clean ... sedan NaN https://images.craigslist.org/00o0o_iiraFnHg8q... Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644 2021 4 Sunday
426876 7301591187 3928680.0 2020 volvo s60 t5 momentum sedan 4d good NaN gas 12029.0 clean ... sedan red https://images.craigslist.org/00x0x_15sbgnxCIS... Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915 2021 4 Sunday
426877 7301591147 4493750.0 2020 cadillac xt4 sport suv 4d good NaN diesel 4174.0 clean ... hatchback white https://images.craigslist.org/00L0L_farM7bxnxR... Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247 2021 4 Sunday
426878 7301591140 3723220.0 2018 lexus es 350 sedan 4d good 6 cylinders gas 30112.0 clean ... sedan silver https://images.craigslist.org/00z0z_bKnIVGLkDT... Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719 2021 4 Sunday
426879 7301591129 3928680.0 2019 bmw 4 series 430i gran coupe good NaN gas 22716.0 clean ... coupe NaN https://images.craigslist.org/00Y0Y_lEUocjyRxa... Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688 2021 4 Sunday

426747 rows × 24 columns

  • sp.drop(invalid_data.index) removes these rows from the original DataFrame sp.
  • Since there is 133 rows which is quite insignificant compared to the whole data. The invalid_data is removed from the sp.

Changing the column name type to car_type¶

In [28]:
if 'type' in sp.columns:
    sp.rename(columns={'type': 'car_type'}, inplace=True)
  • Since the word type is a keyword in python, the column name type is changed to car_type
  • The .rename method is used to change the name of the column type to car_type.

Handling Inappropriate Data¶

Handling the price column¶

In [29]:
sp[sp['price']==0]
Out[29]:
id price year manufacturer model condition cylinders fuel odometer title_status ... car_type paint_color image_url state posting_date latitude longitude posting_year posting_month posting_weekday
10 7218893038 0.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Uttar Pradesh NaT 28.5810 83.4530 <NA> <NA> NaN
11 7218325704 0.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Maharashtra NaT 16.3906 73.9100 <NA> <NA> NaN
12 7217788283 0.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Maharashtra NaT 20.6117 79.7192 <NA> <NA> NaN
13 7217147606 0.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Uttar Pradesh NaT 29.8561 81.6407 <NA> <NA> NaN
14 7209027818 0.0 <NA> NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Kerala NaT 8.5412 75.6084 <NA> <NA> NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
426764 7303347225 0.0 2018 NaN peterbilt 579 NaN NaN diesel 1.0 clean ... NaN NaN https://images.craigslist.org/00F0F_kfjCcZyncy... Kerala 2021-04-07 21:15:05+00:00 8.7133 74.9888 2021 4 Wednesday
426812 7302877815 0.0 2006 toyota scion tc excellent 4 cylinders gas 195000.0 clean ... NaN silver https://images.craigslist.org/00B0B_dqmLwdZw8Q... Tamil Nadu 2021-04-06 23:00:19+00:00 9.7042 79.7712 2021 4 Tuesday
426832 7302353283 0.0 2004 toyota prius excellent 4 cylinders hybrid 239000.0 clean ... NaN blue https://images.craigslist.org/00V0V_7Ap4arGS9S... Telangana 2021-04-05 23:00:55+00:00 16.8085 77.5499 2021 4 Monday
426836 7302301268 0.0 2018 ram 2500 excellent 6 cylinders diesel 20492.0 clean ... truck white https://images.craigslist.org/00i0i_3cSpgdy0qQ... Delhi 2021-04-05 21:18:42+00:00 28.6000 77.2163 2021 4 Monday
426868 7301843288 0.0 2010 toyota venza excellent 6 cylinders gas 155000.0 clean ... NaN blue https://images.craigslist.org/00808_hNEilrIb0i... Uttar Pradesh 2021-04-04 23:00:48+00:00 27.9450 79.3595 2021 4 Sunday

34832 rows × 24 columns

  • price column cannot have entries 0.
  • So it is attempted to replace with the Median of price column in the next step.
In [30]:
# Calculate the median of the column (ignoring zeros)
median_price = sp[sp['price'] > 0]['price'].median()

# Replace zeros with the median
sp.loc[sp['price'] == 0, 'price'] = median_price

# Display the updated dataframe and median
print(f"Median Price (excluding zeros): {median_price}")
print(sp[['price']].head())
Median Price (excluding zeros): 1990460.0
       price
0   770530.0
1  1528210.0
2  2696820.0
3   192610.0
4   629240.0
  • .median() calculates the median of the price column, ignoring zeros.
  • sp.loc[sp['price'] == 0, 'price'] = median_price replaces all zero values in the price column with the calculated median.

Handling the odometer column¶

In [31]:
sp['odometer'].isnull().sum()
Out[31]:
4400
  • sp['odometer'] cannot have null values.
  • So the null values will be replaced with median odometer in the next step.
In [32]:
# Calculate the median of the column (ignoring zeros)
median_odometer = sp[sp['odometer'] > 0]['odometer'].median()

# Replace missing with the median
sp['odometer'].fillna(median_odometer, inplace=True)

# Display the updated dataframe and median
print(f"Median odometer (excluding zeros): {median_odometer}")
print(sp[['odometer']].head())
Median odometer (excluding zeros): 86000.0
   odometer
0   86000.0
1   86000.0
2   86000.0
3   86000.0
4   86000.0
  • .median() calculates the median of the odometer column, ignoring zeros.
  • .fillna() replaces missing values (NaN) in the odometer column with the calculated median.

Handling the year column¶

In [33]:
sp['year'].isnull().sum()
Out[33]:
1205
  • sp['odometer'] cannot have null values.
  • So in the next step it is attempted to rectify.
In [34]:
# Replace empty or invalid values with NaN
sp['year'] = sp['year'].replace(['', 'NA'], pd.NA)

# Calculate median years for each manufacturer and model
median_years = sp.groupby(['manufacturer', 'model'])['year'].median()

# Fill missing years using the calculated medians
sp['year'] = sp.apply(
    lambda row: median_years.loc[(row['manufacturer'], row['model'])] if pd.isna(row['year']) and (row['manufacturer'], row['model']) in median_years.index else row['year'],
    axis=1
)

# For rows still missing 'year', fill with the overall median
overall_median_year = sp['year'].median()
sp['year'] = sp['year'].fillna(overall_median_year)

# View the result
print(sp)
                id      price    year manufacturer                     model  \
0       7222695916   770530.0  2013.0          NaN                       NaN   
1       7218891961  1528210.0  2013.0          NaN                       NaN   
2       7221797935  2696820.0  2013.0          NaN                       NaN   
3       7222270760   192610.0  2013.0          NaN                       NaN   
4       7210384030   629240.0  2013.0          NaN                       NaN   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

       condition    cylinders    fuel  odometer title_status  ...   car_type  \
0            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
1            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
2            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
3            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
4            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
...          ...          ...     ...       ...          ...  ...        ...   
426875      good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876      good          NaN     gas   12029.0        clean  ...      sedan   
426877      good          NaN  diesel    4174.0        clean  ...  hatchback   
426878      good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879      good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
0              NaN                                                NaN   
1              NaN                                                NaN   
2              NaN                                                NaN   
3              NaN                                                NaN   
4              NaN                                                NaN   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
0              Haryana                       NaT  28.8446   75.1167   
1              Haryana                       NaT  27.9026   77.0382   
2          West Bengal                       NaT  24.6838   85.9695   
3       Andhra Pradesh                       NaT  15.5158   81.9342   
4          West Bengal                       NaT  23.7821   87.4861   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
0              <NA>          <NA>              NaN  
1              <NA>          <NA>              NaN  
2              <NA>          <NA>              NaN  
3              <NA>          <NA>              NaN  
4              <NA>          <NA>              NaN  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426747 rows x 24 columns]
  • .replace()replaces empty or invalid values in the year column with NaN.
  • sp.groupby(['manufacturer', 'model'])['year'].median()calculates the median year for each manufacturer and model group.
  • apply() fills missing year values using the calculated medians for corresponding manufacturer and model.
  • With .filna() which fills any remaining missing year values with the overall median.

Handling model column¶

In [35]:
sp['model'].isnull().sum()
Out[35]:
5268
In [36]:
sp['model'].value_counts()
Out[36]:
f-150                      8008
silverado 1500             5140
1500                       4211
camry                      3135
silverado                  3022
                           ... 
escalade and                  1
cx 3                          1
1991 INTERNATIONAL 9800       1
nv 200 s                      1
Paige Glenbrook Touring       1
Name: model, Length: 29648, dtype: int64
In [37]:
# Step 1: Create a mapping of (manufacturer, year) to model
model_mapping = (
    sp[~sp['model'].isnull()]  # Filter rows where 'model' is not null
    .groupby(['manufacturer', 'year'])['model']
    .first()
    .reset_index()
)

# Convert the mapping to a dictionary
mapping_dict = model_mapping.set_index(['manufacturer', 'year'])['model'].to_dict()

# Step 2: Fill missing 'model' values
for index, row in sp.iterrows():      
    if pd.isnull(row['model']):  # Check if 'model' is missing
        sp.at[index, 'model'] = mapping_dict.get((row['manufacturer'], row['year']), None)

# Print the updated DataFrame
print(sp)
                id      price    year manufacturer                     model  \
0       7222695916   770530.0  2013.0          NaN                       NaN   
1       7218891961  1528210.0  2013.0          NaN                       NaN   
2       7221797935  2696820.0  2013.0          NaN                       NaN   
3       7222270760   192610.0  2013.0          NaN                       NaN   
4       7210384030   629240.0  2013.0          NaN                       NaN   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

       condition    cylinders    fuel  odometer title_status  ...   car_type  \
0            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
1            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
2            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
3            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
4            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
...          ...          ...     ...       ...          ...  ...        ...   
426875      good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876      good          NaN     gas   12029.0        clean  ...      sedan   
426877      good          NaN  diesel    4174.0        clean  ...  hatchback   
426878      good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879      good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
0              NaN                                                NaN   
1              NaN                                                NaN   
2              NaN                                                NaN   
3              NaN                                                NaN   
4              NaN                                                NaN   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
0              Haryana                       NaT  28.8446   75.1167   
1              Haryana                       NaT  27.9026   77.0382   
2          West Bengal                       NaT  24.6838   85.9695   
3       Andhra Pradesh                       NaT  15.5158   81.9342   
4          West Bengal                       NaT  23.7821   87.4861   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
0              <NA>          <NA>              NaN  
1              <NA>          <NA>              NaN  
2              <NA>          <NA>              NaN  
3              <NA>          <NA>              NaN  
4              <NA>          <NA>              NaN  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426747 rows x 24 columns]
  • model_mapping creates a mapping of (manufacturer, year) to the first non-null model.
  • .set_indexconverts the mapping into a dictionary for efficient lookup.
  • Using for loop which Iterate through the rows, filling missing model values using the dictionary based on manufacturer and year.
In [38]:
sp['model'].isnull().sum()
Out[38]:
148
  • Still it is noticed that there are some missing values in the column model.

Dropped rows where manufacturer and model is NAN¶

In [39]:
# Drop rows where both 'manufacturer' and 'model' are null
sp = sp.dropna(subset=['manufacturer', 'model'], how='all')

# Print the updated DataFrame
print(sp)
                id      price    year manufacturer                     model  \
27      7316814884  4313900.0  2014.0          gmc  sierra 1500 crew cab slt   
28      7316814758  2901280.0  2010.0    chevrolet            silverado 1500   
29      7316814989  5084440.0  2020.0    chevrolet       silverado 1500 crew   
30      7316743432  3980090.0  2017.0       toyota      tundra double cab sr   
31      7316356412  1926290.0  2013.0         ford                 f-150 xlt   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

        condition    cylinders    fuel  odometer title_status  ...   car_type  \
27           good  8 cylinders     gas   57923.0        clean  ...     pickup   
28           good  8 cylinders     gas   71229.0        clean  ...     pickup   
29           good  8 cylinders     gas   19160.0        clean  ...     pickup   
30           good  8 cylinders     gas   41124.0        clean  ...     pickup   
31      excellent  6 cylinders     gas  128000.0        clean  ...      truck   
...           ...          ...     ...       ...          ...  ...        ...   
426875       good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876       good          NaN     gas   12029.0        clean  ...      sedan   
426877       good          NaN  diesel    4174.0        clean  ...  hatchback   
426878       good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879       good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
27           white  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
28            blue  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
29             red  https://images.craigslist.org/01212_jjirIWa0y0...   
30             red  https://images.craigslist.org/00x0x_1y9kIOzGCF...   
31           black  https://images.craigslist.org/00404_l4loxHvdQe...   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
27              Kerala 2021-05-04 17:31:18+00:00   9.8072   76.8361   
28       Uttar Pradesh 2021-05-04 17:31:08+00:00  29.2417   83.7824   
29               Delhi 2021-05-04 17:31:25+00:00  28.8053   76.8871   
30              Odisha 2021-05-04 15:41:31+00:00  19.2886   85.2667   
31             Tripura 2021-05-03 19:02:03+00:00  23.4924   91.1187   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
27             2021             5          Tuesday  
28             2021             5          Tuesday  
29             2021             5          Tuesday  
30             2021             5          Tuesday  
31             2021             5           Monday  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426678 rows x 24 columns]
  • Using .dropna() function it is attempted to drop the rows where manufacturer and model both has null values.
  • There lies 69 rows which has this condition. Since it is insignificant. We opt to drop.
In [40]:
sp['model'].isnull().sum()
Out[40]:
79
  • Still it is noticed that there are 79 null values in th model column.

Categorized Unknown to the remaining missing values in model¶

In [41]:
# Fill missing values in the 'model' column with 'Unknown'
sp['model'] = sp['model'].fillna('Unknown')

# Print the updated DataFrame
print(sp)
                id      price    year manufacturer                     model  \
27      7316814884  4313900.0  2014.0          gmc  sierra 1500 crew cab slt   
28      7316814758  2901280.0  2010.0    chevrolet            silverado 1500   
29      7316814989  5084440.0  2020.0    chevrolet       silverado 1500 crew   
30      7316743432  3980090.0  2017.0       toyota      tundra double cab sr   
31      7316356412  1926290.0  2013.0         ford                 f-150 xlt   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

        condition    cylinders    fuel  odometer title_status  ...   car_type  \
27           good  8 cylinders     gas   57923.0        clean  ...     pickup   
28           good  8 cylinders     gas   71229.0        clean  ...     pickup   
29           good  8 cylinders     gas   19160.0        clean  ...     pickup   
30           good  8 cylinders     gas   41124.0        clean  ...     pickup   
31      excellent  6 cylinders     gas  128000.0        clean  ...      truck   
...           ...          ...     ...       ...          ...  ...        ...   
426875       good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876       good          NaN     gas   12029.0        clean  ...      sedan   
426877       good          NaN  diesel    4174.0        clean  ...  hatchback   
426878       good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879       good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
27           white  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
28            blue  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
29             red  https://images.craigslist.org/01212_jjirIWa0y0...   
30             red  https://images.craigslist.org/00x0x_1y9kIOzGCF...   
31           black  https://images.craigslist.org/00404_l4loxHvdQe...   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
27              Kerala 2021-05-04 17:31:18+00:00   9.8072   76.8361   
28       Uttar Pradesh 2021-05-04 17:31:08+00:00  29.2417   83.7824   
29               Delhi 2021-05-04 17:31:25+00:00  28.8053   76.8871   
30              Odisha 2021-05-04 15:41:31+00:00  19.2886   85.2667   
31             Tripura 2021-05-03 19:02:03+00:00  23.4924   91.1187   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
27             2021             5          Tuesday  
28             2021             5          Tuesday  
29             2021             5          Tuesday  
30             2021             5          Tuesday  
31             2021             5           Monday  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426678 rows x 24 columns]
  • The remaining null values in the model column has been replaced by Unknown
In [42]:
sp['model'].isnull().sum()
Out[42]:
0
In [43]:
sp['model'].value_counts()
Out[43]:
f-150                      8014
silverado 1500             5154
1500                       4403
camry                      3155
silverado                  3037
                           ... 
g35xsports                    1
escalade ext awd gas          1
big horn slt                  1
1500 4x4 sport                1
Paige Glenbrook Touring       1
Name: model, Length: 29649, dtype: int64

Handling manufacturer column values¶

In [44]:
sp['manufacturer'].isnull().sum()
Out[44]:
17536
In [45]:
sp['manufacturer'].value_counts()
Out[45]:
ford               70981
chevrolet          55061
toyota             34199
honda              21267
nissan             19063
jeep               19014
ram                18342
gmc                16785
bmw                14698
dodge              13707
mercedes-benz      11816
hyundai            10338
subaru              9495
volkswagen          9345
kia                 8457
lexus               8196
audi                7572
cadillac            6952
chrysler            6031
acura               5978
buick               5501
mazda               5427
infiniti            4802
lincoln             4220
volvo               3374
mitsubishi          3225
mini                2376
pontiac             2288
rover               2113
jaguar              1946
porsche             1383
mercury             1184
saturn              1090
alfa-romeo           897
tesla                868
fiat                 792
harley-davidson      153
ferrari               95
datsun                63
aston-martin          24
land rover            21
morgan                 3
Name: manufacturer, dtype: int64
In [46]:
# Step 1: Create a mapping of model to manufacturer
manufacturer_mapping = (
    sp[~sp['manufacturer'].isnull()]  # Filter rows where 'manufacturer' is not null
    .groupby('model')['manufacturer']
    .first()
    .reset_index()
)

# Convert the mapping to a dictionary
manufacturer_mapping_dict = manufacturer_mapping.set_index('model')['manufacturer'].to_dict()

# Step 2: Fill missing values in 'manufacturer'
def fill_manufacturer(row):
    if pd.isnull(row['manufacturer']):
        # Try model-only mapping
        return manufacturer_mapping_dict.get(row['model'], 'Unknown')  # Default to 'Unknown' if no match
    return row['manufacturer']

sp['manufacturer'] = sp.apply(fill_manufacturer, axis=1)

# Print the updated DataFrame
print(sp)
                id      price    year manufacturer                     model  \
27      7316814884  4313900.0  2014.0          gmc  sierra 1500 crew cab slt   
28      7316814758  2901280.0  2010.0    chevrolet            silverado 1500   
29      7316814989  5084440.0  2020.0    chevrolet       silverado 1500 crew   
30      7316743432  3980090.0  2017.0       toyota      tundra double cab sr   
31      7316356412  1926290.0  2013.0         ford                 f-150 xlt   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

        condition    cylinders    fuel  odometer title_status  ...   car_type  \
27           good  8 cylinders     gas   57923.0        clean  ...     pickup   
28           good  8 cylinders     gas   71229.0        clean  ...     pickup   
29           good  8 cylinders     gas   19160.0        clean  ...     pickup   
30           good  8 cylinders     gas   41124.0        clean  ...     pickup   
31      excellent  6 cylinders     gas  128000.0        clean  ...      truck   
...           ...          ...     ...       ...          ...  ...        ...   
426875       good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876       good          NaN     gas   12029.0        clean  ...      sedan   
426877       good          NaN  diesel    4174.0        clean  ...  hatchback   
426878       good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879       good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
27           white  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
28            blue  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
29             red  https://images.craigslist.org/01212_jjirIWa0y0...   
30             red  https://images.craigslist.org/00x0x_1y9kIOzGCF...   
31           black  https://images.craigslist.org/00404_l4loxHvdQe...   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
27              Kerala 2021-05-04 17:31:18+00:00   9.8072   76.8361   
28       Uttar Pradesh 2021-05-04 17:31:08+00:00  29.2417   83.7824   
29               Delhi 2021-05-04 17:31:25+00:00  28.8053   76.8871   
30              Odisha 2021-05-04 15:41:31+00:00  19.2886   85.2667   
31             Tripura 2021-05-03 19:02:03+00:00  23.4924   91.1187   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
27             2021             5          Tuesday  
28             2021             5          Tuesday  
29             2021             5          Tuesday  
30             2021             5          Tuesday  
31             2021             5           Monday  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426678 rows x 24 columns]
  • manufacturer_mapping creates a mapping of manufacturer to the first non-null manufacturer.
  • manufacturer_mapping_dict converts the mapping into a dictionary for quick lookup.
  • Fill missing manufacturer values by checking the model and using the dictionary, defaulting to 'Unknown' if no match is found.
In [47]:
sp['manufacturer'].isnull().sum()
Out[47]:
0
In [48]:
sp['manufacturer'].value_counts()
Out[48]:
ford               71073
chevrolet          55182
toyota             34449
honda              21270
nissan             19065
jeep               19023
ram                18364
gmc                16793
Unknown            16769
bmw                14721
dodge              13724
mercedes-benz      11818
hyundai            10340
subaru              9495
volkswagen          9350
kia                 8457
lexus               8196
audi                7572
cadillac            6952
chrysler            6052
acura               5978
buick               5504
mazda               5456
infiniti            4803
lincoln             4220
volvo               3377
mitsubishi          3227
mini                2376
pontiac             2291
rover               2175
jaguar              1946
porsche             1385
mercury             1187
saturn              1090
alfa-romeo           897
fiat                 868
tesla                868
harley-davidson      153
ferrari               95
datsun                69
aston-martin          24
land rover            21
morgan                 3
Name: manufacturer, dtype: int64

Handling condition column values¶

In [49]:
sp['condition'].isnull().sum()
Out[49]:
173984
In [50]:
sp['condition'].value_counts()
Out[50]:
good         121452
excellent    101430
like new      21175
fair           6765
new            1271
salvage         601
Name: condition, dtype: int64
In [51]:
sp['condition'] = sp.groupby(['manufacturer','model','year'])['condition'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, and year.
  • Then using .transform(), fill missing values in the condition column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the condition column with the filled values.
In [52]:
sp['condition'].isnull().sum()
Out[52]:
0
In [53]:
sp['condition'].value_counts()
Out[53]:
excellent    209179
good         143637
Unknown       36669
like new      27061
fair           7702
new            1781
salvage         649
Name: condition, dtype: int64

Handling cylinder column values¶

In [54]:
sp['cylinders'].isnull().sum()
Out[54]:
177542
In [55]:
sp['cylinders'].value_counts()
Out[55]:
6 cylinders     94165
4 cylinders     77588
8 cylinders     72060
5 cylinders      1712
10 cylinders     1455
other            1293
3 cylinders       655
12 cylinders      208
Name: cylinders, dtype: int64
In [56]:
sp['cylinders'] = sp.groupby(['manufacturer','model','year'])['cylinders'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, and year.
  • Then using .transform(), fill missing values in the 'cylinders' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the 'cylinders' column with the filled values.
In [57]:
sp['cylinders'].isnull().sum()
Out[57]:
0
In [58]:
sp['cylinders'].value_counts()
Out[58]:
6 cylinders     137826
4 cylinders     123886
8 cylinders     102218
Unknown          55441
5 cylinders       2450
10 cylinders      1912
other             1814
3 cylinders        907
12 cylinders       224
Name: cylinders, dtype: int64

Handling fuel column values¶

In [59]:
sp['fuel'].isnull().sum()
Out[59]:
2945
In [60]:
sp['fuel'].value_counts()
Out[60]:
gas         356115
other        30696
diesel       30055
hybrid        5170
electric      1697
Name: fuel, dtype: int64
In [61]:
sp['fuel'] = sp.groupby(['manufacturer','model','year'])['fuel'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, and year.
  • Then using .transform(), fill missing values in the 'fuel' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the 'fuel' column with the filled values.
In [62]:
sp['fuel'].isnull().sum()
Out[62]:
0
In [63]:
sp['fuel'].value_counts()
Out[63]:
gas         357266
other        30724
diesel       30101
hybrid        5195
electric      1701
Unknown       1691
Name: fuel, dtype: int64

Handling transmission column values¶

In [64]:
sp['transmission'].isnull().sum()
Out[64]:
2488
In [65]:
sp['transmission'].value_counts()
Out[65]:
automatic    336437
other         62642
manual        25111
Name: transmission, dtype: int64
In [66]:
sp['transmission'] = sp.groupby(['manufacturer','model','year'])['transmission'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, and year.
  • Then using .transform(), fill missing values in the 'transmission' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the 'transmission' column with the filled values.
In [67]:
sp['transmission'].isnull().sum()
Out[67]:
0
In [68]:
sp['transmission'].value_counts()
Out[68]:
automatic    338541
other         62660
manual        25135
Unknown         342
Name: transmission, dtype: int64

Handling title_status column values¶

In [69]:
sp['title_status'].isnull().sum()
Out[69]:
8139
In [70]:
sp['title_status'].value_counts()
Out[70]:
clean         405030
rebuilt         7219
salvage         3861
lien            1422
missing          810
parts only       197
Name: title_status, dtype: int64
In [71]:
sp['title_status'] = sp['title_status'].fillna('Unknown')
  • .fillna() fills missing values in the title_status column with the string 'Unknown'.
  • Update the title_status column with the filled values.
In [72]:
sp['title_status'].isnull().sum()
Out[72]:
0
In [73]:
sp['title_status'].value_counts()
Out[73]:
clean         405030
Unknown         8139
rebuilt         7219
salvage         3861
lien            1422
missing          810
parts only       197
Name: title_status, dtype: int64

Handling car_type column values¶

In [74]:
sp['car_type'].isnull().sum()
Out[74]:
92723
In [75]:
sp['car_type'].value_counts()
Out[75]:
sedan          87055
SUV            77233
pickup         43509
truck          35273
other          22105
coupe          19204
hatchback      16598
wagon          10750
van             8548
convertible     7731
mini-van        4825
offroad          609
bus              515
Name: car_type, dtype: int64
In [76]:
sp['car_type'] = sp.groupby(['manufacturer','model','year','transmission','fuel'])['car_type'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, year, transmission, and fuel.
  • Then using .transform(), fill missing values in the car_type column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the car_type column with the filled values.
In [77]:
sp['car_type'].isnull().sum()
Out[77]:
0
In [78]:
sp['car_type'].value_counts()
Out[78]:
sedan          106733
SUV             97421
pickup          50507
truck           43467
Unknown         24924
other           23014
coupe           22104
hatchback       18777
wagon           12186
van             10284
convertible      8910
mini-van         7007
offroad           725
bus               619
Name: car_type, dtype: int64

Handling paint column values¶

In [79]:
sp['paint_color'].isnull().sum()
Out[79]:
130030
In [80]:
sp['paint_color'].value_counts()
Out[80]:
white     79280
black     62854
silver    42963
blue      31220
red       30469
grey      24415
green      7343
custom     6699
brown      6593
yellow     2142
orange     1984
purple      686
Name: paint_color, dtype: int64
In [81]:
sp['paint_color'] = sp.groupby(['manufacturer','model','year'])['paint_color'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, and year.
  • Then using .transform(), fill missing values in the 'paint_color' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the 'paint_color' column with the filled values.
In [82]:
sp['paint_color'].isnull().sum()
Out[82]:
0
In [83]:
sp['paint_color'].value_counts()
Out[83]:
white      108720
black       91076
silver      57794
blue        40905
red         37135
grey        30740
Unknown     30617
green        8786
brown        7901
custom       7606
yellow       2346
orange       2282
purple        770
Name: paint_color, dtype: int64

Handling drive column values¶

In [84]:
sp['drive'].isnull().sum()
Out[84]:
130432
In [85]:
sp['drive'].value_counts()
Out[85]:
4wd    131853
fwd    105506
rwd     58887
Name: drive, dtype: int64
In [86]:
sp['drive'] = sp.groupby(['manufacturer','model'])['drive'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, and year.
  • Then using .transform(), fill missing values in the 'drive' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the 'drive' column with the filled values.
In [87]:
sp['drive'].isnull().sum()
Out[87]:
0
In [88]:
sp['drive'].value_counts()
Out[88]:
4wd        179672
fwd        145427
rwd         78620
Unknown     22959
Name: drive, dtype: int64

Handling size Column values¶

In [89]:
sp['size'].isnull().sum()
Out[89]:
306166
In [90]:
sp['size'].value_counts()
Out[90]:
full-size      63461
mid-size       34474
compact        19383
sub-compact     3194
Name: size, dtype: int64
In [91]:
sp['size'] = sp.groupby(['manufacturer','model','year'])['size'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, and year.
  • Then using .transform(), fill missing values in the 'size' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the 'size' column with the filled values.
In [92]:
sp['size'].isnull().sum()
Out[92]:
0
In [93]:
sp['size'].value_counts()
Out[93]:
full-size      160376
Unknown        121065
mid-size        87792
compact         51146
sub-compact      6299
Name: size, dtype: int64

Handling VIN Column values¶

In [94]:
sp['VIN'].isnull().sum()
Out[94]:
160908
In [95]:
sp['VIN'].value_counts()
Out[95]:
1FMJU1JT1HEA52352    261
3C6JR6DT3KG560649    235
1FTER1EH1LLA36301    231
5TFTX4CN3EX042751    227
1GCHTCE37G1186784    214
                    ... 
4S4BRBGC3B3332201      1
1N4AA5AP2EC448150      1
1J8GN58K88W233418      1
2FMDK3GC9ABA40182      1
SAJGX2749VCOO8376      1
Name: VIN, Length: 118226, dtype: int64
In [96]:
sp['VIN'] = sp.groupby(['manufacturer','model','year','price','condition','cylinders','fuel','odometer','title_status','transmission','drive','size','car_type','paint_color'])['VIN'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
  • With sp.groupby which groups the data by manufacturer, model, year, price, condition, cylinders, fuel, odometer, title_status, transmission, drive, size, car_type, and paint_color.
  • Then using .transform(), fill missing values in the VIN column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available.
  • Update the VIN column with the filled values.
In [97]:
sp['VIN'].isnull().sum()
Out[97]:
0
In [98]:
sp['VIN'].value_counts()
Out[98]:
Unknown              160210
1FMJU1JT1HEA52352       261
3C6JR6DT3KG560649       235
1FTER1EH1LLA36301       231
5TFTX4CN3EX042751       227
                      ...  
1C3EL75R35N639520         1
1GCHG35R211150470         1
3B7HC13YXXM530629         1
1GT21REG3GZ289767         1
SAJGX2749VCOO8376         1
Name: VIN, Length: 118227, dtype: int64

Checking the Cleaned Data-type missing values¶

In [99]:
missing_value_count = sp.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count
Missing Values in Each Column:
Out[99]:
id                 0
price              0
year               0
manufacturer       0
model              0
condition          0
cylinders          0
fuel               0
odometer           0
title_status       0
transmission       0
VIN                0
drive              0
size               0
car_type           0
paint_color        0
image_url          0
state              0
posting_date       0
latitude           0
longitude          0
posting_year       0
posting_month      0
posting_weekday    0
dtype: int64

Create car_age and avg_yearly_mileage column¶

Create car_age column¶

In [100]:
sp['car_age'] = sp.apply(
    lambda row: row['posting_date'].year - row['year'] if pd.notna(row['year']) and pd.notna(row['posting_date']) else np.nan, 
    axis=1
)
# Display the DataFrame with the new 'car_age' column
print(sp[['year', 'posting_date', 'car_age']])
          year              posting_date  car_age
27      2014.0 2021-05-04 17:31:18+00:00      7.0
28      2010.0 2021-05-04 17:31:08+00:00     11.0
29      2020.0 2021-05-04 17:31:25+00:00      1.0
30      2017.0 2021-05-04 15:41:31+00:00      4.0
31      2013.0 2021-05-03 19:02:03+00:00      8.0
...        ...                       ...      ...
426875  2019.0 2021-04-04 09:21:31+00:00      2.0
426876  2020.0 2021-04-04 09:21:29+00:00      1.0
426877  2020.0 2021-04-04 09:21:17+00:00      1.0
426878  2018.0 2021-04-04 09:21:11+00:00      3.0
426879  2019.0 2021-04-04 09:21:07+00:00      2.0

[426678 rows x 3 columns]
  • Calculate the car_age by subtracting the year from the posting_date year, if both values are not null.
  • If either year or posting_date is missing, set the car_age as NaN.
  • Update the car_age column with the calculated values.

Create avg_yearly_mileage column¶

In [101]:
# Calculate 'avg_yearly_mileage'
sp['avg_yearly_mileage'] = np.where(
    sp['car_age'] == 0,  # Check if 'car_age' is zero
    sp['odometer'],      # If 'car_age' is zero, use 'odometer'
    sp['odometer'] / sp['car_age']  # Otherwise, calculate the formula
)

# Display the result
print(sp['avg_yearly_mileage'])
27         8274.714286
28         6475.363636
29        19160.000000
30        10281.000000
31        16000.000000
              ...     
426875    16113.000000
426876    12029.000000
426877     4174.000000
426878    10037.333333
426879    11358.000000
Name: avg_yearly_mileage, Length: 426678, dtype: float64
  • Calculate avg_yearly_mileage by dividing odometer by car_age, unless car_age is zero, in which case use the odometer value directly.
  • Update the avg_yearly_mileage column with the calculated values.

DataFrame sp¶

In [102]:
sp
Out[102]:
id price year manufacturer model condition cylinders fuel odometer title_status ... image_url state posting_date latitude longitude posting_year posting_month posting_weekday car_age avg_yearly_mileage
27 7316814884 4313900.0 2014.0 gmc sierra 1500 crew cab slt good 8 cylinders gas 57923.0 clean ... https://images.craigslist.org/00R0R_lwWjXSEWNa... Kerala 2021-05-04 17:31:18+00:00 9.8072 76.8361 2021 5 Tuesday 7.0 8274.714286
28 7316814758 2901280.0 2010.0 chevrolet silverado 1500 good 8 cylinders gas 71229.0 clean ... https://images.craigslist.org/00R0R_lwWjXSEWNa... Uttar Pradesh 2021-05-04 17:31:08+00:00 29.2417 83.7824 2021 5 Tuesday 11.0 6475.363636
29 7316814989 5084440.0 2020.0 chevrolet silverado 1500 crew good 8 cylinders gas 19160.0 clean ... https://images.craigslist.org/01212_jjirIWa0y0... Delhi 2021-05-04 17:31:25+00:00 28.8053 76.8871 2021 5 Tuesday 1.0 19160.000000
30 7316743432 3980090.0 2017.0 toyota tundra double cab sr good 8 cylinders gas 41124.0 clean ... https://images.craigslist.org/00x0x_1y9kIOzGCF... Odisha 2021-05-04 15:41:31+00:00 19.2886 85.2667 2021 5 Tuesday 4.0 10281.000000
31 7316356412 1926290.0 2013.0 ford f-150 xlt excellent 6 cylinders gas 128000.0 clean ... https://images.craigslist.org/00404_l4loxHvdQe... Tripura 2021-05-03 19:02:03+00:00 23.4924 91.1187 2021 5 Monday 8.0 16000.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d good 6 cylinders gas 32226.0 clean ... https://images.craigslist.org/00o0o_iiraFnHg8q... Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644 2021 4 Sunday 2.0 16113.000000
426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d good Unknown gas 12029.0 clean ... https://images.craigslist.org/00x0x_15sbgnxCIS... Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915 2021 4 Sunday 1.0 12029.000000
426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d good Unknown diesel 4174.0 clean ... https://images.craigslist.org/00L0L_farM7bxnxR... Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247 2021 4 Sunday 1.0 4174.000000
426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d good 6 cylinders gas 30112.0 clean ... https://images.craigslist.org/00z0z_bKnIVGLkDT... Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719 2021 4 Sunday 3.0 10037.333333
426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe good Unknown gas 22716.0 clean ... https://images.craigslist.org/00Y0Y_lEUocjyRxa... Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688 2021 4 Sunday 2.0 11358.000000

426678 rows × 26 columns

  • The above DataFrame sp represents the cleaned data set.
In [103]:
sp.describe()
Out[103]:
id price year odometer latitude longitude posting_year posting_month car_age avg_yearly_mileage
count 4.266780e+05 4.266780e+05 426678.000000 4.266780e+05 426678.000000 426678.000000 426678.0 426678.000000 426678.000000 4.266780e+05
mean 7.311495e+09 9.822740e+06 2011.236584 9.786411e+04 22.801159 79.012884 2021.0 4.266302 9.763416 1.230994e+04
std 4.381300e+06 1.564808e+09 9.439536 2.111762e+05 6.532559 5.208544 0.0 0.442024 9.439536 3.861835e+04
min 7.301583e+09 9.200000e+02 1900.000000 0.000000e+00 6.800300 68.400200 2021.0 4.000000 0.000000 0.000000e+00
25% 7.308145e+09 1.026400e+06 2008.000000 3.817000e+04 17.796600 75.719100 2021.0 4.000000 4.000000 7.063000e+03
50% 7.312622e+09 1.990460e+06 2013.000000 8.600000e+04 24.058950 77.353900 2021.0 4.000000 8.000000 1.072728e+04
75% 7.315254e+09 3.402170e+06 2017.000000 1.330000e+05 28.563600 80.927100 2021.0 5.000000 13.000000 1.500000e+04
max 7.317101e+09 4.798934e+11 2021.000000 1.000000e+07 35.997600 97.398000 2021.0 5.000000 121.000000 1.000000e+07
  • .describe() function generates descriptive statistics for the DataFrame sp.

3. Exploratory Data Analysis (EDA)¶

Hypothesis 1:- Vehicle pricing and demand vary based on condition, fuel type, mileage, and vehicle type, impacting transaction volumes and average prices.¶

Price Trends Analysis¶

Objective: Identify factors influencing car prices.¶

Pricing Analysis¶

In [104]:
# Correlation between price and numerical attributes
correlations = sp.corr()['price'].sort_values(ascending=False)
print("\nCorrelation with Price:")
print(correlations)
Correlation with Price:
price                 1.000000
odometer              0.010112
car_age               0.004899
avg_yearly_mileage    0.000166
latitude             -0.001279
longitude            -0.001498
posting_month        -0.001826
id                   -0.002846
year                 -0.004899
posting_year               NaN
Name: price, dtype: float64
  • Calculate the correlation between the price column and all other numerical attributes in the DataFrame.
  • Sort the correlations in descending order and display the result.

Price trends based on condition and fuel¶

In [105]:
# Summary of pricing trends by condition and fuel type
price_summary = sp.groupby(['condition', 'fuel']).agg(
    avg_price=('price', 'mean'),
    median_price=('price', 'median'),
    count=('price', 'size')
).reset_index()

print("\nOptimal Pricing Summary:")
print(price_summary)
Optimal Pricing Summary:
    condition      fuel     avg_price  median_price   count
0     Unknown   Unknown  2.657031e+06     2309580.0    1687
1     Unknown    diesel  4.588512e+06     4045190.0    6839
2     Unknown  electric  3.249356e+06     1990460.0     159
3     Unknown       gas  3.439129e+06     1990460.0   25470
4     Unknown    hybrid  2.589077e+06     2051780.0     447
5     Unknown     other  4.122992e+06     2824250.0    2067
6   excellent   Unknown  7.312450e+06     7312450.0       4
7   excellent    diesel  1.829421e+07     4036280.0   12867
8   excellent  electric  2.823478e+06     1990460.0     526
9   excellent       gas  1.226798e+07     1862110.0  187480
10  excellent    hybrid  1.543248e+06     1283270.0    2495
11  excellent     other  3.043773e+06     1990460.0    5807
12       fair    diesel  2.011004e+06     1014550.0     636
13       fair  electric  1.415820e+06      488000.0       4
14       fair       gas  7.225017e+07      372460.0    6960
15       fair    hybrid  5.042757e+05      475100.0      37
16       fair     other  2.440227e+09      449480.0      65
17       good    diesel  2.399393e+07     2824250.0    7752
18       good  electric  3.471733e+06     3851650.0     845
19       good       gas  3.916336e+06     1990460.0  110943
20       good    hybrid  2.389665e+06     2130830.0    1810
21       good     other  3.679336e+06     3594780.0   22287
22   like new    diesel  4.903614e+06     4993550.0    1877
23   like new  electric  3.671779e+06     2593540.0     154
24   like new       gas  4.450640e+06     1990460.0   24226
25   like new    hybrid  2.272601e+06     1862110.0     373
26   like new     other  3.952193e+06     3466340.0     431
27        new    diesel  7.498743e+06     8352665.0      88
28        new  electric  3.191558e+06      769610.0      11
29        new       gas  3.156952e+06     2117890.0    1604
30        new    hybrid  4.556500e+06     5842160.0      22
31        new     other  2.078390e+06     1990460.0      56
32    salvage    diesel  1.845805e+06     1284190.0      42
33    salvage  electric  1.653625e+06     1653625.0       2
34    salvage       gas  5.570343e+05      256870.0     583
35    salvage    hybrid  5.446691e+05      192610.0      11
36    salvage     other  1.692764e+05      128440.0      11
  • .groupby() function groups the data by condition and fuel, then calculate the average price, median price, and count of transactions for each group.
  • Display the summarized pricing trends by condition and fuel type.

Average Price by Mileage¶

In [106]:
sp['mileage_category'] = pd.cut(sp['avg_yearly_mileage'], 
                                bins=[0, 5000, 10000, 20000, 30000, 80000, 10000000], 
                                labels=['Very Low', 'Low', 'Moderate', 'Considerate', 'High', 'Very High'])
In [107]:
mileage_price = sp.groupby('mileage_category')['price'].mean()
print("Average Price by Mileage Category:")
print(mileage_price)
Average Price by Mileage Category:
mileage_category
Very Low       2.788424e+07
Low            7.393663e+06
Moderate       6.344824e+06
Considerate    2.800212e+06
High           6.066545e+06
Very High      1.310558e+08
Name: price, dtype: float64
  • .groupby() groups the data by mileage_category and calculate the average price for each category.
  • Display the resulting average price for each mileage category.

Pricing Strategy Based on Vehicle Type¶

In [108]:
# Analyze pricing by vehicle type
type_pricing = sp.groupby('car_type').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nPricing by Vehicle Type:")
print(type_pricing)
Pricing by Vehicle Type:
       car_type  total_transactions     avg_price  median_price
0           SUV               97421  1.291095e+07     1926290.0
1       Unknown               24924  1.656132e+07     1990460.0
2           bus                 619  2.206170e+06     1796930.0
3   convertible                8910  3.103358e+07     1990460.0
4         coupe               22104  2.993457e+06     2246410.0
5     hatchback               18777  1.832439e+06     1796930.0
6      mini-van                7007  1.283773e+06      924590.0
7       offroad                 725  2.090865e+06     1643800.0
8         other               23014  3.349307e+06     3209560.0
9        pickup               50507  1.652880e+07     3466340.0
10        sedan              106733  9.306231e+06     1412630.0
11        truck               43467  4.141187e+06     2632560.0
12          van               10284  2.396104e+06     1990460.0
13        wagon               12186  1.787653e+06     1669420.0
  • .groupby() groups the data by car_type, calculating total transactions, average price, and median price for each vehicle type.
  • Reset the index to create a clean DataFrame for pricing analysis by vehicle type.

Analysis¶

  1. Condition and Price: Vehicles in excellent condition have the highest prices across all fuel types, especially diesel (₹18.29M) and gas (₹12.27M). Fair and salvage conditions are priced significantly lower.

  2. Electric Vehicles: EVs in excellent condition are priced lower (₹2.82M) than diesel and gas counterparts, making them relatively affordable.

  3. Mileage and Pricing: Vehicles with very high mileage have the highest average prices (₹131.06M), likely luxury or high-performance models. Moderate and low mileage vehicles are priced in the ₹6M–₹7.4M range.

  4. Vehicle Type: SUVs and sedans dominate in transaction volume, while convertibles (₹31M) and pickup trucks (₹16.5M) have the highest average prices. Hatchbacks and mini-vans are the most affordable categories.


Key Insights¶

  1. Vehicle condition is the strongest determinant of price.
  2. EVs are emerging as affordable alternatives to diesel and gas vehicles.
  3. High mileage does not always lower price, especially for luxury vehicles.
  4. SUVs and sedans are the most in-demand segments, while hatchbacks appeal to budget-conscious buyers.

Recommendations¶

  1. Promote SUVs and Sedans: Focus marketing efforts on these high-demand segments.
  2. Leverage EV Affordability: Encourage EV adoption by emphasizing cost savings and environmental benefits.
  3. Refurbish Vehicles: Improve fair and salvage vehicles for resale to enhance their market value.
  4. Target Mileage Preferences: Highlight moderate mileage vehicles as the ideal choice for reliability and cost-effectiveness.
  5. Expand Affordable Options: Offer financing for hatchbacks and mini-vans to attract budget-conscious buyers.

Data Visualization¶

Compare Price and number of cars Across Car Types¶

In [109]:
car_type_stats = sp.groupby('car_type').agg(
    avg_price=('price', 'mean'),
    num_cars=('id', 'count')
).reset_index()

car_type_stats
Out[109]:
car_type avg_price num_cars
0 SUV 1.291095e+07 97421
1 Unknown 1.656132e+07 24924
2 bus 2.206170e+06 619
3 convertible 3.103358e+07 8910
4 coupe 2.993457e+06 22104
5 hatchback 1.832439e+06 18777
6 mini-van 1.283773e+06 7007
7 offroad 2.090865e+06 725
8 other 3.349307e+06 23014
9 pickup 1.652880e+07 50507
10 sedan 9.306231e+06 106733
11 truck 4.141187e+06 43467
12 van 2.396104e+06 10284
13 wagon 1.787653e+06 12186
In [110]:
# Data preparation
car_type_stats = sp.groupby('car_type').agg(
    avg_price=('price', 'mean'),
    num_cars=('id', 'count')
).reset_index()

# Sort by average price
car_type_stats = car_type_stats.sort_values('avg_price', ascending=False)

# Extract data for plotting
car_types = car_type_stats['car_type']
avg_price = car_type_stats['avg_price']
num_cars = car_type_stats['num_cars']

# Create the plot
fig, ax1 = plt.subplots(figsize=(12, 6))

# Bar chart for average price
bar_width = 0.4
x = np.arange(len(car_types))
ax1.bar(x - bar_width / 2, avg_price, width=bar_width, color='lightblue', label='Average Price (₹)')
ax1.set_ylabel('Average Price (₹)', fontsize=12, color='blue')
ax1.set_xlabel('Car Type', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(x)
ax1.set_xticklabels(car_types, rotation=45, ha='right')

# Bar chart for number of cars
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, num_cars, width=bar_width, color='salmon', label='Number of Cars')
ax2.set_ylabel('Number of Cars', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')

# Add title and legend
plt.title('Average Price and Number of Cars by Car Type', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  • Convertibles: Highest average price (~₹3,000,000), but low number of cars.
  • Sedans and SUVs: High numbers of cars with sedans leading (100,000) and SUVs (80,000). Both show strong market demand.
  • Trucks: Significant number of cars (~60,000) but lower average price compared to sedans and SUVs.
  • Unknown Category: Moderate average price with low number of cars.
  • Other Car Types: Vans, buses, offroad vehicles, hatchbacks, wagons, and mini-vans have lower average prices and low numbers of cars.

Recommendations¶

  1. Target Sedans and SUVs: Invest in marketing and sales strategies for sedans and SUVs due to their high demand.
  2. Premium Strategy for Convertibles: Focus on premium pricing for convertibles to capitalize on their high average price.
  3. Market Expansion for Trucks: Explore opportunities to increase the market share of trucks, considering their significant presence and moderate pricing.
  4. Investigate Unknown Category: Understand the characteristics and potential of the "unknown" category for market opportunities.
  5. Niche Marketing for Low-Volume Types: For car types with lower numbers and prices (vans, buses, offroad vehicles, hatchbacks, wagons, mini-vans), consider niche marketing strategies or cost reduction measures to improve profitability.

Hypothesis 2:- Newer cars have higher demand and prices with showing higher transaction volumes, while older cars see declining demand and value with decrease in market desirability.¶

Car Age and Usage Analysis¶

Objective: Understand how car age relates to odometer readings and price.¶

Age of Cars and Its Impact on Demand¶

In [111]:
age_analysis = sp.groupby('car_age').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)

print("\nCar Age Analysis (Demand and Pricing):")
print(age_analysis)
Car Age Analysis (Demand and Pricing):
     car_age  total_transactions     avg_price  median_price
4        4.0               36421  3.298168e+06     3067260.0
3        3.0               36370  3.596957e+06     3466340.0
8        8.0               31925  2.176866e+06     1990460.0
6        6.0               31538  3.245019e+06     2247340.0
5        5.0               30434  2.970425e+06     2503200.0
..       ...                 ...           ...           ...
104    106.0                   1  6.421000e+05      642100.0
108    116.0                   1  5.127400e+05      512740.0
107    112.0                   1  9.677000e+04       96770.0
102    103.0                   1  2.054720e+06     2054720.0
78      78.0                   1  1.252520e+06     1252520.0

[113 rows x 4 columns]
  • .groupby() function groups the data by car_age, then calculate the total number of transactions, average price, and median price for each car age.
  • Sort the results by the total number of transactions in descending order and display the car age analysis (demand and pricing).

Demand for Newer vs. Older Cars¶

In [112]:
car_age_demand = sp['car_age'].value_counts().reset_index()
car_age_demand.columns = ['car_age', 'Demand']  # Rename columns
car_age_demand = car_age_demand.sort_values('Demand', ascending=False)  # Sort by demand in descending order
print("Car Age Demand:")
print(car_age_demand.head(30))
Car Age Demand:
    car_age  Demand
0       4.0   36421
1       3.0   36370
2       8.0   31925
3       6.0   31538
4       5.0   30434
5       7.0   30287
6       2.0   25375
7       9.0   23898
8      10.0   20341
9       1.0   19298
10     13.0   17150
11     11.0   15829
12     14.0   14873
13     15.0   12763
14     12.0   12185
15     16.0   10621
16     17.0    8971
17     18.0    7151
18     19.0    5587
19     20.0    4443
20     21.0    3572
21     22.0    3094
22      0.0    2396
23     23.0    1988
24     24.0    1724
25     25.0    1302
26     26.0    1246
27     27.0     968
28     28.0     712
29     29.0     626
  • .value_counts().reset_index() counts the number of transactions for each unique car_age value.
  • Rename the columns to car_age and Demand.
  • Sort the results by Demand in descending order to display the car age demand.

Analysis¶

  1. Peak Demand for Mid-Age Cars (3–6 Years)

    • The highest demand is for cars aged 3 to 6 years, with peak transactions at 4 years (36,421) and 3 years (36,370).
    • These cars also have high pricing, with 4-year-old cars averaging ₹3.29M and 3-year-old cars at ₹3.60M.
  2. Gradual Decline in Demand After 10 Years

    • Demand drops significantly after 10 years, with 13-year-old cars at 17,150 transactions and cars aged 20+ years below 5,000 transactions.
    • Prices for older cars drop drastically, with many below ₹1M.
  3. Rare Demand for Vintage & Classic Cars

    • Cars aged 100+ years still have occasional transactions, with prices varying widely, indicating niche collectors’ interest.

Key Insights¶

  1. 3–6-Year-Old Cars Hold the Best Resale Value

    • These cars balance affordability and modern features, making them the most in-demand.
  2. After 10 Years, Depreciation Accelerates

    • A sharp decline in both demand and pricing indicates that cars beyond a decade struggle to attract buyers.
  3. Niche Market for Vintage Cars

    • Rare sales of 100+ year-old cars suggest collector demand, though limited.

Recommendations¶

  1. Target Mid-Age Cars for Sales & Promotions

    • Focus inventory acquisition on 3–6-year-old cars, as they offer the best resale potential.
  2. Offer Trade-In Incentives for Older Cars

    • Encourage sellers of 10+ year-old cars to trade in for newer models, increasing inventory turnover.
  3. Leverage Classic Car Auctions & Collectors’ Market

    • For cars older than 50+ years, explore specialized vintage auctions or collector marketplaces.
  4. Highlight Cost-Effective Options for Budget Buyers

    • Promote 8–10-year-old cars as budget-friendly alternatives for cost-conscious buyers.

Data Visualization¶

1. Correlate car age with odometer¶

In [113]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='car_age', y='odometer', data=sp, alpha=0.6)
plt.title('Car Age vs. Odometer Reading')
plt.xlabel('Car Age (years)')
plt.ylabel('Odometer Reading (km)')
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Car Age and Odometer Reading: The scatter plot shows a high concentration of cars aged between 0 and 20 years with odometer readings below 0.2 million kilometers.
  2. High Mileage Cars: There are several cars with odometer readings near 1.0 million kilometers, regardless of age, indicating high mileage.
  3. Older Cars: Beyond 20 years of age, the number of cars decreases significantly, and their odometer readings are generally lower.
  4. Outliers: There are a few outliers with high odometer readings and older ages, but these are rare.

Recommendations¶

  1. Focus on Younger Cars: When purchasing a used car, consider those within the 0-20 year range with odometer readings below 0.2 million kilometers, as they are most common and likely in better condition.
  2. Avoid High Mileage Cars: Be cautious of cars with extremely high odometer readings (near 1.0 million kilometers) regardless of age, as they may have undergone significant wear and tear.
  3. Consider Well-Maintained Older Cars: Older cars (beyond 20 years) with low odometer readings might be rare but could be worth considering if they have been well-maintained.

Hypothesis 3:- Geographic location influences car demand and pricing, States with stronger economies tend to have higher average car prices eventually drop in weaker economies.¶

Regional Insights¶

Objective: Explore geographical patterns.¶

Demand for Cars Across Regions¶

In [114]:
# Count transactions by state
state_demand = sp['state'].value_counts()
print("\nDemand by State:")
print(state_demand)
Demand by State:
Uttar Pradesh                               45552
Maharashtra                                 45236
Karnataka                                   30474
Punjab                                      30441
Delhi                                       30368
Haryana                                     30167
Tamil Nadu                                  18116
West Bengal                                 18039
Telangana                                   15352
Kerala                                      15293
Madhya Pradesh                              15199
Rajasthan                                   15121
Andhra Pradesh                              15065
Gujarat                                     14985
Odisha                                      12184
Jharkhand                                    6074
Bihar                                        6031
Chhattisgarh                                 6001
Assam                                        5986
Uttarakhand                                  5947
Andaman and Nicobar Islands                  3077
Sikkim                                       3065
Ladakh                                       3060
Puducherry                                   3045
Jammu and Kashmir                            3028
Himachal Pradesh                             3016
Meghalaya                                    3015
Mizoram                                      3009
Manipur                                      2990
Chandigarh                                   2984
Arunachal Pradesh                            2979
Dadra and Nagar Haveli and Daman and Diu     2978
Tripura                                      2960
Goa                                          2922
Lakshadweep                                  2919
Name: state, dtype: int64
  • .value_counts()counts the number of transactions for each unique value in the state column.
  • Display the demand (transaction count) by state.

Top States for High-Value Cars¶

In [115]:
state_avg_price = sp.groupby('state')['price'].mean().sort_values(ascending=False)
print("Average Price by State:")
print(state_avg_price)
Average Price by State:
state
Lakshadweep                                 1.667703e+08
Chandigarh                                  1.372737e+08
Goa                                         6.442383e+07
West Bengal                                 2.903243e+07
Haryana                                     2.018441e+07
Karnataka                                   1.617193e+07
Kerala                                      1.381462e+07
Uttar Pradesh                               8.677833e+06
Dadra and Nagar Haveli and Daman and Diu    7.327605e+06
Maharashtra                                 5.218439e+06
Madhya Pradesh                              3.720367e+06
Tamil Nadu                                  3.271037e+06
Delhi                                       3.019001e+06
Chhattisgarh                                2.672705e+06
Gujarat                                     2.533112e+06
Punjab                                      2.515638e+06
Rajasthan                                   2.509588e+06
Assam                                       2.472224e+06
Jharkhand                                   2.458025e+06
Sikkim                                      2.428607e+06
Andaman and Nicobar Islands                 2.422289e+06
Manipur                                     2.418222e+06
Jammu and Kashmir                           2.414838e+06
Arunachal Pradesh                           2.408561e+06
Tripura                                     2.407741e+06
Odisha                                      2.405977e+06
Telangana                                   2.404359e+06
Puducherry                                  2.402517e+06
Andhra Pradesh                              2.402467e+06
Uttarakhand                                 2.399425e+06
Meghalaya                                   2.393725e+06
Ladakh                                      2.388141e+06
Bihar                                       2.384134e+06
Himachal Pradesh                            2.368753e+06
Mizoram                                     2.345368e+06
Name: price, dtype: float64
  • .groupby() function groups the data by state and calculate the average price for each state.
  • Sort the results in descending order and display the average price by state.

Regional Demand Analysis¶

In [116]:
# Analyze demand by state
regional_demand = sp.groupby('state').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)

print("\nRegional Demand Analysis:")
print(regional_demand)
Regional Demand Analysis:
                                       state  total_transactions  \
32                             Uttar Pradesh               45552   
20                               Maharashtra               45236   
15                                 Karnataka               30474   
26                                    Punjab               30441   
8                                      Delhi               30368   
11                                   Haryana               30167   
29                                Tamil Nadu               18116   
34                               West Bengal               18039   
30                                 Telangana               15352   
16                                    Kerala               15293   
19                            Madhya Pradesh               15199   
27                                 Rajasthan               15121   
1                             Andhra Pradesh               15065   
10                                   Gujarat               14985   
24                                    Odisha               12184   
14                                 Jharkhand                6074   
4                                      Bihar                6031   
6                               Chhattisgarh                6001   
3                                      Assam                5986   
33                               Uttarakhand                5947   
0                Andaman and Nicobar Islands                3077   
28                                    Sikkim                3065   
17                                    Ladakh                3060   
25                                Puducherry                3045   
13                         Jammu and Kashmir                3028   
12                          Himachal Pradesh                3016   
22                                 Meghalaya                3015   
23                                   Mizoram                3009   
21                                   Manipur                2990   
5                                 Chandigarh                2984   
2                          Arunachal Pradesh                2979   
7   Dadra and Nagar Haveli and Daman and Diu                2978   
31                                   Tripura                2960   
9                                        Goa                2922   
18                               Lakshadweep                2919   

       avg_price  median_price  
32  8.677833e+06     1990460.0  
20  5.218439e+06     1990460.0  
15  1.617193e+07     1990460.0  
26  2.515638e+06     1990460.0  
8   3.019001e+06     1990460.0  
11  2.018441e+07     1990460.0  
29  3.271037e+06     1990460.0  
34  2.903243e+07     1990460.0  
30  2.404359e+06     1990460.0  
16  1.381462e+07     1990460.0  
19  3.720367e+06     1990460.0  
27  2.509588e+06     1990460.0  
1   2.402467e+06     1990460.0  
10  2.533112e+06     1990460.0  
24  2.405977e+06     1990460.0  
14  2.458025e+06     1990460.0  
4   2.384134e+06     1990460.0  
6   2.672705e+06     1990460.0  
3   2.472224e+06     1990460.0  
33  2.399425e+06     1990460.0  
0   2.422289e+06     1990460.0  
28  2.428607e+06     1990460.0  
17  2.388141e+06     1990460.0  
25  2.402517e+06     1990460.0  
13  2.414838e+06     1990460.0  
12  2.368753e+06     1990460.0  
22  2.393725e+06     1990460.0  
23  2.345368e+06     1990460.0  
21  2.418222e+06     1990460.0  
5   1.372737e+08     1990460.0  
2   2.408561e+06     1990460.0  
7   7.327605e+06     1990460.0  
31  2.407741e+06     1990460.0  
9   6.442383e+07     1990460.0  
18  1.667703e+08     1990460.0  
  • .groupby() groups the data by state, calculating total transactions, average price, and median price for each state.
  • Reset the index and sort the results by total_transactions in descending order to analyze regional demand.

Geographic Pricing Optimization¶

In [117]:
# Average price by state
state_pricing = sp.groupby('state').agg(
    avg_price=('price', 'mean')
).reset_index()

print("\nGeographic Pricing Analysis:")
print(state_pricing)
Geographic Pricing Analysis:
                                       state     avg_price
0                Andaman and Nicobar Islands  2.422289e+06
1                             Andhra Pradesh  2.402467e+06
2                          Arunachal Pradesh  2.408561e+06
3                                      Assam  2.472224e+06
4                                      Bihar  2.384134e+06
5                                 Chandigarh  1.372737e+08
6                               Chhattisgarh  2.672705e+06
7   Dadra and Nagar Haveli and Daman and Diu  7.327605e+06
8                                      Delhi  3.019001e+06
9                                        Goa  6.442383e+07
10                                   Gujarat  2.533112e+06
11                                   Haryana  2.018441e+07
12                          Himachal Pradesh  2.368753e+06
13                         Jammu and Kashmir  2.414838e+06
14                                 Jharkhand  2.458025e+06
15                                 Karnataka  1.617193e+07
16                                    Kerala  1.381462e+07
17                                    Ladakh  2.388141e+06
18                               Lakshadweep  1.667703e+08
19                            Madhya Pradesh  3.720367e+06
20                               Maharashtra  5.218439e+06
21                                   Manipur  2.418222e+06
22                                 Meghalaya  2.393725e+06
23                                   Mizoram  2.345368e+06
24                                    Odisha  2.405977e+06
25                                Puducherry  2.402517e+06
26                                    Punjab  2.515638e+06
27                                 Rajasthan  2.509588e+06
28                                    Sikkim  2.428607e+06
29                                Tamil Nadu  3.271037e+06
30                                 Telangana  2.404359e+06
31                                   Tripura  2.407741e+06
32                             Uttar Pradesh  8.677833e+06
33                               Uttarakhand  2.399425e+06
34                               West Bengal  2.903243e+07
  • .groupby() groups the data by state and calculate the average price for each state.
  • Reset the index to create a clean DataFrame for state-wise pricing analysis.

Analysis¶

  1. High Demand in Densely Populated and Economically Strong States

    • Top 3 states by transaction volume:
      • Uttar Pradesh (45,552 transactions)
      • Maharashtra (45,236 transactions)
      • Karnataka (30,474 transactions)
    • These states have high urbanization and purchasing power, leading to more car sales.
  2. Luxury Market in Chandigarh, Lakshadweep, and Goa

    • Lakshadweep: ₹166.7M avg price
    • Chandigarh: ₹137.2M avg price
    • Goa: ₹64.4M avg price
    • These regions exhibit significantly higher average prices, likely due to a strong luxury car segment.
  3. Affordable Car Markets in Mid-Tier States

    • States like Telangana, Andhra Pradesh, Odisha, and Rajasthan have an average price range between ₹2.4M to ₹2.5M, indicating higher demand for mid-range and budget cars.
  4. Outliers in Pricing

    • Some regions, despite low demand, have high average prices due to occasional high-value transactions (e.g., luxury and collector cars).

Key Insights¶

  1. Urban States Drive High Demand

    • Major economic hubs (e.g., Maharashtra, Karnataka, and Delhi) have the highest transaction volumes, reinforcing the connection between car demand and urban prosperity.
  2. Luxury Car Markets Are Region-Specific

    • High average prices in Chandigarh, Goa, and Lakshadweep indicate a preference for premium vehicles in these regions.
  3. Price Sensitivity Varies Across States

    • Budget-friendly states like Odisha, Jharkhand, and Bihar cater to cost-conscious buyers, while metro regions see higher average prices.
  4. Luxury Demand in Select States Creates Market Segmentation

    • West Bengal, Haryana, and Kerala also show higher-than-average pricing, indicating selective premium car purchases.

Recommendations¶

  1. Increase Inventory & Marketing in High-Demand States

    • Prioritize Uttar Pradesh, Maharashtra, and Karnataka for increased stock allocation and marketing.
  2. Expand Luxury Offerings in High-Value Markets

    • Focus on high-end vehicles in Lakshadweep, Chandigarh, Goa, and West Bengal to cater to premium buyers.
  3. Introduce Budget Car Campaigns in Cost-Sensitive Regions

    • Provide financing options and discounts in Odisha, Jharkhand, and Bihar to tap into the price-conscious segment.
  4. Localized Digital & Offline Marketing Strategies

    • Metro regions: Promote SUVs and luxury cars.
    • Tier 2 & 3 cities: Focus on affordable and fuel-efficient models.

Data Visualization¶

Compare Average Price by State¶

In [118]:
state_stats = sp.groupby('state').agg(
    avg_price=('price', 'mean'),
    most_common_type=('car_type', lambda x: x.mode()[0] if not x.isnull().all() else 'Unknown'),
    total_listings=('id', 'count')
).reset_index()

# Visualize Average Price by State
plt.figure(figsize=(12, 8))
sns.barplot(
    y='state', 
    x='avg_price', 
    data=state_stats.sort_values('avg_price', ascending=False),
    palette='coolwarm'
)
plt.title('Average Car Price by State', fontsize=14)
plt.xlabel('Average Price (₹)', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Lakshadweep: Has the highest average car price, significantly above any other state.
  2. Chandigarh and Goa: Also show relatively high average car prices.
  3. Moderate Average Prices: States like West Bengal, Haryana, Karnataka, and Kerala have moderate average car prices.
  4. Lower Average Prices: The majority of states have lower average car prices, with Mizoram having the lowest.

Recommendations¶

  1. Focus on High-Price States: Car manufacturers and dealers could target states like Lakshadweep, Chandigarh, and Goa due to their higher average car prices, indicating a wealthier customer base.
  2. Mid-Range Models: States with moderate average car prices, such as West Bengal and Haryana, can be ideal markets for mid-range car models.
  3. Budget Car Models: For states with lower average car prices, such as Mizoram and Himachal Pradesh, focus on promoting budget-friendly models.

Highlight Regions with High Prevalence of Specific Manufacturers or Models¶

In [119]:
top_manufacturers = sp['manufacturer'].value_counts().head(5).index
manufacturer_data = sp[sp['manufacturer'].isin(top_manufacturers)]

plt.figure(figsize=(14, 8))
sns.countplot(
    y='state', 
    hue='manufacturer', 
    data=manufacturer_data, 
    order=state_stats['state']
)
plt.title('Prevalence of Top Manufacturers by State', fontsize=14)
plt.xlabel('Count', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.legend(title='Manufacturer', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Ford Dominance: Ford (green) has the highest presence in many states, especially Maharashtra, Uttar Pradesh, and Tamil Nadu.
  2. Toyota's Significant Presence: Toyota (orange) is also prominent in states like Maharashtra, Karnataka, and Tamil Nadu.
  3. Chevrolet's Moderate Presence: Chevrolet (blue) has a consistent but moderate presence across most states.
  4. Nissan and Honda: Both Nissan (red) and Honda (purple) have lower counts compared to Ford and Toyota, with Honda having the least presence overall.
  5. Major Markets: Maharashtra, Uttar Pradesh, and Tamil Nadu have the highest overall counts of vehicles, indicating a larger market or higher vehicle ownership in these states.

Recommendations¶

  1. Expand in Low-Count States: Manufacturers looking to grow their market presence should focus on states with lower vehicle counts, such as Lakshadweep, Ladakh, and Arunachal Pradesh.
  2. Targeted Marketing for Nissan and Honda: These manufacturers should implement targeted marketing strategies in states where Ford and Toyota dominate to increase their market share.
  3. Capitalize on High-Count States: For states with high vehicle counts like Maharashtra, Uttar Pradesh, and Tamil Nadu, prioritize new product launches and dealership expansions to leverage the existing large market.

Hypothesis 4:- Fuel type affects demand and pricing with Diesel and gas vehicles may see higher sales and prices, while others can have minimal impact.¶

Fuel Type and Efficiency Analysis¶

Objective: Understand preferences and pricing around fuel types.¶

Yearly Mileage vs. Car Age¶

In [120]:
sp['mileage_category'] = pd.cut(sp['avg_yearly_mileage'], 
                                bins=[0, 5000, 10000, 20000, 30000, 80000, 10000000], 
                                labels=['Very Low', 'Low', 'Moderate', 'Considerate', 'High', 'Very High'])
mileage_vs_age = sp.groupby(['car_age', 'mileage_category'])['id'].count().unstack()
print("Mileage vs Car Age:")
print(mileage_vs_age)
Mileage vs Car Age:
mileage_category  Very Low    Low  Moderate  Considerate  High  Very High
car_age                                                                  
0.0                   1809    241       103           12    13         82
1.0                   5483   4037      5182         2591  1827        146
2.0                   4974   6738      9323         2792  1481         43
3.0                   4142  12372     14177         3882  1691         57
4.0                   3995  10768     14149         5603  1786         56
...                    ...    ...       ...          ...   ...        ...
116.0                    1      0         0            0     0          0
118.0                   12      0         0            0     0          0
119.0                    0      1         0            0     0          0
120.0                    3      0         0            0     0          0
121.0                   11      1         0            0     0          0

[113 rows x 6 columns]
  • sp['mileage_category'] creates a mileage_category column by binning avg_yearly_mileage into defined ranges with corresponding labels (Very Low, Low, etc.).
  • .groupby() groups the data by car_age and mileage_category, counting the number of transactions for each combination.
  • Use .unstack() to pivot the results for easier analysis of mileage categories across car ages.

Fuel Type Trends by Region¶

In [121]:
fuel_region_trends = sp.groupby(['state', 'fuel'])['id'].count().unstack()
print("Fuel Type Trends by Region:")
print(fuel_region_trends)
Fuel Type Trends by Region:
fuel                                      Unknown  diesel  electric    gas  \
state                                                                        
Andaman and Nicobar Islands                    10     226        15   2566   
Andhra Pradesh                                 64    1049        63  12595   
Arunachal Pradesh                              11     202         6   2516   
Assam                                          26     445        28   4993   
Bihar                                          24     398        18   5083   
Chandigarh                                     11     208        14   2504   
Chhattisgarh                                   20     420        19   5036   
Dadra and Nagar Haveli and Daman and Diu       10     222        16   2508   
Delhi                                         126    2151       126  25455   
Goa                                            11     230        10   2432   
Gujarat                                        64    1031        62  12522   
Haryana                                       139    2161       135  25178   
Himachal Pradesh                               13     235         8   2507   
Jammu and Kashmir                              16     200         6   2561   
Jharkhand                                      28     423        26   5072   
Karnataka                                     145    2097       108  25581   
Kerala                                         61    1067        59  12754   
Ladakh                                          8     235        12   2542   
Lakshadweep                                    12     201         8   2481   
Madhya Pradesh                                 50    1134        51  12647   
Maharashtra                                   168    3128       184  37964   
Manipur                                         9     220        13   2507   
Meghalaya                                      10     226         4   2523   
Mizoram                                        10     185        17   2543   
Odisha                                         42     846        46  10149   
Puducherry                                     13     232        14   2546   
Punjab                                        110    2137       126  25531   
Rajasthan                                      48    1083        63  12663   
Sikkim                                         10     229        10   2577   
Tamil Nadu                                     65    1284        74  15161   
Telangana                                      50    1091        70  12877   
Tripura                                         8     193        15   2510   
Uttar Pradesh                                 200    3228       187  38092   
Uttarakhand                                    20     444        20   4971   
West Bengal                                    79    1240        68  15119   

fuel                                      hybrid  other  
state                                                    
Andaman and Nicobar Islands                   35    225  
Andhra Pradesh                               191   1103  
Arunachal Pradesh                             29    215  
Assam                                         59    435  
Bihar                                         74    434  
Chandigarh                                    26    221  
Chhattisgarh                                  70    436  
Dadra and Nagar Haveli and Daman and Diu      28    194  
Delhi                                        340   2170  
Goa                                           31    208  
Gujarat                                      172   1134  
Haryana                                      360   2194  
Himachal Pradesh                              34    219  
Jammu and Kashmir                             36    209  
Jharkhand                                     79    446  
Karnataka                                    371   2172  
Kerala                                       185   1167  
Ladakh                                        45    218  
Lakshadweep                                   41    176  
Madhya Pradesh                               235   1082  
Maharashtra                                  558   3234  
Manipur                                       31    210  
Meghalaya                                     42    210  
Mizoram                                       30    224  
Odisha                                       153    948  
Puducherry                                    31    209  
Punjab                                       372   2165  
Rajasthan                                    178   1086  
Sikkim                                        47    192  
Tamil Nadu                                   228   1304  
Telangana                                    192   1072  
Tripura                                       35    199  
Uttar Pradesh                                549   3296  
Uttarakhand                                   76    416  
West Bengal                                  232   1301  
  • .groupby()Group the data by state and fuel, counting the number of transactions for each combination.
  • Use .unstack() to pivot the results, creating a table with fuel as columns and state as rows.

Influence of Fuel Type on Demand and Pricing¶

In [122]:
# Analyze demand and pricing by fuel type
fuel_analysis = sp.groupby('fuel').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nFuel Type Analysis (Demand and Pricing):")
print(fuel_analysis)
Fuel Type Analysis (Demand and Pricing):
       fuel  total_transactions     avg_price  median_price
0   Unknown                1691  2.668044e+06     2310590.0
1    diesel               30101  1.541457e+07     3594780.0
2  electric                1701  3.259814e+06     2901280.0
3       gas              357266  9.623518e+06     1925360.0
4    hybrid                5195  1.983751e+06     1757450.0
5     other               30724  8.743496e+06     3337990.0
  • .groupby() groups the data by manufacturer and model, calculating total transactions, average price, and median price for each group.
  • Reset the index and sort the results by total_transactions in descending order.
  • Display the top 10 most popular car models and brands based on transaction count.

Correlation Analysis¶

In [123]:
# Correlation matrix
correlation_matrix = sp[['price', 'odometer', 'car_age']].corr()
print("\nCorrelation Analysis:")
print(correlation_matrix)
Correlation Analysis:
             price  odometer   car_age
price     1.000000  0.010112  0.004899
odometer  0.010112  1.000000  0.158340
car_age   0.004899  0.158340  1.000000
  • .corr() calculates the correlation matrix for the price, odometer, and car_age columns to analyze the relationships between these variables.

Analysis¶

1. Mileage vs. Car Age¶

  • Newer Cars (0-1 years old): Most have Very Low to Moderate mileage, suggesting limited use.
  • 2-4-year-old Cars: A shift towards Moderate to Considerate mileage indicates regular use.
  • Older Cars (>10 years): Few transactions, mostly Very Low mileage, suggesting classic or less-used vehicles.

2. Fuel Type Trends by Region¶

  • High Gas Demand Nationwide: Most transactions are for gas cars, especially in Delhi, Maharashtra, Karnataka, and Uttar Pradesh.
  • Diesel is Strong in North & West: High diesel usage in Haryana, Punjab, Rajasthan, and Maharashtra, where long-distance travel is common.
  • Electric Cars Are Emerging: Limited presence but growing in metro states like Delhi, Karnataka, Maharashtra.

3. Fuel Type Pricing & Demand Analysis¶

  • Diesel Cars Command Higher Prices: ₹15.4M average price due to fuel efficiency & long-term savings.
  • Electric Vehicles Are Expensive: ₹3.26M average price, but demand remains low.
  • Hybrid Cars Are Affordable: ₹1.98M average price, appealing for fuel-conscious buyers.

4. Correlation Analysis¶

  • Weak correlation between price and odometer/car age: Car value isn't significantly impacted by mileage or age alone, suggesting brand, model, and condition play key roles in pricing.

Key Insights¶

  1. Newer cars have lower mileage, making them attractive for buyers looking for less-used vehicles.
  2. Gas cars dominate the market, with diesel and hybrid cars preferred in certain regions.
  3. Electric cars have high pricing but low demand, indicating a slow adoption rate.
  4. Car age and odometer reading do not strongly influence pricing, reinforcing the importance of brand reputation and vehicle condition.

Recommendations¶

  1. Target Gas Car Buyers in High-Demand States

    • Focus on Delhi, Maharashtra, and Uttar Pradesh to maximize gas car sales.
  2. Expand Diesel Inventory in North & West India

    • Promote diesel cars in Punjab, Haryana, and Rajasthan, where demand is high.
  3. Incentivize Electric Vehicle Adoption

    • Offer subsidies or financing options in metro states (Delhi, Karnataka, Maharashtra) to encourage EV sales.
  4. Emphasize Car Condition in Marketing

    • Since pricing isn't strongly linked to age or odometer, highlight certifications, maintenance records, and brand trust in listings.

Data Visualization¶

Price Distribution by Fuel Type¶

In [124]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='fuel', y='price', data=sp, showfliers=False, palette='Set3')
plt.title('Price Distribution by Fuel Type', fontsize=14)
plt.xlabel('Fuel Type', fontsize=12)
plt.ylabel('Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Diesel Vehicles: Show the highest price variability with a wide interquartile range and the highest maximum prices among all fuel types.
  2. Hybrid Vehicles: Exhibit the lowest median price and the smallest price range, indicating more affordability and consistency in pricing.
  3. Electric Vehicles: Have a relatively high median price with a moderate price range, reflecting the newer technology and higher costs.
  4. Gas and Other Fuel Types: Both categories show moderate median prices and similar price distributions, indicating average variability in prices.
  5. Unknown Category: Displays price distribution similar to gas and other fuel types, with moderate prices and variability.

Recommendations¶

  1. For Consumers Seeking Affordability: Hybrid vehicles are the best option due to their lower median price and smaller price range.
  2. For Electric Vehicle Enthusiasts: Be prepared for a higher median price with moderate price variability. This category suits those interested in cutting-edge technology and sustainability.
  3. For a Wide Range of Price Options: Diesel vehicles offer a broad spectrum of prices, making them suitable for consumers willing to pay a premium and looking for diverse pricing options.
  4. Market Strategies for Sellers:
    • Target Hybrids to Budget-Conscious Consumers: Emphasize the affordability and consistency in pricing of hybrid vehicles.
    • Promote Electric Vehicles to Tech-Savvy Buyers: Highlight the benefits of electric vehicles for environmentally conscious and technology-forward consumers.
    • Diversify Offerings with Diesel Vehicles: Cater to a wider audience with the varied price points available in diesel vehicles.

Correlation Between Car_age, Odometer, and Price¶

In [125]:
correlation_matrix = sp[['car_age', 'odometer', 'price', 'cylinders']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', cbar=True)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Car Age vs. Price: There is no significant linear relationship between car age and price, as indicated by the correlation coefficient of 0.00.
  2. Odometer vs. Price: Similarly, the correlation between odometer reading and price is very weak (0.01), suggesting minimal linear relationship.
  3. Car Age vs. Odometer: There is a weak positive linear relationship between car age and odometer readings (0.16), indicating that older cars tend to have higher mileage, but this is not a strong predictor.

Recommendations¶

  1. Explore Additional Factors: Given the lack of strong correlations with price, consider analyzing other variables (e.g., car make and model, condition, location) to better understand what influences car prices.
  2. Combine Predictors: While age and odometer readings alone are not strong predictors, combining them with other factors in a multivariate analysis might yield more accurate predictions.
  3. Market Segmentation: Focus on different market segments (e.g., newer cars with low mileage vs. older cars with high mileage) and tailor marketing strategies accordingly.
  4. Regular Maintenance: For buyers, prioritize cars with a history of regular maintenance over simply relying on age or mileage as indicators of value.
  5. Further Research: Conduct further research to identify key predictors of car prices that are not captured in the current dataset.

Regional Trends in Fuel Preferences¶

In [126]:
fuel_region = sp.groupby(['state', 'fuel']).size().unstack(fill_value=0)

# Normalize to get percentages
fuel_region_percentage = fuel_region.div(fuel_region.sum(axis=1), axis=0)

# Heatmap of fuel preferences by region
plt.figure(figsize=(14, 8))
sns.heatmap(
    fuel_region_percentage, 
    annot=True, 
    fmt=".1%", 
    cmap='coolwarm', 
    linewidths=0.5, 
    cbar_kws={'label': 'Fuel Preference (%)'}
)
plt.title('Fuel Preferences by State', fontsize=14)
plt.xlabel('Fuel Type', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Fuel Preferences: Gas is the most preferred fuel type across all states, with high percentages (83.1% to 85.0%). Diesel follows, with lower but still significant percentages (6.1% to 7.9%).
  2. Alternative Fuels: Electric and hybrid fuel types have low preferences, indicating limited adoption. Electric ranges from 0.1% to 0.6%, and hybrid from 0.9% to 1.5%.
  3. Minor Categories: The "Unknown" and "Other" categories also have low preferences, with "Unknown" ranging from 0.3% to 0.5% and "Other" from 6.0% to 7.8%.

Recommendations¶

  1. Promote Alternative Fuels: Focus on promoting electric and hybrid vehicles to reduce dependency on gas and diesel. Highlight their environmental benefits and cost savings over time.
  2. Incentives and Policies: Implement policies and incentives to encourage the adoption of electric vehicles. Subsidies, tax breaks, and charging infrastructure can make electric vehicles more attractive to consumers.
  3. Raise Awareness: Increase awareness and education about the benefits and availability of alternative fuel vehicles. Campaigns and informational sessions can help dispel myths and encourage adoption.
  4. Infrastructure Development: Invest in the necessary infrastructure, such as electric vehicle charging stations and hybrid vehicle service centers, to support the growing market for alternative fuel vehicles.

Hypothesis 5:- Top brands dominate regionally and through availability, while premium brands retain higher prices due to their positioning. Low market can grow with targeted strategies.¶

Manufacturer and Model Popularity¶

Objective: Determine the most popular car brands and models.¶

List of Manufacturer with corresponding Model¶

In [127]:
# Group by 'manufacturer' and find min and max year
manufacturer_year_stats = sp.groupby(['manufacturer','model'])['year'].agg(['min', 'max']).reset_index()

# Rename columns for clarity
manufacturer_year_stats.columns = ['manufacturer','model', 'min_year', 'max_year']

# Display the result
print(manufacturer_year_stats)
      manufacturer                           model  min_year  max_year
0          Unknown                               %    2002.0    2002.0
1          Unknown              '50 Business Coupe    1950.0    1950.0
2          Unknown                   '99 H1 Hummer    1999.0    1999.0
3          Unknown                         ,,,,,,,    2009.0    2009.0
4          Unknown    ,2012,2013, SOME 2014 MODELS    2014.0    2014.0
...            ...                             ...       ...       ...
30078        volvo  xc90 t8 inscription 1300 miles    2020.0    2020.0
30079        volvo                         xc90 v8    2008.0    2011.0
30080        volvo                     xc90 v8 awd    2007.0    2007.0
30081        volvo                  xc90 v8 awdsuv    2006.0    2006.0
30082        volvo          xc90t6 awd 24733 miles    2018.0    2018.0

[30083 rows x 4 columns]
  • Using .groupby() groups the data by manufacturer and model, then calculate the minimum and maximum year for each group.
  • Rename the columns for clarity.
  • Display the result with the min_year and max_year for each manufacturer and model.

Data Visualization¶

Top Manufacturers¶

In [128]:
# Data preparation
manufacturer_counts = sp['manufacturer'].value_counts()
manufacturer_stats = sp.groupby('manufacturer').agg(
    avg_price=('price', 'mean'),
    total_listings=('id', 'count')
).reset_index()

# Filter for the top 10 manufacturers by frequency
top_manufacturers = manufacturer_counts.head(10).index
manufacturer_stats_top = manufacturer_stats[manufacturer_stats['manufacturer'].isin(top_manufacturers)]
manufacturer_stats_top = manufacturer_stats_top.set_index('manufacturer').loc[top_manufacturers]  # Maintain order
avg_price = manufacturer_stats_top['avg_price']
total_listings = manufacturer_stats_top['total_listings']

# Create the plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Bar plot for total listings
bar_width = 0.4
x = np.arange(len(top_manufacturers))
ax1.bar(x - bar_width / 2, total_listings, width=bar_width, color='lightblue', label='Number of Listings')
ax1.set_ylabel('Number of Listings', fontsize=12, color='blue')
ax1.set_xlabel('Manufacturer', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(x)
ax1.set_xticklabels(top_manufacturers, rotation=45, ha='right')

# Bar plot for average price
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, avg_price, width=bar_width, color='salmon', label='Average Price (₹)')
ax2.set_ylabel('Average Price (₹)', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')

# Add title and legend
plt.title('Top 10 Manufacturers: Number of Listings vs. Average Price', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Ford Dominates Listings: Ford has the highest number of listings among the top 10 manufacturers, indicating its popularity and wide availability in the market.
  2. High Average Prices for Jeep: Jeep vehicles have the highest average price, suggesting that they are positioned as premium offerings compared to other brands.
  3. Chevrolet and Toyota: These brands also have a high number of listings, making them popular choices. However, their average prices are lower than Jeep, indicating a broader market appeal.
  4. Affordable Options: Brands like Honda, Nissan, and BMW show relatively lower average prices, making them more affordable options for buyers.
  5. Moderate Listings and Prices: GMC and Unknown manufacturers have a moderate number of listings and average prices, indicating a balanced presence in the market.
  6. Lower Listings for Ram: Ram has a lower number of listings but maintains a moderate average price, indicating niche market appeal.

Recommendations¶

  1. Target Diverse Market Segments:
    • For High Listings: Focus marketing efforts on popular brands like Ford, Chevrolet, and Toyota to leverage their high number of listings.
    • For Premium Buyers: Highlight Jeep's premium positioning to attract buyers looking for higher-end vehicles.
  2. Promote Affordable Brands: Emphasize the affordability of brands like Honda, Nissan, and BMW to attract budget-conscious buyers.
  3. Expand Niche Markets: Explore opportunities to increase the listings for brands like Ram and GMC to capture niche market segments.
  4. Increase Awareness for Moderate Brands: Enhance brand visibility and marketing for GMC and Unknown manufacturers to boost their presence and attract more buyers.
  5. Analyze Price Factors: Conduct further analysis to understand the factors driving high average prices for Jeep and Ford, and utilize these insights to optimize pricing strategies for other brands.

Study Regional Dominance of Manufacturers¶

In [129]:
manufacturer_region = sp.groupby(['state', 'manufacturer']).size().unstack(fill_value=0)

# Normalize to show percentages
manufacturer_region_percentage = manufacturer_region.div(manufacturer_region.sum(axis=1), axis=0)

# Heatmap for regional dominance of top manufacturers
plt.figure(figsize=(14, 8))
sns.heatmap(
    manufacturer_region_percentage[top_manufacturers], 
    annot=True, 
    fmt=".1%", 
    cmap='coolwarm', 
    linewidths=0.5, 
    cbar_kws={'label': 'Regional Market Share (%)'}
)
plt.title('Regional Dominance of Top Manufacturers', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Ford's Dominance: Ford has a significant market share across many states, with the highest in Arunachal Pradesh at 17.8%.
  2. Chevrolet and Toyota: Both brands have strong market shares in several states, with Chevrolet peaking in Meghalaya at 13.8% and Toyota in Manipur at 13.8%.
  3. Regional Leaders: Each state has distinct leaders among manufacturers, e.g., Honda in Ladakh (8.4%), Nissan in Mizoram (5.2%), and Jeep in Mizoram (5.0%).
  4. Consistent Presence: Ford, Chevrolet, and Toyota have consistent high presence across multiple states, while other brands like Jeep, Ram, GMC, and BMW have lower market shares.

Recommendations¶

  1. Expand Lesser-Known Brands: Manufacturers with lower market shares (e.g., Honda, Nissan, Jeep) should focus on marketing in states where they have relatively higher shares to consolidate and expand.
  2. Leverage Strong States: Ford, Chevrolet, and Toyota should leverage their strong market positions in states where they dominate, while exploring opportunities in states with lower shares.
  3. Analyze Success Factors: Study successful strategies in states with high market shares to replicate these in other regions. Factors such as local preferences, economic conditions, and competitive pricing should be considered.
  4. Targeted Campaigns: Implement targeted marketing campaigns tailored to the specific needs and preferences of consumers in each state to boost market presence and engagement.

Hypothesis 6:- Vehicle condition and title status drive pricing and demand trends where certain condition demand higher prices while other remain in niche markets.¶

Condition and Title Analysis¶

Objective: Investigate how car condition and title status impact pricing and desirability.¶

Impact of Condition on Price¶

In [130]:
condition_price = sp.groupby('condition')['price'].mean().sort_values(ascending=False)
print("Average Price by Condition:")
print(condition_price)
Average Price by Condition:
condition
fair         8.605288e+07
excellent    1.223083e+07
good         4.941285e+06
like new     4.439667e+06
Unknown      3.644879e+06
new          3.355070e+06
salvage      6.370346e+05
Name: price, dtype: float64
  • .groupby() groups the data by condition and calculate the average price for each condition.
  • Sort the results by the average price in descending order.

Title Status and Its Impact on Pricing¶

In [131]:
# Pricing analysis by title status
title_analysis = sp.groupby('title_status').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean')
).reset_index()
print("\nTitle Status Analysis (Demand and Pricing):")
print(title_analysis)
Title Status Analysis (Demand and Pricing):
  title_status  total_transactions     avg_price
0      Unknown                8139  2.029250e+06
1        clean              405030  1.024240e+07
2         lien                1422  2.888077e+06
3      missing                 810  2.874681e+06
4   parts only                 197  1.646425e+06
5      rebuilt                7219  1.966364e+06
6      salvage                3861  1.345844e+06
  • .groupby() groups the data by title_status and calculate the total number of transactions and the average price for each title status.
  • Reset the index to create a clean DataFrame for pricing analysis by title status.

Demand Analysis by Condition¶

In [132]:
# Analyze demand and pricing by condition
condition_analysis = sp.groupby('condition').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nDemand and Pricing by Condition:")
print(condition_analysis)
Demand and Pricing by Condition:
   condition  total_transactions     avg_price  median_price
0    Unknown               36669  3.644879e+06     2309580.0
1  excellent              209179  1.223083e+07     1926290.0
2       fair                7702  8.605288e+07      385220.0
3       good              143637  4.941285e+06     2310590.0
4   like new               27061  4.439667e+06     1990460.0
5        new                1781  3.355070e+06     2298740.0
6    salvage                 649  6.370346e+05      256870.0
  • .groupby() groups the data by condition, calculating total transactions, average price, and median price for each condition.
  • Reset the index to create a clean DataFrame for demand and pricing analysis by condition.

Analysis¶

1. Average Price by Condition¶

  • "Fair" Condition Cars Have the Highest Average Price (₹86M): This seems unusual and could indicate outliers or data anomalies.
  • "Excellent" and "Good" Condition Cars Are Priced High (₹12.2M and ₹4.9M, respectively): This aligns with expectations as well-maintained cars command premium pricing.
  • "Salvage" Cars Have the Lowest Price (₹637K): These vehicles have significant damage and require repairs.

2. Title Status Analysis (Demand and Pricing)¶

  • Clean Title Cars Dominate the Market (405K transactions, ₹10.24M avg price): Buyers prefer legally clear cars with no issues.
  • Lien & Missing Title Cars Have Lower Demand (1.4K-800 transactions): These indicate ownership or financial disputes, leading to lower market interest.
  • Salvage and Rebuilt Cars Are Cheaper (₹1.34M - ₹1.96M avg price): These vehicles attract budget-conscious buyers but require thorough inspection.

3. Demand and Pricing by Condition¶

  • Most Transactions Are for "Excellent" (209K) and "Good" (143K) Condition Cars: These are the most sought-after categories.
  • "New" Cars Have Low Transaction Volume (1.7K): This suggests the dataset primarily consists of used cars.
  • "Salvage" Cars Have the Lowest Median Price (₹256K): Buyers are unwilling to pay high amounts for vehicles that require repairs.

Key Insights¶

  1. "Fair" Condition Pricing Anomaly: The unusually high price (₹86M) suggests data errors or misclassified vehicles.
  2. Buyers Prefer "Excellent" or "Good" Condition Cars: Over 350K transactions occurred in these two categories, indicating high demand.
  3. Clean Title Vehicles Are the Market Standard: 405K transactions show a strong buyer preference for cars without ownership issues.
  4. Rebuilt and Salvage Cars Have Low Demand and Pricing: These cars sell for less and have lower transaction volumes, making them a niche market.

Recommendations¶

  1. Investigate "Fair" Condition Pricing Anomalies

    • Verify if data errors, extreme outliers, or misclassifications are inflating the price.
  2. Prioritize Listing "Excellent" and "Good" Condition Cars

    • Since these categories have the highest demand, sellers should focus on well-maintained vehicles to maximize sales.
  3. Encourage Buyers to Choose Clean Title Cars

    • Highlight the legal and financial safety of clean title cars in marketing campaigns.
  4. Offer Inspection and Financing Support for Salvage & Rebuilt Cars

    • Providing vehicle certification, repair history, and financing options can make these cars more attractive to budget-conscious buyers.

Data Visualization¶

Compare Average Prices for Different Condition Categories¶

In [133]:
condition_stats = sp.groupby('condition')['price'].mean().sort_values(ascending=False)

# Plotting average prices by condition
plt.figure(figsize=(12, 6))
sns.barplot(
    x=condition_stats.index, 
    y=condition_stats.values, 
    palette='Set2'
)
plt.title('Average Price by Condition', fontsize=14)
plt.xlabel('Condition', fontsize=12)
plt.ylabel('Average Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Fair Condition: Items in fair condition have the highest average price, nearing 9 million rupees. This is significantly higher than all other conditions, indicating a unique market value.
  2. Excellent Condition: Items in excellent condition come next with an average price around 2 million rupees, making them valuable but much lower than items in fair condition.
  3. Lower Prices for Other Conditions: The conditions labeled as good, like new, unknown, new, and salvage all have average prices below 1 million rupees, suggesting less market value compared to fair and excellent conditions.

Recommendations¶

  1. Focus on Fair Condition: Given the high average price, it may be beneficial to prioritize acquiring or maintaining items in fair condition. Understanding the specific factors driving this high valuation could provide strategic advantages.
  2. Promote Excellent Condition: Items in excellent condition also hold considerable value. Enhance marketing efforts for these items to attract buyers looking for high-quality goods.
  3. Investigate Pricing for Other Conditions: Explore why items in other conditions (good, like new, unknown, new, salvage) have lower average prices and consider strategies to enhance their market appeal, such as improvements in presentation or highlighting unique features.
  4. Market Analysis: Conduct further market analysis to understand the demand and supply dynamics for items in different conditions. This can help in setting more competitive pricing and optimizing inventory management.

Correlations Between Title Status and Price¶

In [134]:
title_stats = sp.groupby('title_status')['price'].mean().sort_values(ascending=False)

# Plotting average prices by title status
plt.figure(figsize=(12, 6))
sns.barplot(
    x=title_stats.index, 
    y=title_stats.values, 
    palette='Set3'
)
plt.title('Average Price by Title Status', fontsize=14)
plt.xlabel('Title Status', fontsize=12)
plt.ylabel('Average Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Clean Title Status: Vehicles with a clean title status have the highest average price, reaching approximately ₹10 million. This indicates a strong preference for vehicles with a clear and undisputed ownership history.
  2. Lien and Missing Title Statuses: These categories show the next highest average prices, around ₹2 million, suggesting some buyers are still willing to invest in vehicles with potential ownership complications.
  3. Other Title Statuses: Vehicles with unknown, rebuilt, parts only, and salvage title statuses have significantly lower average prices, all below ₹2 million. This reflects the reduced market value and potential risks associated with these categories.

Recommendations¶

  1. Prioritize Clean Title Vehicles: Focus on acquiring and marketing vehicles with clean title statuses to maximize revenue and appeal to buyers seeking reliable ownership history.
  2. Evaluate Lien and Missing Title Vehicles: Consider these vehicles as viable options for budget-conscious buyers, but ensure thorough checks and transparency about potential ownership issues.
  3. Cautious Approach for Other Titles: Be cautious with vehicles having unknown, rebuilt, parts only, or salvage titles. These should be marketed with clear disclaimers and possibly targeted to buyers looking for project cars or parts.

Combined Analysis of Condition and Title Status on Price¶

In [135]:
combined_stats = sp.groupby(['condition', 'title_status'])['price'].mean().unstack(fill_value=0)

# Heatmap for combined analysis
plt.figure(figsize=(14, 8))
sns.heatmap(
    combined_stats, 
    annot=True, 
    fmt=".0f", 
    cmap='coolwarm', 
    cbar_kws={'label': 'Average Price (₹)'},
    linewidths=0.5
)
plt.title('Average Price by Condition and Title Status', fontsize=14)
plt.xlabel('Title Status', fontsize=12)
plt.ylabel('Condition', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Highest Average Price: The highest average price is ₹94,209,863 for items in "fair" condition with a "clean" title status. This suggests a significant market value for such items.
  2. Clean Title Advantage: Vehicles with a "clean" title status consistently have higher average prices across most conditions, indicating a strong preference for clear ownership.
  3. Fair Condition Premium: Items in "fair" condition have notably high average prices, especially when paired with a "clean" title.
  4. Lower Prices for Salvage Title: Vehicles with a "salvage" title status generally have the lowest average prices, regardless of condition, indicating reduced market value due to potential damage or issues.

Recommendations¶

  1. Focus on Clean Title Vehicles: Prioritize acquiring and marketing vehicles with a "clean" title status, as they command higher prices and appeal to buyers seeking reliable ownership history.
  2. Highlight Fair Condition: Emphasize the value of vehicles in "fair" condition, especially with a clean title, to attract buyers willing to pay a premium for such items.
  3. Cautious with Salvage Titles: Approach vehicles with "salvage" titles with caution. Ensure transparency about potential issues and target budget-conscious buyers or those looking for project cars.
  4. Optimize Pricing Strategies: Use the insights from average price distributions to set competitive and appealing prices for vehicles based on their condition and title status.

Hypothesis 7:- Transaction volumes show seasonal peaks, with higher volumes early in the week due to practical factors.¶

Posting Date Patterns¶

Objective: Examine temporal trends in car listings.¶

Time-Based Demand Trends¶

In [136]:
monthly_trends = sp.groupby(['posting_year', 'posting_month'])['id'].count().reset_index()
monthly_trends.columns = ['Year', 'Month', 'Transactions']
print("Monthly Transaction Trends:")
print(monthly_trends)
Monthly Transaction Trends:
   Year  Month  Transactions
0  2021      4        313053
1  2021      5        113625
  • .groupby() groups the data by posting_year and posting_month, then count the number of transactions for each month.
  • Rename the columns to Year, Month, and Transactions.
In [137]:
weekday_trends = sp['posting_weekday'].value_counts()
print("Weekday Trends:")
print(weekday_trends)
Weekday Trends:
Tuesday      78444
Monday       75067
Friday       66569
Saturday     61453
Thursday     52984
Wednesday    51080
Sunday       41081
Name: posting_weekday, dtype: int64
  • Count the number of transactions for each unique value in the posting_weekday column.
  • Display the transaction count by weekday.

Analysis¶

1. Monthly Transaction Trends¶

  • The dataset currently shows only two months (April & May 2021), which limits the ability to analyze seasonality or long-term trends.
  • April 2021 had a significantly higher number of transactions (313K) compared to May 2021 (113K), indicating a possible drop in demand or listing activity.

2. Weekday Transaction Trends¶

  • Most Listings Occur on Tuesdays (78K) and Mondays (75K): Sellers are more active at the beginning of the week.
  • Least Listings on Sundays (41K): People tend to be less engaged in selling cars on weekends.
  • Fridays & Saturdays Show Moderate Activity (~66K & 61K transactions, respectively).
  • Wednesdays & Thursdays Have the Lowest Mid-Week Activity (~51K-52K transactions).

Key Insights¶

  1. Transaction Volume Dropped From April to May 2021

    • This could be due to seasonal effects, economic factors, or market fluctuations. More historical data is needed to determine long-term trends.
  2. Tuesdays and Mondays Are the Best Days to List a Vehicle

    • These two days account for the highest number of postings, likely because buyers start searching at the beginning of the week.
  3. Sunday Is the Least Active Day for Transactions

    • Sellers and buyers are less engaged on Sundays, making it the worst day to list a vehicle.

Recommendations¶

  1. Analyze More Monthly Data to Identify Trends

    • Collect data from additional months to determine seasonal fluctuations and yearly patterns.
    • Look at festivals, holidays, and economic trends that may impact vehicle sales.
  2. Encourage Listings on High-Traffic Days (Monday & Tuesday)

    • Run promotions or ad campaigns encouraging sellers to list on these days for maximum visibility.
    • Offer discounted listing fees or highlighted ads to sellers who post on peak days.
  3. Optimize Pricing & Promotions on Sundays to Boost Engagement

    • Since Sundays have low activity, consider discounted pricing, free listing promotions, or exclusive Sunday deals to increase engagement.

Data Visualization¶

Weekly Trends in Car Postings¶

In [138]:
weekday_counts = sp['posting_weekday'].value_counts()

plt.figure(figsize=(10, 6))
sns.barplot(
    x=weekday_counts.index, 
    y=weekday_counts.values, 
    palette='viridis'
)
plt.title('Weekly Trends in Car Postings', fontsize=14)
plt.xlabel('Day of the Week', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights from the Bar Chart¶

  1. Peak Posting Day: Tuesday has the highest number of car listings, indicating that this is the most active day for car postings.
  2. Consistently High Days: Monday, Friday, and Saturday also see a high number of postings, following Tuesday.
  3. Mid-Range Activity: Thursday and Wednesday have moderate levels of activity.
  4. Lowest Activity: Sunday has the lowest number of car postings, suggesting it is the least active day for new listings.

Recommendations¶

  1. Optimize Posting Schedule: For maximum visibility and engagement, schedule car postings on Tuesdays and Mondays. This aligns with the peak activity days.
  2. Leverage High Activity Days: Utilize Fridays and Saturdays for additional postings to maintain high visibility and attract more potential buyers.
  3. Boost Sunday Activity: Consider promotions or special deals on Sundays to increase the number of postings and engage users on the platform during this low-activity day.
  4. Monitor Trends: Continuously monitor the weekly trends to adjust posting strategies and optimize timing for maximum impact.

Hypothesis 8:- Vehicle size and drive type influence pricing, with 4wd premium, fwd affordable, and rwd balanced.¶

Transmission, Drive and Size Analysis¶

Objective: Understand the impact of transmission type and drive system.¶

Revenue Contribution by Vehicle Size¶

In [139]:
# Revenue contribution by size
size_revenue = sp.groupby('size').agg(
    total_revenue=('price', 'sum'),
    avg_price=('price', 'mean')
).reset_index()

print("\nRevenue Contribution by Vehicle Size:")
print(size_revenue)
Revenue Contribution by Vehicle Size:
          size  total_revenue     avg_price
0      Unknown   7.099603e+11  5.864291e+06
1      compact   4.462154e+11  8.724345e+06
2    full-size   2.387425e+12  1.488642e+07
3     mid-size   6.333929e+11  7.214700e+06
4  sub-compact   1.415356e+10  2.246953e+06
  • .groupby() groups the data by size, calculating the total revenue and average price for each size category.
  • Reset the index to create a clean DataFrame for revenue analysis by size.

Analysis¶

1. Total Revenue by Vehicle Size¶

  • Full-size vehicles generate the highest total revenue (₹2.39 trillion), indicating strong demand and higher pricing.
  • Mid-size and compact cars also contribute significantly (₹633 billion & ₹446 billion, respectively).
  • Sub-compact cars contribute the least revenue (₹14 billion), suggesting lower demand or lower pricing.
  • The "Unknown" category accounts for ₹709 billion, which may include missing data or undefined vehicle sizes.

2. Average Price by Vehicle Size¶

  • Full-size vehicles have the highest average price (₹14.89 million), followed by compact cars (₹8.72 million).
  • Mid-size vehicles are priced lower (₹7.21 million), while sub-compact cars have the lowest average price (₹2.25 million).
  • The "Unknown" category has an average price of ₹5.86 million, which could skew overall pricing trends.

Key Insights¶

  1. Full-Size Vehicles Dominate Revenue Contribution

    • They generate the highest total revenue and have the highest average price, indicating strong demand among premium buyers.
  2. Compact & Mid-Size Vehicles Offer a Balance Between Demand and Affordability

    • These categories generate substantial revenue and have reasonable pricing, making them attractive for a wider consumer base.
  3. Sub-Compact Cars Have the Lowest Revenue Contribution

    • Lower average pricing and potentially lower demand contribute to their small share of total revenue.
  4. A Large Portion of the Data Falls Under "Unknown" Size

    • ₹709 billion worth of transactions have no defined size, which might indicate data quality issues or missing entries.

Recommendations¶

  1. Focus on High-Value Full-Size Vehicle Sales

    • Given their high revenue contribution and pricing, invest in marketing, premium listings, and financing options to boost full-size vehicle sales.
  2. Target Compact & Mid-Size Vehicle Buyers

    • These vehicles appeal to cost-conscious buyers, so offering loan options, exchange programs, and extended warranties can boost sales.
  3. Boost Demand for Sub-Compact Cars Through Discounts & Incentives

    • Since sub-compact cars contribute the least revenue, introduce budget-friendly financing options, trade-in offers, and discounts to attract buyers.
  4. Clean & Classify "Unknown" Data for Better Accuracy

    • Investigate why certain listings are categorized as "Unknown" and enforce better data collection practices to ensure accurate reporting.

Data Visualization¶

Average Price and Odometer by Drive Type.¶

In [140]:
# Calculate averages
drive_data = sp.groupby('drive').agg({'price': 'mean', 'odometer': 'mean'}).sort_index()

# Extract values
drive_types = drive_data.index
price_values = drive_data['price']
odometer_values = drive_data['odometer']

# Create the plot
fig, ax1 = plt.subplots(figsize=(14, 6))
x = np.arange(len(drive_types))
bar_width = 0.4

# Plot bars
ax1.bar(x - bar_width / 2, price_values, width=bar_width, color='skyblue', label='Average Price (₹)')
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, odometer_values, width=bar_width, color='lightgreen', label='Average Odometer (km)')

# Label axes
ax1.set_ylabel('Average Price (₹)', color='blue', fontsize=12)
ax2.set_ylabel('Average Odometer (km)', color='green', fontsize=12)
ax1.set_xlabel('Drive Type', fontsize=12)
ax1.set_xticks(x)
ax1.set_xticklabels(drive_types, rotation=45, ha='right')

# Add title and legend
plt.title('Average Price and Odometer by Drive Type', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights from the Bar Chart¶

  1. 4wd Vehicles:
    • Average Price: ₹1.25 crore (₹12.5 million).
    • Average Odometer: Around 200,000 km.
  2. Unknown Drive Type:
    • Average Price: ₹0.5 crore (₹5 million).
    • Average Odometer: Around 200,000 km.
  3. fwd Vehicles:
    • Average Price: Very low, close to ₹0 crore.
    • Average Odometer: Around 100,000 km.
  4. rwd Vehicles:
    • Average Price: ₹1 crore (₹10 million).
    • Average Odometer: Slightly above 100,000 km.

Recommendations¶

  1. Budget-Friendly Option: Consider purchasing fwd vehicles as they have the lowest average price. However, further investigation into why they are priced low might be needed to ensure quality and performance.
  2. Lower Mileage Preference: rwd vehicles are a good option if the goal is to purchase a vehicle with a relatively lower odometer reading compared to 4wd and unknown drive types.
  3. High Price-High Value: 4wd vehicles have the highest average price, which may indicate premium features or better performance. Evaluate if the additional cost aligns with your needs and budget.
  4. Investigate Unknown Drive Type: The vehicles with an unknown drive type have a moderate price and high odometer reading. More information about this category might be useful to make informed decisions.

Hypothesis 9:- Regional paint color preferences affect demand, pricing, and marketing strategies.¶

Paint Color Preferences¶

Objective: Study the impact of paint_color on desirability and price.¶

Data Visualization¶

Paint Color Preferences and Average Price¶

In [141]:
# Calculate paint color frequency and average price
paint_frequency = sp['paint_color'].value_counts()
paint_price = sp.groupby('paint_color')['price'].mean()

# Ensure the indices are aligned
paint_frequency = paint_frequency.sort_index()
paint_price = paint_price.sort_index()

# Create the figure and axes
fig, ax1 = plt.subplots(figsize=(14, 6))

# Plot frequency on the first y-axis
ax1.bar(
    paint_frequency.index, 
    paint_frequency.values, 
    color='lightblue', 
    alpha=0.7, 
    label='Frequency'
)
ax1.set_ylabel('Frequency', fontsize=12, color='blue')
ax1.set_xlabel('Paint Color', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(range(len(paint_frequency.index)))  # Set tick positions
ax1.set_xticklabels(paint_frequency.index, rotation=45, ha='right')  # Set tick labels

# Create the second y-axis
ax2 = ax1.twinx()
ax2.plot(
    range(len(paint_price.index)), 
    paint_price.values, 
    color='red', 
    marker='o', 
    label='Average Price'
)
ax2.set_ylabel('Average Price (₹)', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')

# Add a title
plt.title('Paint Color Preferences and Average Price', fontsize=14)

# Show the legend
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

# Adjust layout for better appearance
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights¶

  1. Most Preferred Paint Color: White is the most preferred paint color with a frequency of around 100,000.
  2. Popular Colors: Black and silver are also highly preferred, with frequencies around 90,000 and 70,000, respectively.
  3. Least Preferred Colors: Custom, orange, purple, and yellow are the least preferred paint colors, with frequencies below 10,000.
  4. Highest Average Price: Green paint commands the highest average price, around ₹7,000,000.
  5. High Average Prices: Black, silver, and white paints have relatively high average prices, around ₹2,000,000 to ₹3,000,000.
  6. Lowest Average Prices: Custom, orange, purple, and yellow paint colors have the lowest average prices, below ₹1,000,000.

Recommendations¶

  1. Stock Popular Colors: Increase inventory for white, black, and silver paint colors to meet high demand and maximize sales.
  2. Market Green Paint: Investigate why green paint commands a premium price and explore strategies to market it more effectively to capitalize on its high value.
  3. Boost Low-Preference Colors: Consider promotional campaigns or discounts for custom, orange, purple, and yellow paints to increase their popularity and sales.
  4. Monitor Price Trends: Continuously monitor price trends and consumer preferences to adjust inventory and marketing strategies accordingly.

Regional Paint Color Preferences¶

In [142]:
regional_paint_preference = sp.groupby(['state', 'paint_color']).size().unstack(fill_value=0)

plt.figure(figsize=(15, 8))
sns.heatmap(
    regional_paint_preference,
    cmap='YlGnBu',
    cbar_kws={'label': 'Number of Listings'},
    linewidths=0.5
)
plt.title('Regional Preferences for Paint Colors', fontsize=16)
plt.xlabel('Paint Color', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Key Insights from the Heatmap¶

  1. White Paint Dominance: Maharashtra and Uttar Pradesh have a high number of listings for white paint, indicating a strong preference.
  2. Blue Paint Preference: Delhi and Haryana show a notable preference for blue paint.
  3. Black Paint Popularity: Karnataka and Punjab have a significant number of listings for black paint.
  4. Silver Paint Preference: Madhya Pradesh shows a clear preference for silver paint.
  5. Overall Trends: The color white is generally the most popular paint color across multiple states.

Recommendations¶

  1. Stock White Paint: Paint manufacturers and retailers should prioritize stocking white paint in Maharashtra and Uttar Pradesh to meet the high demand.
  2. Focus on Blue Paint: Ensure ample supply of blue paint in Delhi and Haryana to cater to the regional preference.
  3. Boost Black Paint Inventory: Increase inventory of black paint in Karnataka and Punjab to match the high number of listings.
  4. Emphasize Silver Paint: Consider promoting and stocking more silver paint in Madhya Pradesh to align with local preferences.
  5. Customized Marketing: Develop region-specific marketing strategies that highlight the most popular colors in each state to boost sales and customer satisfaction.

Hypothesis 10:- Vehicle demand varies regionally, with different models and fuel types appealing to specific states based on local trends.¶

In Demand Analysis¶

Objective: Analyze the factors influencing demand trends across various vehicle attributes, including condition, fuel type, and mileage.¶

In-Demand Car Manufacturer with Models¶

In [143]:
# Group by 'manufacturer' and 'model', summing the 'id' column
model_demand = sp.groupby(['manufacturer', 'model'])['id'].count().reset_index()

# Rename the column for clarity
model_demand.columns = ['Manufacturer', 'Model', 'Total_Demand']

# Sort by 'TotalDemand' in descending order
model_demand = model_demand.sort_values(by='Total_Demand', ascending=False)
model_demand
Out[143]:
Manufacturer Model Total_Demand
13769 ford f-150 8014
10401 chevrolet silverado 1500 5154
25302 ram 1500 4259
27474 toyota camry 3153
10395 chevrolet silverado 3037
... ... ... ...
12032 dodge charger police sxt 1
12031 dodge charger police inspector 1
12029 dodge charger plus rt 1
12027 dodge charger limousine 1
15041 ford f250xlt 1

30083 rows × 3 columns

  • Using .groupby() groups the data by manufacturer and model, then count the number of transactions (id) for each group.
  • Rename the columns for clarity (TotalDemand).
  • Sort the resulting DataFrame by TotalDemand in descending order and display the result.
In [144]:
# Get the top 10 in-demand cars
model_demand_top10 = model_demand.head(10)
model_demand_top10
Out[144]:
Manufacturer Model Total_Demand
13769 ford f-150 8014
10401 chevrolet silverado 1500 5154
25302 ram 1500 4259
27474 toyota camry 3153
10395 chevrolet silverado 3037
17821 honda accord 2983
20419 jeep wrangler 2863
23984 nissan altima 2805
18079 honda civic 2801
13300 ford escape 2780
  • Extract the top 10 rows with the highest TotalDemand from the model_demand DataFrame.
  • Display the top 10 in-demand cars.

Key Attributes Driving Demand¶

In [145]:
# Analyze demand by key attributes
attribute_columns = ['fuel', 'condition', 'paint_color']
for col in attribute_columns:
    print(f"\nDemand by {col.capitalize()}:")
    print(sp[col].value_counts())
Demand by Fuel:
gas         357266
other        30724
diesel       30101
hybrid        5195
electric      1701
Unknown       1691
Name: fuel, dtype: int64

Demand by Condition:
excellent    209179
good         143637
Unknown       36669
like new      27061
fair           7702
new            1781
salvage         649
Name: condition, dtype: int64

Demand by Paint_color:
white      108720
black       91076
silver      57794
blue        40905
red         37135
grey        30740
Unknown     30617
green        8786
brown        7901
custom       7606
yellow       2346
orange       2282
purple        770
Name: paint_color, dtype: int64
  • Loop through key attribute columns (fuel, condition, paint_color), and for each column, count the number of occurrences of each unique value.
  • Display the demand for each attribute.

High-Demand Models by Region¶

In [146]:
regional_model_demand = sp.groupby(['state', 'model'])['id'].count().reset_index()
regional_model_demand.columns = ['State', 'Model', 'Transactions']
top_models = regional_model_demand.sort_values(by=['Transactions'], ascending=[False])
print(top_models)
                State                 Model  Transactions
66843     Maharashtra                 f-150           886
110102  Uttar Pradesh                 f-150           847
20926           Delhi                 f-150           608
84926          Punjab                 f-150           587
70240     Maharashtra        silverado 1500           567
...               ...                   ...           ...
49574       Karnataka     silverado ext cab             1
49573       Karnataka      silverado ex-cab             1
49572       Karnataka   silverado durmax hd             1
49571       Karnataka  silverado duramax hd             1
122744    West Bengal                     ♿             1

[122745 rows x 3 columns]
  • .groupby() group the data by state and model, then count the number of transactions for each combination.
  • Rename the columns to State, Model, and Transactions.
  • Sort the resulting DataFrame by Transactions in descending order to identify the top models.

Popular Car Models and Brands¶

In [147]:
# Identify the most popular manufacturers and models
popular_cars = sp.groupby(['manufacturer', 'model']).agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)

print("\nPopular Car Models and Brands Analysis:")
print(popular_cars.head(10))  # Top 10 popular cars
Popular Car Models and Brands Analysis:
      manufacturer           model  total_transactions     avg_price  \
13769         ford           f-150                8014  2.864032e+06   
10401    chevrolet  silverado 1500                5154  2.886477e+06   
25302          ram            1500                4259  2.913927e+06   
27474       toyota           camry                3153  1.298709e+06   
10395    chevrolet       silverado                3037  2.699027e+06   
17821        honda          accord                2983  1.236092e+06   
20419         jeep        wrangler                2863  6.580196e+07   
23984       nissan          altima                2805  1.216228e+06   
18079        honda           civic                2801  1.208038e+06   
13300         ford          escape                2780  1.504665e+06   

       median_price  
13769     2511100.0  
10401     2568380.0  
25302     2695900.0  
27474     1027320.0  
10395     2041870.0  
17821     1007580.0  
20419     2375770.0  
23984     1136940.0  
18079     1014550.0  
13300     1488650.0  
  • .groupby() group the data by manufacturer and model, then count the number of transactions, average_price and median_price for each combination.
  • Sort the resulting DataFrame by Transactions in descending order to identify the top models.

Analysis: Car Model Demand & Popularity¶

1. Most In-Demand Car Models¶

  • Ford F-150 is the most popular model with 8,014 transactions, significantly ahead of competitors.
  • Chevrolet Silverado 1500 (5,154 transactions) and Ram 1500 (4,259 transactions) are also highly demanded.
  • Toyota Camry (3,153 transactions) and Honda Accord (2,983 transactions) dominate the sedan segment.
  • Jeep Wrangler (2,863 transactions) stands out in the SUV category.

2. Demand by Key Attributes¶

Fuel Type Demand¶

  • Gas-powered vehicles dominate (357,266 transactions), indicating continued reliance on traditional fuel.
  • Diesel (30,101 transactions) and hybrid (5,195 transactions) have niche demand.
  • Electric vehicles (1,701 transactions) have the lowest demand, showing slow adoption.

Condition-Based Demand¶

  • Most buyers prefer vehicles in ‘Excellent’ (209,179) or ‘Good’ (143,637) condition.
  • Only 1,781 transactions for ‘New’ vehicles suggest a preference for pre-owned cars.
  • Salvage vehicles have the lowest demand (649 transactions).

Paint Color Preference¶

  • White (108,720) and Black (91,076) are the most popular colors, followed by Silver (57,794).
  • Custom colors like Purple (770) and Yellow (2,346) have very low demand.

3. Regional Demand for Top Models¶

  • Maharashtra has the highest demand for Ford F-150 (886 transactions).
  • Uttar Pradesh (847), Delhi (608), and Punjab (587) also show strong demand for this model.
  • Chevrolet Silverado 1500 sees demand concentrated in Maharashtra (567 transactions).

Key Insights¶

  1. Ford F-150, Chevrolet Silverado, and Ram 1500 dominate the market.

    • These trucks have higher transaction volumes, indicating strong consumer preference.
  2. Gasoline vehicles remain dominant, but hybrid & electric adoption is slow.

    • Despite the EV push, traditional fuel types continue to lead demand.
  3. Used vehicles in excellent and good condition drive sales.

    • Consumers prefer high-quality pre-owned vehicles over new or salvage listings.
  4. White, Black, and Silver are the most preferred colors.

    • Custom or niche colors (Purple, Yellow, etc.) see limited demand.
  5. Regional demand varies, with Maharashtra leading for top models.

    • Understanding local preferences can help optimize inventory distribution.

Recommendations¶

1. Focus on Top-Selling Models for Inventory Optimization¶

  • Stock Ford F-150, Chevrolet Silverado, and Ram 1500 in high-demand regions like Maharashtra and Uttar Pradesh.
  • Ensure a steady supply of Toyota Camry, Honda Accord, and Jeep Wrangler to meet sedan and SUV demand.

2. Expand Hybrid & Electric Vehicle Promotions¶

  • Offer financing incentives, lower interest rates, or EV subsidies to drive adoption.
  • Promote hybrid & electric models in eco-conscious cities with better charging infrastructure.

3. Prioritize High-Quality Used Cars Over New or Salvage Vehicles¶

  • Given the low demand for new cars, focus on certified pre-owned (CPO) vehicles with warranty benefits.
  • Salvage cars have very limited demand; avoid overstocking them.

4. Stock Popular Color Choices & Offer Customization¶

  • Prioritize White, Black, and Silver models, as they sell the most.
  • Introduce custom paint options or wraps for buyers seeking personalization.

5. Use Regional Insights for Targeted Marketing¶

  • Maharashtra & Uttar Pradesh should be priority states for Ford F-150 and Chevrolet Silverado.
  • Target digital marketing campaigns based on regional demand trends.

Hypothesis 11:- Lower-mileage vehicles drive higher demand, while higher-mileage ones attract niche buyers at premium prices.¶

Odometer Trends & Analysis¶

Objective:- Study the impact of odometer readings on vehicle pricing trends and demand.¶

Impact on Odometer on Pricing¶

In [148]:
# Group transactions by odometer ranges
bins = [0, 20000, 50000, 100000, 150000, 200000, 300000,400000,600000,800000,1000000,10000000]
labels = ['0-20K', '20K-50K', '50K-100K', '100K-150K', '150K-200K', '200K-300K', '300K-400K', '400K-600K', '600K-800K', '800K-1000K','1000K+']
sp['mileage_range'] = pd.cut(sp['odometer'], bins=bins, labels=labels)

mileage_analysis = sp.groupby('mileage_range').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nMileage Range Analysis (Demand and Pricing):")
print(mileage_analysis)
Mileage Range Analysis (Demand and Pricing):
   mileage_range  total_transactions     avg_price  median_price
0          0-20K               53687  1.381985e+07     3788400.0
1        20K-50K               79690  3.421764e+06     3209560.0
2       50K-100K              116017  9.843183e+06     1990460.0
3      100K-150K               99096  5.119617e+06     1284190.0
4      150K-200K               53065  1.341860e+07      897960.0
5      200K-300K               20137  2.501741e+07      744910.0
6      300K-400K                1327  1.339640e+07      960200.0
7      400K-600K                 449  3.057920e+06     1990460.0
8      600K-800K                 229  2.832558e+06     2140660.0
9     800K-1000K                 445  2.875703e+08     1926290.0
10        1000K+                 571  2.518616e+08     1027320.0
  • sp['mileage_range'] create mileage ranges using the odometer values, then assign corresponding labels to each range.
  • Group the data by mileage_range and calculate the total number of transactions, average price, and median price for each mileage range.
  • Display the mileage range analysis (demand and pricing).

Impact of Yearly Mileage on Pricing¶

In [149]:
sp['mileage_bins'] = pd.cut(sp['avg_yearly_mileage'], 
                            bins=[0, 5000, 10000, 20000, 30000,40000,50000,100000,10000000], 
                            labels=['0-5K','5K-10K','10K-20K','20K-30K','30K-40K','40K-50K','50K-100K','100K+'])

mileage_demand = sp.groupby('mileage_bins').agg(
    total_transactions=('id','count'),
    avg_price=('price', 'mean'), 
    median_price=('price', 'median')
).reset_index()

print("Demand by Average Yearly Mileage:")
print(mileage_demand)
Demand by Average Yearly Mileage:
  mileage_bins  total_transactions     avg_price  median_price
0         0-5K               62413  2.788424e+07     2616850.0
1       5K-10K              126558  7.393663e+06     1990460.0
2      10K-20K              190936  6.344824e+06     1871940.0
3      20K-30K               34091  2.800212e+06     2182150.0
4      30K-40K                6774  3.418389e+06     3081120.0
5      40K-50K                2011  8.346382e+06     2889430.0
6     50K-100K                1307  1.545943e+07     2311510.0
7        100K+                 623  2.058125e+08     1990460.0
  • Creates mileage bins based on avg_yearly_mileage and assign corresponding labels.
  • Group the data by mileage_bins and calculate the total number of transactions, average price, and median price for each mileage bin.
  • Display the demand by average yearly mileage.

Analysis: Vehicle Demand & Pricing by Mileage¶

1. Mileage Range Analysis¶

  • Vehicles with lower mileage (0-20K) have high demand (53,687 transactions) and the highest average price (₹13.8M).
  • The demand peaks in the 50K-100K range (116,017 transactions), but the average price drops to ₹9.8M.
  • Older, high-mileage vehicles (100K+ miles) have significantly lower median prices, indicating depreciation.
  • Extreme mileage vehicles (800K+ miles) have very low transaction volumes.

2. Average Yearly Mileage Demand¶

  • The 10K-20K range has the highest demand (190,936 transactions), showing a preference for moderately used cars.
  • Cars with less than 5K yearly mileage have high value (₹27.8M avg. price), likely due to better condition.
  • Vehicles exceeding 100K yearly mileage are rare (623 transactions) but still have a surprisingly high avg. price (₹205.8M), possibly due to commercial or specialty use.

Key Insights¶

  1. Moderate mileage (10K-100K miles) has the highest demand.

    • Buyers prioritize cars with reasonable wear but not excessively high mileage.
  2. Price drops sharply for high-mileage vehicles.

    • Vehicles with over 150K miles see a steep decline in value.
  3. Extremely high mileage (800K+) vehicles have low demand but unpredictable pricing.

    • Some outliers have unusually high prices, possibly due to specialized vehicle types.
  4. Low yearly mileage (0-5K per year) retains the most value.

    • These cars attract premium pricing due to perceived better maintenance.

Recommendations¶

1. Prioritize Stocking Vehicles in the 10K-100K Mileage Range¶

  • These have the highest transaction volume and stable pricing.
  • Avoid very high-mileage cars (200K+) unless they serve niche markets.

2. Market Low-Mileage Vehicles as Premium Offerings¶

  • Highlight "low mileage, like new" features to justify higher pricing.
  • Offer extended warranties or certification for added trust.

3. Be Cautious with High-Mileage Vehicles Over 150K Miles¶

  • If selling these, focus on budget-conscious buyers or commercial fleets.
  • Price aggressively to move inventory faster.

4. Use Mileage Data to Guide Trade-In Valuations¶

  • Offer better trade-in values for vehicles under 100K miles to attract sellers.
  • Use historical price trends to ensure competitive but profitable resale prices.