Graduation Project -- Spinny Analysis¶
Business Problem Overview¶
Driving Revenue Growth Through Demand and Pricing Optimization for Spinny
Spinny, a trailblazer in India’s rapidly growing pre-owned car market, connects buyers and sellers with a seamless, customer-focused platform. With a mission to simplify car ownership, Spinny offers a curated selection of pre-owned vehicles, guaranteeing quality, transparency, and trust.
However, achieving sustainable growth in this competitive market requires a strategic approach to revenue generation. Success hinges on two critical factors: understanding regional demand and optimizing car pricing.
India's diverse regions exhibit varying preferences and purchasing behaviors for pre-owned cars. Factors such as local demand, vehicle condition, fuel type, and mileage significantly influence transaction volumes. Misaligned pricing strategies can lead to stagnant inventory, missed revenue opportunities, and diminished competitiveness.
For Spinny, unlocking higher revenue entails identifying the most sought-after cars and their defining attributes across regions. Additionally, optimizing pricing based on vehicle features and real-time market trends is essential to attract buyers while maximizing profitability.
By analyzing transaction data, uncovering patterns in demand, and fine-tuning pricing strategies, Spinny can not only increase completed transactions but also solidify its position as the go-to platform for pre-owned car buyers in India.
Objective¶
In this project, we aim to analyze Duolingo user activity data to:
- Understand regional demand: Analyze pre-owned car transaction trends across various regions to identify areas with high and low demand.
- Identify key car preferences: Determine the most in-demand car models and their defining attributes, such as condition, fuel type, mileage, and other features driving buyer interest.
- Optimize pricing strategies: Develop data-driven insights to establish competitive and market-aligned pricing for each car, ensuring higher transaction success rates and improved profitability.
Business Impact¶
This analysis will enable Spinny to:
- Boost transaction volumes: By understanding regional demand and buyer preferences, Spinny can strategically align its inventory with market needs, driving higher sales.
- Enhance customer satisfaction: Identifying key car attributes and optimizing pricing ensures buyers find vehicles that meet their expectations at competitive prices, fostering trust and loyalty.
- Maximize profitability: Data-driven pricing strategies will help Spinny achieve a balance between competitiveness and revenue growth, reducing stagnant inventory and increasing margins.
- Strengthen market position: Leveraging insights to cater to diverse regional preferences and optimize pricing will solidify Spinny’s reputation as a customer-first, data-savvy leader in the pre-owned car market.
Dataset Overview¶
Dataset Overview¶
- Dataset Name: Spinny Analytics Dataset
- Number of Rows: 4,26,880
- Number of Columns: 21
- Description: The dataset captures language learning session details, including user behavior, engagement metrics, and learning progress. It records recall probability, session performance, and lexeme interactions, enabling analysis of learning outcomes and activity patterns over days.
Column Definitions¶
- id (TransacThe year of manufacture of the vehicle, indicating its age.tion ID): A unique identifier for each car transaction in the dataset.
- Price: The final selling price of the vehicle in Indian Rupees (INR).
- Year: The time (in seconds) since the last lesson or practice where this specific word (lexeme) was encountered.
- Manufacturer: The brand or company that produced the vehicle.
- Model: The specific model name of the vehicle produced by the manufacturer.
- Condition: The physical and operational state of the vehicle at the time of sale.
- Cylinders: The number of cylinders in the car's engine, influencing its power and performance.
- Fuel: The type of fuel the vehicle uses, such as petrol, diesel, CNG, or electric.
- Odometer: The total distance the vehicle has traveled, measured in kilometers.
- Title_status: The condition of the car's , indicating if it is clean, or has missing parts or has liens.
- VIN (Vehicle Identification Number): A unique serial number assigned to each vehicle for identification.
- Drive: The drivetrain configuration of the vehicle (e.g., front-wheel drive, rear-wheel drive).
- Size: The size classification of the vehicle, such as compact, midsize, or full-size.
- Type: The type of vehicle, such as sedan, SUV, hatchback, or truck.
- Paint_color: The exterior color of the vehicle.
- Image_url: A link to an image of the vehicle for reference.
- State: The Indian state where the car transaction took place.
- Posting_date: The date when the car was listed for sale.
- Latitude: The geographical latitude coordinate of the transaction location.
- Longitude: The geographical longitude coordinate of the transaction location.
Analysis & Visualization¶
1. Importing and Cleaning Data¶
Importing Necessary Libraries¶
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting and visualization
import seaborn as sns # For advanced visualizations
Loading the Dataset from google drive¶
sp = pd.read_csv("final_dataset.csv") # Loading the Data
sp
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | VIN | drive | size | type | paint_color | image_url | state | posting_date | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7222695916 | 770530.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Haryana | NaN | 28.8446 | 75.1167 |
1 | 7218891961 | 1528210.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Haryana | NaN | 27.9026 | 77.0382 |
2 | 7221797935 | 2696820.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | West Bengal | NaN | 24.6838 | 85.9695 |
3 | 7222270760 | 192610.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Andhra Pradesh | NaN | 15.5158 | 81.9342 |
4 | 7210384030 | 629240.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | West Bengal | NaN | 23.7821 | 87.4861 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
426875 | 7301591192 | 3029710.0 | 2019.0 | nissan | maxima s sedan 4d | good | 6 cylinders | gas | 32226.0 | clean | ... | 1N4AA6AV6KC367801 | fwd | NaN | sedan | NaN | https://images.craigslist.org/00o0o_iiraFnHg8q... | Delhi | 2021-04-04 09:21:31+00:00 | 28.7876 | 76.8644 |
426876 | 7301591187 | 3928680.0 | 2020.0 | volvo | s60 t5 momentum sedan 4d | good | NaN | gas | 12029.0 | clean | ... | 7JR102FKXLG042696 | fwd | NaN | sedan | red | https://images.craigslist.org/00x0x_15sbgnxCIS... | Punjab | 2021-04-04 09:21:29+00:00 | 32.0715 | 75.1915 |
426877 | 7301591147 | 4493750.0 | 2020.0 | cadillac | xt4 sport suv 4d | good | NaN | diesel | 4174.0 | clean | ... | 1GYFZFR46LF088296 | NaN | NaN | hatchback | white | https://images.craigslist.org/00L0L_farM7bxnxR... | Madhya Pradesh | 2021-04-04 09:21:17+00:00 | 21.5646 | 75.7247 |
426878 | 7301591140 | 3723220.0 | 2018.0 | lexus | es 350 sedan 4d | good | 6 cylinders | gas | 30112.0 | clean | ... | 58ABK1GG4JU103853 | fwd | NaN | sedan | silver | https://images.craigslist.org/00z0z_bKnIVGLkDT... | Maharashtra | 2021-04-04 09:21:11+00:00 | 21.4183 | 75.4719 |
426879 | 7301591129 | 3928680.0 | 2019.0 | bmw | 4 series 430i gran coupe | good | NaN | gas | 22716.0 | clean | ... | WBA4J1C58KBM14708 | rwd | NaN | coupe | NaN | https://images.craigslist.org/00Y0Y_lEUocjyRxa... | Jharkhand | 2021-04-04 09:21:07+00:00 | 24.2616 | 83.7688 |
426880 rows × 21 columns
Displaying Dataset Information¶
print("Dataset Information:")
sp.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 426880 entries, 0 to 426879 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 426880 non-null int64 1 price 426880 non-null float64 2 year 425675 non-null float64 3 manufacturer 409234 non-null object 4 model 421603 non-null object 5 condition 252776 non-null object 6 cylinders 249202 non-null object 7 fuel 423867 non-null object 8 odometer 422480 non-null float64 9 title_status 418638 non-null object 10 transmission 424324 non-null object 11 VIN 265838 non-null object 12 drive 296313 non-null object 13 size 120519 non-null object 14 type 334022 non-null object 15 paint_color 296677 non-null object 16 image_url 426812 non-null object 17 state 426880 non-null object 18 posting_date 426812 non-null object 19 latitude 426880 non-null float64 20 longitude 426880 non-null float64 dtypes: float64(5), int64(1), object(15) memory usage: 68.4+ MB
sp.info()
gives concise information about the columns and its data types with their count.
Displaying Column Names¶
sp.columns
Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'state', 'posting_date', 'latitude', 'longitude'], dtype='object')
- Displays the Columns in the DataFrame
Describing Dataset Information¶
sp.describe()
id | price | year | odometer | latitude | longitude | |
---|---|---|---|---|---|---|
count | 4.268800e+05 | 4.268800e+05 | 425675.000000 | 4.224800e+05 | 426880.000000 | 426880.000000 |
mean | 7.311487e+09 | 9.657028e+06 | 2011.235191 | 9.804333e+04 | 22.800888 | 79.012914 |
std | 4.473170e+06 | 1.564439e+09 | 9.452120 | 2.138815e+05 | 6.532617 | 5.208436 |
min | 7.207408e+09 | 0.000000e+00 | 1900.000000 | 0.000000e+00 | 6.800300 | 68.400200 |
25% | 7.308143e+09 | 7.576800e+05 | 2008.000000 | 3.770400e+04 | 17.796000 | 75.719100 |
50% | 7.312621e+09 | 1.791970e+06 | 2013.000000 | 8.554800e+04 | 24.058650 | 77.353900 |
75% | 7.315254e+09 | 3.401080e+06 | 2017.000000 | 1.335425e+05 | 28.563525 | 80.927325 |
max | 7.317101e+09 | 4.798934e+11 | 2022.000000 | 1.000000e+07 | 35.997600 | 97.398000 |
sp.describe()
function describes the numerical columns from the DataFrame.- It Displays the count, mean, std, min, 25%, 50%, 75%, max which identifies potential outliers or data issues.
- It Helps to understand the Numerical Data Columns.
Displaying Column Data Types¶
print("The Data Types of all Columns:")
sp.dtypes
The Data Types of all Columns:
id int64 price float64 year float64 manufacturer object model object condition object cylinders object fuel object odometer float64 title_status object transmission object VIN object drive object size object type object paint_color object image_url object state object posting_date object latitude float64 longitude float64 dtype: object
- Using
sp.dtypes
, the data types for each column has been Displayed.
Checking the Shape of the Dataset¶
rows, columns = sp.shape
print(f"The dataset contains {rows} rows and {columns} columns.")
The dataset contains 426880 rows and 21 columns.
sp.shape
has been used for checking the size of the dataset
Checking the unique values in the Dataset¶
for column in sp.columns.tolist():
print(f"No. of unique values in {column}:")
print(sp[column].nunique())
No. of unique values in id: 426880 No. of unique values in price: 6253 No. of unique values in year: 114 No. of unique values in manufacturer: 42 No. of unique values in model: 29667 No. of unique values in condition: 6 No. of unique values in cylinders: 8 No. of unique values in fuel: 5 No. of unique values in odometer: 104870 No. of unique values in title_status: 6 No. of unique values in transmission: 3 No. of unique values in VIN: 118264 No. of unique values in drive: 3 No. of unique values in size: 4 No. of unique values in type: 13 No. of unique values in paint_color: 12 No. of unique values in image_url: 241899 No. of unique values in state: 35 No. of unique values in posting_date: 333355 No. of unique values in latitude: 191354 No. of unique values in longitude: 162536
sp.columns.tolist()
converts that list of column names into a Python list which loops and iterates over all the columns in the DataFrame dl.sp[column].nunique()
returns the number of unique (distinct) values in the column.
Checking for the Value Counts in the Dataset¶
value_counts_dict = {}
for column in sp.columns:
value_counts_dict[column] = sp[column].value_counts()
print(value_counts_dict[column])
7222695916 1 7313139418 1 7313423023 1 7313423324 1 7313424533 1 .. 7314655506 1 7314655833 1 7314657468 1 7314659947 1 7301591129 1 Name: id, Length: 426880, dtype: int64 0.0 34868 3851650.0 3683 897960.0 3623 1026400.0 3585 1154830.0 3512 ... 13329960.0 1 1819610.0 1 425710.0 1 2222640.0 1 611520.0 1 Name: price, Length: 6253, dtype: int64 2017.0 36420 2018.0 36369 2015.0 31538 2013.0 30794 2016.0 30434 ... 1943.0 1 1915.0 1 1902.0 1 1905.0 1 1909.0 1 Name: year, Length: 114, dtype: int64 ford 70985 chevrolet 55064 toyota 34202 honda 21269 nissan 19067 jeep 19014 ram 18342 gmc 16785 bmw 14699 dodge 13707 mercedes-benz 11817 hyundai 10338 subaru 9495 volkswagen 9345 kia 8457 lexus 8200 audi 7573 cadillac 6953 chrysler 6031 acura 5978 buick 5501 mazda 5427 infiniti 4802 lincoln 4220 volvo 3374 mitsubishi 3292 mini 2376 pontiac 2288 rover 2113 jaguar 1946 porsche 1384 mercury 1184 saturn 1090 alfa-romeo 897 tesla 868 fiat 792 harley-davidson 153 ferrari 95 datsun 63 aston-martin 24 land rover 21 morgan 3 Name: manufacturer, dtype: int64 f-150 8009 silverado 1500 5140 1500 4211 camry 3135 silverado 3023 ... Huyndai Sante Fe Limited 1 astro awd 4x4 1 escalade and 1 cx 3 1 Paige Glenbrook Touring 1 Name: model, Length: 29667, dtype: int64 good 121456 excellent 101467 like new 21178 fair 6769 new 1305 salvage 601 Name: condition, dtype: int64 6 cylinders 94169 4 cylinders 77642 8 cylinders 72062 5 cylinders 1712 10 cylinders 1455 other 1298 3 cylinders 655 12 cylinders 209 Name: cylinders, dtype: int64 gas 356209 other 30728 diesel 30062 hybrid 5170 electric 1698 Name: fuel, dtype: int64 100000.0 2263 1.0 2246 0.0 1965 200000.0 1728 150000.0 1603 ... 149468.0 1 154259.0 1 111887.0 1 213826.0 1 26892.0 1 Name: odometer, Length: 104870, dtype: int64 clean 405117 rebuilt 7219 salvage 3868 lien 1422 missing 814 parts only 198 Name: title_status, dtype: int64 automatic 336524 other 62682 manual 25118 Name: transmission, dtype: int64 1FMJU1JT1HEA52352 261 3C6JR6DT3KG560649 235 1FTER1EH1LLA36301 231 5TFTX4CN3EX042751 227 1GCHTCE37G1186784 214 ... 2G4GP5EX5E9213578 1 1GCEK19J78Z219711 1 JA4AT3AW1AZ006543 1 4T1BF28B61U153724 1 SAJGX2749VCOO8376 1 Name: VIN, Length: 118264, dtype: int64 4wd 131904 fwd 105517 rwd 58892 Name: drive, dtype: int64 full-size 63465 mid-size 34476 compact 19384 sub-compact 3194 Name: size, dtype: int64 sedan 87056 SUV 77284 pickup 43510 truck 35279 other 22110 coupe 19204 hatchback 16598 wagon 10751 van 8548 convertible 7731 mini-van 4825 offroad 609 bus 517 Name: type, dtype: int64 white 79285 black 62861 silver 42970 blue 31223 red 30473 grey 24416 green 7343 custom 6700 brown 6593 yellow 2142 orange 1984 purple 687 Name: paint_color, dtype: int64 https://images.craigslist.org/00N0N_1xMPvfxRAIdz_0gw0co_600x450.jpg 7357 https://images.craigslist.org/00R0R_lwWjXSEWNa7z_0x20oM_600x450.jpg 2192 https://images.craigslist.org/01111_l590CdXzJSoz_001001_600x450.jpg 276 https://images.craigslist.org/01313_9Zne268YnWZz_0gw0co_600x450.jpg 231 https://images.craigslist.org/00X0X_1cMDo5xqSuAz_0gw0co_600x450.jpg 227 ... https://images.craigslist.org/00a0a_gbpYPGPfZEwz_0lM0t2_600x450.jpg 1 https://images.craigslist.org/00e0e_g9RfXGmsjDaz_0CI0hq_600x450.jpg 1 https://images.craigslist.org/00Y0Y_4fRgRCeuUj2z_0CI0t2_600x450.jpg 1 https://images.craigslist.org/00K0K_fUdf8IWJOlmz_0t20CI_600x450.jpg 1 https://images.craigslist.org/00C0C_3D2bgxR1cysz_0ak07K_600x450.jpg 1 Name: image_url, Length: 241899, dtype: int64 Uttar Pradesh 45573 Maharashtra 45263 Karnataka 30492 Punjab 30455 Delhi 30379 Haryana 30175 Tamil Nadu 18124 West Bengal 18052 Telangana 15356 Kerala 15304 Madhya Pradesh 15205 Rajasthan 15131 Andhra Pradesh 15070 Gujarat 14992 Odisha 12189 Jharkhand 6080 Bihar 6034 Chhattisgarh 6003 Assam 5987 Uttarakhand 5951 Andaman and Nicobar Islands 3079 Sikkim 3065 Ladakh 3062 Puducherry 3046 Jammu and Kashmir 3030 Meghalaya 3017 Himachal Pradesh 3016 Mizoram 3010 Manipur 2990 Chandigarh 2986 Arunachal Pradesh 2980 Dadra and Nagar Haveli and Daman and Diu 2980 Tripura 2961 Goa 2923 Lakshadweep 2920 Name: state, dtype: int64 2021-04-24 02:13:05+00:00 13 2021-04-30 14:01:13+00:00 12 2021-05-04 15:00:42+00:00 12 2021-05-03 13:41:54+00:00 11 2021-04-29 20:25:19+00:00 11 .. 2021-04-21 00:52:45+00:00 1 2021-04-20 20:41:24+00:00 1 2021-04-20 19:24:56+00:00 1 2021-04-20 19:20:54+00:00 1 2021-04-04 09:21:29+00:00 1 Name: posting_date, Length: 333355, dtype: int64 28.8329 20 28.8703 20 28.7504 19 28.6300 19 28.5907 19 .. 19.5164 1 11.4463 1 27.3585 1 21.9371 1 17.0875 1 Name: latitude, Length: 191354, dtype: int64 77.2591 24 77.1361 24 77.2894 23 77.2735 23 76.9070 23 .. 80.3373 1 68.9184 1 84.2857 1 89.4721 1 83.7688 1 Name: longitude, Length: 162536, dtype: int64
value_counts_dict = {}
initializes an empty dictionary.sp[column].value_counts()
returns the counts of unique values in the column.
2. Data Preparation¶
Checking for Missing/Null Values¶
missing_value_count = sp.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count
Missing Values in Each Column:
id 0 price 0 year 1205 manufacturer 17646 model 5277 condition 174104 cylinders 177678 fuel 3013 odometer 4400 title_status 8242 transmission 2556 VIN 161042 drive 130567 size 306361 type 92858 paint_color 130203 image_url 68 state 0 posting_date 68 latitude 0 longitude 0 dtype: int64
sp.isnull()
represents whether the data contains null or not.sp.isnull().sum()
Provides the count of missing values for each column in the DataFrame sp.
Checking for Duplicate Values in the Dataset¶
duplicates = sp[sp.duplicated()]
duplicate_count = len(duplicates)
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 0
- The
sp.duplicated()
function is used to identify duplicate rows in a DataFrame sp.
Checking Data Column types for Well Structured Analysis.¶
sp.dtypes
id int64 price float64 year float64 manufacturer object model object condition object cylinders object fuel object odometer float64 title_status object transmission object VIN object drive object size object type object paint_color object image_url object state object posting_date object latitude float64 longitude float64 dtype: object
- For the further analysis, the data types for all the column should be appropriate for a better directed analysis.
- Here its noticed that data type of
posting_date
is not appropriate as its states 'object' but it has to be in 'datetime'.
Changing Datatype of Inappropriate Columns.¶
pd.options.mode.chained_assignment = None
sp["posting_date"]= pd.to_datetime(sp["posting_date"], format = '%Y-%m-%d %H:%M:%S')
- Pandas might issue a warning. Setting
pd.options.mode.chained_assignment = None
stops that warning from appearing. pd.to_datetime()
function that converts columnsp["posting_date"]
to datetime format.
# Convert 2022.0 to 2022 while keeping NaN values
sp['year'] = pd.to_numeric(sp['year'], errors='coerce').astype('Int64')
sp['year']
0 <NA> 1 <NA> 2 <NA> 3 <NA> 4 <NA> ... 426875 2019 426876 2020 426877 2020 426878 2018 426879 2019 Name: year, Length: 426880, dtype: Int64
- The method
pd.to_numeric()
attempts to convert the values in the year column to numeric format.
sp['price'] = pd.to_numeric(sp['price'], errors='coerce')
- The method
pd.to_numeric()
attempts to convert the values in the price column to numeric format.
sp.dtypes
id int64 price float64 year Int64 manufacturer object model object condition object cylinders object fuel object odometer float64 title_status object transmission object VIN object drive object size object type object paint_color object image_url object state object posting_date datetime64[ns, UTC] latitude float64 longitude float64 dtype: object
.dtypes
retrieves the data types of all columns and ensure that all the column has the appropriate data type.
Segregating Numerical Columns for Devicing Outliers.¶
# Select columns with numeric types (either float64 or int64)
numerical_columns = sp.select_dtypes(include=['float64']).columns.tolist()
print(numerical_columns)
['price', 'odometer', 'latitude', 'longitude']
sp.select_dtypes(include=['float64', 'int64'])
filters the DataFrame sp to select only the columns that are of type float64 or int64 which is numerical column..columns.tolist()
converts the filtered columns into a list.
Devicing outliers from the DataFrame¶
outliers = {}
- Initializing a Dictionary using
outliers = {}
.
for column in numerical_columns:
Q1 = sp[column].quantile(0.10)
Q3 = sp[column].quantile(0.90)
IQR = Q3-Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
outliers[column] = sp[column][(sp[column]<lower_bound) | (sp[column]>upper_bound)]
print(outliers[column])
280 1.268338e+11 346 1.270072e+07 461 1.283604e+07 732 1.284192e+07 1727 1.669450e+07 ... 420535 1.284100e+07 420693 1.605240e+07 424281 3.846158e+07 424840 1.258505e+07 426302 1.605240e+07 Name: price, Length: 874, dtype: float64 280 999999.0 455 9999999.0 562 2222222.0 615 1111111.0 616 1111111.0 ... 424388 468000.0 424840 550000.0 424962 447247.0 426106 9999999.0 426108 999999.0 Name: odometer, Length: 1619, dtype: float64 Series([], Name: latitude, dtype: float64) Series([], Name: longitude, dtype: float64)
- The code loops through all numeric columns (
numerical_columns
) in the DataFrame sp. - For each column, it calculates the 10th and 90th percentiles to determine the Inter Quartile Range.
lower_bound = Q1 - 1.5*IQR
,upper_bound = Q3 + 1.5*IQR
calculates the lower and upper bounds.Outliers[column] = sp[column][(sp[column]<lower_bound) | (sp[column]>upper_bound)]
identifies the outliers which is any data points outside this range.
Handling the Outliers.¶
for col, data in outliers.items():
print(f"{col}: {len(data)}")
price: 874 odometer: 1619 latitude: 0 longitude: 0
- Using the looping through conditions through
outliers.items()
. len(data)
employs the length of the outliers.
Data Distribution in Outliers.¶
for columns in numerical_columns:
plt.figure(figsize=(10, 6))
sns.boxplot(data=sp[columns])
plt.title('Boxplot for Numerical Columns')
plt.xlabel(columns)
plt.ylabel('Values')
plt.xticks(rotation=45)
plt.show()
- The above code is generating separate boxplots for each of the numerical columns in the numerical_columns list as it allows visually inspect the distribution and potential outliers in each column.
sp['posting_year'] = sp['posting_date'].dt.year
sp['posting_year'] = pd.to_numeric(sp['posting_year'], errors='coerce').astype('Int64')
sp['posting_year']
0 <NA> 1 <NA> 2 <NA> 3 <NA> 4 <NA> ... 426875 2021 426876 2021 426877 2021 426878 2021 426879 2021 Name: posting_year, Length: 426880, dtype: Int64
- The
.dt.year
attribute extracts the year from the datetime columnposting_date
. pd.to_numeric()
Converts values in theposting_year
column to numeric format.
Create posting_month
column¶
sp['posting_month'] = sp['posting_date'].dt.month
sp['posting_month'] = pd.to_numeric(sp['posting_month'], errors='coerce').astype('Int64')
sp['posting_month']
0 <NA> 1 <NA> 2 <NA> 3 <NA> 4 <NA> ... 426875 4 426876 4 426877 4 426878 4 426879 4 Name: posting_month, Length: 426880, dtype: Int64
- The
.dt.month
attribute extracts the month from the datetime columnposting_date
. pd.to_numeric()
Converts values in theposting_month
column to numeric format.
Create posting_weekday
column¶
sp['posting_weekday'] = sp['posting_date'].dt.day_name()
sp['posting_weekday']
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ... 426875 Sunday 426876 Sunday 426877 Sunday 426878 Sunday 426879 Sunday Name: posting_weekday, Length: 426880, dtype: object
- The
.dt.day_name()
attribute extracts the weekday from the datetime columnposting_date
. pd.to_numeric()
Converts values in theposting_weekday
column to numeric format.
Ensuring Data Integrity¶
invalid_data = sp[sp['year'] > sp['posting_year']]
invalid_data
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | type | paint_color | image_url | state | posting_date | latitude | longitude | posting_year | posting_month | posting_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9738 | 7303397975 | 770530.0 | 2022 | NaN | Suzuki Vitara | NaN | NaN | gas | 115000.0 | clean | ... | NaN | NaN | https://images.craigslist.org/00n0n_9cYtTzxWFZ... | Uttar Pradesh | 2021-04-07 22:56:59+00:00 | 28.1913 | 84.3726 | 2021 | 4 | Wednesday |
32148 | 7314343550 | 0.0 | 2022 | NaN | Freightliner SportChassis RHA112 | NaN | NaN | diesel | 10.0 | clean | ... | pickup | black | https://images.craigslist.org/00202_j8VlmOCW5g... | Gujarat | 2021-04-29 15:54:44+00:00 | 21.1472 | 70.9694 | 2021 | 4 | Thursday |
43183 | 7314165992 | 1540060.0 | 2022 | toyota | 4runner 4wd | NaN | NaN | gas | 207834.0 | clean | ... | NaN | custom | https://images.craigslist.org/00404_8uTHOfbLml... | Uttar Pradesh | 2021-04-29 02:16:46+00:00 | 25.3863 | 82.3836 | 2021 | 4 | Thursday |
65611 | 7309051490 | 321050.0 | 2022 | honda | civic cvcc | fair | 4 cylinders | gas | 605000.0 | clean | ... | NaN | NaN | https://images.craigslist.org/01414_7ghFSAOlNo... | Maharashtra | 2021-04-19 01:43:58+00:00 | 19.7144 | 75.0804 | 2021 | 4 | Monday |
65612 | 7309047664 | 449480.0 | 2022 | toyota | mighty max | fair | 4 cylinders | gas | 180000.0 | clean | ... | truck | white | https://images.craigslist.org/00m0m_5XzKF5TFiE... | Delhi | 2021-04-19 01:31:30+00:00 | 28.6941 | 76.8817 | 2021 | 4 | Monday |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
410935 | 7315558994 | 49390.0 | 2022 | mitsubishi | eclipse cross le | excellent | 4 cylinders | gas | 5.0 | NaN | ... | SUV | NaN | https://images.craigslist.org/00C0C_mwIIiP2iUq... | Puducherry | 2021-05-01 23:06:10+00:00 | 12.2829 | 79.8404 | 2021 | 5 | Saturday |
410936 | 7315551108 | 52330.0 | 2022 | mitsubishi | eclipse cross se | excellent | 4 cylinders | gas | 3.0 | NaN | ... | SUV | NaN | https://images.craigslist.org/00z0z_6xrjH3eEy3... | Chandigarh | 2021-05-01 22:46:56+00:00 | 30.7712 | 76.7766 | 2021 | 5 | Saturday |
412094 | 7301823858 | 55270.0 | 2022 | mitsubishi | eclipse cross sel | excellent | 4 cylinders | gas | 5.0 | NaN | ... | SUV | NaN | https://images.craigslist.org/00v0v_c8464eYsPz... | Maharashtra | 2021-04-04 22:02:54+00:00 | 17.3951 | 77.1283 | 2021 | 4 | Sunday |
413805 | 7301823862 | 55270.0 | 2022 | mitsubishi | eclipse cross sel | excellent | 4 cylinders | gas | 5.0 | NaN | ... | SUV | NaN | https://images.craigslist.org/00v0v_c8464eYsPz... | Meghalaya | 2021-04-04 22:02:54+00:00 | 25.9323 | 91.2154 | 2021 | 4 | Sunday |
423091 | 7314368209 | 31670.0 | 2022 | NaN | trailer | fair | NaN | other | 1111111.0 | clean | ... | NaN | NaN | https://images.craigslist.org/00Y0Y_3AeJN6UYkh... | Goa | 2021-04-29 16:34:59+00:00 | 15.5037 | 73.8108 | 2021 | 4 | Thursday |
133 rows × 24 columns
- This line creates a new DataFrame, invalid_data, by filtering rows in sp where the value in the
year
column is greater than the value in theposting_year
column. - Here it is noticed that there are 133 entries which are invalid.
Dropping Invalid Data¶
sp = sp.drop(invalid_data.index)
sp
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | type | paint_color | image_url | state | posting_date | latitude | longitude | posting_year | posting_month | posting_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7222695916 | 770530.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Haryana | NaT | 28.8446 | 75.1167 | <NA> | <NA> | NaN |
1 | 7218891961 | 1528210.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Haryana | NaT | 27.9026 | 77.0382 | <NA> | <NA> | NaN |
2 | 7221797935 | 2696820.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | West Bengal | NaT | 24.6838 | 85.9695 | <NA> | <NA> | NaN |
3 | 7222270760 | 192610.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Andhra Pradesh | NaT | 15.5158 | 81.9342 | <NA> | <NA> | NaN |
4 | 7210384030 | 629240.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | West Bengal | NaT | 23.7821 | 87.4861 | <NA> | <NA> | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
426875 | 7301591192 | 3029710.0 | 2019 | nissan | maxima s sedan 4d | good | 6 cylinders | gas | 32226.0 | clean | ... | sedan | NaN | https://images.craigslist.org/00o0o_iiraFnHg8q... | Delhi | 2021-04-04 09:21:31+00:00 | 28.7876 | 76.8644 | 2021 | 4 | Sunday |
426876 | 7301591187 | 3928680.0 | 2020 | volvo | s60 t5 momentum sedan 4d | good | NaN | gas | 12029.0 | clean | ... | sedan | red | https://images.craigslist.org/00x0x_15sbgnxCIS... | Punjab | 2021-04-04 09:21:29+00:00 | 32.0715 | 75.1915 | 2021 | 4 | Sunday |
426877 | 7301591147 | 4493750.0 | 2020 | cadillac | xt4 sport suv 4d | good | NaN | diesel | 4174.0 | clean | ... | hatchback | white | https://images.craigslist.org/00L0L_farM7bxnxR... | Madhya Pradesh | 2021-04-04 09:21:17+00:00 | 21.5646 | 75.7247 | 2021 | 4 | Sunday |
426878 | 7301591140 | 3723220.0 | 2018 | lexus | es 350 sedan 4d | good | 6 cylinders | gas | 30112.0 | clean | ... | sedan | silver | https://images.craigslist.org/00z0z_bKnIVGLkDT... | Maharashtra | 2021-04-04 09:21:11+00:00 | 21.4183 | 75.4719 | 2021 | 4 | Sunday |
426879 | 7301591129 | 3928680.0 | 2019 | bmw | 4 series 430i gran coupe | good | NaN | gas | 22716.0 | clean | ... | coupe | NaN | https://images.craigslist.org/00Y0Y_lEUocjyRxa... | Jharkhand | 2021-04-04 09:21:07+00:00 | 24.2616 | 83.7688 | 2021 | 4 | Sunday |
426747 rows × 24 columns
sp.drop(invalid_data.index)
removes these rows from the original DataFrame sp.- Since there is 133 rows which is quite insignificant compared to the whole data. The
invalid_data
is removed from thesp
.
Changing the column name type
to car_type
¶
if 'type' in sp.columns:
sp.rename(columns={'type': 'car_type'}, inplace=True)
- Since the word
type
is a keyword in python, the column nametype
is changed tocar_type
- The
.rename
method is used to change the name of the columntype
tocar_type
.
Handling Inappropriate Data¶
Handling the price
column¶
sp[sp['price']==0]
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | car_type | paint_color | image_url | state | posting_date | latitude | longitude | posting_year | posting_month | posting_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 7218893038 | 0.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Uttar Pradesh | NaT | 28.5810 | 83.4530 | <NA> | <NA> | NaN |
11 | 7218325704 | 0.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Maharashtra | NaT | 16.3906 | 73.9100 | <NA> | <NA> | NaN |
12 | 7217788283 | 0.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Maharashtra | NaT | 20.6117 | 79.7192 | <NA> | <NA> | NaN |
13 | 7217147606 | 0.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Uttar Pradesh | NaT | 29.8561 | 81.6407 | <NA> | <NA> | NaN |
14 | 7209027818 | 0.0 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Kerala | NaT | 8.5412 | 75.6084 | <NA> | <NA> | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
426764 | 7303347225 | 0.0 | 2018 | NaN | peterbilt 579 | NaN | NaN | diesel | 1.0 | clean | ... | NaN | NaN | https://images.craigslist.org/00F0F_kfjCcZyncy... | Kerala | 2021-04-07 21:15:05+00:00 | 8.7133 | 74.9888 | 2021 | 4 | Wednesday |
426812 | 7302877815 | 0.0 | 2006 | toyota | scion tc | excellent | 4 cylinders | gas | 195000.0 | clean | ... | NaN | silver | https://images.craigslist.org/00B0B_dqmLwdZw8Q... | Tamil Nadu | 2021-04-06 23:00:19+00:00 | 9.7042 | 79.7712 | 2021 | 4 | Tuesday |
426832 | 7302353283 | 0.0 | 2004 | toyota | prius | excellent | 4 cylinders | hybrid | 239000.0 | clean | ... | NaN | blue | https://images.craigslist.org/00V0V_7Ap4arGS9S... | Telangana | 2021-04-05 23:00:55+00:00 | 16.8085 | 77.5499 | 2021 | 4 | Monday |
426836 | 7302301268 | 0.0 | 2018 | ram | 2500 | excellent | 6 cylinders | diesel | 20492.0 | clean | ... | truck | white | https://images.craigslist.org/00i0i_3cSpgdy0qQ... | Delhi | 2021-04-05 21:18:42+00:00 | 28.6000 | 77.2163 | 2021 | 4 | Monday |
426868 | 7301843288 | 0.0 | 2010 | toyota | venza | excellent | 6 cylinders | gas | 155000.0 | clean | ... | NaN | blue | https://images.craigslist.org/00808_hNEilrIb0i... | Uttar Pradesh | 2021-04-04 23:00:48+00:00 | 27.9450 | 79.3595 | 2021 | 4 | Sunday |
34832 rows × 24 columns
price
column cannot have entries0
.- So it is attempted to replace with the Median of
price
column in the next step.
# Calculate the median of the column (ignoring zeros)
median_price = sp[sp['price'] > 0]['price'].median()
# Replace zeros with the median
sp.loc[sp['price'] == 0, 'price'] = median_price
# Display the updated dataframe and median
print(f"Median Price (excluding zeros): {median_price}")
print(sp[['price']].head())
Median Price (excluding zeros): 1990460.0 price 0 770530.0 1 1528210.0 2 2696820.0 3 192610.0 4 629240.0
.median()
calculates the median of the price column, ignoring zeros.sp.loc[sp['price'] == 0, 'price'] = median_price
replaces all zero values in the price column with the calculated median.
Handling the odometer
column¶
sp['odometer'].isnull().sum()
4400
sp['odometer']
cannot have null values.- So the null values will be replaced with median odometer in the next step.
# Calculate the median of the column (ignoring zeros)
median_odometer = sp[sp['odometer'] > 0]['odometer'].median()
# Replace missing with the median
sp['odometer'].fillna(median_odometer, inplace=True)
# Display the updated dataframe and median
print(f"Median odometer (excluding zeros): {median_odometer}")
print(sp[['odometer']].head())
Median odometer (excluding zeros): 86000.0 odometer 0 86000.0 1 86000.0 2 86000.0 3 86000.0 4 86000.0
.median()
calculates the median of theodometer
column, ignoring zeros..fillna()
replaces missing values (NaN) in theodometer
column with the calculated median.
Handling the year
column¶
sp['year'].isnull().sum()
1205
sp['odometer']
cannot have null values.- So in the next step it is attempted to rectify.
# Replace empty or invalid values with NaN
sp['year'] = sp['year'].replace(['', 'NA'], pd.NA)
# Calculate median years for each manufacturer and model
median_years = sp.groupby(['manufacturer', 'model'])['year'].median()
# Fill missing years using the calculated medians
sp['year'] = sp.apply(
lambda row: median_years.loc[(row['manufacturer'], row['model'])] if pd.isna(row['year']) and (row['manufacturer'], row['model']) in median_years.index else row['year'],
axis=1
)
# For rows still missing 'year', fill with the overall median
overall_median_year = sp['year'].median()
sp['year'] = sp['year'].fillna(overall_median_year)
# View the result
print(sp)
id price year manufacturer model \ 0 7222695916 770530.0 2013.0 NaN NaN 1 7218891961 1528210.0 2013.0 NaN NaN 2 7221797935 2696820.0 2013.0 NaN NaN 3 7222270760 192610.0 2013.0 NaN NaN 4 7210384030 629240.0 2013.0 NaN NaN ... ... ... ... ... ... 426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d 426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d 426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d 426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d 426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe condition cylinders fuel odometer title_status ... car_type \ 0 NaN NaN NaN 86000.0 NaN ... NaN 1 NaN NaN NaN 86000.0 NaN ... NaN 2 NaN NaN NaN 86000.0 NaN ... NaN 3 NaN NaN NaN 86000.0 NaN ... NaN 4 NaN NaN NaN 86000.0 NaN ... NaN ... ... ... ... ... ... ... ... 426875 good 6 cylinders gas 32226.0 clean ... sedan 426876 good NaN gas 12029.0 clean ... sedan 426877 good NaN diesel 4174.0 clean ... hatchback 426878 good 6 cylinders gas 30112.0 clean ... sedan 426879 good NaN gas 22716.0 clean ... coupe paint_color image_url \ 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN ... ... ... 426875 NaN https://images.craigslist.org/00o0o_iiraFnHg8q... 426876 red https://images.craigslist.org/00x0x_15sbgnxCIS... 426877 white https://images.craigslist.org/00L0L_farM7bxnxR... 426878 silver https://images.craigslist.org/00z0z_bKnIVGLkDT... 426879 NaN https://images.craigslist.org/00Y0Y_lEUocjyRxa... state posting_date latitude longitude \ 0 Haryana NaT 28.8446 75.1167 1 Haryana NaT 27.9026 77.0382 2 West Bengal NaT 24.6838 85.9695 3 Andhra Pradesh NaT 15.5158 81.9342 4 West Bengal NaT 23.7821 87.4861 ... ... ... ... ... 426875 Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644 426876 Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915 426877 Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247 426878 Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719 426879 Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688 posting_year posting_month posting_weekday 0 <NA> <NA> NaN 1 <NA> <NA> NaN 2 <NA> <NA> NaN 3 <NA> <NA> NaN 4 <NA> <NA> NaN ... ... ... ... 426875 2021 4 Sunday 426876 2021 4 Sunday 426877 2021 4 Sunday 426878 2021 4 Sunday 426879 2021 4 Sunday [426747 rows x 24 columns]
.replace()
replaces empty or invalid values in the year column with NaN.sp.groupby(['manufacturer', 'model'])['year'].median()
calculates the median year for each manufacturer and model group.apply()
fills missing year values using the calculated medians for corresponding manufacturer and model.- With
.filna()
which fills any remaining missing year values with the overall median.
Handling model
column¶
sp['model'].isnull().sum()
5268
sp['model'].value_counts()
f-150 8008 silverado 1500 5140 1500 4211 camry 3135 silverado 3022 ... escalade and 1 cx 3 1 1991 INTERNATIONAL 9800 1 nv 200 s 1 Paige Glenbrook Touring 1 Name: model, Length: 29648, dtype: int64
# Step 1: Create a mapping of (manufacturer, year) to model
model_mapping = (
sp[~sp['model'].isnull()] # Filter rows where 'model' is not null
.groupby(['manufacturer', 'year'])['model']
.first()
.reset_index()
)
# Convert the mapping to a dictionary
mapping_dict = model_mapping.set_index(['manufacturer', 'year'])['model'].to_dict()
# Step 2: Fill missing 'model' values
for index, row in sp.iterrows():
if pd.isnull(row['model']): # Check if 'model' is missing
sp.at[index, 'model'] = mapping_dict.get((row['manufacturer'], row['year']), None)
# Print the updated DataFrame
print(sp)
id price year manufacturer model \ 0 7222695916 770530.0 2013.0 NaN NaN 1 7218891961 1528210.0 2013.0 NaN NaN 2 7221797935 2696820.0 2013.0 NaN NaN 3 7222270760 192610.0 2013.0 NaN NaN 4 7210384030 629240.0 2013.0 NaN NaN ... ... ... ... ... ... 426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d 426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d 426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d 426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d 426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe condition cylinders fuel odometer title_status ... car_type \ 0 NaN NaN NaN 86000.0 NaN ... NaN 1 NaN NaN NaN 86000.0 NaN ... NaN 2 NaN NaN NaN 86000.0 NaN ... NaN 3 NaN NaN NaN 86000.0 NaN ... NaN 4 NaN NaN NaN 86000.0 NaN ... NaN ... ... ... ... ... ... ... ... 426875 good 6 cylinders gas 32226.0 clean ... sedan 426876 good NaN gas 12029.0 clean ... sedan 426877 good NaN diesel 4174.0 clean ... hatchback 426878 good 6 cylinders gas 30112.0 clean ... sedan 426879 good NaN gas 22716.0 clean ... coupe paint_color image_url \ 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN ... ... ... 426875 NaN https://images.craigslist.org/00o0o_iiraFnHg8q... 426876 red https://images.craigslist.org/00x0x_15sbgnxCIS... 426877 white https://images.craigslist.org/00L0L_farM7bxnxR... 426878 silver https://images.craigslist.org/00z0z_bKnIVGLkDT... 426879 NaN https://images.craigslist.org/00Y0Y_lEUocjyRxa... state posting_date latitude longitude \ 0 Haryana NaT 28.8446 75.1167 1 Haryana NaT 27.9026 77.0382 2 West Bengal NaT 24.6838 85.9695 3 Andhra Pradesh NaT 15.5158 81.9342 4 West Bengal NaT 23.7821 87.4861 ... ... ... ... ... 426875 Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644 426876 Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915 426877 Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247 426878 Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719 426879 Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688 posting_year posting_month posting_weekday 0 <NA> <NA> NaN 1 <NA> <NA> NaN 2 <NA> <NA> NaN 3 <NA> <NA> NaN 4 <NA> <NA> NaN ... ... ... ... 426875 2021 4 Sunday 426876 2021 4 Sunday 426877 2021 4 Sunday 426878 2021 4 Sunday 426879 2021 4 Sunday [426747 rows x 24 columns]
model_mapping
creates a mapping of (manufacturer, year) to the first non-null model..set_index
converts the mapping into a dictionary for efficient lookup.- Using for loop which Iterate through the rows, filling missing model values using the dictionary based on manufacturer and year.
sp['model'].isnull().sum()
148
- Still it is noticed that there are some missing values in the column
model
.
Dropped rows where manufacturer
and model
is NAN¶
# Drop rows where both 'manufacturer' and 'model' are null
sp = sp.dropna(subset=['manufacturer', 'model'], how='all')
# Print the updated DataFrame
print(sp)
id price year manufacturer model \ 27 7316814884 4313900.0 2014.0 gmc sierra 1500 crew cab slt 28 7316814758 2901280.0 2010.0 chevrolet silverado 1500 29 7316814989 5084440.0 2020.0 chevrolet silverado 1500 crew 30 7316743432 3980090.0 2017.0 toyota tundra double cab sr 31 7316356412 1926290.0 2013.0 ford f-150 xlt ... ... ... ... ... ... 426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d 426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d 426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d 426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d 426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe condition cylinders fuel odometer title_status ... car_type \ 27 good 8 cylinders gas 57923.0 clean ... pickup 28 good 8 cylinders gas 71229.0 clean ... pickup 29 good 8 cylinders gas 19160.0 clean ... pickup 30 good 8 cylinders gas 41124.0 clean ... pickup 31 excellent 6 cylinders gas 128000.0 clean ... truck ... ... ... ... ... ... ... ... 426875 good 6 cylinders gas 32226.0 clean ... sedan 426876 good NaN gas 12029.0 clean ... sedan 426877 good NaN diesel 4174.0 clean ... hatchback 426878 good 6 cylinders gas 30112.0 clean ... sedan 426879 good NaN gas 22716.0 clean ... coupe paint_color image_url \ 27 white https://images.craigslist.org/00R0R_lwWjXSEWNa... 28 blue https://images.craigslist.org/00R0R_lwWjXSEWNa... 29 red https://images.craigslist.org/01212_jjirIWa0y0... 30 red https://images.craigslist.org/00x0x_1y9kIOzGCF... 31 black https://images.craigslist.org/00404_l4loxHvdQe... ... ... ... 426875 NaN https://images.craigslist.org/00o0o_iiraFnHg8q... 426876 red https://images.craigslist.org/00x0x_15sbgnxCIS... 426877 white https://images.craigslist.org/00L0L_farM7bxnxR... 426878 silver https://images.craigslist.org/00z0z_bKnIVGLkDT... 426879 NaN https://images.craigslist.org/00Y0Y_lEUocjyRxa... state posting_date latitude longitude \ 27 Kerala 2021-05-04 17:31:18+00:00 9.8072 76.8361 28 Uttar Pradesh 2021-05-04 17:31:08+00:00 29.2417 83.7824 29 Delhi 2021-05-04 17:31:25+00:00 28.8053 76.8871 30 Odisha 2021-05-04 15:41:31+00:00 19.2886 85.2667 31 Tripura 2021-05-03 19:02:03+00:00 23.4924 91.1187 ... ... ... ... ... 426875 Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644 426876 Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915 426877 Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247 426878 Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719 426879 Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688 posting_year posting_month posting_weekday 27 2021 5 Tuesday 28 2021 5 Tuesday 29 2021 5 Tuesday 30 2021 5 Tuesday 31 2021 5 Monday ... ... ... ... 426875 2021 4 Sunday 426876 2021 4 Sunday 426877 2021 4 Sunday 426878 2021 4 Sunday 426879 2021 4 Sunday [426678 rows x 24 columns]
- Using
.dropna()
function it is attempted to drop the rows wheremanufacturer
andmodel
both has null values. - There lies 69 rows which has this condition. Since it is insignificant. We opt to drop.
sp['model'].isnull().sum()
79
- Still it is noticed that there are 79 null values in th
model
column.
Categorized Unknown
to the remaining missing values in model
¶
# Fill missing values in the 'model' column with 'Unknown'
sp['model'] = sp['model'].fillna('Unknown')
# Print the updated DataFrame
print(sp)
id price year manufacturer model \ 27 7316814884 4313900.0 2014.0 gmc sierra 1500 crew cab slt 28 7316814758 2901280.0 2010.0 chevrolet silverado 1500 29 7316814989 5084440.0 2020.0 chevrolet silverado 1500 crew 30 7316743432 3980090.0 2017.0 toyota tundra double cab sr 31 7316356412 1926290.0 2013.0 ford f-150 xlt ... ... ... ... ... ... 426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d 426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d 426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d 426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d 426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe condition cylinders fuel odometer title_status ... car_type \ 27 good 8 cylinders gas 57923.0 clean ... pickup 28 good 8 cylinders gas 71229.0 clean ... pickup 29 good 8 cylinders gas 19160.0 clean ... pickup 30 good 8 cylinders gas 41124.0 clean ... pickup 31 excellent 6 cylinders gas 128000.0 clean ... truck ... ... ... ... ... ... ... ... 426875 good 6 cylinders gas 32226.0 clean ... sedan 426876 good NaN gas 12029.0 clean ... sedan 426877 good NaN diesel 4174.0 clean ... hatchback 426878 good 6 cylinders gas 30112.0 clean ... sedan 426879 good NaN gas 22716.0 clean ... coupe paint_color image_url \ 27 white https://images.craigslist.org/00R0R_lwWjXSEWNa... 28 blue https://images.craigslist.org/00R0R_lwWjXSEWNa... 29 red https://images.craigslist.org/01212_jjirIWa0y0... 30 red https://images.craigslist.org/00x0x_1y9kIOzGCF... 31 black https://images.craigslist.org/00404_l4loxHvdQe... ... ... ... 426875 NaN https://images.craigslist.org/00o0o_iiraFnHg8q... 426876 red https://images.craigslist.org/00x0x_15sbgnxCIS... 426877 white https://images.craigslist.org/00L0L_farM7bxnxR... 426878 silver https://images.craigslist.org/00z0z_bKnIVGLkDT... 426879 NaN https://images.craigslist.org/00Y0Y_lEUocjyRxa... state posting_date latitude longitude \ 27 Kerala 2021-05-04 17:31:18+00:00 9.8072 76.8361 28 Uttar Pradesh 2021-05-04 17:31:08+00:00 29.2417 83.7824 29 Delhi 2021-05-04 17:31:25+00:00 28.8053 76.8871 30 Odisha 2021-05-04 15:41:31+00:00 19.2886 85.2667 31 Tripura 2021-05-03 19:02:03+00:00 23.4924 91.1187 ... ... ... ... ... 426875 Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644 426876 Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915 426877 Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247 426878 Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719 426879 Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688 posting_year posting_month posting_weekday 27 2021 5 Tuesday 28 2021 5 Tuesday 29 2021 5 Tuesday 30 2021 5 Tuesday 31 2021 5 Monday ... ... ... ... 426875 2021 4 Sunday 426876 2021 4 Sunday 426877 2021 4 Sunday 426878 2021 4 Sunday 426879 2021 4 Sunday [426678 rows x 24 columns]
- The remaining null values in the
model
column has been replaced byUnknown
sp['model'].isnull().sum()
0
sp['model'].value_counts()
f-150 8014 silverado 1500 5154 1500 4403 camry 3155 silverado 3037 ... g35xsports 1 escalade ext awd gas 1 big horn slt 1 1500 4x4 sport 1 Paige Glenbrook Touring 1 Name: model, Length: 29649, dtype: int64
Handling manufacturer
column values¶
sp['manufacturer'].isnull().sum()
17536
sp['manufacturer'].value_counts()
ford 70981 chevrolet 55061 toyota 34199 honda 21267 nissan 19063 jeep 19014 ram 18342 gmc 16785 bmw 14698 dodge 13707 mercedes-benz 11816 hyundai 10338 subaru 9495 volkswagen 9345 kia 8457 lexus 8196 audi 7572 cadillac 6952 chrysler 6031 acura 5978 buick 5501 mazda 5427 infiniti 4802 lincoln 4220 volvo 3374 mitsubishi 3225 mini 2376 pontiac 2288 rover 2113 jaguar 1946 porsche 1383 mercury 1184 saturn 1090 alfa-romeo 897 tesla 868 fiat 792 harley-davidson 153 ferrari 95 datsun 63 aston-martin 24 land rover 21 morgan 3 Name: manufacturer, dtype: int64
# Step 1: Create a mapping of model to manufacturer
manufacturer_mapping = (
sp[~sp['manufacturer'].isnull()] # Filter rows where 'manufacturer' is not null
.groupby('model')['manufacturer']
.first()
.reset_index()
)
# Convert the mapping to a dictionary
manufacturer_mapping_dict = manufacturer_mapping.set_index('model')['manufacturer'].to_dict()
# Step 2: Fill missing values in 'manufacturer'
def fill_manufacturer(row):
if pd.isnull(row['manufacturer']):
# Try model-only mapping
return manufacturer_mapping_dict.get(row['model'], 'Unknown') # Default to 'Unknown' if no match
return row['manufacturer']
sp['manufacturer'] = sp.apply(fill_manufacturer, axis=1)
# Print the updated DataFrame
print(sp)
id price year manufacturer model \ 27 7316814884 4313900.0 2014.0 gmc sierra 1500 crew cab slt 28 7316814758 2901280.0 2010.0 chevrolet silverado 1500 29 7316814989 5084440.0 2020.0 chevrolet silverado 1500 crew 30 7316743432 3980090.0 2017.0 toyota tundra double cab sr 31 7316356412 1926290.0 2013.0 ford f-150 xlt ... ... ... ... ... ... 426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d 426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d 426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d 426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d 426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe condition cylinders fuel odometer title_status ... car_type \ 27 good 8 cylinders gas 57923.0 clean ... pickup 28 good 8 cylinders gas 71229.0 clean ... pickup 29 good 8 cylinders gas 19160.0 clean ... pickup 30 good 8 cylinders gas 41124.0 clean ... pickup 31 excellent 6 cylinders gas 128000.0 clean ... truck ... ... ... ... ... ... ... ... 426875 good 6 cylinders gas 32226.0 clean ... sedan 426876 good NaN gas 12029.0 clean ... sedan 426877 good NaN diesel 4174.0 clean ... hatchback 426878 good 6 cylinders gas 30112.0 clean ... sedan 426879 good NaN gas 22716.0 clean ... coupe paint_color image_url \ 27 white https://images.craigslist.org/00R0R_lwWjXSEWNa... 28 blue https://images.craigslist.org/00R0R_lwWjXSEWNa... 29 red https://images.craigslist.org/01212_jjirIWa0y0... 30 red https://images.craigslist.org/00x0x_1y9kIOzGCF... 31 black https://images.craigslist.org/00404_l4loxHvdQe... ... ... ... 426875 NaN https://images.craigslist.org/00o0o_iiraFnHg8q... 426876 red https://images.craigslist.org/00x0x_15sbgnxCIS... 426877 white https://images.craigslist.org/00L0L_farM7bxnxR... 426878 silver https://images.craigslist.org/00z0z_bKnIVGLkDT... 426879 NaN https://images.craigslist.org/00Y0Y_lEUocjyRxa... state posting_date latitude longitude \ 27 Kerala 2021-05-04 17:31:18+00:00 9.8072 76.8361 28 Uttar Pradesh 2021-05-04 17:31:08+00:00 29.2417 83.7824 29 Delhi 2021-05-04 17:31:25+00:00 28.8053 76.8871 30 Odisha 2021-05-04 15:41:31+00:00 19.2886 85.2667 31 Tripura 2021-05-03 19:02:03+00:00 23.4924 91.1187 ... ... ... ... ... 426875 Delhi 2021-04-04 09:21:31+00:00 28.7876 76.8644 426876 Punjab 2021-04-04 09:21:29+00:00 32.0715 75.1915 426877 Madhya Pradesh 2021-04-04 09:21:17+00:00 21.5646 75.7247 426878 Maharashtra 2021-04-04 09:21:11+00:00 21.4183 75.4719 426879 Jharkhand 2021-04-04 09:21:07+00:00 24.2616 83.7688 posting_year posting_month posting_weekday 27 2021 5 Tuesday 28 2021 5 Tuesday 29 2021 5 Tuesday 30 2021 5 Tuesday 31 2021 5 Monday ... ... ... ... 426875 2021 4 Sunday 426876 2021 4 Sunday 426877 2021 4 Sunday 426878 2021 4 Sunday 426879 2021 4 Sunday [426678 rows x 24 columns]
manufacturer_mapping
creates a mapping of manufacturer to the first non-null manufacturer.manufacturer_mapping_dict
converts the mapping into a dictionary for quick lookup.- Fill missing manufacturer values by checking the model and using the dictionary, defaulting to 'Unknown' if no match is found.
sp['manufacturer'].isnull().sum()
0
sp['manufacturer'].value_counts()
ford 71073 chevrolet 55182 toyota 34449 honda 21270 nissan 19065 jeep 19023 ram 18364 gmc 16793 Unknown 16769 bmw 14721 dodge 13724 mercedes-benz 11818 hyundai 10340 subaru 9495 volkswagen 9350 kia 8457 lexus 8196 audi 7572 cadillac 6952 chrysler 6052 acura 5978 buick 5504 mazda 5456 infiniti 4803 lincoln 4220 volvo 3377 mitsubishi 3227 mini 2376 pontiac 2291 rover 2175 jaguar 1946 porsche 1385 mercury 1187 saturn 1090 alfa-romeo 897 fiat 868 tesla 868 harley-davidson 153 ferrari 95 datsun 69 aston-martin 24 land rover 21 morgan 3 Name: manufacturer, dtype: int64
Handling condition
column values¶
sp['condition'].isnull().sum()
173984
sp['condition'].value_counts()
good 121452 excellent 101430 like new 21175 fair 6765 new 1271 salvage 601 Name: condition, dtype: int64
sp['condition'] = sp.groupby(['manufacturer','model','year'])['condition'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, and year. - Then using
.transform()
, fill missing values in the condition column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the condition column with the filled values.
sp['condition'].isnull().sum()
0
sp['condition'].value_counts()
excellent 209179 good 143637 Unknown 36669 like new 27061 fair 7702 new 1781 salvage 649 Name: condition, dtype: int64
Handling cylinder
column values¶
sp['cylinders'].isnull().sum()
177542
sp['cylinders'].value_counts()
6 cylinders 94165 4 cylinders 77588 8 cylinders 72060 5 cylinders 1712 10 cylinders 1455 other 1293 3 cylinders 655 12 cylinders 208 Name: cylinders, dtype: int64
sp['cylinders'] = sp.groupby(['manufacturer','model','year'])['cylinders'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, and year. - Then using
.transform()
, fill missing values in the 'cylinders' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the 'cylinders' column with the filled values.
sp['cylinders'].isnull().sum()
0
sp['cylinders'].value_counts()
6 cylinders 137826 4 cylinders 123886 8 cylinders 102218 Unknown 55441 5 cylinders 2450 10 cylinders 1912 other 1814 3 cylinders 907 12 cylinders 224 Name: cylinders, dtype: int64
Handling fuel
column values¶
sp['fuel'].isnull().sum()
2945
sp['fuel'].value_counts()
gas 356115 other 30696 diesel 30055 hybrid 5170 electric 1697 Name: fuel, dtype: int64
sp['fuel'] = sp.groupby(['manufacturer','model','year'])['fuel'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, and year. - Then using
.transform()
, fill missing values in the 'fuel' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the 'fuel' column with the filled values.
sp['fuel'].isnull().sum()
0
sp['fuel'].value_counts()
gas 357266 other 30724 diesel 30101 hybrid 5195 electric 1701 Unknown 1691 Name: fuel, dtype: int64
Handling transmission
column values¶
sp['transmission'].isnull().sum()
2488
sp['transmission'].value_counts()
automatic 336437 other 62642 manual 25111 Name: transmission, dtype: int64
sp['transmission'] = sp.groupby(['manufacturer','model','year'])['transmission'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, and year. - Then using
.transform()
, fill missing values in the 'transmission' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the 'transmission' column with the filled values.
sp['transmission'].isnull().sum()
0
sp['transmission'].value_counts()
automatic 338541 other 62660 manual 25135 Unknown 342 Name: transmission, dtype: int64
Handling title_status
column values¶
sp['title_status'].isnull().sum()
8139
sp['title_status'].value_counts()
clean 405030 rebuilt 7219 salvage 3861 lien 1422 missing 810 parts only 197 Name: title_status, dtype: int64
sp['title_status'] = sp['title_status'].fillna('Unknown')
.fillna()
fills missing values in thetitle_status
column with the string'Unknown'
.- Update the
title_status
column with the filled values.
sp['title_status'].isnull().sum()
0
sp['title_status'].value_counts()
clean 405030 Unknown 8139 rebuilt 7219 salvage 3861 lien 1422 missing 810 parts only 197 Name: title_status, dtype: int64
Handling car_type
column values¶
sp['car_type'].isnull().sum()
92723
sp['car_type'].value_counts()
sedan 87055 SUV 77233 pickup 43509 truck 35273 other 22105 coupe 19204 hatchback 16598 wagon 10750 van 8548 convertible 7731 mini-van 4825 offroad 609 bus 515 Name: car_type, dtype: int64
sp['car_type'] = sp.groupby(['manufacturer','model','year','transmission','fuel'])['car_type'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, year, transmission, and fuel. - Then using
.transform()
, fill missing values in thecar_type
column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the
car_type
column with the filled values.
sp['car_type'].isnull().sum()
0
sp['car_type'].value_counts()
sedan 106733 SUV 97421 pickup 50507 truck 43467 Unknown 24924 other 23014 coupe 22104 hatchback 18777 wagon 12186 van 10284 convertible 8910 mini-van 7007 offroad 725 bus 619 Name: car_type, dtype: int64
Handling paint
column values¶
sp['paint_color'].isnull().sum()
130030
sp['paint_color'].value_counts()
white 79280 black 62854 silver 42963 blue 31220 red 30469 grey 24415 green 7343 custom 6699 brown 6593 yellow 2142 orange 1984 purple 686 Name: paint_color, dtype: int64
sp['paint_color'] = sp.groupby(['manufacturer','model','year'])['paint_color'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, and year. - Then using
.transform()
, fill missing values in the 'paint_color' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the 'paint_color' column with the filled values.
sp['paint_color'].isnull().sum()
0
sp['paint_color'].value_counts()
white 108720 black 91076 silver 57794 blue 40905 red 37135 grey 30740 Unknown 30617 green 8786 brown 7901 custom 7606 yellow 2346 orange 2282 purple 770 Name: paint_color, dtype: int64
Handling drive
column values¶
sp['drive'].isnull().sum()
130432
sp['drive'].value_counts()
4wd 131853 fwd 105506 rwd 58887 Name: drive, dtype: int64
sp['drive'] = sp.groupby(['manufacturer','model'])['drive'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, and year. - Then using
.transform()
, fill missing values in the 'drive' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the 'drive' column with the filled values.
sp['drive'].isnull().sum()
0
sp['drive'].value_counts()
4wd 179672 fwd 145427 rwd 78620 Unknown 22959 Name: drive, dtype: int64
Handling size
Column values¶
sp['size'].isnull().sum()
306166
sp['size'].value_counts()
full-size 63461 mid-size 34474 compact 19383 sub-compact 3194 Name: size, dtype: int64
sp['size'] = sp.groupby(['manufacturer','model','year'])['size'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data by manufacturer, model, and year. - Then using
.transform()
, fill missing values in the 'size' column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the 'size' column with the filled values.
sp['size'].isnull().sum()
0
sp['size'].value_counts()
full-size 160376 Unknown 121065 mid-size 87792 compact 51146 sub-compact 6299 Name: size, dtype: int64
Handling VIN
Column values¶
sp['VIN'].isnull().sum()
160908
sp['VIN'].value_counts()
1FMJU1JT1HEA52352 261 3C6JR6DT3KG560649 235 1FTER1EH1LLA36301 231 5TFTX4CN3EX042751 227 1GCHTCE37G1186784 214 ... 4S4BRBGC3B3332201 1 1N4AA5AP2EC448150 1 1J8GN58K88W233418 1 2FMDK3GC9ABA40182 1 SAJGX2749VCOO8376 1 Name: VIN, Length: 118226, dtype: int64
sp['VIN'] = sp.groupby(['manufacturer','model','year','price','condition','cylinders','fuel','odometer','title_status','transmission','drive','size','car_type','paint_color'])['VIN'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
- With
sp.groupby
which groups the data bymanufacturer
,model
,year
,price
,condition
,cylinders
,fuel
,odometer
,title_status
,transmission
,drive
,size
,car_type
, andpaint_color
. - Then using
.transform()
, fill missing values in theVIN
column with the most frequent value (mode) of that group, or 'Unknown' if the mode is not available. - Update the
VIN
column with the filled values.
sp['VIN'].isnull().sum()
0
sp['VIN'].value_counts()
Unknown 160210 1FMJU1JT1HEA52352 261 3C6JR6DT3KG560649 235 1FTER1EH1LLA36301 231 5TFTX4CN3EX042751 227 ... 1C3EL75R35N639520 1 1GCHG35R211150470 1 3B7HC13YXXM530629 1 1GT21REG3GZ289767 1 SAJGX2749VCOO8376 1 Name: VIN, Length: 118227, dtype: int64
Checking the Cleaned Data-type missing values¶
missing_value_count = sp.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count
Missing Values in Each Column:
id 0 price 0 year 0 manufacturer 0 model 0 condition 0 cylinders 0 fuel 0 odometer 0 title_status 0 transmission 0 VIN 0 drive 0 size 0 car_type 0 paint_color 0 image_url 0 state 0 posting_date 0 latitude 0 longitude 0 posting_year 0 posting_month 0 posting_weekday 0 dtype: int64
Create car_age
and avg_yearly_mileage
column¶
Create car_age
column¶
sp['car_age'] = sp.apply(
lambda row: row['posting_date'].year - row['year'] if pd.notna(row['year']) and pd.notna(row['posting_date']) else np.nan,
axis=1
)
# Display the DataFrame with the new 'car_age' column
print(sp[['year', 'posting_date', 'car_age']])
year posting_date car_age 27 2014.0 2021-05-04 17:31:18+00:00 7.0 28 2010.0 2021-05-04 17:31:08+00:00 11.0 29 2020.0 2021-05-04 17:31:25+00:00 1.0 30 2017.0 2021-05-04 15:41:31+00:00 4.0 31 2013.0 2021-05-03 19:02:03+00:00 8.0 ... ... ... ... 426875 2019.0 2021-04-04 09:21:31+00:00 2.0 426876 2020.0 2021-04-04 09:21:29+00:00 1.0 426877 2020.0 2021-04-04 09:21:17+00:00 1.0 426878 2018.0 2021-04-04 09:21:11+00:00 3.0 426879 2019.0 2021-04-04 09:21:07+00:00 2.0 [426678 rows x 3 columns]
- Calculate the
car_age
by subtracting theyear
from theposting_date
year, if both values are not null. - If either
year
orposting_date
is missing, set thecar_age
asNaN
. - Update the
car_age
column with the calculated values.
Create avg_yearly_mileage
column¶
# Calculate 'avg_yearly_mileage'
sp['avg_yearly_mileage'] = np.where(
sp['car_age'] == 0, # Check if 'car_age' is zero
sp['odometer'], # If 'car_age' is zero, use 'odometer'
sp['odometer'] / sp['car_age'] # Otherwise, calculate the formula
)
# Display the result
print(sp['avg_yearly_mileage'])
27 8274.714286 28 6475.363636 29 19160.000000 30 10281.000000 31 16000.000000 ... 426875 16113.000000 426876 12029.000000 426877 4174.000000 426878 10037.333333 426879 11358.000000 Name: avg_yearly_mileage, Length: 426678, dtype: float64
- Calculate
avg_yearly_mileage
by dividingodometer
bycar_age
, unlesscar_age
is zero, in which case use theodometer
value directly. - Update the
avg_yearly_mileage
column with the calculated values.
DataFrame sp
¶
sp
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | image_url | state | posting_date | latitude | longitude | posting_year | posting_month | posting_weekday | car_age | avg_yearly_mileage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
27 | 7316814884 | 4313900.0 | 2014.0 | gmc | sierra 1500 crew cab slt | good | 8 cylinders | gas | 57923.0 | clean | ... | https://images.craigslist.org/00R0R_lwWjXSEWNa... | Kerala | 2021-05-04 17:31:18+00:00 | 9.8072 | 76.8361 | 2021 | 5 | Tuesday | 7.0 | 8274.714286 |
28 | 7316814758 | 2901280.0 | 2010.0 | chevrolet | silverado 1500 | good | 8 cylinders | gas | 71229.0 | clean | ... | https://images.craigslist.org/00R0R_lwWjXSEWNa... | Uttar Pradesh | 2021-05-04 17:31:08+00:00 | 29.2417 | 83.7824 | 2021 | 5 | Tuesday | 11.0 | 6475.363636 |
29 | 7316814989 | 5084440.0 | 2020.0 | chevrolet | silverado 1500 crew | good | 8 cylinders | gas | 19160.0 | clean | ... | https://images.craigslist.org/01212_jjirIWa0y0... | Delhi | 2021-05-04 17:31:25+00:00 | 28.8053 | 76.8871 | 2021 | 5 | Tuesday | 1.0 | 19160.000000 |
30 | 7316743432 | 3980090.0 | 2017.0 | toyota | tundra double cab sr | good | 8 cylinders | gas | 41124.0 | clean | ... | https://images.craigslist.org/00x0x_1y9kIOzGCF... | Odisha | 2021-05-04 15:41:31+00:00 | 19.2886 | 85.2667 | 2021 | 5 | Tuesday | 4.0 | 10281.000000 |
31 | 7316356412 | 1926290.0 | 2013.0 | ford | f-150 xlt | excellent | 6 cylinders | gas | 128000.0 | clean | ... | https://images.craigslist.org/00404_l4loxHvdQe... | Tripura | 2021-05-03 19:02:03+00:00 | 23.4924 | 91.1187 | 2021 | 5 | Monday | 8.0 | 16000.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
426875 | 7301591192 | 3029710.0 | 2019.0 | nissan | maxima s sedan 4d | good | 6 cylinders | gas | 32226.0 | clean | ... | https://images.craigslist.org/00o0o_iiraFnHg8q... | Delhi | 2021-04-04 09:21:31+00:00 | 28.7876 | 76.8644 | 2021 | 4 | Sunday | 2.0 | 16113.000000 |
426876 | 7301591187 | 3928680.0 | 2020.0 | volvo | s60 t5 momentum sedan 4d | good | Unknown | gas | 12029.0 | clean | ... | https://images.craigslist.org/00x0x_15sbgnxCIS... | Punjab | 2021-04-04 09:21:29+00:00 | 32.0715 | 75.1915 | 2021 | 4 | Sunday | 1.0 | 12029.000000 |
426877 | 7301591147 | 4493750.0 | 2020.0 | cadillac | xt4 sport suv 4d | good | Unknown | diesel | 4174.0 | clean | ... | https://images.craigslist.org/00L0L_farM7bxnxR... | Madhya Pradesh | 2021-04-04 09:21:17+00:00 | 21.5646 | 75.7247 | 2021 | 4 | Sunday | 1.0 | 4174.000000 |
426878 | 7301591140 | 3723220.0 | 2018.0 | lexus | es 350 sedan 4d | good | 6 cylinders | gas | 30112.0 | clean | ... | https://images.craigslist.org/00z0z_bKnIVGLkDT... | Maharashtra | 2021-04-04 09:21:11+00:00 | 21.4183 | 75.4719 | 2021 | 4 | Sunday | 3.0 | 10037.333333 |
426879 | 7301591129 | 3928680.0 | 2019.0 | bmw | 4 series 430i gran coupe | good | Unknown | gas | 22716.0 | clean | ... | https://images.craigslist.org/00Y0Y_lEUocjyRxa... | Jharkhand | 2021-04-04 09:21:07+00:00 | 24.2616 | 83.7688 | 2021 | 4 | Sunday | 2.0 | 11358.000000 |
426678 rows × 26 columns
- The above DataFrame
sp
represents the cleaned data set.
sp.describe()
id | price | year | odometer | latitude | longitude | posting_year | posting_month | car_age | avg_yearly_mileage | |
---|---|---|---|---|---|---|---|---|---|---|
count | 4.266780e+05 | 4.266780e+05 | 426678.000000 | 4.266780e+05 | 426678.000000 | 426678.000000 | 426678.0 | 426678.000000 | 426678.000000 | 4.266780e+05 |
mean | 7.311495e+09 | 9.822740e+06 | 2011.236584 | 9.786411e+04 | 22.801159 | 79.012884 | 2021.0 | 4.266302 | 9.763416 | 1.230994e+04 |
std | 4.381300e+06 | 1.564808e+09 | 9.439536 | 2.111762e+05 | 6.532559 | 5.208544 | 0.0 | 0.442024 | 9.439536 | 3.861835e+04 |
min | 7.301583e+09 | 9.200000e+02 | 1900.000000 | 0.000000e+00 | 6.800300 | 68.400200 | 2021.0 | 4.000000 | 0.000000 | 0.000000e+00 |
25% | 7.308145e+09 | 1.026400e+06 | 2008.000000 | 3.817000e+04 | 17.796600 | 75.719100 | 2021.0 | 4.000000 | 4.000000 | 7.063000e+03 |
50% | 7.312622e+09 | 1.990460e+06 | 2013.000000 | 8.600000e+04 | 24.058950 | 77.353900 | 2021.0 | 4.000000 | 8.000000 | 1.072728e+04 |
75% | 7.315254e+09 | 3.402170e+06 | 2017.000000 | 1.330000e+05 | 28.563600 | 80.927100 | 2021.0 | 5.000000 | 13.000000 | 1.500000e+04 |
max | 7.317101e+09 | 4.798934e+11 | 2021.000000 | 1.000000e+07 | 35.997600 | 97.398000 | 2021.0 | 5.000000 | 121.000000 | 1.000000e+07 |
.describe()
function generates descriptive statistics for the DataFramesp
.
3. Exploratory Data Analysis (EDA)¶
Hypothesis 1:- Vehicle pricing and demand vary based on condition, fuel type, mileage, and vehicle type, impacting transaction volumes and average prices.¶
Pricing Analysis¶
# Correlation between price and numerical attributes
correlations = sp.corr()['price'].sort_values(ascending=False)
print("\nCorrelation with Price:")
print(correlations)
Correlation with Price: price 1.000000 odometer 0.010112 car_age 0.004899 avg_yearly_mileage 0.000166 latitude -0.001279 longitude -0.001498 posting_month -0.001826 id -0.002846 year -0.004899 posting_year NaN Name: price, dtype: float64
- Calculate the correlation between the price column and all other numerical attributes in the DataFrame.
- Sort the correlations in descending order and display the result.
Price trends based on condition and fuel¶
# Summary of pricing trends by condition and fuel type
price_summary = sp.groupby(['condition', 'fuel']).agg(
avg_price=('price', 'mean'),
median_price=('price', 'median'),
count=('price', 'size')
).reset_index()
print("\nOptimal Pricing Summary:")
print(price_summary)
Optimal Pricing Summary: condition fuel avg_price median_price count 0 Unknown Unknown 2.657031e+06 2309580.0 1687 1 Unknown diesel 4.588512e+06 4045190.0 6839 2 Unknown electric 3.249356e+06 1990460.0 159 3 Unknown gas 3.439129e+06 1990460.0 25470 4 Unknown hybrid 2.589077e+06 2051780.0 447 5 Unknown other 4.122992e+06 2824250.0 2067 6 excellent Unknown 7.312450e+06 7312450.0 4 7 excellent diesel 1.829421e+07 4036280.0 12867 8 excellent electric 2.823478e+06 1990460.0 526 9 excellent gas 1.226798e+07 1862110.0 187480 10 excellent hybrid 1.543248e+06 1283270.0 2495 11 excellent other 3.043773e+06 1990460.0 5807 12 fair diesel 2.011004e+06 1014550.0 636 13 fair electric 1.415820e+06 488000.0 4 14 fair gas 7.225017e+07 372460.0 6960 15 fair hybrid 5.042757e+05 475100.0 37 16 fair other 2.440227e+09 449480.0 65 17 good diesel 2.399393e+07 2824250.0 7752 18 good electric 3.471733e+06 3851650.0 845 19 good gas 3.916336e+06 1990460.0 110943 20 good hybrid 2.389665e+06 2130830.0 1810 21 good other 3.679336e+06 3594780.0 22287 22 like new diesel 4.903614e+06 4993550.0 1877 23 like new electric 3.671779e+06 2593540.0 154 24 like new gas 4.450640e+06 1990460.0 24226 25 like new hybrid 2.272601e+06 1862110.0 373 26 like new other 3.952193e+06 3466340.0 431 27 new diesel 7.498743e+06 8352665.0 88 28 new electric 3.191558e+06 769610.0 11 29 new gas 3.156952e+06 2117890.0 1604 30 new hybrid 4.556500e+06 5842160.0 22 31 new other 2.078390e+06 1990460.0 56 32 salvage diesel 1.845805e+06 1284190.0 42 33 salvage electric 1.653625e+06 1653625.0 2 34 salvage gas 5.570343e+05 256870.0 583 35 salvage hybrid 5.446691e+05 192610.0 11 36 salvage other 1.692764e+05 128440.0 11
.groupby()
function groups the data by condition and fuel, then calculate the average price, median price, and count of transactions for each group.- Display the summarized pricing trends by condition and fuel type.
Average Price by Mileage¶
sp['mileage_category'] = pd.cut(sp['avg_yearly_mileage'],
bins=[0, 5000, 10000, 20000, 30000, 80000, 10000000],
labels=['Very Low', 'Low', 'Moderate', 'Considerate', 'High', 'Very High'])
mileage_price = sp.groupby('mileage_category')['price'].mean()
print("Average Price by Mileage Category:")
print(mileage_price)
Average Price by Mileage Category: mileage_category Very Low 2.788424e+07 Low 7.393663e+06 Moderate 6.344824e+06 Considerate 2.800212e+06 High 6.066545e+06 Very High 1.310558e+08 Name: price, dtype: float64
.groupby()
groups the data by mileage_category and calculate the average price for each category.- Display the resulting average price for each mileage category.
Pricing Strategy Based on Vehicle Type¶
# Analyze pricing by vehicle type
type_pricing = sp.groupby('car_type').agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index()
print("\nPricing by Vehicle Type:")
print(type_pricing)
Pricing by Vehicle Type: car_type total_transactions avg_price median_price 0 SUV 97421 1.291095e+07 1926290.0 1 Unknown 24924 1.656132e+07 1990460.0 2 bus 619 2.206170e+06 1796930.0 3 convertible 8910 3.103358e+07 1990460.0 4 coupe 22104 2.993457e+06 2246410.0 5 hatchback 18777 1.832439e+06 1796930.0 6 mini-van 7007 1.283773e+06 924590.0 7 offroad 725 2.090865e+06 1643800.0 8 other 23014 3.349307e+06 3209560.0 9 pickup 50507 1.652880e+07 3466340.0 10 sedan 106733 9.306231e+06 1412630.0 11 truck 43467 4.141187e+06 2632560.0 12 van 10284 2.396104e+06 1990460.0 13 wagon 12186 1.787653e+06 1669420.0
.groupby()
groups the data bycar_type
, calculating total transactions, average price, and median price for each vehicle type.- Reset the index to create a clean DataFrame for pricing analysis by vehicle type.
Analysis¶
Condition and Price: Vehicles in excellent condition have the highest prices across all fuel types, especially diesel (₹18.29M) and gas (₹12.27M). Fair and salvage conditions are priced significantly lower.
Electric Vehicles: EVs in excellent condition are priced lower (₹2.82M) than diesel and gas counterparts, making them relatively affordable.
Mileage and Pricing: Vehicles with very high mileage have the highest average prices (₹131.06M), likely luxury or high-performance models. Moderate and low mileage vehicles are priced in the ₹6M–₹7.4M range.
Vehicle Type: SUVs and sedans dominate in transaction volume, while convertibles (₹31M) and pickup trucks (₹16.5M) have the highest average prices. Hatchbacks and mini-vans are the most affordable categories.
Key Insights¶
- Vehicle condition is the strongest determinant of price.
- EVs are emerging as affordable alternatives to diesel and gas vehicles.
- High mileage does not always lower price, especially for luxury vehicles.
- SUVs and sedans are the most in-demand segments, while hatchbacks appeal to budget-conscious buyers.
Recommendations¶
- Promote SUVs and Sedans: Focus marketing efforts on these high-demand segments.
- Leverage EV Affordability: Encourage EV adoption by emphasizing cost savings and environmental benefits.
- Refurbish Vehicles: Improve fair and salvage vehicles for resale to enhance their market value.
- Target Mileage Preferences: Highlight moderate mileage vehicles as the ideal choice for reliability and cost-effectiveness.
- Expand Affordable Options: Offer financing for hatchbacks and mini-vans to attract budget-conscious buyers.
Data Visualization¶
Compare Price and number of cars Across Car Types¶
car_type_stats = sp.groupby('car_type').agg(
avg_price=('price', 'mean'),
num_cars=('id', 'count')
).reset_index()
car_type_stats
car_type | avg_price | num_cars | |
---|---|---|---|
0 | SUV | 1.291095e+07 | 97421 |
1 | Unknown | 1.656132e+07 | 24924 |
2 | bus | 2.206170e+06 | 619 |
3 | convertible | 3.103358e+07 | 8910 |
4 | coupe | 2.993457e+06 | 22104 |
5 | hatchback | 1.832439e+06 | 18777 |
6 | mini-van | 1.283773e+06 | 7007 |
7 | offroad | 2.090865e+06 | 725 |
8 | other | 3.349307e+06 | 23014 |
9 | pickup | 1.652880e+07 | 50507 |
10 | sedan | 9.306231e+06 | 106733 |
11 | truck | 4.141187e+06 | 43467 |
12 | van | 2.396104e+06 | 10284 |
13 | wagon | 1.787653e+06 | 12186 |
# Data preparation
car_type_stats = sp.groupby('car_type').agg(
avg_price=('price', 'mean'),
num_cars=('id', 'count')
).reset_index()
# Sort by average price
car_type_stats = car_type_stats.sort_values('avg_price', ascending=False)
# Extract data for plotting
car_types = car_type_stats['car_type']
avg_price = car_type_stats['avg_price']
num_cars = car_type_stats['num_cars']
# Create the plot
fig, ax1 = plt.subplots(figsize=(12, 6))
# Bar chart for average price
bar_width = 0.4
x = np.arange(len(car_types))
ax1.bar(x - bar_width / 2, avg_price, width=bar_width, color='lightblue', label='Average Price (₹)')
ax1.set_ylabel('Average Price (₹)', fontsize=12, color='blue')
ax1.set_xlabel('Car Type', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(x)
ax1.set_xticklabels(car_types, rotation=45, ha='right')
# Bar chart for number of cars
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, num_cars, width=bar_width, color='salmon', label='Number of Cars')
ax2.set_ylabel('Number of Cars', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')
# Add title and legend
plt.title('Average Price and Number of Cars by Car Type', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)
# Adjust layout
plt.tight_layout()
plt.show()
Key Insights¶
- Convertibles: Highest average price (~₹3,000,000), but low number of cars.
- Sedans and SUVs: High numbers of cars with sedans leading (100,000) and SUVs (80,000). Both show strong market demand.
- Trucks: Significant number of cars (~60,000) but lower average price compared to sedans and SUVs.
- Unknown Category: Moderate average price with low number of cars.
- Other Car Types: Vans, buses, offroad vehicles, hatchbacks, wagons, and mini-vans have lower average prices and low numbers of cars.
Recommendations¶
- Target Sedans and SUVs: Invest in marketing and sales strategies for sedans and SUVs due to their high demand.
- Premium Strategy for Convertibles: Focus on premium pricing for convertibles to capitalize on their high average price.
- Market Expansion for Trucks: Explore opportunities to increase the market share of trucks, considering their significant presence and moderate pricing.
- Investigate Unknown Category: Understand the characteristics and potential of the "unknown" category for market opportunities.
- Niche Marketing for Low-Volume Types: For car types with lower numbers and prices (vans, buses, offroad vehicles, hatchbacks, wagons, mini-vans), consider niche marketing strategies or cost reduction measures to improve profitability.
Hypothesis 2:- Newer cars have higher demand and prices with showing higher transaction volumes, while older cars see declining demand and value with decrease in market desirability.¶
Age of Cars and Its Impact on Demand¶
age_analysis = sp.groupby('car_age').agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)
print("\nCar Age Analysis (Demand and Pricing):")
print(age_analysis)
Car Age Analysis (Demand and Pricing): car_age total_transactions avg_price median_price 4 4.0 36421 3.298168e+06 3067260.0 3 3.0 36370 3.596957e+06 3466340.0 8 8.0 31925 2.176866e+06 1990460.0 6 6.0 31538 3.245019e+06 2247340.0 5 5.0 30434 2.970425e+06 2503200.0 .. ... ... ... ... 104 106.0 1 6.421000e+05 642100.0 108 116.0 1 5.127400e+05 512740.0 107 112.0 1 9.677000e+04 96770.0 102 103.0 1 2.054720e+06 2054720.0 78 78.0 1 1.252520e+06 1252520.0 [113 rows x 4 columns]
.groupby()
function groups the data bycar_age
, then calculate the total number of transactions, average price, and median price for each car age.- Sort the results by the total number of transactions in descending order and display the car age analysis (demand and pricing).
Demand for Newer vs. Older Cars¶
car_age_demand = sp['car_age'].value_counts().reset_index()
car_age_demand.columns = ['car_age', 'Demand'] # Rename columns
car_age_demand = car_age_demand.sort_values('Demand', ascending=False) # Sort by demand in descending order
print("Car Age Demand:")
print(car_age_demand.head(30))
Car Age Demand: car_age Demand 0 4.0 36421 1 3.0 36370 2 8.0 31925 3 6.0 31538 4 5.0 30434 5 7.0 30287 6 2.0 25375 7 9.0 23898 8 10.0 20341 9 1.0 19298 10 13.0 17150 11 11.0 15829 12 14.0 14873 13 15.0 12763 14 12.0 12185 15 16.0 10621 16 17.0 8971 17 18.0 7151 18 19.0 5587 19 20.0 4443 20 21.0 3572 21 22.0 3094 22 0.0 2396 23 23.0 1988 24 24.0 1724 25 25.0 1302 26 26.0 1246 27 27.0 968 28 28.0 712 29 29.0 626
.value_counts().reset_index()
counts the number of transactions for each unique car_age value.- Rename the columns to car_age and Demand.
- Sort the results by Demand in descending order to display the car age demand.
Analysis¶
Peak Demand for Mid-Age Cars (3–6 Years)
- The highest demand is for cars aged 3 to 6 years, with peak transactions at 4 years (36,421) and 3 years (36,370).
- These cars also have high pricing, with 4-year-old cars averaging ₹3.29M and 3-year-old cars at ₹3.60M.
Gradual Decline in Demand After 10 Years
- Demand drops significantly after 10 years, with 13-year-old cars at 17,150 transactions and cars aged 20+ years below 5,000 transactions.
- Prices for older cars drop drastically, with many below ₹1M.
Rare Demand for Vintage & Classic Cars
- Cars aged 100+ years still have occasional transactions, with prices varying widely, indicating niche collectors’ interest.
Key Insights¶
3–6-Year-Old Cars Hold the Best Resale Value
- These cars balance affordability and modern features, making them the most in-demand.
After 10 Years, Depreciation Accelerates
- A sharp decline in both demand and pricing indicates that cars beyond a decade struggle to attract buyers.
Niche Market for Vintage Cars
- Rare sales of 100+ year-old cars suggest collector demand, though limited.
Recommendations¶
Target Mid-Age Cars for Sales & Promotions
- Focus inventory acquisition on 3–6-year-old cars, as they offer the best resale potential.
Offer Trade-In Incentives for Older Cars
- Encourage sellers of 10+ year-old cars to trade in for newer models, increasing inventory turnover.
Leverage Classic Car Auctions & Collectors’ Market
- For cars older than 50+ years, explore specialized vintage auctions or collector marketplaces.
Highlight Cost-Effective Options for Budget Buyers
- Promote 8–10-year-old cars as budget-friendly alternatives for cost-conscious buyers.
Data Visualization¶
1. Correlate car age with odometer¶
plt.figure(figsize=(10, 6))
sns.scatterplot(x='car_age', y='odometer', data=sp, alpha=0.6)
plt.title('Car Age vs. Odometer Reading')
plt.xlabel('Car Age (years)')
plt.ylabel('Odometer Reading (km)')
plt.grid(True)
plt.tight_layout()
plt.show()
Key Insights¶
- Car Age and Odometer Reading: The scatter plot shows a high concentration of cars aged between 0 and 20 years with odometer readings below 0.2 million kilometers.
- High Mileage Cars: There are several cars with odometer readings near 1.0 million kilometers, regardless of age, indicating high mileage.
- Older Cars: Beyond 20 years of age, the number of cars decreases significantly, and their odometer readings are generally lower.
- Outliers: There are a few outliers with high odometer readings and older ages, but these are rare.
Recommendations¶
- Focus on Younger Cars: When purchasing a used car, consider those within the 0-20 year range with odometer readings below 0.2 million kilometers, as they are most common and likely in better condition.
- Avoid High Mileage Cars: Be cautious of cars with extremely high odometer readings (near 1.0 million kilometers) regardless of age, as they may have undergone significant wear and tear.
- Consider Well-Maintained Older Cars: Older cars (beyond 20 years) with low odometer readings might be rare but could be worth considering if they have been well-maintained.
Hypothesis 3:- Geographic location influences car demand and pricing, States with stronger economies tend to have higher average car prices eventually drop in weaker economies.¶
Demand for Cars Across Regions¶
# Count transactions by state
state_demand = sp['state'].value_counts()
print("\nDemand by State:")
print(state_demand)
Demand by State: Uttar Pradesh 45552 Maharashtra 45236 Karnataka 30474 Punjab 30441 Delhi 30368 Haryana 30167 Tamil Nadu 18116 West Bengal 18039 Telangana 15352 Kerala 15293 Madhya Pradesh 15199 Rajasthan 15121 Andhra Pradesh 15065 Gujarat 14985 Odisha 12184 Jharkhand 6074 Bihar 6031 Chhattisgarh 6001 Assam 5986 Uttarakhand 5947 Andaman and Nicobar Islands 3077 Sikkim 3065 Ladakh 3060 Puducherry 3045 Jammu and Kashmir 3028 Himachal Pradesh 3016 Meghalaya 3015 Mizoram 3009 Manipur 2990 Chandigarh 2984 Arunachal Pradesh 2979 Dadra and Nagar Haveli and Daman and Diu 2978 Tripura 2960 Goa 2922 Lakshadweep 2919 Name: state, dtype: int64
.value_counts()
counts the number of transactions for each unique value in the state column.- Display the demand (transaction count) by state.
Top States for High-Value Cars¶
state_avg_price = sp.groupby('state')['price'].mean().sort_values(ascending=False)
print("Average Price by State:")
print(state_avg_price)
Average Price by State: state Lakshadweep 1.667703e+08 Chandigarh 1.372737e+08 Goa 6.442383e+07 West Bengal 2.903243e+07 Haryana 2.018441e+07 Karnataka 1.617193e+07 Kerala 1.381462e+07 Uttar Pradesh 8.677833e+06 Dadra and Nagar Haveli and Daman and Diu 7.327605e+06 Maharashtra 5.218439e+06 Madhya Pradesh 3.720367e+06 Tamil Nadu 3.271037e+06 Delhi 3.019001e+06 Chhattisgarh 2.672705e+06 Gujarat 2.533112e+06 Punjab 2.515638e+06 Rajasthan 2.509588e+06 Assam 2.472224e+06 Jharkhand 2.458025e+06 Sikkim 2.428607e+06 Andaman and Nicobar Islands 2.422289e+06 Manipur 2.418222e+06 Jammu and Kashmir 2.414838e+06 Arunachal Pradesh 2.408561e+06 Tripura 2.407741e+06 Odisha 2.405977e+06 Telangana 2.404359e+06 Puducherry 2.402517e+06 Andhra Pradesh 2.402467e+06 Uttarakhand 2.399425e+06 Meghalaya 2.393725e+06 Ladakh 2.388141e+06 Bihar 2.384134e+06 Himachal Pradesh 2.368753e+06 Mizoram 2.345368e+06 Name: price, dtype: float64
.groupby()
function groups the data by state and calculate the average price for each state.- Sort the results in descending order and display the average price by state.
Regional Demand Analysis¶
# Analyze demand by state
regional_demand = sp.groupby('state').agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)
print("\nRegional Demand Analysis:")
print(regional_demand)
Regional Demand Analysis: state total_transactions \ 32 Uttar Pradesh 45552 20 Maharashtra 45236 15 Karnataka 30474 26 Punjab 30441 8 Delhi 30368 11 Haryana 30167 29 Tamil Nadu 18116 34 West Bengal 18039 30 Telangana 15352 16 Kerala 15293 19 Madhya Pradesh 15199 27 Rajasthan 15121 1 Andhra Pradesh 15065 10 Gujarat 14985 24 Odisha 12184 14 Jharkhand 6074 4 Bihar 6031 6 Chhattisgarh 6001 3 Assam 5986 33 Uttarakhand 5947 0 Andaman and Nicobar Islands 3077 28 Sikkim 3065 17 Ladakh 3060 25 Puducherry 3045 13 Jammu and Kashmir 3028 12 Himachal Pradesh 3016 22 Meghalaya 3015 23 Mizoram 3009 21 Manipur 2990 5 Chandigarh 2984 2 Arunachal Pradesh 2979 7 Dadra and Nagar Haveli and Daman and Diu 2978 31 Tripura 2960 9 Goa 2922 18 Lakshadweep 2919 avg_price median_price 32 8.677833e+06 1990460.0 20 5.218439e+06 1990460.0 15 1.617193e+07 1990460.0 26 2.515638e+06 1990460.0 8 3.019001e+06 1990460.0 11 2.018441e+07 1990460.0 29 3.271037e+06 1990460.0 34 2.903243e+07 1990460.0 30 2.404359e+06 1990460.0 16 1.381462e+07 1990460.0 19 3.720367e+06 1990460.0 27 2.509588e+06 1990460.0 1 2.402467e+06 1990460.0 10 2.533112e+06 1990460.0 24 2.405977e+06 1990460.0 14 2.458025e+06 1990460.0 4 2.384134e+06 1990460.0 6 2.672705e+06 1990460.0 3 2.472224e+06 1990460.0 33 2.399425e+06 1990460.0 0 2.422289e+06 1990460.0 28 2.428607e+06 1990460.0 17 2.388141e+06 1990460.0 25 2.402517e+06 1990460.0 13 2.414838e+06 1990460.0 12 2.368753e+06 1990460.0 22 2.393725e+06 1990460.0 23 2.345368e+06 1990460.0 21 2.418222e+06 1990460.0 5 1.372737e+08 1990460.0 2 2.408561e+06 1990460.0 7 7.327605e+06 1990460.0 31 2.407741e+06 1990460.0 9 6.442383e+07 1990460.0 18 1.667703e+08 1990460.0
.groupby()
groups the data by state, calculating total transactions, average price, and median price for each state.- Reset the index and sort the results by
total_transactions
in descending order to analyze regional demand.
Geographic Pricing Optimization¶
# Average price by state
state_pricing = sp.groupby('state').agg(
avg_price=('price', 'mean')
).reset_index()
print("\nGeographic Pricing Analysis:")
print(state_pricing)
Geographic Pricing Analysis: state avg_price 0 Andaman and Nicobar Islands 2.422289e+06 1 Andhra Pradesh 2.402467e+06 2 Arunachal Pradesh 2.408561e+06 3 Assam 2.472224e+06 4 Bihar 2.384134e+06 5 Chandigarh 1.372737e+08 6 Chhattisgarh 2.672705e+06 7 Dadra and Nagar Haveli and Daman and Diu 7.327605e+06 8 Delhi 3.019001e+06 9 Goa 6.442383e+07 10 Gujarat 2.533112e+06 11 Haryana 2.018441e+07 12 Himachal Pradesh 2.368753e+06 13 Jammu and Kashmir 2.414838e+06 14 Jharkhand 2.458025e+06 15 Karnataka 1.617193e+07 16 Kerala 1.381462e+07 17 Ladakh 2.388141e+06 18 Lakshadweep 1.667703e+08 19 Madhya Pradesh 3.720367e+06 20 Maharashtra 5.218439e+06 21 Manipur 2.418222e+06 22 Meghalaya 2.393725e+06 23 Mizoram 2.345368e+06 24 Odisha 2.405977e+06 25 Puducherry 2.402517e+06 26 Punjab 2.515638e+06 27 Rajasthan 2.509588e+06 28 Sikkim 2.428607e+06 29 Tamil Nadu 3.271037e+06 30 Telangana 2.404359e+06 31 Tripura 2.407741e+06 32 Uttar Pradesh 8.677833e+06 33 Uttarakhand 2.399425e+06 34 West Bengal 2.903243e+07
.groupby()
groups the data by state and calculate the average price for each state.- Reset the index to create a clean DataFrame for state-wise pricing analysis.
Analysis¶
High Demand in Densely Populated and Economically Strong States
- Top 3 states by transaction volume:
- Uttar Pradesh (45,552 transactions)
- Maharashtra (45,236 transactions)
- Karnataka (30,474 transactions)
- These states have high urbanization and purchasing power, leading to more car sales.
- Top 3 states by transaction volume:
Luxury Market in Chandigarh, Lakshadweep, and Goa
- Lakshadweep: ₹166.7M avg price
- Chandigarh: ₹137.2M avg price
- Goa: ₹64.4M avg price
- These regions exhibit significantly higher average prices, likely due to a strong luxury car segment.
Affordable Car Markets in Mid-Tier States
- States like Telangana, Andhra Pradesh, Odisha, and Rajasthan have an average price range between ₹2.4M to ₹2.5M, indicating higher demand for mid-range and budget cars.
Outliers in Pricing
- Some regions, despite low demand, have high average prices due to occasional high-value transactions (e.g., luxury and collector cars).
Key Insights¶
Urban States Drive High Demand
- Major economic hubs (e.g., Maharashtra, Karnataka, and Delhi) have the highest transaction volumes, reinforcing the connection between car demand and urban prosperity.
Luxury Car Markets Are Region-Specific
- High average prices in Chandigarh, Goa, and Lakshadweep indicate a preference for premium vehicles in these regions.
Price Sensitivity Varies Across States
- Budget-friendly states like Odisha, Jharkhand, and Bihar cater to cost-conscious buyers, while metro regions see higher average prices.
Luxury Demand in Select States Creates Market Segmentation
- West Bengal, Haryana, and Kerala also show higher-than-average pricing, indicating selective premium car purchases.
Recommendations¶
Increase Inventory & Marketing in High-Demand States
- Prioritize Uttar Pradesh, Maharashtra, and Karnataka for increased stock allocation and marketing.
Expand Luxury Offerings in High-Value Markets
- Focus on high-end vehicles in Lakshadweep, Chandigarh, Goa, and West Bengal to cater to premium buyers.
Introduce Budget Car Campaigns in Cost-Sensitive Regions
- Provide financing options and discounts in Odisha, Jharkhand, and Bihar to tap into the price-conscious segment.
Localized Digital & Offline Marketing Strategies
- Metro regions: Promote SUVs and luxury cars.
- Tier 2 & 3 cities: Focus on affordable and fuel-efficient models.
Data Visualization¶
Compare Average Price by State¶
state_stats = sp.groupby('state').agg(
avg_price=('price', 'mean'),
most_common_type=('car_type', lambda x: x.mode()[0] if not x.isnull().all() else 'Unknown'),
total_listings=('id', 'count')
).reset_index()
# Visualize Average Price by State
plt.figure(figsize=(12, 8))
sns.barplot(
y='state',
x='avg_price',
data=state_stats.sort_values('avg_price', ascending=False),
palette='coolwarm'
)
plt.title('Average Car Price by State', fontsize=14)
plt.xlabel('Average Price (₹)', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
Key Insights¶
- Lakshadweep: Has the highest average car price, significantly above any other state.
- Chandigarh and Goa: Also show relatively high average car prices.
- Moderate Average Prices: States like West Bengal, Haryana, Karnataka, and Kerala have moderate average car prices.
- Lower Average Prices: The majority of states have lower average car prices, with Mizoram having the lowest.
Recommendations¶
- Focus on High-Price States: Car manufacturers and dealers could target states like Lakshadweep, Chandigarh, and Goa due to their higher average car prices, indicating a wealthier customer base.
- Mid-Range Models: States with moderate average car prices, such as West Bengal and Haryana, can be ideal markets for mid-range car models.
- Budget Car Models: For states with lower average car prices, such as Mizoram and Himachal Pradesh, focus on promoting budget-friendly models.
Highlight Regions with High Prevalence of Specific Manufacturers or Models¶
top_manufacturers = sp['manufacturer'].value_counts().head(5).index
manufacturer_data = sp[sp['manufacturer'].isin(top_manufacturers)]
plt.figure(figsize=(14, 8))
sns.countplot(
y='state',
hue='manufacturer',
data=manufacturer_data,
order=state_stats['state']
)
plt.title('Prevalence of Top Manufacturers by State', fontsize=14)
plt.xlabel('Count', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.legend(title='Manufacturer', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Key Insights¶
- Ford Dominance: Ford (green) has the highest presence in many states, especially Maharashtra, Uttar Pradesh, and Tamil Nadu.
- Toyota's Significant Presence: Toyota (orange) is also prominent in states like Maharashtra, Karnataka, and Tamil Nadu.
- Chevrolet's Moderate Presence: Chevrolet (blue) has a consistent but moderate presence across most states.
- Nissan and Honda: Both Nissan (red) and Honda (purple) have lower counts compared to Ford and Toyota, with Honda having the least presence overall.
- Major Markets: Maharashtra, Uttar Pradesh, and Tamil Nadu have the highest overall counts of vehicles, indicating a larger market or higher vehicle ownership in these states.
Recommendations¶
- Expand in Low-Count States: Manufacturers looking to grow their market presence should focus on states with lower vehicle counts, such as Lakshadweep, Ladakh, and Arunachal Pradesh.
- Targeted Marketing for Nissan and Honda: These manufacturers should implement targeted marketing strategies in states where Ford and Toyota dominate to increase their market share.
- Capitalize on High-Count States: For states with high vehicle counts like Maharashtra, Uttar Pradesh, and Tamil Nadu, prioritize new product launches and dealership expansions to leverage the existing large market.
Hypothesis 4:- Fuel type affects demand and pricing with Diesel and gas vehicles may see higher sales and prices, while others can have minimal impact.¶
Yearly Mileage vs. Car Age¶
sp['mileage_category'] = pd.cut(sp['avg_yearly_mileage'],
bins=[0, 5000, 10000, 20000, 30000, 80000, 10000000],
labels=['Very Low', 'Low', 'Moderate', 'Considerate', 'High', 'Very High'])
mileage_vs_age = sp.groupby(['car_age', 'mileage_category'])['id'].count().unstack()
print("Mileage vs Car Age:")
print(mileage_vs_age)
Mileage vs Car Age: mileage_category Very Low Low Moderate Considerate High Very High car_age 0.0 1809 241 103 12 13 82 1.0 5483 4037 5182 2591 1827 146 2.0 4974 6738 9323 2792 1481 43 3.0 4142 12372 14177 3882 1691 57 4.0 3995 10768 14149 5603 1786 56 ... ... ... ... ... ... ... 116.0 1 0 0 0 0 0 118.0 12 0 0 0 0 0 119.0 0 1 0 0 0 0 120.0 3 0 0 0 0 0 121.0 11 1 0 0 0 0 [113 rows x 6 columns]
sp['mileage_category']
creates amileage_category
column by binningavg_yearly_mileage
into defined ranges with corresponding labels (Very Low, Low, etc.)..groupby()
groups the data by car_age and mileage_category, counting the number of transactions for each combination.- Use
.unstack()
to pivot the results for easier analysis of mileage categories across car ages.
Fuel Type Trends by Region¶
fuel_region_trends = sp.groupby(['state', 'fuel'])['id'].count().unstack()
print("Fuel Type Trends by Region:")
print(fuel_region_trends)
Fuel Type Trends by Region: fuel Unknown diesel electric gas \ state Andaman and Nicobar Islands 10 226 15 2566 Andhra Pradesh 64 1049 63 12595 Arunachal Pradesh 11 202 6 2516 Assam 26 445 28 4993 Bihar 24 398 18 5083 Chandigarh 11 208 14 2504 Chhattisgarh 20 420 19 5036 Dadra and Nagar Haveli and Daman and Diu 10 222 16 2508 Delhi 126 2151 126 25455 Goa 11 230 10 2432 Gujarat 64 1031 62 12522 Haryana 139 2161 135 25178 Himachal Pradesh 13 235 8 2507 Jammu and Kashmir 16 200 6 2561 Jharkhand 28 423 26 5072 Karnataka 145 2097 108 25581 Kerala 61 1067 59 12754 Ladakh 8 235 12 2542 Lakshadweep 12 201 8 2481 Madhya Pradesh 50 1134 51 12647 Maharashtra 168 3128 184 37964 Manipur 9 220 13 2507 Meghalaya 10 226 4 2523 Mizoram 10 185 17 2543 Odisha 42 846 46 10149 Puducherry 13 232 14 2546 Punjab 110 2137 126 25531 Rajasthan 48 1083 63 12663 Sikkim 10 229 10 2577 Tamil Nadu 65 1284 74 15161 Telangana 50 1091 70 12877 Tripura 8 193 15 2510 Uttar Pradesh 200 3228 187 38092 Uttarakhand 20 444 20 4971 West Bengal 79 1240 68 15119 fuel hybrid other state Andaman and Nicobar Islands 35 225 Andhra Pradesh 191 1103 Arunachal Pradesh 29 215 Assam 59 435 Bihar 74 434 Chandigarh 26 221 Chhattisgarh 70 436 Dadra and Nagar Haveli and Daman and Diu 28 194 Delhi 340 2170 Goa 31 208 Gujarat 172 1134 Haryana 360 2194 Himachal Pradesh 34 219 Jammu and Kashmir 36 209 Jharkhand 79 446 Karnataka 371 2172 Kerala 185 1167 Ladakh 45 218 Lakshadweep 41 176 Madhya Pradesh 235 1082 Maharashtra 558 3234 Manipur 31 210 Meghalaya 42 210 Mizoram 30 224 Odisha 153 948 Puducherry 31 209 Punjab 372 2165 Rajasthan 178 1086 Sikkim 47 192 Tamil Nadu 228 1304 Telangana 192 1072 Tripura 35 199 Uttar Pradesh 549 3296 Uttarakhand 76 416 West Bengal 232 1301
.groupby()
Group the data by state and fuel, counting the number of transactions for each combination.- Use
.unstack()
to pivot the results, creating a table with fuel as columns and state as rows.
Influence of Fuel Type on Demand and Pricing¶
# Analyze demand and pricing by fuel type
fuel_analysis = sp.groupby('fuel').agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index()
print("\nFuel Type Analysis (Demand and Pricing):")
print(fuel_analysis)
Fuel Type Analysis (Demand and Pricing): fuel total_transactions avg_price median_price 0 Unknown 1691 2.668044e+06 2310590.0 1 diesel 30101 1.541457e+07 3594780.0 2 electric 1701 3.259814e+06 2901280.0 3 gas 357266 9.623518e+06 1925360.0 4 hybrid 5195 1.983751e+06 1757450.0 5 other 30724 8.743496e+06 3337990.0
.groupby()
groups the data by manufacturer and model, calculating total transactions, average price, and median price for each group.- Reset the index and sort the results by
total_transactions
in descending order. - Display the top 10 most popular car models and brands based on transaction count.
Correlation Analysis¶
# Correlation matrix
correlation_matrix = sp[['price', 'odometer', 'car_age']].corr()
print("\nCorrelation Analysis:")
print(correlation_matrix)
Correlation Analysis: price odometer car_age price 1.000000 0.010112 0.004899 odometer 0.010112 1.000000 0.158340 car_age 0.004899 0.158340 1.000000
.corr()
calculates the correlation matrix for the price, odometer, and car_age columns to analyze the relationships between these variables.
Analysis¶
1. Mileage vs. Car Age¶
- Newer Cars (0-1 years old): Most have Very Low to Moderate mileage, suggesting limited use.
- 2-4-year-old Cars: A shift towards Moderate to Considerate mileage indicates regular use.
- Older Cars (>10 years): Few transactions, mostly Very Low mileage, suggesting classic or less-used vehicles.
2. Fuel Type Trends by Region¶
- High Gas Demand Nationwide: Most transactions are for gas cars, especially in Delhi, Maharashtra, Karnataka, and Uttar Pradesh.
- Diesel is Strong in North & West: High diesel usage in Haryana, Punjab, Rajasthan, and Maharashtra, where long-distance travel is common.
- Electric Cars Are Emerging: Limited presence but growing in metro states like Delhi, Karnataka, Maharashtra.
3. Fuel Type Pricing & Demand Analysis¶
- Diesel Cars Command Higher Prices: ₹15.4M average price due to fuel efficiency & long-term savings.
- Electric Vehicles Are Expensive: ₹3.26M average price, but demand remains low.
- Hybrid Cars Are Affordable: ₹1.98M average price, appealing for fuel-conscious buyers.
4. Correlation Analysis¶
- Weak correlation between price and odometer/car age: Car value isn't significantly impacted by mileage or age alone, suggesting brand, model, and condition play key roles in pricing.
Key Insights¶
- Newer cars have lower mileage, making them attractive for buyers looking for less-used vehicles.
- Gas cars dominate the market, with diesel and hybrid cars preferred in certain regions.
- Electric cars have high pricing but low demand, indicating a slow adoption rate.
- Car age and odometer reading do not strongly influence pricing, reinforcing the importance of brand reputation and vehicle condition.
Recommendations¶
Target Gas Car Buyers in High-Demand States
- Focus on Delhi, Maharashtra, and Uttar Pradesh to maximize gas car sales.
Expand Diesel Inventory in North & West India
- Promote diesel cars in Punjab, Haryana, and Rajasthan, where demand is high.
Incentivize Electric Vehicle Adoption
- Offer subsidies or financing options in metro states (Delhi, Karnataka, Maharashtra) to encourage EV sales.
Emphasize Car Condition in Marketing
- Since pricing isn't strongly linked to age or odometer, highlight certifications, maintenance records, and brand trust in listings.
Data Visualization¶
Price Distribution by Fuel Type¶
plt.figure(figsize=(12, 6))
sns.boxplot(x='fuel', y='price', data=sp, showfliers=False, palette='Set3')
plt.title('Price Distribution by Fuel Type', fontsize=14)
plt.xlabel('Fuel Type', fontsize=12)
plt.ylabel('Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Key Insights¶
- Diesel Vehicles: Show the highest price variability with a wide interquartile range and the highest maximum prices among all fuel types.
- Hybrid Vehicles: Exhibit the lowest median price and the smallest price range, indicating more affordability and consistency in pricing.
- Electric Vehicles: Have a relatively high median price with a moderate price range, reflecting the newer technology and higher costs.
- Gas and Other Fuel Types: Both categories show moderate median prices and similar price distributions, indicating average variability in prices.
- Unknown Category: Displays price distribution similar to gas and other fuel types, with moderate prices and variability.
Recommendations¶
- For Consumers Seeking Affordability: Hybrid vehicles are the best option due to their lower median price and smaller price range.
- For Electric Vehicle Enthusiasts: Be prepared for a higher median price with moderate price variability. This category suits those interested in cutting-edge technology and sustainability.
- For a Wide Range of Price Options: Diesel vehicles offer a broad spectrum of prices, making them suitable for consumers willing to pay a premium and looking for diverse pricing options.
- Market Strategies for Sellers:
- Target Hybrids to Budget-Conscious Consumers: Emphasize the affordability and consistency in pricing of hybrid vehicles.
- Promote Electric Vehicles to Tech-Savvy Buyers: Highlight the benefits of electric vehicles for environmentally conscious and technology-forward consumers.
- Diversify Offerings with Diesel Vehicles: Cater to a wider audience with the varied price points available in diesel vehicles.
Correlation Between Car_age, Odometer, and Price¶
correlation_matrix = sp[['car_age', 'odometer', 'price', 'cylinders']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', cbar=True)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()
Key Insights¶
- Car Age vs. Price: There is no significant linear relationship between car age and price, as indicated by the correlation coefficient of 0.00.
- Odometer vs. Price: Similarly, the correlation between odometer reading and price is very weak (0.01), suggesting minimal linear relationship.
- Car Age vs. Odometer: There is a weak positive linear relationship between car age and odometer readings (0.16), indicating that older cars tend to have higher mileage, but this is not a strong predictor.
Recommendations¶
- Explore Additional Factors: Given the lack of strong correlations with price, consider analyzing other variables (e.g., car make and model, condition, location) to better understand what influences car prices.
- Combine Predictors: While age and odometer readings alone are not strong predictors, combining them with other factors in a multivariate analysis might yield more accurate predictions.
- Market Segmentation: Focus on different market segments (e.g., newer cars with low mileage vs. older cars with high mileage) and tailor marketing strategies accordingly.
- Regular Maintenance: For buyers, prioritize cars with a history of regular maintenance over simply relying on age or mileage as indicators of value.
- Further Research: Conduct further research to identify key predictors of car prices that are not captured in the current dataset.
Regional Trends in Fuel Preferences¶
fuel_region = sp.groupby(['state', 'fuel']).size().unstack(fill_value=0)
# Normalize to get percentages
fuel_region_percentage = fuel_region.div(fuel_region.sum(axis=1), axis=0)
# Heatmap of fuel preferences by region
plt.figure(figsize=(14, 8))
sns.heatmap(
fuel_region_percentage,
annot=True,
fmt=".1%",
cmap='coolwarm',
linewidths=0.5,
cbar_kws={'label': 'Fuel Preference (%)'}
)
plt.title('Fuel Preferences by State', fontsize=14)
plt.xlabel('Fuel Type', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
Key Insights¶
- Fuel Preferences: Gas is the most preferred fuel type across all states, with high percentages (83.1% to 85.0%). Diesel follows, with lower but still significant percentages (6.1% to 7.9%).
- Alternative Fuels: Electric and hybrid fuel types have low preferences, indicating limited adoption. Electric ranges from 0.1% to 0.6%, and hybrid from 0.9% to 1.5%.
- Minor Categories: The "Unknown" and "Other" categories also have low preferences, with "Unknown" ranging from 0.3% to 0.5% and "Other" from 6.0% to 7.8%.
Recommendations¶
- Promote Alternative Fuels: Focus on promoting electric and hybrid vehicles to reduce dependency on gas and diesel. Highlight their environmental benefits and cost savings over time.
- Incentives and Policies: Implement policies and incentives to encourage the adoption of electric vehicles. Subsidies, tax breaks, and charging infrastructure can make electric vehicles more attractive to consumers.
- Raise Awareness: Increase awareness and education about the benefits and availability of alternative fuel vehicles. Campaigns and informational sessions can help dispel myths and encourage adoption.
- Infrastructure Development: Invest in the necessary infrastructure, such as electric vehicle charging stations and hybrid vehicle service centers, to support the growing market for alternative fuel vehicles.
Hypothesis 5:- Top brands dominate regionally and through availability, while premium brands retain higher prices due to their positioning. Low market can grow with targeted strategies.¶
List of Manufacturer with corresponding Model¶
# Group by 'manufacturer' and find min and max year
manufacturer_year_stats = sp.groupby(['manufacturer','model'])['year'].agg(['min', 'max']).reset_index()
# Rename columns for clarity
manufacturer_year_stats.columns = ['manufacturer','model', 'min_year', 'max_year']
# Display the result
print(manufacturer_year_stats)
manufacturer model min_year max_year 0 Unknown % 2002.0 2002.0 1 Unknown '50 Business Coupe 1950.0 1950.0 2 Unknown '99 H1 Hummer 1999.0 1999.0 3 Unknown ,,,,,,, 2009.0 2009.0 4 Unknown ,2012,2013, SOME 2014 MODELS 2014.0 2014.0 ... ... ... ... ... 30078 volvo xc90 t8 inscription 1300 miles 2020.0 2020.0 30079 volvo xc90 v8 2008.0 2011.0 30080 volvo xc90 v8 awd 2007.0 2007.0 30081 volvo xc90 v8 awdsuv 2006.0 2006.0 30082 volvo xc90t6 awd 24733 miles 2018.0 2018.0 [30083 rows x 4 columns]
- Using
.groupby()
groups the data bymanufacturer
andmodel
, then calculate the minimum and maximumyear
for each group. - Rename the columns for clarity.
- Display the result with the
min_year
andmax_year
for eachmanufacturer
andmodel
.
Data Visualization¶
Top Manufacturers¶
# Data preparation
manufacturer_counts = sp['manufacturer'].value_counts()
manufacturer_stats = sp.groupby('manufacturer').agg(
avg_price=('price', 'mean'),
total_listings=('id', 'count')
).reset_index()
# Filter for the top 10 manufacturers by frequency
top_manufacturers = manufacturer_counts.head(10).index
manufacturer_stats_top = manufacturer_stats[manufacturer_stats['manufacturer'].isin(top_manufacturers)]
manufacturer_stats_top = manufacturer_stats_top.set_index('manufacturer').loc[top_manufacturers] # Maintain order
avg_price = manufacturer_stats_top['avg_price']
total_listings = manufacturer_stats_top['total_listings']
# Create the plot
fig, ax1 = plt.subplots(figsize=(14, 6))
# Bar plot for total listings
bar_width = 0.4
x = np.arange(len(top_manufacturers))
ax1.bar(x - bar_width / 2, total_listings, width=bar_width, color='lightblue', label='Number of Listings')
ax1.set_ylabel('Number of Listings', fontsize=12, color='blue')
ax1.set_xlabel('Manufacturer', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(x)
ax1.set_xticklabels(top_manufacturers, rotation=45, ha='right')
# Bar plot for average price
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, avg_price, width=bar_width, color='salmon', label='Average Price (₹)')
ax2.set_ylabel('Average Price (₹)', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')
# Add title and legend
plt.title('Top 10 Manufacturers: Number of Listings vs. Average Price', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)
# Adjust layout
plt.tight_layout()
plt.show()
Key Insights¶
- Ford Dominates Listings: Ford has the highest number of listings among the top 10 manufacturers, indicating its popularity and wide availability in the market.
- High Average Prices for Jeep: Jeep vehicles have the highest average price, suggesting that they are positioned as premium offerings compared to other brands.
- Chevrolet and Toyota: These brands also have a high number of listings, making them popular choices. However, their average prices are lower than Jeep, indicating a broader market appeal.
- Affordable Options: Brands like Honda, Nissan, and BMW show relatively lower average prices, making them more affordable options for buyers.
- Moderate Listings and Prices: GMC and Unknown manufacturers have a moderate number of listings and average prices, indicating a balanced presence in the market.
- Lower Listings for Ram: Ram has a lower number of listings but maintains a moderate average price, indicating niche market appeal.
Recommendations¶
- Target Diverse Market Segments:
- For High Listings: Focus marketing efforts on popular brands like Ford, Chevrolet, and Toyota to leverage their high number of listings.
- For Premium Buyers: Highlight Jeep's premium positioning to attract buyers looking for higher-end vehicles.
- Promote Affordable Brands: Emphasize the affordability of brands like Honda, Nissan, and BMW to attract budget-conscious buyers.
- Expand Niche Markets: Explore opportunities to increase the listings for brands like Ram and GMC to capture niche market segments.
- Increase Awareness for Moderate Brands: Enhance brand visibility and marketing for GMC and Unknown manufacturers to boost their presence and attract more buyers.
- Analyze Price Factors: Conduct further analysis to understand the factors driving high average prices for Jeep and Ford, and utilize these insights to optimize pricing strategies for other brands.
Study Regional Dominance of Manufacturers¶
manufacturer_region = sp.groupby(['state', 'manufacturer']).size().unstack(fill_value=0)
# Normalize to show percentages
manufacturer_region_percentage = manufacturer_region.div(manufacturer_region.sum(axis=1), axis=0)
# Heatmap for regional dominance of top manufacturers
plt.figure(figsize=(14, 8))
sns.heatmap(
manufacturer_region_percentage[top_manufacturers],
annot=True,
fmt=".1%",
cmap='coolwarm',
linewidths=0.5,
cbar_kws={'label': 'Regional Market Share (%)'}
)
plt.title('Regional Dominance of Top Manufacturers', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
Key Insights¶
- Ford's Dominance: Ford has a significant market share across many states, with the highest in Arunachal Pradesh at 17.8%.
- Chevrolet and Toyota: Both brands have strong market shares in several states, with Chevrolet peaking in Meghalaya at 13.8% and Toyota in Manipur at 13.8%.
- Regional Leaders: Each state has distinct leaders among manufacturers, e.g., Honda in Ladakh (8.4%), Nissan in Mizoram (5.2%), and Jeep in Mizoram (5.0%).
- Consistent Presence: Ford, Chevrolet, and Toyota have consistent high presence across multiple states, while other brands like Jeep, Ram, GMC, and BMW have lower market shares.
Recommendations¶
- Expand Lesser-Known Brands: Manufacturers with lower market shares (e.g., Honda, Nissan, Jeep) should focus on marketing in states where they have relatively higher shares to consolidate and expand.
- Leverage Strong States: Ford, Chevrolet, and Toyota should leverage their strong market positions in states where they dominate, while exploring opportunities in states with lower shares.
- Analyze Success Factors: Study successful strategies in states with high market shares to replicate these in other regions. Factors such as local preferences, economic conditions, and competitive pricing should be considered.
- Targeted Campaigns: Implement targeted marketing campaigns tailored to the specific needs and preferences of consumers in each state to boost market presence and engagement.
Hypothesis 6:- Vehicle condition and title status drive pricing and demand trends where certain condition demand higher prices while other remain in niche markets.¶
Impact of Condition on Price¶
condition_price = sp.groupby('condition')['price'].mean().sort_values(ascending=False)
print("Average Price by Condition:")
print(condition_price)
Average Price by Condition: condition fair 8.605288e+07 excellent 1.223083e+07 good 4.941285e+06 like new 4.439667e+06 Unknown 3.644879e+06 new 3.355070e+06 salvage 6.370346e+05 Name: price, dtype: float64
.groupby()
groups the data by condition and calculate the average price for each condition.- Sort the results by the average price in descending order.
Title Status and Its Impact on Pricing¶
# Pricing analysis by title status
title_analysis = sp.groupby('title_status').agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean')
).reset_index()
print("\nTitle Status Analysis (Demand and Pricing):")
print(title_analysis)
Title Status Analysis (Demand and Pricing): title_status total_transactions avg_price 0 Unknown 8139 2.029250e+06 1 clean 405030 1.024240e+07 2 lien 1422 2.888077e+06 3 missing 810 2.874681e+06 4 parts only 197 1.646425e+06 5 rebuilt 7219 1.966364e+06 6 salvage 3861 1.345844e+06
.groupby()
groups the data by title_status and calculate the total number of transactions and the average price for each title status.- Reset the index to create a clean DataFrame for pricing analysis by title status.
Demand Analysis by Condition¶
# Analyze demand and pricing by condition
condition_analysis = sp.groupby('condition').agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index()
print("\nDemand and Pricing by Condition:")
print(condition_analysis)
Demand and Pricing by Condition: condition total_transactions avg_price median_price 0 Unknown 36669 3.644879e+06 2309580.0 1 excellent 209179 1.223083e+07 1926290.0 2 fair 7702 8.605288e+07 385220.0 3 good 143637 4.941285e+06 2310590.0 4 like new 27061 4.439667e+06 1990460.0 5 new 1781 3.355070e+06 2298740.0 6 salvage 649 6.370346e+05 256870.0
.groupby()
groups the data by condition, calculating total transactions, average price, and median price for each condition.- Reset the index to create a clean DataFrame for demand and pricing analysis by condition.
Analysis¶
1. Average Price by Condition¶
- "Fair" Condition Cars Have the Highest Average Price (₹86M): This seems unusual and could indicate outliers or data anomalies.
- "Excellent" and "Good" Condition Cars Are Priced High (₹12.2M and ₹4.9M, respectively): This aligns with expectations as well-maintained cars command premium pricing.
- "Salvage" Cars Have the Lowest Price (₹637K): These vehicles have significant damage and require repairs.
2. Title Status Analysis (Demand and Pricing)¶
- Clean Title Cars Dominate the Market (405K transactions, ₹10.24M avg price): Buyers prefer legally clear cars with no issues.
- Lien & Missing Title Cars Have Lower Demand (1.4K-800 transactions): These indicate ownership or financial disputes, leading to lower market interest.
- Salvage and Rebuilt Cars Are Cheaper (₹1.34M - ₹1.96M avg price): These vehicles attract budget-conscious buyers but require thorough inspection.
3. Demand and Pricing by Condition¶
- Most Transactions Are for "Excellent" (209K) and "Good" (143K) Condition Cars: These are the most sought-after categories.
- "New" Cars Have Low Transaction Volume (1.7K): This suggests the dataset primarily consists of used cars.
- "Salvage" Cars Have the Lowest Median Price (₹256K): Buyers are unwilling to pay high amounts for vehicles that require repairs.
Key Insights¶
- "Fair" Condition Pricing Anomaly: The unusually high price (₹86M) suggests data errors or misclassified vehicles.
- Buyers Prefer "Excellent" or "Good" Condition Cars: Over 350K transactions occurred in these two categories, indicating high demand.
- Clean Title Vehicles Are the Market Standard: 405K transactions show a strong buyer preference for cars without ownership issues.
- Rebuilt and Salvage Cars Have Low Demand and Pricing: These cars sell for less and have lower transaction volumes, making them a niche market.
Recommendations¶
Investigate "Fair" Condition Pricing Anomalies
- Verify if data errors, extreme outliers, or misclassifications are inflating the price.
Prioritize Listing "Excellent" and "Good" Condition Cars
- Since these categories have the highest demand, sellers should focus on well-maintained vehicles to maximize sales.
Encourage Buyers to Choose Clean Title Cars
- Highlight the legal and financial safety of clean title cars in marketing campaigns.
Offer Inspection and Financing Support for Salvage & Rebuilt Cars
- Providing vehicle certification, repair history, and financing options can make these cars more attractive to budget-conscious buyers.
Data Visualization¶
Compare Average Prices for Different Condition Categories¶
condition_stats = sp.groupby('condition')['price'].mean().sort_values(ascending=False)
# Plotting average prices by condition
plt.figure(figsize=(12, 6))
sns.barplot(
x=condition_stats.index,
y=condition_stats.values,
palette='Set2'
)
plt.title('Average Price by Condition', fontsize=14)
plt.xlabel('Condition', fontsize=12)
plt.ylabel('Average Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Key Insights¶
- Fair Condition: Items in fair condition have the highest average price, nearing 9 million rupees. This is significantly higher than all other conditions, indicating a unique market value.
- Excellent Condition: Items in excellent condition come next with an average price around 2 million rupees, making them valuable but much lower than items in fair condition.
- Lower Prices for Other Conditions: The conditions labeled as good, like new, unknown, new, and salvage all have average prices below 1 million rupees, suggesting less market value compared to fair and excellent conditions.
Recommendations¶
- Focus on Fair Condition: Given the high average price, it may be beneficial to prioritize acquiring or maintaining items in fair condition. Understanding the specific factors driving this high valuation could provide strategic advantages.
- Promote Excellent Condition: Items in excellent condition also hold considerable value. Enhance marketing efforts for these items to attract buyers looking for high-quality goods.
- Investigate Pricing for Other Conditions: Explore why items in other conditions (good, like new, unknown, new, salvage) have lower average prices and consider strategies to enhance their market appeal, such as improvements in presentation or highlighting unique features.
- Market Analysis: Conduct further market analysis to understand the demand and supply dynamics for items in different conditions. This can help in setting more competitive pricing and optimizing inventory management.
Correlations Between Title Status and Price¶
title_stats = sp.groupby('title_status')['price'].mean().sort_values(ascending=False)
# Plotting average prices by title status
plt.figure(figsize=(12, 6))
sns.barplot(
x=title_stats.index,
y=title_stats.values,
palette='Set3'
)
plt.title('Average Price by Title Status', fontsize=14)
plt.xlabel('Title Status', fontsize=12)
plt.ylabel('Average Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Key Insights¶
- Clean Title Status: Vehicles with a clean title status have the highest average price, reaching approximately ₹10 million. This indicates a strong preference for vehicles with a clear and undisputed ownership history.
- Lien and Missing Title Statuses: These categories show the next highest average prices, around ₹2 million, suggesting some buyers are still willing to invest in vehicles with potential ownership complications.
- Other Title Statuses: Vehicles with unknown, rebuilt, parts only, and salvage title statuses have significantly lower average prices, all below ₹2 million. This reflects the reduced market value and potential risks associated with these categories.
Recommendations¶
- Prioritize Clean Title Vehicles: Focus on acquiring and marketing vehicles with clean title statuses to maximize revenue and appeal to buyers seeking reliable ownership history.
- Evaluate Lien and Missing Title Vehicles: Consider these vehicles as viable options for budget-conscious buyers, but ensure thorough checks and transparency about potential ownership issues.
- Cautious Approach for Other Titles: Be cautious with vehicles having unknown, rebuilt, parts only, or salvage titles. These should be marketed with clear disclaimers and possibly targeted to buyers looking for project cars or parts.
Combined Analysis of Condition and Title Status on Price¶
combined_stats = sp.groupby(['condition', 'title_status'])['price'].mean().unstack(fill_value=0)
# Heatmap for combined analysis
plt.figure(figsize=(14, 8))
sns.heatmap(
combined_stats,
annot=True,
fmt=".0f",
cmap='coolwarm',
cbar_kws={'label': 'Average Price (₹)'},
linewidths=0.5
)
plt.title('Average Price by Condition and Title Status', fontsize=14)
plt.xlabel('Title Status', fontsize=12)
plt.ylabel('Condition', fontsize=12)
plt.tight_layout()
plt.show()
Key Insights¶
- Highest Average Price: The highest average price is ₹94,209,863 for items in "fair" condition with a "clean" title status. This suggests a significant market value for such items.
- Clean Title Advantage: Vehicles with a "clean" title status consistently have higher average prices across most conditions, indicating a strong preference for clear ownership.
- Fair Condition Premium: Items in "fair" condition have notably high average prices, especially when paired with a "clean" title.
- Lower Prices for Salvage Title: Vehicles with a "salvage" title status generally have the lowest average prices, regardless of condition, indicating reduced market value due to potential damage or issues.
Recommendations¶
- Focus on Clean Title Vehicles: Prioritize acquiring and marketing vehicles with a "clean" title status, as they command higher prices and appeal to buyers seeking reliable ownership history.
- Highlight Fair Condition: Emphasize the value of vehicles in "fair" condition, especially with a clean title, to attract buyers willing to pay a premium for such items.
- Cautious with Salvage Titles: Approach vehicles with "salvage" titles with caution. Ensure transparency about potential issues and target budget-conscious buyers or those looking for project cars.
- Optimize Pricing Strategies: Use the insights from average price distributions to set competitive and appealing prices for vehicles based on their condition and title status.
Hypothesis 7:- Transaction volumes show seasonal peaks, with higher volumes early in the week due to practical factors.¶
Time-Based Demand Trends¶
monthly_trends = sp.groupby(['posting_year', 'posting_month'])['id'].count().reset_index()
monthly_trends.columns = ['Year', 'Month', 'Transactions']
print("Monthly Transaction Trends:")
print(monthly_trends)
Monthly Transaction Trends: Year Month Transactions 0 2021 4 313053 1 2021 5 113625
.groupby()
groups the data by posting_year and posting_month, then count the number of transactions for each month.- Rename the columns to Year, Month, and Transactions.
weekday_trends = sp['posting_weekday'].value_counts()
print("Weekday Trends:")
print(weekday_trends)
Weekday Trends: Tuesday 78444 Monday 75067 Friday 66569 Saturday 61453 Thursday 52984 Wednesday 51080 Sunday 41081 Name: posting_weekday, dtype: int64
- Count the number of transactions for each unique value in the
posting_weekday
column. - Display the transaction count by weekday.
Analysis¶
1. Monthly Transaction Trends¶
- The dataset currently shows only two months (April & May 2021), which limits the ability to analyze seasonality or long-term trends.
- April 2021 had a significantly higher number of transactions (313K) compared to May 2021 (113K), indicating a possible drop in demand or listing activity.
2. Weekday Transaction Trends¶
- Most Listings Occur on Tuesdays (78K) and Mondays (75K): Sellers are more active at the beginning of the week.
- Least Listings on Sundays (41K): People tend to be less engaged in selling cars on weekends.
- Fridays & Saturdays Show Moderate Activity (~66K & 61K transactions, respectively).
- Wednesdays & Thursdays Have the Lowest Mid-Week Activity (~51K-52K transactions).
Key Insights¶
Transaction Volume Dropped From April to May 2021
- This could be due to seasonal effects, economic factors, or market fluctuations. More historical data is needed to determine long-term trends.
Tuesdays and Mondays Are the Best Days to List a Vehicle
- These two days account for the highest number of postings, likely because buyers start searching at the beginning of the week.
Sunday Is the Least Active Day for Transactions
- Sellers and buyers are less engaged on Sundays, making it the worst day to list a vehicle.
Recommendations¶
Analyze More Monthly Data to Identify Trends
- Collect data from additional months to determine seasonal fluctuations and yearly patterns.
- Look at festivals, holidays, and economic trends that may impact vehicle sales.
Encourage Listings on High-Traffic Days (Monday & Tuesday)
- Run promotions or ad campaigns encouraging sellers to list on these days for maximum visibility.
- Offer discounted listing fees or highlighted ads to sellers who post on peak days.
Optimize Pricing & Promotions on Sundays to Boost Engagement
- Since Sundays have low activity, consider discounted pricing, free listing promotions, or exclusive Sunday deals to increase engagement.
Data Visualization¶
Weekly Trends in Car Postings¶
weekday_counts = sp['posting_weekday'].value_counts()
plt.figure(figsize=(10, 6))
sns.barplot(
x=weekday_counts.index,
y=weekday_counts.values,
palette='viridis'
)
plt.title('Weekly Trends in Car Postings', fontsize=14)
plt.xlabel('Day of the Week', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.tight_layout()
plt.show()
Key Insights from the Bar Chart¶
- Peak Posting Day: Tuesday has the highest number of car listings, indicating that this is the most active day for car postings.
- Consistently High Days: Monday, Friday, and Saturday also see a high number of postings, following Tuesday.
- Mid-Range Activity: Thursday and Wednesday have moderate levels of activity.
- Lowest Activity: Sunday has the lowest number of car postings, suggesting it is the least active day for new listings.
Recommendations¶
- Optimize Posting Schedule: For maximum visibility and engagement, schedule car postings on Tuesdays and Mondays. This aligns with the peak activity days.
- Leverage High Activity Days: Utilize Fridays and Saturdays for additional postings to maintain high visibility and attract more potential buyers.
- Boost Sunday Activity: Consider promotions or special deals on Sundays to increase the number of postings and engage users on the platform during this low-activity day.
- Monitor Trends: Continuously monitor the weekly trends to adjust posting strategies and optimize timing for maximum impact.
Hypothesis 8:- Vehicle size and drive type influence pricing, with 4wd premium, fwd affordable, and rwd balanced.¶
Revenue Contribution by Vehicle Size¶
# Revenue contribution by size
size_revenue = sp.groupby('size').agg(
total_revenue=('price', 'sum'),
avg_price=('price', 'mean')
).reset_index()
print("\nRevenue Contribution by Vehicle Size:")
print(size_revenue)
Revenue Contribution by Vehicle Size: size total_revenue avg_price 0 Unknown 7.099603e+11 5.864291e+06 1 compact 4.462154e+11 8.724345e+06 2 full-size 2.387425e+12 1.488642e+07 3 mid-size 6.333929e+11 7.214700e+06 4 sub-compact 1.415356e+10 2.246953e+06
.groupby()
groups the data by size, calculating the total revenue and average price for each size category.- Reset the index to create a clean DataFrame for revenue analysis by size.
Analysis¶
1. Total Revenue by Vehicle Size¶
- Full-size vehicles generate the highest total revenue (₹2.39 trillion), indicating strong demand and higher pricing.
- Mid-size and compact cars also contribute significantly (₹633 billion & ₹446 billion, respectively).
- Sub-compact cars contribute the least revenue (₹14 billion), suggesting lower demand or lower pricing.
- The "Unknown" category accounts for ₹709 billion, which may include missing data or undefined vehicle sizes.
2. Average Price by Vehicle Size¶
- Full-size vehicles have the highest average price (₹14.89 million), followed by compact cars (₹8.72 million).
- Mid-size vehicles are priced lower (₹7.21 million), while sub-compact cars have the lowest average price (₹2.25 million).
- The "Unknown" category has an average price of ₹5.86 million, which could skew overall pricing trends.
Key Insights¶
Full-Size Vehicles Dominate Revenue Contribution
- They generate the highest total revenue and have the highest average price, indicating strong demand among premium buyers.
Compact & Mid-Size Vehicles Offer a Balance Between Demand and Affordability
- These categories generate substantial revenue and have reasonable pricing, making them attractive for a wider consumer base.
Sub-Compact Cars Have the Lowest Revenue Contribution
- Lower average pricing and potentially lower demand contribute to their small share of total revenue.
A Large Portion of the Data Falls Under "Unknown" Size
- ₹709 billion worth of transactions have no defined size, which might indicate data quality issues or missing entries.
Recommendations¶
Focus on High-Value Full-Size Vehicle Sales
- Given their high revenue contribution and pricing, invest in marketing, premium listings, and financing options to boost full-size vehicle sales.
Target Compact & Mid-Size Vehicle Buyers
- These vehicles appeal to cost-conscious buyers, so offering loan options, exchange programs, and extended warranties can boost sales.
Boost Demand for Sub-Compact Cars Through Discounts & Incentives
- Since sub-compact cars contribute the least revenue, introduce budget-friendly financing options, trade-in offers, and discounts to attract buyers.
Clean & Classify "Unknown" Data for Better Accuracy
- Investigate why certain listings are categorized as "Unknown" and enforce better data collection practices to ensure accurate reporting.
Data Visualization¶
Average Price and Odometer by Drive Type.¶
# Calculate averages
drive_data = sp.groupby('drive').agg({'price': 'mean', 'odometer': 'mean'}).sort_index()
# Extract values
drive_types = drive_data.index
price_values = drive_data['price']
odometer_values = drive_data['odometer']
# Create the plot
fig, ax1 = plt.subplots(figsize=(14, 6))
x = np.arange(len(drive_types))
bar_width = 0.4
# Plot bars
ax1.bar(x - bar_width / 2, price_values, width=bar_width, color='skyblue', label='Average Price (₹)')
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, odometer_values, width=bar_width, color='lightgreen', label='Average Odometer (km)')
# Label axes
ax1.set_ylabel('Average Price (₹)', color='blue', fontsize=12)
ax2.set_ylabel('Average Odometer (km)', color='green', fontsize=12)
ax1.set_xlabel('Drive Type', fontsize=12)
ax1.set_xticks(x)
ax1.set_xticklabels(drive_types, rotation=45, ha='right')
# Add title and legend
plt.title('Average Price and Odometer by Drive Type', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)
plt.tight_layout()
plt.show()
Key Insights from the Bar Chart¶
- 4wd Vehicles:
- Average Price: ₹1.25 crore (₹12.5 million).
- Average Odometer: Around 200,000 km.
- Unknown Drive Type:
- Average Price: ₹0.5 crore (₹5 million).
- Average Odometer: Around 200,000 km.
- fwd Vehicles:
- Average Price: Very low, close to ₹0 crore.
- Average Odometer: Around 100,000 km.
- rwd Vehicles:
- Average Price: ₹1 crore (₹10 million).
- Average Odometer: Slightly above 100,000 km.
Recommendations¶
- Budget-Friendly Option: Consider purchasing fwd vehicles as they have the lowest average price. However, further investigation into why they are priced low might be needed to ensure quality and performance.
- Lower Mileage Preference: rwd vehicles are a good option if the goal is to purchase a vehicle with a relatively lower odometer reading compared to 4wd and unknown drive types.
- High Price-High Value: 4wd vehicles have the highest average price, which may indicate premium features or better performance. Evaluate if the additional cost aligns with your needs and budget.
- Investigate Unknown Drive Type: The vehicles with an unknown drive type have a moderate price and high odometer reading. More information about this category might be useful to make informed decisions.
Hypothesis 9:- Regional paint color preferences affect demand, pricing, and marketing strategies.¶
Data Visualization¶
Paint Color Preferences and Average Price¶
# Calculate paint color frequency and average price
paint_frequency = sp['paint_color'].value_counts()
paint_price = sp.groupby('paint_color')['price'].mean()
# Ensure the indices are aligned
paint_frequency = paint_frequency.sort_index()
paint_price = paint_price.sort_index()
# Create the figure and axes
fig, ax1 = plt.subplots(figsize=(14, 6))
# Plot frequency on the first y-axis
ax1.bar(
paint_frequency.index,
paint_frequency.values,
color='lightblue',
alpha=0.7,
label='Frequency'
)
ax1.set_ylabel('Frequency', fontsize=12, color='blue')
ax1.set_xlabel('Paint Color', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(range(len(paint_frequency.index))) # Set tick positions
ax1.set_xticklabels(paint_frequency.index, rotation=45, ha='right') # Set tick labels
# Create the second y-axis
ax2 = ax1.twinx()
ax2.plot(
range(len(paint_price.index)),
paint_price.values,
color='red',
marker='o',
label='Average Price'
)
ax2.set_ylabel('Average Price (₹)', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')
# Add a title
plt.title('Paint Color Preferences and Average Price', fontsize=14)
# Show the legend
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
# Adjust layout for better appearance
plt.tight_layout()
plt.show()
Key Insights¶
- Most Preferred Paint Color: White is the most preferred paint color with a frequency of around 100,000.
- Popular Colors: Black and silver are also highly preferred, with frequencies around 90,000 and 70,000, respectively.
- Least Preferred Colors: Custom, orange, purple, and yellow are the least preferred paint colors, with frequencies below 10,000.
- Highest Average Price: Green paint commands the highest average price, around ₹7,000,000.
- High Average Prices: Black, silver, and white paints have relatively high average prices, around ₹2,000,000 to ₹3,000,000.
- Lowest Average Prices: Custom, orange, purple, and yellow paint colors have the lowest average prices, below ₹1,000,000.
Recommendations¶
- Stock Popular Colors: Increase inventory for white, black, and silver paint colors to meet high demand and maximize sales.
- Market Green Paint: Investigate why green paint commands a premium price and explore strategies to market it more effectively to capitalize on its high value.
- Boost Low-Preference Colors: Consider promotional campaigns or discounts for custom, orange, purple, and yellow paints to increase their popularity and sales.
- Monitor Price Trends: Continuously monitor price trends and consumer preferences to adjust inventory and marketing strategies accordingly.
Regional Paint Color Preferences¶
regional_paint_preference = sp.groupby(['state', 'paint_color']).size().unstack(fill_value=0)
plt.figure(figsize=(15, 8))
sns.heatmap(
regional_paint_preference,
cmap='YlGnBu',
cbar_kws={'label': 'Number of Listings'},
linewidths=0.5
)
plt.title('Regional Preferences for Paint Colors', fontsize=16)
plt.xlabel('Paint Color', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()
Key Insights from the Heatmap¶
- White Paint Dominance: Maharashtra and Uttar Pradesh have a high number of listings for white paint, indicating a strong preference.
- Blue Paint Preference: Delhi and Haryana show a notable preference for blue paint.
- Black Paint Popularity: Karnataka and Punjab have a significant number of listings for black paint.
- Silver Paint Preference: Madhya Pradesh shows a clear preference for silver paint.
- Overall Trends: The color white is generally the most popular paint color across multiple states.
Recommendations¶
- Stock White Paint: Paint manufacturers and retailers should prioritize stocking white paint in Maharashtra and Uttar Pradesh to meet the high demand.
- Focus on Blue Paint: Ensure ample supply of blue paint in Delhi and Haryana to cater to the regional preference.
- Boost Black Paint Inventory: Increase inventory of black paint in Karnataka and Punjab to match the high number of listings.
- Emphasize Silver Paint: Consider promoting and stocking more silver paint in Madhya Pradesh to align with local preferences.
- Customized Marketing: Develop region-specific marketing strategies that highlight the most popular colors in each state to boost sales and customer satisfaction.
Hypothesis 10:- Vehicle demand varies regionally, with different models and fuel types appealing to specific states based on local trends.¶
In-Demand Car Manufacturer with Models¶
# Group by 'manufacturer' and 'model', summing the 'id' column
model_demand = sp.groupby(['manufacturer', 'model'])['id'].count().reset_index()
# Rename the column for clarity
model_demand.columns = ['Manufacturer', 'Model', 'Total_Demand']
# Sort by 'TotalDemand' in descending order
model_demand = model_demand.sort_values(by='Total_Demand', ascending=False)
model_demand
Manufacturer | Model | Total_Demand | |
---|---|---|---|
13769 | ford | f-150 | 8014 |
10401 | chevrolet | silverado 1500 | 5154 |
25302 | ram | 1500 | 4259 |
27474 | toyota | camry | 3153 |
10395 | chevrolet | silverado | 3037 |
... | ... | ... | ... |
12032 | dodge | charger police sxt | 1 |
12031 | dodge | charger police inspector | 1 |
12029 | dodge | charger plus rt | 1 |
12027 | dodge | charger limousine | 1 |
15041 | ford | f250xlt | 1 |
30083 rows × 3 columns
- Using
.groupby()
groups the data bymanufacturer
andmodel
, then count the number of transactions (id
) for each group. - Rename the columns for clarity (
TotalDemand
). - Sort the resulting DataFrame by
TotalDemand
in descending order and display the result.
# Get the top 10 in-demand cars
model_demand_top10 = model_demand.head(10)
model_demand_top10
Manufacturer | Model | Total_Demand | |
---|---|---|---|
13769 | ford | f-150 | 8014 |
10401 | chevrolet | silverado 1500 | 5154 |
25302 | ram | 1500 | 4259 |
27474 | toyota | camry | 3153 |
10395 | chevrolet | silverado | 3037 |
17821 | honda | accord | 2983 |
20419 | jeep | wrangler | 2863 |
23984 | nissan | altima | 2805 |
18079 | honda | civic | 2801 |
13300 | ford | escape | 2780 |
- Extract the top 10 rows with the highest TotalDemand from the model_demand DataFrame.
- Display the top 10 in-demand cars.
Key Attributes Driving Demand¶
# Analyze demand by key attributes
attribute_columns = ['fuel', 'condition', 'paint_color']
for col in attribute_columns:
print(f"\nDemand by {col.capitalize()}:")
print(sp[col].value_counts())
Demand by Fuel: gas 357266 other 30724 diesel 30101 hybrid 5195 electric 1701 Unknown 1691 Name: fuel, dtype: int64 Demand by Condition: excellent 209179 good 143637 Unknown 36669 like new 27061 fair 7702 new 1781 salvage 649 Name: condition, dtype: int64 Demand by Paint_color: white 108720 black 91076 silver 57794 blue 40905 red 37135 grey 30740 Unknown 30617 green 8786 brown 7901 custom 7606 yellow 2346 orange 2282 purple 770 Name: paint_color, dtype: int64
- Loop through key attribute columns (fuel, condition, paint_color), and for each column, count the number of occurrences of each unique value.
- Display the demand for each attribute.
High-Demand Models by Region¶
regional_model_demand = sp.groupby(['state', 'model'])['id'].count().reset_index()
regional_model_demand.columns = ['State', 'Model', 'Transactions']
top_models = regional_model_demand.sort_values(by=['Transactions'], ascending=[False])
print(top_models)
State Model Transactions 66843 Maharashtra f-150 886 110102 Uttar Pradesh f-150 847 20926 Delhi f-150 608 84926 Punjab f-150 587 70240 Maharashtra silverado 1500 567 ... ... ... ... 49574 Karnataka silverado ext cab 1 49573 Karnataka silverado ex-cab 1 49572 Karnataka silverado durmax hd 1 49571 Karnataka silverado duramax hd 1 122744 West Bengal ♿ 1 [122745 rows x 3 columns]
.groupby()
group the data by state and model, then count the number of transactions for each combination.- Rename the columns to State, Model, and Transactions.
- Sort the resulting DataFrame by Transactions in descending order to identify the top models.
Popular Car Models and Brands¶
# Identify the most popular manufacturers and models
popular_cars = sp.groupby(['manufacturer', 'model']).agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)
print("\nPopular Car Models and Brands Analysis:")
print(popular_cars.head(10)) # Top 10 popular cars
Popular Car Models and Brands Analysis: manufacturer model total_transactions avg_price \ 13769 ford f-150 8014 2.864032e+06 10401 chevrolet silverado 1500 5154 2.886477e+06 25302 ram 1500 4259 2.913927e+06 27474 toyota camry 3153 1.298709e+06 10395 chevrolet silverado 3037 2.699027e+06 17821 honda accord 2983 1.236092e+06 20419 jeep wrangler 2863 6.580196e+07 23984 nissan altima 2805 1.216228e+06 18079 honda civic 2801 1.208038e+06 13300 ford escape 2780 1.504665e+06 median_price 13769 2511100.0 10401 2568380.0 25302 2695900.0 27474 1027320.0 10395 2041870.0 17821 1007580.0 20419 2375770.0 23984 1136940.0 18079 1014550.0 13300 1488650.0
.groupby()
group the data by manufacturer and model, then count the number of transactions, average_price and median_price for each combination.- Sort the resulting DataFrame by Transactions in descending order to identify the top models.
Analysis: Car Model Demand & Popularity¶
1. Most In-Demand Car Models¶
- Ford F-150 is the most popular model with 8,014 transactions, significantly ahead of competitors.
- Chevrolet Silverado 1500 (5,154 transactions) and Ram 1500 (4,259 transactions) are also highly demanded.
- Toyota Camry (3,153 transactions) and Honda Accord (2,983 transactions) dominate the sedan segment.
- Jeep Wrangler (2,863 transactions) stands out in the SUV category.
2. Demand by Key Attributes¶
Fuel Type Demand¶
- Gas-powered vehicles dominate (357,266 transactions), indicating continued reliance on traditional fuel.
- Diesel (30,101 transactions) and hybrid (5,195 transactions) have niche demand.
- Electric vehicles (1,701 transactions) have the lowest demand, showing slow adoption.
Condition-Based Demand¶
- Most buyers prefer vehicles in ‘Excellent’ (209,179) or ‘Good’ (143,637) condition.
- Only 1,781 transactions for ‘New’ vehicles suggest a preference for pre-owned cars.
- Salvage vehicles have the lowest demand (649 transactions).
Paint Color Preference¶
- White (108,720) and Black (91,076) are the most popular colors, followed by Silver (57,794).
- Custom colors like Purple (770) and Yellow (2,346) have very low demand.
3. Regional Demand for Top Models¶
- Maharashtra has the highest demand for Ford F-150 (886 transactions).
- Uttar Pradesh (847), Delhi (608), and Punjab (587) also show strong demand for this model.
- Chevrolet Silverado 1500 sees demand concentrated in Maharashtra (567 transactions).
Key Insights¶
Ford F-150, Chevrolet Silverado, and Ram 1500 dominate the market.
- These trucks have higher transaction volumes, indicating strong consumer preference.
Gasoline vehicles remain dominant, but hybrid & electric adoption is slow.
- Despite the EV push, traditional fuel types continue to lead demand.
Used vehicles in excellent and good condition drive sales.
- Consumers prefer high-quality pre-owned vehicles over new or salvage listings.
White, Black, and Silver are the most preferred colors.
- Custom or niche colors (Purple, Yellow, etc.) see limited demand.
Regional demand varies, with Maharashtra leading for top models.
- Understanding local preferences can help optimize inventory distribution.
Recommendations¶
1. Focus on Top-Selling Models for Inventory Optimization¶
- Stock Ford F-150, Chevrolet Silverado, and Ram 1500 in high-demand regions like Maharashtra and Uttar Pradesh.
- Ensure a steady supply of Toyota Camry, Honda Accord, and Jeep Wrangler to meet sedan and SUV demand.
2. Expand Hybrid & Electric Vehicle Promotions¶
- Offer financing incentives, lower interest rates, or EV subsidies to drive adoption.
- Promote hybrid & electric models in eco-conscious cities with better charging infrastructure.
3. Prioritize High-Quality Used Cars Over New or Salvage Vehicles¶
- Given the low demand for new cars, focus on certified pre-owned (CPO) vehicles with warranty benefits.
- Salvage cars have very limited demand; avoid overstocking them.
4. Stock Popular Color Choices & Offer Customization¶
- Prioritize White, Black, and Silver models, as they sell the most.
- Introduce custom paint options or wraps for buyers seeking personalization.
5. Use Regional Insights for Targeted Marketing¶
- Maharashtra & Uttar Pradesh should be priority states for Ford F-150 and Chevrolet Silverado.
- Target digital marketing campaigns based on regional demand trends.
Hypothesis 11:- Lower-mileage vehicles drive higher demand, while higher-mileage ones attract niche buyers at premium prices.¶
Impact on Odometer on Pricing¶
# Group transactions by odometer ranges
bins = [0, 20000, 50000, 100000, 150000, 200000, 300000,400000,600000,800000,1000000,10000000]
labels = ['0-20K', '20K-50K', '50K-100K', '100K-150K', '150K-200K', '200K-300K', '300K-400K', '400K-600K', '600K-800K', '800K-1000K','1000K+']
sp['mileage_range'] = pd.cut(sp['odometer'], bins=bins, labels=labels)
mileage_analysis = sp.groupby('mileage_range').agg(
total_transactions=('id', 'count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index()
print("\nMileage Range Analysis (Demand and Pricing):")
print(mileage_analysis)
Mileage Range Analysis (Demand and Pricing): mileage_range total_transactions avg_price median_price 0 0-20K 53687 1.381985e+07 3788400.0 1 20K-50K 79690 3.421764e+06 3209560.0 2 50K-100K 116017 9.843183e+06 1990460.0 3 100K-150K 99096 5.119617e+06 1284190.0 4 150K-200K 53065 1.341860e+07 897960.0 5 200K-300K 20137 2.501741e+07 744910.0 6 300K-400K 1327 1.339640e+07 960200.0 7 400K-600K 449 3.057920e+06 1990460.0 8 600K-800K 229 2.832558e+06 2140660.0 9 800K-1000K 445 2.875703e+08 1926290.0 10 1000K+ 571 2.518616e+08 1027320.0
sp['mileage_range']
create mileage ranges using the odometer values, then assign corresponding labels to each range.- Group the data by mileage_range and calculate the total number of transactions, average price, and median price for each mileage range.
- Display the mileage range analysis (demand and pricing).
Impact of Yearly Mileage on Pricing¶
sp['mileage_bins'] = pd.cut(sp['avg_yearly_mileage'],
bins=[0, 5000, 10000, 20000, 30000,40000,50000,100000,10000000],
labels=['0-5K','5K-10K','10K-20K','20K-30K','30K-40K','40K-50K','50K-100K','100K+'])
mileage_demand = sp.groupby('mileage_bins').agg(
total_transactions=('id','count'),
avg_price=('price', 'mean'),
median_price=('price', 'median')
).reset_index()
print("Demand by Average Yearly Mileage:")
print(mileage_demand)
Demand by Average Yearly Mileage: mileage_bins total_transactions avg_price median_price 0 0-5K 62413 2.788424e+07 2616850.0 1 5K-10K 126558 7.393663e+06 1990460.0 2 10K-20K 190936 6.344824e+06 1871940.0 3 20K-30K 34091 2.800212e+06 2182150.0 4 30K-40K 6774 3.418389e+06 3081120.0 5 40K-50K 2011 8.346382e+06 2889430.0 6 50K-100K 1307 1.545943e+07 2311510.0 7 100K+ 623 2.058125e+08 1990460.0
- Creates mileage bins based on
avg_yearly_mileage
and assign corresponding labels. - Group the data by
mileage_bins
and calculate the total number of transactions, average price, and median price for each mileage bin. - Display the demand by average yearly mileage.
Analysis: Vehicle Demand & Pricing by Mileage¶
1. Mileage Range Analysis¶
- Vehicles with lower mileage (0-20K) have high demand (53,687 transactions) and the highest average price (₹13.8M).
- The demand peaks in the 50K-100K range (116,017 transactions), but the average price drops to ₹9.8M.
- Older, high-mileage vehicles (100K+ miles) have significantly lower median prices, indicating depreciation.
- Extreme mileage vehicles (800K+ miles) have very low transaction volumes.
2. Average Yearly Mileage Demand¶
- The 10K-20K range has the highest demand (190,936 transactions), showing a preference for moderately used cars.
- Cars with less than 5K yearly mileage have high value (₹27.8M avg. price), likely due to better condition.
- Vehicles exceeding 100K yearly mileage are rare (623 transactions) but still have a surprisingly high avg. price (₹205.8M), possibly due to commercial or specialty use.
Key Insights¶
Moderate mileage (10K-100K miles) has the highest demand.
- Buyers prioritize cars with reasonable wear but not excessively high mileage.
Price drops sharply for high-mileage vehicles.
- Vehicles with over 150K miles see a steep decline in value.
Extremely high mileage (800K+) vehicles have low demand but unpredictable pricing.
- Some outliers have unusually high prices, possibly due to specialized vehicle types.
Low yearly mileage (0-5K per year) retains the most value.
- These cars attract premium pricing due to perceived better maintenance.
Recommendations¶
1. Prioritize Stocking Vehicles in the 10K-100K Mileage Range¶
- These have the highest transaction volume and stable pricing.
- Avoid very high-mileage cars (200K+) unless they serve niche markets.
2. Market Low-Mileage Vehicles as Premium Offerings¶
- Highlight "low mileage, like new" features to justify higher pricing.
- Offer extended warranties or certification for added trust.
3. Be Cautious with High-Mileage Vehicles Over 150K Miles¶
- If selling these, focus on budget-conscious buyers or commercial fleets.
- Price aggressively to move inventory faster.
4. Use Mileage Data to Guide Trade-In Valuations¶
- Offer better trade-in values for vehicles under 100K miles to attract sellers.
- Use historical price trends to ensure competitive but profitable resale prices.