import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import seaborn as sns  # For advanced visualizations

sp = pd.read_csv("final_dataset.csv") # Loading the Data

sp

print("Dataset Information:")
sp.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 21 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   price         426880 non-null  float64
 2   year          425675 non-null  float64
 3   manufacturer  409234 non-null  object 
 4   model         421603 non-null  object 
 5   condition     252776 non-null  object 
 6   cylinders     249202 non-null  object 
 7   fuel          423867 non-null  object 
 8   odometer      422480 non-null  float64
 9   title_status  418638 non-null  object 
 10  transmission  424324 non-null  object 
 11  VIN           265838 non-null  object 
 12  drive         296313 non-null  object 
 13  size          120519 non-null  object 
 14  type          334022 non-null  object 
 15  paint_color   296677 non-null  object 
 16  image_url     426812 non-null  object 
 17  state         426880 non-null  object 
 18  posting_date  426812 non-null  object 
 19  latitude      426880 non-null  float64
 20  longitude     426880 non-null  float64
dtypes: float64(5), int64(1), object(15)
memory usage: 68.4+ MB

sp.columns

Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'image_url', 'state',
       'posting_date', 'latitude', 'longitude'],
      dtype='object')

sp.describe()

print("The Data Types of all Columns:")
sp.dtypes

The Data Types of all Columns:

id                int64
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object

rows, columns = sp.shape
print(f"The dataset contains {rows} rows and {columns} columns.")

The dataset contains 426880 rows and 21 columns.

for column in sp.columns.tolist():
    print(f"No. of unique values in {column}:")
    print(sp[column].nunique())

No. of unique values in id:
426880
No. of unique values in price:
6253
No. of unique values in year:
114
No. of unique values in manufacturer:
42
No. of unique values in model:
29667
No. of unique values in condition:
6
No. of unique values in cylinders:
8
No. of unique values in fuel:
5
No. of unique values in odometer:
104870
No. of unique values in title_status:
6
No. of unique values in transmission:
3
No. of unique values in VIN:
118264
No. of unique values in drive:
3
No. of unique values in size:
4
No. of unique values in type:
13
No. of unique values in paint_color:
12
No. of unique values in image_url:
241899
No. of unique values in state:
35
No. of unique values in posting_date:
333355
No. of unique values in latitude:
191354
No. of unique values in longitude:
162536

value_counts_dict = {}

for column in sp.columns:
    value_counts_dict[column] = sp[column].value_counts()
    print(value_counts_dict[column])

7222695916    1
7313139418    1
7313423023    1
7313423324    1
7313424533    1
             ..
7314655506    1
7314655833    1
7314657468    1
7314659947    1
7301591129    1
Name: id, Length: 426880, dtype: int64
0.0           34868
3851650.0      3683
897960.0       3623
1026400.0      3585
1154830.0      3512
              ...  
13329960.0        1
1819610.0         1
425710.0          1
2222640.0         1
611520.0          1
Name: price, Length: 6253, dtype: int64
2017.0    36420
2018.0    36369
2015.0    31538
2013.0    30794
2016.0    30434
          ...  
1943.0        1
1915.0        1
1902.0        1
1905.0        1
1909.0        1
Name: year, Length: 114, dtype: int64
ford               70985
chevrolet          55064
toyota             34202
honda              21269
nissan             19067
jeep               19014
ram                18342
gmc                16785
bmw                14699
dodge              13707
mercedes-benz      11817
hyundai            10338
subaru              9495
volkswagen          9345
kia                 8457
lexus               8200
audi                7573
cadillac            6953
chrysler            6031
acura               5978
buick               5501
mazda               5427
infiniti            4802
lincoln             4220
volvo               3374
mitsubishi          3292
mini                2376
pontiac             2288
rover               2113
jaguar              1946
porsche             1384
mercury             1184
saturn              1090
alfa-romeo           897
tesla                868
fiat                 792
harley-davidson      153
ferrari               95
datsun                63
aston-martin          24
land rover            21
morgan                 3
Name: manufacturer, dtype: int64
f-150                       8009
silverado 1500              5140
1500                        4211
camry                       3135
silverado                   3023
                            ... 
Huyndai Sante Fe Limited       1
astro awd 4x4                  1
escalade and                   1
cx 3                           1
Paige Glenbrook Touring        1
Name: model, Length: 29667, dtype: int64
good         121456
excellent    101467
like new      21178
fair           6769
new            1305
salvage         601
Name: condition, dtype: int64
6 cylinders     94169
4 cylinders     77642
8 cylinders     72062
5 cylinders      1712
10 cylinders     1455
other            1298
3 cylinders       655
12 cylinders      209
Name: cylinders, dtype: int64
gas         356209
other        30728
diesel       30062
hybrid        5170
electric      1698
Name: fuel, dtype: int64
100000.0    2263
1.0         2246
0.0         1965
200000.0    1728
150000.0    1603
            ... 
149468.0       1
154259.0       1
111887.0       1
213826.0       1
26892.0        1
Name: odometer, Length: 104870, dtype: int64
clean         405117
rebuilt         7219
salvage         3868
lien            1422
missing          814
parts only       198
Name: title_status, dtype: int64
automatic    336524
other         62682
manual        25118
Name: transmission, dtype: int64
1FMJU1JT1HEA52352    261
3C6JR6DT3KG560649    235
1FTER1EH1LLA36301    231
5TFTX4CN3EX042751    227
1GCHTCE37G1186784    214
                    ... 
2G4GP5EX5E9213578      1
1GCEK19J78Z219711      1
JA4AT3AW1AZ006543      1
4T1BF28B61U153724      1
SAJGX2749VCOO8376      1
Name: VIN, Length: 118264, dtype: int64
4wd    131904
fwd    105517
rwd     58892
Name: drive, dtype: int64
full-size      63465
mid-size       34476
compact        19384
sub-compact     3194
Name: size, dtype: int64
sedan          87056
SUV            77284
pickup         43510
truck          35279
other          22110
coupe          19204
hatchback      16598
wagon          10751
van             8548
convertible     7731
mini-van        4825
offroad          609
bus              517
Name: type, dtype: int64
white     79285
black     62861
silver    42970
blue      31223
red       30473
grey      24416
green      7343
custom     6700
brown      6593
yellow     2142
orange     1984
purple      687
Name: paint_color, dtype: int64
https://images.craigslist.org/00N0N_1xMPvfxRAIdz_0gw0co_600x450.jpg    7357
https://images.craigslist.org/00R0R_lwWjXSEWNa7z_0x20oM_600x450.jpg    2192
https://images.craigslist.org/01111_l590CdXzJSoz_001001_600x450.jpg     276
https://images.craigslist.org/01313_9Zne268YnWZz_0gw0co_600x450.jpg     231
https://images.craigslist.org/00X0X_1cMDo5xqSuAz_0gw0co_600x450.jpg     227
                                                                       ... 
https://images.craigslist.org/00a0a_gbpYPGPfZEwz_0lM0t2_600x450.jpg       1
https://images.craigslist.org/00e0e_g9RfXGmsjDaz_0CI0hq_600x450.jpg       1
https://images.craigslist.org/00Y0Y_4fRgRCeuUj2z_0CI0t2_600x450.jpg       1
https://images.craigslist.org/00K0K_fUdf8IWJOlmz_0t20CI_600x450.jpg       1
https://images.craigslist.org/00C0C_3D2bgxR1cysz_0ak07K_600x450.jpg       1
Name: image_url, Length: 241899, dtype: int64
Uttar Pradesh                               45573
Maharashtra                                 45263
Karnataka                                   30492
Punjab                                      30455
Delhi                                       30379
Haryana                                     30175
Tamil Nadu                                  18124
West Bengal                                 18052
Telangana                                   15356
Kerala                                      15304
Madhya Pradesh                              15205
Rajasthan                                   15131
Andhra Pradesh                              15070
Gujarat                                     14992
Odisha                                      12189
Jharkhand                                    6080
Bihar                                        6034
Chhattisgarh                                 6003
Assam                                        5987
Uttarakhand                                  5951
Andaman and Nicobar Islands                  3079
Sikkim                                       3065
Ladakh                                       3062
Puducherry                                   3046
Jammu and Kashmir                            3030
Meghalaya                                    3017
Himachal Pradesh                             3016
Mizoram                                      3010
Manipur                                      2990
Chandigarh                                   2986
Arunachal Pradesh                            2980
Dadra and Nagar Haveli and Daman and Diu     2980
Tripura                                      2961
Goa                                          2923
Lakshadweep                                  2920
Name: state, dtype: int64
2021-04-24 02:13:05+00:00    13
2021-04-30 14:01:13+00:00    12
2021-05-04 15:00:42+00:00    12
2021-05-03 13:41:54+00:00    11
2021-04-29 20:25:19+00:00    11
                             ..
2021-04-21 00:52:45+00:00     1
2021-04-20 20:41:24+00:00     1
2021-04-20 19:24:56+00:00     1
2021-04-20 19:20:54+00:00     1
2021-04-04 09:21:29+00:00     1
Name: posting_date, Length: 333355, dtype: int64
28.8329    20
28.8703    20
28.7504    19
28.6300    19
28.5907    19
           ..
19.5164     1
11.4463     1
27.3585     1
21.9371     1
17.0875     1
Name: latitude, Length: 191354, dtype: int64
77.2591    24
77.1361    24
77.2894    23
77.2735    23
76.9070    23
           ..
80.3373     1
68.9184     1
84.2857     1
89.4721     1
83.7688     1
Name: longitude, Length: 162536, dtype: int64

missing_value_count = sp.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count

Missing Values in Each Column:

id                   0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
image_url           68
state                0
posting_date        68
latitude             0
longitude            0
dtype: int64

duplicates = sp[sp.duplicated()]
duplicate_count = len(duplicates)
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")

Number of Duplicate Rows in the Dataset: 0

sp.dtypes

id                int64
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object

pd.options.mode.chained_assignment = None
sp["posting_date"]= pd.to_datetime(sp["posting_date"], format = '%Y-%m-%d %H:%M:%S')

# Convert 2022.0 to 2022 while keeping NaN values
sp['year'] = pd.to_numeric(sp['year'], errors='coerce').astype('Int64')
sp['year']

0         <NA>
1         <NA>
2         <NA>
3         <NA>
4         <NA>
          ... 
426875    2019
426876    2020
426877    2020
426878    2018
426879    2019
Name: year, Length: 426880, dtype: Int64

sp['price'] = pd.to_numeric(sp['price'], errors='coerce')

sp.dtypes

id                            int64
price                       float64
year                          Int64
manufacturer                 object
model                        object
condition                    object
cylinders                    object
fuel                         object
odometer                    float64
title_status                 object
transmission                 object
VIN                          object
drive                        object
size                         object
type                         object
paint_color                  object
image_url                    object
state                        object
posting_date    datetime64[ns, UTC]
latitude                    float64
longitude                   float64
dtype: object

# Select columns with numeric types (either float64 or int64)
numerical_columns = sp.select_dtypes(include=['float64']).columns.tolist()
print(numerical_columns)

['price', 'odometer', 'latitude', 'longitude']

outliers = {}

for column in numerical_columns:
    Q1 = sp[column].quantile(0.10)
    Q3 = sp[column].quantile(0.90)
    IQR = Q3-Q1
    
    lower_bound = Q1 - 1.5*IQR
    upper_bound = Q3 + 1.5*IQR
    
    outliers[column] = sp[column][(sp[column]<lower_bound) | (sp[column]>upper_bound)]
    print(outliers[column])

280       1.268338e+11
346       1.270072e+07
461       1.283604e+07
732       1.284192e+07
1727      1.669450e+07
              ...     
420535    1.284100e+07
420693    1.605240e+07
424281    3.846158e+07
424840    1.258505e+07
426302    1.605240e+07
Name: price, Length: 874, dtype: float64
280        999999.0
455       9999999.0
562       2222222.0
615       1111111.0
616       1111111.0
            ...    
424388     468000.0
424840     550000.0
424962     447247.0
426106    9999999.0
426108     999999.0
Name: odometer, Length: 1619, dtype: float64
Series([], Name: latitude, dtype: float64)
Series([], Name: longitude, dtype: float64)

for col, data in outliers.items():
    print(f"{col}: {len(data)}")

price: 874
odometer: 1619
latitude: 0
longitude: 0

for columns in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=sp[columns])
    plt.title('Boxplot for Numerical Columns')
    plt.xlabel(columns)
    plt.ylabel('Values')
    plt.xticks(rotation=45)
    plt.show()

sp['posting_year'] = sp['posting_date'].dt.year
sp['posting_year'] = pd.to_numeric(sp['posting_year'], errors='coerce').astype('Int64')
sp['posting_year']

0         <NA>
1         <NA>
2         <NA>
3         <NA>
4         <NA>
          ... 
426875    2021
426876    2021
426877    2021
426878    2021
426879    2021
Name: posting_year, Length: 426880, dtype: Int64

sp['posting_month'] = sp['posting_date'].dt.month
sp['posting_month'] = pd.to_numeric(sp['posting_month'], errors='coerce').astype('Int64')
sp['posting_month']

0         <NA>
1         <NA>
2         <NA>
3         <NA>
4         <NA>
          ... 
426875       4
426876       4
426877       4
426878       4
426879       4
Name: posting_month, Length: 426880, dtype: Int64

sp['posting_weekday'] = sp['posting_date'].dt.day_name()
sp['posting_weekday']

0            NaN
1            NaN
2            NaN
3            NaN
4            NaN
           ...  
426875    Sunday
426876    Sunday
426877    Sunday
426878    Sunday
426879    Sunday
Name: posting_weekday, Length: 426880, dtype: object

invalid_data = sp[sp['year'] > sp['posting_year']]
invalid_data

sp = sp.drop(invalid_data.index)
sp

if 'type' in sp.columns:
    sp.rename(columns={'type': 'car_type'}, inplace=True)

sp[sp['price']==0]

# Calculate the median of the column (ignoring zeros)
median_price = sp[sp['price'] > 0]['price'].median()

# Replace zeros with the median
sp.loc[sp['price'] == 0, 'price'] = median_price

# Display the updated dataframe and median
print(f"Median Price (excluding zeros): {median_price}")
print(sp[['price']].head())

Median Price (excluding zeros): 1990460.0
       price
0   770530.0
1  1528210.0
2  2696820.0
3   192610.0
4   629240.0

sp['odometer'].isnull().sum()

4400

# Calculate the median of the column (ignoring zeros)
median_odometer = sp[sp['odometer'] > 0]['odometer'].median()

# Replace missing with the median
sp['odometer'].fillna(median_odometer, inplace=True)

# Display the updated dataframe and median
print(f"Median odometer (excluding zeros): {median_odometer}")
print(sp[['odometer']].head())

Median odometer (excluding zeros): 86000.0
   odometer
0   86000.0
1   86000.0
2   86000.0
3   86000.0
4   86000.0

sp['year'].isnull().sum()

1205

# Replace empty or invalid values with NaN
sp['year'] = sp['year'].replace(['', 'NA'], pd.NA)

# Calculate median years for each manufacturer and model
median_years = sp.groupby(['manufacturer', 'model'])['year'].median()

# Fill missing years using the calculated medians
sp['year'] = sp.apply(
    lambda row: median_years.loc[(row['manufacturer'], row['model'])] if pd.isna(row['year']) and (row['manufacturer'], row['model']) in median_years.index else row['year'],
    axis=1
)

# For rows still missing 'year', fill with the overall median
overall_median_year = sp['year'].median()
sp['year'] = sp['year'].fillna(overall_median_year)

# View the result
print(sp)

                id      price    year manufacturer                     model  \
0       7222695916   770530.0  2013.0          NaN                       NaN   
1       7218891961  1528210.0  2013.0          NaN                       NaN   
2       7221797935  2696820.0  2013.0          NaN                       NaN   
3       7222270760   192610.0  2013.0          NaN                       NaN   
4       7210384030   629240.0  2013.0          NaN                       NaN   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

       condition    cylinders    fuel  odometer title_status  ...   car_type  \
0            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
1            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
2            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
3            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
4            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
...          ...          ...     ...       ...          ...  ...        ...   
426875      good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876      good          NaN     gas   12029.0        clean  ...      sedan   
426877      good          NaN  diesel    4174.0        clean  ...  hatchback   
426878      good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879      good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
0              NaN                                                NaN   
1              NaN                                                NaN   
2              NaN                                                NaN   
3              NaN                                                NaN   
4              NaN                                                NaN   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
0              Haryana                       NaT  28.8446   75.1167   
1              Haryana                       NaT  27.9026   77.0382   
2          West Bengal                       NaT  24.6838   85.9695   
3       Andhra Pradesh                       NaT  15.5158   81.9342   
4          West Bengal                       NaT  23.7821   87.4861   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
0              <NA>          <NA>              NaN  
1              <NA>          <NA>              NaN  
2              <NA>          <NA>              NaN  
3              <NA>          <NA>              NaN  
4              <NA>          <NA>              NaN  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426747 rows x 24 columns]

sp['model'].isnull().sum()

5268

sp['model'].value_counts()

f-150                      8008
silverado 1500             5140
1500                       4211
camry                      3135
silverado                  3022
                           ... 
escalade and                  1
cx 3                          1
1991 INTERNATIONAL 9800       1
nv 200 s                      1
Paige Glenbrook Touring       1
Name: model, Length: 29648, dtype: int64

# Step 1: Create a mapping of (manufacturer, year) to model
model_mapping = (
    sp[~sp['model'].isnull()]  # Filter rows where 'model' is not null
    .groupby(['manufacturer', 'year'])['model']
    .first()
    .reset_index()
)

# Convert the mapping to a dictionary
mapping_dict = model_mapping.set_index(['manufacturer', 'year'])['model'].to_dict()

# Step 2: Fill missing 'model' values
for index, row in sp.iterrows():      
    if pd.isnull(row['model']):  # Check if 'model' is missing
        sp.at[index, 'model'] = mapping_dict.get((row['manufacturer'], row['year']), None)

# Print the updated DataFrame
print(sp)

                id      price    year manufacturer                     model  \
0       7222695916   770530.0  2013.0          NaN                       NaN   
1       7218891961  1528210.0  2013.0          NaN                       NaN   
2       7221797935  2696820.0  2013.0          NaN                       NaN   
3       7222270760   192610.0  2013.0          NaN                       NaN   
4       7210384030   629240.0  2013.0          NaN                       NaN   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

       condition    cylinders    fuel  odometer title_status  ...   car_type  \
0            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
1            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
2            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
3            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
4            NaN          NaN     NaN   86000.0          NaN  ...        NaN   
...          ...          ...     ...       ...          ...  ...        ...   
426875      good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876      good          NaN     gas   12029.0        clean  ...      sedan   
426877      good          NaN  diesel    4174.0        clean  ...  hatchback   
426878      good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879      good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
0              NaN                                                NaN   
1              NaN                                                NaN   
2              NaN                                                NaN   
3              NaN                                                NaN   
4              NaN                                                NaN   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
0              Haryana                       NaT  28.8446   75.1167   
1              Haryana                       NaT  27.9026   77.0382   
2          West Bengal                       NaT  24.6838   85.9695   
3       Andhra Pradesh                       NaT  15.5158   81.9342   
4          West Bengal                       NaT  23.7821   87.4861   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
0              <NA>          <NA>              NaN  
1              <NA>          <NA>              NaN  
2              <NA>          <NA>              NaN  
3              <NA>          <NA>              NaN  
4              <NA>          <NA>              NaN  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426747 rows x 24 columns]

sp['model'].isnull().sum()

148

# Drop rows where both 'manufacturer' and 'model' are null
sp = sp.dropna(subset=['manufacturer', 'model'], how='all')

# Print the updated DataFrame
print(sp)

                id      price    year manufacturer                     model  \
27      7316814884  4313900.0  2014.0          gmc  sierra 1500 crew cab slt   
28      7316814758  2901280.0  2010.0    chevrolet            silverado 1500   
29      7316814989  5084440.0  2020.0    chevrolet       silverado 1500 crew   
30      7316743432  3980090.0  2017.0       toyota      tundra double cab sr   
31      7316356412  1926290.0  2013.0         ford                 f-150 xlt   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

        condition    cylinders    fuel  odometer title_status  ...   car_type  \
27           good  8 cylinders     gas   57923.0        clean  ...     pickup   
28           good  8 cylinders     gas   71229.0        clean  ...     pickup   
29           good  8 cylinders     gas   19160.0        clean  ...     pickup   
30           good  8 cylinders     gas   41124.0        clean  ...     pickup   
31      excellent  6 cylinders     gas  128000.0        clean  ...      truck   
...           ...          ...     ...       ...          ...  ...        ...   
426875       good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876       good          NaN     gas   12029.0        clean  ...      sedan   
426877       good          NaN  diesel    4174.0        clean  ...  hatchback   
426878       good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879       good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
27           white  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
28            blue  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
29             red  https://images.craigslist.org/01212_jjirIWa0y0...   
30             red  https://images.craigslist.org/00x0x_1y9kIOzGCF...   
31           black  https://images.craigslist.org/00404_l4loxHvdQe...   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
27              Kerala 2021-05-04 17:31:18+00:00   9.8072   76.8361   
28       Uttar Pradesh 2021-05-04 17:31:08+00:00  29.2417   83.7824   
29               Delhi 2021-05-04 17:31:25+00:00  28.8053   76.8871   
30              Odisha 2021-05-04 15:41:31+00:00  19.2886   85.2667   
31             Tripura 2021-05-03 19:02:03+00:00  23.4924   91.1187   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
27             2021             5          Tuesday  
28             2021             5          Tuesday  
29             2021             5          Tuesday  
30             2021             5          Tuesday  
31             2021             5           Monday  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426678 rows x 24 columns]

sp['model'].isnull().sum()

79

# Fill missing values in the 'model' column with 'Unknown'
sp['model'] = sp['model'].fillna('Unknown')

# Print the updated DataFrame
print(sp)

                id      price    year manufacturer                     model  \
27      7316814884  4313900.0  2014.0          gmc  sierra 1500 crew cab slt   
28      7316814758  2901280.0  2010.0    chevrolet            silverado 1500   
29      7316814989  5084440.0  2020.0    chevrolet       silverado 1500 crew   
30      7316743432  3980090.0  2017.0       toyota      tundra double cab sr   
31      7316356412  1926290.0  2013.0         ford                 f-150 xlt   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

        condition    cylinders    fuel  odometer title_status  ...   car_type  \
27           good  8 cylinders     gas   57923.0        clean  ...     pickup   
28           good  8 cylinders     gas   71229.0        clean  ...     pickup   
29           good  8 cylinders     gas   19160.0        clean  ...     pickup   
30           good  8 cylinders     gas   41124.0        clean  ...     pickup   
31      excellent  6 cylinders     gas  128000.0        clean  ...      truck   
...           ...          ...     ...       ...          ...  ...        ...   
426875       good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876       good          NaN     gas   12029.0        clean  ...      sedan   
426877       good          NaN  diesel    4174.0        clean  ...  hatchback   
426878       good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879       good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
27           white  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
28            blue  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
29             red  https://images.craigslist.org/01212_jjirIWa0y0...   
30             red  https://images.craigslist.org/00x0x_1y9kIOzGCF...   
31           black  https://images.craigslist.org/00404_l4loxHvdQe...   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
27              Kerala 2021-05-04 17:31:18+00:00   9.8072   76.8361   
28       Uttar Pradesh 2021-05-04 17:31:08+00:00  29.2417   83.7824   
29               Delhi 2021-05-04 17:31:25+00:00  28.8053   76.8871   
30              Odisha 2021-05-04 15:41:31+00:00  19.2886   85.2667   
31             Tripura 2021-05-03 19:02:03+00:00  23.4924   91.1187   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
27             2021             5          Tuesday  
28             2021             5          Tuesday  
29             2021             5          Tuesday  
30             2021             5          Tuesday  
31             2021             5           Monday  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426678 rows x 24 columns]

sp['model'].isnull().sum()

0

sp['model'].value_counts()

f-150                      8014
silverado 1500             5154
1500                       4403
camry                      3155
silverado                  3037
                           ... 
g35xsports                    1
escalade ext awd gas          1
big horn slt                  1
1500 4x4 sport                1
Paige Glenbrook Touring       1
Name: model, Length: 29649, dtype: int64

sp['manufacturer'].isnull().sum()

17536

sp['manufacturer'].value_counts()

ford               70981
chevrolet          55061
toyota             34199
honda              21267
nissan             19063
jeep               19014
ram                18342
gmc                16785
bmw                14698
dodge              13707
mercedes-benz      11816
hyundai            10338
subaru              9495
volkswagen          9345
kia                 8457
lexus               8196
audi                7572
cadillac            6952
chrysler            6031
acura               5978
buick               5501
mazda               5427
infiniti            4802
lincoln             4220
volvo               3374
mitsubishi          3225
mini                2376
pontiac             2288
rover               2113
jaguar              1946
porsche             1383
mercury             1184
saturn              1090
alfa-romeo           897
tesla                868
fiat                 792
harley-davidson      153
ferrari               95
datsun                63
aston-martin          24
land rover            21
morgan                 3
Name: manufacturer, dtype: int64

# Step 1: Create a mapping of model to manufacturer
manufacturer_mapping = (
    sp[~sp['manufacturer'].isnull()]  # Filter rows where 'manufacturer' is not null
    .groupby('model')['manufacturer']
    .first()
    .reset_index()
)

# Convert the mapping to a dictionary
manufacturer_mapping_dict = manufacturer_mapping.set_index('model')['manufacturer'].to_dict()

# Step 2: Fill missing values in 'manufacturer'
def fill_manufacturer(row):
    if pd.isnull(row['manufacturer']):
        # Try model-only mapping
        return manufacturer_mapping_dict.get(row['model'], 'Unknown')  # Default to 'Unknown' if no match
    return row['manufacturer']

sp['manufacturer'] = sp.apply(fill_manufacturer, axis=1)

# Print the updated DataFrame
print(sp)

                id      price    year manufacturer                     model  \
27      7316814884  4313900.0  2014.0          gmc  sierra 1500 crew cab slt   
28      7316814758  2901280.0  2010.0    chevrolet            silverado 1500   
29      7316814989  5084440.0  2020.0    chevrolet       silverado 1500 crew   
30      7316743432  3980090.0  2017.0       toyota      tundra double cab sr   
31      7316356412  1926290.0  2013.0         ford                 f-150 xlt   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

        condition    cylinders    fuel  odometer title_status  ...   car_type  \
27           good  8 cylinders     gas   57923.0        clean  ...     pickup   
28           good  8 cylinders     gas   71229.0        clean  ...     pickup   
29           good  8 cylinders     gas   19160.0        clean  ...     pickup   
30           good  8 cylinders     gas   41124.0        clean  ...     pickup   
31      excellent  6 cylinders     gas  128000.0        clean  ...      truck   
...           ...          ...     ...       ...          ...  ...        ...   
426875       good  6 cylinders     gas   32226.0        clean  ...      sedan   
426876       good          NaN     gas   12029.0        clean  ...      sedan   
426877       good          NaN  diesel    4174.0        clean  ...  hatchback   
426878       good  6 cylinders     gas   30112.0        clean  ...      sedan   
426879       good          NaN     gas   22716.0        clean  ...      coupe   

       paint_color                                          image_url  \
27           white  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
28            blue  https://images.craigslist.org/00R0R_lwWjXSEWNa...   
29             red  https://images.craigslist.org/01212_jjirIWa0y0...   
30             red  https://images.craigslist.org/00x0x_1y9kIOzGCF...   
31           black  https://images.craigslist.org/00404_l4loxHvdQe...   
...            ...                                                ...   
426875         NaN  https://images.craigslist.org/00o0o_iiraFnHg8q...   
426876         red  https://images.craigslist.org/00x0x_15sbgnxCIS...   
426877       white  https://images.craigslist.org/00L0L_farM7bxnxR...   
426878      silver  https://images.craigslist.org/00z0z_bKnIVGLkDT...   
426879         NaN  https://images.craigslist.org/00Y0Y_lEUocjyRxa...   

                 state              posting_date latitude longitude  \
27              Kerala 2021-05-04 17:31:18+00:00   9.8072   76.8361   
28       Uttar Pradesh 2021-05-04 17:31:08+00:00  29.2417   83.7824   
29               Delhi 2021-05-04 17:31:25+00:00  28.8053   76.8871   
30              Odisha 2021-05-04 15:41:31+00:00  19.2886   85.2667   
31             Tripura 2021-05-03 19:02:03+00:00  23.4924   91.1187   
...                ...                       ...      ...       ...   
426875           Delhi 2021-04-04 09:21:31+00:00  28.7876   76.8644   
426876          Punjab 2021-04-04 09:21:29+00:00  32.0715   75.1915   
426877  Madhya Pradesh 2021-04-04 09:21:17+00:00  21.5646   75.7247   
426878     Maharashtra 2021-04-04 09:21:11+00:00  21.4183   75.4719   
426879       Jharkhand 2021-04-04 09:21:07+00:00  24.2616   83.7688   

       posting_year posting_month  posting_weekday  
27             2021             5          Tuesday  
28             2021             5          Tuesday  
29             2021             5          Tuesday  
30             2021             5          Tuesday  
31             2021             5           Monday  
...             ...           ...              ...  
426875         2021             4           Sunday  
426876         2021             4           Sunday  
426877         2021             4           Sunday  
426878         2021             4           Sunday  
426879         2021             4           Sunday  

[426678 rows x 24 columns]

sp['manufacturer'].isnull().sum()

0

sp['manufacturer'].value_counts()

ford               71073
chevrolet          55182
toyota             34449
honda              21270
nissan             19065
jeep               19023
ram                18364
gmc                16793
Unknown            16769
bmw                14721
dodge              13724
mercedes-benz      11818
hyundai            10340
subaru              9495
volkswagen          9350
kia                 8457
lexus               8196
audi                7572
cadillac            6952
chrysler            6052
acura               5978
buick               5504
mazda               5456
infiniti            4803
lincoln             4220
volvo               3377
mitsubishi          3227
mini                2376
pontiac             2291
rover               2175
jaguar              1946
porsche             1385
mercury             1187
saturn              1090
alfa-romeo           897
fiat                 868
tesla                868
harley-davidson      153
ferrari               95
datsun                69
aston-martin          24
land rover            21
morgan                 3
Name: manufacturer, dtype: int64

sp['condition'].isnull().sum()

173984

sp['condition'].value_counts()

good         121452
excellent    101430
like new      21175
fair           6765
new            1271
salvage         601
Name: condition, dtype: int64

sp['condition'] = sp.groupby(['manufacturer','model','year'])['condition'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['condition'].isnull().sum()

0

sp['condition'].value_counts()

excellent    209179
good         143637
Unknown       36669
like new      27061
fair           7702
new            1781
salvage         649
Name: condition, dtype: int64

sp['cylinders'].isnull().sum()

177542

sp['cylinders'].value_counts()

6 cylinders     94165
4 cylinders     77588
8 cylinders     72060
5 cylinders      1712
10 cylinders     1455
other            1293
3 cylinders       655
12 cylinders      208
Name: cylinders, dtype: int64

sp['cylinders'] = sp.groupby(['manufacturer','model','year'])['cylinders'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['cylinders'].isnull().sum()

0

sp['cylinders'].value_counts()

6 cylinders     137826
4 cylinders     123886
8 cylinders     102218
Unknown          55441
5 cylinders       2450
10 cylinders      1912
other             1814
3 cylinders        907
12 cylinders       224
Name: cylinders, dtype: int64

sp['fuel'].isnull().sum()

2945

sp['fuel'].value_counts()

gas         356115
other        30696
diesel       30055
hybrid        5170
electric      1697
Name: fuel, dtype: int64

sp['fuel'] = sp.groupby(['manufacturer','model','year'])['fuel'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['fuel'].isnull().sum()

0

sp['fuel'].value_counts()

gas         357266
other        30724
diesel       30101
hybrid        5195
electric      1701
Unknown       1691
Name: fuel, dtype: int64

sp['transmission'].isnull().sum()

2488

sp['transmission'].value_counts()

automatic    336437
other         62642
manual        25111
Name: transmission, dtype: int64

sp['transmission'] = sp.groupby(['manufacturer','model','year'])['transmission'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['transmission'].isnull().sum()

0

sp['transmission'].value_counts()

automatic    338541
other         62660
manual        25135
Unknown         342
Name: transmission, dtype: int64

sp['title_status'].isnull().sum()

8139

sp['title_status'].value_counts()

clean         405030
rebuilt         7219
salvage         3861
lien            1422
missing          810
parts only       197
Name: title_status, dtype: int64

sp['title_status'] = sp['title_status'].fillna('Unknown')

sp['title_status'].isnull().sum()

0

sp['title_status'].value_counts()

clean         405030
Unknown         8139
rebuilt         7219
salvage         3861
lien            1422
missing          810
parts only       197
Name: title_status, dtype: int64

sp['car_type'].isnull().sum()

92723

sp['car_type'].value_counts()

sedan          87055
SUV            77233
pickup         43509
truck          35273
other          22105
coupe          19204
hatchback      16598
wagon          10750
van             8548
convertible     7731
mini-van        4825
offroad          609
bus              515
Name: car_type, dtype: int64

sp['car_type'] = sp.groupby(['manufacturer','model','year','transmission','fuel'])['car_type'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['car_type'].isnull().sum()

0

sp['car_type'].value_counts()

sedan          106733
SUV             97421
pickup          50507
truck           43467
Unknown         24924
other           23014
coupe           22104
hatchback       18777
wagon           12186
van             10284
convertible      8910
mini-van         7007
offroad           725
bus               619
Name: car_type, dtype: int64

sp['paint_color'].isnull().sum()

130030

sp['paint_color'].value_counts()

white     79280
black     62854
silver    42963
blue      31220
red       30469
grey      24415
green      7343
custom     6699
brown      6593
yellow     2142
orange     1984
purple      686
Name: paint_color, dtype: int64

sp['paint_color'] = sp.groupby(['manufacturer','model','year'])['paint_color'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['paint_color'].isnull().sum()

0

sp['paint_color'].value_counts()

white      108720
black       91076
silver      57794
blue        40905
red         37135
grey        30740
Unknown     30617
green        8786
brown        7901
custom       7606
yellow       2346
orange       2282
purple        770
Name: paint_color, dtype: int64

sp['drive'].isnull().sum()

130432

sp['drive'].value_counts()

4wd    131853
fwd    105506
rwd     58887
Name: drive, dtype: int64

sp['drive'] = sp.groupby(['manufacturer','model'])['drive'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['drive'].isnull().sum()

0

sp['drive'].value_counts()

4wd        179672
fwd        145427
rwd         78620
Unknown     22959
Name: drive, dtype: int64

sp['size'].isnull().sum()

306166

sp['size'].value_counts()

full-size      63461
mid-size       34474
compact        19383
sub-compact     3194
Name: size, dtype: int64

sp['size'] = sp.groupby(['manufacturer','model','year'])['size'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['size'].isnull().sum()

0

sp['size'].value_counts()

full-size      160376
Unknown        121065
mid-size        87792
compact         51146
sub-compact      6299
Name: size, dtype: int64

sp['VIN'].isnull().sum()

160908

sp['VIN'].value_counts()

1FMJU1JT1HEA52352    261
3C6JR6DT3KG560649    235
1FTER1EH1LLA36301    231
5TFTX4CN3EX042751    227
1GCHTCE37G1186784    214
                    ... 
4S4BRBGC3B3332201      1
1N4AA5AP2EC448150      1
1J8GN58K88W233418      1
2FMDK3GC9ABA40182      1
SAJGX2749VCOO8376      1
Name: VIN, Length: 118226, dtype: int64

sp['VIN'] = sp.groupby(['manufacturer','model','year','price','condition','cylinders','fuel','odometer','title_status','transmission','drive','size','car_type','paint_color'])['VIN'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

sp['VIN'].isnull().sum()

0

sp['VIN'].value_counts()

Unknown              160210
1FMJU1JT1HEA52352       261
3C6JR6DT3KG560649       235
1FTER1EH1LLA36301       231
5TFTX4CN3EX042751       227
                      ...  
1C3EL75R35N639520         1
1GCHG35R211150470         1
3B7HC13YXXM530629         1
1GT21REG3GZ289767         1
SAJGX2749VCOO8376         1
Name: VIN, Length: 118227, dtype: int64

missing_value_count = sp.isnull().sum()
print("Missing Values in Each Column:")
missing_value_count

Missing Values in Each Column:

id                 0
price              0
year               0
manufacturer       0
model              0
condition          0
cylinders          0
fuel               0
odometer           0
title_status       0
transmission       0
VIN                0
drive              0
size               0
car_type           0
paint_color        0
image_url          0
state              0
posting_date       0
latitude           0
longitude          0
posting_year       0
posting_month      0
posting_weekday    0
dtype: int64

sp['car_age'] = sp.apply(
    lambda row: row['posting_date'].year - row['year'] if pd.notna(row['year']) and pd.notna(row['posting_date']) else np.nan, 
    axis=1
)
# Display the DataFrame with the new 'car_age' column
print(sp[['year', 'posting_date', 'car_age']])

          year              posting_date  car_age
27      2014.0 2021-05-04 17:31:18+00:00      7.0
28      2010.0 2021-05-04 17:31:08+00:00     11.0
29      2020.0 2021-05-04 17:31:25+00:00      1.0
30      2017.0 2021-05-04 15:41:31+00:00      4.0
31      2013.0 2021-05-03 19:02:03+00:00      8.0
...        ...                       ...      ...
426875  2019.0 2021-04-04 09:21:31+00:00      2.0
426876  2020.0 2021-04-04 09:21:29+00:00      1.0
426877  2020.0 2021-04-04 09:21:17+00:00      1.0
426878  2018.0 2021-04-04 09:21:11+00:00      3.0
426879  2019.0 2021-04-04 09:21:07+00:00      2.0

[426678 rows x 3 columns]

# Calculate 'avg_yearly_mileage'
sp['avg_yearly_mileage'] = np.where(
    sp['car_age'] == 0,  # Check if 'car_age' is zero
    sp['odometer'],      # If 'car_age' is zero, use 'odometer'
    sp['odometer'] / sp['car_age']  # Otherwise, calculate the formula
)

# Display the result
print(sp['avg_yearly_mileage'])

27         8274.714286
28         6475.363636
29        19160.000000
30        10281.000000
31        16000.000000
              ...     
426875    16113.000000
426876    12029.000000
426877     4174.000000
426878    10037.333333
426879    11358.000000
Name: avg_yearly_mileage, Length: 426678, dtype: float64

sp

sp.describe()

# Correlation between price and numerical attributes
correlations = sp.corr()['price'].sort_values(ascending=False)
print("\nCorrelation with Price:")
print(correlations)

Correlation with Price:
price                 1.000000
odometer              0.010112
car_age               0.004899
avg_yearly_mileage    0.000166
latitude             -0.001279
longitude            -0.001498
posting_month        -0.001826
id                   -0.002846
year                 -0.004899
posting_year               NaN
Name: price, dtype: float64

# Summary of pricing trends by condition and fuel type
price_summary = sp.groupby(['condition', 'fuel']).agg(
    avg_price=('price', 'mean'),
    median_price=('price', 'median'),
    count=('price', 'size')
).reset_index()

print("\nOptimal Pricing Summary:")
print(price_summary)

Optimal Pricing Summary:
    condition      fuel     avg_price  median_price   count
0     Unknown   Unknown  2.657031e+06     2309580.0    1687
1     Unknown    diesel  4.588512e+06     4045190.0    6839
2     Unknown  electric  3.249356e+06     1990460.0     159
3     Unknown       gas  3.439129e+06     1990460.0   25470
4     Unknown    hybrid  2.589077e+06     2051780.0     447
5     Unknown     other  4.122992e+06     2824250.0    2067
6   excellent   Unknown  7.312450e+06     7312450.0       4
7   excellent    diesel  1.829421e+07     4036280.0   12867
8   excellent  electric  2.823478e+06     1990460.0     526
9   excellent       gas  1.226798e+07     1862110.0  187480
10  excellent    hybrid  1.543248e+06     1283270.0    2495
11  excellent     other  3.043773e+06     1990460.0    5807
12       fair    diesel  2.011004e+06     1014550.0     636
13       fair  electric  1.415820e+06      488000.0       4
14       fair       gas  7.225017e+07      372460.0    6960
15       fair    hybrid  5.042757e+05      475100.0      37
16       fair     other  2.440227e+09      449480.0      65
17       good    diesel  2.399393e+07     2824250.0    7752
18       good  electric  3.471733e+06     3851650.0     845
19       good       gas  3.916336e+06     1990460.0  110943
20       good    hybrid  2.389665e+06     2130830.0    1810
21       good     other  3.679336e+06     3594780.0   22287
22   like new    diesel  4.903614e+06     4993550.0    1877
23   like new  electric  3.671779e+06     2593540.0     154
24   like new       gas  4.450640e+06     1990460.0   24226
25   like new    hybrid  2.272601e+06     1862110.0     373
26   like new     other  3.952193e+06     3466340.0     431
27        new    diesel  7.498743e+06     8352665.0      88
28        new  electric  3.191558e+06      769610.0      11
29        new       gas  3.156952e+06     2117890.0    1604
30        new    hybrid  4.556500e+06     5842160.0      22
31        new     other  2.078390e+06     1990460.0      56
32    salvage    diesel  1.845805e+06     1284190.0      42
33    salvage  electric  1.653625e+06     1653625.0       2
34    salvage       gas  5.570343e+05      256870.0     583
35    salvage    hybrid  5.446691e+05      192610.0      11
36    salvage     other  1.692764e+05      128440.0      11

sp['mileage_category'] = pd.cut(sp['avg_yearly_mileage'], 
                                bins=[0, 5000, 10000, 20000, 30000, 80000, 10000000], 
                                labels=['Very Low', 'Low', 'Moderate', 'Considerate', 'High', 'Very High'])

mileage_price = sp.groupby('mileage_category')['price'].mean()
print("Average Price by Mileage Category:")
print(mileage_price)

Average Price by Mileage Category:
mileage_category
Very Low       2.788424e+07
Low            7.393663e+06
Moderate       6.344824e+06
Considerate    2.800212e+06
High           6.066545e+06
Very High      1.310558e+08
Name: price, dtype: float64

# Analyze pricing by vehicle type
type_pricing = sp.groupby('car_type').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nPricing by Vehicle Type:")
print(type_pricing)

Pricing by Vehicle Type:
       car_type  total_transactions     avg_price  median_price
0           SUV               97421  1.291095e+07     1926290.0
1       Unknown               24924  1.656132e+07     1990460.0
2           bus                 619  2.206170e+06     1796930.0
3   convertible                8910  3.103358e+07     1990460.0
4         coupe               22104  2.993457e+06     2246410.0
5     hatchback               18777  1.832439e+06     1796930.0
6      mini-van                7007  1.283773e+06      924590.0
7       offroad                 725  2.090865e+06     1643800.0
8         other               23014  3.349307e+06     3209560.0
9        pickup               50507  1.652880e+07     3466340.0
10        sedan              106733  9.306231e+06     1412630.0
11        truck               43467  4.141187e+06     2632560.0
12          van               10284  2.396104e+06     1990460.0
13        wagon               12186  1.787653e+06     1669420.0

car_type_stats = sp.groupby('car_type').agg(
    avg_price=('price', 'mean'),
    num_cars=('id', 'count')
).reset_index()

car_type_stats

# Data preparation
car_type_stats = sp.groupby('car_type').agg(
    avg_price=('price', 'mean'),
    num_cars=('id', 'count')
).reset_index()

# Sort by average price
car_type_stats = car_type_stats.sort_values('avg_price', ascending=False)

# Extract data for plotting
car_types = car_type_stats['car_type']
avg_price = car_type_stats['avg_price']
num_cars = car_type_stats['num_cars']

# Create the plot
fig, ax1 = plt.subplots(figsize=(12, 6))

# Bar chart for average price
bar_width = 0.4
x = np.arange(len(car_types))
ax1.bar(x - bar_width / 2, avg_price, width=bar_width, color='lightblue', label='Average Price (₹)')
ax1.set_ylabel('Average Price (₹)', fontsize=12, color='blue')
ax1.set_xlabel('Car Type', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(x)
ax1.set_xticklabels(car_types, rotation=45, ha='right')

# Bar chart for number of cars
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, num_cars, width=bar_width, color='salmon', label='Number of Cars')
ax2.set_ylabel('Number of Cars', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')

# Add title and legend
plt.title('Average Price and Number of Cars by Car Type', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)

# Adjust layout
plt.tight_layout()
plt.show()

age_analysis = sp.groupby('car_age').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)

print("\nCar Age Analysis (Demand and Pricing):")
print(age_analysis)

Car Age Analysis (Demand and Pricing):
     car_age  total_transactions     avg_price  median_price
4        4.0               36421  3.298168e+06     3067260.0
3        3.0               36370  3.596957e+06     3466340.0
8        8.0               31925  2.176866e+06     1990460.0
6        6.0               31538  3.245019e+06     2247340.0
5        5.0               30434  2.970425e+06     2503200.0
..       ...                 ...           ...           ...
104    106.0                   1  6.421000e+05      642100.0
108    116.0                   1  5.127400e+05      512740.0
107    112.0                   1  9.677000e+04       96770.0
102    103.0                   1  2.054720e+06     2054720.0
78      78.0                   1  1.252520e+06     1252520.0

[113 rows x 4 columns]

car_age_demand = sp['car_age'].value_counts().reset_index()
car_age_demand.columns = ['car_age', 'Demand']  # Rename columns
car_age_demand = car_age_demand.sort_values('Demand', ascending=False)  # Sort by demand in descending order
print("Car Age Demand:")
print(car_age_demand.head(30))

Car Age Demand:
    car_age  Demand
0       4.0   36421
1       3.0   36370
2       8.0   31925
3       6.0   31538
4       5.0   30434
5       7.0   30287
6       2.0   25375
7       9.0   23898
8      10.0   20341
9       1.0   19298
10     13.0   17150
11     11.0   15829
12     14.0   14873
13     15.0   12763
14     12.0   12185
15     16.0   10621
16     17.0    8971
17     18.0    7151
18     19.0    5587
19     20.0    4443
20     21.0    3572
21     22.0    3094
22      0.0    2396
23     23.0    1988
24     24.0    1724
25     25.0    1302
26     26.0    1246
27     27.0     968
28     28.0     712
29     29.0     626

plt.figure(figsize=(10, 6))
sns.scatterplot(x='car_age', y='odometer', data=sp, alpha=0.6)
plt.title('Car Age vs. Odometer Reading')
plt.xlabel('Car Age (years)')
plt.ylabel('Odometer Reading (km)')
plt.grid(True)
plt.tight_layout()
plt.show()

# Count transactions by state
state_demand = sp['state'].value_counts()
print("\nDemand by State:")
print(state_demand)

Demand by State:
Uttar Pradesh                               45552
Maharashtra                                 45236
Karnataka                                   30474
Punjab                                      30441
Delhi                                       30368
Haryana                                     30167
Tamil Nadu                                  18116
West Bengal                                 18039
Telangana                                   15352
Kerala                                      15293
Madhya Pradesh                              15199
Rajasthan                                   15121
Andhra Pradesh                              15065
Gujarat                                     14985
Odisha                                      12184
Jharkhand                                    6074
Bihar                                        6031
Chhattisgarh                                 6001
Assam                                        5986
Uttarakhand                                  5947
Andaman and Nicobar Islands                  3077
Sikkim                                       3065
Ladakh                                       3060
Puducherry                                   3045
Jammu and Kashmir                            3028
Himachal Pradesh                             3016
Meghalaya                                    3015
Mizoram                                      3009
Manipur                                      2990
Chandigarh                                   2984
Arunachal Pradesh                            2979
Dadra and Nagar Haveli and Daman and Diu     2978
Tripura                                      2960
Goa                                          2922
Lakshadweep                                  2919
Name: state, dtype: int64

state_avg_price = sp.groupby('state')['price'].mean().sort_values(ascending=False)
print("Average Price by State:")
print(state_avg_price)

Average Price by State:
state
Lakshadweep                                 1.667703e+08
Chandigarh                                  1.372737e+08
Goa                                         6.442383e+07
West Bengal                                 2.903243e+07
Haryana                                     2.018441e+07
Karnataka                                   1.617193e+07
Kerala                                      1.381462e+07
Uttar Pradesh                               8.677833e+06
Dadra and Nagar Haveli and Daman and Diu    7.327605e+06
Maharashtra                                 5.218439e+06
Madhya Pradesh                              3.720367e+06
Tamil Nadu                                  3.271037e+06
Delhi                                       3.019001e+06
Chhattisgarh                                2.672705e+06
Gujarat                                     2.533112e+06
Punjab                                      2.515638e+06
Rajasthan                                   2.509588e+06
Assam                                       2.472224e+06
Jharkhand                                   2.458025e+06
Sikkim                                      2.428607e+06
Andaman and Nicobar Islands                 2.422289e+06
Manipur                                     2.418222e+06
Jammu and Kashmir                           2.414838e+06
Arunachal Pradesh                           2.408561e+06
Tripura                                     2.407741e+06
Odisha                                      2.405977e+06
Telangana                                   2.404359e+06
Puducherry                                  2.402517e+06
Andhra Pradesh                              2.402467e+06
Uttarakhand                                 2.399425e+06
Meghalaya                                   2.393725e+06
Ladakh                                      2.388141e+06
Bihar                                       2.384134e+06
Himachal Pradesh                            2.368753e+06
Mizoram                                     2.345368e+06
Name: price, dtype: float64

# Analyze demand by state
regional_demand = sp.groupby('state').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)

print("\nRegional Demand Analysis:")
print(regional_demand)

Regional Demand Analysis:
                                       state  total_transactions  \
32                             Uttar Pradesh               45552   
20                               Maharashtra               45236   
15                                 Karnataka               30474   
26                                    Punjab               30441   
8                                      Delhi               30368   
11                                   Haryana               30167   
29                                Tamil Nadu               18116   
34                               West Bengal               18039   
30                                 Telangana               15352   
16                                    Kerala               15293   
19                            Madhya Pradesh               15199   
27                                 Rajasthan               15121   
1                             Andhra Pradesh               15065   
10                                   Gujarat               14985   
24                                    Odisha               12184   
14                                 Jharkhand                6074   
4                                      Bihar                6031   
6                               Chhattisgarh                6001   
3                                      Assam                5986   
33                               Uttarakhand                5947   
0                Andaman and Nicobar Islands                3077   
28                                    Sikkim                3065   
17                                    Ladakh                3060   
25                                Puducherry                3045   
13                         Jammu and Kashmir                3028   
12                          Himachal Pradesh                3016   
22                                 Meghalaya                3015   
23                                   Mizoram                3009   
21                                   Manipur                2990   
5                                 Chandigarh                2984   
2                          Arunachal Pradesh                2979   
7   Dadra and Nagar Haveli and Daman and Diu                2978   
31                                   Tripura                2960   
9                                        Goa                2922   
18                               Lakshadweep                2919   

       avg_price  median_price  
32  8.677833e+06     1990460.0  
20  5.218439e+06     1990460.0  
15  1.617193e+07     1990460.0  
26  2.515638e+06     1990460.0  
8   3.019001e+06     1990460.0  
11  2.018441e+07     1990460.0  
29  3.271037e+06     1990460.0  
34  2.903243e+07     1990460.0  
30  2.404359e+06     1990460.0  
16  1.381462e+07     1990460.0  
19  3.720367e+06     1990460.0  
27  2.509588e+06     1990460.0  
1   2.402467e+06     1990460.0  
10  2.533112e+06     1990460.0  
24  2.405977e+06     1990460.0  
14  2.458025e+06     1990460.0  
4   2.384134e+06     1990460.0  
6   2.672705e+06     1990460.0  
3   2.472224e+06     1990460.0  
33  2.399425e+06     1990460.0  
0   2.422289e+06     1990460.0  
28  2.428607e+06     1990460.0  
17  2.388141e+06     1990460.0  
25  2.402517e+06     1990460.0  
13  2.414838e+06     1990460.0  
12  2.368753e+06     1990460.0  
22  2.393725e+06     1990460.0  
23  2.345368e+06     1990460.0  
21  2.418222e+06     1990460.0  
5   1.372737e+08     1990460.0  
2   2.408561e+06     1990460.0  
7   7.327605e+06     1990460.0  
31  2.407741e+06     1990460.0  
9   6.442383e+07     1990460.0  
18  1.667703e+08     1990460.0

# Average price by state
state_pricing = sp.groupby('state').agg(
    avg_price=('price', 'mean')
).reset_index()

print("\nGeographic Pricing Analysis:")
print(state_pricing)

Geographic Pricing Analysis:
                                       state     avg_price
0                Andaman and Nicobar Islands  2.422289e+06
1                             Andhra Pradesh  2.402467e+06
2                          Arunachal Pradesh  2.408561e+06
3                                      Assam  2.472224e+06
4                                      Bihar  2.384134e+06
5                                 Chandigarh  1.372737e+08
6                               Chhattisgarh  2.672705e+06
7   Dadra and Nagar Haveli and Daman and Diu  7.327605e+06
8                                      Delhi  3.019001e+06
9                                        Goa  6.442383e+07
10                                   Gujarat  2.533112e+06
11                                   Haryana  2.018441e+07
12                          Himachal Pradesh  2.368753e+06
13                         Jammu and Kashmir  2.414838e+06
14                                 Jharkhand  2.458025e+06
15                                 Karnataka  1.617193e+07
16                                    Kerala  1.381462e+07
17                                    Ladakh  2.388141e+06
18                               Lakshadweep  1.667703e+08
19                            Madhya Pradesh  3.720367e+06
20                               Maharashtra  5.218439e+06
21                                   Manipur  2.418222e+06
22                                 Meghalaya  2.393725e+06
23                                   Mizoram  2.345368e+06
24                                    Odisha  2.405977e+06
25                                Puducherry  2.402517e+06
26                                    Punjab  2.515638e+06
27                                 Rajasthan  2.509588e+06
28                                    Sikkim  2.428607e+06
29                                Tamil Nadu  3.271037e+06
30                                 Telangana  2.404359e+06
31                                   Tripura  2.407741e+06
32                             Uttar Pradesh  8.677833e+06
33                               Uttarakhand  2.399425e+06
34                               West Bengal  2.903243e+07

state_stats = sp.groupby('state').agg(
    avg_price=('price', 'mean'),
    most_common_type=('car_type', lambda x: x.mode()[0] if not x.isnull().all() else 'Unknown'),
    total_listings=('id', 'count')
).reset_index()

# Visualize Average Price by State
plt.figure(figsize=(12, 8))
sns.barplot(
    y='state', 
    x='avg_price', 
    data=state_stats.sort_values('avg_price', ascending=False),
    palette='coolwarm'
)
plt.title('Average Car Price by State', fontsize=14)
plt.xlabel('Average Price (₹)', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()

top_manufacturers = sp['manufacturer'].value_counts().head(5).index
manufacturer_data = sp[sp['manufacturer'].isin(top_manufacturers)]

plt.figure(figsize=(14, 8))
sns.countplot(
    y='state', 
    hue='manufacturer', 
    data=manufacturer_data, 
    order=state_stats['state']
)
plt.title('Prevalence of Top Manufacturers by State', fontsize=14)
plt.xlabel('Count', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.legend(title='Manufacturer', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

sp['mileage_category'] = pd.cut(sp['avg_yearly_mileage'], 
                                bins=[0, 5000, 10000, 20000, 30000, 80000, 10000000], 
                                labels=['Very Low', 'Low', 'Moderate', 'Considerate', 'High', 'Very High'])
mileage_vs_age = sp.groupby(['car_age', 'mileage_category'])['id'].count().unstack()
print("Mileage vs Car Age:")
print(mileage_vs_age)

Mileage vs Car Age:
mileage_category  Very Low    Low  Moderate  Considerate  High  Very High
car_age                                                                  
0.0                   1809    241       103           12    13         82
1.0                   5483   4037      5182         2591  1827        146
2.0                   4974   6738      9323         2792  1481         43
3.0                   4142  12372     14177         3882  1691         57
4.0                   3995  10768     14149         5603  1786         56
...                    ...    ...       ...          ...   ...        ...
116.0                    1      0         0            0     0          0
118.0                   12      0         0            0     0          0
119.0                    0      1         0            0     0          0
120.0                    3      0         0            0     0          0
121.0                   11      1         0            0     0          0

[113 rows x 6 columns]

fuel_region_trends = sp.groupby(['state', 'fuel'])['id'].count().unstack()
print("Fuel Type Trends by Region:")
print(fuel_region_trends)

Fuel Type Trends by Region:
fuel                                      Unknown  diesel  electric    gas  \
state                                                                        
Andaman and Nicobar Islands                    10     226        15   2566   
Andhra Pradesh                                 64    1049        63  12595   
Arunachal Pradesh                              11     202         6   2516   
Assam                                          26     445        28   4993   
Bihar                                          24     398        18   5083   
Chandigarh                                     11     208        14   2504   
Chhattisgarh                                   20     420        19   5036   
Dadra and Nagar Haveli and Daman and Diu       10     222        16   2508   
Delhi                                         126    2151       126  25455   
Goa                                            11     230        10   2432   
Gujarat                                        64    1031        62  12522   
Haryana                                       139    2161       135  25178   
Himachal Pradesh                               13     235         8   2507   
Jammu and Kashmir                              16     200         6   2561   
Jharkhand                                      28     423        26   5072   
Karnataka                                     145    2097       108  25581   
Kerala                                         61    1067        59  12754   
Ladakh                                          8     235        12   2542   
Lakshadweep                                    12     201         8   2481   
Madhya Pradesh                                 50    1134        51  12647   
Maharashtra                                   168    3128       184  37964   
Manipur                                         9     220        13   2507   
Meghalaya                                      10     226         4   2523   
Mizoram                                        10     185        17   2543   
Odisha                                         42     846        46  10149   
Puducherry                                     13     232        14   2546   
Punjab                                        110    2137       126  25531   
Rajasthan                                      48    1083        63  12663   
Sikkim                                         10     229        10   2577   
Tamil Nadu                                     65    1284        74  15161   
Telangana                                      50    1091        70  12877   
Tripura                                         8     193        15   2510   
Uttar Pradesh                                 200    3228       187  38092   
Uttarakhand                                    20     444        20   4971   
West Bengal                                    79    1240        68  15119   

fuel                                      hybrid  other  
state                                                    
Andaman and Nicobar Islands                   35    225  
Andhra Pradesh                               191   1103  
Arunachal Pradesh                             29    215  
Assam                                         59    435  
Bihar                                         74    434  
Chandigarh                                    26    221  
Chhattisgarh                                  70    436  
Dadra and Nagar Haveli and Daman and Diu      28    194  
Delhi                                        340   2170  
Goa                                           31    208  
Gujarat                                      172   1134  
Haryana                                      360   2194  
Himachal Pradesh                              34    219  
Jammu and Kashmir                             36    209  
Jharkhand                                     79    446  
Karnataka                                    371   2172  
Kerala                                       185   1167  
Ladakh                                        45    218  
Lakshadweep                                   41    176  
Madhya Pradesh                               235   1082  
Maharashtra                                  558   3234  
Manipur                                       31    210  
Meghalaya                                     42    210  
Mizoram                                       30    224  
Odisha                                       153    948  
Puducherry                                    31    209  
Punjab                                       372   2165  
Rajasthan                                    178   1086  
Sikkim                                        47    192  
Tamil Nadu                                   228   1304  
Telangana                                    192   1072  
Tripura                                       35    199  
Uttar Pradesh                                549   3296  
Uttarakhand                                   76    416  
West Bengal                                  232   1301

# Analyze demand and pricing by fuel type
fuel_analysis = sp.groupby('fuel').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nFuel Type Analysis (Demand and Pricing):")
print(fuel_analysis)

Fuel Type Analysis (Demand and Pricing):
       fuel  total_transactions     avg_price  median_price
0   Unknown                1691  2.668044e+06     2310590.0
1    diesel               30101  1.541457e+07     3594780.0
2  electric                1701  3.259814e+06     2901280.0
3       gas              357266  9.623518e+06     1925360.0
4    hybrid                5195  1.983751e+06     1757450.0
5     other               30724  8.743496e+06     3337990.0

# Correlation matrix
correlation_matrix = sp[['price', 'odometer', 'car_age']].corr()
print("\nCorrelation Analysis:")
print(correlation_matrix)

Correlation Analysis:
             price  odometer   car_age
price     1.000000  0.010112  0.004899
odometer  0.010112  1.000000  0.158340
car_age   0.004899  0.158340  1.000000

plt.figure(figsize=(12, 6))
sns.boxplot(x='fuel', y='price', data=sp, showfliers=False, palette='Set3')
plt.title('Price Distribution by Fuel Type', fontsize=14)
plt.xlabel('Fuel Type', fontsize=12)
plt.ylabel('Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

correlation_matrix = sp[['car_age', 'odometer', 'price', 'cylinders']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', cbar=True)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

fuel_region = sp.groupby(['state', 'fuel']).size().unstack(fill_value=0)

# Normalize to get percentages
fuel_region_percentage = fuel_region.div(fuel_region.sum(axis=1), axis=0)

# Heatmap of fuel preferences by region
plt.figure(figsize=(14, 8))
sns.heatmap(
    fuel_region_percentage, 
    annot=True, 
    fmt=".1%", 
    cmap='coolwarm', 
    linewidths=0.5, 
    cbar_kws={'label': 'Fuel Preference (%)'}
)
plt.title('Fuel Preferences by State', fontsize=14)
plt.xlabel('Fuel Type', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()

# Group by 'manufacturer' and find min and max year
manufacturer_year_stats = sp.groupby(['manufacturer','model'])['year'].agg(['min', 'max']).reset_index()

# Rename columns for clarity
manufacturer_year_stats.columns = ['manufacturer','model', 'min_year', 'max_year']

# Display the result
print(manufacturer_year_stats)

      manufacturer                           model  min_year  max_year
0          Unknown                               %    2002.0    2002.0
1          Unknown              '50 Business Coupe    1950.0    1950.0
2          Unknown                   '99 H1 Hummer    1999.0    1999.0
3          Unknown                         ,,,,,,,    2009.0    2009.0
4          Unknown    ,2012,2013, SOME 2014 MODELS    2014.0    2014.0
...            ...                             ...       ...       ...
30078        volvo  xc90 t8 inscription 1300 miles    2020.0    2020.0
30079        volvo                         xc90 v8    2008.0    2011.0
30080        volvo                     xc90 v8 awd    2007.0    2007.0
30081        volvo                  xc90 v8 awdsuv    2006.0    2006.0
30082        volvo          xc90t6 awd 24733 miles    2018.0    2018.0

[30083 rows x 4 columns]

# Data preparation
manufacturer_counts = sp['manufacturer'].value_counts()
manufacturer_stats = sp.groupby('manufacturer').agg(
    avg_price=('price', 'mean'),
    total_listings=('id', 'count')
).reset_index()

# Filter for the top 10 manufacturers by frequency
top_manufacturers = manufacturer_counts.head(10).index
manufacturer_stats_top = manufacturer_stats[manufacturer_stats['manufacturer'].isin(top_manufacturers)]
manufacturer_stats_top = manufacturer_stats_top.set_index('manufacturer').loc[top_manufacturers]  # Maintain order
avg_price = manufacturer_stats_top['avg_price']
total_listings = manufacturer_stats_top['total_listings']

# Create the plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Bar plot for total listings
bar_width = 0.4
x = np.arange(len(top_manufacturers))
ax1.bar(x - bar_width / 2, total_listings, width=bar_width, color='lightblue', label='Number of Listings')
ax1.set_ylabel('Number of Listings', fontsize=12, color='blue')
ax1.set_xlabel('Manufacturer', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(x)
ax1.set_xticklabels(top_manufacturers, rotation=45, ha='right')

# Bar plot for average price
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, avg_price, width=bar_width, color='salmon', label='Average Price (₹)')
ax2.set_ylabel('Average Price (₹)', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')

# Add title and legend
plt.title('Top 10 Manufacturers: Number of Listings vs. Average Price', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)

# Adjust layout
plt.tight_layout()
plt.show()

manufacturer_region = sp.groupby(['state', 'manufacturer']).size().unstack(fill_value=0)

# Normalize to show percentages
manufacturer_region_percentage = manufacturer_region.div(manufacturer_region.sum(axis=1), axis=0)

# Heatmap for regional dominance of top manufacturers
plt.figure(figsize=(14, 8))
sns.heatmap(
    manufacturer_region_percentage[top_manufacturers], 
    annot=True, 
    fmt=".1%", 
    cmap='coolwarm', 
    linewidths=0.5, 
    cbar_kws={'label': 'Regional Market Share (%)'}
)
plt.title('Regional Dominance of Top Manufacturers', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()

condition_price = sp.groupby('condition')['price'].mean().sort_values(ascending=False)
print("Average Price by Condition:")
print(condition_price)

Average Price by Condition:
condition
fair         8.605288e+07
excellent    1.223083e+07
good         4.941285e+06
like new     4.439667e+06
Unknown      3.644879e+06
new          3.355070e+06
salvage      6.370346e+05
Name: price, dtype: float64

# Pricing analysis by title status
title_analysis = sp.groupby('title_status').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean')
).reset_index()
print("\nTitle Status Analysis (Demand and Pricing):")
print(title_analysis)

Title Status Analysis (Demand and Pricing):
  title_status  total_transactions     avg_price
0      Unknown                8139  2.029250e+06
1        clean              405030  1.024240e+07
2         lien                1422  2.888077e+06
3      missing                 810  2.874681e+06
4   parts only                 197  1.646425e+06
5      rebuilt                7219  1.966364e+06
6      salvage                3861  1.345844e+06

# Analyze demand and pricing by condition
condition_analysis = sp.groupby('condition').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nDemand and Pricing by Condition:")
print(condition_analysis)

Demand and Pricing by Condition:
   condition  total_transactions     avg_price  median_price
0    Unknown               36669  3.644879e+06     2309580.0
1  excellent              209179  1.223083e+07     1926290.0
2       fair                7702  8.605288e+07      385220.0
3       good              143637  4.941285e+06     2310590.0
4   like new               27061  4.439667e+06     1990460.0
5        new                1781  3.355070e+06     2298740.0
6    salvage                 649  6.370346e+05      256870.0

condition_stats = sp.groupby('condition')['price'].mean().sort_values(ascending=False)

# Plotting average prices by condition
plt.figure(figsize=(12, 6))
sns.barplot(
    x=condition_stats.index, 
    y=condition_stats.values, 
    palette='Set2'
)
plt.title('Average Price by Condition', fontsize=14)
plt.xlabel('Condition', fontsize=12)
plt.ylabel('Average Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

title_stats = sp.groupby('title_status')['price'].mean().sort_values(ascending=False)

# Plotting average prices by title status
plt.figure(figsize=(12, 6))
sns.barplot(
    x=title_stats.index, 
    y=title_stats.values, 
    palette='Set3'
)
plt.title('Average Price by Title Status', fontsize=14)
plt.xlabel('Title Status', fontsize=12)
plt.ylabel('Average Price (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

combined_stats = sp.groupby(['condition', 'title_status'])['price'].mean().unstack(fill_value=0)

# Heatmap for combined analysis
plt.figure(figsize=(14, 8))
sns.heatmap(
    combined_stats, 
    annot=True, 
    fmt=".0f", 
    cmap='coolwarm', 
    cbar_kws={'label': 'Average Price (₹)'},
    linewidths=0.5
)
plt.title('Average Price by Condition and Title Status', fontsize=14)
plt.xlabel('Title Status', fontsize=12)
plt.ylabel('Condition', fontsize=12)
plt.tight_layout()
plt.show()

monthly_trends = sp.groupby(['posting_year', 'posting_month'])['id'].count().reset_index()
monthly_trends.columns = ['Year', 'Month', 'Transactions']
print("Monthly Transaction Trends:")
print(monthly_trends)

Monthly Transaction Trends:
   Year  Month  Transactions
0  2021      4        313053
1  2021      5        113625

weekday_trends = sp['posting_weekday'].value_counts()
print("Weekday Trends:")
print(weekday_trends)

Weekday Trends:
Tuesday      78444
Monday       75067
Friday       66569
Saturday     61453
Thursday     52984
Wednesday    51080
Sunday       41081
Name: posting_weekday, dtype: int64

weekday_counts = sp['posting_weekday'].value_counts()

plt.figure(figsize=(10, 6))
sns.barplot(
    x=weekday_counts.index, 
    y=weekday_counts.values, 
    palette='viridis'
)
plt.title('Weekly Trends in Car Postings', fontsize=14)
plt.xlabel('Day of the Week', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.tight_layout()
plt.show()

# Revenue contribution by size
size_revenue = sp.groupby('size').agg(
    total_revenue=('price', 'sum'),
    avg_price=('price', 'mean')
).reset_index()

print("\nRevenue Contribution by Vehicle Size:")
print(size_revenue)

Revenue Contribution by Vehicle Size:
          size  total_revenue     avg_price
0      Unknown   7.099603e+11  5.864291e+06
1      compact   4.462154e+11  8.724345e+06
2    full-size   2.387425e+12  1.488642e+07
3     mid-size   6.333929e+11  7.214700e+06
4  sub-compact   1.415356e+10  2.246953e+06

# Calculate averages
drive_data = sp.groupby('drive').agg({'price': 'mean', 'odometer': 'mean'}).sort_index()

# Extract values
drive_types = drive_data.index
price_values = drive_data['price']
odometer_values = drive_data['odometer']

# Create the plot
fig, ax1 = plt.subplots(figsize=(14, 6))
x = np.arange(len(drive_types))
bar_width = 0.4

# Plot bars
ax1.bar(x - bar_width / 2, price_values, width=bar_width, color='skyblue', label='Average Price (₹)')
ax2 = ax1.twinx()
ax2.bar(x + bar_width / 2, odometer_values, width=bar_width, color='lightgreen', label='Average Odometer (km)')

# Label axes
ax1.set_ylabel('Average Price (₹)', color='blue', fontsize=12)
ax2.set_ylabel('Average Odometer (km)', color='green', fontsize=12)
ax1.set_xlabel('Drive Type', fontsize=12)
ax1.set_xticks(x)
ax1.set_xticklabels(drive_types, rotation=45, ha='right')

# Add title and legend
plt.title('Average Price and Odometer by Drive Type', fontsize=14)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)
plt.tight_layout()
plt.show()

# Calculate paint color frequency and average price
paint_frequency = sp['paint_color'].value_counts()
paint_price = sp.groupby('paint_color')['price'].mean()

# Ensure the indices are aligned
paint_frequency = paint_frequency.sort_index()
paint_price = paint_price.sort_index()

# Create the figure and axes
fig, ax1 = plt.subplots(figsize=(14, 6))

# Plot frequency on the first y-axis
ax1.bar(
    paint_frequency.index, 
    paint_frequency.values, 
    color='lightblue', 
    alpha=0.7, 
    label='Frequency'
)
ax1.set_ylabel('Frequency', fontsize=12, color='blue')
ax1.set_xlabel('Paint Color', fontsize=12)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_xticks(range(len(paint_frequency.index)))  # Set tick positions
ax1.set_xticklabels(paint_frequency.index, rotation=45, ha='right')  # Set tick labels

# Create the second y-axis
ax2 = ax1.twinx()
ax2.plot(
    range(len(paint_price.index)), 
    paint_price.values, 
    color='red', 
    marker='o', 
    label='Average Price'
)
ax2.set_ylabel('Average Price (₹)', fontsize=12, color='red')
ax2.tick_params(axis='y', labelcolor='red')

# Add a title
plt.title('Paint Color Preferences and Average Price', fontsize=14)

# Show the legend
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

# Adjust layout for better appearance
plt.tight_layout()
plt.show()

regional_paint_preference = sp.groupby(['state', 'paint_color']).size().unstack(fill_value=0)

plt.figure(figsize=(15, 8))
sns.heatmap(
    regional_paint_preference,
    cmap='YlGnBu',
    cbar_kws={'label': 'Number of Listings'},
    linewidths=0.5
)
plt.title('Regional Preferences for Paint Colors', fontsize=16)
plt.xlabel('Paint Color', fontsize=12)
plt.ylabel('State', fontsize=12)
plt.tight_layout()
plt.show()

# Group by 'manufacturer' and 'model', summing the 'id' column
model_demand = sp.groupby(['manufacturer', 'model'])['id'].count().reset_index()

# Rename the column for clarity
model_demand.columns = ['Manufacturer', 'Model', 'Total_Demand']

# Sort by 'TotalDemand' in descending order
model_demand = model_demand.sort_values(by='Total_Demand', ascending=False)
model_demand

# Get the top 10 in-demand cars
model_demand_top10 = model_demand.head(10)
model_demand_top10

# Analyze demand by key attributes
attribute_columns = ['fuel', 'condition', 'paint_color']
for col in attribute_columns:
    print(f"\nDemand by {col.capitalize()}:")
    print(sp[col].value_counts())

Demand by Fuel:
gas         357266
other        30724
diesel       30101
hybrid        5195
electric      1701
Unknown       1691
Name: fuel, dtype: int64

Demand by Condition:
excellent    209179
good         143637
Unknown       36669
like new      27061
fair           7702
new            1781
salvage         649
Name: condition, dtype: int64

Demand by Paint_color:
white      108720
black       91076
silver      57794
blue        40905
red         37135
grey        30740
Unknown     30617
green        8786
brown        7901
custom       7606
yellow       2346
orange       2282
purple        770
Name: paint_color, dtype: int64

regional_model_demand = sp.groupby(['state', 'model'])['id'].count().reset_index()
regional_model_demand.columns = ['State', 'Model', 'Transactions']
top_models = regional_model_demand.sort_values(by=['Transactions'], ascending=[False])
print(top_models)

                State                 Model  Transactions
66843     Maharashtra                 f-150           886
110102  Uttar Pradesh                 f-150           847
20926           Delhi                 f-150           608
84926          Punjab                 f-150           587
70240     Maharashtra        silverado 1500           567
...               ...                   ...           ...
49574       Karnataka     silverado ext cab             1
49573       Karnataka      silverado ex-cab             1
49572       Karnataka   silverado durmax hd             1
49571       Karnataka  silverado duramax hd             1
122744    West Bengal                     ♿             1

[122745 rows x 3 columns]

# Identify the most popular manufacturers and models
popular_cars = sp.groupby(['manufacturer', 'model']).agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index().sort_values('total_transactions', ascending=False)

print("\nPopular Car Models and Brands Analysis:")
print(popular_cars.head(10))  # Top 10 popular cars

Popular Car Models and Brands Analysis:
      manufacturer           model  total_transactions     avg_price  \
13769         ford           f-150                8014  2.864032e+06   
10401    chevrolet  silverado 1500                5154  2.886477e+06   
25302          ram            1500                4259  2.913927e+06   
27474       toyota           camry                3153  1.298709e+06   
10395    chevrolet       silverado                3037  2.699027e+06   
17821        honda          accord                2983  1.236092e+06   
20419         jeep        wrangler                2863  6.580196e+07   
23984       nissan          altima                2805  1.216228e+06   
18079        honda           civic                2801  1.208038e+06   
13300         ford          escape                2780  1.504665e+06   

       median_price  
13769     2511100.0  
10401     2568380.0  
25302     2695900.0  
27474     1027320.0  
10395     2041870.0  
17821     1007580.0  
20419     2375770.0  
23984     1136940.0  
18079     1014550.0  
13300     1488650.0

# Group transactions by odometer ranges
bins = [0, 20000, 50000, 100000, 150000, 200000, 300000,400000,600000,800000,1000000,10000000]
labels = ['0-20K', '20K-50K', '50K-100K', '100K-150K', '150K-200K', '200K-300K', '300K-400K', '400K-600K', '600K-800K', '800K-1000K','1000K+']
sp['mileage_range'] = pd.cut(sp['odometer'], bins=bins, labels=labels)

mileage_analysis = sp.groupby('mileage_range').agg(
    total_transactions=('id', 'count'),
    avg_price=('price', 'mean'),
    median_price=('price', 'median')
).reset_index()

print("\nMileage Range Analysis (Demand and Pricing):")
print(mileage_analysis)

Mileage Range Analysis (Demand and Pricing):
   mileage_range  total_transactions     avg_price  median_price
0          0-20K               53687  1.381985e+07     3788400.0
1        20K-50K               79690  3.421764e+06     3209560.0
2       50K-100K              116017  9.843183e+06     1990460.0
3      100K-150K               99096  5.119617e+06     1284190.0
4      150K-200K               53065  1.341860e+07      897960.0
5      200K-300K               20137  2.501741e+07      744910.0
6      300K-400K                1327  1.339640e+07      960200.0
7      400K-600K                 449  3.057920e+06     1990460.0
8      600K-800K                 229  2.832558e+06     2140660.0
9     800K-1000K                 445  2.875703e+08     1926290.0
10        1000K+                 571  2.518616e+08     1027320.0

sp['mileage_bins'] = pd.cut(sp['avg_yearly_mileage'], 
                            bins=[0, 5000, 10000, 20000, 30000,40000,50000,100000,10000000], 
                            labels=['0-5K','5K-10K','10K-20K','20K-30K','30K-40K','40K-50K','50K-100K','100K+'])

mileage_demand = sp.groupby('mileage_bins').agg(
    total_transactions=('id','count'),
    avg_price=('price', 'mean'), 
    median_price=('price', 'median')
).reset_index()

print("Demand by Average Yearly Mileage:")
print(mileage_demand)

Demand by Average Yearly Mileage:
  mileage_bins  total_transactions     avg_price  median_price
0         0-5K               62413  2.788424e+07     2616850.0
1       5K-10K              126558  7.393663e+06     1990460.0
2      10K-20K              190936  6.344824e+06     1871940.0
3      20K-30K               34091  2.800212e+06     2182150.0
4      30K-40K                6774  3.418389e+06     3081120.0
5      40K-50K                2011  8.346382e+06     2889430.0
6     50K-100K                1307  1.545943e+07     2311510.0
7        100K+                 623  2.058125e+08     1990460.0

	id	price	year	odometer	latitude	longitude
count	4.268800e+05	4.268800e+05	425675.000000	4.224800e+05	426880.000000	426880.000000
mean	7.311487e+09	9.657028e+06	2011.235191	9.804333e+04	22.800888	79.012914
std	4.473170e+06	1.564439e+09	9.452120	2.138815e+05	6.532617	5.208436
min	7.207408e+09	0.000000e+00	1900.000000	0.000000e+00	6.800300	68.400200
25%	7.308143e+09	7.576800e+05	2008.000000	3.770400e+04	17.796000	75.719100
50%	7.312621e+09	1.791970e+06	2013.000000	8.554800e+04	24.058650	77.353900
75%	7.315254e+09	3.401080e+06	2017.000000	1.335425e+05	28.563525	80.927325
max	7.317101e+09	4.798934e+11	2022.000000	1.000000e+07	35.997600	97.398000

	id	price	year	odometer	latitude	longitude	posting_year	posting_month	car_age	avg_yearly_mileage
count	4.266780e+05	4.266780e+05	426678.000000	4.266780e+05	426678.000000	426678.000000	426678.0	426678.000000	426678.000000	4.266780e+05
mean	7.311495e+09	9.822740e+06	2011.236584	9.786411e+04	22.801159	79.012884	2021.0	4.266302	9.763416	1.230994e+04
std	4.381300e+06	1.564808e+09	9.439536	2.111762e+05	6.532559	5.208544	0.0	0.442024	9.439536	3.861835e+04
min	7.301583e+09	9.200000e+02	1900.000000	0.000000e+00	6.800300	68.400200	2021.0	4.000000	0.000000	0.000000e+00
25%	7.308145e+09	1.026400e+06	2008.000000	3.817000e+04	17.796600	75.719100	2021.0	4.000000	4.000000	7.063000e+03
50%	7.312622e+09	1.990460e+06	2013.000000	8.600000e+04	24.058950	77.353900	2021.0	4.000000	8.000000	1.072728e+04
75%	7.315254e+09	3.402170e+06	2017.000000	1.330000e+05	28.563600	80.927100	2021.0	5.000000	13.000000	1.500000e+04
max	7.317101e+09	4.798934e+11	2021.000000	1.000000e+07	35.997600	97.398000	2021.0	5.000000	121.000000	1.000000e+07

	id	price	year	manufacturer	model	condition	cylinders	fuel	odometer	title_status	...	VIN	drive	size	type	paint_color	image_url	state	posting_date	latitude	longitude
0	7222695916	770530.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Haryana	NaN	28.8446	75.1167
1	7218891961	1528210.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Haryana	NaN	27.9026	77.0382
2	7221797935	2696820.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	West Bengal	NaN	24.6838	85.9695
3	7222270760	192610.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Andhra Pradesh	NaN	15.5158	81.9342
4	7210384030	629240.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	West Bengal	NaN	23.7821	87.4861
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
426875	7301591192	3029710.0	2019.0	nissan	maxima s sedan 4d	good	6 cylinders	gas	32226.0	clean	...	1N4AA6AV6KC367801	fwd	NaN	sedan	NaN	https://images.craigslist.org/00o0o_iiraFnHg8q...	Delhi	2021-04-04 09:21:31+00:00	28.7876	76.8644
426876	7301591187	3928680.0	2020.0	volvo	s60 t5 momentum sedan 4d	good	NaN	gas	12029.0	clean	...	7JR102FKXLG042696	fwd	NaN	sedan	red	https://images.craigslist.org/00x0x_15sbgnxCIS...	Punjab	2021-04-04 09:21:29+00:00	32.0715	75.1915
426877	7301591147	4493750.0	2020.0	cadillac	xt4 sport suv 4d	good	NaN	diesel	4174.0	clean	...	1GYFZFR46LF088296	NaN	NaN	hatchback	white	https://images.craigslist.org/00L0L_farM7bxnxR...	Madhya Pradesh	2021-04-04 09:21:17+00:00	21.5646	75.7247
426878	7301591140	3723220.0	2018.0	lexus	es 350 sedan 4d	good	6 cylinders	gas	30112.0	clean	...	58ABK1GG4JU103853	fwd	NaN	sedan	silver	https://images.craigslist.org/00z0z_bKnIVGLkDT...	Maharashtra	2021-04-04 09:21:11+00:00	21.4183	75.4719
426879	7301591129	3928680.0	2019.0	bmw	4 series 430i gran coupe	good	NaN	gas	22716.0	clean	...	WBA4J1C58KBM14708	rwd	NaN	coupe	NaN	https://images.craigslist.org/00Y0Y_lEUocjyRxa...	Jharkhand	2021-04-04 09:21:07+00:00	24.2616	83.7688

	id	price	year	manufacturer	model	condition	cylinders	fuel	odometer	title_status	...	type	paint_color	image_url	state	posting_date	latitude	longitude	posting_year	posting_month	posting_weekday
9738	7303397975	770530.0	2022	NaN	Suzuki Vitara	NaN	NaN	gas	115000.0	clean	...	NaN	NaN	https://images.craigslist.org/00n0n_9cYtTzxWFZ...	Uttar Pradesh	2021-04-07 22:56:59+00:00	28.1913	84.3726	2021	4	Wednesday
32148	7314343550	0.0	2022	NaN	Freightliner SportChassis RHA112	NaN	NaN	diesel	10.0	clean	...	pickup	black	https://images.craigslist.org/00202_j8VlmOCW5g...	Gujarat	2021-04-29 15:54:44+00:00	21.1472	70.9694	2021	4	Thursday
43183	7314165992	1540060.0	2022	toyota	4runner 4wd	NaN	NaN	gas	207834.0	clean	...	NaN	custom	https://images.craigslist.org/00404_8uTHOfbLml...	Uttar Pradesh	2021-04-29 02:16:46+00:00	25.3863	82.3836	2021	4	Thursday
65611	7309051490	321050.0	2022	honda	civic cvcc	fair	4 cylinders	gas	605000.0	clean	...	NaN	NaN	https://images.craigslist.org/01414_7ghFSAOlNo...	Maharashtra	2021-04-19 01:43:58+00:00	19.7144	75.0804	2021	4	Monday
65612	7309047664	449480.0	2022	toyota	mighty max	fair	4 cylinders	gas	180000.0	clean	...	truck	white	https://images.craigslist.org/00m0m_5XzKF5TFiE...	Delhi	2021-04-19 01:31:30+00:00	28.6941	76.8817	2021	4	Monday
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
410935	7315558994	49390.0	2022	mitsubishi	eclipse cross le	excellent	4 cylinders	gas	5.0	NaN	...	SUV	NaN	https://images.craigslist.org/00C0C_mwIIiP2iUq...	Puducherry	2021-05-01 23:06:10+00:00	12.2829	79.8404	2021	5	Saturday
410936	7315551108	52330.0	2022	mitsubishi	eclipse cross se	excellent	4 cylinders	gas	3.0	NaN	...	SUV	NaN	https://images.craigslist.org/00z0z_6xrjH3eEy3...	Chandigarh	2021-05-01 22:46:56+00:00	30.7712	76.7766	2021	5	Saturday
412094	7301823858	55270.0	2022	mitsubishi	eclipse cross sel	excellent	4 cylinders	gas	5.0	NaN	...	SUV	NaN	https://images.craigslist.org/00v0v_c8464eYsPz...	Maharashtra	2021-04-04 22:02:54+00:00	17.3951	77.1283	2021	4	Sunday
413805	7301823862	55270.0	2022	mitsubishi	eclipse cross sel	excellent	4 cylinders	gas	5.0	NaN	...	SUV	NaN	https://images.craigslist.org/00v0v_c8464eYsPz...	Meghalaya	2021-04-04 22:02:54+00:00	25.9323	91.2154	2021	4	Sunday
423091	7314368209	31670.0	2022	NaN	trailer	fair	NaN	other	1111111.0	clean	...	NaN	NaN	https://images.craigslist.org/00Y0Y_3AeJN6UYkh...	Goa	2021-04-29 16:34:59+00:00	15.5037	73.8108	2021	4	Thursday

	id	price	year	manufacturer	model	condition	cylinders	fuel	odometer	title_status	...	type	paint_color	image_url	state	posting_date	latitude	longitude	posting_year	posting_month	posting_weekday
0	7222695916	770530.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Haryana	NaT	28.8446	75.1167	<NA>	<NA>	NaN
1	7218891961	1528210.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Haryana	NaT	27.9026	77.0382	<NA>	<NA>	NaN
2	7221797935	2696820.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	West Bengal	NaT	24.6838	85.9695	<NA>	<NA>	NaN
3	7222270760	192610.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Andhra Pradesh	NaT	15.5158	81.9342	<NA>	<NA>	NaN
4	7210384030	629240.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	West Bengal	NaT	23.7821	87.4861	<NA>	<NA>	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
426875	7301591192	3029710.0	2019	nissan	maxima s sedan 4d	good	6 cylinders	gas	32226.0	clean	...	sedan	NaN	https://images.craigslist.org/00o0o_iiraFnHg8q...	Delhi	2021-04-04 09:21:31+00:00	28.7876	76.8644	2021	4	Sunday
426876	7301591187	3928680.0	2020	volvo	s60 t5 momentum sedan 4d	good	NaN	gas	12029.0	clean	...	sedan	red	https://images.craigslist.org/00x0x_15sbgnxCIS...	Punjab	2021-04-04 09:21:29+00:00	32.0715	75.1915	2021	4	Sunday
426877	7301591147	4493750.0	2020	cadillac	xt4 sport suv 4d	good	NaN	diesel	4174.0	clean	...	hatchback	white	https://images.craigslist.org/00L0L_farM7bxnxR...	Madhya Pradesh	2021-04-04 09:21:17+00:00	21.5646	75.7247	2021	4	Sunday
426878	7301591140	3723220.0	2018	lexus	es 350 sedan 4d	good	6 cylinders	gas	30112.0	clean	...	sedan	silver	https://images.craigslist.org/00z0z_bKnIVGLkDT...	Maharashtra	2021-04-04 09:21:11+00:00	21.4183	75.4719	2021	4	Sunday
426879	7301591129	3928680.0	2019	bmw	4 series 430i gran coupe	good	NaN	gas	22716.0	clean	...	coupe	NaN	https://images.craigslist.org/00Y0Y_lEUocjyRxa...	Jharkhand	2021-04-04 09:21:07+00:00	24.2616	83.7688	2021	4	Sunday

	id	price	year	manufacturer	model	condition	cylinders	fuel	odometer	title_status	...	car_type	paint_color	image_url	state	posting_date	latitude	longitude	posting_year	posting_month	posting_weekday
10	7218893038	0.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Uttar Pradesh	NaT	28.5810	83.4530	<NA>	<NA>	NaN
11	7218325704	0.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Maharashtra	NaT	16.3906	73.9100	<NA>	<NA>	NaN
12	7217788283	0.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Maharashtra	NaT	20.6117	79.7192	<NA>	<NA>	NaN
13	7217147606	0.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Uttar Pradesh	NaT	29.8561	81.6407	<NA>	<NA>	NaN
14	7209027818	0.0	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	Kerala	NaT	8.5412	75.6084	<NA>	<NA>	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
426764	7303347225	0.0	2018	NaN	peterbilt 579	NaN	NaN	diesel	1.0	clean	...	NaN	NaN	https://images.craigslist.org/00F0F_kfjCcZyncy...	Kerala	2021-04-07 21:15:05+00:00	8.7133	74.9888	2021	4	Wednesday
426812	7302877815	0.0	2006	toyota	scion tc	excellent	4 cylinders	gas	195000.0	clean	...	NaN	silver	https://images.craigslist.org/00B0B_dqmLwdZw8Q...	Tamil Nadu	2021-04-06 23:00:19+00:00	9.7042	79.7712	2021	4	Tuesday
426832	7302353283	0.0	2004	toyota	prius	excellent	4 cylinders	hybrid	239000.0	clean	...	NaN	blue	https://images.craigslist.org/00V0V_7Ap4arGS9S...	Telangana	2021-04-05 23:00:55+00:00	16.8085	77.5499	2021	4	Monday
426836	7302301268	0.0	2018	ram	2500	excellent	6 cylinders	diesel	20492.0	clean	...	truck	white	https://images.craigslist.org/00i0i_3cSpgdy0qQ...	Delhi	2021-04-05 21:18:42+00:00	28.6000	77.2163	2021	4	Monday
426868	7301843288	0.0	2010	toyota	venza	excellent	6 cylinders	gas	155000.0	clean	...	NaN	blue	https://images.craigslist.org/00808_hNEilrIb0i...	Uttar Pradesh	2021-04-04 23:00:48+00:00	27.9450	79.3595	2021	4	Sunday

	id	price	year	manufacturer	model	condition	cylinders	fuel	odometer	title_status	...	image_url	state	posting_date	latitude	longitude	posting_year	posting_month	posting_weekday	car_age	avg_yearly_mileage
27	7316814884	4313900.0	2014.0	gmc	sierra 1500 crew cab slt	good	8 cylinders	gas	57923.0	clean	...	https://images.craigslist.org/00R0R_lwWjXSEWNa...	Kerala	2021-05-04 17:31:18+00:00	9.8072	76.8361	2021	5	Tuesday	7.0	8274.714286
28	7316814758	2901280.0	2010.0	chevrolet	silverado 1500	good	8 cylinders	gas	71229.0	clean	...	https://images.craigslist.org/00R0R_lwWjXSEWNa...	Uttar Pradesh	2021-05-04 17:31:08+00:00	29.2417	83.7824	2021	5	Tuesday	11.0	6475.363636
29	7316814989	5084440.0	2020.0	chevrolet	silverado 1500 crew	good	8 cylinders	gas	19160.0	clean	...	https://images.craigslist.org/01212_jjirIWa0y0...	Delhi	2021-05-04 17:31:25+00:00	28.8053	76.8871	2021	5	Tuesday	1.0	19160.000000
30	7316743432	3980090.0	2017.0	toyota	tundra double cab sr	good	8 cylinders	gas	41124.0	clean	...	https://images.craigslist.org/00x0x_1y9kIOzGCF...	Odisha	2021-05-04 15:41:31+00:00	19.2886	85.2667	2021	5	Tuesday	4.0	10281.000000
31	7316356412	1926290.0	2013.0	ford	f-150 xlt	excellent	6 cylinders	gas	128000.0	clean	...	https://images.craigslist.org/00404_l4loxHvdQe...	Tripura	2021-05-03 19:02:03+00:00	23.4924	91.1187	2021	5	Monday	8.0	16000.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
426875	7301591192	3029710.0	2019.0	nissan	maxima s sedan 4d	good	6 cylinders	gas	32226.0	clean	...	https://images.craigslist.org/00o0o_iiraFnHg8q...	Delhi	2021-04-04 09:21:31+00:00	28.7876	76.8644	2021	4	Sunday	2.0	16113.000000
426876	7301591187	3928680.0	2020.0	volvo	s60 t5 momentum sedan 4d	good	Unknown	gas	12029.0	clean	...	https://images.craigslist.org/00x0x_15sbgnxCIS...	Punjab	2021-04-04 09:21:29+00:00	32.0715	75.1915	2021	4	Sunday	1.0	12029.000000
426877	7301591147	4493750.0	2020.0	cadillac	xt4 sport suv 4d	good	Unknown	diesel	4174.0	clean	...	https://images.craigslist.org/00L0L_farM7bxnxR...	Madhya Pradesh	2021-04-04 09:21:17+00:00	21.5646	75.7247	2021	4	Sunday	1.0	4174.000000
426878	7301591140	3723220.0	2018.0	lexus	es 350 sedan 4d	good	6 cylinders	gas	30112.0	clean	...	https://images.craigslist.org/00z0z_bKnIVGLkDT...	Maharashtra	2021-04-04 09:21:11+00:00	21.4183	75.4719	2021	4	Sunday	3.0	10037.333333
426879	7301591129	3928680.0	2019.0	bmw	4 series 430i gran coupe	good	Unknown	gas	22716.0	clean	...	https://images.craigslist.org/00Y0Y_lEUocjyRxa...	Jharkhand	2021-04-04 09:21:07+00:00	24.2616	83.7688	2021	4	Sunday	2.0	11358.000000

	car_type	avg_price	num_cars
0	SUV	1.291095e+07	97421
1	Unknown	1.656132e+07	24924
2	bus	2.206170e+06	619
3	convertible	3.103358e+07	8910
4	coupe	2.993457e+06	22104
5	hatchback	1.832439e+06	18777
6	mini-van	1.283773e+06	7007
7	offroad	2.090865e+06	725
8	other	3.349307e+06	23014
9	pickup	1.652880e+07	50507
10	sedan	9.306231e+06	106733
11	truck	4.141187e+06	43467
12	van	2.396104e+06	10284
13	wagon	1.787653e+06	12186

	Manufacturer	Model	Total_Demand
13769	ford	f-150	8014
10401	chevrolet	silverado 1500	5154
25302	ram	1500	4259
27474	toyota	camry	3153
10395	chevrolet	silverado	3037
...	...	...	...
12032	dodge	charger police sxt	1
12031	dodge	charger police inspector	1
12029	dodge	charger plus rt	1
12027	dodge	charger limousine	1
15041	ford	f250xlt	1

Graduation Project -- Spinny Analysis¶

Business Problem Overview¶

Objective¶

Business Impact¶

Dataset Overview¶

Dataset Overview¶

Column Definitions¶

Analysis & Visualization¶

1. Importing and Cleaning Data¶

Importing Necessary Libraries¶

Loading the Dataset from google drive¶

Displaying Dataset Information¶

Displaying Column Names¶

Describing Dataset Information¶

Displaying Column Data Types¶

Checking the Shape of the Dataset¶

Checking the unique values in the Dataset¶

Checking for the Value Counts in the Dataset¶

2. Data Preparation¶

Checking for Missing/Null Values¶

Checking for Duplicate Values in the Dataset¶

Checking Data Column types for Well Structured Analysis.¶

Changing Datatype of Inappropriate Columns.¶

Segregating Numerical Columns for Devicing Outliers.¶

Devicing outliers from the DataFrame¶

Handling the Outliers.¶

Data Distribution in Outliers.¶

Creating New columns for better Analysis¶

Create posting_year column¶

Create posting_month column¶

Create posting_weekday column¶

Ensuring Data Integrity¶

Dropping Invalid Data¶

Changing the column name type to car_type¶

Handling Inappropriate Data¶

Handling the price column¶

Handling the odometer column¶

Handling the year column¶

Handling model column¶

Dropped rows where manufacturer and model is NAN¶

Categorized Unknown to the remaining missing values in model¶

Handling manufacturer column values¶

Handling condition column values¶

Handling cylinder column values¶

Handling fuel column values¶

Handling transmission column values¶

Handling title_status column values¶

Handling car_type column values¶

Handling paint column values¶

Handling drive column values¶

Handling size Column values¶

Handling VIN Column values¶

Checking the Cleaned Data-type missing values¶

Create car_age and avg_yearly_mileage column¶

Create car_age column¶

Create avg_yearly_mileage column¶

DataFrame sp¶

3. Exploratory Data Analysis (EDA)¶

Hypothesis 1:- Vehicle pricing and demand vary based on condition, fuel type, mileage, and vehicle type, impacting transaction volumes and average prices.¶

Price Trends Analysis¶

Objective: Identify factors influencing car prices.¶

Pricing Analysis¶

Price trends based on condition and fuel¶

Average Price by Mileage¶

Pricing Strategy Based on Vehicle Type¶

Analysis¶

Key Insights¶

Recommendations¶

Data Visualization¶

Compare Price and number of cars Across Car Types¶

Key Insights¶

Recommendations¶

Hypothesis 2:- Newer cars have higher demand and prices with showing higher transaction volumes, while older cars see declining demand and value with decrease in market desirability.¶

Car Age and Usage Analysis¶

Objective: Understand how car age relates to odometer readings and price.¶

Age of Cars and Its Impact on Demand¶

Demand for Newer vs. Older Cars¶

Analysis¶

Key Insights¶

Recommendations¶

Create `posting_year` column¶

Create `posting_month` column¶

Create `posting_weekday` column¶

Changing the column name `type` to `car_type`¶

Handling the `price` column¶

Handling the `odometer` column¶

Handling the `year` column¶

Handling `model` column¶

Dropped rows where `manufacturer` and `model` is NAN¶

Categorized `Unknown` to the remaining missing values in `model`¶

Handling `manufacturer` column values¶

Handling `condition` column values¶

Handling `cylinder` column values¶

Handling `fuel` column values¶

Handling `transmission` column values¶

Handling `title_status` column values¶

Handling `car_type` column values¶

Handling `paint` column values¶

Handling `drive` column values¶

Handling `size` Column values¶

Handling `VIN` Column values¶

Create `car_age` and `avg_yearly_mileage` column¶

Create `car_age` column¶

Create `avg_yearly_mileage` column¶

DataFrame `sp`¶