Spinny's Revenue Growth Through Demand and Pricing Optimisation¶

Spinny Pre-Owned Car Market Analysis¶

Introduction:¶

Spinny, a leading Indian startup in the pre-owned car market, aims to increase revenue by boosting the number of completed transactions. To achieve this, we will analyze used car transaction data to:

  1. Assess the demand for pre-owned cars in each region.
  2. Identify the demand for specific car models and their key attributes (e.g., condition, fuel type, mileage).
  3. Determine the optimal pricing for each car to stay competitive and align with market trends.
  • The insights derived will support strategic decisions to optimize inventory, pricing, and marketing efforts.

1. Data Understanding¶

  • Loading and Inspecting Data
  • Import the dataset and inspect its structure.
  • Display the first few rows to understand the data layout.
  • Check for: Column names and data types.
  • Missing values and their distribution.
  • Basic statistics of numerical and categorical features.

2. Data Preparation¶

  • Handling Missing Values
    • Identify columns with missing data. Implement strategies to handle them (e.g., imputation, removal). Justify selected strategies briefly.
  • Dealing with Outliers
    • Use statistical methods (e.g., IQR) or visualizations (e.g., boxplots) to identify and address outliers.
    • Ensure the approach aligns with the business context.
  • Standardizing Data
  • Format categorical variables (e.g., fuel type, transmission) for consistency.
  • Scale numerical columns where needed for analysis (e.g., price, mileage).

3. Exploratory Data Analysis (EDA)¶

  • Univariate Analysis
    • Examine the distribution of key variables:
  • Visualize using histograms, bar charts, and boxplots.
  • Bivariate Analysis
    • Explore relationships between:
      • Car attributes.
      • Region and car demand. Use scatter plots, boxplots, and grouped bar charts.
  • Multivariate Analysis
    • Investigate interactions:
      • Region, car attributes, and price.
      • Visualize using heatmaps, pair plots, and correlation matrices.

4. Charting and Insights¶

  • Visualization Strategy
    • Provide relevant charts with:
    • Titles, labels, legends, and annotations.
    • Brief markdown explanations of the purpose and insights.
  • Key Insights
    • Highlight actionable insights:
    • Regions with high demand.
    • Attributes driving car demand (e.g., fuel type, age). Pricing trends by region and attribute.

5.Insights and Recommendations¶

  • Summary of Findings
    • Region-wise demand distribution.
    • Popular car models and attributes.
    • Optimal pricing strategies based on market demand.
  • Recommendations
    • Target Regions with High Demand:
    • Focus marketing efforts on regions showing peak demand.
      • Increase inventory in high-demand areas.

Unlocking Customer Demand and Transaction Growth in the Pre-Owned Car Market¶

Spinny, a leading Indian startup in the pre-owned car market, is revolutionizing the way consumers buy and sell used cars by providing a seamless and trustworthy platform. With the growing demand for affordable and reliable pre-owned vehicles, Spinny aims to cater to diverse customer needs across various regions in India.

However, success in the competitive pre-owned car market hinges on two critical factors: understanding customer demand and offering competitive pricing. Customers have numerous options to choose from, and this abundance can lead to missed opportunities for conversions, where potential buyers fail to complete transactions. High transaction dropout rates can negatively impact revenue growth, customer trust, and overall market share.

For Spinny, identifying the key factors that drive demand for cars—such as region-specific preferences, car attributes (e.g., fuel type, mileage, condition), and optimal pricing—is essential. Why do certain cars appeal more to buyers? What attributes influence purchase decisions? Understanding these patterns is crucial to crafting effective strategies that meet customer expectations, boost completed transactions, and drive sustainable revenue growth.

Objective:¶

As a data analyst, your goal is to analyze used car transaction data to:

  1. Assess the demand for pre-owned cars across different regions.
  2. Identify the demand for specific car models and their key attributes (e.g., condition, fuel type, mileage).
  3. Determine the optimal pricing for each car, ensuring it is competitive and aligned with market trends.

Business Impact¶

This analysis will empower Spinny as a leading platform in the pre-owned car market to:

Enhance Regional Market Understanding : Identify region-specific demand trends, enabling better inventory planning and targeted marketing strategies.

Increase Transaction Conversion Rates: Provide insights into the most in-demand car models and attributes, optimizing the car selection available to customers.

Optimize Pricing Strategies: Establish competitive and market-aligned pricing models to attract more buyers and drive completed transactions.
Boost Revenue and Customer Satisfaction: Addressing customer preferences and pricing gaps will contribute to increased customer satisfaction and higher transaction volumes, directly impacting revenue growth.

  • By addressing these key factors, this project supports Spinny's broader goal of building customer trust, driving sustainable growth, and reinforcing its position as a leader in the pre-owned car market. Let’s dive into the data to uncover these actionable insights!

Dataset Overview¶

Dataset Overview¶

  • Dataset Name : Spinny_Data -Number of Rows : 426880 -Number of Columns : 21
  • Description : Optimizing Revenue Growth for Spinny through Demand Analysis and Competitive Pricing.

Column Definitions¶

  1. id – A unique identifier assigned to each car listing.
  2. price – The listed selling price of the car (in the local currency).
  3. year – The manufacturing year of the car.
  4. manufacturer – The brand or company that produced the car (e.g., Toyota, Ford).
  5. model – The specific model name of the car (e.g., Corolla, Mustang).
  6. condition – The overall state of the car, typically classified as new, excellent, good, fair, etc.
  7. cylinders – The number of engine cylinders in the car, indicating engine size and power.
  8. fuel – The type of fuel the car uses (e.g., petrol, diesel, electric, hybrid).
  9. odometer – The total distance the car has traveled, measured in miles or kilometers.
  10. title_status – The legal ownership status of the vehicle (e.g., clean, salvage, rebuilt).
  11. transmission – This column indicates the type of transmission system in the car, which determines how the vehicle shifts gears.
  12. VIN (Vehicle Identification Number) – A unique code assigned to each car for identification and tracking.
  13. drive – The type of drivetrain (e.g., front-wheel drive, rear-wheel drive, all-wheel drive).
  14. size – The size classification of the car (e.g., compact, midsize, full-size, SUV).
  15. type – The category of the car (e.g., sedan, SUV, truck, coupe).
  16. paint_color – The exterior color of the car.
  17. image_url – A link to the image of the car listing.
  18. state – The location (state or region) where the car is being sold.
  19. posting_date – The date when the car was listed for sale.
  20. latitude – The geographical latitude of the car’s listing location.
  21. longitude – The geographical longitude of the car’s listing location.

Analysis & Visualisation !¶

1. Importing and Cleaning Data¶

Importing Necessary Libraries¶

In [1]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import seaborn as sns  # For advanced visualizations

Loading the Dataset from google drive¶

In [2]:
# Step 1: Install gdown
!pip install gdown

# Step 2: Import necessary libraries
import gdown
import pandas as pd

# Step 3: Set the file ID and create a download URL
file_id = "1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR"
download_url = f"https://drive.google.com/uc?id={file_id}"

# Step 4: Set the output file name
output_file = "spinny_dataset.csv"

# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)

# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)
Requirement already satisfied: gdown in /usr/local/lib/python3.11/dist-packages (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown) (4.12.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown) (3.17.0)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown) (2.32.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown) (4.67.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (2.6)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2024.12.14)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (1.7.1)
Downloading...
From: https://drive.google.com/uc?id=1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR
To: /content/spinny_dataset.csv
100%|██████████| 99.9M/99.9M [00:00<00:00, 166MB/s]

Viewing the First Few Rows of the Dataset¶

In [3]:
print("First 5 Rows of the Dataset:")
data.head(5)
First 5 Rows of the Dataset:
Out[3]:
id price year manufacturer model condition cylinders fuel odometer title_status ... VIN drive size type paint_color image_url state posting_date latitude longitude
0 7222695916 770530.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Haryana NaN 28.8446 75.1167
1 7218891961 1528210.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Haryana NaN 27.9026 77.0382
2 7221797935 2696820.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN West Bengal NaN 24.6838 85.9695
3 7222270760 192610.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Andhra Pradesh NaN 15.5158 81.9342
4 7210384030 629240.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN West Bengal NaN 23.7821 87.4861

5 rows × 21 columns

In [4]:
random_sample = data[data.notna().all(axis=1)].sample(n=10, random_state=42)  # Randomly select 10 rows with no missing values
print(random_sample)
                id      price    year manufacturer                   model  \
421994  7302010346  1399780.0  2014.0         ford         transit connect   
359017  7316155054        0.0  2017.0          kia               soul plus   
119389  7316241495  2786620.0  2010.0    chevrolet  silverado 1500 ltz 4x4   
4642    7309388743  1149880.0  2015.0   volkswagen    passat tsi wolfsburg   
381660  7313268847  1130140.0  2011.0        mazda                       3   
26098   7314924294  1540060.0  2014.0   volkswagen                   jetta   
278249  7316153678  2117980.0  2007.0         jeep                wrangler   
194984  7309831477   693500.0  2010.0       nissan                  sentra   
421770  7303638125  4108440.0  2017.0    chevrolet          silverado 1500   
308593  7313955519        0.0  2012.0        buick                lacrosse   

        condition    cylinders fuel  odometer title_status  ...  \
421994       good  4 cylinders  gas  143546.0        clean  ...   
359017   like new  4 cylinders  gas   63496.0        clean  ...   
119389  excellent  8 cylinders  gas  124248.0        clean  ...   
4642    excellent  4 cylinders  gas  100274.0        clean  ...   
381660       good  4 cylinders  gas   77312.0        clean  ...   
26098   excellent  4 cylinders  gas   63930.0        clean  ...   
278249       good  6 cylinders  gas  122236.0        clean  ...   
194984       good  4 cylinders  gas  104000.0        clean  ...   
421770  excellent  8 cylinders  gas  101870.0        clean  ...   
308593  excellent  6 cylinders  gas   98239.0        clean  ...   

                      VIN drive       size      type paint_color  \
421994  NM0LS7F75E1164637   fwd  full-size       van         red   
359017  KNDJP3A51H7457654   fwd   mid-size  mini-van        blue   
119389  3GCRKTE37AG118369   4wd  full-size    pickup         red   
4642    1VWAT7A31FC018187   fwd   mid-size     sedan       white   
381660  JM1BL1U59B1435270   fwd   mid-size     sedan        grey   
26098   3VW2K7AJ1EM439955   fwd   mid-size     sedan       black   
278249  1J4GA39137L155539   4wd  full-size       SUV        grey   
194984  3N1AB6AP9AL643919   fwd   mid-size     sedan        grey   
421770  3GCUKSEC8HG435107   4wd  full-size     truck       black   
308593  1G4GJ5E34CF175943   fwd   mid-size     sedan       white   

                                                image_url              state  \
421994  https://images.craigslist.org/00U0U_29yhnVuqee...      Uttar Pradesh   
359017  https://images.craigslist.org/00202_iYUv6Xesse...            Haryana   
119389  https://images.craigslist.org/00D0D_9lZxHlsHk2...      Uttar Pradesh   
4642    https://images.craigslist.org/00505_924hjCwSYz...             Ladakh   
381660  https://images.craigslist.org/01616_3dxG5z31YU...         Tamil Nadu   
26098   https://images.craigslist.org/00X0X_iN9WvC5LD5...  Jammu and Kashmir   
278249  https://images.craigslist.org/00D0D_3AXXo1TdE6...        Maharashtra   
194984  https://images.craigslist.org/00H0H_dVRfJUD77f...        West Bengal   
421770  https://images.craigslist.org/01212_9xi1bxfHaI...              Delhi   
308593  https://images.craigslist.org/00J0J_acZwS8hozH...          Rajasthan   

                     posting_date latitude  longitude  
421994  2021-04-05 13:42:27+00:00  26.7509    80.8497  
359017  2021-05-03 13:41:03+00:00  29.6172    75.6695  
119389  2021-05-03 16:06:49+00:00  28.2134    80.2300  
4642    2021-04-19 18:33:28+00:00  34.9460    75.3032  
381660  2021-04-27 13:22:15+00:00   9.3002    78.0696  
26098   2021-04-30 18:14:50+00:00  32.9228    74.9187  
278249  2021-05-03 13:38:08+00:00  17.3431    78.5934  
194984  2021-04-20 15:46:07+00:00  22.5864    86.7940  
421770  2021-04-08 15:05:30+00:00  28.5658    76.9453  
308593  2021-04-28 18:35:08+00:00  28.1765    75.0177  

[10 rows x 21 columns]

Checking the Shape of the Dataset¶

In [5]:
rows, columns = data.shape
print(f"\nThe dataset contains {rows} rows and {columns} columns.")
The dataset contains 426880 rows and 21 columns.

Displaying Dataset Information¶

In [6]:
print("\nDataset Information:")
data.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 21 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   price         426880 non-null  float64
 2   year          425675 non-null  float64
 3   manufacturer  409234 non-null  object 
 4   model         421603 non-null  object 
 5   condition     252776 non-null  object 
 6   cylinders     249202 non-null  object 
 7   fuel          423867 non-null  object 
 8   odometer      422480 non-null  float64
 9   title_status  418638 non-null  object 
 10  transmission  424324 non-null  object 
 11  VIN           265838 non-null  object 
 12  drive         296313 non-null  object 
 13  size          120519 non-null  object 
 14  type          334022 non-null  object 
 15  paint_color   296677 non-null  object 
 16  image_url     426812 non-null  object 
 17  state         426880 non-null  object 
 18  posting_date  426812 non-null  object 
 19  latitude      426880 non-null  float64
 20  longitude     426880 non-null  float64
dtypes: float64(5), int64(1), object(15)
memory usage: 68.4+ MB
In [7]:
# Check the data type of the 'cylinders' column
print(data['cylinders'].dtype)

# If it's not an object (string) type, convert it first:
if data['cylinders'].dtype != object:
    data['cylinders'] = data['cylinders'].astype(str)

# Now, proceed with the extraction:
data['cylinders'] = data['cylinders'].str.extract('(\d+)').fillna(0).astype(int)

# Display the updated data types
print(data.dtypes)
object
id                int64
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders         int64
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object
  • Here i converted Cylinders into 'int64' type.

Converstion of Data

In [8]:
# Convert specified columns to string type
columns_to_str = ['id', 'manufacturer', 'model', 'condition', 'fuel', 'title_status', 'transmission' ,
                  'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url',
                  'state']

# Apply conversion
data[columns_to_str] = data[columns_to_str].astype(str)

# Display the updated data types
print(data.dtypes)
id               object
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders         int64
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object

Time Stamp

In [9]:
# Convert 'posting_date' to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'], errors='coerce')

# Display the DataFrame
print(data)
                id      price    year manufacturer                     model  \
0       7222695916   770530.0     NaN          nan                       nan   
1       7218891961  1528210.0     NaN          nan                       nan   
2       7221797935  2696820.0     NaN          nan                       nan   
3       7222270760   192610.0     NaN          nan                       nan   
4       7210384030   629240.0     NaN          nan                       nan   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

       condition  cylinders    fuel  odometer title_status  ...  \
0            nan          0     nan       NaN          nan  ...   
1            nan          0     nan       NaN          nan  ...   
2            nan          0     nan       NaN          nan  ...   
3            nan          0     nan       NaN          nan  ...   
4            nan          0     nan       NaN          nan  ...   
...          ...        ...     ...       ...          ...  ...   
426875      good          6     gas   32226.0        clean  ...   
426876      good          0     gas   12029.0        clean  ...   
426877      good          0  diesel    4174.0        clean  ...   
426878      good          6     gas   30112.0        clean  ...   
426879      good          0     gas   22716.0        clean  ...   

                      VIN drive size       type paint_color  \
0                     nan   nan  nan        nan         nan   
1                     nan   nan  nan        nan         nan   
2                     nan   nan  nan        nan         nan   
3                     nan   nan  nan        nan         nan   
4                     nan   nan  nan        nan         nan   
...                   ...   ...  ...        ...         ...   
426875  1N4AA6AV6KC367801   fwd  nan      sedan         nan   
426876  7JR102FKXLG042696   fwd  nan      sedan         red   
426877  1GYFZFR46LF088296   nan  nan  hatchback       white   
426878  58ABK1GG4JU103853   fwd  nan      sedan      silver   
426879  WBA4J1C58KBM14708   rwd  nan      coupe         nan   

                                                image_url           state  \
0                                                     nan         Haryana   
1                                                     nan         Haryana   
2                                                     nan     West Bengal   
3                                                     nan  Andhra Pradesh   
4                                                     nan     West Bengal   
...                                                   ...             ...   
426875  https://images.craigslist.org/00o0o_iiraFnHg8q...           Delhi   
426876  https://images.craigslist.org/00x0x_15sbgnxCIS...          Punjab   
426877  https://images.craigslist.org/00L0L_farM7bxnxR...  Madhya Pradesh   
426878  https://images.craigslist.org/00z0z_bKnIVGLkDT...     Maharashtra   
426879  https://images.craigslist.org/00Y0Y_lEUocjyRxa...       Jharkhand   

                    posting_date latitude  longitude  
0                            NaT  28.8446    75.1167  
1                            NaT  27.9026    77.0382  
2                            NaT  24.6838    85.9695  
3                            NaT  15.5158    81.9342  
4                            NaT  23.7821    87.4861  
...                          ...      ...        ...  
426875 2021-04-04 09:21:31+00:00  28.7876    76.8644  
426876 2021-04-04 09:21:29+00:00  32.0715    75.1915  
426877 2021-04-04 09:21:17+00:00  21.5646    75.7247  
426878 2021-04-04 09:21:11+00:00  21.4183    75.4719  
426879 2021-04-04 09:21:07+00:00  24.2616    83.7688  

[426880 rows x 21 columns]

Checking for Duplicate Values in the Dataset¶

In [10]:
duplicate_count = len(data[data.duplicated()])
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 0

Checking for Missing/Null Values¶

In [11]:
# Now you can proceed with the missing value check:
missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)
Missing Values in Each Column:
id                 0
price              0
year            1205
manufacturer       0
model              0
condition          0
cylinders          0
fuel               0
odometer        4400
title_status       0
transmission       0
VIN                0
drive              0
size               0
type               0
paint_color        0
image_url          0
state              0
posting_date      68
latitude           0
longitude          0
dtype: int64
In [12]:
missing_values = data.isnull().sum().sum()
print(missing_values)
5673

Summary of Dataset Observations¶

In [13]:
print("\nObservations About the Dataset:")
if duplicate_count > 0:
    print(f"- There are {duplicate_count} duplicate rows in the dataset.")
else:
    print("- No duplicate rows found in the dataset.")

if missing_values.sum() > 0:
    print("- There are missing values in the dataset. Here’s a summary:")
    print(missing_values[missing_values > 0])
else:
    print("- No missing values found in the dataset.")

print("- The dataset is ready for further analysis after handling duplicates and missing values.")
Observations About the Dataset:
- No duplicate rows found in the dataset.
- There are missing values in the dataset. Here’s a summary:
[5673]
- The dataset is ready for further analysis after handling duplicates and missing values.

2. Data Types¶

In [14]:
# Dataset Columns
print("Dataset Columns:")
print(data.columns)
Dataset Columns:
Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'image_url', 'state',
       'posting_date', 'latitude', 'longitude'],
      dtype='object')
In [15]:
# Dataset Describe
print("\nDataset Summary Statistics:")
print(data.describe(include='all'))
Dataset Summary Statistics:
                id         price           year manufacturer   model  \
count       426880  4.268800e+05  425675.000000       426880  426880   
unique      426880           NaN            NaN           43   29668   
top     7222695916           NaN            NaN         ford   f-150   
freq             1           NaN            NaN        70985    8009   
mean           NaN  9.657028e+06    2011.235191          NaN     NaN   
min            NaN  0.000000e+00    1900.000000          NaN     NaN   
25%            NaN  7.576800e+05    2008.000000          NaN     NaN   
50%            NaN  1.791970e+06    2013.000000          NaN     NaN   
75%            NaN  3.401080e+06    2017.000000          NaN     NaN   
max            NaN  4.798934e+11    2022.000000          NaN     NaN   
std            NaN  1.564439e+09       9.452120          NaN     NaN   

       condition      cylinders    fuel      odometer title_status  ...  \
count     426880  426880.000000  426880  4.224800e+05       426880  ...   
unique         7            NaN       6           NaN            7  ...   
top          nan            NaN     gas           NaN        clean  ...   
freq      174104            NaN  356209           NaN       405117  ...   
mean         NaN       3.466222     NaN  9.804333e+04          NaN  ...   
min          NaN       0.000000     NaN  0.000000e+00          NaN  ...   
25%          NaN       0.000000     NaN  3.770400e+04          NaN  ...   
50%          NaN       4.000000     NaN  8.554800e+04          NaN  ...   
75%          NaN       6.000000     NaN  1.335425e+05          NaN  ...   
max          NaN      12.000000     NaN  1.000000e+07          NaN  ...   
std          NaN       3.188464     NaN  2.138815e+05          NaN  ...   

           VIN   drive    size    type paint_color  \
count   426880  426880  426880  426880      426880   
unique  118265       4       5      14          13   
top        nan     4wd     nan     nan         nan   
freq    161042  131904  306361   92858      130203   
mean       NaN     NaN     NaN     NaN         NaN   
min        NaN     NaN     NaN     NaN         NaN   
25%        NaN     NaN     NaN     NaN         NaN   
50%        NaN     NaN     NaN     NaN         NaN   
75%        NaN     NaN     NaN     NaN         NaN   
max        NaN     NaN     NaN     NaN         NaN   
std        NaN     NaN     NaN     NaN         NaN   

                                                image_url          state  \
count                                              426880         426880   
unique                                             241900             35   
top     https://images.craigslist.org/00N0N_1xMPvfxRAI...  Uttar Pradesh   
freq                                                 7357          45573   
mean                                                  NaN            NaN   
min                                                   NaN            NaN   
25%                                                   NaN            NaN   
50%                                                   NaN            NaN   
75%                                                   NaN            NaN   
max                                                   NaN            NaN   
std                                                   NaN            NaN   

                               posting_date       latitude      longitude  
count                                426812  426880.000000  426880.000000  
unique                                  NaN            NaN            NaN  
top                                     NaN            NaN            NaN  
freq                                    NaN            NaN            NaN  
mean    2021-04-24 00:23:42.976467200+00:00      22.800888      79.012914  
min               2021-04-04 07:00:25+00:00       6.800300      68.400200  
25%     2021-04-17 10:47:46.750000128+00:00      17.796000      75.719100  
50%        2021-04-26 01:08:31.500000+00:00      24.058650      77.353900  
75%     2021-05-01 13:32:26.249999872+00:00      28.563525      80.927325  
max               2021-05-05 04:24:09+00:00      35.997600      97.398000  
std                                     NaN       6.532617       5.208436  

[11 rows x 21 columns]
In [16]:
# Variables Description
print("\n### Variables Description ###")
variable_descriptions = {
"id": "A unique identifier for each listing or item.",
"price": "The price of the item or listing. It might be in a specific currency (e.g., INR, USD).",
"year": "The manufacturing year of the item (possibly for vehicles or other dated products).",
"manufacturer": "The name of the manufacturer or brand of the item.",
"model": "The specific model of the product.",
"condition": "The condition of the item (e.g., new, used, refurbished).",
"cylinders": "The number of cylinders (likely relevant if the dataset is related to vehicles).",
"fuel": "The type of fuel the item uses (e.g., petrol, diesel, electric).",
"odometer": "The mileage or distance traveled by the item (if vehicles are involved).",
"title_status": "The legal or ownership status of the item (e.g., clean, salvage, rebuilt).",
"VIN(Vehicle Identification Number)": "A unique code to identify individual vehicles.",
"drive": "The drivetrain type (e.g., front-wheel drive, rear-wheel drive, all-wheel drive).",
"size": "The size category of the item (e.g., small, medium, large).",
"type": "The type or category of the item (e.g., car, truck, bike, or other product types).",
"paint_color": "The color of the item (e.g., for vehicles or items with customizable color options).",
"image_url": "A URL to the image of the item or listing.",
"state": "The state in which the listing or item is located.",
"posting_date": "The date on which the item was listed or posted.",
"latitude": "The latitude coordinate of the item's location.",
"longitude": "The longitude coordinate of the item's location."
}

for col, desc in variable_descriptions.items():
    print(f"* **{col}**: {desc}")
### Variables Description ###
* **id**: A unique identifier for each listing or item.
* **price**: The price of the item or listing. It might be in a specific currency (e.g., INR, USD).
* **year**: The manufacturing year of the item (possibly for vehicles or other dated products).
* **manufacturer**: The name of the manufacturer or brand of the item.
* **model**: The specific model of the product.
* **condition**: The condition of the item (e.g., new, used, refurbished).
* **cylinders**: The number of cylinders (likely relevant if the dataset is related to vehicles).
* **fuel**: The type of fuel the item uses (e.g., petrol, diesel, electric).
* **odometer**: The mileage or distance traveled by the item (if vehicles are involved).
* **title_status**: The legal or ownership status of the item (e.g., clean, salvage, rebuilt).
* **VIN(Vehicle Identification Number)**: A unique code to identify individual vehicles.
* **drive**: The drivetrain type (e.g., front-wheel drive, rear-wheel drive, all-wheel drive).
* **size**: The size category of the item (e.g., small, medium, large).
* **type**: The type or category of the item (e.g., car, truck, bike, or other product types).
* **paint_color**: The color of the item (e.g., for vehicles or items with customizable color options).
* **image_url**: A URL to the image of the item or listing.
* **state**: The state in which the listing or item is located.
* **posting_date**: The date on which the item was listed or posted.
* **latitude**: The latitude coordinate of the item's location.
* **longitude**: The longitude coordinate of the item's location.

Unique Values for each variable.¶

In [17]:
# Unique Values for Each Variable
print("\n### Unique Values for Each Variable ###")
for column in data.columns.tolist():
    print(f"No. of unique values in {column}: {data[column].nunique()}.")
### Unique Values for Each Variable ###
No. of unique values in id: 426880.
No. of unique values in price: 6253.
No. of unique values in year: 114.
No. of unique values in manufacturer: 43.
No. of unique values in model: 29668.
No. of unique values in condition: 7.
No. of unique values in cylinders: 8.
No. of unique values in fuel: 6.
No. of unique values in odometer: 104870.
No. of unique values in title_status: 7.
No. of unique values in transmission: 4.
No. of unique values in VIN: 118265.
No. of unique values in drive: 4.
No. of unique values in size: 5.
No. of unique values in type: 14.
No. of unique values in paint_color: 13.
No. of unique values in image_url: 241900.
No. of unique values in state: 35.
No. of unique values in posting_date: 333355.
No. of unique values in latitude: 191354.
No. of unique values in longitude: 162536.
In [18]:
#import pandas as pd
#from geopy.geocoders import Nominatim

# Load your data (replace 'your_data.csv' with your actual file)
#data = pd.read_csv(output_file)

# Function to get the city
#def get_city(row):
    #geolocator = Nominatim(user_agent="my_geocoder")
    #location = geolocator.reverse(str(row['latitude']) + ", " + str(row['longitude']))
    #return location.address.split(',')[0] if location else None

# Apply the function and create a new column
#data['city'] = data.apply(get_city, axis=1)

# Save the modified DataFrame
#data.to_csv('output.csv', index=False)
In [20]:
# import math

# cities_df = pd.read_csv('India Cities LatLng.csv')
# def haversine(lat1, lon1, lat2, lon2):
#     R = 6371  # Radius of Earth in kilometers
#     phi1 = math.radians(lat1)
#     phi2 = math.radians(lat2)
#     delta_phi = math.radians(lat2 - lat1)
#     delta_lambda = math.radians(lon2 - lon1)

#     a = math.sin(delta_phi / 2)*2 + math.cos(phi1) * math.cos(phi2) * math.sin(delta_lambda / 2)*2
#     c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

#     return R * c  # Distance in kilometers

# def get_closest_city(row):
#     min_distance = float('inf')
#     closest_city = None

#     for _, city_row in cities_df.iterrows():
#         distance = haversine(row['latitude'], row['longitude'], city_row['lat'], city_row['lng'])
#         if distance < min_distance:
#             min_distance = distance
#             closest_city = city_row['city']

#     return closest_city


# df = pd.DataFrame(data)


# # Apply the function to get the closest city for each row in data
# df['city'] = df.apply(get_closest_city, axis=1)

3. Data Wrangling¶

Data Wrangling Code¶

In [19]:
# Copying the dataset for analysis
data = data.copy()
In [20]:
# Checking basic stats
print("Dataset Shape:", data.shape)
print("Dataset Columns:", data.columns)
Dataset Shape: (426880, 21)
Dataset Columns: Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'image_url', 'state',
       'posting_date', 'latitude', 'longitude'],
      dtype='object')
In [21]:
!pip install geopy
Requirement already satisfied: geopy in /usr/local/lib/python3.11/dist-packages (2.4.1)
Requirement already satisfied: geographiclib<3,>=1.52 in /usr/local/lib/python3.11/dist-packages (from geopy) (2.0)
In [24]:
# #from geopy.geocoders import Nominatim
# from geopy.exc import GeocoderTimedOut

# def get_city_from_coordinates(lat, lon):
#     try:
#         geolocator = Nominatim(user_agent="geo_locator")
#         location = geolocator.reverse((lat, lon), exactly_one=True)
#         if location and 'address' in location.raw:
#             address = location.raw['address']
#             city = address.get('city') or address.get('town') or address.get('village')
#             state = address.get('state')
#             country = address.get('country')
#             return city, state, country
#         else:
#             return None, None, None
#     except GeocoderTimedOut:
#         print("Geocoder service timed out. Try again.")
#         return None, None, None

# # Example usage
# latitude = 28.6139   # Example: Latitude of New Delhi
# longitude = 77.2090  # Example: Longitude of New Delhi

# city, state, country = get_city_from_coordinates(latitude, longitude)

# if city:
#     print(f"City: {city}, State: {state}, Country: {country}")
# else:
#     print("City not found.")
  • By using latitude and longitude values we can see city , state and country.

OutLiers¶

  1. Define Outliers (IQR Method)
In [22]:
def detect_outliers_iqr(data, columns):
    outliers = {}
    for col in columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Find outliers
        outliers[col] = data[(data[col] < lower_bound) | (data[col] > upper_bound)][col]
    return outliers

# Apply to numeric columns
numeric_columns = ['price', 'year', 'odometer']
outliers = detect_outliers_iqr(data, numeric_columns)
for col, vals in outliers.items():
    print(f"Outliers in {col}:")
    print(vals)
Outliers in price:
204       7.573780e+06
280       1.268338e+11
346       1.270072e+07
365       9.361800e+06
461       1.283604e+07
              ...     
426694    7.692300e+06
426716    8.469800e+06
426778    8.474760e+06
426810    7.575790e+06
426840    8.475680e+06
Name: price, Length: 8177, dtype: float64
Outliers in year:
37        1992.0
101       1968.0
137       1966.0
208       1976.0
210       1969.0
           ...  
426729    1951.0
426762    1982.0
426809    1990.0
426831    1985.0
426838    1920.0
Name: year, Length: 15896, dtype: float64
Outliers in odometer:
138        281000.0
280        999999.0
411        279000.0
455       9999999.0
484        283356.0
            ...    
426508     284000.0
426590     291988.0
426643     287000.0
426671     291137.0
426690     295000.0
Name: odometer, Length: 4385, dtype: float64
  1. Define Outliers (Standard Deviation Method)
In [23]:
def detect_outliers_std(data, columns):
    outliers = {}
    for col in columns:
        mean = data[col].mean()
        std = data[col].std()
        lower_bound = mean - 3 * std
        upper_bound = mean + 3 * std

        # Find outliers
        outliers[col] = data[(data[col] < lower_bound) | (data[col] > upper_bound)][col]
    return outliers

# Apply to numeric columns
numeric_columns = ['price', 'year', 'odometer']
outliers_std = detect_outliers_std(data, numeric_columns)
for col, vals in outliers_std.items():
    print(f"Outliers in {col}:")
    print(vals)
Outliers in price:
280       1.268338e+11
1838      1.284192e+10
29386     1.426880e+11
37409     1.426880e+11
37410     3.864838e+11
91576     3.884607e+11
122470    1.268338e+11
137807    1.585422e+10
153082    1.585422e+11
184704    1.810795e+11
193736    1.585422e+10
207080    1.585422e+10
230753    1.733773e+10
241404    1.457003e+10
257840    3.884607e+11
288356    1.585422e+10
307488    1.585422e+10
318592    4.798934e+11
356716    4.798934e+11
358555    1.585422e+10
Name: price, dtype: float64
Outliers in year:
101       1968.0
137       1966.0
208       1976.0
210       1969.0
228       1978.0
           ...  
426575    1970.0
426593    1971.0
426729    1951.0
426762    1982.0
426838    1920.0
Name: year, Length: 9116, dtype: float64
Outliers in odometer:
280        999999.0
455       9999999.0
562       2222222.0
615       1111111.0
616       1111111.0
            ...    
423025     840000.0
423091    1111111.0
424165     900132.0
426106    9999999.0
426108     999999.0
Name: odometer, Length: 1089, dtype: float64
  1. Visualize Outliers (Boxplot)
In [24]:
import matplotlib.pyplot as plt

for col in ['price', 'year', 'odometer']:
    plt.figure(figsize=(8, 4))
    plt.boxplot(data[col].dropna(), vert=False, patch_artist=True)
    plt.title(f"Boxplot of {col}")
    plt.xlabel(col)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  1. Treat or Handle Outliers
In [25]:
def remove_outliers_iqr(data, columns):
    for col in columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        data = data[(data[col] >= lower_bound) & (data[col] <= upper_bound)]
    return data

data_cleaned = remove_outliers_iqr(data, numeric_columns)

Demand for pre-owned cars by region¶

In [26]:
# Bar Chart: Listings by State
state_counts = data['state'].value_counts()

plt.figure(figsize=(10, 6))
state_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Listings by State', fontsize=16)
plt.xlabel('State', fontsize=6)
plt.ylabel('Number of Listings', fontsize=14)
plt.xticks(rotation=45, fontsize=6)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image

Bar Chart:

  • Here it shows uttarpradesh is showing the highest counts in state.
  • The states with the highest and lowest counts are evident, enabling identification of demand hotspots.

Car Attributes Analysis¶

In [27]:
# Set general style for plots
sns.set_style("whitegrid")

# Step 1: Distribution

# Histograms
numeric_columns = ['price', 'year', 'odometer']
plt.figure(figsize=(15, 5))
for i, col in enumerate(numeric_columns, 1):
    plt.subplot(1, 3, i)
    plt.hist(data[col], bins=10, color='skyblue', edgecolor='black')
    plt.title(f'Distribution of {col.capitalize()}', fontsize=14)
    plt.xlabel(col.capitalize(), fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
plt.tight_layout()
plt.show()

# Pie Charts
categorical_columns = ['fuel', 'condition', 'paint_color', 'drive']

plt.figure(figsize=(14, 8))

for i, col in enumerate(categorical_columns, 1):
    plt.subplot(2, 2, i)

    # Get top 5 categories, grouping the rest as "Other"
    top_categories = data[col].value_counts().nlargest(5)
    other_count = data[col].value_counts().sum() - top_categories.sum()

    # Append "Other" category if necessary
    if other_count > 0:
        top_categories['Other'] = other_count

    # Plot the filtered pie chart
    top_categories.plot.pie(
        autopct='%1.1f%%',
        colors=sns.color_palette("pastel"),
        startangle=90,
        wedgeprops={'edgecolor': 'black'}
    )

    plt.title(f'Distribution of {col.capitalize()} (Top 5)', fontsize=14)
    plt.ylabel('')

plt.tight_layout()
plt.show()
# Step 2: Relationships

# Scatterplots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(data=data, x='odometer', y='price', color='teal')
plt.title('Price vs. Odometer', fontsize=14)
plt.xlabel('Odometer', fontsize=12)
plt.ylabel('Price', fontsize=12)

plt.subplot(1, 2, 2)
sns.scatterplot(data=data, x='year', y='price', color='orange')
plt.title('Price vs. Year', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()

# Boxplots
categorical_columns = ['condition', 'fuel', 'drive']
plt.figure(figsize=(20, 5))
for i, col in enumerate(categorical_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(data=data, x=col, y='price', palette='Set2')
    plt.title(f'Price by {col.capitalize()}', fontsize=14)
    plt.xlabel(col.capitalize(), fontsize=12)
    plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
<ipython-input-27-647f5fa808f8>:70: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=data, x=col, y='price', palette='Set2')
<ipython-input-27-647f5fa808f8>:70: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=data, x=col, y='price', palette='Set2')
<ipython-input-27-647f5fa808f8>:70: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=data, x=col, y='price', palette='Set2')
No description has been provided for this image

Histograms:¶

  • Visualize the distribution of numerical attributes like price, year, and odometer.
  • Identify central tendencies, skewness, and outliers.

Pie Charts:¶

  • Show the proportions of categorical attributes (fuel, condition, etc.).
  • Highlight dominant categories, such as the most common fuel type or paint color.

Scatterplots:¶

  • Price vs. Odometer: Observe the trend of car prices decreasing as mileage increases.
  • Price vs. Year: Examine how car prices vary with the year of manufacture.

Boxplots:¶

  • Group price by categories like condition, fuel, and drive.
  • Highlight variations in price across these groups and detect outliers.

Manufacturer and Model Popularity¶

  1. Top 10 Maufacturers by count.
In [28]:
# General style
sns.set_style("whitegrid")

# Step 1: Top 10 Manufacturers by Count
manufacturer_counts = data['manufacturer'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=manufacturer_counts.index, y=manufacturer_counts.values, palette='viridis')
plt.title('Top 10 Manufacturers by Count', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
<ipython-input-28-8c4aee0bdc88>:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=manufacturer_counts.index, y=manufacturer_counts.values, palette='viridis')
No description has been provided for this image
  • Here bar chart shows the top 10 manufacturers with the most listings.
  • And ford shows the highest in top 10.
  1. Top 10 Models by Count
In [29]:
model_counts = data['model'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=model_counts.index, y=model_counts.values, palette='coolwarm')
plt.title('Top 10 Models by Count', fontsize=14)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
<ipython-input-29-3f2e3d584f3b>:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=model_counts.index, y=model_counts.values, palette='coolwarm')
No description has been provided for this image
  • Here we show most most popular models . Here f-150 is the most popular and top 10 model.
  1. Manufacturers Segmented by Condition
In [30]:
condition_segment = data.groupby(['manufacturer', 'condition']).size().unstack(fill_value=0)
condition_segment = condition_segment.loc[manufacturer_counts.index]  # Align manufacturers with top 10
condition_segment.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Manufacturers Segmented by Condition', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Condition', fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image
  • A stacked bar chart groups the top manufacturers by vehicle condition (e.g., excellent, good, fair). Reveals which manufacturers offer the most "like new" or "fair" condition cars.
  1. Manufacturers Segmented by Fuel
In [31]:
fuel_segment = data.groupby(['manufacturer', 'fuel']).size().unstack(fill_value=0)
fuel_segment = fuel_segment.loc[manufacturer_counts.index]  # Align manufacturers with top 10
fuel_segment.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='coolwarm')
plt.title('Manufacturers Segmented by Fuel', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Fuel', fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image
  • A stacked bar chart groups the top manufacturers by fuel type (e.g., gas, diesel, electric). Shows which brands lean towards specific fuel types, such as Toyota offering more electric cars.

Pricing Analysis¶

In [32]:
# General style
sns.set_style("whitegrid")

top_states = data['state'].value_counts().index[:10]
filtered_data = data[data['state'].isin(top_states)]

plt.figure(figsize=(12, 6))
sns.violinplot(data=filtered_data, x='state', y='price', palette='muted', inner='quartile')

plt.title('Distribution of Car Prices by Top 10 States', fontsize=14)
plt.xlabel('State', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Step 2: Violin Plot of Price Grouped by Condition
plt.figure(figsize=(12, 6))
sns.violinplot(data=data, x='condition', y='price', palette='Set2')
plt.title('Distribution of Prices by Condition', fontsize=14)
plt.xlabel('Condition', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()

# Step 3: Line Chart Showing Average Price Trends Over Year
# Group data by year and calculate average price
avg_price_by_year = data.groupby('year')['price'].mean().reset_index()

plt.figure(figsize=(10, 5))
sns.lineplot(data=avg_price_by_year, x='year', y='price', marker='o', color='b')
plt.title('Average Price Trends Over Years', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.tight_layout()
plt.show()
<ipython-input-32-5dc789e1b852>:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(data=filtered_data, x='state', y='price', palette='muted', inner='quartile')
No description has been provided for this image
<ipython-input-32-5dc789e1b852>:21: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(data=data, x='condition', y='price', palette='Set2')
No description has been provided for this image
No description has been provided for this image
  1. Violin Plot of Price Grouped by State:
  • It provides more details on the distribution shape, showing multimodal data (multiple peaks) if present.
  • It gives a clearer visual of pricing trends, especially in cases where the data is not normally distributed.
  1. Violin Plot of Price Grouped by Condition:
  • Analyze how vehicle prices vary based on their condition.
  • "Like new" vehicles are expected to have higher median prices and narrower distributions.
  • "Fair" vehicles may have a wider spread, reflecting variability in pricing.
  1. Line Chart of Average Price Trends Over Years:
  • Use a line chart to show year-over-year price changes.
  • Identify whether prices are increasing, decreasing, or fluctuating. Spot anomalies (e.g., sudden spikes or drops).

Regional Postings Distribution¶

In [33]:
# Step 1: Grouping the data by 'state' to count the number of postings per region
region_postings = data['state'].value_counts().reset_index()
region_postings.columns = ['Region', 'Postings']

# Step 2: Handling missing values (if any), for now we remove rows where state is NaN
region_postings = region_postings.dropna()

# Step 3: Sorting the data by 'Postings' in descending order for better visualization
region_postings = region_postings.sort_values(by='Postings', ascending=False)

# Step 4: Plotting the bar chart
plt.figure(figsize=(12, 8))
sns.barplot(x='Postings', y='Region', data=region_postings, palette='viridis')
plt.title('Car Postings Distribution by Region', fontsize=16)
plt.xlabel('Number of Postings', fontsize=12)
plt.ylabel('Region', fontsize=12)
plt.show()
<ipython-input-33-27050b1abf8c>:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Postings', y='Region', data=region_postings, palette='viridis')
No description has been provided for this image
  • The bar chart will display the regions (states) on the y-axis and the number of car postings on the x-axis. The regions with the most car postings will be on the top, and regions with fewer postings will appear lower.

Regional Price Comparison: Analyzing Average Prices Across Different Regions¶

In [34]:
# Step 1: Calculate the average price for each model across regions
avg_price_per_region = data.groupby(['state', 'model'])['price'].mean().reset_index()

# Step 2: Create a pivot table for heatmap visualization
price_pivot = avg_price_per_region.pivot(index='model', columns='state', values='price')

# Step 3: Plotting heatmap for regional price comparison
plt.figure(figsize=(8, 4))
sns.heatmap(price_pivot, cmap='coolwarm', annot=True, fmt='.0f', linewidths=0.5)
plt.title('Average Car Price Comparison Across state')
plt.xlabel('State')
plt.ylabel('Car Model')
plt.xticks(rotation=30, ha='right')
plt.show()

# Bar chart for selected car models (e.g., top 5 most common models)
top_models = data['model'].value_counts().nlargest(5).index
filtered_data = data[data['model'].isin(top_models)]

# Group data again for selected models
avg_price_filtered = filtered_data.groupby(['state', 'model'])['price'].mean().reset_index()

plt.figure(figsize=(14, 6))
sns.barplot(x='state', y='price', hue='model', data=avg_price_filtered, palette='viridis')
plt.title('Average Price of Top Car Models Across Regions')
plt.xlabel('State')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Car Model')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
No description has been provided for this image
No description has been provided for this image
  • Heat map Visualization:

    • Indicates the average price—blue regions represent lower prices, while red regions represent higher prices.
  • Bar Chart Visualization:

    • We took Top 5 models for easier visualization. From all the states

Hypothesis¶

Hypothesis-1 : Better condition for cars result in higher prices

In [35]:
# Step 1: Calculate the average price for each condition
avg_price_condition = data.groupby('condition')['price'].mean().reset_index()

# Step 2: Sort values for better visualization
avg_price_condition = avg_price_condition.sort_values(by='price', ascending=False)

# Step 3: Bar Chart to visualize average price by condition
plt.figure(figsize=(10, 6))
sns.barplot(x='condition', y='price', data=avg_price_condition, palette='viridis')
plt.title('Average Car Price by Condition')
plt.xlabel('Condition')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

# Step 4: Box Plot to analyze price variability within each condition
plt.figure(figsize=(12, 6))
sns.boxplot(x='condition', y='price', data=data, palette='coolwarm')
plt.title('Price Distribution by Car Condition')
plt.xlabel('Condition')
plt.ylabel('Price')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
<ipython-input-35-ec038169399b>:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='condition', y='price', data=avg_price_condition, palette='viridis')
No description has been provided for this image
<ipython-input-35-ec038169399b>:19: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='condition', y='price', data=data, palette='coolwarm')
No description has been provided for this image
  • Insights from the Analysis:
    • Higher Prices for Better Conditions:Here Fair has the highest average price condtion.
  • Recjbgcfxdxfcg nnhgdywfrdexzdxf

Hypothesis-2 : Correlation Between Price, Year, and Odometer

In [36]:
# Step 1: Calculate the correlation matrix
correlation = data[['price', 'year', 'odometer']].corr()

# Step 2: Print the correlation matrix
print("Correlation Matrix:\n", correlation)

# Step 3: Heatmap to visualize correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix: Price, Year, and Odometer')
plt.show()

# Step 4: Pairplot to visualize relationships between variables
sns.pairplot(data[['price', 'year', 'odometer']], diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pairplot of Price, Year, and Odometer', y=1.02)
plt.show()

# Step 5: Scatter plots for individual analysis
plt.figure(figsize=(12, 5))

# Price vs Year
plt.subplot(1, 2, 1)
sns.scatterplot(x='year', y='price', data=data, alpha=0.6, color='blue')
plt.title('Car Price vs Manufacturing Year')
plt.xlabel('Year')
plt.ylabel('Price')

# Price vs Odometer
plt.subplot(1, 2, 2)
sns.scatterplot(x='odometer', y='price', data=data, alpha=0.6, color='red')
plt.title('Car Price vs Odometer Reading')
plt.xlabel('Odometer')
plt.ylabel('Price')

plt.tight_layout()
plt.show()
Correlation Matrix:
              price      year  odometer
price     1.000000 -0.004925  0.010032
year     -0.004925  1.000000 -0.157215
odometer  0.010032 -0.157215  1.000000
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • Insights for the Analysis:

    • A positive correlation indicates that newer cars have higher prices.
  • Recommendations:

    • Analyze correlations by segmenting data by brand or car type.

Hypothesis-3 : Geographic Analysis

In [37]:
# Step 1: Calculate average car price by state and sort in descending order
avg_price_state = data.groupby('state')['price'].mean().sort_values(ascending=False)

# Step 2: Print the result
print("Average Car Price by State:\n", avg_price_state)

# Step 3: Visualize using a bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price_state.index, y=avg_price_state.values, palette='viridis')
plt.xticks(rotation=90)
plt.title('Average Car Price by State')
plt.xlabel('State')
plt.ylabel('Average Price (USD)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Step 4: Visualize using a heatmap (if geographic data available)
plt.figure(figsize=(8, 12))
sns.heatmap(avg_price_state.to_frame().T, cmap='coolwarm', annot=True, fmt='.0f')
plt.title('Heatmap of Average Car Prices by State')
plt.ylabel('Price')
plt.show()
Average Car Price by State:
 state
Lakshadweep                                 1.665653e+08
Chandigarh                                  1.370311e+08
Goa                                         6.424109e+07
West Bengal                                 2.884930e+07
Haryana                                     2.001612e+07
Karnataka                                   1.600844e+07
Kerala                                      1.364536e+07
Uttar Pradesh                               8.511310e+06
Dadra and Nagar Haveli and Daman and Diu    7.138190e+06
Maharashtra                                 5.056206e+06
Madhya Pradesh                              3.556956e+06
Tamil Nadu                                  3.109092e+06
Delhi                                       2.860353e+06
Chhattisgarh                                2.503276e+06
Gujarat                                     2.369305e+06
Punjab                                      2.348216e+06
Rajasthan                                   2.346407e+06
Assam                                       2.314903e+06
Jharkhand                                   2.306235e+06
Andaman and Nicobar Islands                 2.268856e+06
Sikkim                                      2.268851e+06
Manipur                                     2.267773e+06
Puducherry                                  2.257328e+06
Arunachal Pradesh                           2.250119e+06
Odisha                                      2.247732e+06
Jammu and Kashmir                           2.240744e+06
Tripura                                     2.238199e+06
Telangana                                   2.237392e+06
Uttarakhand                                 2.237066e+06
Andhra Pradesh                              2.231818e+06
Bihar                                       2.225778e+06
Ladakh                                      2.218344e+06
Meghalaya                                   2.216664e+06
Himachal Pradesh                            2.197821e+06
Mizoram                                     2.188526e+06
Name: price, dtype: float64
<ipython-input-37-4b324341e25b>:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=avg_price_state.index, y=avg_price_state.values, palette='viridis')
No description has been provided for this image
No description has been provided for this image
  • Insights from Analysis:
    • States with a higher average price could indicate a preference for premium or newer cars.
  • Recommendations:
    • Check if prices in certain states are rising or falling over months/years.

Assessing the Demand for Pre-Owned Cars in Each Region¶

In [38]:
#  Number of Listings by Region (Supply Indicator)
listings_by_region = data['state'].value_counts()
print("Number of Listings by Region:\n", listings_by_region)

# Visualization of listings by region
plt.figure(figsize=(12, 6))
sns.barplot(x=listings_by_region.index, y=listings_by_region.values, palette='viridis')
plt.title('Number of Car Listings by Region')
plt.xlabel('Region')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()
Number of Listings by Region:
 state
Uttar Pradesh                               45573
Maharashtra                                 45263
Karnataka                                   30492
Punjab                                      30455
Delhi                                       30379
Haryana                                     30175
Tamil Nadu                                  18124
West Bengal                                 18052
Telangana                                   15356
Kerala                                      15304
Madhya Pradesh                              15205
Rajasthan                                   15131
Andhra Pradesh                              15070
Gujarat                                     14992
Odisha                                      12189
Jharkhand                                    6080
Bihar                                        6034
Chhattisgarh                                 6003
Assam                                        5987
Uttarakhand                                  5951
Andaman and Nicobar Islands                  3079
Sikkim                                       3065
Ladakh                                       3062
Puducherry                                   3046
Jammu and Kashmir                            3030
Meghalaya                                    3017
Himachal Pradesh                             3016
Mizoram                                      3010
Manipur                                      2990
Chandigarh                                   2986
Arunachal Pradesh                            2980
Dadra and Nagar Haveli and Daman and Diu     2980
Tripura                                      2961
Goa                                          2923
Lakshadweep                                  2920
Name: count, dtype: int64
<ipython-input-38-7fb1fec7a708>:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=listings_by_region.index, y=listings_by_region.values, palette='viridis')
No description has been provided for this image
  • In this region place i have given state
  • Bar chart represents counts the number of car postings in each region.

Identifying Demand for Specific Car Models and Their Key Attributes¶

In [39]:
# Analysis 1: Number of Listings by Fuel Type
fuel_listings = filtered_data['fuel'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=fuel_listings.index, y=fuel_listings.values, palette='coolwarm')
plt.title('Number of Listings by Fuel Type')
plt.xlabel('Fuel Type')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()

# Analysis 2:Number of Clean Cars by Fuel Type
clean_fuel = filtered_data[filtered_data['title_status'] == 'clean']['fuel'].value_counts().sort_values(ascending=False)
if not clean_fuel.empty:
    plt.figure(figsize=(12, 6))
    sns.barplot(x=clean_fuel.index, y=clean_fuel.values, palette='Set2')
    plt.title('Number of Clean Cars by Fuel Type')
    plt.xlabel('Fuel Type')
    plt.ylabel('Number of Listings')
    plt.xticks(rotation=45)
    plt.show()
else:
    print("No clean car data available for Fuel Type analysis.")

# Analysis 3: Number of Listings by Condition
condition_listings = filtered_data['condition'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=condition_listings.index, y=condition_listings.values, palette='Blues')
plt.title('Number of Listings by Condition')
plt.xlabel('Condition')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()

# Analysis 4: Number of Clean Cars by Condition
clean_condition = filtered_data[filtered_data['title_status'] == 'clean']['condition'].value_counts().sort_values(ascending=False)
if not clean_condition.empty:
    plt.figure(figsize=(12, 6))
    sns.barplot(x=clean_condition.index, y=clean_condition.values, palette='Oranges')
    plt.title('Number of Clean Cars by Condition')
    plt.xlabel('Condition')
    plt.ylabel('Number of Listings')
    plt.xticks(rotation=45)
    plt.show()
else:
    print("No clean car data available for Condition analysis.")
<ipython-input-39-e98a20952f3d>:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=fuel_listings.index, y=fuel_listings.values, palette='coolwarm')
No description has been provided for this image
<ipython-input-39-e98a20952f3d>:15: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=clean_fuel.index, y=clean_fuel.values, palette='Set2')
No description has been provided for this image
<ipython-input-39-e98a20952f3d>:27: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=condition_listings.index, y=condition_listings.values, palette='Blues')
No description has been provided for this image
<ipython-input-39-e98a20952f3d>:38: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=clean_condition.index, y=clean_condition.values, palette='Oranges')
No description has been provided for this image

Optimal Pricing Strategy for Cars¶

In [40]:
# Inspect the data
print(data.head())

# 1. Calculate the average and median price for each car model
price_stats = data.groupby('model')['price'].agg(['mean', 'median', 'count']).reset_index()
price_stats.columns = ['model', 'average_price', 'median_price', 'listing_count']

# 2. Identify outliers using Interquartile Range (IQR) for price
Q1 = data['price'].quantile(0.25)
Q3 = data['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
filtered_data = data[(data['price'] >= lower_bound) & (data['price'] <= upper_bound)]

# 3. Competitive analysis based on demand and supply
sold_count = data[data['title_status'] == 'sold'].groupby('model')['price'].count().reset_index()
sold_count.columns = ['model', 'sold_count']

# Merge sold count with price stats
optimal_pricing_data = pd.merge(price_stats, sold_count, on='model', how='left').fillna(0)

# 4. Analyzing condition impact on price
condition_price = data.groupby(['model', 'condition'])['price'].mean().unstack(fill_value=0)

# 5. Analyzing mileage impact on price
plt.figure(figsize=(12, 6))
sns.scatterplot(x='odometer', y='price', hue='condition', data=data, palette='coolwarm')
plt.title('Mileage vs Price Distribution')
plt.xlabel('Odometer (Mileage)')
plt.ylabel('Price')
plt.legend(title='Condition')
plt.show()

# 6. Adjusting pricing based on market trends
optimal_pricing_data['recommended_price'] = np.where(
    optimal_pricing_data['sold_count'] > 0,
    optimal_pricing_data['median_price'] * 1.05,  # Increase price slightly if demand is high
    optimal_pricing_data['median_price'] * 0.95   # Decrease price if demand is low
)

# 7. Display the recommended prices
print(optimal_pricing_data[['model', 'average_price', 'median_price', 'sold_count', 'recommended_price']])

# 8. Visualize pricing recommendations
plt.figure(figsize=(12, 6))
sns.barplot(x='model', y='recommended_price', data=optimal_pricing_data.sort_values(by='recommended_price', ascending=False).head(20), palette='viridis')
plt.title('Recommended Pricing for Top Car Models')
plt.xlabel('Car Model')
plt.ylabel('Recommended Price')
plt.xticks(rotation=90)
plt.show()
           id      price  year manufacturer model condition  cylinders fuel  \
0  7222695916   770530.0   NaN          nan   nan       nan          0  nan   
1  7218891961  1528210.0   NaN          nan   nan       nan          0  nan   
2  7221797935  2696820.0   NaN          nan   nan       nan          0  nan   
3  7222270760   192610.0   NaN          nan   nan       nan          0  nan   
4  7210384030   629240.0   NaN          nan   nan       nan          0  nan   

   odometer title_status  ...  VIN drive size type paint_color image_url  \
0       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
1       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
2       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
3       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
4       NaN          nan  ...  nan   nan  nan  nan         nan       nan   

            state posting_date latitude  longitude  
0         Haryana          NaT  28.8446    75.1167  
1         Haryana          NaT  27.9026    77.0382  
2     West Bengal          NaT  24.6838    85.9695  
3  Andhra Pradesh          NaT  15.5158    81.9342  
4     West Bengal          NaT  23.7821    87.4861  

[5 rows x 21 columns]
No description has been provided for this image
                                                   model  average_price  \
0                                                    "t"      3467350.0   
1      $362.47, $1000 down, oac, 2.9%apr $362.47,luxu...      3197630.0   
2                                                      %       128440.0   
3                                               & altima       391100.0   
4                                     '50 Business Coupe      1284190.0   
...                                                  ...            ...   
29663                               ♦ALL TADES WELCOME!♦        64180.0   
29664                                                  ♿      1412630.0   
29665                                              ♿ vmi      1412165.0   
29666                               𝓜𝓮𝓻𝓬𝓮𝓭𝓮𝓼 𝓫𝓮𝓷𝔃 𝓶𝓵 350       500890.0   
29667                        🔥GMC Sierra 1500 SLE🔥 4X4 🔥      1284190.0   

       median_price  sold_count  recommended_price  
0         3467350.0         0.0         3293982.50  
1         3197630.0         0.0         3037748.50  
2          128440.0         0.0          122018.00  
3          391100.0         0.0          371545.00  
4         1284190.0         0.0         1219980.50  
...             ...         ...                ...  
29663       64180.0         0.0           60971.00  
29664     1412630.0         0.0         1341998.50  
29665     1412165.0         0.0         1341556.75  
29666      500890.0         0.0          475845.50  
29667     1284190.0         0.0         1219980.50  

[29668 rows x 5 columns]
<ipython-input-40-0a1f70a6d962>:49: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='model', y='recommended_price', data=optimal_pricing_data.sort_values(by='recommended_price', ascending=False).head(20), palette='viridis')
No description has been provided for this image

By following this approach, car dealers can set optimal prices for each model by:

  • Ensuring competitiveness against market trends.
  • Balancing demand with supply to adjust pricing dynamically.
  • Accounting for key factors like condition and mileage.

To analyze the relationship between the posting_date and year of manufacture

In [41]:
# Convert posting_date to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'])

# Extract the year from the posting date
data['posting_year'] = data['posting_date'].dt.year

# Calculate vehicle age at the time of posting
data['vehicle_age'] = data['posting_year'] - data['year']

# Display result
print(data[['year', 'posting_date', 'vehicle_age']])

# Conclusion
average_vehicle_age = data['vehicle_age'].mean()
print(f"On average, vehicles are {average_vehicle_age:.1f} years old at the time of posting.")
          year              posting_date  vehicle_age
0          NaN                       NaT          NaN
1          NaN                       NaT          NaN
2          NaN                       NaT          NaN
3          NaN                       NaT          NaN
4          NaN                       NaT          NaN
...        ...                       ...          ...
426875  2019.0 2021-04-04 09:21:31+00:00          2.0
426876  2020.0 2021-04-04 09:21:29+00:00          1.0
426877  2020.0 2021-04-04 09:21:17+00:00          1.0
426878  2018.0 2021-04-04 09:21:11+00:00          3.0
426879  2019.0 2021-04-04 09:21:07+00:00          2.0

[426880 rows x 3 columns]
On average, vehicles are 9.8 years old at the time of posting.
  • This is to show the relationship between posting_date and year of manufacture.
In [42]:
# Convert posting_date to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'])

# Extract the year from the posting date
data['posting_year'] = data['posting_date'] .dt.year

# Calculate vehicle age at the time of posting
data['vehicle_age'] = data['posting_year'] - data['year']

# Plot a histogram to visualize vehicle age distribution
plt.figure(figsize=(8, 5))
sns.histplot(data['vehicle_age'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Vehicle Age at Posting')
plt.xlabel('Vehicle Age (years)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Plot a bar chart to see the average vehicle age by posting year
plt.figure(figsize=(8, 5))
sns.barplot(x=data['posting_year'], y=data['vehicle_age'], palette='viridis')
plt.title('Average Vehicle Age by Posting Year')
plt.xlabel('Posting Year')
plt.ylabel('Average Vehicle Age')
plt.show()

# Display box plot for age distribution
plt.figure(figsize=(8, 5))
sns.boxplot(x=data['vehicle_age'], color='lightcoral')
plt.title('Boxplot of Vehicle Age')
plt.xlabel('Vehicle Age (years)')
plt.grid(True)
plt.show()
No description has been provided for this image
<ipython-input-42-a73b8b06c96f>:21: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=data['posting_year'], y=data['vehicle_age'], palette='viridis')
No description has been provided for this image
No description has been provided for this image

To identify which car brands have a high sale value

In [43]:
# Group by manufacturer and calculate total sales value
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()

# Sort by sales value in descending order
brand_sales = brand_sales.sort_values(by='price', ascending=False)

# Display result
print(brand_sales)

# Identify the brand with the highest sales value
highest_sale_brand = brand_sales.iloc[0]
print(f"The brand with the highest sales value is {highest_sale_brand['manufacturer']} with a total sales value of {highest_sale_brand['price']}.")
       manufacturer         price
40           toyota  1.029069e+12
7         chevrolet  8.179780e+11
26    mercedes-benz  8.068871e+11
20             jeep  3.680175e+11
13             ford  3.319232e+11
42            volvo  1.662761e+11
31              nan  7.596019e+10
14              gmc  6.554087e+10
35              ram  6.531354e+10
4               bmw  5.047692e+10
32           nissan  4.610003e+10
16            honda  2.936577e+10
10            dodge  2.640941e+10
5             buick  2.598642e+10
3              audi  2.292789e+10
23            lexus  2.022540e+10
6          cadillac  1.735790e+10
38           subaru  1.591916e+10
0             acura  1.523405e+10
41       volkswagen  1.504611e+10
17          hyundai  1.423159e+10
18         infiniti  1.206377e+10
21              kia  1.203714e+10
24          lincoln  1.073178e+10
25            mazda  8.604323e+09
8          chrysler  8.087946e+09
36            rover  7.376169e+09
19           jaguar  6.635093e+09
29       mitsubishi  5.810281e+09
34          porsche  5.677958e+09
28             mini  4.321809e+09
39            tesla  4.275460e+09
1        alfa-romeo  3.252941e+09
33          pontiac  2.402481e+09
11          ferrari  1.310728e+09
12             fiat  1.209434e+09
37           saturn  9.632394e+08
27          mercury  8.335567e+08
15  harley-davidson  2.383198e+08
2      aston-martin  1.648740e+08
9            datsun  1.225653e+08
22       land rover  2.133458e+07
30           morgan  5.046890e+06
The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.
  • The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.
In [44]:
# Group by manufacturer and calculate total sales value
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()

# Sort by sales value in descending order
brand_sales = brand_sales.sort_values(by='price', ascending=False)

# Display result
print(brand_sales)

# Identify the brand with the highest sales value
highest_sale_brand = brand_sales.iloc[0]
print(f"The brand with the highest sales value is {highest_sale_brand['manufacturer']} with a total sales value of {highest_sale_brand['price']}.")

# Plot a bar chart to visualize brand sales
plt.figure(figsize=(8, 5))
sns.barplot(x='manufacturer', y='price', data=brand_sales, palette='viridis')
plt.title('Total Sales Value by Manufacturer')
plt.xlabel('Manufacturer')
plt.ylabel('Total Sales Value')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.show()

# Step 1: Aggregate sales by manufacturer (if not already done)
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()

# Step 2: Sort and select the top 5 manufacturers by sales
top_brands = brand_sales.sort_values('price', ascending=False).head(5)

# Step 3: Plot the pie chart for the top 5 brands
plt.figure(figsize=(8, 8))
plt.pie(top_brands['price'], labels=top_brands['manufacturer'], autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Sales Distribution by Top 5 Manufacturer')
plt.show()
       manufacturer         price
40           toyota  1.029069e+12
7         chevrolet  8.179780e+11
26    mercedes-benz  8.068871e+11
20             jeep  3.680175e+11
13             ford  3.319232e+11
42            volvo  1.662761e+11
31              nan  7.596019e+10
14              gmc  6.554087e+10
35              ram  6.531354e+10
4               bmw  5.047692e+10
32           nissan  4.610003e+10
16            honda  2.936577e+10
10            dodge  2.640941e+10
5             buick  2.598642e+10
3              audi  2.292789e+10
23            lexus  2.022540e+10
6          cadillac  1.735790e+10
38           subaru  1.591916e+10
0             acura  1.523405e+10
41       volkswagen  1.504611e+10
17          hyundai  1.423159e+10
18         infiniti  1.206377e+10
21              kia  1.203714e+10
24          lincoln  1.073178e+10
25            mazda  8.604323e+09
8          chrysler  8.087946e+09
36            rover  7.376169e+09
19           jaguar  6.635093e+09
29       mitsubishi  5.810281e+09
34          porsche  5.677958e+09
28             mini  4.321809e+09
39            tesla  4.275460e+09
1        alfa-romeo  3.252941e+09
33          pontiac  2.402481e+09
11          ferrari  1.310728e+09
12             fiat  1.209434e+09
37           saturn  9.632394e+08
27          mercury  8.335567e+08
15  harley-davidson  2.383198e+08
2      aston-martin  1.648740e+08
9            datsun  1.225653e+08
22       land rover  2.133458e+07
30           morgan  5.046890e+06
The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.
<ipython-input-44-e0c4ba509063>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='manufacturer', y='price', data=brand_sales, palette='viridis')
No description has been provided for this image
No description has been provided for this image
  • Here we can see what type of car purchase is happend and count of it.

To check if the same vehicle has been resold within a given time frame

In [45]:
# Define the time frame (e.g., 60 days)
time_frame = pd.Timedelta(days=60)

# Sort data by VIN and posting date
data= data.sort_values(by=['VIN', 'posting_date'])

# Check for resales within the time frame
data['resale_within_timeframe'] = data.duplicated(subset=['VIN'], keep=False) & (
    data['posting_date'].diff().fillna(pd.Timedelta(days=9999)) <= time_frame
)

# Filter records that were resold within the given timeframe
resold_vehicles = data[data['resale_within_timeframe']]

# Display the result
print("Vehicles resold within the given timeframe:\n", resold_vehicles[['VIN', 'posting_date']])
Vehicles resold within the given timeframe:
                       VIN              posting_date
32544               00000 2021-04-28 13:44:51+00:00
29955               00000 2021-05-04 15:37:19+00:00
288401  00000000000000000 2021-04-05 16:26:23+00:00
288400  00000000000000000 2021-04-05 16:28:23+00:00
303644  00000000000000000 2021-04-05 16:34:11+00:00
...                   ...                       ...
406984                nan 2021-05-05 03:56:25+00:00
406982                nan 2021-05-05 03:58:48+00:00
406981                nan 2021-05-05 03:59:22+00:00
422031                nan 2021-05-05 04:07:07+00:00
422030                nan 2021-05-05 04:24:09+00:00

[348828 rows x 2 columns]
  • Here we can see which vehicle have been resold within time.

Recommendations for Spinny¶

  1. Competitive Pricing Recommendations:
  • Increase Prices for High-Demand Models:Increase Prices for High-Demand Models
  • Reduce Prices for Low-Demand Models:Consider reducing prices for models like "Ford Fiesta" or "Nissan Altima" if demand is low.
  1. Condition-Based Pricing Adjustments:
  • Increase prices by 10-15% for cars in excellent condition to reflect their higher value perception.
  1. Mileage-Based Price Adjustments:
  • Lower mileage cars tend to attract higher prices.
  1. Regional Market Adaptation:
  • Pricing should be adjusted based on geographic trends where demand for specific models is higher.
  1. Fuel Type Preferences:
  • Increase pricing for EVs and hybrids by 5-10% to reflect market preferences and future demand trends.
  1. Visual Pricing Insights:
  • Use data visualizations to identify the sweet spot for pricing and communicate value propositions effectively to customers.
  1. Dynamic Pricing Implementation:
  • Regular monitoring and updates of pricing based on seasonal trends, fuel prices, and competitor adjustments.

Conclusion: By integrating these strategic pricing recommendations, Spinny can position itself as a market leader in the pre-owned car segment, achieving higher sales volumes, improved customer satisfaction, and sustained revenue growth.¶

In [46]:
data.to_csv('spinny_data.csv')