Spinny's Revenue Growth Through Demand and Pricing Optimisation¶
Spinny Pre-Owned Car Market Analysis¶
Introduction:¶
Spinny, a leading Indian startup in the pre-owned car market, aims to increase revenue by boosting the number of completed transactions. To achieve this, we will analyze used car transaction data to:
- Assess the demand for pre-owned cars in each region.
- Identify the demand for specific car models and their key attributes (e.g., condition, fuel type, mileage).
- Determine the optimal pricing for each car to stay competitive and align with market trends.
- The insights derived will support strategic decisions to optimize inventory, pricing, and marketing efforts.
1. Data Understanding¶
- Loading and Inspecting Data
- Import the dataset and inspect its structure.
- Display the first few rows to understand the data layout.
- Check for: Column names and data types.
- Missing values and their distribution.
- Basic statistics of numerical and categorical features.
2. Data Preparation¶
- Handling Missing Values
- Identify columns with missing data. Implement strategies to handle them (e.g., imputation, removal). Justify selected strategies briefly.
- Dealing with Outliers
- Use statistical methods (e.g., IQR) or visualizations (e.g., boxplots) to identify and address outliers.
- Ensure the approach aligns with the business context.
- Standardizing Data
- Format categorical variables (e.g., fuel type, transmission) for consistency.
- Scale numerical columns where needed for analysis (e.g., price, mileage).
3. Exploratory Data Analysis (EDA)¶
- Univariate Analysis
- Examine the distribution of key variables:
- Visualize using histograms, bar charts, and boxplots.
- Bivariate Analysis
- Explore relationships between:
- Car attributes.
- Region and car demand. Use scatter plots, boxplots, and grouped bar charts.
- Explore relationships between:
- Multivariate Analysis
- Investigate interactions:
- Region, car attributes, and price.
- Visualize using heatmaps, pair plots, and correlation matrices.
- Investigate interactions:
4. Charting and Insights¶
- Visualization Strategy
- Provide relevant charts with:
- Titles, labels, legends, and annotations.
- Brief markdown explanations of the purpose and insights.
- Key Insights
- Highlight actionable insights:
- Regions with high demand.
- Attributes driving car demand (e.g., fuel type, age). Pricing trends by region and attribute.
5.Insights and Recommendations¶
- Summary of Findings
- Region-wise demand distribution.
- Popular car models and attributes.
- Optimal pricing strategies based on market demand.
- Recommendations
- Target Regions with High Demand:
- Focus marketing efforts on regions showing peak demand.
- Increase inventory in high-demand areas.
Unlocking Customer Demand and Transaction Growth in the Pre-Owned Car Market¶
Spinny, a leading Indian startup in the pre-owned car market, is revolutionizing the way consumers buy and sell used cars by providing a seamless and trustworthy platform. With the growing demand for affordable and reliable pre-owned vehicles, Spinny aims to cater to diverse customer needs across various regions in India.
However, success in the competitive pre-owned car market hinges on two critical factors: understanding customer demand and offering competitive pricing. Customers have numerous options to choose from, and this abundance can lead to missed opportunities for conversions, where potential buyers fail to complete transactions. High transaction dropout rates can negatively impact revenue growth, customer trust, and overall market share.
For Spinny, identifying the key factors that drive demand for cars—such as region-specific preferences, car attributes (e.g., fuel type, mileage, condition), and optimal pricing—is essential. Why do certain cars appeal more to buyers? What attributes influence purchase decisions? Understanding these patterns is crucial to crafting effective strategies that meet customer expectations, boost completed transactions, and drive sustainable revenue growth.
Objective:¶
As a data analyst, your goal is to analyze used car transaction data to:
- Assess the demand for pre-owned cars across different regions.
- Identify the demand for specific car models and their key attributes (e.g., condition, fuel type, mileage).
- Determine the optimal pricing for each car, ensuring it is competitive and aligned with market trends.
Business Impact¶
This analysis will empower Spinny as a leading platform in the pre-owned car market to:
Enhance Regional Market Understanding : Identify region-specific demand trends, enabling better inventory planning and targeted marketing strategies.
Increase Transaction Conversion Rates: Provide insights into the most in-demand car models and attributes, optimizing the car selection available to customers.
Optimize Pricing Strategies: Establish competitive and market-aligned pricing models to attract more buyers and drive completed transactions.
Boost Revenue and Customer Satisfaction: Addressing customer preferences and pricing gaps will contribute to increased customer satisfaction and higher transaction volumes, directly impacting revenue growth.
- By addressing these key factors, this project supports Spinny's broader goal of building customer trust, driving sustainable growth, and reinforcing its position as a leader in the pre-owned car market. Let’s dive into the data to uncover these actionable insights!
Dataset Overview¶
Dataset Overview¶
- Dataset Name : Spinny_Data -Number of Rows : 426880 -Number of Columns : 21
- Description : Optimizing Revenue Growth for Spinny through Demand Analysis and Competitive Pricing.
Column Definitions¶
- id – A unique identifier assigned to each car listing.
- price – The listed selling price of the car (in the local currency).
- year – The manufacturing year of the car.
- manufacturer – The brand or company that produced the car (e.g., Toyota, Ford).
- model – The specific model name of the car (e.g., Corolla, Mustang).
- condition – The overall state of the car, typically classified as new, excellent, good, fair, etc.
- cylinders – The number of engine cylinders in the car, indicating engine size and power.
- fuel – The type of fuel the car uses (e.g., petrol, diesel, electric, hybrid).
- odometer – The total distance the car has traveled, measured in miles or kilometers.
- title_status – The legal ownership status of the vehicle (e.g., clean, salvage, rebuilt).
- transmission – This column indicates the type of transmission system in the car, which determines how the vehicle shifts gears.
- VIN (Vehicle Identification Number) – A unique code assigned to each car for identification and tracking.
- drive – The type of drivetrain (e.g., front-wheel drive, rear-wheel drive, all-wheel drive).
- size – The size classification of the car (e.g., compact, midsize, full-size, SUV).
- type – The category of the car (e.g., sedan, SUV, truck, coupe).
- paint_color – The exterior color of the car.
- image_url – A link to the image of the car listing.
- state – The location (state or region) where the car is being sold.
- posting_date – The date when the car was listed for sale.
- latitude – The geographical latitude of the car’s listing location.
- longitude – The geographical longitude of the car’s listing location.
Analysis & Visualisation !¶
1. Importing and Cleaning Data¶
Importing Necessary Libraries¶
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting and visualization
import seaborn as sns # For advanced visualizations
Loading the Dataset from google drive¶
# Step 1: Install gdown
!pip install gdown
# Step 2: Import necessary libraries
import gdown
import pandas as pd
# Step 3: Set the file ID and create a download URL
file_id = "1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR"
download_url = f"https://drive.google.com/uc?id={file_id}"
# Step 4: Set the output file name
output_file = "spinny_dataset.csv"
# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)
# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)
Requirement already satisfied: gdown in /usr/local/lib/python3.11/dist-packages (5.2.0) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown) (4.12.3) Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown) (3.17.0) Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown) (2.32.3) Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown) (4.67.1) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (2.6) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2024.12.14) Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (1.7.1)
Downloading... From: https://drive.google.com/uc?id=1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR To: /content/spinny_dataset.csv 100%|██████████| 99.9M/99.9M [00:00<00:00, 166MB/s]
Viewing the First Few Rows of the Dataset¶
print("First 5 Rows of the Dataset:")
data.head(5)
First 5 Rows of the Dataset:
id | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | ... | VIN | drive | size | type | paint_color | image_url | state | posting_date | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7222695916 | 770530.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Haryana | NaN | 28.8446 | 75.1167 |
1 | 7218891961 | 1528210.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Haryana | NaN | 27.9026 | 77.0382 |
2 | 7221797935 | 2696820.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | West Bengal | NaN | 24.6838 | 85.9695 |
3 | 7222270760 | 192610.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Andhra Pradesh | NaN | 15.5158 | 81.9342 |
4 | 7210384030 | 629240.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | West Bengal | NaN | 23.7821 | 87.4861 |
5 rows × 21 columns
random_sample = data[data.notna().all(axis=1)].sample(n=10, random_state=42) # Randomly select 10 rows with no missing values
print(random_sample)
id price year manufacturer model \ 421994 7302010346 1399780.0 2014.0 ford transit connect 359017 7316155054 0.0 2017.0 kia soul plus 119389 7316241495 2786620.0 2010.0 chevrolet silverado 1500 ltz 4x4 4642 7309388743 1149880.0 2015.0 volkswagen passat tsi wolfsburg 381660 7313268847 1130140.0 2011.0 mazda 3 26098 7314924294 1540060.0 2014.0 volkswagen jetta 278249 7316153678 2117980.0 2007.0 jeep wrangler 194984 7309831477 693500.0 2010.0 nissan sentra 421770 7303638125 4108440.0 2017.0 chevrolet silverado 1500 308593 7313955519 0.0 2012.0 buick lacrosse condition cylinders fuel odometer title_status ... \ 421994 good 4 cylinders gas 143546.0 clean ... 359017 like new 4 cylinders gas 63496.0 clean ... 119389 excellent 8 cylinders gas 124248.0 clean ... 4642 excellent 4 cylinders gas 100274.0 clean ... 381660 good 4 cylinders gas 77312.0 clean ... 26098 excellent 4 cylinders gas 63930.0 clean ... 278249 good 6 cylinders gas 122236.0 clean ... 194984 good 4 cylinders gas 104000.0 clean ... 421770 excellent 8 cylinders gas 101870.0 clean ... 308593 excellent 6 cylinders gas 98239.0 clean ... VIN drive size type paint_color \ 421994 NM0LS7F75E1164637 fwd full-size van red 359017 KNDJP3A51H7457654 fwd mid-size mini-van blue 119389 3GCRKTE37AG118369 4wd full-size pickup red 4642 1VWAT7A31FC018187 fwd mid-size sedan white 381660 JM1BL1U59B1435270 fwd mid-size sedan grey 26098 3VW2K7AJ1EM439955 fwd mid-size sedan black 278249 1J4GA39137L155539 4wd full-size SUV grey 194984 3N1AB6AP9AL643919 fwd mid-size sedan grey 421770 3GCUKSEC8HG435107 4wd full-size truck black 308593 1G4GJ5E34CF175943 fwd mid-size sedan white image_url state \ 421994 https://images.craigslist.org/00U0U_29yhnVuqee... Uttar Pradesh 359017 https://images.craigslist.org/00202_iYUv6Xesse... Haryana 119389 https://images.craigslist.org/00D0D_9lZxHlsHk2... Uttar Pradesh 4642 https://images.craigslist.org/00505_924hjCwSYz... Ladakh 381660 https://images.craigslist.org/01616_3dxG5z31YU... Tamil Nadu 26098 https://images.craigslist.org/00X0X_iN9WvC5LD5... Jammu and Kashmir 278249 https://images.craigslist.org/00D0D_3AXXo1TdE6... Maharashtra 194984 https://images.craigslist.org/00H0H_dVRfJUD77f... West Bengal 421770 https://images.craigslist.org/01212_9xi1bxfHaI... Delhi 308593 https://images.craigslist.org/00J0J_acZwS8hozH... Rajasthan posting_date latitude longitude 421994 2021-04-05 13:42:27+00:00 26.7509 80.8497 359017 2021-05-03 13:41:03+00:00 29.6172 75.6695 119389 2021-05-03 16:06:49+00:00 28.2134 80.2300 4642 2021-04-19 18:33:28+00:00 34.9460 75.3032 381660 2021-04-27 13:22:15+00:00 9.3002 78.0696 26098 2021-04-30 18:14:50+00:00 32.9228 74.9187 278249 2021-05-03 13:38:08+00:00 17.3431 78.5934 194984 2021-04-20 15:46:07+00:00 22.5864 86.7940 421770 2021-04-08 15:05:30+00:00 28.5658 76.9453 308593 2021-04-28 18:35:08+00:00 28.1765 75.0177 [10 rows x 21 columns]
Checking the Shape of the Dataset¶
rows, columns = data.shape
print(f"\nThe dataset contains {rows} rows and {columns} columns.")
The dataset contains 426880 rows and 21 columns.
Displaying Dataset Information¶
print("\nDataset Information:")
data.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 426880 entries, 0 to 426879 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 426880 non-null int64 1 price 426880 non-null float64 2 year 425675 non-null float64 3 manufacturer 409234 non-null object 4 model 421603 non-null object 5 condition 252776 non-null object 6 cylinders 249202 non-null object 7 fuel 423867 non-null object 8 odometer 422480 non-null float64 9 title_status 418638 non-null object 10 transmission 424324 non-null object 11 VIN 265838 non-null object 12 drive 296313 non-null object 13 size 120519 non-null object 14 type 334022 non-null object 15 paint_color 296677 non-null object 16 image_url 426812 non-null object 17 state 426880 non-null object 18 posting_date 426812 non-null object 19 latitude 426880 non-null float64 20 longitude 426880 non-null float64 dtypes: float64(5), int64(1), object(15) memory usage: 68.4+ MB
# Check the data type of the 'cylinders' column
print(data['cylinders'].dtype)
# If it's not an object (string) type, convert it first:
if data['cylinders'].dtype != object:
data['cylinders'] = data['cylinders'].astype(str)
# Now, proceed with the extraction:
data['cylinders'] = data['cylinders'].str.extract('(\d+)').fillna(0).astype(int)
# Display the updated data types
print(data.dtypes)
object id int64 price float64 year float64 manufacturer object model object condition object cylinders int64 fuel object odometer float64 title_status object transmission object VIN object drive object size object type object paint_color object image_url object state object posting_date object latitude float64 longitude float64 dtype: object
- Here i converted Cylinders into 'int64' type.
Converstion of Data
# Convert specified columns to string type
columns_to_str = ['id', 'manufacturer', 'model', 'condition', 'fuel', 'title_status', 'transmission' ,
'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url',
'state']
# Apply conversion
data[columns_to_str] = data[columns_to_str].astype(str)
# Display the updated data types
print(data.dtypes)
id object price float64 year float64 manufacturer object model object condition object cylinders int64 fuel object odometer float64 title_status object transmission object VIN object drive object size object type object paint_color object image_url object state object posting_date object latitude float64 longitude float64 dtype: object
Time Stamp
# Convert 'posting_date' to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'], errors='coerce')
# Display the DataFrame
print(data)
id price year manufacturer model \ 0 7222695916 770530.0 NaN nan nan 1 7218891961 1528210.0 NaN nan nan 2 7221797935 2696820.0 NaN nan nan 3 7222270760 192610.0 NaN nan nan 4 7210384030 629240.0 NaN nan nan ... ... ... ... ... ... 426875 7301591192 3029710.0 2019.0 nissan maxima s sedan 4d 426876 7301591187 3928680.0 2020.0 volvo s60 t5 momentum sedan 4d 426877 7301591147 4493750.0 2020.0 cadillac xt4 sport suv 4d 426878 7301591140 3723220.0 2018.0 lexus es 350 sedan 4d 426879 7301591129 3928680.0 2019.0 bmw 4 series 430i gran coupe condition cylinders fuel odometer title_status ... \ 0 nan 0 nan NaN nan ... 1 nan 0 nan NaN nan ... 2 nan 0 nan NaN nan ... 3 nan 0 nan NaN nan ... 4 nan 0 nan NaN nan ... ... ... ... ... ... ... ... 426875 good 6 gas 32226.0 clean ... 426876 good 0 gas 12029.0 clean ... 426877 good 0 diesel 4174.0 clean ... 426878 good 6 gas 30112.0 clean ... 426879 good 0 gas 22716.0 clean ... VIN drive size type paint_color \ 0 nan nan nan nan nan 1 nan nan nan nan nan 2 nan nan nan nan nan 3 nan nan nan nan nan 4 nan nan nan nan nan ... ... ... ... ... ... 426875 1N4AA6AV6KC367801 fwd nan sedan nan 426876 7JR102FKXLG042696 fwd nan sedan red 426877 1GYFZFR46LF088296 nan nan hatchback white 426878 58ABK1GG4JU103853 fwd nan sedan silver 426879 WBA4J1C58KBM14708 rwd nan coupe nan image_url state \ 0 nan Haryana 1 nan Haryana 2 nan West Bengal 3 nan Andhra Pradesh 4 nan West Bengal ... ... ... 426875 https://images.craigslist.org/00o0o_iiraFnHg8q... Delhi 426876 https://images.craigslist.org/00x0x_15sbgnxCIS... Punjab 426877 https://images.craigslist.org/00L0L_farM7bxnxR... Madhya Pradesh 426878 https://images.craigslist.org/00z0z_bKnIVGLkDT... Maharashtra 426879 https://images.craigslist.org/00Y0Y_lEUocjyRxa... Jharkhand posting_date latitude longitude 0 NaT 28.8446 75.1167 1 NaT 27.9026 77.0382 2 NaT 24.6838 85.9695 3 NaT 15.5158 81.9342 4 NaT 23.7821 87.4861 ... ... ... ... 426875 2021-04-04 09:21:31+00:00 28.7876 76.8644 426876 2021-04-04 09:21:29+00:00 32.0715 75.1915 426877 2021-04-04 09:21:17+00:00 21.5646 75.7247 426878 2021-04-04 09:21:11+00:00 21.4183 75.4719 426879 2021-04-04 09:21:07+00:00 24.2616 83.7688 [426880 rows x 21 columns]
Checking for Duplicate Values in the Dataset¶
duplicate_count = len(data[data.duplicated()])
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")
Number of Duplicate Rows in the Dataset: 0
Checking for Missing/Null Values¶
# Now you can proceed with the missing value check:
missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)
Missing Values in Each Column: id 0 price 0 year 1205 manufacturer 0 model 0 condition 0 cylinders 0 fuel 0 odometer 4400 title_status 0 transmission 0 VIN 0 drive 0 size 0 type 0 paint_color 0 image_url 0 state 0 posting_date 68 latitude 0 longitude 0 dtype: int64
missing_values = data.isnull().sum().sum()
print(missing_values)
5673
Summary of Dataset Observations¶
print("\nObservations About the Dataset:")
if duplicate_count > 0:
print(f"- There are {duplicate_count} duplicate rows in the dataset.")
else:
print("- No duplicate rows found in the dataset.")
if missing_values.sum() > 0:
print("- There are missing values in the dataset. Here’s a summary:")
print(missing_values[missing_values > 0])
else:
print("- No missing values found in the dataset.")
print("- The dataset is ready for further analysis after handling duplicates and missing values.")
Observations About the Dataset: - No duplicate rows found in the dataset. - There are missing values in the dataset. Here’s a summary: [5673] - The dataset is ready for further analysis after handling duplicates and missing values.
2. Data Types¶
# Dataset Columns
print("Dataset Columns:")
print(data.columns)
Dataset Columns: Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'state', 'posting_date', 'latitude', 'longitude'], dtype='object')
# Dataset Describe
print("\nDataset Summary Statistics:")
print(data.describe(include='all'))
Dataset Summary Statistics: id price year manufacturer model \ count 426880 4.268800e+05 425675.000000 426880 426880 unique 426880 NaN NaN 43 29668 top 7222695916 NaN NaN ford f-150 freq 1 NaN NaN 70985 8009 mean NaN 9.657028e+06 2011.235191 NaN NaN min NaN 0.000000e+00 1900.000000 NaN NaN 25% NaN 7.576800e+05 2008.000000 NaN NaN 50% NaN 1.791970e+06 2013.000000 NaN NaN 75% NaN 3.401080e+06 2017.000000 NaN NaN max NaN 4.798934e+11 2022.000000 NaN NaN std NaN 1.564439e+09 9.452120 NaN NaN condition cylinders fuel odometer title_status ... \ count 426880 426880.000000 426880 4.224800e+05 426880 ... unique 7 NaN 6 NaN 7 ... top nan NaN gas NaN clean ... freq 174104 NaN 356209 NaN 405117 ... mean NaN 3.466222 NaN 9.804333e+04 NaN ... min NaN 0.000000 NaN 0.000000e+00 NaN ... 25% NaN 0.000000 NaN 3.770400e+04 NaN ... 50% NaN 4.000000 NaN 8.554800e+04 NaN ... 75% NaN 6.000000 NaN 1.335425e+05 NaN ... max NaN 12.000000 NaN 1.000000e+07 NaN ... std NaN 3.188464 NaN 2.138815e+05 NaN ... VIN drive size type paint_color \ count 426880 426880 426880 426880 426880 unique 118265 4 5 14 13 top nan 4wd nan nan nan freq 161042 131904 306361 92858 130203 mean NaN NaN NaN NaN NaN min NaN NaN NaN NaN NaN 25% NaN NaN NaN NaN NaN 50% NaN NaN NaN NaN NaN 75% NaN NaN NaN NaN NaN max NaN NaN NaN NaN NaN std NaN NaN NaN NaN NaN image_url state \ count 426880 426880 unique 241900 35 top https://images.craigslist.org/00N0N_1xMPvfxRAI... Uttar Pradesh freq 7357 45573 mean NaN NaN min NaN NaN 25% NaN NaN 50% NaN NaN 75% NaN NaN max NaN NaN std NaN NaN posting_date latitude longitude count 426812 426880.000000 426880.000000 unique NaN NaN NaN top NaN NaN NaN freq NaN NaN NaN mean 2021-04-24 00:23:42.976467200+00:00 22.800888 79.012914 min 2021-04-04 07:00:25+00:00 6.800300 68.400200 25% 2021-04-17 10:47:46.750000128+00:00 17.796000 75.719100 50% 2021-04-26 01:08:31.500000+00:00 24.058650 77.353900 75% 2021-05-01 13:32:26.249999872+00:00 28.563525 80.927325 max 2021-05-05 04:24:09+00:00 35.997600 97.398000 std NaN 6.532617 5.208436 [11 rows x 21 columns]
# Variables Description
print("\n### Variables Description ###")
variable_descriptions = {
"id": "A unique identifier for each listing or item.",
"price": "The price of the item or listing. It might be in a specific currency (e.g., INR, USD).",
"year": "The manufacturing year of the item (possibly for vehicles or other dated products).",
"manufacturer": "The name of the manufacturer or brand of the item.",
"model": "The specific model of the product.",
"condition": "The condition of the item (e.g., new, used, refurbished).",
"cylinders": "The number of cylinders (likely relevant if the dataset is related to vehicles).",
"fuel": "The type of fuel the item uses (e.g., petrol, diesel, electric).",
"odometer": "The mileage or distance traveled by the item (if vehicles are involved).",
"title_status": "The legal or ownership status of the item (e.g., clean, salvage, rebuilt).",
"VIN(Vehicle Identification Number)": "A unique code to identify individual vehicles.",
"drive": "The drivetrain type (e.g., front-wheel drive, rear-wheel drive, all-wheel drive).",
"size": "The size category of the item (e.g., small, medium, large).",
"type": "The type or category of the item (e.g., car, truck, bike, or other product types).",
"paint_color": "The color of the item (e.g., for vehicles or items with customizable color options).",
"image_url": "A URL to the image of the item or listing.",
"state": "The state in which the listing or item is located.",
"posting_date": "The date on which the item was listed or posted.",
"latitude": "The latitude coordinate of the item's location.",
"longitude": "The longitude coordinate of the item's location."
}
for col, desc in variable_descriptions.items():
print(f"* **{col}**: {desc}")
### Variables Description ### * **id**: A unique identifier for each listing or item. * **price**: The price of the item or listing. It might be in a specific currency (e.g., INR, USD). * **year**: The manufacturing year of the item (possibly for vehicles or other dated products). * **manufacturer**: The name of the manufacturer or brand of the item. * **model**: The specific model of the product. * **condition**: The condition of the item (e.g., new, used, refurbished). * **cylinders**: The number of cylinders (likely relevant if the dataset is related to vehicles). * **fuel**: The type of fuel the item uses (e.g., petrol, diesel, electric). * **odometer**: The mileage or distance traveled by the item (if vehicles are involved). * **title_status**: The legal or ownership status of the item (e.g., clean, salvage, rebuilt). * **VIN(Vehicle Identification Number)**: A unique code to identify individual vehicles. * **drive**: The drivetrain type (e.g., front-wheel drive, rear-wheel drive, all-wheel drive). * **size**: The size category of the item (e.g., small, medium, large). * **type**: The type or category of the item (e.g., car, truck, bike, or other product types). * **paint_color**: The color of the item (e.g., for vehicles or items with customizable color options). * **image_url**: A URL to the image of the item or listing. * **state**: The state in which the listing or item is located. * **posting_date**: The date on which the item was listed or posted. * **latitude**: The latitude coordinate of the item's location. * **longitude**: The longitude coordinate of the item's location.
Unique Values for each variable.¶
# Unique Values for Each Variable
print("\n### Unique Values for Each Variable ###")
for column in data.columns.tolist():
print(f"No. of unique values in {column}: {data[column].nunique()}.")
### Unique Values for Each Variable ### No. of unique values in id: 426880. No. of unique values in price: 6253. No. of unique values in year: 114. No. of unique values in manufacturer: 43. No. of unique values in model: 29668. No. of unique values in condition: 7. No. of unique values in cylinders: 8. No. of unique values in fuel: 6. No. of unique values in odometer: 104870. No. of unique values in title_status: 7. No. of unique values in transmission: 4. No. of unique values in VIN: 118265. No. of unique values in drive: 4. No. of unique values in size: 5. No. of unique values in type: 14. No. of unique values in paint_color: 13. No. of unique values in image_url: 241900. No. of unique values in state: 35. No. of unique values in posting_date: 333355. No. of unique values in latitude: 191354. No. of unique values in longitude: 162536.
#import pandas as pd
#from geopy.geocoders import Nominatim
# Load your data (replace 'your_data.csv' with your actual file)
#data = pd.read_csv(output_file)
# Function to get the city
#def get_city(row):
#geolocator = Nominatim(user_agent="my_geocoder")
#location = geolocator.reverse(str(row['latitude']) + ", " + str(row['longitude']))
#return location.address.split(',')[0] if location else None
# Apply the function and create a new column
#data['city'] = data.apply(get_city, axis=1)
# Save the modified DataFrame
#data.to_csv('output.csv', index=False)
# import math
# cities_df = pd.read_csv('India Cities LatLng.csv')
# def haversine(lat1, lon1, lat2, lon2):
# R = 6371 # Radius of Earth in kilometers
# phi1 = math.radians(lat1)
# phi2 = math.radians(lat2)
# delta_phi = math.radians(lat2 - lat1)
# delta_lambda = math.radians(lon2 - lon1)
# a = math.sin(delta_phi / 2)*2 + math.cos(phi1) * math.cos(phi2) * math.sin(delta_lambda / 2)*2
# c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
# return R * c # Distance in kilometers
# def get_closest_city(row):
# min_distance = float('inf')
# closest_city = None
# for _, city_row in cities_df.iterrows():
# distance = haversine(row['latitude'], row['longitude'], city_row['lat'], city_row['lng'])
# if distance < min_distance:
# min_distance = distance
# closest_city = city_row['city']
# return closest_city
# df = pd.DataFrame(data)
# # Apply the function to get the closest city for each row in data
# df['city'] = df.apply(get_closest_city, axis=1)
3. Data Wrangling¶
Data Wrangling Code¶
# Copying the dataset for analysis
data = data.copy()
# Checking basic stats
print("Dataset Shape:", data.shape)
print("Dataset Columns:", data.columns)
Dataset Shape: (426880, 21) Dataset Columns: Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'state', 'posting_date', 'latitude', 'longitude'], dtype='object')
!pip install geopy
Requirement already satisfied: geopy in /usr/local/lib/python3.11/dist-packages (2.4.1) Requirement already satisfied: geographiclib<3,>=1.52 in /usr/local/lib/python3.11/dist-packages (from geopy) (2.0)
# #from geopy.geocoders import Nominatim
# from geopy.exc import GeocoderTimedOut
# def get_city_from_coordinates(lat, lon):
# try:
# geolocator = Nominatim(user_agent="geo_locator")
# location = geolocator.reverse((lat, lon), exactly_one=True)
# if location and 'address' in location.raw:
# address = location.raw['address']
# city = address.get('city') or address.get('town') or address.get('village')
# state = address.get('state')
# country = address.get('country')
# return city, state, country
# else:
# return None, None, None
# except GeocoderTimedOut:
# print("Geocoder service timed out. Try again.")
# return None, None, None
# # Example usage
# latitude = 28.6139 # Example: Latitude of New Delhi
# longitude = 77.2090 # Example: Longitude of New Delhi
# city, state, country = get_city_from_coordinates(latitude, longitude)
# if city:
# print(f"City: {city}, State: {state}, Country: {country}")
# else:
# print("City not found.")
- By using latitude and longitude values we can see city , state and country.
OutLiers¶
- Define Outliers (IQR Method)
def detect_outliers_iqr(data, columns):
outliers = {}
for col in columns:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Find outliers
outliers[col] = data[(data[col] < lower_bound) | (data[col] > upper_bound)][col]
return outliers
# Apply to numeric columns
numeric_columns = ['price', 'year', 'odometer']
outliers = detect_outliers_iqr(data, numeric_columns)
for col, vals in outliers.items():
print(f"Outliers in {col}:")
print(vals)
Outliers in price: 204 7.573780e+06 280 1.268338e+11 346 1.270072e+07 365 9.361800e+06 461 1.283604e+07 ... 426694 7.692300e+06 426716 8.469800e+06 426778 8.474760e+06 426810 7.575790e+06 426840 8.475680e+06 Name: price, Length: 8177, dtype: float64 Outliers in year: 37 1992.0 101 1968.0 137 1966.0 208 1976.0 210 1969.0 ... 426729 1951.0 426762 1982.0 426809 1990.0 426831 1985.0 426838 1920.0 Name: year, Length: 15896, dtype: float64 Outliers in odometer: 138 281000.0 280 999999.0 411 279000.0 455 9999999.0 484 283356.0 ... 426508 284000.0 426590 291988.0 426643 287000.0 426671 291137.0 426690 295000.0 Name: odometer, Length: 4385, dtype: float64
- Define Outliers (Standard Deviation Method)
def detect_outliers_std(data, columns):
outliers = {}
for col in columns:
mean = data[col].mean()
std = data[col].std()
lower_bound = mean - 3 * std
upper_bound = mean + 3 * std
# Find outliers
outliers[col] = data[(data[col] < lower_bound) | (data[col] > upper_bound)][col]
return outliers
# Apply to numeric columns
numeric_columns = ['price', 'year', 'odometer']
outliers_std = detect_outliers_std(data, numeric_columns)
for col, vals in outliers_std.items():
print(f"Outliers in {col}:")
print(vals)
Outliers in price: 280 1.268338e+11 1838 1.284192e+10 29386 1.426880e+11 37409 1.426880e+11 37410 3.864838e+11 91576 3.884607e+11 122470 1.268338e+11 137807 1.585422e+10 153082 1.585422e+11 184704 1.810795e+11 193736 1.585422e+10 207080 1.585422e+10 230753 1.733773e+10 241404 1.457003e+10 257840 3.884607e+11 288356 1.585422e+10 307488 1.585422e+10 318592 4.798934e+11 356716 4.798934e+11 358555 1.585422e+10 Name: price, dtype: float64 Outliers in year: 101 1968.0 137 1966.0 208 1976.0 210 1969.0 228 1978.0 ... 426575 1970.0 426593 1971.0 426729 1951.0 426762 1982.0 426838 1920.0 Name: year, Length: 9116, dtype: float64 Outliers in odometer: 280 999999.0 455 9999999.0 562 2222222.0 615 1111111.0 616 1111111.0 ... 423025 840000.0 423091 1111111.0 424165 900132.0 426106 9999999.0 426108 999999.0 Name: odometer, Length: 1089, dtype: float64
- Visualize Outliers (Boxplot)
import matplotlib.pyplot as plt
for col in ['price', 'year', 'odometer']:
plt.figure(figsize=(8, 4))
plt.boxplot(data[col].dropna(), vert=False, patch_artist=True)
plt.title(f"Boxplot of {col}")
plt.xlabel(col)
plt.show()
- Treat or Handle Outliers
def remove_outliers_iqr(data, columns):
for col in columns:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
data = data[(data[col] >= lower_bound) & (data[col] <= upper_bound)]
return data
data_cleaned = remove_outliers_iqr(data, numeric_columns)
Demand for pre-owned cars by region¶
# Bar Chart: Listings by State
state_counts = data['state'].value_counts()
plt.figure(figsize=(10, 6))
state_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Listings by State', fontsize=16)
plt.xlabel('State', fontsize=6)
plt.ylabel('Number of Listings', fontsize=14)
plt.xticks(rotation=45, fontsize=6)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Bar Chart:
- Here it shows uttarpradesh is showing the highest counts in state.
- The states with the highest and lowest counts are evident, enabling identification of demand hotspots.
Car Attributes Analysis¶
# Set general style for plots
sns.set_style("whitegrid")
# Step 1: Distribution
# Histograms
numeric_columns = ['price', 'year', 'odometer']
plt.figure(figsize=(15, 5))
for i, col in enumerate(numeric_columns, 1):
plt.subplot(1, 3, i)
plt.hist(data[col], bins=10, color='skyblue', edgecolor='black')
plt.title(f'Distribution of {col.capitalize()}', fontsize=14)
plt.xlabel(col.capitalize(), fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.tight_layout()
plt.show()
# Pie Charts
categorical_columns = ['fuel', 'condition', 'paint_color', 'drive']
plt.figure(figsize=(14, 8))
for i, col in enumerate(categorical_columns, 1):
plt.subplot(2, 2, i)
# Get top 5 categories, grouping the rest as "Other"
top_categories = data[col].value_counts().nlargest(5)
other_count = data[col].value_counts().sum() - top_categories.sum()
# Append "Other" category if necessary
if other_count > 0:
top_categories['Other'] = other_count
# Plot the filtered pie chart
top_categories.plot.pie(
autopct='%1.1f%%',
colors=sns.color_palette("pastel"),
startangle=90,
wedgeprops={'edgecolor': 'black'}
)
plt.title(f'Distribution of {col.capitalize()} (Top 5)', fontsize=14)
plt.ylabel('')
plt.tight_layout()
plt.show()
# Step 2: Relationships
# Scatterplots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(data=data, x='odometer', y='price', color='teal')
plt.title('Price vs. Odometer', fontsize=14)
plt.xlabel('Odometer', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.subplot(1, 2, 2)
sns.scatterplot(data=data, x='year', y='price', color='orange')
plt.title('Price vs. Year', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()
# Boxplots
categorical_columns = ['condition', 'fuel', 'drive']
plt.figure(figsize=(20, 5))
for i, col in enumerate(categorical_columns, 1):
plt.subplot(1, 3, i)
sns.boxplot(data=data, x=col, y='price', palette='Set2')
plt.title(f'Price by {col.capitalize()}', fontsize=14)
plt.xlabel(col.capitalize(), fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()
<ipython-input-27-647f5fa808f8>:70: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(data=data, x=col, y='price', palette='Set2') <ipython-input-27-647f5fa808f8>:70: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(data=data, x=col, y='price', palette='Set2') <ipython-input-27-647f5fa808f8>:70: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(data=data, x=col, y='price', palette='Set2')
Histograms:¶
- Visualize the distribution of numerical attributes like price, year, and odometer.
- Identify central tendencies, skewness, and outliers.
Pie Charts:¶
- Show the proportions of categorical attributes (fuel, condition, etc.).
- Highlight dominant categories, such as the most common fuel type or paint color.
Scatterplots:¶
- Price vs. Odometer: Observe the trend of car prices decreasing as mileage increases.
- Price vs. Year: Examine how car prices vary with the year of manufacture.
Boxplots:¶
- Group price by categories like condition, fuel, and drive.
- Highlight variations in price across these groups and detect outliers.
Manufacturer and Model Popularity¶
- Top 10 Maufacturers by count.
# General style
sns.set_style("whitegrid")
# Step 1: Top 10 Manufacturers by Count
manufacturer_counts = data['manufacturer'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=manufacturer_counts.index, y=manufacturer_counts.values, palette='viridis')
plt.title('Top 10 Manufacturers by Count', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
<ipython-input-28-8c4aee0bdc88>:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=manufacturer_counts.index, y=manufacturer_counts.values, palette='viridis')
- Here bar chart shows the top 10 manufacturers with the most listings.
- And ford shows the highest in top 10.
- Top 10 Models by Count
model_counts = data['model'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=model_counts.index, y=model_counts.values, palette='coolwarm')
plt.title('Top 10 Models by Count', fontsize=14)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
<ipython-input-29-3f2e3d584f3b>:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=model_counts.index, y=model_counts.values, palette='coolwarm')
- Here we show most most popular models . Here f-150 is the most popular and top 10 model.
- Manufacturers Segmented by Condition
condition_segment = data.groupby(['manufacturer', 'condition']).size().unstack(fill_value=0)
condition_segment = condition_segment.loc[manufacturer_counts.index] # Align manufacturers with top 10
condition_segment.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Manufacturers Segmented by Condition', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Condition', fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
- A stacked bar chart groups the top manufacturers by vehicle condition (e.g., excellent, good, fair). Reveals which manufacturers offer the most "like new" or "fair" condition cars.
- Manufacturers Segmented by Fuel
fuel_segment = data.groupby(['manufacturer', 'fuel']).size().unstack(fill_value=0)
fuel_segment = fuel_segment.loc[manufacturer_counts.index] # Align manufacturers with top 10
fuel_segment.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='coolwarm')
plt.title('Manufacturers Segmented by Fuel', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Fuel', fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
- A stacked bar chart groups the top manufacturers by fuel type (e.g., gas, diesel, electric). Shows which brands lean towards specific fuel types, such as Toyota offering more electric cars.
Pricing Analysis¶
# General style
sns.set_style("whitegrid")
top_states = data['state'].value_counts().index[:10]
filtered_data = data[data['state'].isin(top_states)]
plt.figure(figsize=(12, 6))
sns.violinplot(data=filtered_data, x='state', y='price', palette='muted', inner='quartile')
plt.title('Distribution of Car Prices by Top 10 States', fontsize=14)
plt.xlabel('State', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Step 2: Violin Plot of Price Grouped by Condition
plt.figure(figsize=(12, 6))
sns.violinplot(data=data, x='condition', y='price', palette='Set2')
plt.title('Distribution of Prices by Condition', fontsize=14)
plt.xlabel('Condition', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()
# Step 3: Line Chart Showing Average Price Trends Over Year
# Group data by year and calculate average price
avg_price_by_year = data.groupby('year')['price'].mean().reset_index()
plt.figure(figsize=(10, 5))
sns.lineplot(data=avg_price_by_year, x='year', y='price', marker='o', color='b')
plt.title('Average Price Trends Over Years', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.tight_layout()
plt.show()
<ipython-input-32-5dc789e1b852>:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.violinplot(data=filtered_data, x='state', y='price', palette='muted', inner='quartile')
<ipython-input-32-5dc789e1b852>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.violinplot(data=data, x='condition', y='price', palette='Set2')
- Violin Plot of Price Grouped by State:
- It provides more details on the distribution shape, showing multimodal data (multiple peaks) if present.
- It gives a clearer visual of pricing trends, especially in cases where the data is not normally distributed.
- Violin Plot of Price Grouped by Condition:
- Analyze how vehicle prices vary based on their condition.
- "Like new" vehicles are expected to have higher median prices and narrower distributions.
- "Fair" vehicles may have a wider spread, reflecting variability in pricing.
- Line Chart of Average Price Trends Over Years:
- Use a line chart to show year-over-year price changes.
- Identify whether prices are increasing, decreasing, or fluctuating. Spot anomalies (e.g., sudden spikes or drops).
Regional Postings Distribution¶
# Step 1: Grouping the data by 'state' to count the number of postings per region
region_postings = data['state'].value_counts().reset_index()
region_postings.columns = ['Region', 'Postings']
# Step 2: Handling missing values (if any), for now we remove rows where state is NaN
region_postings = region_postings.dropna()
# Step 3: Sorting the data by 'Postings' in descending order for better visualization
region_postings = region_postings.sort_values(by='Postings', ascending=False)
# Step 4: Plotting the bar chart
plt.figure(figsize=(12, 8))
sns.barplot(x='Postings', y='Region', data=region_postings, palette='viridis')
plt.title('Car Postings Distribution by Region', fontsize=16)
plt.xlabel('Number of Postings', fontsize=12)
plt.ylabel('Region', fontsize=12)
plt.show()
<ipython-input-33-27050b1abf8c>:13: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='Postings', y='Region', data=region_postings, palette='viridis')
- The bar chart will display the regions (states) on the y-axis and the number of car postings on the x-axis. The regions with the most car postings will be on the top, and regions with fewer postings will appear lower.
Regional Price Comparison: Analyzing Average Prices Across Different Regions¶
# Step 1: Calculate the average price for each model across regions
avg_price_per_region = data.groupby(['state', 'model'])['price'].mean().reset_index()
# Step 2: Create a pivot table for heatmap visualization
price_pivot = avg_price_per_region.pivot(index='model', columns='state', values='price')
# Step 3: Plotting heatmap for regional price comparison
plt.figure(figsize=(8, 4))
sns.heatmap(price_pivot, cmap='coolwarm', annot=True, fmt='.0f', linewidths=0.5)
plt.title('Average Car Price Comparison Across state')
plt.xlabel('State')
plt.ylabel('Car Model')
plt.xticks(rotation=30, ha='right')
plt.show()
# Bar chart for selected car models (e.g., top 5 most common models)
top_models = data['model'].value_counts().nlargest(5).index
filtered_data = data[data['model'].isin(top_models)]
# Group data again for selected models
avg_price_filtered = filtered_data.groupby(['state', 'model'])['price'].mean().reset_index()
plt.figure(figsize=(14, 6))
sns.barplot(x='state', y='price', hue='model', data=avg_price_filtered, palette='viridis')
plt.title('Average Price of Top Car Models Across Regions')
plt.xlabel('State')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Car Model')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
Heat map Visualization:
- Indicates the average price—blue regions represent lower prices, while red regions represent higher prices.
Bar Chart Visualization:
- We took Top 5 models for easier visualization. From all the states
Hypothesis¶
Hypothesis-1 : Better condition for cars result in higher prices
# Step 1: Calculate the average price for each condition
avg_price_condition = data.groupby('condition')['price'].mean().reset_index()
# Step 2: Sort values for better visualization
avg_price_condition = avg_price_condition.sort_values(by='price', ascending=False)
# Step 3: Bar Chart to visualize average price by condition
plt.figure(figsize=(10, 6))
sns.barplot(x='condition', y='price', data=avg_price_condition, palette='viridis')
plt.title('Average Car Price by Condition')
plt.xlabel('Condition')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
# Step 4: Box Plot to analyze price variability within each condition
plt.figure(figsize=(12, 6))
sns.boxplot(x='condition', y='price', data=data, palette='coolwarm')
plt.title('Price Distribution by Car Condition')
plt.xlabel('Condition')
plt.ylabel('Price')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
<ipython-input-35-ec038169399b>:9: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='condition', y='price', data=avg_price_condition, palette='viridis')
<ipython-input-35-ec038169399b>:19: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='condition', y='price', data=data, palette='coolwarm')
- Insights from the Analysis:
- Higher Prices for Better Conditions:Here Fair has the highest average price condtion.
- Recjbgcfxdxfcg nnhgdywfrdexzdxf
Hypothesis-2 : Correlation Between Price, Year, and Odometer
# Step 1: Calculate the correlation matrix
correlation = data[['price', 'year', 'odometer']].corr()
# Step 2: Print the correlation matrix
print("Correlation Matrix:\n", correlation)
# Step 3: Heatmap to visualize correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix: Price, Year, and Odometer')
plt.show()
# Step 4: Pairplot to visualize relationships between variables
sns.pairplot(data[['price', 'year', 'odometer']], diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pairplot of Price, Year, and Odometer', y=1.02)
plt.show()
# Step 5: Scatter plots for individual analysis
plt.figure(figsize=(12, 5))
# Price vs Year
plt.subplot(1, 2, 1)
sns.scatterplot(x='year', y='price', data=data, alpha=0.6, color='blue')
plt.title('Car Price vs Manufacturing Year')
plt.xlabel('Year')
plt.ylabel('Price')
# Price vs Odometer
plt.subplot(1, 2, 2)
sns.scatterplot(x='odometer', y='price', data=data, alpha=0.6, color='red')
plt.title('Car Price vs Odometer Reading')
plt.xlabel('Odometer')
plt.ylabel('Price')
plt.tight_layout()
plt.show()
Correlation Matrix: price year odometer price 1.000000 -0.004925 0.010032 year -0.004925 1.000000 -0.157215 odometer 0.010032 -0.157215 1.000000
Insights for the Analysis:
- A positive correlation indicates that newer cars have higher prices.
Recommendations:
- Analyze correlations by segmenting data by brand or car type.
Hypothesis-3 : Geographic Analysis
# Step 1: Calculate average car price by state and sort in descending order
avg_price_state = data.groupby('state')['price'].mean().sort_values(ascending=False)
# Step 2: Print the result
print("Average Car Price by State:\n", avg_price_state)
# Step 3: Visualize using a bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price_state.index, y=avg_price_state.values, palette='viridis')
plt.xticks(rotation=90)
plt.title('Average Car Price by State')
plt.xlabel('State')
plt.ylabel('Average Price (USD)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
# Step 4: Visualize using a heatmap (if geographic data available)
plt.figure(figsize=(8, 12))
sns.heatmap(avg_price_state.to_frame().T, cmap='coolwarm', annot=True, fmt='.0f')
plt.title('Heatmap of Average Car Prices by State')
plt.ylabel('Price')
plt.show()
Average Car Price by State: state Lakshadweep 1.665653e+08 Chandigarh 1.370311e+08 Goa 6.424109e+07 West Bengal 2.884930e+07 Haryana 2.001612e+07 Karnataka 1.600844e+07 Kerala 1.364536e+07 Uttar Pradesh 8.511310e+06 Dadra and Nagar Haveli and Daman and Diu 7.138190e+06 Maharashtra 5.056206e+06 Madhya Pradesh 3.556956e+06 Tamil Nadu 3.109092e+06 Delhi 2.860353e+06 Chhattisgarh 2.503276e+06 Gujarat 2.369305e+06 Punjab 2.348216e+06 Rajasthan 2.346407e+06 Assam 2.314903e+06 Jharkhand 2.306235e+06 Andaman and Nicobar Islands 2.268856e+06 Sikkim 2.268851e+06 Manipur 2.267773e+06 Puducherry 2.257328e+06 Arunachal Pradesh 2.250119e+06 Odisha 2.247732e+06 Jammu and Kashmir 2.240744e+06 Tripura 2.238199e+06 Telangana 2.237392e+06 Uttarakhand 2.237066e+06 Andhra Pradesh 2.231818e+06 Bihar 2.225778e+06 Ladakh 2.218344e+06 Meghalaya 2.216664e+06 Himachal Pradesh 2.197821e+06 Mizoram 2.188526e+06 Name: price, dtype: float64
<ipython-input-37-4b324341e25b>:9: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=avg_price_state.index, y=avg_price_state.values, palette='viridis')
- Insights from Analysis:
- States with a higher average price could indicate a preference for premium or newer cars.
- Recommendations:
- Check if prices in certain states are rising or falling over months/years.
Assessing the Demand for Pre-Owned Cars in Each Region¶
# Number of Listings by Region (Supply Indicator)
listings_by_region = data['state'].value_counts()
print("Number of Listings by Region:\n", listings_by_region)
# Visualization of listings by region
plt.figure(figsize=(12, 6))
sns.barplot(x=listings_by_region.index, y=listings_by_region.values, palette='viridis')
plt.title('Number of Car Listings by Region')
plt.xlabel('Region')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()
Number of Listings by Region: state Uttar Pradesh 45573 Maharashtra 45263 Karnataka 30492 Punjab 30455 Delhi 30379 Haryana 30175 Tamil Nadu 18124 West Bengal 18052 Telangana 15356 Kerala 15304 Madhya Pradesh 15205 Rajasthan 15131 Andhra Pradesh 15070 Gujarat 14992 Odisha 12189 Jharkhand 6080 Bihar 6034 Chhattisgarh 6003 Assam 5987 Uttarakhand 5951 Andaman and Nicobar Islands 3079 Sikkim 3065 Ladakh 3062 Puducherry 3046 Jammu and Kashmir 3030 Meghalaya 3017 Himachal Pradesh 3016 Mizoram 3010 Manipur 2990 Chandigarh 2986 Arunachal Pradesh 2980 Dadra and Nagar Haveli and Daman and Diu 2980 Tripura 2961 Goa 2923 Lakshadweep 2920 Name: count, dtype: int64
<ipython-input-38-7fb1fec7a708>:7: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=listings_by_region.index, y=listings_by_region.values, palette='viridis')
- In this region place i have given state
- Bar chart represents counts the number of car postings in each region.
Identifying Demand for Specific Car Models and Their Key Attributes¶
# Analysis 1: Number of Listings by Fuel Type
fuel_listings = filtered_data['fuel'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=fuel_listings.index, y=fuel_listings.values, palette='coolwarm')
plt.title('Number of Listings by Fuel Type')
plt.xlabel('Fuel Type')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()
# Analysis 2:Number of Clean Cars by Fuel Type
clean_fuel = filtered_data[filtered_data['title_status'] == 'clean']['fuel'].value_counts().sort_values(ascending=False)
if not clean_fuel.empty:
plt.figure(figsize=(12, 6))
sns.barplot(x=clean_fuel.index, y=clean_fuel.values, palette='Set2')
plt.title('Number of Clean Cars by Fuel Type')
plt.xlabel('Fuel Type')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()
else:
print("No clean car data available for Fuel Type analysis.")
# Analysis 3: Number of Listings by Condition
condition_listings = filtered_data['condition'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=condition_listings.index, y=condition_listings.values, palette='Blues')
plt.title('Number of Listings by Condition')
plt.xlabel('Condition')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()
# Analysis 4: Number of Clean Cars by Condition
clean_condition = filtered_data[filtered_data['title_status'] == 'clean']['condition'].value_counts().sort_values(ascending=False)
if not clean_condition.empty:
plt.figure(figsize=(12, 6))
sns.barplot(x=clean_condition.index, y=clean_condition.values, palette='Oranges')
plt.title('Number of Clean Cars by Condition')
plt.xlabel('Condition')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()
else:
print("No clean car data available for Condition analysis.")
<ipython-input-39-e98a20952f3d>:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=fuel_listings.index, y=fuel_listings.values, palette='coolwarm')
<ipython-input-39-e98a20952f3d>:15: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=clean_fuel.index, y=clean_fuel.values, palette='Set2')
<ipython-input-39-e98a20952f3d>:27: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=condition_listings.index, y=condition_listings.values, palette='Blues')
<ipython-input-39-e98a20952f3d>:38: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=clean_condition.index, y=clean_condition.values, palette='Oranges')
Optimal Pricing Strategy for Cars¶
# Inspect the data
print(data.head())
# 1. Calculate the average and median price for each car model
price_stats = data.groupby('model')['price'].agg(['mean', 'median', 'count']).reset_index()
price_stats.columns = ['model', 'average_price', 'median_price', 'listing_count']
# 2. Identify outliers using Interquartile Range (IQR) for price
Q1 = data['price'].quantile(0.25)
Q3 = data['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out outliers
filtered_data = data[(data['price'] >= lower_bound) & (data['price'] <= upper_bound)]
# 3. Competitive analysis based on demand and supply
sold_count = data[data['title_status'] == 'sold'].groupby('model')['price'].count().reset_index()
sold_count.columns = ['model', 'sold_count']
# Merge sold count with price stats
optimal_pricing_data = pd.merge(price_stats, sold_count, on='model', how='left').fillna(0)
# 4. Analyzing condition impact on price
condition_price = data.groupby(['model', 'condition'])['price'].mean().unstack(fill_value=0)
# 5. Analyzing mileage impact on price
plt.figure(figsize=(12, 6))
sns.scatterplot(x='odometer', y='price', hue='condition', data=data, palette='coolwarm')
plt.title('Mileage vs Price Distribution')
plt.xlabel('Odometer (Mileage)')
plt.ylabel('Price')
plt.legend(title='Condition')
plt.show()
# 6. Adjusting pricing based on market trends
optimal_pricing_data['recommended_price'] = np.where(
optimal_pricing_data['sold_count'] > 0,
optimal_pricing_data['median_price'] * 1.05, # Increase price slightly if demand is high
optimal_pricing_data['median_price'] * 0.95 # Decrease price if demand is low
)
# 7. Display the recommended prices
print(optimal_pricing_data[['model', 'average_price', 'median_price', 'sold_count', 'recommended_price']])
# 8. Visualize pricing recommendations
plt.figure(figsize=(12, 6))
sns.barplot(x='model', y='recommended_price', data=optimal_pricing_data.sort_values(by='recommended_price', ascending=False).head(20), palette='viridis')
plt.title('Recommended Pricing for Top Car Models')
plt.xlabel('Car Model')
plt.ylabel('Recommended Price')
plt.xticks(rotation=90)
plt.show()
id price year manufacturer model condition cylinders fuel \ 0 7222695916 770530.0 NaN nan nan nan 0 nan 1 7218891961 1528210.0 NaN nan nan nan 0 nan 2 7221797935 2696820.0 NaN nan nan nan 0 nan 3 7222270760 192610.0 NaN nan nan nan 0 nan 4 7210384030 629240.0 NaN nan nan nan 0 nan odometer title_status ... VIN drive size type paint_color image_url \ 0 NaN nan ... nan nan nan nan nan nan 1 NaN nan ... nan nan nan nan nan nan 2 NaN nan ... nan nan nan nan nan nan 3 NaN nan ... nan nan nan nan nan nan 4 NaN nan ... nan nan nan nan nan nan state posting_date latitude longitude 0 Haryana NaT 28.8446 75.1167 1 Haryana NaT 27.9026 77.0382 2 West Bengal NaT 24.6838 85.9695 3 Andhra Pradesh NaT 15.5158 81.9342 4 West Bengal NaT 23.7821 87.4861 [5 rows x 21 columns]
model average_price \ 0 "t" 3467350.0 1 $362.47, $1000 down, oac, 2.9%apr $362.47,luxu... 3197630.0 2 % 128440.0 3 & altima 391100.0 4 '50 Business Coupe 1284190.0 ... ... ... 29663 ♦ALL TADES WELCOME!♦ 64180.0 29664 ♿ 1412630.0 29665 ♿ vmi 1412165.0 29666 𝓜𝓮𝓻𝓬𝓮𝓭𝓮𝓼 𝓫𝓮𝓷𝔃 𝓶𝓵 350 500890.0 29667 🔥GMC Sierra 1500 SLE🔥 4X4 🔥 1284190.0 median_price sold_count recommended_price 0 3467350.0 0.0 3293982.50 1 3197630.0 0.0 3037748.50 2 128440.0 0.0 122018.00 3 391100.0 0.0 371545.00 4 1284190.0 0.0 1219980.50 ... ... ... ... 29663 64180.0 0.0 60971.00 29664 1412630.0 0.0 1341998.50 29665 1412165.0 0.0 1341556.75 29666 500890.0 0.0 475845.50 29667 1284190.0 0.0 1219980.50 [29668 rows x 5 columns]
<ipython-input-40-0a1f70a6d962>:49: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='model', y='recommended_price', data=optimal_pricing_data.sort_values(by='recommended_price', ascending=False).head(20), palette='viridis')
By following this approach, car dealers can set optimal prices for each model by:
- Ensuring competitiveness against market trends.
- Balancing demand with supply to adjust pricing dynamically.
- Accounting for key factors like condition and mileage.
To analyze the relationship between the posting_date and year of manufacture
# Convert posting_date to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'])
# Extract the year from the posting date
data['posting_year'] = data['posting_date'].dt.year
# Calculate vehicle age at the time of posting
data['vehicle_age'] = data['posting_year'] - data['year']
# Display result
print(data[['year', 'posting_date', 'vehicle_age']])
# Conclusion
average_vehicle_age = data['vehicle_age'].mean()
print(f"On average, vehicles are {average_vehicle_age:.1f} years old at the time of posting.")
year posting_date vehicle_age 0 NaN NaT NaN 1 NaN NaT NaN 2 NaN NaT NaN 3 NaN NaT NaN 4 NaN NaT NaN ... ... ... ... 426875 2019.0 2021-04-04 09:21:31+00:00 2.0 426876 2020.0 2021-04-04 09:21:29+00:00 1.0 426877 2020.0 2021-04-04 09:21:17+00:00 1.0 426878 2018.0 2021-04-04 09:21:11+00:00 3.0 426879 2019.0 2021-04-04 09:21:07+00:00 2.0 [426880 rows x 3 columns] On average, vehicles are 9.8 years old at the time of posting.
- This is to show the relationship between posting_date and year of manufacture.
# Convert posting_date to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'])
# Extract the year from the posting date
data['posting_year'] = data['posting_date'] .dt.year
# Calculate vehicle age at the time of posting
data['vehicle_age'] = data['posting_year'] - data['year']
# Plot a histogram to visualize vehicle age distribution
plt.figure(figsize=(8, 5))
sns.histplot(data['vehicle_age'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Vehicle Age at Posting')
plt.xlabel('Vehicle Age (years)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
# Plot a bar chart to see the average vehicle age by posting year
plt.figure(figsize=(8, 5))
sns.barplot(x=data['posting_year'], y=data['vehicle_age'], palette='viridis')
plt.title('Average Vehicle Age by Posting Year')
plt.xlabel('Posting Year')
plt.ylabel('Average Vehicle Age')
plt.show()
# Display box plot for age distribution
plt.figure(figsize=(8, 5))
sns.boxplot(x=data['vehicle_age'], color='lightcoral')
plt.title('Boxplot of Vehicle Age')
plt.xlabel('Vehicle Age (years)')
plt.grid(True)
plt.show()
<ipython-input-42-a73b8b06c96f>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=data['posting_year'], y=data['vehicle_age'], palette='viridis')
To identify which car brands have a high sale value
# Group by manufacturer and calculate total sales value
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()
# Sort by sales value in descending order
brand_sales = brand_sales.sort_values(by='price', ascending=False)
# Display result
print(brand_sales)
# Identify the brand with the highest sales value
highest_sale_brand = brand_sales.iloc[0]
print(f"The brand with the highest sales value is {highest_sale_brand['manufacturer']} with a total sales value of {highest_sale_brand['price']}.")
manufacturer price 40 toyota 1.029069e+12 7 chevrolet 8.179780e+11 26 mercedes-benz 8.068871e+11 20 jeep 3.680175e+11 13 ford 3.319232e+11 42 volvo 1.662761e+11 31 nan 7.596019e+10 14 gmc 6.554087e+10 35 ram 6.531354e+10 4 bmw 5.047692e+10 32 nissan 4.610003e+10 16 honda 2.936577e+10 10 dodge 2.640941e+10 5 buick 2.598642e+10 3 audi 2.292789e+10 23 lexus 2.022540e+10 6 cadillac 1.735790e+10 38 subaru 1.591916e+10 0 acura 1.523405e+10 41 volkswagen 1.504611e+10 17 hyundai 1.423159e+10 18 infiniti 1.206377e+10 21 kia 1.203714e+10 24 lincoln 1.073178e+10 25 mazda 8.604323e+09 8 chrysler 8.087946e+09 36 rover 7.376169e+09 19 jaguar 6.635093e+09 29 mitsubishi 5.810281e+09 34 porsche 5.677958e+09 28 mini 4.321809e+09 39 tesla 4.275460e+09 1 alfa-romeo 3.252941e+09 33 pontiac 2.402481e+09 11 ferrari 1.310728e+09 12 fiat 1.209434e+09 37 saturn 9.632394e+08 27 mercury 8.335567e+08 15 harley-davidson 2.383198e+08 2 aston-martin 1.648740e+08 9 datsun 1.225653e+08 22 land rover 2.133458e+07 30 morgan 5.046890e+06 The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.
- The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.
# Group by manufacturer and calculate total sales value
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()
# Sort by sales value in descending order
brand_sales = brand_sales.sort_values(by='price', ascending=False)
# Display result
print(brand_sales)
# Identify the brand with the highest sales value
highest_sale_brand = brand_sales.iloc[0]
print(f"The brand with the highest sales value is {highest_sale_brand['manufacturer']} with a total sales value of {highest_sale_brand['price']}.")
# Plot a bar chart to visualize brand sales
plt.figure(figsize=(8, 5))
sns.barplot(x='manufacturer', y='price', data=brand_sales, palette='viridis')
plt.title('Total Sales Value by Manufacturer')
plt.xlabel('Manufacturer')
plt.ylabel('Total Sales Value')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.show()
# Step 1: Aggregate sales by manufacturer (if not already done)
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()
# Step 2: Sort and select the top 5 manufacturers by sales
top_brands = brand_sales.sort_values('price', ascending=False).head(5)
# Step 3: Plot the pie chart for the top 5 brands
plt.figure(figsize=(8, 8))
plt.pie(top_brands['price'], labels=top_brands['manufacturer'], autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Sales Distribution by Top 5 Manufacturer')
plt.show()
manufacturer price 40 toyota 1.029069e+12 7 chevrolet 8.179780e+11 26 mercedes-benz 8.068871e+11 20 jeep 3.680175e+11 13 ford 3.319232e+11 42 volvo 1.662761e+11 31 nan 7.596019e+10 14 gmc 6.554087e+10 35 ram 6.531354e+10 4 bmw 5.047692e+10 32 nissan 4.610003e+10 16 honda 2.936577e+10 10 dodge 2.640941e+10 5 buick 2.598642e+10 3 audi 2.292789e+10 23 lexus 2.022540e+10 6 cadillac 1.735790e+10 38 subaru 1.591916e+10 0 acura 1.523405e+10 41 volkswagen 1.504611e+10 17 hyundai 1.423159e+10 18 infiniti 1.206377e+10 21 kia 1.203714e+10 24 lincoln 1.073178e+10 25 mazda 8.604323e+09 8 chrysler 8.087946e+09 36 rover 7.376169e+09 19 jaguar 6.635093e+09 29 mitsubishi 5.810281e+09 34 porsche 5.677958e+09 28 mini 4.321809e+09 39 tesla 4.275460e+09 1 alfa-romeo 3.252941e+09 33 pontiac 2.402481e+09 11 ferrari 1.310728e+09 12 fiat 1.209434e+09 37 saturn 9.632394e+08 27 mercury 8.335567e+08 15 harley-davidson 2.383198e+08 2 aston-martin 1.648740e+08 9 datsun 1.225653e+08 22 land rover 2.133458e+07 30 morgan 5.046890e+06 The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.
<ipython-input-44-e0c4ba509063>:16: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='manufacturer', y='price', data=brand_sales, palette='viridis')
- Here we can see what type of car purchase is happend and count of it.
To check if the same vehicle has been resold within a given time frame
# Define the time frame (e.g., 60 days)
time_frame = pd.Timedelta(days=60)
# Sort data by VIN and posting date
data= data.sort_values(by=['VIN', 'posting_date'])
# Check for resales within the time frame
data['resale_within_timeframe'] = data.duplicated(subset=['VIN'], keep=False) & (
data['posting_date'].diff().fillna(pd.Timedelta(days=9999)) <= time_frame
)
# Filter records that were resold within the given timeframe
resold_vehicles = data[data['resale_within_timeframe']]
# Display the result
print("Vehicles resold within the given timeframe:\n", resold_vehicles[['VIN', 'posting_date']])
Vehicles resold within the given timeframe: VIN posting_date 32544 00000 2021-04-28 13:44:51+00:00 29955 00000 2021-05-04 15:37:19+00:00 288401 00000000000000000 2021-04-05 16:26:23+00:00 288400 00000000000000000 2021-04-05 16:28:23+00:00 303644 00000000000000000 2021-04-05 16:34:11+00:00 ... ... ... 406984 nan 2021-05-05 03:56:25+00:00 406982 nan 2021-05-05 03:58:48+00:00 406981 nan 2021-05-05 03:59:22+00:00 422031 nan 2021-05-05 04:07:07+00:00 422030 nan 2021-05-05 04:24:09+00:00 [348828 rows x 2 columns]
- Here we can see which vehicle have been resold within time.
Recommendations for Spinny¶
- Competitive Pricing Recommendations:
- Increase Prices for High-Demand Models:Increase Prices for High-Demand Models
- Reduce Prices for Low-Demand Models:Consider reducing prices for models like "Ford Fiesta" or "Nissan Altima" if demand is low.
- Condition-Based Pricing Adjustments:
- Increase prices by 10-15% for cars in excellent condition to reflect their higher value perception.
- Mileage-Based Price Adjustments:
- Lower mileage cars tend to attract higher prices.
- Regional Market Adaptation:
- Pricing should be adjusted based on geographic trends where demand for specific models is higher.
- Fuel Type Preferences:
- Increase pricing for EVs and hybrids by 5-10% to reflect market preferences and future demand trends.
- Visual Pricing Insights:
- Use data visualizations to identify the sweet spot for pricing and communicate value propositions effectively to customers.
- Dynamic Pricing Implementation:
- Regular monitoring and updates of pricing based on seasonal trends, fuel prices, and competitor adjustments.
Conclusion: By integrating these strategic pricing recommendations, Spinny can position itself as a market leader in the pre-owned car segment, achieving higher sales volumes, improved customer satisfaction, and sustained revenue growth.¶
data.to_csv('spinny_data.csv')