import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import seaborn as sns  # For advanced visualizations

# Step 1: Install gdown
!pip install gdown

# Step 2: Import necessary libraries
import gdown
import pandas as pd

# Step 3: Set the file ID and create a download URL
file_id = "1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR"
download_url = f"https://drive.google.com/uc?id={file_id}"

# Step 4: Set the output file name
output_file = "spinny_dataset.csv"

# Step 5: Download the file
gdown.download(download_url, output_file, quiet=False)

# Step 6: Load the CSV file into a Pandas DataFrame
data = pd.read_csv(output_file)

Requirement already satisfied: gdown in /usr/local/lib/python3.11/dist-packages (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown) (4.12.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown) (3.17.0)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown) (2.32.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown) (4.67.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown) (2.6)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (2024.12.14)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown) (1.7.1)

Downloading...
From: https://drive.google.com/uc?id=1QQaRZizuq3TGhUqHyIc701ChT-GgM1yR
To: /content/spinny_dataset.csv
100%|██████████| 99.9M/99.9M [00:00<00:00, 166MB/s]

print("First 5 Rows of the Dataset:")
data.head(5)

First 5 Rows of the Dataset:

random_sample = data[data.notna().all(axis=1)].sample(n=10, random_state=42)  # Randomly select 10 rows with no missing values
print(random_sample)

                id      price    year manufacturer                   model  \
421994  7302010346  1399780.0  2014.0         ford         transit connect   
359017  7316155054        0.0  2017.0          kia               soul plus   
119389  7316241495  2786620.0  2010.0    chevrolet  silverado 1500 ltz 4x4   
4642    7309388743  1149880.0  2015.0   volkswagen    passat tsi wolfsburg   
381660  7313268847  1130140.0  2011.0        mazda                       3   
26098   7314924294  1540060.0  2014.0   volkswagen                   jetta   
278249  7316153678  2117980.0  2007.0         jeep                wrangler   
194984  7309831477   693500.0  2010.0       nissan                  sentra   
421770  7303638125  4108440.0  2017.0    chevrolet          silverado 1500   
308593  7313955519        0.0  2012.0        buick                lacrosse   

        condition    cylinders fuel  odometer title_status  ...  \
421994       good  4 cylinders  gas  143546.0        clean  ...   
359017   like new  4 cylinders  gas   63496.0        clean  ...   
119389  excellent  8 cylinders  gas  124248.0        clean  ...   
4642    excellent  4 cylinders  gas  100274.0        clean  ...   
381660       good  4 cylinders  gas   77312.0        clean  ...   
26098   excellent  4 cylinders  gas   63930.0        clean  ...   
278249       good  6 cylinders  gas  122236.0        clean  ...   
194984       good  4 cylinders  gas  104000.0        clean  ...   
421770  excellent  8 cylinders  gas  101870.0        clean  ...   
308593  excellent  6 cylinders  gas   98239.0        clean  ...   

                      VIN drive       size      type paint_color  \
421994  NM0LS7F75E1164637   fwd  full-size       van         red   
359017  KNDJP3A51H7457654   fwd   mid-size  mini-van        blue   
119389  3GCRKTE37AG118369   4wd  full-size    pickup         red   
4642    1VWAT7A31FC018187   fwd   mid-size     sedan       white   
381660  JM1BL1U59B1435270   fwd   mid-size     sedan        grey   
26098   3VW2K7AJ1EM439955   fwd   mid-size     sedan       black   
278249  1J4GA39137L155539   4wd  full-size       SUV        grey   
194984  3N1AB6AP9AL643919   fwd   mid-size     sedan        grey   
421770  3GCUKSEC8HG435107   4wd  full-size     truck       black   
308593  1G4GJ5E34CF175943   fwd   mid-size     sedan       white   

                                                image_url              state  \
421994  https://images.craigslist.org/00U0U_29yhnVuqee...      Uttar Pradesh   
359017  https://images.craigslist.org/00202_iYUv6Xesse...            Haryana   
119389  https://images.craigslist.org/00D0D_9lZxHlsHk2...      Uttar Pradesh   
4642    https://images.craigslist.org/00505_924hjCwSYz...             Ladakh   
381660  https://images.craigslist.org/01616_3dxG5z31YU...         Tamil Nadu   
26098   https://images.craigslist.org/00X0X_iN9WvC5LD5...  Jammu and Kashmir   
278249  https://images.craigslist.org/00D0D_3AXXo1TdE6...        Maharashtra   
194984  https://images.craigslist.org/00H0H_dVRfJUD77f...        West Bengal   
421770  https://images.craigslist.org/01212_9xi1bxfHaI...              Delhi   
308593  https://images.craigslist.org/00J0J_acZwS8hozH...          Rajasthan   

                     posting_date latitude  longitude  
421994  2021-04-05 13:42:27+00:00  26.7509    80.8497  
359017  2021-05-03 13:41:03+00:00  29.6172    75.6695  
119389  2021-05-03 16:06:49+00:00  28.2134    80.2300  
4642    2021-04-19 18:33:28+00:00  34.9460    75.3032  
381660  2021-04-27 13:22:15+00:00   9.3002    78.0696  
26098   2021-04-30 18:14:50+00:00  32.9228    74.9187  
278249  2021-05-03 13:38:08+00:00  17.3431    78.5934  
194984  2021-04-20 15:46:07+00:00  22.5864    86.7940  
421770  2021-04-08 15:05:30+00:00  28.5658    76.9453  
308593  2021-04-28 18:35:08+00:00  28.1765    75.0177  

[10 rows x 21 columns]

rows, columns = data.shape
print(f"\nThe dataset contains {rows} rows and {columns} columns.")

The dataset contains 426880 rows and 21 columns.

print("\nDataset Information:")
data.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 21 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   price         426880 non-null  float64
 2   year          425675 non-null  float64
 3   manufacturer  409234 non-null  object 
 4   model         421603 non-null  object 
 5   condition     252776 non-null  object 
 6   cylinders     249202 non-null  object 
 7   fuel          423867 non-null  object 
 8   odometer      422480 non-null  float64
 9   title_status  418638 non-null  object 
 10  transmission  424324 non-null  object 
 11  VIN           265838 non-null  object 
 12  drive         296313 non-null  object 
 13  size          120519 non-null  object 
 14  type          334022 non-null  object 
 15  paint_color   296677 non-null  object 
 16  image_url     426812 non-null  object 
 17  state         426880 non-null  object 
 18  posting_date  426812 non-null  object 
 19  latitude      426880 non-null  float64
 20  longitude     426880 non-null  float64
dtypes: float64(5), int64(1), object(15)
memory usage: 68.4+ MB

# Check the data type of the 'cylinders' column
print(data['cylinders'].dtype)

# If it's not an object (string) type, convert it first:
if data['cylinders'].dtype != object:
    data['cylinders'] = data['cylinders'].astype(str)

# Now, proceed with the extraction:
data['cylinders'] = data['cylinders'].str.extract('(\d+)').fillna(0).astype(int)

# Display the updated data types
print(data.dtypes)

object
id                int64
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders         int64
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object

# Convert specified columns to string type
columns_to_str = ['id', 'manufacturer', 'model', 'condition', 'fuel', 'title_status', 'transmission' ,
                  'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url',
                  'state']

# Apply conversion
data[columns_to_str] = data[columns_to_str].astype(str)

# Display the updated data types
print(data.dtypes)

id               object
price           float64
year            float64
manufacturer     object
model            object
condition        object
cylinders         int64
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
state            object
posting_date     object
latitude        float64
longitude       float64
dtype: object

# Convert 'posting_date' to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'], errors='coerce')

# Display the DataFrame
print(data)

                id      price    year manufacturer                     model  \
0       7222695916   770530.0     NaN          nan                       nan   
1       7218891961  1528210.0     NaN          nan                       nan   
2       7221797935  2696820.0     NaN          nan                       nan   
3       7222270760   192610.0     NaN          nan                       nan   
4       7210384030   629240.0     NaN          nan                       nan   
...            ...        ...     ...          ...                       ...   
426875  7301591192  3029710.0  2019.0       nissan         maxima s sedan 4d   
426876  7301591187  3928680.0  2020.0        volvo  s60 t5 momentum sedan 4d   
426877  7301591147  4493750.0  2020.0     cadillac          xt4 sport suv 4d   
426878  7301591140  3723220.0  2018.0        lexus           es 350 sedan 4d   
426879  7301591129  3928680.0  2019.0          bmw  4 series 430i gran coupe   

       condition  cylinders    fuel  odometer title_status  ...  \
0            nan          0     nan       NaN          nan  ...   
1            nan          0     nan       NaN          nan  ...   
2            nan          0     nan       NaN          nan  ...   
3            nan          0     nan       NaN          nan  ...   
4            nan          0     nan       NaN          nan  ...   
...          ...        ...     ...       ...          ...  ...   
426875      good          6     gas   32226.0        clean  ...   
426876      good          0     gas   12029.0        clean  ...   
426877      good          0  diesel    4174.0        clean  ...   
426878      good          6     gas   30112.0        clean  ...   
426879      good          0     gas   22716.0        clean  ...   

                      VIN drive size       type paint_color  \
0                     nan   nan  nan        nan         nan   
1                     nan   nan  nan        nan         nan   
2                     nan   nan  nan        nan         nan   
3                     nan   nan  nan        nan         nan   
4                     nan   nan  nan        nan         nan   
...                   ...   ...  ...        ...         ...   
426875  1N4AA6AV6KC367801   fwd  nan      sedan         nan   
426876  7JR102FKXLG042696   fwd  nan      sedan         red   
426877  1GYFZFR46LF088296   nan  nan  hatchback       white   
426878  58ABK1GG4JU103853   fwd  nan      sedan      silver   
426879  WBA4J1C58KBM14708   rwd  nan      coupe         nan   

                                                image_url           state  \
0                                                     nan         Haryana   
1                                                     nan         Haryana   
2                                                     nan     West Bengal   
3                                                     nan  Andhra Pradesh   
4                                                     nan     West Bengal   
...                                                   ...             ...   
426875  https://images.craigslist.org/00o0o_iiraFnHg8q...           Delhi   
426876  https://images.craigslist.org/00x0x_15sbgnxCIS...          Punjab   
426877  https://images.craigslist.org/00L0L_farM7bxnxR...  Madhya Pradesh   
426878  https://images.craigslist.org/00z0z_bKnIVGLkDT...     Maharashtra   
426879  https://images.craigslist.org/00Y0Y_lEUocjyRxa...       Jharkhand   

                    posting_date latitude  longitude  
0                            NaT  28.8446    75.1167  
1                            NaT  27.9026    77.0382  
2                            NaT  24.6838    85.9695  
3                            NaT  15.5158    81.9342  
4                            NaT  23.7821    87.4861  
...                          ...      ...        ...  
426875 2021-04-04 09:21:31+00:00  28.7876    76.8644  
426876 2021-04-04 09:21:29+00:00  32.0715    75.1915  
426877 2021-04-04 09:21:17+00:00  21.5646    75.7247  
426878 2021-04-04 09:21:11+00:00  21.4183    75.4719  
426879 2021-04-04 09:21:07+00:00  24.2616    83.7688  

[426880 rows x 21 columns]

duplicate_count = len(data[data.duplicated()])
print(f"Number of Duplicate Rows in the Dataset: {duplicate_count}")

Number of Duplicate Rows in the Dataset: 0

# Now you can proceed with the missing value check:
missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)

Missing Values in Each Column:
id                 0
price              0
year            1205
manufacturer       0
model              0
condition          0
cylinders          0
fuel               0
odometer        4400
title_status       0
transmission       0
VIN                0
drive              0
size               0
type               0
paint_color        0
image_url          0
state              0
posting_date      68
latitude           0
longitude          0
dtype: int64

missing_values = data.isnull().sum().sum()
print(missing_values)

5673

print("\nObservations About the Dataset:")
if duplicate_count > 0:
    print(f"- There are {duplicate_count} duplicate rows in the dataset.")
else:
    print("- No duplicate rows found in the dataset.")

if missing_values.sum() > 0:
    print("- There are missing values in the dataset. Here’s a summary:")
    print(missing_values[missing_values > 0])
else:
    print("- No missing values found in the dataset.")

print("- The dataset is ready for further analysis after handling duplicates and missing values.")

Observations About the Dataset:
- No duplicate rows found in the dataset.
- There are missing values in the dataset. Here’s a summary:
[5673]
- The dataset is ready for further analysis after handling duplicates and missing values.

# Dataset Columns
print("Dataset Columns:")
print(data.columns)

Dataset Columns:
Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'image_url', 'state',
       'posting_date', 'latitude', 'longitude'],
      dtype='object')

# Dataset Describe
print("\nDataset Summary Statistics:")
print(data.describe(include='all'))

Dataset Summary Statistics:
                id         price           year manufacturer   model  \
count       426880  4.268800e+05  425675.000000       426880  426880   
unique      426880           NaN            NaN           43   29668   
top     7222695916           NaN            NaN         ford   f-150   
freq             1           NaN            NaN        70985    8009   
mean           NaN  9.657028e+06    2011.235191          NaN     NaN   
min            NaN  0.000000e+00    1900.000000          NaN     NaN   
25%            NaN  7.576800e+05    2008.000000          NaN     NaN   
50%            NaN  1.791970e+06    2013.000000          NaN     NaN   
75%            NaN  3.401080e+06    2017.000000          NaN     NaN   
max            NaN  4.798934e+11    2022.000000          NaN     NaN   
std            NaN  1.564439e+09       9.452120          NaN     NaN   

       condition      cylinders    fuel      odometer title_status  ...  \
count     426880  426880.000000  426880  4.224800e+05       426880  ...   
unique         7            NaN       6           NaN            7  ...   
top          nan            NaN     gas           NaN        clean  ...   
freq      174104            NaN  356209           NaN       405117  ...   
mean         NaN       3.466222     NaN  9.804333e+04          NaN  ...   
min          NaN       0.000000     NaN  0.000000e+00          NaN  ...   
25%          NaN       0.000000     NaN  3.770400e+04          NaN  ...   
50%          NaN       4.000000     NaN  8.554800e+04          NaN  ...   
75%          NaN       6.000000     NaN  1.335425e+05          NaN  ...   
max          NaN      12.000000     NaN  1.000000e+07          NaN  ...   
std          NaN       3.188464     NaN  2.138815e+05          NaN  ...   

           VIN   drive    size    type paint_color  \
count   426880  426880  426880  426880      426880   
unique  118265       4       5      14          13   
top        nan     4wd     nan     nan         nan   
freq    161042  131904  306361   92858      130203   
mean       NaN     NaN     NaN     NaN         NaN   
min        NaN     NaN     NaN     NaN         NaN   
25%        NaN     NaN     NaN     NaN         NaN   
50%        NaN     NaN     NaN     NaN         NaN   
75%        NaN     NaN     NaN     NaN         NaN   
max        NaN     NaN     NaN     NaN         NaN   
std        NaN     NaN     NaN     NaN         NaN   

                                                image_url          state  \
count                                              426880         426880   
unique                                             241900             35   
top     https://images.craigslist.org/00N0N_1xMPvfxRAI...  Uttar Pradesh   
freq                                                 7357          45573   
mean                                                  NaN            NaN   
min                                                   NaN            NaN   
25%                                                   NaN            NaN   
50%                                                   NaN            NaN   
75%                                                   NaN            NaN   
max                                                   NaN            NaN   
std                                                   NaN            NaN   

                               posting_date       latitude      longitude  
count                                426812  426880.000000  426880.000000  
unique                                  NaN            NaN            NaN  
top                                     NaN            NaN            NaN  
freq                                    NaN            NaN            NaN  
mean    2021-04-24 00:23:42.976467200+00:00      22.800888      79.012914  
min               2021-04-04 07:00:25+00:00       6.800300      68.400200  
25%     2021-04-17 10:47:46.750000128+00:00      17.796000      75.719100  
50%        2021-04-26 01:08:31.500000+00:00      24.058650      77.353900  
75%     2021-05-01 13:32:26.249999872+00:00      28.563525      80.927325  
max               2021-05-05 04:24:09+00:00      35.997600      97.398000  
std                                     NaN       6.532617       5.208436  

[11 rows x 21 columns]

# Variables Description
print("\n### Variables Description ###")
variable_descriptions = {
"id": "A unique identifier for each listing or item.",
"price": "The price of the item or listing. It might be in a specific currency (e.g., INR, USD).",
"year": "The manufacturing year of the item (possibly for vehicles or other dated products).",
"manufacturer": "The name of the manufacturer or brand of the item.",
"model": "The specific model of the product.",
"condition": "The condition of the item (e.g., new, used, refurbished).",
"cylinders": "The number of cylinders (likely relevant if the dataset is related to vehicles).",
"fuel": "The type of fuel the item uses (e.g., petrol, diesel, electric).",
"odometer": "The mileage or distance traveled by the item (if vehicles are involved).",
"title_status": "The legal or ownership status of the item (e.g., clean, salvage, rebuilt).",
"VIN(Vehicle Identification Number)": "A unique code to identify individual vehicles.",
"drive": "The drivetrain type (e.g., front-wheel drive, rear-wheel drive, all-wheel drive).",
"size": "The size category of the item (e.g., small, medium, large).",
"type": "The type or category of the item (e.g., car, truck, bike, or other product types).",
"paint_color": "The color of the item (e.g., for vehicles or items with customizable color options).",
"image_url": "A URL to the image of the item or listing.",
"state": "The state in which the listing or item is located.",
"posting_date": "The date on which the item was listed or posted.",
"latitude": "The latitude coordinate of the item's location.",
"longitude": "The longitude coordinate of the item's location."
}

for col, desc in variable_descriptions.items():
    print(f"* **{col}**: {desc}")

### Variables Description ###
* **id**: A unique identifier for each listing or item.
* **price**: The price of the item or listing. It might be in a specific currency (e.g., INR, USD).
* **year**: The manufacturing year of the item (possibly for vehicles or other dated products).
* **manufacturer**: The name of the manufacturer or brand of the item.
* **model**: The specific model of the product.
* **condition**: The condition of the item (e.g., new, used, refurbished).
* **cylinders**: The number of cylinders (likely relevant if the dataset is related to vehicles).
* **fuel**: The type of fuel the item uses (e.g., petrol, diesel, electric).
* **odometer**: The mileage or distance traveled by the item (if vehicles are involved).
* **title_status**: The legal or ownership status of the item (e.g., clean, salvage, rebuilt).
* **VIN(Vehicle Identification Number)**: A unique code to identify individual vehicles.
* **drive**: The drivetrain type (e.g., front-wheel drive, rear-wheel drive, all-wheel drive).
* **size**: The size category of the item (e.g., small, medium, large).
* **type**: The type or category of the item (e.g., car, truck, bike, or other product types).
* **paint_color**: The color of the item (e.g., for vehicles or items with customizable color options).
* **image_url**: A URL to the image of the item or listing.
* **state**: The state in which the listing or item is located.
* **posting_date**: The date on which the item was listed or posted.
* **latitude**: The latitude coordinate of the item's location.
* **longitude**: The longitude coordinate of the item's location.

# Unique Values for Each Variable
print("\n### Unique Values for Each Variable ###")
for column in data.columns.tolist():
    print(f"No. of unique values in {column}: {data[column].nunique()}.")

### Unique Values for Each Variable ###
No. of unique values in id: 426880.
No. of unique values in price: 6253.
No. of unique values in year: 114.
No. of unique values in manufacturer: 43.
No. of unique values in model: 29668.
No. of unique values in condition: 7.
No. of unique values in cylinders: 8.
No. of unique values in fuel: 6.
No. of unique values in odometer: 104870.
No. of unique values in title_status: 7.
No. of unique values in transmission: 4.
No. of unique values in VIN: 118265.
No. of unique values in drive: 4.
No. of unique values in size: 5.
No. of unique values in type: 14.
No. of unique values in paint_color: 13.
No. of unique values in image_url: 241900.
No. of unique values in state: 35.
No. of unique values in posting_date: 333355.
No. of unique values in latitude: 191354.
No. of unique values in longitude: 162536.

#import pandas as pd
#from geopy.geocoders import Nominatim

# Load your data (replace 'your_data.csv' with your actual file)
#data = pd.read_csv(output_file)

# Function to get the city
#def get_city(row):
    #geolocator = Nominatim(user_agent="my_geocoder")
    #location = geolocator.reverse(str(row['latitude']) + ", " + str(row['longitude']))
    #return location.address.split(',')[0] if location else None

# Apply the function and create a new column
#data['city'] = data.apply(get_city, axis=1)

# Save the modified DataFrame
#data.to_csv('output.csv', index=False)

# import math

# cities_df = pd.read_csv('India Cities LatLng.csv')
# def haversine(lat1, lon1, lat2, lon2):
#     R = 6371  # Radius of Earth in kilometers
#     phi1 = math.radians(lat1)
#     phi2 = math.radians(lat2)
#     delta_phi = math.radians(lat2 - lat1)
#     delta_lambda = math.radians(lon2 - lon1)

#     a = math.sin(delta_phi / 2)*2 + math.cos(phi1) * math.cos(phi2) * math.sin(delta_lambda / 2)*2
#     c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

#     return R * c  # Distance in kilometers

# def get_closest_city(row):
#     min_distance = float('inf')
#     closest_city = None

#     for _, city_row in cities_df.iterrows():
#         distance = haversine(row['latitude'], row['longitude'], city_row['lat'], city_row['lng'])
#         if distance < min_distance:
#             min_distance = distance
#             closest_city = city_row['city']

#     return closest_city


# df = pd.DataFrame(data)


# # Apply the function to get the closest city for each row in data
# df['city'] = df.apply(get_closest_city, axis=1)

# Copying the dataset for analysis
data = data.copy()

# Checking basic stats
print("Dataset Shape:", data.shape)
print("Dataset Columns:", data.columns)

Dataset Shape: (426880, 21)
Dataset Columns: Index(['id', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'image_url', 'state',
       'posting_date', 'latitude', 'longitude'],
      dtype='object')

!pip install geopy

Requirement already satisfied: geopy in /usr/local/lib/python3.11/dist-packages (2.4.1)
Requirement already satisfied: geographiclib<3,>=1.52 in /usr/local/lib/python3.11/dist-packages (from geopy) (2.0)

# #from geopy.geocoders import Nominatim
# from geopy.exc import GeocoderTimedOut

# def get_city_from_coordinates(lat, lon):
#     try:
#         geolocator = Nominatim(user_agent="geo_locator")
#         location = geolocator.reverse((lat, lon), exactly_one=True)
#         if location and 'address' in location.raw:
#             address = location.raw['address']
#             city = address.get('city') or address.get('town') or address.get('village')
#             state = address.get('state')
#             country = address.get('country')
#             return city, state, country
#         else:
#             return None, None, None
#     except GeocoderTimedOut:
#         print("Geocoder service timed out. Try again.")
#         return None, None, None

# # Example usage
# latitude = 28.6139   # Example: Latitude of New Delhi
# longitude = 77.2090  # Example: Longitude of New Delhi

# city, state, country = get_city_from_coordinates(latitude, longitude)

# if city:
#     print(f"City: {city}, State: {state}, Country: {country}")
# else:
#     print("City not found.")

def detect_outliers_iqr(data, columns):
    outliers = {}
    for col in columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Find outliers
        outliers[col] = data[(data[col] < lower_bound) | (data[col] > upper_bound)][col]
    return outliers

# Apply to numeric columns
numeric_columns = ['price', 'year', 'odometer']
outliers = detect_outliers_iqr(data, numeric_columns)
for col, vals in outliers.items():
    print(f"Outliers in {col}:")
    print(vals)

Outliers in price:
204       7.573780e+06
280       1.268338e+11
346       1.270072e+07
365       9.361800e+06
461       1.283604e+07
              ...     
426694    7.692300e+06
426716    8.469800e+06
426778    8.474760e+06
426810    7.575790e+06
426840    8.475680e+06
Name: price, Length: 8177, dtype: float64
Outliers in year:
37        1992.0
101       1968.0
137       1966.0
208       1976.0
210       1969.0
           ...  
426729    1951.0
426762    1982.0
426809    1990.0
426831    1985.0
426838    1920.0
Name: year, Length: 15896, dtype: float64
Outliers in odometer:
138        281000.0
280        999999.0
411        279000.0
455       9999999.0
484        283356.0
            ...    
426508     284000.0
426590     291988.0
426643     287000.0
426671     291137.0
426690     295000.0
Name: odometer, Length: 4385, dtype: float64

def detect_outliers_std(data, columns):
    outliers = {}
    for col in columns:
        mean = data[col].mean()
        std = data[col].std()
        lower_bound = mean - 3 * std
        upper_bound = mean + 3 * std

        # Find outliers
        outliers[col] = data[(data[col] < lower_bound) | (data[col] > upper_bound)][col]
    return outliers

# Apply to numeric columns
numeric_columns = ['price', 'year', 'odometer']
outliers_std = detect_outliers_std(data, numeric_columns)
for col, vals in outliers_std.items():
    print(f"Outliers in {col}:")
    print(vals)

Outliers in price:
280       1.268338e+11
1838      1.284192e+10
29386     1.426880e+11
37409     1.426880e+11
37410     3.864838e+11
91576     3.884607e+11
122470    1.268338e+11
137807    1.585422e+10
153082    1.585422e+11
184704    1.810795e+11
193736    1.585422e+10
207080    1.585422e+10
230753    1.733773e+10
241404    1.457003e+10
257840    3.884607e+11
288356    1.585422e+10
307488    1.585422e+10
318592    4.798934e+11
356716    4.798934e+11
358555    1.585422e+10
Name: price, dtype: float64
Outliers in year:
101       1968.0
137       1966.0
208       1976.0
210       1969.0
228       1978.0
           ...  
426575    1970.0
426593    1971.0
426729    1951.0
426762    1982.0
426838    1920.0
Name: year, Length: 9116, dtype: float64
Outliers in odometer:
280        999999.0
455       9999999.0
562       2222222.0
615       1111111.0
616       1111111.0
            ...    
423025     840000.0
423091    1111111.0
424165     900132.0
426106    9999999.0
426108     999999.0
Name: odometer, Length: 1089, dtype: float64

import matplotlib.pyplot as plt

for col in ['price', 'year', 'odometer']:
    plt.figure(figsize=(8, 4))
    plt.boxplot(data[col].dropna(), vert=False, patch_artist=True)
    plt.title(f"Boxplot of {col}")
    plt.xlabel(col)
    plt.show()

def remove_outliers_iqr(data, columns):
    for col in columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        data = data[(data[col] >= lower_bound) & (data[col] <= upper_bound)]
    return data

data_cleaned = remove_outliers_iqr(data, numeric_columns)

# Bar Chart: Listings by State
state_counts = data['state'].value_counts()

plt.figure(figsize=(10, 6))
state_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Listings by State', fontsize=16)
plt.xlabel('State', fontsize=6)
plt.ylabel('Number of Listings', fontsize=14)
plt.xticks(rotation=45, fontsize=6)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Set general style for plots
sns.set_style("whitegrid")

# Step 1: Distribution

# Histograms
numeric_columns = ['price', 'year', 'odometer']
plt.figure(figsize=(15, 5))
for i, col in enumerate(numeric_columns, 1):
    plt.subplot(1, 3, i)
    plt.hist(data[col], bins=10, color='skyblue', edgecolor='black')
    plt.title(f'Distribution of {col.capitalize()}', fontsize=14)
    plt.xlabel(col.capitalize(), fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
plt.tight_layout()
plt.show()

# Pie Charts
categorical_columns = ['fuel', 'condition', 'paint_color', 'drive']

plt.figure(figsize=(14, 8))

for i, col in enumerate(categorical_columns, 1):
    plt.subplot(2, 2, i)

    # Get top 5 categories, grouping the rest as "Other"
    top_categories = data[col].value_counts().nlargest(5)
    other_count = data[col].value_counts().sum() - top_categories.sum()

    # Append "Other" category if necessary
    if other_count > 0:
        top_categories['Other'] = other_count

    # Plot the filtered pie chart
    top_categories.plot.pie(
        autopct='%1.1f%%',
        colors=sns.color_palette("pastel"),
        startangle=90,
        wedgeprops={'edgecolor': 'black'}
    )

    plt.title(f'Distribution of {col.capitalize()} (Top 5)', fontsize=14)
    plt.ylabel('')

plt.tight_layout()
plt.show()
# Step 2: Relationships

# Scatterplots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(data=data, x='odometer', y='price', color='teal')
plt.title('Price vs. Odometer', fontsize=14)
plt.xlabel('Odometer', fontsize=12)
plt.ylabel('Price', fontsize=12)

plt.subplot(1, 2, 2)
sns.scatterplot(data=data, x='year', y='price', color='orange')
plt.title('Price vs. Year', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()

# Boxplots
categorical_columns = ['condition', 'fuel', 'drive']
plt.figure(figsize=(20, 5))
for i, col in enumerate(categorical_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(data=data, x=col, y='price', palette='Set2')
    plt.title(f'Price by {col.capitalize()}', fontsize=14)
    plt.xlabel(col.capitalize(), fontsize=12)
    plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()

<ipython-input-27-647f5fa808f8>:70: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=data, x=col, y='price', palette='Set2')
<ipython-input-27-647f5fa808f8>:70: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=data, x=col, y='price', palette='Set2')
<ipython-input-27-647f5fa808f8>:70: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=data, x=col, y='price', palette='Set2')

# General style
sns.set_style("whitegrid")

# Step 1: Top 10 Manufacturers by Count
manufacturer_counts = data['manufacturer'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=manufacturer_counts.index, y=manufacturer_counts.values, palette='viridis')
plt.title('Top 10 Manufacturers by Count', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<ipython-input-28-8c4aee0bdc88>:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=manufacturer_counts.index, y=manufacturer_counts.values, palette='viridis')

model_counts = data['model'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=model_counts.index, y=model_counts.values, palette='coolwarm')
plt.title('Top 10 Models by Count', fontsize=14)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<ipython-input-29-3f2e3d584f3b>:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=model_counts.index, y=model_counts.values, palette='coolwarm')

condition_segment = data.groupby(['manufacturer', 'condition']).size().unstack(fill_value=0)
condition_segment = condition_segment.loc[manufacturer_counts.index]  # Align manufacturers with top 10
condition_segment.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Manufacturers Segmented by Condition', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Condition', fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

fuel_segment = data.groupby(['manufacturer', 'fuel']).size().unstack(fill_value=0)
fuel_segment = fuel_segment.loc[manufacturer_counts.index]  # Align manufacturers with top 10
fuel_segment.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='coolwarm')
plt.title('Manufacturers Segmented by Fuel', fontsize=14)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Fuel', fontsize=10)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# General style
sns.set_style("whitegrid")

top_states = data['state'].value_counts().index[:10]
filtered_data = data[data['state'].isin(top_states)]

plt.figure(figsize=(12, 6))
sns.violinplot(data=filtered_data, x='state', y='price', palette='muted', inner='quartile')

plt.title('Distribution of Car Prices by Top 10 States', fontsize=14)
plt.xlabel('State', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Step 2: Violin Plot of Price Grouped by Condition
plt.figure(figsize=(12, 6))
sns.violinplot(data=data, x='condition', y='price', palette='Set2')
plt.title('Distribution of Prices by Condition', fontsize=14)
plt.xlabel('Condition', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.tight_layout()
plt.show()

# Step 3: Line Chart Showing Average Price Trends Over Year
# Group data by year and calculate average price
avg_price_by_year = data.groupby('year')['price'].mean().reset_index()

plt.figure(figsize=(10, 5))
sns.lineplot(data=avg_price_by_year, x='year', y='price', marker='o', color='b')
plt.title('Average Price Trends Over Years', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.tight_layout()
plt.show()

<ipython-input-32-5dc789e1b852>:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(data=filtered_data, x='state', y='price', palette='muted', inner='quartile')

<ipython-input-32-5dc789e1b852>:21: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(data=data, x='condition', y='price', palette='Set2')

# Step 1: Grouping the data by 'state' to count the number of postings per region
region_postings = data['state'].value_counts().reset_index()
region_postings.columns = ['Region', 'Postings']

# Step 2: Handling missing values (if any), for now we remove rows where state is NaN
region_postings = region_postings.dropna()

# Step 3: Sorting the data by 'Postings' in descending order for better visualization
region_postings = region_postings.sort_values(by='Postings', ascending=False)

# Step 4: Plotting the bar chart
plt.figure(figsize=(12, 8))
sns.barplot(x='Postings', y='Region', data=region_postings, palette='viridis')
plt.title('Car Postings Distribution by Region', fontsize=16)
plt.xlabel('Number of Postings', fontsize=12)
plt.ylabel('Region', fontsize=12)
plt.show()

<ipython-input-33-27050b1abf8c>:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Postings', y='Region', data=region_postings, palette='viridis')

# Step 1: Calculate the average price for each model across regions
avg_price_per_region = data.groupby(['state', 'model'])['price'].mean().reset_index()

# Step 2: Create a pivot table for heatmap visualization
price_pivot = avg_price_per_region.pivot(index='model', columns='state', values='price')

# Step 3: Plotting heatmap for regional price comparison
plt.figure(figsize=(8, 4))
sns.heatmap(price_pivot, cmap='coolwarm', annot=True, fmt='.0f', linewidths=0.5)
plt.title('Average Car Price Comparison Across state')
plt.xlabel('State')
plt.ylabel('Car Model')
plt.xticks(rotation=30, ha='right')
plt.show()

# Bar chart for selected car models (e.g., top 5 most common models)
top_models = data['model'].value_counts().nlargest(5).index
filtered_data = data[data['model'].isin(top_models)]

# Group data again for selected models
avg_price_filtered = filtered_data.groupby(['state', 'model'])['price'].mean().reset_index()

plt.figure(figsize=(14, 6))
sns.barplot(x='state', y='price', hue='model', data=avg_price_filtered, palette='viridis')
plt.title('Average Price of Top Car Models Across Regions')
plt.xlabel('State')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Car Model')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

# Step 1: Calculate the average price for each condition
avg_price_condition = data.groupby('condition')['price'].mean().reset_index()

# Step 2: Sort values for better visualization
avg_price_condition = avg_price_condition.sort_values(by='price', ascending=False)

# Step 3: Bar Chart to visualize average price by condition
plt.figure(figsize=(10, 6))
sns.barplot(x='condition', y='price', data=avg_price_condition, palette='viridis')
plt.title('Average Car Price by Condition')
plt.xlabel('Condition')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

# Step 4: Box Plot to analyze price variability within each condition
plt.figure(figsize=(12, 6))
sns.boxplot(x='condition', y='price', data=data, palette='coolwarm')
plt.title('Price Distribution by Car Condition')
plt.xlabel('Condition')
plt.ylabel('Price')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

<ipython-input-35-ec038169399b>:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='condition', y='price', data=avg_price_condition, palette='viridis')

<ipython-input-35-ec038169399b>:19: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='condition', y='price', data=data, palette='coolwarm')

# Step 1: Calculate the correlation matrix
correlation = data[['price', 'year', 'odometer']].corr()

# Step 2: Print the correlation matrix
print("Correlation Matrix:\n", correlation)

# Step 3: Heatmap to visualize correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix: Price, Year, and Odometer')
plt.show()

# Step 4: Pairplot to visualize relationships between variables
sns.pairplot(data[['price', 'year', 'odometer']], diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pairplot of Price, Year, and Odometer', y=1.02)
plt.show()

# Step 5: Scatter plots for individual analysis
plt.figure(figsize=(12, 5))

# Price vs Year
plt.subplot(1, 2, 1)
sns.scatterplot(x='year', y='price', data=data, alpha=0.6, color='blue')
plt.title('Car Price vs Manufacturing Year')
plt.xlabel('Year')
plt.ylabel('Price')

# Price vs Odometer
plt.subplot(1, 2, 2)
sns.scatterplot(x='odometer', y='price', data=data, alpha=0.6, color='red')
plt.title('Car Price vs Odometer Reading')
plt.xlabel('Odometer')
plt.ylabel('Price')

plt.tight_layout()
plt.show()

Correlation Matrix:
              price      year  odometer
price     1.000000 -0.004925  0.010032
year     -0.004925  1.000000 -0.157215
odometer  0.010032 -0.157215  1.000000

# Step 1: Calculate average car price by state and sort in descending order
avg_price_state = data.groupby('state')['price'].mean().sort_values(ascending=False)

# Step 2: Print the result
print("Average Car Price by State:\n", avg_price_state)

# Step 3: Visualize using a bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price_state.index, y=avg_price_state.values, palette='viridis')
plt.xticks(rotation=90)
plt.title('Average Car Price by State')
plt.xlabel('State')
plt.ylabel('Average Price (USD)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Step 4: Visualize using a heatmap (if geographic data available)
plt.figure(figsize=(8, 12))
sns.heatmap(avg_price_state.to_frame().T, cmap='coolwarm', annot=True, fmt='.0f')
plt.title('Heatmap of Average Car Prices by State')
plt.ylabel('Price')
plt.show()

Average Car Price by State:
 state
Lakshadweep                                 1.665653e+08
Chandigarh                                  1.370311e+08
Goa                                         6.424109e+07
West Bengal                                 2.884930e+07
Haryana                                     2.001612e+07
Karnataka                                   1.600844e+07
Kerala                                      1.364536e+07
Uttar Pradesh                               8.511310e+06
Dadra and Nagar Haveli and Daman and Diu    7.138190e+06
Maharashtra                                 5.056206e+06
Madhya Pradesh                              3.556956e+06
Tamil Nadu                                  3.109092e+06
Delhi                                       2.860353e+06
Chhattisgarh                                2.503276e+06
Gujarat                                     2.369305e+06
Punjab                                      2.348216e+06
Rajasthan                                   2.346407e+06
Assam                                       2.314903e+06
Jharkhand                                   2.306235e+06
Andaman and Nicobar Islands                 2.268856e+06
Sikkim                                      2.268851e+06
Manipur                                     2.267773e+06
Puducherry                                  2.257328e+06
Arunachal Pradesh                           2.250119e+06
Odisha                                      2.247732e+06
Jammu and Kashmir                           2.240744e+06
Tripura                                     2.238199e+06
Telangana                                   2.237392e+06
Uttarakhand                                 2.237066e+06
Andhra Pradesh                              2.231818e+06
Bihar                                       2.225778e+06
Ladakh                                      2.218344e+06
Meghalaya                                   2.216664e+06
Himachal Pradesh                            2.197821e+06
Mizoram                                     2.188526e+06
Name: price, dtype: float64

<ipython-input-37-4b324341e25b>:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=avg_price_state.index, y=avg_price_state.values, palette='viridis')

#  Number of Listings by Region (Supply Indicator)
listings_by_region = data['state'].value_counts()
print("Number of Listings by Region:\n", listings_by_region)

# Visualization of listings by region
plt.figure(figsize=(12, 6))
sns.barplot(x=listings_by_region.index, y=listings_by_region.values, palette='viridis')
plt.title('Number of Car Listings by Region')
plt.xlabel('Region')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()

Number of Listings by Region:
 state
Uttar Pradesh                               45573
Maharashtra                                 45263
Karnataka                                   30492
Punjab                                      30455
Delhi                                       30379
Haryana                                     30175
Tamil Nadu                                  18124
West Bengal                                 18052
Telangana                                   15356
Kerala                                      15304
Madhya Pradesh                              15205
Rajasthan                                   15131
Andhra Pradesh                              15070
Gujarat                                     14992
Odisha                                      12189
Jharkhand                                    6080
Bihar                                        6034
Chhattisgarh                                 6003
Assam                                        5987
Uttarakhand                                  5951
Andaman and Nicobar Islands                  3079
Sikkim                                       3065
Ladakh                                       3062
Puducherry                                   3046
Jammu and Kashmir                            3030
Meghalaya                                    3017
Himachal Pradesh                             3016
Mizoram                                      3010
Manipur                                      2990
Chandigarh                                   2986
Arunachal Pradesh                            2980
Dadra and Nagar Haveli and Daman and Diu     2980
Tripura                                      2961
Goa                                          2923
Lakshadweep                                  2920
Name: count, dtype: int64

<ipython-input-38-7fb1fec7a708>:7: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=listings_by_region.index, y=listings_by_region.values, palette='viridis')

# Analysis 1: Number of Listings by Fuel Type
fuel_listings = filtered_data['fuel'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=fuel_listings.index, y=fuel_listings.values, palette='coolwarm')
plt.title('Number of Listings by Fuel Type')
plt.xlabel('Fuel Type')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()

# Analysis 2:Number of Clean Cars by Fuel Type
clean_fuel = filtered_data[filtered_data['title_status'] == 'clean']['fuel'].value_counts().sort_values(ascending=False)
if not clean_fuel.empty:
    plt.figure(figsize=(12, 6))
    sns.barplot(x=clean_fuel.index, y=clean_fuel.values, palette='Set2')
    plt.title('Number of Clean Cars by Fuel Type')
    plt.xlabel('Fuel Type')
    plt.ylabel('Number of Listings')
    plt.xticks(rotation=45)
    plt.show()
else:
    print("No clean car data available for Fuel Type analysis.")

# Analysis 3: Number of Listings by Condition
condition_listings = filtered_data['condition'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=condition_listings.index, y=condition_listings.values, palette='Blues')
plt.title('Number of Listings by Condition')
plt.xlabel('Condition')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45)
plt.show()

# Analysis 4: Number of Clean Cars by Condition
clean_condition = filtered_data[filtered_data['title_status'] == 'clean']['condition'].value_counts().sort_values(ascending=False)
if not clean_condition.empty:
    plt.figure(figsize=(12, 6))
    sns.barplot(x=clean_condition.index, y=clean_condition.values, palette='Oranges')
    plt.title('Number of Clean Cars by Condition')
    plt.xlabel('Condition')
    plt.ylabel('Number of Listings')
    plt.xticks(rotation=45)
    plt.show()
else:
    print("No clean car data available for Condition analysis.")

<ipython-input-39-e98a20952f3d>:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=fuel_listings.index, y=fuel_listings.values, palette='coolwarm')

<ipython-input-39-e98a20952f3d>:15: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=clean_fuel.index, y=clean_fuel.values, palette='Set2')

<ipython-input-39-e98a20952f3d>:27: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=condition_listings.index, y=condition_listings.values, palette='Blues')

<ipython-input-39-e98a20952f3d>:38: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=clean_condition.index, y=clean_condition.values, palette='Oranges')

# Inspect the data
print(data.head())

# 1. Calculate the average and median price for each car model
price_stats = data.groupby('model')['price'].agg(['mean', 'median', 'count']).reset_index()
price_stats.columns = ['model', 'average_price', 'median_price', 'listing_count']

# 2. Identify outliers using Interquartile Range (IQR) for price
Q1 = data['price'].quantile(0.25)
Q3 = data['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
filtered_data = data[(data['price'] >= lower_bound) & (data['price'] <= upper_bound)]

# 3. Competitive analysis based on demand and supply
sold_count = data[data['title_status'] == 'sold'].groupby('model')['price'].count().reset_index()
sold_count.columns = ['model', 'sold_count']

# Merge sold count with price stats
optimal_pricing_data = pd.merge(price_stats, sold_count, on='model', how='left').fillna(0)

# 4. Analyzing condition impact on price
condition_price = data.groupby(['model', 'condition'])['price'].mean().unstack(fill_value=0)

# 5. Analyzing mileage impact on price
plt.figure(figsize=(12, 6))
sns.scatterplot(x='odometer', y='price', hue='condition', data=data, palette='coolwarm')
plt.title('Mileage vs Price Distribution')
plt.xlabel('Odometer (Mileage)')
plt.ylabel('Price')
plt.legend(title='Condition')
plt.show()

# 6. Adjusting pricing based on market trends
optimal_pricing_data['recommended_price'] = np.where(
    optimal_pricing_data['sold_count'] > 0,
    optimal_pricing_data['median_price'] * 1.05,  # Increase price slightly if demand is high
    optimal_pricing_data['median_price'] * 0.95   # Decrease price if demand is low
)

# 7. Display the recommended prices
print(optimal_pricing_data[['model', 'average_price', 'median_price', 'sold_count', 'recommended_price']])

# 8. Visualize pricing recommendations
plt.figure(figsize=(12, 6))
sns.barplot(x='model', y='recommended_price', data=optimal_pricing_data.sort_values(by='recommended_price', ascending=False).head(20), palette='viridis')
plt.title('Recommended Pricing for Top Car Models')
plt.xlabel('Car Model')
plt.ylabel('Recommended Price')
plt.xticks(rotation=90)
plt.show()

           id      price  year manufacturer model condition  cylinders fuel  \
0  7222695916   770530.0   NaN          nan   nan       nan          0  nan   
1  7218891961  1528210.0   NaN          nan   nan       nan          0  nan   
2  7221797935  2696820.0   NaN          nan   nan       nan          0  nan   
3  7222270760   192610.0   NaN          nan   nan       nan          0  nan   
4  7210384030   629240.0   NaN          nan   nan       nan          0  nan   

   odometer title_status  ...  VIN drive size type paint_color image_url  \
0       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
1       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
2       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
3       NaN          nan  ...  nan   nan  nan  nan         nan       nan   
4       NaN          nan  ...  nan   nan  nan  nan         nan       nan   

            state posting_date latitude  longitude  
0         Haryana          NaT  28.8446    75.1167  
1         Haryana          NaT  27.9026    77.0382  
2     West Bengal          NaT  24.6838    85.9695  
3  Andhra Pradesh          NaT  15.5158    81.9342  
4     West Bengal          NaT  23.7821    87.4861  

[5 rows x 21 columns]

                                                   model  average_price  \
0                                                    "t"      3467350.0   
1      $362.47, $1000 down, oac, 2.9%apr $362.47,luxu...      3197630.0   
2                                                      %       128440.0   
3                                               & altima       391100.0   
4                                     '50 Business Coupe      1284190.0   
...                                                  ...            ...   
29663                               ♦ALL TADES WELCOME!♦        64180.0   
29664                                                  ♿      1412630.0   
29665                                              ♿ vmi      1412165.0   
29666                               𝓜𝓮𝓻𝓬𝓮𝓭𝓮𝓼 𝓫𝓮𝓷𝔃 𝓶𝓵 350       500890.0   
29667                        🔥GMC Sierra 1500 SLE🔥 4X4 🔥      1284190.0   

       median_price  sold_count  recommended_price  
0         3467350.0         0.0         3293982.50  
1         3197630.0         0.0         3037748.50  
2          128440.0         0.0          122018.00  
3          391100.0         0.0          371545.00  
4         1284190.0         0.0         1219980.50  
...             ...         ...                ...  
29663       64180.0         0.0           60971.00  
29664     1412630.0         0.0         1341998.50  
29665     1412165.0         0.0         1341556.75  
29666      500890.0         0.0          475845.50  
29667     1284190.0         0.0         1219980.50  

[29668 rows x 5 columns]

<ipython-input-40-0a1f70a6d962>:49: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='model', y='recommended_price', data=optimal_pricing_data.sort_values(by='recommended_price', ascending=False).head(20), palette='viridis')

# Convert posting_date to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'])

# Extract the year from the posting date
data['posting_year'] = data['posting_date'].dt.year

# Calculate vehicle age at the time of posting
data['vehicle_age'] = data['posting_year'] - data['year']

# Display result
print(data[['year', 'posting_date', 'vehicle_age']])

# Conclusion
average_vehicle_age = data['vehicle_age'].mean()
print(f"On average, vehicles are {average_vehicle_age:.1f} years old at the time of posting.")

          year              posting_date  vehicle_age
0          NaN                       NaT          NaN
1          NaN                       NaT          NaN
2          NaN                       NaT          NaN
3          NaN                       NaT          NaN
4          NaN                       NaT          NaN
...        ...                       ...          ...
426875  2019.0 2021-04-04 09:21:31+00:00          2.0
426876  2020.0 2021-04-04 09:21:29+00:00          1.0
426877  2020.0 2021-04-04 09:21:17+00:00          1.0
426878  2018.0 2021-04-04 09:21:11+00:00          3.0
426879  2019.0 2021-04-04 09:21:07+00:00          2.0

[426880 rows x 3 columns]
On average, vehicles are 9.8 years old at the time of posting.

# Convert posting_date to datetime
data['posting_date'] = pd.to_datetime(data['posting_date'])

# Extract the year from the posting date
data['posting_year'] = data['posting_date'] .dt.year

# Calculate vehicle age at the time of posting
data['vehicle_age'] = data['posting_year'] - data['year']

# Plot a histogram to visualize vehicle age distribution
plt.figure(figsize=(8, 5))
sns.histplot(data['vehicle_age'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Vehicle Age at Posting')
plt.xlabel('Vehicle Age (years)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Plot a bar chart to see the average vehicle age by posting year
plt.figure(figsize=(8, 5))
sns.barplot(x=data['posting_year'], y=data['vehicle_age'], palette='viridis')
plt.title('Average Vehicle Age by Posting Year')
plt.xlabel('Posting Year')
plt.ylabel('Average Vehicle Age')
plt.show()

# Display box plot for age distribution
plt.figure(figsize=(8, 5))
sns.boxplot(x=data['vehicle_age'], color='lightcoral')
plt.title('Boxplot of Vehicle Age')
plt.xlabel('Vehicle Age (years)')
plt.grid(True)
plt.show()

<ipython-input-42-a73b8b06c96f>:21: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=data['posting_year'], y=data['vehicle_age'], palette='viridis')

# Group by manufacturer and calculate total sales value
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()

# Sort by sales value in descending order
brand_sales = brand_sales.sort_values(by='price', ascending=False)

# Display result
print(brand_sales)

# Identify the brand with the highest sales value
highest_sale_brand = brand_sales.iloc[0]
print(f"The brand with the highest sales value is {highest_sale_brand['manufacturer']} with a total sales value of {highest_sale_brand['price']}.")

       manufacturer         price
40           toyota  1.029069e+12
7         chevrolet  8.179780e+11
26    mercedes-benz  8.068871e+11
20             jeep  3.680175e+11
13             ford  3.319232e+11
42            volvo  1.662761e+11
31              nan  7.596019e+10
14              gmc  6.554087e+10
35              ram  6.531354e+10
4               bmw  5.047692e+10
32           nissan  4.610003e+10
16            honda  2.936577e+10
10            dodge  2.640941e+10
5             buick  2.598642e+10
3              audi  2.292789e+10
23            lexus  2.022540e+10
6          cadillac  1.735790e+10
38           subaru  1.591916e+10
0             acura  1.523405e+10
41       volkswagen  1.504611e+10
17          hyundai  1.423159e+10
18         infiniti  1.206377e+10
21              kia  1.203714e+10
24          lincoln  1.073178e+10
25            mazda  8.604323e+09
8          chrysler  8.087946e+09
36            rover  7.376169e+09
19           jaguar  6.635093e+09
29       mitsubishi  5.810281e+09
34          porsche  5.677958e+09
28             mini  4.321809e+09
39            tesla  4.275460e+09
1        alfa-romeo  3.252941e+09
33          pontiac  2.402481e+09
11          ferrari  1.310728e+09
12             fiat  1.209434e+09
37           saturn  9.632394e+08
27          mercury  8.335567e+08
15  harley-davidson  2.383198e+08
2      aston-martin  1.648740e+08
9            datsun  1.225653e+08
22       land rover  2.133458e+07
30           morgan  5.046890e+06
The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.

# Group by manufacturer and calculate total sales value
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()

# Sort by sales value in descending order
brand_sales = brand_sales.sort_values(by='price', ascending=False)

# Display result
print(brand_sales)

# Identify the brand with the highest sales value
highest_sale_brand = brand_sales.iloc[0]
print(f"The brand with the highest sales value is {highest_sale_brand['manufacturer']} with a total sales value of {highest_sale_brand['price']}.")

# Plot a bar chart to visualize brand sales
plt.figure(figsize=(8, 5))
sns.barplot(x='manufacturer', y='price', data=brand_sales, palette='viridis')
plt.title('Total Sales Value by Manufacturer')
plt.xlabel('Manufacturer')
plt.ylabel('Total Sales Value')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.show()

# Step 1: Aggregate sales by manufacturer (if not already done)
brand_sales = data.groupby('manufacturer')['price'].sum().reset_index()

# Step 2: Sort and select the top 5 manufacturers by sales
top_brands = brand_sales.sort_values('price', ascending=False).head(5)

# Step 3: Plot the pie chart for the top 5 brands
plt.figure(figsize=(8, 8))
plt.pie(top_brands['price'], labels=top_brands['manufacturer'], autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Sales Distribution by Top 5 Manufacturer')
plt.show()

       manufacturer         price
40           toyota  1.029069e+12
7         chevrolet  8.179780e+11
26    mercedes-benz  8.068871e+11
20             jeep  3.680175e+11
13             ford  3.319232e+11
42            volvo  1.662761e+11
31              nan  7.596019e+10
14              gmc  6.554087e+10
35              ram  6.531354e+10
4               bmw  5.047692e+10
32           nissan  4.610003e+10
16            honda  2.936577e+10
10            dodge  2.640941e+10
5             buick  2.598642e+10
3              audi  2.292789e+10
23            lexus  2.022540e+10
6          cadillac  1.735790e+10
38           subaru  1.591916e+10
0             acura  1.523405e+10
41       volkswagen  1.504611e+10
17          hyundai  1.423159e+10
18         infiniti  1.206377e+10
21              kia  1.203714e+10
24          lincoln  1.073178e+10
25            mazda  8.604323e+09
8          chrysler  8.087946e+09
36            rover  7.376169e+09
19           jaguar  6.635093e+09
29       mitsubishi  5.810281e+09
34          porsche  5.677958e+09
28             mini  4.321809e+09
39            tesla  4.275460e+09
1        alfa-romeo  3.252941e+09
33          pontiac  2.402481e+09
11          ferrari  1.310728e+09
12             fiat  1.209434e+09
37           saturn  9.632394e+08
27          mercury  8.335567e+08
15  harley-davidson  2.383198e+08
2      aston-martin  1.648740e+08
9            datsun  1.225653e+08
22       land rover  2.133458e+07
30           morgan  5.046890e+06
The brand with the highest sales value is toyota with a total sales value of 1029068765610.0.

<ipython-input-44-e0c4ba509063>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='manufacturer', y='price', data=brand_sales, palette='viridis')

# Define the time frame (e.g., 60 days)
time_frame = pd.Timedelta(days=60)

# Sort data by VIN and posting date
data= data.sort_values(by=['VIN', 'posting_date'])

# Check for resales within the time frame
data['resale_within_timeframe'] = data.duplicated(subset=['VIN'], keep=False) & (
    data['posting_date'].diff().fillna(pd.Timedelta(days=9999)) <= time_frame
)

# Filter records that were resold within the given timeframe
resold_vehicles = data[data['resale_within_timeframe']]

# Display the result
print("Vehicles resold within the given timeframe:\n", resold_vehicles[['VIN', 'posting_date']])

Vehicles resold within the given timeframe:
                       VIN              posting_date
32544               00000 2021-04-28 13:44:51+00:00
29955               00000 2021-05-04 15:37:19+00:00
288401  00000000000000000 2021-04-05 16:26:23+00:00
288400  00000000000000000 2021-04-05 16:28:23+00:00
303644  00000000000000000 2021-04-05 16:34:11+00:00
...                   ...                       ...
406984                nan 2021-05-05 03:56:25+00:00
406982                nan 2021-05-05 03:58:48+00:00
406981                nan 2021-05-05 03:59:22+00:00
422031                nan 2021-05-05 04:07:07+00:00
422030                nan 2021-05-05 04:24:09+00:00

[348828 rows x 2 columns]

data.to_csv('spinny_data.csv')

	id	price	year	manufacturer	model	condition	cylinders	fuel	odometer	title_status	...	VIN	drive	size	type	paint_color	image_url	state	posting_date	latitude	longitude
0	7222695916	770530.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Haryana	NaN	28.8446	75.1167
1	7218891961	1528210.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Haryana	NaN	27.9026	77.0382
2	7221797935	2696820.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	West Bengal	NaN	24.6838	85.9695
3	7222270760	192610.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Andhra Pradesh	NaN	15.5158	81.9342
4	7210384030	629240.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	West Bengal	NaN	23.7821	87.4861

Spinny's Revenue Growth Through Demand and Pricing Optimisation¶

Spinny Pre-Owned Car Market Analysis¶

Introduction:¶

1. Data Understanding¶

2. Data Preparation¶

3. Exploratory Data Analysis (EDA)¶

4. Charting and Insights¶

5.Insights and Recommendations¶

Unlocking Customer Demand and Transaction Growth in the Pre-Owned Car Market¶

Objective:¶

Business Impact¶

Dataset Overview¶

Dataset Overview¶

Column Definitions¶

Analysis & Visualisation !¶

1. Importing and Cleaning Data¶

Importing Necessary Libraries¶

Loading the Dataset from google drive¶

Viewing the First Few Rows of the Dataset¶

Checking the Shape of the Dataset¶

Displaying Dataset Information¶

Checking for Duplicate Values in the Dataset¶

Checking for Missing/Null Values¶

Summary of Dataset Observations¶

2. Data Types¶

Unique Values for each variable.¶

3. Data Wrangling¶

Data Wrangling Code¶

OutLiers¶

Demand for pre-owned cars by region¶

Car Attributes Analysis¶

Histograms:¶

Pie Charts:¶

Scatterplots:¶

Boxplots:¶

Manufacturer and Model Popularity¶

Pricing Analysis¶

Regional Postings Distribution¶

Regional Price Comparison: Analyzing Average Prices Across Different Regions¶

Hypothesis¶

Assessing the Demand for Pre-Owned Cars in Each Region¶

Identifying Demand for Specific Car Models and Their Key Attributes¶

Optimal Pricing Strategy for Cars¶

Recommendations for Spinny¶

Conclusion: By integrating these strategic pricing recommendations, Spinny can position itself as a market leader in the pre-owned car segment, achieving higher sales volumes, improved customer satisfaction, and sustained revenue growth.¶