This project involves analyzing a car price dataset using Python libraries such as NumPy, Pandas, Matplotlib, Seaborn, and sklearn. The goal is to pre-process the dataset by applying feature engineering, feature selection, and exploratory data analysis. Note that actual model building is not part of this project.
- Project Description
- Dataset
- Installation
- Usage
- Project Structure
- Analysis Steps
- Conclusions
- Contributing
- License
The goal of this project is to analyze a car price dataset and perform various data preprocessing tasks, including feature engineering, feature selection, and exploratory data analysis. This analysis will help develop machine learning models that can accurately predict car prices based on various features such as model, production year, category, brand, fuel type, engine volume, mileage, cylinders, vehicle style, and others.
The dataset consists of 19,237 samples and includes features such as:
- Model
- Production year
- Category
- Brand
- Fuel type
- Engine volume
- Mileage
- Cylinders
- Vehicle style
- Price (target variable)
To run this project, you'll need to install the following dependencies:
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Seaborn
- scikit-learn
- Jupyter Notebook
You can install the required packages using pip:
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
git clone https://github.com/yourusername/car-price-prediction.git
cd car-price-prediction
python -m venv envsource env/bin/activate # On Windows use \`env\\Scripts\\activate\`
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
jupyter notebook
Open car_price_analysis.ipynb and run the cells sequentially to perform the data analysis.
The project directory contains the following files:
-
car_price_analysis.ipynb: The Jupyter Notebook containing the data analysis and preprocessing steps.
-
price_prediction_batch_23.csv: The dataset used for the analysis.
-
README.md: The project documentation.
-
Analyze data types of features and update if required.
-
Process the 'Levy' column and convert it to an integer type.
-
Process the 'Mileage' column and convert it to an integer.
-
Check for NaN values in the data and remove them if necessary.
-
Check for duplicates and remove them.
-
Check for outliers using boxplots and statistical methods, and remove them if necessary.
-
Draw countplots for categorical features and write observations.
-
Draw histograms for numeric features, compute skewness, and apply transformation functions if needed.
-
Create a joint plot with the hue parameter and write observations.
-
Apply scaling methods to independent features.
-
Convert categorical features into numeric ones using appropriate encoding techniques.
-
Combine results and compute the correlation among all independent features using a heatmap. Discard one of the variables if high correlation is detected (above 0.7).
-
Split the dataset into training and testing sets using an 80-20 split.
-
Compute the correlation of each independent feature with the dependent variable 'Price'.
-
Select the seven most important independent features based on the correlation values.
- Apply the SelectKBest method to the dataset to reduce the feature set to the seven most important features.
-
The analysis identified key features influencing car prices and reduced the feature set to the most relevant ones for potential model building.
-
Data preprocessing steps, including handling missing values, outliers, and feature scaling, were crucial in preparing the data for analysis.
Contributions are welcome! Please fork this repository and submit a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.