This project focusing on statistical analysis to understand and prepare data for potential machine learning applications. The dataset house_price.csv
includes property prices in Bangalore. The analysis aims to perform exploratory data analysis (EDA), detect and handle outliers, check data distribution and normality, and analyze correlations.
- Filename:
house_price.csv
- Description: Contains property prices in Bangalore. The main focus is on the "price per square foot" column.
- Objective: Perform initial analysis to understand the dataset.
- Steps:
- Load the dataset.
- Examine basic statistics and structure.
- Objective: Identify and manage outliers using various methods.
- Methods:
- a) Mean and Standard Deviation:
- Compute mean and standard deviation of the "price per square foot."
- Define outliers as values outside of mean ± (k * standard deviation), where k is typically 2 or 3.
- b) Percentile Method:
- Compute percentiles (e.g., 1st and 99th).
- Define outliers as values below the 1st percentile or above the 99th percentile.
- c) IQR (Interquartile Range) Method:
- Calculate the IQR (Q3 - Q1).
- Define outliers as values outside of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR.
- d) Z Score Method:
- Compute Z-scores for the "price per square foot."
- Define outliers as values with Z-scores beyond ±3.
- a) Mean and Standard Deviation:
- Handling: Remove or adjust outliers using trimming, capping, or imputation (mean/median).
- Objective: Evaluate which outlier removal method is most effective.
- Steps:
- Generate box plots for the "price per square foot" column before and after outlier removal for each method.
- Compare the box plots to determine the best method for removing outliers.
- Objective: Analyze the distribution of the "price per square foot" and apply transformations if necessary.
- Steps:
- Create a histplot to visualize the distribution of the "price per square foot."
- Apply transformations (e.g., log transformation) if the data is skewed.
- Compute and compare skewness and kurtosis before and after the transformation.
- Objective: Analyze correlations between numerical columns and visualize them.
- Steps:
- Compute the correlation matrix for all numerical columns.
- Plot a heatmap to visualize these correlations.
- Objective: Examine relationships between pairs of numerical variables.
- Steps:
- Create scatter plots to visualize the correlation between pairs of numerical variables.
Ensure you have the following Python libraries installed:
- pandas
- numpy
- matplotlib
- seaborn
- scipy
Install these libraries using pip:
pip install pandas numpy matplotlib seaborn scipy