This project aims to analyze and predict energy consumption and greenhouse gas (GHG) emissions of buildings in Seattle. The primary goals are to provide insights into energy usage patterns and to identify key factors influencing GHG emissions. The dataset comprises various features related to building characteristics and energy consumption metrics.
The dataset used in this project includes detailed information on buildings in Seattle, encompassing residential and non-residential types. Key features include building size, location (latitude and longitude), and energy usage statistics. Two main target variables were analyzed:
- Site Energy Use WN (Kbtu): Annual energy consumption of the property, adjusted for weather conditions.
- Total GHG Emissions: Total greenhouse gas emissions in metric tons of CO2 equivalent.
- Missing Data Analysis: Utilized
missingno
to visualize and address missing values. - Distribution Analysis: Pie charts and bar plots were used to show the distribution of building types and energy consumption by neighborhood.
- Correlation Analysis: Scatter plots revealed a strong correlation between building size and energy consumption, with notable outliers such as medical and distribution centers.
- Handling Missing Data: Addressed missing values with appropriate imputation techniques.
- Outlier Detection: Used Interquartile Range (IQR) method to identify and handle outliers.
- Categorical Encoding: Applied One Hot Encoding and Label Encoding based on the model requirements.
- New Features Creation: Created new variables to enhance model performance, including energy use per square meter and neighborhood-specific factors.
- Model Selection: Tested various regression models including Dummy Regressor, Random Forest Regressor, and Gradient Boosting Regressor.
- Hyperparameter Tuning: Employed GridSearchCV for hyperparameter optimization.
- Model Evaluation: Used cross-validation and performance metrics such as R2, RMSE, and MAE to evaluate models.
-
Random Forest Regressor:
- Criterion: Quality measure of the split
- Max Depth: Tree depth
- Max Features: Number of features considered for the best split
- N_estimators: Number of trees
-
Gradient Boosting Regressor:
- Learning Rate: Step size in gradient descent
- Loss: Cost function used
Both ensemble models performed better than baseline models, with Random Forest showing slightly better generalization.
- Performance Metrics: Random Forest Regressor showed better performance with lower RMSE and higher R2 scores.
- Error Analysis: Predictions were generally underestimations, with larger errors in high consumption buildings. Log transformation of targets provided better error interpretation.
- Variable Importance: Total building area and type/use of buildings were significant predictors. Energy Star Score was a crucial feature in improving model accuracy.
The project successfully demonstrated the importance of feature engineering and model tuning in predicting energy consumption and GHG emissions. Future work could focus on expanding the dataset and exploring additional modeling techniques to further improve predictive performance.
- Clone the repository:
git clone <repo_url>
- Install dependencies:
pip install -r requirements.txt
- Run the Jupyter notebook:
jupyter notebook
- Follow the steps in the notebook to load data, perform analysis, and run models`
This project is licensed under the MIT License - see the LICENSE file for details.
- This project was completed as part of a data science training program.
- Special thanks to the instructors and peers for their valuable feedback and support.
Audrey Hohmann, OpenclassRoom student : Data Scientist training path