diff --git a/Team JM 4-notebook .ipynb b/Team JM 4-notebook .ipynb new file mode 100644 index 00000000..0555eeac --- /dev/null +++ b/Team JM 4-notebook .ipynb @@ -0,0 +1,4787 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6c7e849a", + "metadata": { + "ExecuteTime": { + "end_time": "2021-06-11T09:24:53.643384Z", + "start_time": "2021-06-11T09:24:53.622385Z" + } + }, + "source": [ + "# Regression Predict Student Solution\n", + "\n", + "© Explore Data Science Academy\n", + "\n", + "---\n", + "### Honour Code\n", + "\n", + "I {**JOSEPH, OKONKWO**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).\n", + "\n", + "Non-compliance with the honour code constitutes a material breach of contract.\n", + "\n", + "### Predict Overview: Spain Electricity Shortfall Challenge\n", + "\n", + "The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:\n", + "\n", + "- 1. analyse the supplied data;\n", + "- 2. identify potential errors in the data and clean the existing data set;\n", + "- 3. determine if additional features can be added to enrich the data set;\n", + "- 4. build a model that is capable of forecasting the three hourly demand shortfalls;\n", + "- 5. evaluate the accuracy of the best machine learning model;\n", + "- 6. determine what features were most important in the model’s prediction decision, and\n", + "- 7. explain the inner working of the model to a non-technical audience.\n", + "\n", + "Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:\n", + "\n", + "> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.\n", + " \n", + "On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. " + ] + }, + { + "cell_type": "markdown", + "id": "05600c92", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Table of Contents\n", + "\n", + "1. Importing Packages\n", + "\n", + "2. Loading Data\n", + "\n", + "3. Exploratory Data Analysis (EDA)\n", + "\n", + "4. Data Engineering\n", + "\n", + "5. Modeling\n", + "\n", + "6. Model Performance\n", + "\n", + "7. Model Explanations" + ] + }, + { + "cell_type": "markdown", + "id": "997462e2", + "metadata": {}, + "source": [ + " \n", + "## 1. Importing Packages\n", + "Back to Table of Contents\n", + "\n", + "---\n", + " \n", + "| ⚡ Description: Importing Packages ⚡ |\n", + "| :--------------------------- |\n", + "| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "475dbe93", + "metadata": { + "ExecuteTime": { + "end_time": "2021-06-23T10:30:53.800892Z", + "start_time": "2021-06-23T10:30:50.215449Z" + } + }, + "outputs": [], + "source": [ + "# Libraries for data loading, data manipulation and data visulisation\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib import rc\n", + "import seaborn as sns\n", + "from statsmodels.graphics.correlation import plot_corr\n", + "\n", + "\n", + "# Libraries for data preparation and model building\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.linear_model import Ridge\n", + "from sklearn.linear_model import Lasso\n", + "from sklearn.tree import DecisionTreeRegressor\n", + "from sklearn.ensemble import RandomForestRegressor\n", + "from sklearn.metrics import mean_squared_error\n", + "\n", + "# Setting global constants to ensure notebook results are reproducible\n", + "# PARAMETER_CONSTANT = ###" + ] + }, + { + "cell_type": "markdown", + "id": "f22a6718", + "metadata": {}, + "source": [ + "\n", + "## 2. Loading the Data\n", + "\n", + "Back to Table of Contents\n", + "\n", + "---\n", + " \n", + "| ⚡ Description: Loading the data ⚡ |\n", + "| :--------------------------- |\n", + "| In this section you are required to load the data from the `df_train` file into a DataFrame. |\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "fbbb6c18", + "metadata": { + "ExecuteTime": { + "end_time": "2021-06-28T08:49:35.311495Z", + "start_time": "2021-06-28T08:49:35.295494Z" + } + }, + "outputs": [], + "source": [ + "df_train = pd.read_csv(\"df_train.csv\")\n", + "df_test = pd.read_csv(\"df_test.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "81132ab3", + "metadata": {}, + "source": [ + "\n", + "## 3. Exploratory Data Analysis (EDA)\n", + "\n", + "Back to Table of Contents\n", + "\n", + "---\n", + " \n", + "| ⚡ Description: Exploratory data analysis ⚡ |\n", + "| :--------------------------- |\n", + "| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e6ef4be6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(8763, 49)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# confirm the dimensions of the training dataset\n", + "df_train.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "f2b48d6e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['Unnamed: 0', 'time', 'Madrid_wind_speed', 'Valencia_wind_deg',\n", + " 'Bilbao_rain_1h', 'Valencia_wind_speed', 'Seville_humidity',\n", + " 'Madrid_humidity', 'Bilbao_clouds_all', 'Bilbao_wind_speed',\n", + " 'Seville_clouds_all', 'Bilbao_wind_deg', 'Barcelona_wind_speed',\n", + " 'Barcelona_wind_deg', 'Madrid_clouds_all', 'Seville_wind_speed',\n", + " 'Barcelona_rain_1h', 'Seville_pressure', 'Seville_rain_1h',\n", + " 'Bilbao_snow_3h', 'Barcelona_pressure', 'Seville_rain_3h',\n", + " 'Madrid_rain_1h', 'Barcelona_rain_3h', 'Valencia_snow_3h',\n", + " 'Madrid_weather_id', 'Barcelona_weather_id', 'Bilbao_pressure',\n", + " 'Seville_weather_id', 'Valencia_pressure', 'Seville_temp_max',\n", + " 'Madrid_pressure', 'Valencia_temp_max', 'Valencia_temp',\n", + " 'Bilbao_weather_id', 'Seville_temp', 'Valencia_humidity',\n", + " 'Valencia_temp_min', 'Barcelona_temp_max', 'Madrid_temp_max',\n", + " 'Barcelona_temp', 'Bilbao_temp_min', 'Bilbao_temp',\n", + " 'Barcelona_temp_min', 'Bilbao_temp_max', 'Seville_temp_min',\n", + " 'Madrid_temp', 'Madrid_temp_min', 'load_shortfall_3h'],\n", + " dtype='object')" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# look at the column names \n", + "df_train.columns" + ] + }, + { + "cell_type": "markdown", + "id": "649decf4", + "metadata": {}, + "source": [ + "1. From the results of the foregoing codes, the dataset contains 8763 rows and 49 columns. The columns contain data regarding various weather features recorded '3-hourly' from specific spanish cities. The features which include wind speed, wind direction, humidity, clouds quantity, pressure,snow levels, weather Id, rain levels, and temperature are factors which determine to varying degrees the amount of renewable energy available for generation and by extension the load shortfall in relation to fossil fuel energy sources. To enable a much closer look at the data within each column it is necessary to break up the dataframe along the column axis into a number of parts." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "a7bad589", + "metadata": {}, + "outputs": [], + "source": [ + "# transpose data to enable full view of data\n", + "df_train_trans = df_train.transpose(copy = True)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "93368695", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countmeanstdmin25%50%75%max
Unnamed: 08763.04381.0000002529.8045380.0000002190.5000004381.0000006571.5000008.762000e+03
Madrid_wind_speed8763.02.4257291.8503710.0000001.0000002.0000003.3333331.300000e+01
Bilbao_rain_1h8763.00.1357530.3749010.0000000.0000000.0000000.1000003.000000e+00
Valencia_wind_speed8763.02.5862722.4111900.0000001.0000001.6666673.6666675.200000e+01
Seville_humidity8763.062.65879322.6212268.33333344.33333365.66666782.0000001.000000e+02
Madrid_humidity8763.057.41471724.3353966.33333336.33333358.00000078.6666671.000000e+02
Bilbao_clouds_all8763.043.46913232.5510440.00000010.00000045.00000075.0000001.000000e+02
Bilbao_wind_speed8763.01.8503561.6958880.0000000.6666671.0000002.6666671.266667e+01
Seville_clouds_all8763.013.71474824.2724820.0000000.0000000.00000020.0000009.733333e+01
Bilbao_wind_deg8763.0158.957511102.0562990.00000073.333333147.000000234.0000003.593333e+02
Barcelona_wind_speed8763.02.8704971.7921970.0000001.6666672.6666674.0000001.266667e+01
Barcelona_wind_deg8763.0190.54484889.0773370.000000118.166667200.000000260.0000003.600000e+02
Madrid_clouds_all8763.019.47339228.0536600.0000000.0000000.00000033.3333331.000000e+02
Seville_wind_speed8763.02.4250451.6728950.0000001.0000002.0000003.3333331.166667e+01
Barcelona_rain_1h8763.00.1289060.6347300.0000000.0000000.0000000.0000001.200000e+01
Seville_rain_1h8763.00.0394390.1758570.0000000.0000000.0000000.0000003.000000e+00
Bilbao_snow_3h8763.00.0319120.5572640.0000000.0000000.0000000.0000002.130000e+01
Barcelona_pressure8763.01377.96460514073.140990670.6666671014.0000001018.0000001022.0000001.001411e+06
Seville_rain_3h8763.00.0002430.0036600.0000000.0000000.0000000.0000009.333333e-02
Madrid_rain_1h8763.00.0378180.1526390.0000000.0000000.0000000.0000003.000000e+00
Barcelona_rain_3h8763.00.0004390.0039940.0000000.0000000.0000000.0000009.300000e-02
Valencia_snow_3h8763.00.0002050.0118660.0000000.0000000.0000000.0000007.916667e-01
Madrid_weather_id8763.0773.52759477.313315211.000000800.000000800.000000800.6666678.040000e+02
Barcelona_weather_id8763.0765.97968788.142235200.666667800.000000800.333333801.0000008.040000e+02
Bilbao_pressure8763.01017.73954910.046124971.3333331013.0000001019.0000001024.0000001.042000e+03
Seville_weather_id8763.0774.65881871.940009200.000000800.000000800.000000800.0000008.040000e+02
Valencia_pressure6695.01012.0514079.506214972.6666671010.3333331015.0000001018.0000001.021667e+03
Seville_temp_max8763.0297.4795278.875812272.063000291.312750297.101667304.1500003.204833e+02
Madrid_pressure8763.01010.31692022.198555927.6666671012.3333331017.3333331022.0000001.038000e+03
Valencia_temp_max8763.0291.3372337.565692269.888000285.550167291.037000297.2483333.142633e+02
Valencia_temp8763.0290.5921527.162274269.888000285.150000290.176667296.0566673.104267e+02
Bilbao_weather_id8763.0724.722362115.846537207.333333700.333333800.000000801.6666678.040000e+02
Seville_temp8763.0293.9789037.920986272.063000288.282917293.323333299.6203333.149767e+02
Valencia_humidity8763.065.24772719.26232210.33333351.33333367.00000081.3333331.000000e+02
Valencia_temp_min8763.0289.8676486.907402269.888000284.783333289.550000294.8200003.102720e+02
Barcelona_temp_max8763.0291.1576447.273538272.150000285.483333290.150000296.8550003.140767e+02
Madrid_temp_max8763.0289.5403099.752047264.983333282.150000288.116177296.8166673.144833e+02
Barcelona_temp8763.0289.8554596.528111270.816667284.973443289.416667294.9090003.073167e+02
Bilbao_temp_min8763.0285.0179736.705672264.483333280.085167284.816667289.8166673.098167e+02
Bilbao_temp8763.0286.4229296.818682267.483333281.374167286.158333291.0341673.107100e+02
Barcelona_temp_min8763.0288.4474226.102593269.483333284.150000288.150000292.9666673.048167e+02
Bilbao_temp_max8763.0287.9660277.105590269.063000282.836776287.630000292.4833333.179667e+02
Seville_temp_min8763.0291.6333568.178220270.150000285.816667290.816667297.1500003.148167e+02
Madrid_temp8763.0288.4194399.346796264.983333281.404281287.053333295.1546673.131333e+02
Madrid_temp_min8763.0287.2022039.206237264.983333280.299167286.083333293.8845003.103833e+02
load_shortfall_3h8763.010673.8576125218.046404-6618.0000007390.33333311114.66666714498.1666673.190400e+04
\n", + "
" + ], + "text/plain": [ + " count mean std min \\\n", + "Unnamed: 0 8763.0 4381.000000 2529.804538 0.000000 \n", + "Madrid_wind_speed 8763.0 2.425729 1.850371 0.000000 \n", + "Bilbao_rain_1h 8763.0 0.135753 0.374901 0.000000 \n", + "Valencia_wind_speed 8763.0 2.586272 2.411190 0.000000 \n", + "Seville_humidity 8763.0 62.658793 22.621226 8.333333 \n", + "Madrid_humidity 8763.0 57.414717 24.335396 6.333333 \n", + "Bilbao_clouds_all 8763.0 43.469132 32.551044 0.000000 \n", + "Bilbao_wind_speed 8763.0 1.850356 1.695888 0.000000 \n", + "Seville_clouds_all 8763.0 13.714748 24.272482 0.000000 \n", + "Bilbao_wind_deg 8763.0 158.957511 102.056299 0.000000 \n", + "Barcelona_wind_speed 8763.0 2.870497 1.792197 0.000000 \n", + "Barcelona_wind_deg 8763.0 190.544848 89.077337 0.000000 \n", + "Madrid_clouds_all 8763.0 19.473392 28.053660 0.000000 \n", + "Seville_wind_speed 8763.0 2.425045 1.672895 0.000000 \n", + "Barcelona_rain_1h 8763.0 0.128906 0.634730 0.000000 \n", + "Seville_rain_1h 8763.0 0.039439 0.175857 0.000000 \n", + "Bilbao_snow_3h 8763.0 0.031912 0.557264 0.000000 \n", + "Barcelona_pressure 8763.0 1377.964605 14073.140990 670.666667 \n", + "Seville_rain_3h 8763.0 0.000243 0.003660 0.000000 \n", + "Madrid_rain_1h 8763.0 0.037818 0.152639 0.000000 \n", + "Barcelona_rain_3h 8763.0 0.000439 0.003994 0.000000 \n", + "Valencia_snow_3h 8763.0 0.000205 0.011866 0.000000 \n", + "Madrid_weather_id 8763.0 773.527594 77.313315 211.000000 \n", + "Barcelona_weather_id 8763.0 765.979687 88.142235 200.666667 \n", + "Bilbao_pressure 8763.0 1017.739549 10.046124 971.333333 \n", + "Seville_weather_id 8763.0 774.658818 71.940009 200.000000 \n", + "Valencia_pressure 6695.0 1012.051407 9.506214 972.666667 \n", + "Seville_temp_max 8763.0 297.479527 8.875812 272.063000 \n", + "Madrid_pressure 8763.0 1010.316920 22.198555 927.666667 \n", + "Valencia_temp_max 8763.0 291.337233 7.565692 269.888000 \n", + "Valencia_temp 8763.0 290.592152 7.162274 269.888000 \n", + "Bilbao_weather_id 8763.0 724.722362 115.846537 207.333333 \n", + "Seville_temp 8763.0 293.978903 7.920986 272.063000 \n", + "Valencia_humidity 8763.0 65.247727 19.262322 10.333333 \n", + "Valencia_temp_min 8763.0 289.867648 6.907402 269.888000 \n", + "Barcelona_temp_max 8763.0 291.157644 7.273538 272.150000 \n", + "Madrid_temp_max 8763.0 289.540309 9.752047 264.983333 \n", + "Barcelona_temp 8763.0 289.855459 6.528111 270.816667 \n", + "Bilbao_temp_min 8763.0 285.017973 6.705672 264.483333 \n", + "Bilbao_temp 8763.0 286.422929 6.818682 267.483333 \n", + "Barcelona_temp_min 8763.0 288.447422 6.102593 269.483333 \n", + "Bilbao_temp_max 8763.0 287.966027 7.105590 269.063000 \n", + "Seville_temp_min 8763.0 291.633356 8.178220 270.150000 \n", + "Madrid_temp 8763.0 288.419439 9.346796 264.983333 \n", + "Madrid_temp_min 8763.0 287.202203 9.206237 264.983333 \n", + "load_shortfall_3h 8763.0 10673.857612 5218.046404 -6618.000000 \n", + "\n", + " 25% 50% 75% max \n", + "Unnamed: 0 2190.500000 4381.000000 6571.500000 8.762000e+03 \n", + "Madrid_wind_speed 1.000000 2.000000 3.333333 1.300000e+01 \n", + "Bilbao_rain_1h 0.000000 0.000000 0.100000 3.000000e+00 \n", + "Valencia_wind_speed 1.000000 1.666667 3.666667 5.200000e+01 \n", + "Seville_humidity 44.333333 65.666667 82.000000 1.000000e+02 \n", + "Madrid_humidity 36.333333 58.000000 78.666667 1.000000e+02 \n", + "Bilbao_clouds_all 10.000000 45.000000 75.000000 1.000000e+02 \n", + "Bilbao_wind_speed 0.666667 1.000000 2.666667 1.266667e+01 \n", + "Seville_clouds_all 0.000000 0.000000 20.000000 9.733333e+01 \n", + "Bilbao_wind_deg 73.333333 147.000000 234.000000 3.593333e+02 \n", + "Barcelona_wind_speed 1.666667 2.666667 4.000000 1.266667e+01 \n", + "Barcelona_wind_deg 118.166667 200.000000 260.000000 3.600000e+02 \n", + "Madrid_clouds_all 0.000000 0.000000 33.333333 1.000000e+02 \n", + "Seville_wind_speed 1.000000 2.000000 3.333333 1.166667e+01 \n", + "Barcelona_rain_1h 0.000000 0.000000 0.000000 1.200000e+01 \n", + "Seville_rain_1h 0.000000 0.000000 0.000000 3.000000e+00 \n", + "Bilbao_snow_3h 0.000000 0.000000 0.000000 2.130000e+01 \n", + "Barcelona_pressure 1014.000000 1018.000000 1022.000000 1.001411e+06 \n", + "Seville_rain_3h 0.000000 0.000000 0.000000 9.333333e-02 \n", + "Madrid_rain_1h 0.000000 0.000000 0.000000 3.000000e+00 \n", + "Barcelona_rain_3h 0.000000 0.000000 0.000000 9.300000e-02 \n", + "Valencia_snow_3h 0.000000 0.000000 0.000000 7.916667e-01 \n", + "Madrid_weather_id 800.000000 800.000000 800.666667 8.040000e+02 \n", + "Barcelona_weather_id 800.000000 800.333333 801.000000 8.040000e+02 \n", + "Bilbao_pressure 1013.000000 1019.000000 1024.000000 1.042000e+03 \n", + "Seville_weather_id 800.000000 800.000000 800.000000 8.040000e+02 \n", + "Valencia_pressure 1010.333333 1015.000000 1018.000000 1.021667e+03 \n", + "Seville_temp_max 291.312750 297.101667 304.150000 3.204833e+02 \n", + "Madrid_pressure 1012.333333 1017.333333 1022.000000 1.038000e+03 \n", + "Valencia_temp_max 285.550167 291.037000 297.248333 3.142633e+02 \n", + "Valencia_temp 285.150000 290.176667 296.056667 3.104267e+02 \n", + "Bilbao_weather_id 700.333333 800.000000 801.666667 8.040000e+02 \n", + "Seville_temp 288.282917 293.323333 299.620333 3.149767e+02 \n", + "Valencia_humidity 51.333333 67.000000 81.333333 1.000000e+02 \n", + "Valencia_temp_min 284.783333 289.550000 294.820000 3.102720e+02 \n", + "Barcelona_temp_max 285.483333 290.150000 296.855000 3.140767e+02 \n", + "Madrid_temp_max 282.150000 288.116177 296.816667 3.144833e+02 \n", + "Barcelona_temp 284.973443 289.416667 294.909000 3.073167e+02 \n", + "Bilbao_temp_min 280.085167 284.816667 289.816667 3.098167e+02 \n", + "Bilbao_temp 281.374167 286.158333 291.034167 3.107100e+02 \n", + "Barcelona_temp_min 284.150000 288.150000 292.966667 3.048167e+02 \n", + "Bilbao_temp_max 282.836776 287.630000 292.483333 3.179667e+02 \n", + "Seville_temp_min 285.816667 290.816667 297.150000 3.148167e+02 \n", + "Madrid_temp 281.404281 287.053333 295.154667 3.131333e+02 \n", + "Madrid_temp_min 280.299167 286.083333 293.884500 3.103833e+02 \n", + "load_shortfall_3h 7390.333333 11114.666667 14498.166667 3.190400e+04 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# provide full view of summary statistcs of data features \n", + "df_train.describe().transpose(copy = True)\n" + ] + }, + { + "cell_type": "markdown", + "id": "fd597dce", + "metadata": {}, + "source": [ + "2. It is clear from the views of the data above and accompanying statistical summaries that some recorded features are sources of concern. First, the column 'Valencia_wind_deg' - which like corresponding data from other cities is supposed to carry wind direction data measured in degrees - contains categorical data measured in levels. Second, Columns depicting hourly rain data in various cities seem incompartible with the time schedule of recording which is every 3 hours. Third, the columns containing weather_Id need to be understood more clearly to elicit its bearing on the corresponding weather feature combination for each city. A close look at these features shortly. \n", + "\n", + "3. For other features, rain, pressure, wind and others, we have to plot feature interractions to tell each feature's influence on load shortfall, if any. In order to this, we will need to produce a condensed dataframe where each column contains the average quantity of each weather feature across all cities. Since we are concerned with the shortfall of the whole country and not just for each city, it is safe to analyse for instance, the mean rainfall, mean temperature or mean pressure with respect to load shortfall for each observation of load shortfall for the whole of Spain. Moreso, we expect the values of a feature to correlate perfectly across the given cities, giving causes for multicollinearity. But before going into all that, first is a little data engineering. We have to ensure there are no missing values in the datasets and that all features are numerical in dtype. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ab18dffb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 8763 entries, 0 to 8762\n", + "Data columns (total 49 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Unnamed: 0 8763 non-null int64 \n", + " 1 time 8763 non-null object \n", + " 2 Madrid_wind_speed 8763 non-null float64\n", + " 3 Valencia_wind_deg 8763 non-null object \n", + " 4 Bilbao_rain_1h 8763 non-null float64\n", + " 5 Valencia_wind_speed 8763 non-null float64\n", + " 6 Seville_humidity 8763 non-null float64\n", + " 7 Madrid_humidity 8763 non-null float64\n", + " 8 Bilbao_clouds_all 8763 non-null float64\n", + " 9 Bilbao_wind_speed 8763 non-null float64\n", + " 10 Seville_clouds_all 8763 non-null float64\n", + " 11 Bilbao_wind_deg 8763 non-null float64\n", + " 12 Barcelona_wind_speed 8763 non-null float64\n", + " 13 Barcelona_wind_deg 8763 non-null float64\n", + " 14 Madrid_clouds_all 8763 non-null float64\n", + " 15 Seville_wind_speed 8763 non-null float64\n", + " 16 Barcelona_rain_1h 8763 non-null float64\n", + " 17 Seville_pressure 8763 non-null object \n", + " 18 Seville_rain_1h 8763 non-null float64\n", + " 19 Bilbao_snow_3h 8763 non-null float64\n", + " 20 Barcelona_pressure 8763 non-null float64\n", + " 21 Seville_rain_3h 8763 non-null float64\n", + " 22 Madrid_rain_1h 8763 non-null float64\n", + " 23 Barcelona_rain_3h 8763 non-null float64\n", + " 24 Valencia_snow_3h 8763 non-null float64\n", + " 25 Madrid_weather_id 8763 non-null float64\n", + " 26 Barcelona_weather_id 8763 non-null float64\n", + " 27 Bilbao_pressure 8763 non-null float64\n", + " 28 Seville_weather_id 8763 non-null float64\n", + " 29 Valencia_pressure 6695 non-null float64\n", + " 30 Seville_temp_max 8763 non-null float64\n", + " 31 Madrid_pressure 8763 non-null float64\n", + " 32 Valencia_temp_max 8763 non-null float64\n", + " 33 Valencia_temp 8763 non-null float64\n", + " 34 Bilbao_weather_id 8763 non-null float64\n", + " 35 Seville_temp 8763 non-null float64\n", + " 36 Valencia_humidity 8763 non-null float64\n", + " 37 Valencia_temp_min 8763 non-null float64\n", + " 38 Barcelona_temp_max 8763 non-null float64\n", + " 39 Madrid_temp_max 8763 non-null float64\n", + " 40 Barcelona_temp 8763 non-null float64\n", + " 41 Bilbao_temp_min 8763 non-null float64\n", + " 42 Bilbao_temp 8763 non-null float64\n", + " 43 Barcelona_temp_min 8763 non-null float64\n", + " 44 Bilbao_temp_max 8763 non-null float64\n", + " 45 Seville_temp_min 8763 non-null float64\n", + " 46 Madrid_temp 8763 non-null float64\n", + " 47 Madrid_temp_min 8763 non-null float64\n", + " 48 load_shortfall_3h 8763 non-null float64\n", + "dtypes: float64(45), int64(1), object(3)\n", + "memory usage: 3.3+ MB\n" + ] + } + ], + "source": [ + "# Examine columns with missing values and data types\n", + "df_train.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "1146cc05", + "metadata": {}, + "outputs": [], + "source": [ + "# replace missing values with the most frequent value in 'Valencia_pressure'\n", + "df_train['Valencia_pressure'] = df_train['Valencia_pressure'].fillna(df_train['Valencia_pressure'].mode()[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "384e6694", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timeValencia_wind_degSeville_pressure
02015-01-01 03:00:00level_5sp25
12015-01-01 06:00:00level_10sp25
22015-01-01 09:00:00level_9sp25
32015-01-01 12:00:00level_8sp25
42015-01-01 15:00:00level_7sp25
............
87582017-12-31 09:00:00level_6sp23
87592017-12-31 12:00:00level_6sp23
87602017-12-31 15:00:00level_9sp22
87612017-12-31 18:00:00level_8sp23
87622017-12-31 21:00:00level_9sp25
\n", + "

8763 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " time Valencia_wind_deg Seville_pressure\n", + "0 2015-01-01 03:00:00 level_5 sp25\n", + "1 2015-01-01 06:00:00 level_10 sp25\n", + "2 2015-01-01 09:00:00 level_9 sp25\n", + "3 2015-01-01 12:00:00 level_8 sp25\n", + "4 2015-01-01 15:00:00 level_7 sp25\n", + "... ... ... ...\n", + "8758 2017-12-31 09:00:00 level_6 sp23\n", + "8759 2017-12-31 12:00:00 level_6 sp23\n", + "8760 2017-12-31 15:00:00 level_9 sp22\n", + "8761 2017-12-31 18:00:00 level_8 sp23\n", + "8762 2017-12-31 21:00:00 level_9 sp25\n", + "\n", + "[8763 rows x 3 columns]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# examine the 3 non-numerical columns \n", + "cols = [col for col in df_train.columns if df_train[col].dtype not in ['int64', 'float64']]\n", + "df_train[cols]" + ] + }, + { + "cell_type": "markdown", + "id": "4b21baf9", + "metadata": {}, + "source": [ + "4. Following from the above results, the 'time' column can easily be converted to 'datetime' datatype and some other features engineered from it. The other two features poses a bigger problem. For one thing, they seem like different notations for angle and pressure (perhaps the whole azimuth plane is divided into levels and may be 'sp' stands for static pressure or standard pressure pressure where 1 sp equals 1 atm = 100 kPa, especially seeing that temperatures were most likely given in kelvin)). Let us go ahead and examine the unique values of these features." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "cc3dc185", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),\n", + " [Text(0, 0, 'level_5'),\n", + " Text(1, 0, 'level_10'),\n", + " Text(2, 0, 'level_9'),\n", + " Text(3, 0, 'level_8'),\n", + " Text(4, 0, 'level_7'),\n", + " Text(5, 0, 'level_6'),\n", + " Text(6, 0, 'level_4'),\n", + " Text(7, 0, 'level_3'),\n", + " Text(8, 0, 'level_1'),\n", + " Text(9, 0, 'level_2')])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "sns.countplot(x = 'Valencia_wind_deg', data = df_train, palette=\"hls\")\n", + "plt.title(\"Distribution of wind direction\")\n", + "plt.xticks(rotation = 45)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "66d61bf1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", + " 17, 18, 19, 20, 21, 22, 23, 24]),\n", + " [Text(0, 0, 'sp25'),\n", + " Text(1, 0, 'sp23'),\n", + " Text(2, 0, 'sp24'),\n", + " Text(3, 0, 'sp21'),\n", + " Text(4, 0, 'sp16'),\n", + " Text(5, 0, 'sp9'),\n", + " Text(6, 0, 'sp15'),\n", + " Text(7, 0, 'sp19'),\n", + " Text(8, 0, 'sp22'),\n", + " Text(9, 0, 'sp11'),\n", + " Text(10, 0, 'sp8'),\n", + " Text(11, 0, 'sp4'),\n", + " Text(12, 0, 'sp6'),\n", + " Text(13, 0, 'sp13'),\n", + " Text(14, 0, 'sp17'),\n", + " Text(15, 0, 'sp20'),\n", + " Text(16, 0, 'sp18'),\n", + " Text(17, 0, 'sp14'),\n", + " Text(18, 0, 'sp12'),\n", + " Text(19, 0, 'sp5'),\n", + " Text(20, 0, 'sp10'),\n", + " Text(21, 0, 'sp7'),\n", + " Text(22, 0, 'sp3'),\n", + " Text(23, 0, 'sp2'),\n", + " Text(24, 0, 'sp1')])" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "sns.countplot(x = 'Seville_pressure', data = df_train, palette=\"hls\")\n", + "plt.title(\"Distribution of Seville_pressure\")\n", + "plt.xticks(rotation = 45)" + ] + }, + { + "cell_type": "markdown", + "id": "f9336086", + "metadata": {}, + "source": [ + "5. The above plot may suggest we drop the columns altogether because at best each level in the 'Valencia_wind_deg' column only provides a range of angles and not a specific angular direction of wind. The sp notation for 'Seville_pressure' will require us multiplying each value by 100000 Pa giving us unrealistic atmospheric pressure values. Thereafter, we can go ahead as discussed earlier to average out the values of each weather feature across all cities and store the resulting averages into new dataframes. This average in fact will minimise any effect of the columns dropped. The new dataframe will aslo contain the time data from df_train but will leave out the 'unnamed' column since its just a replica of the of the index." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "4d8b9e34", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0timeMadrid_wind_speedBilbao_rain_1hValencia_wind_speedSeville_humidityMadrid_humidityBilbao_clouds_allBilbao_wind_speedSeville_clouds_all...Madrid_temp_maxBarcelona_tempBilbao_temp_minBilbao_tempBarcelona_temp_minBilbao_temp_maxSeville_temp_minMadrid_tempMadrid_temp_minload_shortfall_3h
002015-01-01 03:00:000.6666670.00.66666774.33333364.0000000.0000001.0000000.000000...265.938000281.013000269.338615269.338615281.013000269.338615274.254667265.938000265.9380006715.666667
112015-01-01 06:00:000.3333330.01.66666778.33333364.6666670.0000001.0000000.000000...266.386667280.561667270.376000270.376000280.561667270.376000274.945000266.386667266.3866674171.666667
222015-01-01 09:00:001.0000000.01.00000071.33333364.3333330.0000001.0000000.000000...272.708667281.583667275.027229275.027229281.583667275.027229278.792000272.708667272.7086674274.666667
332015-01-01 12:00:001.0000000.01.00000065.33333356.3333330.0000001.0000000.000000...281.895219283.434104281.135063281.135063283.434104281.135063285.394000281.895219281.8952195075.666667
442015-01-01 15:00:001.0000000.01.00000059.00000057.0000002.0000000.3333330.000000...280.678437284.213167282.252063282.252063284.213167282.252063285.513719280.678437280.6784376620.666667
..................................................................
875887582017-12-31 09:00:001.0000000.02.66666789.00000095.66666756.6666674.33333380.000000...280.816667281.276667285.150000287.573333280.483333290.150000284.816667279.686667278.483333-28.333333
875987592017-12-31 12:00:005.0000000.02.00000082.00000085.00000026.6666678.00000075.000000...283.483333287.483333286.483333288.616667287.150000291.150000287.150000282.400000280.1500002266.666667
876087602017-12-31 15:00:006.3333330.47.33333367.66666771.00000063.3333338.33333333.333333...285.150000289.816667283.816667285.330000289.150000286.816667289.150000283.956667281.150000822.000000
876187612017-12-31 18:00:007.3333330.27.33333367.66666779.00000063.3333332.66666751.666667...283.483333287.523333278.816667281.410000286.816667284.150000289.150000282.666667280.816667-760.000000
876287622017-12-31 21:00:004.3333330.07.00000078.66666768.66666720.0000001.66666733.333333...282.150000287.483333276.816667281.020000287.150000285.150000287.483333281.396667280.4833332780.666667
\n", + "

8763 rows × 47 columns

\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 time Madrid_wind_speed Bilbao_rain_1h \\\n", + "0 0 2015-01-01 03:00:00 0.666667 0.0 \n", + "1 1 2015-01-01 06:00:00 0.333333 0.0 \n", + "2 2 2015-01-01 09:00:00 1.000000 0.0 \n", + "3 3 2015-01-01 12:00:00 1.000000 0.0 \n", + "4 4 2015-01-01 15:00:00 1.000000 0.0 \n", + "... ... ... ... ... \n", + "8758 8758 2017-12-31 09:00:00 1.000000 0.0 \n", + "8759 8759 2017-12-31 12:00:00 5.000000 0.0 \n", + "8760 8760 2017-12-31 15:00:00 6.333333 0.4 \n", + "8761 8761 2017-12-31 18:00:00 7.333333 0.2 \n", + "8762 8762 2017-12-31 21:00:00 4.333333 0.0 \n", + "\n", + " Valencia_wind_speed Seville_humidity Madrid_humidity \\\n", + "0 0.666667 74.333333 64.000000 \n", + "1 1.666667 78.333333 64.666667 \n", + "2 1.000000 71.333333 64.333333 \n", + "3 1.000000 65.333333 56.333333 \n", + "4 1.000000 59.000000 57.000000 \n", + "... ... ... ... \n", + "8758 2.666667 89.000000 95.666667 \n", + "8759 2.000000 82.000000 85.000000 \n", + "8760 7.333333 67.666667 71.000000 \n", + "8761 7.333333 67.666667 79.000000 \n", + "8762 7.000000 78.666667 68.666667 \n", + "\n", + " Bilbao_clouds_all Bilbao_wind_speed Seville_clouds_all ... \\\n", + "0 0.000000 1.000000 0.000000 ... \n", + "1 0.000000 1.000000 0.000000 ... \n", + "2 0.000000 1.000000 0.000000 ... \n", + "3 0.000000 1.000000 0.000000 ... \n", + "4 2.000000 0.333333 0.000000 ... \n", + "... ... ... ... ... \n", + "8758 56.666667 4.333333 80.000000 ... \n", + "8759 26.666667 8.000000 75.000000 ... \n", + "8760 63.333333 8.333333 33.333333 ... \n", + "8761 63.333333 2.666667 51.666667 ... \n", + "8762 20.000000 1.666667 33.333333 ... \n", + "\n", + " Madrid_temp_max Barcelona_temp Bilbao_temp_min Bilbao_temp \\\n", + "0 265.938000 281.013000 269.338615 269.338615 \n", + "1 266.386667 280.561667 270.376000 270.376000 \n", + "2 272.708667 281.583667 275.027229 275.027229 \n", + "3 281.895219 283.434104 281.135063 281.135063 \n", + "4 280.678437 284.213167 282.252063 282.252063 \n", + "... ... ... ... ... \n", + "8758 280.816667 281.276667 285.150000 287.573333 \n", + "8759 283.483333 287.483333 286.483333 288.616667 \n", + "8760 285.150000 289.816667 283.816667 285.330000 \n", + "8761 283.483333 287.523333 278.816667 281.410000 \n", + "8762 282.150000 287.483333 276.816667 281.020000 \n", + "\n", + " Barcelona_temp_min Bilbao_temp_max Seville_temp_min Madrid_temp \\\n", + "0 281.013000 269.338615 274.254667 265.938000 \n", + "1 280.561667 270.376000 274.945000 266.386667 \n", + "2 281.583667 275.027229 278.792000 272.708667 \n", + "3 283.434104 281.135063 285.394000 281.895219 \n", + "4 284.213167 282.252063 285.513719 280.678437 \n", + "... ... ... ... ... \n", + "8758 280.483333 290.150000 284.816667 279.686667 \n", + "8759 287.150000 291.150000 287.150000 282.400000 \n", + "8760 289.150000 286.816667 289.150000 283.956667 \n", + "8761 286.816667 284.150000 289.150000 282.666667 \n", + "8762 287.150000 285.150000 287.483333 281.396667 \n", + "\n", + " Madrid_temp_min load_shortfall_3h \n", + "0 265.938000 6715.666667 \n", + "1 266.386667 4171.666667 \n", + "2 272.708667 4274.666667 \n", + "3 281.895219 5075.666667 \n", + "4 280.678437 6620.666667 \n", + "... ... ... \n", + "8758 278.483333 -28.333333 \n", + "8759 280.150000 2266.666667 \n", + "8760 281.150000 822.000000 \n", + "8761 280.816667 -760.000000 \n", + "8762 280.483333 2780.666667 \n", + "\n", + "[8763 rows x 47 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# drop columns\n", + "df = df_train.drop(columns = ['Valencia_wind_deg','Seville_pressure'])\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "f0d8d9fc", + "metadata": {}, + "outputs": [], + "source": [ + "# create subsets of training data in terms of weather feature\n", + "df_wind_speed = df[['Madrid_wind_speed','Valencia_wind_speed','Bilbao_wind_speed','Barcelona_wind_speed','Seville_wind_speed']]\n", + "df_wind_deg = df[['Bilbao_wind_deg','Barcelona_wind_deg']]\n", + "df_humidity = df[['Seville_humidity','Madrid_humidity','Valencia_humidity']]\n", + "df_rain = df[['Bilbao_rain_1h','Barcelona_rain_1h','Seville_rain_1h','Seville_rain_3h','Madrid_rain_1h','Barcelona_rain_3h']]\n", + "df_clouds_all = df[['Bilbao_clouds_all','Seville_clouds_all','Madrid_clouds_all']]\n", + "df_pressure = df[['Barcelona_pressure','Bilbao_pressure','Valencia_pressure','Madrid_pressure']]\n", + "df_snow = df[['Bilbao_snow_3h','Valencia_snow_3h']]\n", + "df_weather_id = df[['Madrid_weather_id','Barcelona_weather_id','Seville_weather_id','Bilbao_weather_id']]\n", + "df_temp_min = df[['Valencia_temp_min','Bilbao_temp_min','Barcelona_temp_min','Seville_temp_min','Madrid_temp_min']]\n", + "df_temp = df[['Barcelona_temp','Bilbao_temp','Madrid_temp','Seville_temp','Valencia_temp']]\n", + "df_temp_max = df[['Barcelona_temp_max','Bilbao_temp_max','Madrid_temp_max','Seville_temp_max','Valencia_temp_max']]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "0b988212", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Madrid_wind_speedValencia_wind_speedBilbao_wind_speedBarcelona_wind_speedSeville_wind_speed
count8763.0000008763.0000008763.0000008763.0000008763.000000
mean2.4257292.5862721.8503562.8704972.425045
std1.8503712.4111901.6958881.7921971.672895
min0.0000000.0000000.0000000.0000000.000000
25%1.0000001.0000000.6666671.6666671.000000
50%2.0000001.6666671.0000002.6666672.000000
75%3.3333333.6666672.6666674.0000003.333333
max13.00000052.00000012.66666712.66666711.666667
\n", + "
" + ], + "text/plain": [ + " Madrid_wind_speed Valencia_wind_speed Bilbao_wind_speed \\\n", + "count 8763.000000 8763.000000 8763.000000 \n", + "mean 2.425729 2.586272 1.850356 \n", + "std 1.850371 2.411190 1.695888 \n", + "min 0.000000 0.000000 0.000000 \n", + "25% 1.000000 1.000000 0.666667 \n", + "50% 2.000000 1.666667 1.000000 \n", + "75% 3.333333 3.666667 2.666667 \n", + "max 13.000000 52.000000 12.666667 \n", + "\n", + " Barcelona_wind_speed Seville_wind_speed \n", + "count 8763.000000 8763.000000 \n", + "mean 2.870497 2.425045 \n", + "std 1.792197 1.672895 \n", + "min 0.000000 0.000000 \n", + "25% 1.666667 1.000000 \n", + "50% 2.666667 2.000000 \n", + "75% 4.000000 3.333333 \n", + "max 12.666667 11.666667 " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_wind_speed.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "597c6592", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Bilbao_wind_degBarcelona_wind_deg
count8763.0000008763.000000
mean158.957511190.544848
std102.05629989.077337
min0.0000000.000000
25%73.333333118.166667
50%147.000000200.000000
75%234.000000260.000000
max359.333333360.000000
\n", + "
" + ], + "text/plain": [ + " Bilbao_wind_deg Barcelona_wind_deg\n", + "count 8763.000000 8763.000000\n", + "mean 158.957511 190.544848\n", + "std 102.056299 89.077337\n", + "min 0.000000 0.000000\n", + "25% 73.333333 118.166667\n", + "50% 147.000000 200.000000\n", + "75% 234.000000 260.000000\n", + "max 359.333333 360.000000" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_wind_deg.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "9dd544b7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Seville_humidityMadrid_humidityValencia_humidity
count8763.0000008763.0000008763.000000
mean62.65879357.41471765.247727
std22.62122624.33539619.262322
min8.3333336.33333310.333333
25%44.33333336.33333351.333333
50%65.66666758.00000067.000000
75%82.00000078.66666781.333333
max100.000000100.000000100.000000
\n", + "
" + ], + "text/plain": [ + " Seville_humidity Madrid_humidity Valencia_humidity\n", + "count 8763.000000 8763.000000 8763.000000\n", + "mean 62.658793 57.414717 65.247727\n", + "std 22.621226 24.335396 19.262322\n", + "min 8.333333 6.333333 10.333333\n", + "25% 44.333333 36.333333 51.333333\n", + "50% 65.666667 58.000000 67.000000\n", + "75% 82.000000 78.666667 81.333333\n", + "max 100.000000 100.000000 100.000000" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_humidity.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "970f53c6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Bilbao_rain_1hBarcelona_rain_1hSeville_rain_1hSeville_rain_3hMadrid_rain_1hBarcelona_rain_3h
count8763.0000008763.0000008763.0000008763.0000008763.0000008763.000000
mean0.1357530.1289060.0394390.0002430.0378180.000439
std0.3749010.6347300.1758570.0036600.1526390.003994
min0.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.0000000.0000000.0000000.0000000.000000
50%0.0000000.0000000.0000000.0000000.0000000.000000
75%0.1000000.0000000.0000000.0000000.0000000.000000
max3.00000012.0000003.0000000.0933333.0000000.093000
\n", + "
" + ], + "text/plain": [ + " Bilbao_rain_1h Barcelona_rain_1h Seville_rain_1h Seville_rain_3h \\\n", + "count 8763.000000 8763.000000 8763.000000 8763.000000 \n", + "mean 0.135753 0.128906 0.039439 0.000243 \n", + "std 0.374901 0.634730 0.175857 0.003660 \n", + "min 0.000000 0.000000 0.000000 0.000000 \n", + "25% 0.000000 0.000000 0.000000 0.000000 \n", + "50% 0.000000 0.000000 0.000000 0.000000 \n", + "75% 0.100000 0.000000 0.000000 0.000000 \n", + "max 3.000000 12.000000 3.000000 0.093333 \n", + "\n", + " Madrid_rain_1h Barcelona_rain_3h \n", + "count 8763.000000 8763.000000 \n", + "mean 0.037818 0.000439 \n", + "std 0.152639 0.003994 \n", + "min 0.000000 0.000000 \n", + "25% 0.000000 0.000000 \n", + "50% 0.000000 0.000000 \n", + "75% 0.000000 0.000000 \n", + "max 3.000000 0.093000 " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_rain.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "40e188c6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Barcelona_pressureBilbao_pressureValencia_pressureMadrid_pressure
count8.763000e+038763.0000008763.0000008763.000000
mean1.377965e+031017.7395491013.4552281010.316920
std1.407314e+0410.0461248.68448522.198555
min6.706667e+02971.333333972.666667927.666667
25%1.014000e+031013.0000001012.6666671012.333333
50%1.018000e+031019.0000001017.0000001017.333333
75%1.022000e+031024.0000001018.0000001022.000000
max1.001411e+061042.0000001021.6666671038.000000
\n", + "
" + ], + "text/plain": [ + " Barcelona_pressure Bilbao_pressure Valencia_pressure Madrid_pressure\n", + "count 8.763000e+03 8763.000000 8763.000000 8763.000000\n", + "mean 1.377965e+03 1017.739549 1013.455228 1010.316920\n", + "std 1.407314e+04 10.046124 8.684485 22.198555\n", + "min 6.706667e+02 971.333333 972.666667 927.666667\n", + "25% 1.014000e+03 1013.000000 1012.666667 1012.333333\n", + "50% 1.018000e+03 1019.000000 1017.000000 1017.333333\n", + "75% 1.022000e+03 1024.000000 1018.000000 1022.000000\n", + "max 1.001411e+06 1042.000000 1021.666667 1038.000000" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_pressure.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "def61105", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Bilbao_snow_3hValencia_snow_3h
count8763.0000008763.000000
mean0.0319120.000205
std0.5572640.011866
min0.0000000.000000
25%0.0000000.000000
50%0.0000000.000000
75%0.0000000.000000
max21.3000000.791667
\n", + "
" + ], + "text/plain": [ + " Bilbao_snow_3h Valencia_snow_3h\n", + "count 8763.000000 8763.000000\n", + "mean 0.031912 0.000205\n", + "std 0.557264 0.011866\n", + "min 0.000000 0.000000\n", + "25% 0.000000 0.000000\n", + "50% 0.000000 0.000000\n", + "75% 0.000000 0.000000\n", + "max 21.300000 0.791667" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_snow.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "7a0af639", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Barcelona_tempBilbao_tempMadrid_tempSeville_tempValencia_temp
count8763.0000008763.0000008763.0000008763.0000008763.000000
mean289.855459286.422929288.419439293.978903290.592152
std6.5281116.8186829.3467967.9209867.162274
min270.816667267.483333264.983333272.063000269.888000
25%284.973443281.374167281.404281288.282917285.150000
50%289.416667286.158333287.053333293.323333290.176667
75%294.909000291.034167295.154667299.620333296.056667
max307.316667310.710000313.133333314.976667310.426667
\n", + "
" + ], + "text/plain": [ + " Barcelona_temp Bilbao_temp Madrid_temp Seville_temp Valencia_temp\n", + "count 8763.000000 8763.000000 8763.000000 8763.000000 8763.000000\n", + "mean 289.855459 286.422929 288.419439 293.978903 290.592152\n", + "std 6.528111 6.818682 9.346796 7.920986 7.162274\n", + "min 270.816667 267.483333 264.983333 272.063000 269.888000\n", + "25% 284.973443 281.374167 281.404281 288.282917 285.150000\n", + "50% 289.416667 286.158333 287.053333 293.323333 290.176667\n", + "75% 294.909000 291.034167 295.154667 299.620333 296.056667\n", + "max 307.316667 310.710000 313.133333 314.976667 310.426667" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_temp.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "d83775ae", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Barcelona_temp_maxBilbao_temp_maxMadrid_temp_maxSeville_temp_maxValencia_temp_max
count8763.0000008763.0000008763.0000008763.0000008763.000000
mean291.157644287.966027289.540309297.479527291.337233
std7.2735387.1055909.7520478.8758127.565692
min272.150000269.063000264.983333272.063000269.888000
25%285.483333282.836776282.150000291.312750285.550167
50%290.150000287.630000288.116177297.101667291.037000
75%296.855000292.483333296.816667304.150000297.248333
max314.076667317.966667314.483333320.483333314.263333
\n", + "
" + ], + "text/plain": [ + " Barcelona_temp_max Bilbao_temp_max Madrid_temp_max Seville_temp_max \\\n", + "count 8763.000000 8763.000000 8763.000000 8763.000000 \n", + "mean 291.157644 287.966027 289.540309 297.479527 \n", + "std 7.273538 7.105590 9.752047 8.875812 \n", + "min 272.150000 269.063000 264.983333 272.063000 \n", + "25% 285.483333 282.836776 282.150000 291.312750 \n", + "50% 290.150000 287.630000 288.116177 297.101667 \n", + "75% 296.855000 292.483333 296.816667 304.150000 \n", + "max 314.076667 317.966667 314.483333 320.483333 \n", + "\n", + " Valencia_temp_max \n", + "count 8763.000000 \n", + "mean 291.337233 \n", + "std 7.565692 \n", + "min 269.888000 \n", + "25% 285.550167 \n", + "50% 291.037000 \n", + "75% 297.248333 \n", + "max 314.263333 " + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_temp_max.describe()" + ] + }, + { + "cell_type": "markdown", + "id": "4ac735f3", + "metadata": {}, + "source": [ + "6. We have paused here to examine these datasets above to examine how some weather features are distributed across the cities. This may help us observe any anormalies within the data or inform the decisions regarding possible location of renewable energy infrastructure. It can be observed for instance that the city of Seville is observed to have recorded the highest temperature levels; Seville provides the highest potential of supplying the highest amounts of solar power. The rain data collected from some cities are observed to be have been denoted as '1h' in constrast with the load shortfall which is denoted '3h'. We may take off this data as they may be a source of distortion in our machine learning process. We can also deduce that Valencia recorded the highest windspeed of 52 units. However, the value suggest the presence of a contextual or point outlier when compared to other maximums. In the same vein, Barcelona may be regarded as the city with the overall highest amount of pressure but unlike the Valencia's maximum wind speed, its maximum value is certainly a point outlier because a pressure of about 1 million units at sea level is a very unlikely. More on these outliers later. For now, we may move ahead with putting all these mean feature data within one dataset but not before taking off those 1-hourly rain data." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "82e89cbf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Seville_rain_3hBarcelona_rain_3h
00.00.0
10.00.0
20.00.0
30.00.0
40.00.0
.........
87580.00.0
87590.00.0
87600.00.0
87610.00.0
87620.00.0
\n", + "

8763 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " Seville_rain_3h Barcelona_rain_3h\n", + "0 0.0 0.0\n", + "1 0.0 0.0\n", + "2 0.0 0.0\n", + "3 0.0 0.0\n", + "4 0.0 0.0\n", + "... ... ...\n", + "8758 0.0 0.0\n", + "8759 0.0 0.0\n", + "8760 0.0 0.0\n", + "8761 0.0 0.0\n", + "8762 0.0 0.0\n", + "\n", + "[8763 rows x 2 columns]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# drop 1h rain feature data\n", + "df_rain = df_rain.drop(columns = ['Bilbao_rain_1h','Barcelona_rain_1h','Seville_rain_1h', 'Madrid_rain_1h'])\n", + "df_rain" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "8ba3a192", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 272.086456\n", + "1 272.799533\n", + "2 277.224046\n", + "3 283.351587\n", + "4 283.134500\n", + " ... \n", + "8758 284.216667\n", + "8759 288.550000\n", + "8760 288.927933\n", + "8761 287.550000\n", + "8762 286.750000\n", + "Length: 8763, dtype: float64" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# append extra column with mean value of weather feature across all cities\n", + "ave_wind_speed = df_wind_speed.mean(axis = 1)\n", + "ave_wind_deg = df_wind_deg.mean(axis = 1)\n", + "ave_humidity = df_humidity.mean(axis = 1)\n", + "ave_rain = df_rain.mean(axis = 1)\n", + "ave_clouds_all = df_clouds_all.mean(axis = 1)\n", + "ave_pressure = df_pressure.mean(axis = 1)\n", + "ave_snow = df_snow.mean(axis = 1)\n", + "ave_weather_id = df_weather_id.mean(axis = 1)\n", + "ave_temp_min = df_temp_min.mean(axis = 1)\n", + "ave_temp = df_temp.mean(axis = 1)\n", + "ave_temp_max = df_temp_max.mean(axis = 1)\n", + "ave_temp_max" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "84218394", + "metadata": {}, + "outputs": [], + "source": [ + " # creating a time feature and load_shortfall series\n", + "time = df['time']\n", + "load_shortfall_3h = df['load_shortfall_3h']" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "8951b17d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TimeAve_weather_idAve_wind_speedAve_wind_degAve_humidityAve_rainAve_clouds_allAve_pressureAve_snowAve_temp_minAve_tempAve_temp_maxload_shortfall_3h
02015-01-01 03:00:00800.0000002.400000133.00000071.3333330.00.0000001011.3333330.0272.086456272.086456272.0864566715.666667
12015-01-01 06:00:00800.0000002.066667180.00000071.3333330.00.0000001012.5000000.0272.799533272.799533272.7995334171.666667
22015-01-01 09:00:00800.0000001.533333270.16666767.1111110.00.0000001013.3333330.0277.224046277.224046277.2240464274.666667
32015-01-01 12:00:00800.0000001.866667236.33333358.5555560.00.0000001019.1666670.0283.351587283.351587283.3515875075.666667
42015-01-01 15:00:00800.0000001.933333222.50000058.1111110.00.6666671030.9166670.0283.134500283.134500283.1345006620.666667
..........................................
87582017-12-31 09:00:00775.0833332.133333155.16666785.3333330.060.5555561020.1666670.0282.283333283.219333284.216667-28.333333
87592017-12-31 12:00:00791.8333333.933333216.66666769.1111110.062.2222221019.7500000.0286.550000287.598000288.5500002266.666667
87602017-12-31 15:00:00726.5000006.200000270.00000061.1111110.060.5555561016.0833330.0286.794600287.975933288.927933822.000000
87612017-12-31 18:00:00684.1250005.400000235.00000063.8888890.057.2222221019.5833330.0285.216667286.430667287.550000-760.000000
87622017-12-31 21:00:00800.6666673.800000205.00000065.7777780.017.7777781021.2500000.0284.283333285.504667286.7500002780.666667
\n", + "

8763 rows × 13 columns

\n", + "
" + ], + "text/plain": [ + " Time Ave_weather_id Ave_wind_speed Ave_wind_deg \\\n", + "0 2015-01-01 03:00:00 800.000000 2.400000 133.000000 \n", + "1 2015-01-01 06:00:00 800.000000 2.066667 180.000000 \n", + "2 2015-01-01 09:00:00 800.000000 1.533333 270.166667 \n", + "3 2015-01-01 12:00:00 800.000000 1.866667 236.333333 \n", + "4 2015-01-01 15:00:00 800.000000 1.933333 222.500000 \n", + "... ... ... ... ... \n", + "8758 2017-12-31 09:00:00 775.083333 2.133333 155.166667 \n", + "8759 2017-12-31 12:00:00 791.833333 3.933333 216.666667 \n", + "8760 2017-12-31 15:00:00 726.500000 6.200000 270.000000 \n", + "8761 2017-12-31 18:00:00 684.125000 5.400000 235.000000 \n", + "8762 2017-12-31 21:00:00 800.666667 3.800000 205.000000 \n", + "\n", + " Ave_humidity Ave_rain Ave_clouds_all Ave_pressure Ave_snow \\\n", + "0 71.333333 0.0 0.000000 1011.333333 0.0 \n", + "1 71.333333 0.0 0.000000 1012.500000 0.0 \n", + "2 67.111111 0.0 0.000000 1013.333333 0.0 \n", + "3 58.555556 0.0 0.000000 1019.166667 0.0 \n", + "4 58.111111 0.0 0.666667 1030.916667 0.0 \n", + "... ... ... ... ... ... \n", + "8758 85.333333 0.0 60.555556 1020.166667 0.0 \n", + "8759 69.111111 0.0 62.222222 1019.750000 0.0 \n", + "8760 61.111111 0.0 60.555556 1016.083333 0.0 \n", + "8761 63.888889 0.0 57.222222 1019.583333 0.0 \n", + "8762 65.777778 0.0 17.777778 1021.250000 0.0 \n", + "\n", + " Ave_temp_min Ave_temp Ave_temp_max load_shortfall_3h \n", + "0 272.086456 272.086456 272.086456 6715.666667 \n", + "1 272.799533 272.799533 272.799533 4171.666667 \n", + "2 277.224046 277.224046 277.224046 4274.666667 \n", + "3 283.351587 283.351587 283.351587 5075.666667 \n", + "4 283.134500 283.134500 283.134500 6620.666667 \n", + "... ... ... ... ... \n", + "8758 282.283333 283.219333 284.216667 -28.333333 \n", + "8759 286.550000 287.598000 288.550000 2266.666667 \n", + "8760 286.794600 287.975933 288.927933 822.000000 \n", + "8761 285.216667 286.430667 287.550000 -760.000000 \n", + "8762 284.283333 285.504667 286.750000 2780.666667 \n", + "\n", + "[8763 rows x 13 columns]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# create a dictionary containing all feature series with their names and keys in a desirable order\n", + "features_ave = {'Time': time, 'Ave_weather_id': ave_weather_id, 'Ave_wind_speed': ave_wind_speed, 'Ave_wind_deg': ave_wind_deg, 'Ave_humidity': ave_humidity, \n", + " 'Ave_rain': ave_rain, 'Ave_clouds_all': ave_clouds_all, 'Ave_pressure': ave_pressure, 'Ave_snow': ave_snow, 'Ave_temp_min': ave_temp_min,\n", + " 'Ave_temp': ave_temp, 'Ave_temp_max': ave_temp_max, 'load_shortfall_3h': load_shortfall_3h}\n", + "\n", + "# convert dictionary of feature series into dataframe named df_train_clean\n", + "df_train_clean = pd.DataFrame(features_ave)\n", + "\n", + "df_train_clean\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f516a53", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "78f39e0f", + "metadata": {}, + "source": [ + "7. Having significantly condensed the dataset without necessarily impacting on its ability to effectively train our model, we may now go ahead and draw out some inferences from each feature's interractions with load shortfall and with one another.The essence of this is to check for possible linearity between a feature and the target variable (in this case, 'load_shortfall) and possible collinearity between any pair of features over time. We'll make use of scatter plots and heat maps to elicit such interractions. But first, some column manipulation to make the response variable come first." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "2fb74182", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# plot relevant feature interactions\n", + "fig, axs = plt.subplots(3,4, figsize=(14,6),)\n", + "fig.subplots_adjust(hspace = 0.5, wspace=.2)\n", + "axs = axs.ravel()\n", + "\n", + "for index, column in enumerate(df_train_clean.columns):\n", + " axs[index-1].set_title(\"{} vs. load_shortfall_3h\".format(column),fontsize=12)\n", + " axs[index-1].scatter(x=df_train_clean[column],y=df_train_clean['load_shortfall_3h'],color='blue',edgecolor='k')\n", + " \n", + "fig.tight_layout(pad=1)\n" + ] + }, + { + "cell_type": "markdown", + "id": "545bd833", + "metadata": {}, + "source": [ + "8. From the foregoing plots , it is quite clear that no obvious linearity exists between any of the features and the load shortfall. This means that no feature exists which changes the shortfall by the same amount with each change in its own quantity. There is however a suggestion from the plots that the variance with respect to shortfall of the three temperature features are similar. Though this may be expected, we will still have to confirm their correlation as a potential source of multicollinearity using a heatmap. " + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "363b6bd2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Ave_weather_idAve_wind_speedAve_wind_degAve_humidityAve_rainAve_clouds_allAve_pressureAve_snowAve_temp_minAve_tempAve_temp_maxload_shortfall_3h
Ave_weather_id1.000000-0.249671-0.040275-0.265155-0.037892-0.596429-0.034199-0.0669250.1769900.1951690.2095140.135499
Ave_wind_speed-0.2496711.0000000.230740-0.262193-0.0097040.3422270.0218900.0786820.1104640.1106670.100607-0.157644
Ave_wind_deg-0.0402750.2307401.000000-0.002791-0.0166760.1197690.025140-0.046898-0.089915-0.090755-0.092286-0.168674
Ave_humidity-0.265155-0.262193-0.0027911.0000000.0424820.3295420.0184770.006639-0.726230-0.722368-0.692962-0.118548
Ave_rain-0.037892-0.009704-0.0166760.0424821.0000000.041851-0.0028900.001751-0.045953-0.062838-0.080205-0.037829
Ave_clouds_all-0.5964290.3422270.1197690.3295420.0418511.0000000.0335030.064971-0.211687-0.228930-0.241156-0.147201
Ave_pressure-0.0341990.0218900.0251400.018477-0.0028900.0335031.000000-0.001804-0.027319-0.030635-0.033640-0.034161
Ave_snow-0.0669250.078682-0.0468980.0066390.0017510.064971-0.0018041.000000-0.085402-0.094102-0.102383-0.031910
Ave_temp_min0.1769900.110464-0.089915-0.726230-0.045953-0.211687-0.027319-0.0854021.0000000.9862640.9439460.194317
Ave_temp0.1951690.110667-0.090755-0.722368-0.062838-0.228930-0.030635-0.0941020.9862641.0000000.9845650.184345
Ave_temp_max0.2095140.100607-0.092286-0.692962-0.080205-0.241156-0.033640-0.1023830.9439460.9845651.0000000.168071
load_shortfall_3h0.135499-0.157644-0.168674-0.118548-0.037829-0.147201-0.034161-0.0319100.1943170.1843450.1680711.000000
\n", + "
" + ], + "text/plain": [ + " Ave_weather_id Ave_wind_speed Ave_wind_deg Ave_humidity \\\n", + "Ave_weather_id 1.000000 -0.249671 -0.040275 -0.265155 \n", + "Ave_wind_speed -0.249671 1.000000 0.230740 -0.262193 \n", + "Ave_wind_deg -0.040275 0.230740 1.000000 -0.002791 \n", + "Ave_humidity -0.265155 -0.262193 -0.002791 1.000000 \n", + "Ave_rain -0.037892 -0.009704 -0.016676 0.042482 \n", + "Ave_clouds_all -0.596429 0.342227 0.119769 0.329542 \n", + "Ave_pressure -0.034199 0.021890 0.025140 0.018477 \n", + "Ave_snow -0.066925 0.078682 -0.046898 0.006639 \n", + "Ave_temp_min 0.176990 0.110464 -0.089915 -0.726230 \n", + "Ave_temp 0.195169 0.110667 -0.090755 -0.722368 \n", + "Ave_temp_max 0.209514 0.100607 -0.092286 -0.692962 \n", + "load_shortfall_3h 0.135499 -0.157644 -0.168674 -0.118548 \n", + "\n", + " Ave_rain Ave_clouds_all Ave_pressure Ave_snow \\\n", + "Ave_weather_id -0.037892 -0.596429 -0.034199 -0.066925 \n", + "Ave_wind_speed -0.009704 0.342227 0.021890 0.078682 \n", + "Ave_wind_deg -0.016676 0.119769 0.025140 -0.046898 \n", + "Ave_humidity 0.042482 0.329542 0.018477 0.006639 \n", + "Ave_rain 1.000000 0.041851 -0.002890 0.001751 \n", + "Ave_clouds_all 0.041851 1.000000 0.033503 0.064971 \n", + "Ave_pressure -0.002890 0.033503 1.000000 -0.001804 \n", + "Ave_snow 0.001751 0.064971 -0.001804 1.000000 \n", + "Ave_temp_min -0.045953 -0.211687 -0.027319 -0.085402 \n", + "Ave_temp -0.062838 -0.228930 -0.030635 -0.094102 \n", + "Ave_temp_max -0.080205 -0.241156 -0.033640 -0.102383 \n", + "load_shortfall_3h -0.037829 -0.147201 -0.034161 -0.031910 \n", + "\n", + " Ave_temp_min Ave_temp Ave_temp_max load_shortfall_3h \n", + "Ave_weather_id 0.176990 0.195169 0.209514 0.135499 \n", + "Ave_wind_speed 0.110464 0.110667 0.100607 -0.157644 \n", + "Ave_wind_deg -0.089915 -0.090755 -0.092286 -0.168674 \n", + "Ave_humidity -0.726230 -0.722368 -0.692962 -0.118548 \n", + "Ave_rain -0.045953 -0.062838 -0.080205 -0.037829 \n", + "Ave_clouds_all -0.211687 -0.228930 -0.241156 -0.147201 \n", + "Ave_pressure -0.027319 -0.030635 -0.033640 -0.034161 \n", + "Ave_snow -0.085402 -0.094102 -0.102383 -0.031910 \n", + "Ave_temp_min 1.000000 0.986264 0.943946 0.194317 \n", + "Ave_temp 0.986264 1.000000 0.984565 0.184345 \n", + "Ave_temp_max 0.943946 0.984565 1.000000 0.168071 \n", + "load_shortfall_3h 0.194317 0.184345 0.168071 1.000000 " + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# evaluate correlation\n", + "correlation = df_train_clean.corr()\n", + "correlation" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "6722a031", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# view correlation with a heatmap\n", + "fig = plot_corr(correlation, xnames = correlation.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "71afb40b", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# have a look at feature distributions\n", + "features = df_train_clean.drop('Time', axis = 1) # create a list of all numerical features\n", + "features.plot(kind='density', subplots=True, layout=(3, 4), sharex=False, figsize=(16, 12));" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "5f54ceb9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Ave_weather_id 3.633779\n", + "Ave_wind_speed 3.708942\n", + "Ave_wind_deg -0.640068\n", + "Ave_humidity -0.951951\n", + "Ave_rain 173.988275\n", + "Ave_clouds_all 0.228047\n", + "Ave_pressure 3688.345400\n", + "Ave_snow 799.972093\n", + "Ave_temp_min -0.667229\n", + "Ave_temp -0.678971\n", + "Ave_temp_max -0.645494\n", + "load_shortfall_3h -0.118999\n", + "dtype: float64" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# examine measure of presence of outliers\n", + "df_train_clean.drop('Time', axis = 1).kurtosis()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "17e257b8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# create the subset for features with high kurtosis\n", + "df_outliers = df_train_clean[['Ave_weather_id','Ave_wind_speed','Ave_rain','Ave_pressure','Ave_snow']]\n", + "\n", + "#melt dataframe into long format\n", + "df_melted = pd.melt(df_outliers)\n", + "\n", + "#plot boxplots for variables\n", + "sns.boxplot(x='variable', y='value', data=df_melted) " + ] + }, + { + "cell_type": "markdown", + "id": "98061d97", + "metadata": {}, + "source": [ + "9. We can deduce from the plots above that no signiicant correlation exists between any independent variable and the target variable. However, we notice some interesting correlations elsewhere. We can deduce that the weather_id feature is majorly characterised by rain and cloud quantities which have weak negative correlations with the load_shortfall. We can also confirm as suggested by our earlier scatter plots that while the temperature features provide the highest correlation with demand shortfall, very strong correlation exists amongst them. As expected too, the temperature features together produce a strong negative correlation with the humidity levels. We also do not expect a significant effect of pressure and snow on the target variable seeing how thinly distributed their levels are as shown in the density plots and confirmed by the heat map. This is obviously as a result of strong presence of outliers in their data. These will be handled in the data engineering section." + ] + }, + { + "cell_type": "markdown", + "id": "acf5dcb2", + "metadata": {}, + "source": [ + "\n", + "## 4. Data Engineering\n", + "\n", + "Back to Table of Contents\n", + "\n", + "---\n", + " \n", + "| ⚡ Description: Data engineering ⚡ |\n", + "| :--------------------------- |\n", + "| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "764b5fba", + "metadata": {}, + "source": [ + "10. For our feature engineering, there is a need to derive other time features such as the hour of day and month of year to provide the model a perspective of the shortfall's variance with the time. Seeing that the time feature as its given in the data may not do justice to this functionality, it is necessary to derive from it, an 'hour of day' feature which will reflect in the model how the demand shortfall changes between day and night, and the 'day of year' or 'week of year' which helps reflect the variation of demand shortfall over the various seasons in the year. For better organisation though, we may reindex the columns such that the target variable along with all the derived time features come first in the dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "c313d4d3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2015-01-01 03:00:00\n", + "1 2015-01-01 06:00:00\n", + "2 2015-01-01 09:00:00\n", + "3 2015-01-01 12:00:00\n", + "4 2015-01-01 15:00:00\n", + " ... \n", + "8758 2017-12-31 09:00:00\n", + "8759 2017-12-31 12:00:00\n", + "8760 2017-12-31 15:00:00\n", + "8761 2017-12-31 18:00:00\n", + "8762 2017-12-31 21:00:00\n", + "Name: Time, Length: 8763, dtype: datetime64[ns]" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# convert time data type to datetime\n", + "df_train_clean['Time'] = pd.to_datetime(df_train_clean['Time'])\n", + "df_train_clean['Time']" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "84eea17b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TimeAve_weather_idAve_wind_speedAve_wind_degAve_humidityAve_rainAve_clouds_allAve_pressureAve_snowAve_temp_minAve_tempAve_temp_maxload_shortfall_3hHour_of_dayDay_of_yearWeek_of_year
02015-01-01 03:00:00800.02.400000133.00000071.3333330.00.0000001011.3333330.0272.086456272.086456272.0864566715.666667311
12015-01-01 06:00:00800.02.066667180.00000071.3333330.00.0000001012.5000000.0272.799533272.799533272.7995334171.666667611
22015-01-01 09:00:00800.01.533333270.16666767.1111110.00.0000001013.3333330.0277.224046277.224046277.2240464274.666667911
32015-01-01 12:00:00800.01.866667236.33333358.5555560.00.0000001019.1666670.0283.351587283.351587283.3515875075.6666671211
42015-01-01 15:00:00800.01.933333222.50000058.1111110.00.6666671030.9166670.0283.134500283.134500283.1345006620.6666671511
\n", + "
" + ], + "text/plain": [ + " Time Ave_weather_id Ave_wind_speed Ave_wind_deg \\\n", + "0 2015-01-01 03:00:00 800.0 2.400000 133.000000 \n", + "1 2015-01-01 06:00:00 800.0 2.066667 180.000000 \n", + "2 2015-01-01 09:00:00 800.0 1.533333 270.166667 \n", + "3 2015-01-01 12:00:00 800.0 1.866667 236.333333 \n", + "4 2015-01-01 15:00:00 800.0 1.933333 222.500000 \n", + "\n", + " Ave_humidity Ave_rain Ave_clouds_all Ave_pressure Ave_snow \\\n", + "0 71.333333 0.0 0.000000 1011.333333 0.0 \n", + "1 71.333333 0.0 0.000000 1012.500000 0.0 \n", + "2 67.111111 0.0 0.000000 1013.333333 0.0 \n", + "3 58.555556 0.0 0.000000 1019.166667 0.0 \n", + "4 58.111111 0.0 0.666667 1030.916667 0.0 \n", + "\n", + " Ave_temp_min Ave_temp Ave_temp_max load_shortfall_3h Hour_of_day \\\n", + "0 272.086456 272.086456 272.086456 6715.666667 3 \n", + "1 272.799533 272.799533 272.799533 4171.666667 6 \n", + "2 277.224046 277.224046 277.224046 4274.666667 9 \n", + "3 283.351587 283.351587 283.351587 5075.666667 12 \n", + "4 283.134500 283.134500 283.134500 6620.666667 15 \n", + "\n", + " Day_of_year Week_of_year \n", + "0 1 1 \n", + "1 1 1 \n", + "2 1 1 \n", + "3 1 1 \n", + "4 1 1 " + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# create new features\n", + "df_train_clean['Hour_of_day'] = df_train_clean['Time'].dt.hour # add hour of day feature\n", + "df_train_clean['Day_of_year'] = df_train_clean['Time'].dt.day_of_year # add day of the year feature\n", + "df_train_clean['Week_of_year'] = df_train_clean['Time'].dt.isocalendar().week # add week of year feature\n", + "df_train_clean.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "767f5645", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
load_shortfall_3hTimeHour_of_dayDay_of_yearWeek_of_yearAve_weather_idAve_wind_speedAve_wind_degAve_humidityAve_rainAve_clouds_allAve_pressureAve_snowAve_temp_minAve_tempAve_temp_max
06715.6666672015-01-01 03:00:00311800.02.400000133.00000071.3333330.00.0000001011.3333330.0272.086456272.086456272.086456
14171.6666672015-01-01 06:00:00611800.02.066667180.00000071.3333330.00.0000001012.5000000.0272.799533272.799533272.799533
24274.6666672015-01-01 09:00:00911800.01.533333270.16666767.1111110.00.0000001013.3333330.0277.224046277.224046277.224046
35075.6666672015-01-01 12:00:001211800.01.866667236.33333358.5555560.00.0000001019.1666670.0283.351587283.351587283.351587
46620.6666672015-01-01 15:00:001511800.01.933333222.50000058.1111110.00.6666671030.9166670.0283.134500283.134500283.134500
\n", + "
" + ], + "text/plain": [ + " load_shortfall_3h Time Hour_of_day Day_of_year \\\n", + "0 6715.666667 2015-01-01 03:00:00 3 1 \n", + "1 4171.666667 2015-01-01 06:00:00 6 1 \n", + "2 4274.666667 2015-01-01 09:00:00 9 1 \n", + "3 5075.666667 2015-01-01 12:00:00 12 1 \n", + "4 6620.666667 2015-01-01 15:00:00 15 1 \n", + "\n", + " Week_of_year Ave_weather_id Ave_wind_speed Ave_wind_deg Ave_humidity \\\n", + "0 1 800.0 2.400000 133.000000 71.333333 \n", + "1 1 800.0 2.066667 180.000000 71.333333 \n", + "2 1 800.0 1.533333 270.166667 67.111111 \n", + "3 1 800.0 1.866667 236.333333 58.555556 \n", + "4 1 800.0 1.933333 222.500000 58.111111 \n", + "\n", + " Ave_rain Ave_clouds_all Ave_pressure Ave_snow Ave_temp_min Ave_temp \\\n", + "0 0.0 0.000000 1011.333333 0.0 272.086456 272.086456 \n", + "1 0.0 0.000000 1012.500000 0.0 272.799533 272.799533 \n", + "2 0.0 0.000000 1013.333333 0.0 277.224046 277.224046 \n", + "3 0.0 0.000000 1019.166667 0.0 283.351587 283.351587 \n", + "4 0.0 0.666667 1030.916667 0.0 283.134500 283.134500 \n", + "\n", + " Ave_temp_max \n", + "0 272.086456 \n", + "1 272.799533 \n", + "2 277.224046 \n", + "3 283.351587 \n", + "4 283.134500 " + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# engineer existing fetures\n", + "\n", + "# create a list of features we need to bring forward\n", + "first_columns = ['load_shortfall_3h', 'Time', 'Hour_of_day', 'Day_of_year', 'Week_of_year']\n", + "\n", + "# create the order of columns\n", + "new_columns = first_columns + [col for col in df_train_clean.columns if col not in first_columns] \n", + "\n", + "# create a new dataframe named 'df_train' with the new column index \n", + "df_train = df_train_clean.reindex(columns = new_columns) \n", + "\n", + "# quick look at the resulting dataframe\n", + "df_train.head()" + ] + }, + { + "cell_type": "markdown", + "id": "dd30d47c", + "metadata": {}, + "source": [ + "10. We may fare better by removing feature pairs which have a high collinearity such as the temperature features. However, we will choose to leave them since we are considering using a Lasso regressor which inherently performs variable shrinking and selection. As for features with high measure of outliers (kurtosis), we shall use knowledge of their respective distributions and box plots in the EDA section to write a function that sets lower and upper boundaries of a dataframe's field and then replace a detected outlier with the median value. An observation may be classified as an outlier if its value is greater than the 3rd quantile(75th percentile) by 1.5 times the Interquantile Range(5*IQR) or more. The presence of outliers may mislead the training process of our machine learning model leading to less accurate models and poorer results.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "31386e1d", + "metadata": {}, + "outputs": [], + "source": [ + "# define outlier detector function\n", + "def handle_outliers(dataframe, col):\n", + " '''\n", + " define function which takes as argument a dataframe and a column\n", + " as input, calculates boundaries for outliers and replaces each outlier\n", + " by it closest boundary value, and then returns the\n", + " '''\n", + " iqr = dataframe[col].quantile(0.75) - dataframe[col].quantile(0.25) # calculate interquantile range for column\n", + " tail = dataframe[col].quantile(0.25) - (iqr*1.5) # set lower boundary\n", + " head= dataframe[col].quantile(0.75) + (iqr*1.5) # set upper boundary\n", + " dataframe.loc[dataframe[col] > head, col] = head # detect and replace each outlier with nearest boundary\n", + " dataframe.loc[dataframe[col] < tail, col] = tail \n", + " return dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "b7c0eee8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Ave_weather_id 0.807169\n", + "Ave_wind_speed 0.117803\n", + "Ave_wind_deg -0.640068\n", + "Ave_humidity -0.951951\n", + "Ave_rain 0.000000\n", + "Ave_clouds_all 0.228047\n", + "Ave_pressure -0.102126\n", + "Ave_snow 0.000000\n", + "Ave_temp_min -0.667229\n", + "Ave_temp -0.678971\n", + "Ave_temp_max -0.645494\n", + "dtype: float64" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# call handle_outliers for features with high kurtosis \n", + "handle_outliers(df_train, 'Ave_pressure')\n", + "handle_outliers(df_train, 'Ave_rain')\n", + "handle_outliers(df_train, 'Ave_snow')\n", + "handle_outliers(df_train, 'Ave_wind_speed')\n", + "handle_outliers(df_train, 'Ave_weather_id')\n", + "\n", + "df_train.drop(first_columns, axis = 1).kurtosis()" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "8051d326", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
load_shortfall_3hTimeHour_of_dayDay_of_yearWeek_of_yearAve_weather_idAve_wind_speedAve_wind_degAve_humidityAve_rainAve_clouds_allAve_pressureAve_snowAve_temp_minAve_tempAve_temp_max
06715.6666672015-01-01 03:00:00311800.02.400000133.00000071.3333330.00.0000001011.3333330.0272.086456272.086456272.086456
14171.6666672015-01-01 06:00:00611800.02.066667180.00000071.3333330.00.0000001012.5000000.0272.799533272.799533272.799533
24274.6666672015-01-01 09:00:00911800.01.533333270.16666767.1111110.00.0000001013.3333330.0277.224046277.224046277.224046
35075.6666672015-01-01 12:00:001211800.01.866667236.33333358.5555560.00.0000001019.1666670.0283.351587283.351587283.351587
46620.6666672015-01-01 15:00:001511800.01.933333222.50000058.1111110.00.6666671030.9166670.0283.134500283.134500283.134500
\n", + "
" + ], + "text/plain": [ + " load_shortfall_3h Time Hour_of_day Day_of_year \\\n", + "0 6715.666667 2015-01-01 03:00:00 3 1 \n", + "1 4171.666667 2015-01-01 06:00:00 6 1 \n", + "2 4274.666667 2015-01-01 09:00:00 9 1 \n", + "3 5075.666667 2015-01-01 12:00:00 12 1 \n", + "4 6620.666667 2015-01-01 15:00:00 15 1 \n", + "\n", + " Week_of_year Ave_weather_id Ave_wind_speed Ave_wind_deg Ave_humidity \\\n", + "0 1 800.0 2.400000 133.000000 71.333333 \n", + "1 1 800.0 2.066667 180.000000 71.333333 \n", + "2 1 800.0 1.533333 270.166667 67.111111 \n", + "3 1 800.0 1.866667 236.333333 58.555556 \n", + "4 1 800.0 1.933333 222.500000 58.111111 \n", + "\n", + " Ave_rain Ave_clouds_all Ave_pressure Ave_snow Ave_temp_min Ave_temp \\\n", + "0 0.0 0.000000 1011.333333 0.0 272.086456 272.086456 \n", + "1 0.0 0.000000 1012.500000 0.0 272.799533 272.799533 \n", + "2 0.0 0.000000 1013.333333 0.0 277.224046 277.224046 \n", + "3 0.0 0.000000 1019.166667 0.0 283.351587 283.351587 \n", + "4 0.0 0.666667 1030.916667 0.0 283.134500 283.134500 \n", + "\n", + " Ave_temp_max \n", + "0 272.086456 \n", + "1 272.799533 \n", + "2 277.224046 \n", + "3 283.351587 \n", + "4 283.134500 " + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# last look of the engineered training dataset\n", + "df_train.head()" + ] + }, + { + "cell_type": "markdown", + "id": "43b2d523", + "metadata": {}, + "source": [ + "\n", + "## 4. Modelling\n", + "\n", + "Back to Table of Contents\n", + "\n", + "---\n", + " \n", + "| ⚡ Description: Modelling ⚡ |\n", + "| :--------------------------- |\n", + "| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "2344b3e0", + "metadata": {}, + "outputs": [], + "source": [ + "# split data into predictor and response variables\n", + "\n", + "X = df_train.drop(columns = ['load_shortfall_3h','Time'])\n", + "y = df_train['load_shortfall_3h']" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "9c58df02", + "metadata": {}, + "outputs": [], + "source": [ + "# apply data scaling to predictor variables\n", + "scaler = StandardScaler()\n", + "X_scaled = scaler.fit_transform(X)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "7bdd9689", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Hour_of_dayDay_of_yearWeek_of_yearAve_weather_idAve_wind_speedAve_wind_degAve_humidityAve_rainAve_clouds_allAve_pressureAve_snowAve_temp_minAve_tempAve_temp_max
0-1.090901-1.728991-1.7097030.7862190.001704-0.6212390.5138530.0-1.302834-0.5026530.0-2.340125-2.467583-2.576747
1-0.654451-1.728991-1.7097030.786219-0.2778580.0781000.5138530.0-1.302834-0.3611760.0-2.238048-2.368549-2.482082
2-0.218001-1.728991-1.7097030.786219-0.7251561.4197410.2868970.0-1.302834-0.2601210.0-1.604675-1.754058-1.894702
30.218449-1.728991-1.7097030.786219-0.4455950.916316-0.1729860.0-1.3028340.4472610.0-0.727512-0.903045-1.081237
40.654899-1.728991-1.7097030.786219-0.3896830.710482-0.1968760.0-1.2688431.8721310.0-0.758588-0.933195-1.110056
\n", + "
" + ], + "text/plain": [ + " Hour_of_day Day_of_year Week_of_year Ave_weather_id Ave_wind_speed \\\n", + "0 -1.090901 -1.728991 -1.709703 0.786219 0.001704 \n", + "1 -0.654451 -1.728991 -1.709703 0.786219 -0.277858 \n", + "2 -0.218001 -1.728991 -1.709703 0.786219 -0.725156 \n", + "3 0.218449 -1.728991 -1.709703 0.786219 -0.445595 \n", + "4 0.654899 -1.728991 -1.709703 0.786219 -0.389683 \n", + "\n", + " Ave_wind_deg Ave_humidity Ave_rain Ave_clouds_all Ave_pressure \\\n", + "0 -0.621239 0.513853 0.0 -1.302834 -0.502653 \n", + "1 0.078100 0.513853 0.0 -1.302834 -0.361176 \n", + "2 1.419741 0.286897 0.0 -1.302834 -0.260121 \n", + "3 0.916316 -0.172986 0.0 -1.302834 0.447261 \n", + "4 0.710482 -0.196876 0.0 -1.268843 1.872131 \n", + "\n", + " Ave_snow Ave_temp_min Ave_temp Ave_temp_max \n", + "0 0.0 -2.340125 -2.467583 -2.576747 \n", + "1 0.0 -2.238048 -2.368549 -2.482082 \n", + "2 0.0 -1.604675 -1.754058 -1.894702 \n", + "3 0.0 -0.727512 -0.903045 -1.081237 \n", + "4 0.0 -0.758588 -0.933195 -1.110056 " + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# revert to dataframe \n", + "X_standard = pd.DataFrame(X_scaled, columns = X.columns)\n", + "X_standard.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "c57e4a50", + "metadata": {}, + "outputs": [], + "source": [ + "# Split the data into train and test using the standardised predictors\n", + "X_train, X_test, y_train, y_test = train_test_split(X, \n", + " y, \n", + " test_size=0.2, \n", + " shuffle=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "20d073e0", + "metadata": {}, + "outputs": [], + "source": [ + "# create one or more ML models\n", + "lm = LinearRegression()\n", + "rr = Ridge()\n", + "lr = Lasso(alpha = 0.1)\n", + "dt = DecisionTreeRegressor(max_depth=10,random_state=15)\n", + "rf = RandomForestRegressor(n_estimators = 70, max_depth = 10, random_state = 25)" + ] + }, + { + "cell_type": "markdown", + "id": "6b530251", + "metadata": {}, + "source": [ + "\n", + "## 5. Model Performance\n", + "\n", + "Back to Table of Contents\n", + "\n", + "---\n", + " \n", + "| ⚡ Description: Model performance ⚡ |\n", + "| :--------------------------- |\n", + "| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "6a69b5a1", + "metadata": {}, + "outputs": [], + "source": [ + "# Compare model performance\n", + "# evaluate one or more ML models\n", + "def train_eval(model, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):\n", + " '''\n", + " define function that fits a model object to a data set and calculate root mean squared values\n", + " '''\n", + " model.fit(X_train, y_train)\n", + " rmse_train = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))\n", + " rmse_test = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))\n", + " result = {'rmse_train':rmse_train, 'rmse_test':rmse_test}\n", + " return result" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "8a46694a", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\comfo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:530: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 80808284351.6825, tolerance: 18634441.993152138\n", + " model = cd_fast.enet_coordinate_descent(\n", + "C:\\Users\\comfo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:530: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 80808284351.6825, tolerance: 18634441.993152138\n", + " model = cd_fast.enet_coordinate_descent(\n" + ] + } + ], + "source": [ + "results_dict = {'lm_scores': train_eval(lm),\n", + " 'rr_scores': train_eval(rr),\n", + " 'lr_scores': train_eval(lr),\n", + " 'dt_scores': train_eval(lr),\n", + " 'rf_scores': train_eval(rf)\n", + " }\n", + "Results = pd.DataFrame(results_dict) " + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "3874a7c6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lm_scoresrr_scoreslr_scoresdt_scoresrf_scores
rmse_train4894.3680414894.3680674894.4858434894.4858432945.016428
rmse_test4936.6148324936.6260814936.0679234936.0679234501.985059
\n", + "
" + ], + "text/plain": [ + " lm_scores rr_scores lr_scores dt_scores rf_scores\n", + "rmse_train 4894.368041 4894.368067 4894.485843 4894.485843 2945.016428\n", + "rmse_test 4936.614832 4936.626081 4936.067923 4936.067923 4501.985059" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e19664f", + "metadata": {}, + "outputs": [], + "source": [ + "#Choose best model and motivate why it is the best choice\n", + "\n", + "'''The best model as may be observed from the results dataframe is the random forest regressor because since\n", + "it produces the lowest root mean squared value for an unseen test data, it is likely going to produce the\n", + "least amount of error form any other unseen dataset '''\n" + ] + }, + { + "cell_type": "markdown", + "id": "a8ad0c0d", + "metadata": {}, + "source": [ + "\n", + "## 6. Model Explanations\n", + "\n", + "Back to Table of Contents\n", + "\n", + "---\n", + " \n", + "| ⚡ Description: Model explanation ⚡ |\n", + "| :--------------------------- |\n", + "| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "5ff741c2", + "metadata": {}, + "outputs": [], + "source": [ + "# discuss chosen methods logic\n", + "'''The Random Forest algorithm is essentially a Machine Learning(ML for short) algorithm or model; a statistical way of teaching\n", + "a computer how to predict the outcome of an activity. It is a randomised combination of yet another ML algorithm called Decision\n", + "Tree. A Decision Tree starts out at the base(root node) and represents data by dividing it into two different branches based on\n", + "possible ways of answering a question posed to it. The division continues with each branch further branching out until it\n", + "reaches a point where only one answer is possible. The collected answers or predictions from each decision tree is then\n", + "aggregated by their mean value to produce a much better prediction'''\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + }, + "latex_envs": { + "LaTeX_envs_menu_present": true, + "autoclose": false, + "autocomplete": true, + "bibliofile": "biblio.bib", + "cite_by": "apalike", + "current_citInitial": 1, + "eqLabelWithNumbers": true, + "eqNumInitial": 1, + "hotkeys": { + "equation": "Ctrl-E", + "itemize": "Ctrl-I" + }, + "labels_anchors": false, + "latex_user_defs": false, + "report_style_numbering": false, + "user_envs_cfg": false + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + }, + "varInspector": { + "cols": { + "lenName": 16, + "lenType": 16, + "lenVar": 40 + }, + "kernels_config": { + "python": { + "delete_cmd_postfix": "", + "delete_cmd_prefix": "del ", + "library": "var_list.py", + "varRefreshCmd": "print(var_dic_list())" + }, + "r": { + "delete_cmd_postfix": ") ", + "delete_cmd_prefix": "rm(", + "library": "var_list.r", + "varRefreshCmd": "cat(var_dic_list()) " + } + }, + "types_to_exclude": [ + "module", + "function", + "builtin_function_or_method", + "instance", + "_Feature" + ], + "window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}