diff --git a/energy-analysis/building-instinct/building-instinct-starter-notebook.ipynb b/energy-analysis/building-instinct/building-instinct-starter-notebook.ipynb new file mode 100644 index 0000000..6f8e815 --- /dev/null +++ b/energy-analysis/building-instinct/building-instinct-starter-notebook.ipynb @@ -0,0 +1,3406 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "767e112f-72ab-4806-a718-aa584106be92", + "metadata": {}, + "source": [ + "# Building Instinct: Where power meets predictions\n", + "\n", + "Welcome to the Building Instinct: Where power meets predictions Challenge, a buildings' metadata inference problem based on their electricity load profiles. This is a hierarchical multi-output multi-class classification problem where you are tasked to predict metadata of different buildings based on their load profiles.\n", + "In this challenge, you will be given end-use load profiles for 7200 unique buildings (train data), along with their metadata (labels) to train your inference model. You will also be given load profiles for 1440 buildings (without their metadata) as your test data. Your predictive leaderboad score will be calculated based on your model performance on the test data. Your task is to infer the medatadata of different hierarchical classes for the building. More information will be given about this hierarchical multi-output multi-class classification problem later in this starter notebook.\n", + "\n", + "### Supplied Materials:\n", + " \n", + "* Starter Notebook\n", + "* Train dataset: 7200 `.parquet` files containing timestamped end-use load profiles for 7200 buildings\n", + "* `train_label.parquet` file containing labels (also referred to as metadata or attributes) for the 7200 buildings in the train dataset\n", + "* Test dataset: 1440 `.parquet` files containing timestamped end-use load profiles for 1440 buildings\n", + "* `utils.py`: containing some functions used in this starter notebook and to help you get started\n", + "* `requirements.txt` should contain all the required packages for your submission\n", + "\n", + "### Data:\n", + "\n", + "Each of the above-metioned timestamped `.parquet` files (either train or test) contains time series of electricity energy consumption for the corresponding building, starting from Jan 1, 2018 (`2018-01-01 00:15:00`) till the end of Dec 31, 2018 (`2019-01-01 00:00:00`), with 15 minutes incremenets. All the times are in Eastern Standard Time (EST). All energy consumptions are in `kWh`. Energy consumptions for each timestamp (row) is the electricity energy consumed during the 15 minutes ending at that timestamp. For example, for the row corresponding to `2018-12-20 17:15:00` the energy load value is the energy consumed from `2018-12-20 17:00:00` till `2018-12-20 17:15:00`. Each of these files also contains a column providing the state in which the building is located.\n", + "\n", + "`train_label.parquet` file contains the metadata (labels/classes to predict) for each building in the train dataset. Buildings are either residential or commercial which can be found under the `building_stock_type` column. If a building is commercial there are 11 metadata available for them that are stored under the columns whose names end with `_com`. On the other hand, if a building is residential there are 13 metadata available for them that are stored under the columns whose names end with `_res`.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "f3a3510e-1016-4d1f-afce-2762ab95f189", + "metadata": {}, + "source": [ + "### Data loading and exploration\n", + "\n", + "Below are a few code snippets that show you how to load and explore the data.\n", + "\n", + "Please remember to include any packages you use in a `requirements.txt` file and include it in the starter notebook folder." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "047c6393-f6f8-4457-9d20-661dc89eabcf", + "metadata": {}, + "outputs": [], + "source": [ + "# importaing the required libraries\n", + "\n", + "import os\n", + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.metrics import f1_score\n", + "\n", + "from utils import (\n", + " calculate_average_hourly_energy_consumption,\n", + " train_model,\n", + " get_pred,\n", + " calculate_hierarchical_f1_score,\n", + " sample_submission_generator,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0257e8f8-0ad8-4f44-9b5d-6ab924ac659a", + "metadata": {}, + "source": [ + "Below, the timestamped load profile for building with building ID of 1 (`1.parquet`) is loaded as a pandas DataFrame and the first 10 rows are displayed. This building is located in Kentucky state." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "77b61a4e-6df0-4525-b01d-3ef8d473d074", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timestampout.electricity.total.energy_consumptionin.state
bldg_id
12018-01-01 00:15:002.288KY
12018-01-01 00:30:002.190KY
12018-01-01 00:45:002.101KY
12018-01-01 01:00:002.016KY
12018-01-01 01:15:002.027KY
12018-01-01 01:30:002.050KY
12018-01-01 01:45:002.074KY
12018-01-01 02:00:002.097KY
12018-01-01 02:15:002.129KY
12018-01-01 02:30:002.162KY
\n", + "
" + ], + "text/plain": [ + " timestamp out.electricity.total.energy_consumption in.state\n", + "bldg_id \n", + "1 2018-01-01 00:15:00 2.288 KY\n", + "1 2018-01-01 00:30:00 2.190 KY\n", + "1 2018-01-01 00:45:00 2.101 KY\n", + "1 2018-01-01 01:00:00 2.016 KY\n", + "1 2018-01-01 01:15:00 2.027 KY\n", + "1 2018-01-01 01:30:00 2.050 KY\n", + "1 2018-01-01 01:45:00 2.074 KY\n", + "1 2018-01-01 02:00:00 2.097 KY\n", + "1 2018-01-01 02:15:00 2.129 KY\n", + "1 2018-01-01 02:30:00 2.162 KY" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "load_filepath_bldg = os.path.join(\n", + " os.getcwd(), \"building-instinct-train-data\", \"1.parquet\"\n", + ") # path to a file in the train dataset\n", + "df_bldg = pd.read_parquet(load_filepath_bldg, engine=\"pyarrow\")\n", + "\n", + "# show the first 10 rows of the df_bldg dataframe\n", + "df_bldg.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "9a06da5d-0a5a-4457-9905-b6fa7cc807a7", + "metadata": {}, + "source": [ + "Next we load the metadata (labels) for the buildings in the train dataset and display the first 10 rows. As shown in the dataframe below, residential buildings (e.g. building 1) have entries only for the columns with names ending with `_res` (in addition to `building_stock_type` column that specifices their residential building stock type). Similarly, commercial buildings (e.g. building 3) have entries only for the columns with names ending with `_com` (in addition to `building_stock_type` column that specifices their commercial building stock type).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d0b0e8b0-74d7-464a-9f90-833ea096ac88", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
building_stock_typein.comstock_building_type_group_comin.heating_fuel_comin.hvac_category_comin.number_of_stories_comin.ownership_type_comin.vintage_comin.wall_construction_type_comin.tstat_clg_sp_f..f_comin.tstat_htg_sp_f..f_com...in.geometry_building_type_recs_resin.geometry_floor_area_resin.geometry_foundation_type_resin.geometry_wall_type_resin.heating_fuel_resin.income_resin.roof_material_resin.tenure_resin.vacancy_status_resin.vintage_res
bldg_id
1residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Multi-Family with 2 - 4 Units1500-1999Unheated BasementWood FrameNatural Gas100000-119999Composition ShinglesOwnerOccupied<1940
2residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Multi-Family with 5+ Units750-999Vented CrawlspaceWood FrameNatural Gas10000-14999Asphalt Shingles, MediumRenterOccupied1970s
3commercialWarehouse and StorageElectricitySmall Packaged Unit2owner_occupied1990 to 1999WoodFramed999999...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
4residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Multi-Family with 5+ Units1000-1499Vented CrawlspaceWood FrameElectricity60000-69999Asphalt Shingles, MediumRenterOccupied1980s
5commercialWarehouse and StorageNaturalGasResidential Style Central Systems1owner_occupied2000 to 2012WoodFramed999999...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
6commercialMercantileNaturalGasSmall Packaged Unit1leased1960 to 1969WoodFramed7267...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
7residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached750-999Heated BasementWood FrameNatural Gas80000-99999Asphalt Shingles, MediumOwnerOccupied1940s
8residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Multi-Family with 5+ Units500-749Vented CrawlspaceWood FrameElectricity40000-44999Asphalt Shingles, MediumRenterOccupied1990s
9commercialWarehouse and StorageNaturalGasSmall Packaged Unit1leased1980 to 1989Mass999999...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
10residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached1000-1499SlabSteel FrameElectricity20000-24999Composition ShinglesRenterOccupied1950s
\n", + "

10 rows × 25 columns

\n", + "
" + ], + "text/plain": [ + " building_stock_type in.comstock_building_type_group_com \\\n", + "bldg_id \n", + "1 residential None \n", + "2 residential None \n", + "3 commercial Warehouse and Storage \n", + "4 residential None \n", + "5 commercial Warehouse and Storage \n", + "6 commercial Mercantile \n", + "7 residential None \n", + "8 residential None \n", + "9 commercial Warehouse and Storage \n", + "10 residential None \n", + "\n", + " in.heating_fuel_com in.hvac_category_com \\\n", + "bldg_id \n", + "1 None None \n", + "2 None None \n", + "3 Electricity Small Packaged Unit \n", + "4 None None \n", + "5 NaturalGas Residential Style Central Systems \n", + "6 NaturalGas Small Packaged Unit \n", + "7 None None \n", + "8 None None \n", + "9 NaturalGas Small Packaged Unit \n", + "10 None None \n", + "\n", + " in.number_of_stories_com in.ownership_type_com in.vintage_com \\\n", + "bldg_id \n", + "1 None None None \n", + "2 None None None \n", + "3 2 owner_occupied 1990 to 1999 \n", + "4 None None None \n", + "5 1 owner_occupied 2000 to 2012 \n", + "6 1 leased 1960 to 1969 \n", + "7 None None None \n", + "8 None None None \n", + "9 1 leased 1980 to 1989 \n", + "10 None None None \n", + "\n", + " in.wall_construction_type_com in.tstat_clg_sp_f..f_com \\\n", + "bldg_id \n", + "1 None None \n", + "2 None None \n", + "3 WoodFramed 999 \n", + "4 None None \n", + "5 WoodFramed 999 \n", + "6 WoodFramed 72 \n", + "7 None None \n", + "8 None None \n", + "9 Mass 999 \n", + "10 None None \n", + "\n", + " in.tstat_htg_sp_f..f_com ... in.geometry_building_type_recs_res \\\n", + "bldg_id ... \n", + "1 None ... Multi-Family with 2 - 4 Units \n", + "2 None ... Multi-Family with 5+ Units \n", + "3 999 ... None \n", + "4 None ... Multi-Family with 5+ Units \n", + "5 999 ... None \n", + "6 67 ... None \n", + "7 None ... Single-Family Detached \n", + "8 None ... Multi-Family with 5+ Units \n", + "9 999 ... None \n", + "10 None ... Single-Family Detached \n", + "\n", + " in.geometry_floor_area_res in.geometry_foundation_type_res \\\n", + "bldg_id \n", + "1 1500-1999 Unheated Basement \n", + "2 750-999 Vented Crawlspace \n", + "3 None None \n", + "4 1000-1499 Vented Crawlspace \n", + "5 None None \n", + "6 None None \n", + "7 750-999 Heated Basement \n", + "8 500-749 Vented Crawlspace \n", + "9 None None \n", + "10 1000-1499 Slab \n", + "\n", + " in.geometry_wall_type_res in.heating_fuel_res in.income_res \\\n", + "bldg_id \n", + "1 Wood Frame Natural Gas 100000-119999 \n", + "2 Wood Frame Natural Gas 10000-14999 \n", + "3 None None None \n", + "4 Wood Frame Electricity 60000-69999 \n", + "5 None None None \n", + "6 None None None \n", + "7 Wood Frame Natural Gas 80000-99999 \n", + "8 Wood Frame Electricity 40000-44999 \n", + "9 None None None \n", + "10 Steel Frame Electricity 20000-24999 \n", + "\n", + " in.roof_material_res in.tenure_res in.vacancy_status_res \\\n", + "bldg_id \n", + "1 Composition Shingles Owner Occupied \n", + "2 Asphalt Shingles, Medium Renter Occupied \n", + "3 None None None \n", + "4 Asphalt Shingles, Medium Renter Occupied \n", + "5 None None None \n", + "6 None None None \n", + "7 Asphalt Shingles, Medium Owner Occupied \n", + "8 Asphalt Shingles, Medium Renter Occupied \n", + "9 None None None \n", + "10 Composition Shingles Renter Occupied \n", + "\n", + " in.vintage_res \n", + "bldg_id \n", + "1 <1940 \n", + "2 1970s \n", + "3 None \n", + "4 1980s \n", + "5 None \n", + "6 None \n", + "7 1940s \n", + "8 1990s \n", + "9 None \n", + "10 1950s \n", + "\n", + "[10 rows x 25 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "load_filepath_labels = os.path.join(\n", + " os.getcwd(), \"building-instinct-train-label\", \"train_label.parquet\"\n", + ") # path to the train label file\n", + "df_targets = pd.read_parquet(load_filepath_labels, engine=\"pyarrow\")\n", + "\n", + "# show the first 10 rows of the dataframe\n", + "df_targets.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "4a35e508-7a20-410d-9a5e-9a4acb0c51b7", + "metadata": {}, + "source": [ + "Below we print the list of commercial and residential metadata. Most of the entries are self-explanatory. For additional clarification, `in.tstat_clg_sp_f..f_com` and `in.tstat_htg_sp_f..f_com` refer to the cooling and heating thermostat setpoints (in Fahrenheit) for commercial buildings." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "9efc214c-5500-48a4-ad52-aebc6444fa05", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + " Metadata columns for commercial buildings: \n", + " ['in.comstock_building_type_group_com', 'in.heating_fuel_com', 'in.hvac_category_com', 'in.number_of_stories_com', 'in.ownership_type_com', 'in.vintage_com', 'in.wall_construction_type_com', 'in.tstat_clg_sp_f..f_com', 'in.tstat_htg_sp_f..f_com', 'in.weekday_opening_time..hr_com', 'in.weekday_operating_hours..hr_com']\n", + "\n", + " Metadata columns for residential buildings: \n", + " ['in.bedrooms_res', 'in.cooling_setpoint_res', 'in.heating_setpoint_res', 'in.geometry_building_type_recs_res', 'in.geometry_floor_area_res', 'in.geometry_foundation_type_res', 'in.geometry_wall_type_res', 'in.heating_fuel_res', 'in.income_res', 'in.roof_material_res', 'in.tenure_res', 'in.vacancy_status_res', 'in.vintage_res']\n" + ] + } + ], + "source": [ + "columns_com = df_targets.filter(like=\"_com\").columns.tolist()\n", + "print(\"\\n Metadata columns for commercial buildings: \\n\", columns_com)\n", + "\n", + "columns_res = df_targets.filter(like=\"_res\").columns.tolist()\n", + "print(\"\\n Metadata columns for residential buildings: \\n\", columns_res)" + ] + }, + { + "cell_type": "markdown", + "id": "015310d2-5e6b-4faf-8b65-a3fe3ad6ecdd", + "metadata": {}, + "source": [ + "### Hierarchical Multi-output Multi-class Classification\n", + "\n", + "Your task in this challenge is to classify (i.e., predict the classes for) each metadata target variable for a given building based on its electricity load profile. The process begins with classifying the `building_stock_type` (a binary classification: residential or commercial). If the building is classified as residential, you should then predict the classes for all 13 corresponding metadata target variables (columns ending with `_res`). Similarly, if the building is classified as commercial, you should predict the classes for all 11 corresponding metadata target variables (columns ending with `_com`).\n", + "\n", + "This is a hierarchical classification problem with two levels (hierarchies) of classification: The first level involves determining the building stock type, and the second level involves classifying the target variables specific to the first-level class. Additionally, this is a multi-output classification problem since there are multiple target variables to classify (13 for residential and 11 for commercial). Furthermore, it is a multi-class problem because many target variables have more than two classes to predict. To assess the performance of your classification model, a customized F1-score is used as the performance metric for this challenge. The details of this performance metric will be discussed later in the Starter Notebook.\n", + "\n", + "To help you better understand the problem and get you up to speed, we will create and train a simple classification model. Every effective machine learning model relies on proper data preprocessing and feature engineering/extraction. Therefore, before diving into the classification model, we will provide a brief discussion on feature engineering and extraction.\n" + ] + }, + { + "cell_type": "markdown", + "id": "850e6979-9722-446b-aa23-39c980cc6dce", + "metadata": {}, + "source": [ + "### Feature extraction/engineering\n", + "\n", + "The collected smart meter data for energy conusmption are often processed in some ways to reduce the scale of input data or\n", + "to define more meaningful features for the ML task at hand (classification in this case). This is often referred to as feature extraction/engineering or data reduction stage. It should be noted that each recorded electricity consumption for a given building is indeed a feature; however, considering every single collected smart meter data results in a very high-dimensional feature set. For the provided dataset, this results in a massive 365 x 24 x 4 = 35040 dimensional feature set for every building. Analyzing these massive sets of data could be a challenging task. Therefore, data size reduction and feature engineering/extraction methods are pivotal to reduce the size of load data sets. The proper use of these methods can reduce the input data size of classification algorithms, save computation time, and produce features that are suitable for a specific task or algorithm.\n", + "\n", + "Many different knowledge-based and automatic feature extraction techniques could be used for classification. For instance, one can reduce the full load profiles by aggregating and averaging over different date-time windows. For example, one can aggregate energy consumptions in every hour and then average over the entire year to get 24-hour representative load profile (RLP) for the entire year. This reduces the size of feature set from 35040 to 24. The averaging could be done for different seasons or different months. If it's done for every month, it results in 12 of 24-hr RLPs (12 x 24 = 288 features). One could engineer other features, such as number of consumption peaks, or times of the consumption peaks. Furthermore, more sophisticated and automatic deep learning feature extraction techniques could be ingerated into the classification pipeline. One can also combine several of these teqchniques together. There is no hard and fast rule as to what technique could work better and it very much depends on the dataset, the classification algorithm, and the application of interest and the type of labels to classify. \n", + "\n", + "To have a starting point, we provided you with a simple function in the `utils.py` that calculates the average hourly energy consumptions. The function `calculate_average_hourly_energy_consumption(folder_path, season_months_dict)` reads all the parquet files in the `folder_path` folder, and calculates hourly average energy consumption (as described above), and returns a pandas DataFrame with each row corresponding to one file (buidling) in the folder. 15-min energy consumptions are aggregated within each hour. The dictionary argument `season_months_dict` defines over what months the averaging takes place. The keys of this dictionary are season names (strings) and values are lists of corresponding month numbers. For example, if `season_months_dict` = {'cold': [1, 2, 12], 'hot': [6, 7, 8], 'mild': [3, 4, 5, 9, 10, 11]}, averaging of energy consumption is done within 3 different seasons, resulting in a 3 x 24 = 72 features for every building load profile. Below this function is applied to the parquet files in the train dataset.\n", + "\n", + "NOTE: It's crucial to note that the provided feature engineering and feature-set size reduction method is\n", + "just one of many approaches, shared for illustrative purposes. We highly encourage experimentation with various traditional and sophisticated (e.g. deep learning) techniques for feature extraction, either independently or in combination. Choose methods that best align with your analysis and classification algorithm and objectives." + ] + }, + { + "cell_type": "markdown", + "id": "fe8878ae-f7d3-499c-adb8-a3d4af76fdfe", + "metadata": {}, + "source": [ + "Using the provided helper function in `utils.py`, below we calculate hourly energy consumption averaged over the entire year resulting in a 24-feature RLP for every building in the train dataset (`df_features`)." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "87a97c9e-0b18-49fd-9904-e653096877f8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
year
12345678910...15161718192021222324
bldg_id
11.3884491.3948141.4064361.5054381.6297211.6998491.7636901.7629701.6920081.589521...1.3709951.4641291.6156961.7801751.9618772.0274051.9392881.7269511.4636411.372170
20.9745890.8737290.8327530.8154770.6929590.8376601.0585621.0470550.9441640.915151...1.0212111.0331101.1208251.3791641.5083591.5478681.5054791.3061671.1197951.070608
326.46385526.68414126.85607727.02217027.25803327.39152827.57203426.57491625.36990427.300352...57.90341757.76048557.72498957.73856957.76102558.75262754.26496937.54500327.30102126.393634
40.3298270.3416140.3477640.3562960.3713950.4056520.5171040.6167780.5641560.478921...0.4451700.4930490.5393370.5761700.5948790.5842740.5622880.5053120.4018380.346137
54.6121724.5370624.4761874.4242584.3716714.5818244.8457165.5833726.0026816.144763...7.0004566.9653895.1527883.3102722.9441293.7781944.7628444.8806664.7619544.666943
69.9293338.9870239.0791689.2260189.59082313.35090720.61440622.86336626.52728526.583311...24.09326423.89773923.72420924.50624025.75524726.80760022.78624115.43244512.72624411.050264
70.3165860.2993370.2663010.2607480.3941230.4305420.3999290.4435290.4619750.469151...0.6309340.6368660.7027370.7996030.9108740.8545780.8164770.6739890.4592470.353238
80.3941040.3450600.3382580.3291450.3368490.3497040.3882080.4551320.5442770.607230...0.5220680.5470550.5488110.6080250.7031260.7393750.7913920.7390550.6123950.479586
9130.721933131.979900132.178927132.830446133.266893135.160795146.462227216.727471264.917181273.437032...269.033935267.266991239.046661154.014922111.503383112.502299125.288994129.144772129.320504129.947422
100.9805860.8494300.7682930.7533400.7594300.7818930.8919511.0465511.1802441.241507...1.3260901.4494881.5864141.7076381.7880991.8076471.7715121.6138301.3929481.166989
\n", + "

10 rows × 24 columns

\n", + "
" + ], + "text/plain": [ + " year \\\n", + " 1 2 3 4 5 \n", + "bldg_id \n", + "1 1.388449 1.394814 1.406436 1.505438 1.629721 \n", + "2 0.974589 0.873729 0.832753 0.815477 0.692959 \n", + "3 26.463855 26.684141 26.856077 27.022170 27.258033 \n", + "4 0.329827 0.341614 0.347764 0.356296 0.371395 \n", + "5 4.612172 4.537062 4.476187 4.424258 4.371671 \n", + "6 9.929333 8.987023 9.079168 9.226018 9.590823 \n", + "7 0.316586 0.299337 0.266301 0.260748 0.394123 \n", + "8 0.394104 0.345060 0.338258 0.329145 0.336849 \n", + "9 130.721933 131.979900 132.178927 132.830446 133.266893 \n", + "10 0.980586 0.849430 0.768293 0.753340 0.759430 \n", + "\n", + " ... \\\n", + " 6 7 8 9 10 ... \n", + "bldg_id ... \n", + "1 1.699849 1.763690 1.762970 1.692008 1.589521 ... \n", + "2 0.837660 1.058562 1.047055 0.944164 0.915151 ... \n", + "3 27.391528 27.572034 26.574916 25.369904 27.300352 ... \n", + "4 0.405652 0.517104 0.616778 0.564156 0.478921 ... \n", + "5 4.581824 4.845716 5.583372 6.002681 6.144763 ... \n", + "6 13.350907 20.614406 22.863366 26.527285 26.583311 ... \n", + "7 0.430542 0.399929 0.443529 0.461975 0.469151 ... \n", + "8 0.349704 0.388208 0.455132 0.544277 0.607230 ... \n", + "9 135.160795 146.462227 216.727471 264.917181 273.437032 ... \n", + "10 0.781893 0.891951 1.046551 1.180244 1.241507 ... \n", + "\n", + " \\\n", + " 15 16 17 18 19 \n", + "bldg_id \n", + "1 1.370995 1.464129 1.615696 1.780175 1.961877 \n", + "2 1.021211 1.033110 1.120825 1.379164 1.508359 \n", + "3 57.903417 57.760485 57.724989 57.738569 57.761025 \n", + "4 0.445170 0.493049 0.539337 0.576170 0.594879 \n", + "5 7.000456 6.965389 5.152788 3.310272 2.944129 \n", + "6 24.093264 23.897739 23.724209 24.506240 25.755247 \n", + "7 0.630934 0.636866 0.702737 0.799603 0.910874 \n", + "8 0.522068 0.547055 0.548811 0.608025 0.703126 \n", + "9 269.033935 267.266991 239.046661 154.014922 111.503383 \n", + "10 1.326090 1.449488 1.586414 1.707638 1.788099 \n", + "\n", + " \n", + " 20 21 22 23 24 \n", + "bldg_id \n", + "1 2.027405 1.939288 1.726951 1.463641 1.372170 \n", + "2 1.547868 1.505479 1.306167 1.119795 1.070608 \n", + "3 58.752627 54.264969 37.545003 27.301021 26.393634 \n", + "4 0.584274 0.562288 0.505312 0.401838 0.346137 \n", + "5 3.778194 4.762844 4.880666 4.761954 4.666943 \n", + "6 26.807600 22.786241 15.432445 12.726244 11.050264 \n", + "7 0.854578 0.816477 0.673989 0.459247 0.353238 \n", + "8 0.739375 0.791392 0.739055 0.612395 0.479586 \n", + "9 112.502299 125.288994 129.144772 129.320504 129.947422 \n", + "10 1.807647 1.771512 1.613830 1.392948 1.166989 \n", + "\n", + "[10 rows x 24 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "folder_path_train = os.path.join(\n", + " os.getcwd(), \"building-instinct-train-data\"\n", + ") # folder path for the train dataset\n", + "season_months_dict = {\"year\": [i for i in range(1, 13)]}\n", + "\n", + "df_features = calculate_average_hourly_energy_consumption(\n", + " folder_path=folder_path_train, season_months_dict=season_months_dict\n", + ")\n", + "\n", + "df_features.sort_index(inplace=True)\n", + "df_features.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "b0b9d8f4-5267-4bed-98a9-7bae2877726f", + "metadata": {}, + "source": [ + "Next, with the reduced-size feature set for the training data we train a simple classification model and use it for prediction. To this end, we first split the train data (for which we have labels) into train and test sets. We use a customized hierarchical classification model that is provided to you within the `train_model()` function found in the `utils.py` module. This function instantiates and trains three separate classification models (and returns the trained models):\n", + "\n", + "* A classifier to predict the `building_stock_type` (either 'commercial' or 'residential').\n", + "* A classifier for predicting attributes/metadata of commercial buildings.\n", + "* A classifier for predicting attributes/metadata of residential buildings.\n", + "\n", + "We have also provided a function (`get_pred()`) to do predictions using the above-mentioned trained models. This function takes in a feature dataframe and the list of trained classifiers to generate predictions for the `building_stock_type` and its respective attributes based on the hierarchical structure. The predictions are populated in a new dataframe with the same index as the input features and columns specified in the column list.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "493cb03e-1898-4a6c-8c26-b604e73bd66a", + "metadata": {}, + "outputs": [], + "source": [ + "X_train, X_test, y_train, y_test = train_test_split(\n", + " df_features, df_targets, test_size=0.2, random_state=42\n", + ")\n", + "column_list = list(df_targets.columns)\n", + "\n", + "classifier_list = train_model(X=X_train, y=y_train)\n", + "y_pred = get_pred(X=X_test, classifier_list=classifier_list, column_list=column_list)" + ] + }, + { + "cell_type": "markdown", + "id": "666ba4fc-e843-4884-b9e9-30f46acc51da", + "metadata": {}, + "source": [ + "Below are the labels for the test portion of the training data (`y_test`) along with the predicted labels (`y_pred`)." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "dd130a56-ea69-4b9a-8531-510368995df9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
building_stock_typein.comstock_building_type_group_comin.heating_fuel_comin.hvac_category_comin.number_of_stories_comin.ownership_type_comin.vintage_comin.wall_construction_type_comin.tstat_clg_sp_f..f_comin.tstat_htg_sp_f..f_com...in.geometry_building_type_recs_resin.geometry_floor_area_resin.geometry_foundation_type_resin.geometry_wall_type_resin.heating_fuel_resin.income_resin.roof_material_resin.tenure_resin.vacancy_status_resin.vintage_res
bldg_id
3099commercialMercantileNaturalGasSmall Packaged Unit2owner_occupiedBefore 1946Mass7368...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
2532commercialWarehouse and StorageNaturalGasSmall Packaged Unit1leased1980 to 1989SteelFramed999999...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
4072residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached500-749SlabWood FrameNatural Gas15000-19999Composition ShinglesOwnerOccupied1950s
1288commercialOfficeElectricitySmall Packaged Unit2owner_occupied1946 to 1959Mass7566...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
2541residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached500-749Vented CrawlspaceBrickNatural Gas10000-14999Metal, DarkOwnerOccupied1960s
..................................................................
3791residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached2000-2499Vented CrawlspaceBrickNatural Gas70000-79999Composition ShinglesOwnerOccupied1970s
912commercialWarehouse and StorageElectricityMultizone CAV/VAV1leasedBefore 1946Mass999999...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
6521commercialMercantileElectricitySmall Packaged Unit1leased1980 to 1989WoodFramed7370...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
2996residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached3000-3999Vented CrawlspaceWood FrameFuel Oil60000-69999Asphalt Shingles, MediumOwnerOccupied1960s
3342residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached1500-1999SlabWood FrameOther Fuel100000-119999Asphalt Shingles, MediumOwnerOccupied1970s
\n", + "

1440 rows × 25 columns

\n", + "
" + ], + "text/plain": [ + " building_stock_type in.comstock_building_type_group_com \\\n", + "bldg_id \n", + "3099 commercial Mercantile \n", + "2532 commercial Warehouse and Storage \n", + "4072 residential None \n", + "1288 commercial Office \n", + "2541 residential None \n", + "... ... ... \n", + "3791 residential None \n", + "912 commercial Warehouse and Storage \n", + "6521 commercial Mercantile \n", + "2996 residential None \n", + "3342 residential None \n", + "\n", + " in.heating_fuel_com in.hvac_category_com in.number_of_stories_com \\\n", + "bldg_id \n", + "3099 NaturalGas Small Packaged Unit 2 \n", + "2532 NaturalGas Small Packaged Unit 1 \n", + "4072 None None None \n", + "1288 Electricity Small Packaged Unit 2 \n", + "2541 None None None \n", + "... ... ... ... \n", + "3791 None None None \n", + "912 Electricity Multizone CAV/VAV 1 \n", + "6521 Electricity Small Packaged Unit 1 \n", + "2996 None None None \n", + "3342 None None None \n", + "\n", + " in.ownership_type_com in.vintage_com in.wall_construction_type_com \\\n", + "bldg_id \n", + "3099 owner_occupied Before 1946 Mass \n", + "2532 leased 1980 to 1989 SteelFramed \n", + "4072 None None None \n", + "1288 owner_occupied 1946 to 1959 Mass \n", + "2541 None None None \n", + "... ... ... ... \n", + "3791 None None None \n", + "912 leased Before 1946 Mass \n", + "6521 leased 1980 to 1989 WoodFramed \n", + "2996 None None None \n", + "3342 None None None \n", + "\n", + " in.tstat_clg_sp_f..f_com in.tstat_htg_sp_f..f_com ... \\\n", + "bldg_id ... \n", + "3099 73 68 ... \n", + "2532 999 999 ... \n", + "4072 None None ... \n", + "1288 75 66 ... \n", + "2541 None None ... \n", + "... ... ... ... \n", + "3791 None None ... \n", + "912 999 999 ... \n", + "6521 73 70 ... \n", + "2996 None None ... \n", + "3342 None None ... \n", + "\n", + " in.geometry_building_type_recs_res in.geometry_floor_area_res \\\n", + "bldg_id \n", + "3099 None None \n", + "2532 None None \n", + "4072 Single-Family Detached 500-749 \n", + "1288 None None \n", + "2541 Single-Family Detached 500-749 \n", + "... ... ... \n", + "3791 Single-Family Detached 2000-2499 \n", + "912 None None \n", + "6521 None None \n", + "2996 Single-Family Detached 3000-3999 \n", + "3342 Single-Family Detached 1500-1999 \n", + "\n", + " in.geometry_foundation_type_res in.geometry_wall_type_res \\\n", + "bldg_id \n", + "3099 None None \n", + "2532 None None \n", + "4072 Slab Wood Frame \n", + "1288 None None \n", + "2541 Vented Crawlspace Brick \n", + "... ... ... \n", + "3791 Vented Crawlspace Brick \n", + "912 None None \n", + "6521 None None \n", + "2996 Vented Crawlspace Wood Frame \n", + "3342 Slab Wood Frame \n", + "\n", + " in.heating_fuel_res in.income_res in.roof_material_res \\\n", + "bldg_id \n", + "3099 None None None \n", + "2532 None None None \n", + "4072 Natural Gas 15000-19999 Composition Shingles \n", + "1288 None None None \n", + "2541 Natural Gas 10000-14999 Metal, Dark \n", + "... ... ... ... \n", + "3791 Natural Gas 70000-79999 Composition Shingles \n", + "912 None None None \n", + "6521 None None None \n", + "2996 Fuel Oil 60000-69999 Asphalt Shingles, Medium \n", + "3342 Other Fuel 100000-119999 Asphalt Shingles, Medium \n", + "\n", + " in.tenure_res in.vacancy_status_res in.vintage_res \n", + "bldg_id \n", + "3099 None None None \n", + "2532 None None None \n", + "4072 Owner Occupied 1950s \n", + "1288 None None None \n", + "2541 Owner Occupied 1960s \n", + "... ... ... ... \n", + "3791 Owner Occupied 1970s \n", + "912 None None None \n", + "6521 None None None \n", + "2996 Owner Occupied 1960s \n", + "3342 Owner Occupied 1970s \n", + "\n", + "[1440 rows x 25 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_test" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "5e61dffd-13ab-47e7-b9aa-ebaef85c6b73", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
building_stock_typein.comstock_building_type_group_comin.heating_fuel_comin.hvac_category_comin.number_of_stories_comin.ownership_type_comin.vintage_comin.wall_construction_type_comin.tstat_clg_sp_f..f_comin.tstat_htg_sp_f..f_com...in.geometry_building_type_recs_resin.geometry_floor_area_resin.geometry_foundation_type_resin.geometry_wall_type_resin.heating_fuel_resin.income_resin.roof_material_resin.tenure_resin.vacancy_status_resin.vintage_res
bldg_id
3099commercialMercantileNaturalGasSmall Packaged Unit1leasedBefore 1946Mass7168...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2532commercialWarehouse and StorageNaturalGasSmall Packaged Unit1owner_occupiedBefore 1946SteelFramed999999...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4072residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Multi-Family with 5+ Units500-749SlabWood FrameElectricity40000-44999Asphalt Shingles, MediumRenterOccupied1990s
1288commercialOfficeElectricitySmall Packaged Unit1owner_occupiedBefore 1946Mass7469...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2541residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Single-Family Detached1500-1999Vented CrawlspaceWood FrameNatural Gas<10000Composition ShinglesOwnerOccupied1960s
..................................................................
3791residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Single-Family Detached1000-1499SlabWood FrameNatural Gas80000-99999Composition ShinglesOwnerOccupied1940s
912commercialWarehouse and StorageElectricitySmall Packaged Unit1owner_occupied1980 to 1989Mass999999...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
6521commercialMercantileNaturalGasSmall Packaged Unit1owner_occupiedBefore 1946SteelFramed7268...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2996residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Single-Family Detached1500-1999SlabWood FrameElectricity20000-24999Composition ShinglesOwnerOccupied<1940
3342residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Single-Family Detached1500-1999SlabWood FrameNatural Gas80000-99999Composition ShinglesOwnerOccupied1960s
\n", + "

1440 rows × 25 columns

\n", + "
" + ], + "text/plain": [ + " building_stock_type in.comstock_building_type_group_com \\\n", + "bldg_id \n", + "3099 commercial Mercantile \n", + "2532 commercial Warehouse and Storage \n", + "4072 residential NaN \n", + "1288 commercial Office \n", + "2541 residential NaN \n", + "... ... ... \n", + "3791 residential NaN \n", + "912 commercial Warehouse and Storage \n", + "6521 commercial Mercantile \n", + "2996 residential NaN \n", + "3342 residential NaN \n", + "\n", + " in.heating_fuel_com in.hvac_category_com in.number_of_stories_com \\\n", + "bldg_id \n", + "3099 NaturalGas Small Packaged Unit 1 \n", + "2532 NaturalGas Small Packaged Unit 1 \n", + "4072 NaN NaN NaN \n", + "1288 Electricity Small Packaged Unit 1 \n", + "2541 NaN NaN NaN \n", + "... ... ... ... \n", + "3791 NaN NaN NaN \n", + "912 Electricity Small Packaged Unit 1 \n", + "6521 NaturalGas Small Packaged Unit 1 \n", + "2996 NaN NaN NaN \n", + "3342 NaN NaN NaN \n", + "\n", + " in.ownership_type_com in.vintage_com in.wall_construction_type_com \\\n", + "bldg_id \n", + "3099 leased Before 1946 Mass \n", + "2532 owner_occupied Before 1946 SteelFramed \n", + "4072 NaN NaN NaN \n", + "1288 owner_occupied Before 1946 Mass \n", + "2541 NaN NaN NaN \n", + "... ... ... ... \n", + "3791 NaN NaN NaN \n", + "912 owner_occupied 1980 to 1989 Mass \n", + "6521 owner_occupied Before 1946 SteelFramed \n", + "2996 NaN NaN NaN \n", + "3342 NaN NaN NaN \n", + "\n", + " in.tstat_clg_sp_f..f_com in.tstat_htg_sp_f..f_com ... \\\n", + "bldg_id ... \n", + "3099 71 68 ... \n", + "2532 999 999 ... \n", + "4072 NaN NaN ... \n", + "1288 74 69 ... \n", + "2541 NaN NaN ... \n", + "... ... ... ... \n", + "3791 NaN NaN ... \n", + "912 999 999 ... \n", + "6521 72 68 ... \n", + "2996 NaN NaN ... \n", + "3342 NaN NaN ... \n", + "\n", + " in.geometry_building_type_recs_res in.geometry_floor_area_res \\\n", + "bldg_id \n", + "3099 NaN NaN \n", + "2532 NaN NaN \n", + "4072 Multi-Family with 5+ Units 500-749 \n", + "1288 NaN NaN \n", + "2541 Single-Family Detached 1500-1999 \n", + "... ... ... \n", + "3791 Single-Family Detached 1000-1499 \n", + "912 NaN NaN \n", + "6521 NaN NaN \n", + "2996 Single-Family Detached 1500-1999 \n", + "3342 Single-Family Detached 1500-1999 \n", + "\n", + " in.geometry_foundation_type_res in.geometry_wall_type_res \\\n", + "bldg_id \n", + "3099 NaN NaN \n", + "2532 NaN NaN \n", + "4072 Slab Wood Frame \n", + "1288 NaN NaN \n", + "2541 Vented Crawlspace Wood Frame \n", + "... ... ... \n", + "3791 Slab Wood Frame \n", + "912 NaN NaN \n", + "6521 NaN NaN \n", + "2996 Slab Wood Frame \n", + "3342 Slab Wood Frame \n", + "\n", + " in.heating_fuel_res in.income_res in.roof_material_res \\\n", + "bldg_id \n", + "3099 NaN NaN NaN \n", + "2532 NaN NaN NaN \n", + "4072 Electricity 40000-44999 Asphalt Shingles, Medium \n", + "1288 NaN NaN NaN \n", + "2541 Natural Gas <10000 Composition Shingles \n", + "... ... ... ... \n", + "3791 Natural Gas 80000-99999 Composition Shingles \n", + "912 NaN NaN NaN \n", + "6521 NaN NaN NaN \n", + "2996 Electricity 20000-24999 Composition Shingles \n", + "3342 Natural Gas 80000-99999 Composition Shingles \n", + "\n", + " in.tenure_res in.vacancy_status_res in.vintage_res \n", + "bldg_id \n", + "3099 NaN NaN NaN \n", + "2532 NaN NaN NaN \n", + "4072 Renter Occupied 1990s \n", + "1288 NaN NaN NaN \n", + "2541 Owner Occupied 1960s \n", + "... ... ... ... \n", + "3791 Owner Occupied 1940s \n", + "912 NaN NaN NaN \n", + "6521 NaN NaN NaN \n", + "2996 Owner Occupied <1940 \n", + "3342 Owner Occupied 1960s \n", + "\n", + "[1440 rows x 25 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_pred" + ] + }, + { + "cell_type": "markdown", + "id": "dc2a30d2-5368-4c5f-8fc5-af2405bc7b4c", + "metadata": {}, + "source": [ + "Next we talk about our performance metric; the customized hierarchical $F1$-score that will be used to measure the performance of your classification model. The final $F1$-score is derived by first calculating the $F1$-scores at two hierarchical levels:\n", + "* The `building_stock_type` level, which is the first level of hierarchy ($F1_{l1}$).\n", + "* The second level, which is conditional on the `building_stock_type` being either 'commercial' or 'residential' ($F1_{l2}$).\n", + "\n", + "The final $F1$-score is a weighted average of the first and second level $F1$-scores: $F1$-score = $\\alpha F1_{l1} + (1-\\alpha) F1_{l2}$ , where $\\alpha$ is the weight. $F1_{l1}$ represents the $F1$-score for the binary classification of the `building_stock_type` column. $F1_{l2}$ is the arithmetic average of $F1$-scores for the residential and commercial columns ($F1_{l2}^{res}$ and $F1_{l2}^{com}$): $F1_{l2}$ = 0.5 x ($F1_{l2}^{res}$ + $F1_{l2}^{com}$). To calculate $F1_{l2}^{res}$, the macro $F1$-score is first computed for each column whose name ends with `_res`. These scores are then averaged to yield $F1_{l2}^{res}$. Similarly, $F1_{l2}^{com}$ is calculated using the columns whose names end with `_com`. The function to calculate the final $F1$-score is provided for you in the `utils.py` (`calculate_hierarchical_f1_score()`). This function (with the default parameter values) will also be used to calculate your predictive leaderboard score, i.e. the hierarchical $F1$-score on the test dataset.\n", + "\n", + "In this function if you set the parameter `F1_list` to `True`, it returns a tuple where the first element is the overall hierarchical $F1$-score and the second element is a dictionary containing the $F1$-scores for all individual columns. This could help you explore which columns your model classifies well and which columns it does not. Below we apply this function on the predictions made by the trained model above." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "9ced35e8-a443-4533-9025-524817aed129", + "metadata": {}, + "outputs": [], + "source": [ + "F1, F1_dict = calculate_hierarchical_f1_score(\n", + " y_test, y_pred, alpha=0.4, average=\"macro\", F1_list=True\n", + ")\n", + "\n", + "df_f1_scores = pd.DataFrame(\n", + " list(F1_dict.items()), columns=[\"column name\", \"Macro F1-score\"]\n", + ")\n", + "\n", + "df_f1_scores.set_index(\"column name\", inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "f4923e73-8a16-4ce5-a4fa-547e95a88a10", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hierarchical F1-score: 0.5257525961502534 \n", + "\n", + "Macro F1-scores for all the individual columns:\n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Macro F1-score
column name
building_stock_type0.982550
in.vacancy_status_res0.923055
in.tenure_res0.672234
in.comstock_building_type_group_com0.410894
in.bedrooms_res0.273604
in.wall_construction_type_com0.265956
in.hvac_category_com0.256310
in.geometry_building_type_recs_res0.247342
in.geometry_wall_type_res0.219878
in.geometry_floor_area_res0.219610
in.ownership_type_com0.212479
in.heating_fuel_com0.206151
in.geometry_foundation_type_res0.173268
in.roof_material_res0.169418
in.heating_fuel_res0.162244
in.tstat_clg_sp_f..f_com0.156799
in.tstat_htg_sp_f..f_com0.140824
in.heating_setpoint_res0.135394
in.vintage_res0.128259
in.vintage_com0.122547
in.number_of_stories_com0.101989
in.cooling_setpoint_res0.072500
in.income_res0.055572
in.weekday_opening_time..hr_com0.042650
in.weekday_operating_hours..hr_com0.029024
\n", + "
" + ], + "text/plain": [ + " Macro F1-score\n", + "column name \n", + "building_stock_type 0.982550\n", + "in.vacancy_status_res 0.923055\n", + "in.tenure_res 0.672234\n", + "in.comstock_building_type_group_com 0.410894\n", + "in.bedrooms_res 0.273604\n", + "in.wall_construction_type_com 0.265956\n", + "in.hvac_category_com 0.256310\n", + "in.geometry_building_type_recs_res 0.247342\n", + "in.geometry_wall_type_res 0.219878\n", + "in.geometry_floor_area_res 0.219610\n", + "in.ownership_type_com 0.212479\n", + "in.heating_fuel_com 0.206151\n", + "in.geometry_foundation_type_res 0.173268\n", + "in.roof_material_res 0.169418\n", + "in.heating_fuel_res 0.162244\n", + "in.tstat_clg_sp_f..f_com 0.156799\n", + "in.tstat_htg_sp_f..f_com 0.140824\n", + "in.heating_setpoint_res 0.135394\n", + "in.vintage_res 0.128259\n", + "in.vintage_com 0.122547\n", + "in.number_of_stories_com 0.101989\n", + "in.cooling_setpoint_res 0.072500\n", + "in.income_res 0.055572\n", + "in.weekday_opening_time..hr_com 0.042650\n", + "in.weekday_operating_hours..hr_com 0.029024" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print(f\"Hierarchical F1-score: {F1} \\n\")\n", + "\n", + "print(\"Macro F1-scores for all the individual columns:\\n\")\n", + "\n", + "\n", + "df_f1_scores" + ] + }, + { + "cell_type": "markdown", + "id": "c95a5ae9-f4dc-4cb1-94c2-ba47208778b3", + "metadata": {}, + "source": [ + "Your task is to develop a model to predict the metadata for the 1440 buildings in the test dataset. Although you do not have access to the metadata/labels for the test dataset, the ground-truth labels will be used to calculate your leaderboard score. You must submit your predictions as a `.parquet` file. The `.parquet` file should have the same structure as the metadata dataframes in this notebook (e.g. `df_targets` or `y_pred`): it MUST have 25 columns with the exact column names: `building_stock_type`, the 13 column names ending with `_com`, and the 11 column names ending with `_res`. The index name MUST be `bldg_id`. Your `.parquet` file MUST have 1440 rows containing your predicted metadata for the 1440 buildings in the test dataset.\n", + "\n", + "As mentioned in the beginning of this Notebook, the parquet files of energy consumption follow the naming convention of `.parquet`. Therefore, the index values (`bldg_id`) in your submission should start with 1 and end with 1440. Please note that the index values MUST be integer numbers corresponding to the building IDs.\n", + "\n", + "To further help you understand the structure of the DataFrame for the submission file, we have provided a function called `sample_submission_generator(bldg_id_list, df_targets, path_to_save)` to generate a prediction DataFrame and save it as a sample `.parquet` file. This function takes a list of building IDs that should be a list of integers from 1 to 1440. It also takes an input dataframe (`df_targets`): the function uses the distribution of classes for each attribute (column) of this DataFrame to sample from, in order to populate the entries for the sample submission. Please note that it does not matter much how this function generates the values for the submission file; what matters is the structure of the generated DataFrame, and hence the saved `.parquet` file, i.e. the number of rows and columns and their names, the hierarchical nature of it (e.g. if a row is residential there are no entries for the columns ending with `_com` for that row), etc. The function also takes in the filepath to save the `.parquet` file. In addition to saving the `.parquet` file the function also returns the generated sample DataFrame.\n", + "\n", + "We would like to emphasize again that your submission MUST be a `.parquet` file passing all the above-mentioned requirements.\n", + "\n", + "Below we use the `sample_submission_generator()` to generate and save a sample `.parquet` submission file." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "cc6e7add-1349-4d51-a12f-282cda3d75c8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The generated sample submission DataFrame that is saved as a .parquet file: \n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
building_stock_typein.comstock_building_type_group_comin.heating_fuel_comin.hvac_category_comin.number_of_stories_comin.ownership_type_comin.vintage_comin.wall_construction_type_comin.tstat_clg_sp_f..f_comin.tstat_htg_sp_f..f_com...in.geometry_building_type_recs_resin.geometry_floor_area_resin.geometry_foundation_type_resin.geometry_wall_type_resin.heating_fuel_resin.income_resin.roof_material_resin.tenure_resin.vacancy_status_resin.vintage_res
bldg_id
1commercialHealthcareFuelOilMultizone CAV/VAV1owner_occupied1960 to 1969SteelFramed99968...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2commercialMercantileElectricitySmall Packaged Unit1owner_occupiedBefore 1946Mass99968...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3commercialOfficeNaturalGasSmall Packaged Unit2owner_occupiedBefore 1946Mass7267...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4commercialMercantileDistrictHeatingSmall Packaged Unit2leased1960 to 1969WoodFramed74999...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
5residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Multi-Family with 5+ Units1000-1499AmbientWood FrameFuel Oil15000-19999Composition ShinglesOwnerOccupied<1940
..................................................................
1436residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Multi-Family with 5+ Units1500-1999SlabWood FrameElectricity200000+Asphalt Shingles, MediumOwnerOccupied1960s
1437residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Single-Family Detached0-499Unheated BasementWood FrameElectricity10000-14999Composition ShinglesOwnerVacant1970s
1438residentialNaNNaNNaNNaNNaNNaNNaNNaNNaN...Single-Family Attached500-749Unvented CrawlspaceWood FrameNatural Gas<10000Composition ShinglesRenterOccupied1980s
1439commercialWarehouse and StorageNaturalGasSmall Packaged Unit5state1946 to 1959Mass7361...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1440commercialMercantileNaturalGasResidential Style Central Systems1leased1990 to 1999SteelFramed7268...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

1440 rows × 25 columns

\n", + "
" + ], + "text/plain": [ + " building_stock_type in.comstock_building_type_group_com \\\n", + "bldg_id \n", + "1 commercial Healthcare \n", + "2 commercial Mercantile \n", + "3 commercial Office \n", + "4 commercial Mercantile \n", + "5 residential NaN \n", + "... ... ... \n", + "1436 residential NaN \n", + "1437 residential NaN \n", + "1438 residential NaN \n", + "1439 commercial Warehouse and Storage \n", + "1440 commercial Mercantile \n", + "\n", + " in.heating_fuel_com in.hvac_category_com \\\n", + "bldg_id \n", + "1 FuelOil Multizone CAV/VAV \n", + "2 Electricity Small Packaged Unit \n", + "3 NaturalGas Small Packaged Unit \n", + "4 DistrictHeating Small Packaged Unit \n", + "5 NaN NaN \n", + "... ... ... \n", + "1436 NaN NaN \n", + "1437 NaN NaN \n", + "1438 NaN NaN \n", + "1439 NaturalGas Small Packaged Unit \n", + "1440 NaturalGas Residential Style Central Systems \n", + "\n", + " in.number_of_stories_com in.ownership_type_com in.vintage_com \\\n", + "bldg_id \n", + "1 1 owner_occupied 1960 to 1969 \n", + "2 1 owner_occupied Before 1946 \n", + "3 2 owner_occupied Before 1946 \n", + "4 2 leased 1960 to 1969 \n", + "5 NaN NaN NaN \n", + "... ... ... ... \n", + "1436 NaN NaN NaN \n", + "1437 NaN NaN NaN \n", + "1438 NaN NaN NaN \n", + "1439 5 state 1946 to 1959 \n", + "1440 1 leased 1990 to 1999 \n", + "\n", + " in.wall_construction_type_com in.tstat_clg_sp_f..f_com \\\n", + "bldg_id \n", + "1 SteelFramed 999 \n", + "2 Mass 999 \n", + "3 Mass 72 \n", + "4 WoodFramed 74 \n", + "5 NaN NaN \n", + "... ... ... \n", + "1436 NaN NaN \n", + "1437 NaN NaN \n", + "1438 NaN NaN \n", + "1439 Mass 73 \n", + "1440 SteelFramed 72 \n", + "\n", + " in.tstat_htg_sp_f..f_com ... in.geometry_building_type_recs_res \\\n", + "bldg_id ... \n", + "1 68 ... NaN \n", + "2 68 ... NaN \n", + "3 67 ... NaN \n", + "4 999 ... NaN \n", + "5 NaN ... Multi-Family with 5+ Units \n", + "... ... ... ... \n", + "1436 NaN ... Multi-Family with 5+ Units \n", + "1437 NaN ... Single-Family Detached \n", + "1438 NaN ... Single-Family Attached \n", + "1439 61 ... NaN \n", + "1440 68 ... NaN \n", + "\n", + " in.geometry_floor_area_res in.geometry_foundation_type_res \\\n", + "bldg_id \n", + "1 NaN NaN \n", + "2 NaN NaN \n", + "3 NaN NaN \n", + "4 NaN NaN \n", + "5 1000-1499 Ambient \n", + "... ... ... \n", + "1436 1500-1999 Slab \n", + "1437 0-499 Unheated Basement \n", + "1438 500-749 Unvented Crawlspace \n", + "1439 NaN NaN \n", + "1440 NaN NaN \n", + "\n", + " in.geometry_wall_type_res in.heating_fuel_res in.income_res \\\n", + "bldg_id \n", + "1 NaN NaN NaN \n", + "2 NaN NaN NaN \n", + "3 NaN NaN NaN \n", + "4 NaN NaN NaN \n", + "5 Wood Frame Fuel Oil 15000-19999 \n", + "... ... ... ... \n", + "1436 Wood Frame Electricity 200000+ \n", + "1437 Wood Frame Electricity 10000-14999 \n", + "1438 Wood Frame Natural Gas <10000 \n", + "1439 NaN NaN NaN \n", + "1440 NaN NaN NaN \n", + "\n", + " in.roof_material_res in.tenure_res in.vacancy_status_res \\\n", + "bldg_id \n", + "1 NaN NaN NaN \n", + "2 NaN NaN NaN \n", + "3 NaN NaN NaN \n", + "4 NaN NaN NaN \n", + "5 Composition Shingles Owner Occupied \n", + "... ... ... ... \n", + "1436 Asphalt Shingles, Medium Owner Occupied \n", + "1437 Composition Shingles Owner Vacant \n", + "1438 Composition Shingles Renter Occupied \n", + "1439 NaN NaN NaN \n", + "1440 NaN NaN NaN \n", + "\n", + " in.vintage_res \n", + "bldg_id \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "5 <1940 \n", + "... ... \n", + "1436 1960s \n", + "1437 1970s \n", + "1438 1980s \n", + "1439 NaN \n", + "1440 NaN \n", + "\n", + "[1440 rows x 25 columns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bldg_id_list = [i for i in range(1, 1441)]\n", + "df_sample = sample_submission_generator(\n", + " bldg_id_list, df_targets, \"sample_submission.parquet\"\n", + ")\n", + "\n", + "print(\"The generated sample submission DataFrame that is saved as a .parquet file: \\n\")\n", + "df_sample" + ] + }, + { + "cell_type": "markdown", + "id": "c8c9562e-8c50-4692-9c89-aca603e8bdd2", + "metadata": {}, + "source": [ + "You can also read the saved `sample_submission.parquet` file as shown below (Note: this is how your submitted `.parquet` file will be read for scoring)." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "a9a76838-addb-4195-86ad-d884780618c2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The dataframe read from the sample_submission.parquet file: \n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
building_stock_typein.comstock_building_type_group_comin.heating_fuel_comin.hvac_category_comin.number_of_stories_comin.ownership_type_comin.vintage_comin.wall_construction_type_comin.tstat_clg_sp_f..f_comin.tstat_htg_sp_f..f_com...in.geometry_building_type_recs_resin.geometry_floor_area_resin.geometry_foundation_type_resin.geometry_wall_type_resin.heating_fuel_resin.income_resin.roof_material_resin.tenure_resin.vacancy_status_resin.vintage_res
bldg_id
1commercialHealthcareFuelOilMultizone CAV/VAV1owner_occupied1960 to 1969SteelFramed99968...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
2commercialMercantileElectricitySmall Packaged Unit1owner_occupiedBefore 1946Mass99968...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3commercialOfficeNaturalGasSmall Packaged Unit2owner_occupiedBefore 1946Mass7267...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
4commercialMercantileDistrictHeatingSmall Packaged Unit2leased1960 to 1969WoodFramed74999...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
5residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Multi-Family with 5+ Units1000-1499AmbientWood FrameFuel Oil15000-19999Composition ShinglesOwnerOccupied<1940
..................................................................
1436residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Multi-Family with 5+ Units1500-1999SlabWood FrameElectricity200000+Asphalt Shingles, MediumOwnerOccupied1960s
1437residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Detached0-499Unheated BasementWood FrameElectricity10000-14999Composition ShinglesOwnerVacant1970s
1438residentialNoneNoneNoneNoneNoneNoneNoneNoneNone...Single-Family Attached500-749Unvented CrawlspaceWood FrameNatural Gas<10000Composition ShinglesRenterOccupied1980s
1439commercialWarehouse and StorageNaturalGasSmall Packaged Unit5state1946 to 1959Mass7361...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
1440commercialMercantileNaturalGasResidential Style Central Systems1leased1990 to 1999SteelFramed7268...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
\n", + "

1440 rows × 25 columns

\n", + "
" + ], + "text/plain": [ + " building_stock_type in.comstock_building_type_group_com \\\n", + "bldg_id \n", + "1 commercial Healthcare \n", + "2 commercial Mercantile \n", + "3 commercial Office \n", + "4 commercial Mercantile \n", + "5 residential None \n", + "... ... ... \n", + "1436 residential None \n", + "1437 residential None \n", + "1438 residential None \n", + "1439 commercial Warehouse and Storage \n", + "1440 commercial Mercantile \n", + "\n", + " in.heating_fuel_com in.hvac_category_com \\\n", + "bldg_id \n", + "1 FuelOil Multizone CAV/VAV \n", + "2 Electricity Small Packaged Unit \n", + "3 NaturalGas Small Packaged Unit \n", + "4 DistrictHeating Small Packaged Unit \n", + "5 None None \n", + "... ... ... \n", + "1436 None None \n", + "1437 None None \n", + "1438 None None \n", + "1439 NaturalGas Small Packaged Unit \n", + "1440 NaturalGas Residential Style Central Systems \n", + "\n", + " in.number_of_stories_com in.ownership_type_com in.vintage_com \\\n", + "bldg_id \n", + "1 1 owner_occupied 1960 to 1969 \n", + "2 1 owner_occupied Before 1946 \n", + "3 2 owner_occupied Before 1946 \n", + "4 2 leased 1960 to 1969 \n", + "5 None None None \n", + "... ... ... ... \n", + "1436 None None None \n", + "1437 None None None \n", + "1438 None None None \n", + "1439 5 state 1946 to 1959 \n", + "1440 1 leased 1990 to 1999 \n", + "\n", + " in.wall_construction_type_com in.tstat_clg_sp_f..f_com \\\n", + "bldg_id \n", + "1 SteelFramed 999 \n", + "2 Mass 999 \n", + "3 Mass 72 \n", + "4 WoodFramed 74 \n", + "5 None None \n", + "... ... ... \n", + "1436 None None \n", + "1437 None None \n", + "1438 None None \n", + "1439 Mass 73 \n", + "1440 SteelFramed 72 \n", + "\n", + " in.tstat_htg_sp_f..f_com ... in.geometry_building_type_recs_res \\\n", + "bldg_id ... \n", + "1 68 ... None \n", + "2 68 ... None \n", + "3 67 ... None \n", + "4 999 ... None \n", + "5 None ... Multi-Family with 5+ Units \n", + "... ... ... ... \n", + "1436 None ... Multi-Family with 5+ Units \n", + "1437 None ... Single-Family Detached \n", + "1438 None ... Single-Family Attached \n", + "1439 61 ... None \n", + "1440 68 ... None \n", + "\n", + " in.geometry_floor_area_res in.geometry_foundation_type_res \\\n", + "bldg_id \n", + "1 None None \n", + "2 None None \n", + "3 None None \n", + "4 None None \n", + "5 1000-1499 Ambient \n", + "... ... ... \n", + "1436 1500-1999 Slab \n", + "1437 0-499 Unheated Basement \n", + "1438 500-749 Unvented Crawlspace \n", + "1439 None None \n", + "1440 None None \n", + "\n", + " in.geometry_wall_type_res in.heating_fuel_res in.income_res \\\n", + "bldg_id \n", + "1 None None None \n", + "2 None None None \n", + "3 None None None \n", + "4 None None None \n", + "5 Wood Frame Fuel Oil 15000-19999 \n", + "... ... ... ... \n", + "1436 Wood Frame Electricity 200000+ \n", + "1437 Wood Frame Electricity 10000-14999 \n", + "1438 Wood Frame Natural Gas <10000 \n", + "1439 None None None \n", + "1440 None None None \n", + "\n", + " in.roof_material_res in.tenure_res in.vacancy_status_res \\\n", + "bldg_id \n", + "1 None None None \n", + "2 None None None \n", + "3 None None None \n", + "4 None None None \n", + "5 Composition Shingles Owner Occupied \n", + "... ... ... ... \n", + "1436 Asphalt Shingles, Medium Owner Occupied \n", + "1437 Composition Shingles Owner Vacant \n", + "1438 Composition Shingles Renter Occupied \n", + "1439 None None None \n", + "1440 None None None \n", + "\n", + " in.vintage_res \n", + "bldg_id \n", + "1 None \n", + "2 None \n", + "3 None \n", + "4 None \n", + "5 <1940 \n", + "... ... \n", + "1436 1960s \n", + "1437 1970s \n", + "1438 1980s \n", + "1439 None \n", + "1440 None \n", + "\n", + "[1440 rows x 25 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_from_parquet = pd.read_parquet(\"sample_submission.parquet\", engine=\"pyarrow\")\n", + "\n", + "print(\"The dataframe read from the sample_submission.parquet file: \\n\")\n", + "\n", + "df_from_parquet" + ] + }, + { + "cell_type": "markdown", + "id": "f46e0a0b-9131-4d2e-ac4b-e1cf36125411", + "metadata": {}, + "source": [ + "\n", + "## And finally, thank you for choosing to participate in this challenge!\n", + "## Best of luck and have fun!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/energy-analysis/building-instinct/utils.py b/energy-analysis/building-instinct/utils.py new file mode 100644 index 0000000..f3fd741 --- /dev/null +++ b/energy-analysis/building-instinct/utils.py @@ -0,0 +1,433 @@ +import os +import pandas as pd +import numpy as np +import random +from datetime import datetime +from sklearn.pipeline import Pipeline +from sklearn.compose import ColumnTransformer +from sklearn.preprocessing import StandardScaler, OneHotEncoder +from sklearn.ensemble import RandomForestClassifier +from sklearn.metrics import f1_score + + +def calculate_average_hourly_energy_consumption(folder_path, season_months_dict): + """ + Process multiple parquet files in a folder, calculate hourly average energy consumption, + and return a pandas DataFrame with each row corresponding to one file in the folder. + + Parameters: + - folder_path (str): Path to the folder containing parquet files. + - season_months_dict (dict): A dictionary where keys are season names (strings) and values are lists + of corresponding month numbers. For example, {'cold': [1, 2, 12], 'hot': [6, 7, 8], 'mild': [3, 4, 5, 9, 10, 11]}. + + Returns: + - df_ave_hourly (pd.DataFrame): A pandas DataFrame with each row corresponding to one file in the folder (i.e. one building). + The columns are multi-layer with the first layer being the season and the second layer the hour of the day + Index ('bldg_id') contains building IDs. Column values are average hourly electricity energy consumption + """ + # Initialize an empty list to store individual DataFrames for each file + result_dfs = [] + + # Iterate through all files in the folder_path + for file_name in os.listdir(folder_path): + if file_name.endswith(".parquet"): + # Extract the bldg_id from the file name + bldg_id = int(file_name.split(".")[0]) + + # Construct the full file path + file_path = os.path.join(folder_path, file_name) + + # Read the original parquet file + df = pd.read_parquet(file_path) + + # Convert 'timestamp' column to datetime + df["timestamp"] = pd.to_datetime(df["timestamp"]) + + # Extract month and hour from 'timestamp' + df["month"] = df["timestamp"].dt.month + df["hour"] = df["timestamp"].dt.hour + + # Create a mapping from month to the corresponding season + month_to_season = { + month: season + for season, months_list in season_months_dict.items() + for month in months_list + } + + # Assign a season to each row based on the month + df["season"] = df["month"].map(month_to_season) + + # Calculate hourly average energy consumption for each row + df["hourly_avg_energy_consumption"] = 4 * df.groupby(["season", "hour"])[ + "out.electricity.total.energy_consumption" + ].transform("mean") + + # Pivot the dataframe to create the desired output format + result_df = df.pivot_table( + values="hourly_avg_energy_consumption", + index="bldg_id", + columns=["season", "hour"], + ) + + # Reset the column names + result_df.columns = pd.MultiIndex.from_tuples( + [ + (season, hour + 1) + for season, months_list in season_months_dict.items() + for hour in range(24) + ] + ) + + # Add 'bldg_id' index with values corresponding to the names of the parquet files + result_df["bldg_id"] = bldg_id + result_df.set_index("bldg_id", inplace=True) + + # Append the result_df to the list + result_dfs.append(result_df) + + # Concatenate all individual DataFrames into a single DataFrame + df_ave_hourly = pd.concat(result_dfs, ignore_index=False) + + return df_ave_hourly + + +def train_model(X, y): + """ + Train hierarchical classification models for predicting building stock types and their respective attributes. + + This function trains three separate models: + 1. A classifier to predict the 'building_stock_type' (either 'commercial' or 'residential'). + 2. A classifier for predicting attributes of commercial buildings. + 3. A classifier for predicting attributes of residential buildings. + + The function preprocesses the input data using standard scaling and optional one-hot encoding before training the classifiers. + + Parameters: + ---------- + X : pd.DataFrame + The feature dataframe used for training. Each row represents a building, and each column represents a feature. + + y : pd.DataFrame + The target dataframe containing the labels. It includes the 'building_stock_type' column and other columns ending + with '_com' for commercial attributes and '_res' for residential attributes. + + Returns: + ------- + list + A list of three trained classifiers: + 1. classifier_type: A RandomForestClassifier model for predicting 'building_stock_type'. + 2. classifier_residential: A RandomForestClassifier model for predicting residential attributes. + 3. classifier_commercial: A RandomForestClassifier model for predicting commercial attributes. + + """ + + # Define column transformers for commercial and residential buildings + transformer_commercial = ColumnTransformer( + [("scaler", StandardScaler(), X.columns), ("encoder", OneHotEncoder(), [])] + ) + + transformer_residential = ColumnTransformer( + [("scaler", StandardScaler(), X.columns), ("encoder", OneHotEncoder(), [])] + ) + + # Filter features and targets for commercial and residential buildings + X_commercial = X[y["building_stock_type"] == "commercial"] + X_residential = X[y["building_stock_type"] == "residential"] + y_commercial = y[y["building_stock_type"] == "commercial"].filter(like="_com") + y_residential = y[y["building_stock_type"] == "residential"].filter(like="_res") + + # Train classifier to predict 'building_stock_type' + classifier_type = Pipeline( + [ + ( + "preprocessor", + ColumnTransformer( + [ + ("scaler", StandardScaler(), X.columns), + ("encoder", OneHotEncoder(), []), + ] + ), + ), + ("classifier", RandomForestClassifier(random_state=42)), + ] + ) + classifier_type.fit(X, y["building_stock_type"]) + + # Train separate classifiers for commercial and residential buildings + classifier_commercial = Pipeline( + [ + ("preprocessor", transformer_commercial), + ("classifier", RandomForestClassifier(random_state=42)), + ] + ) + + classifier_residential = Pipeline( + [ + ("preprocessor", transformer_residential), + ("classifier", RandomForestClassifier(random_state=42)), + ] + ) + + # Train models + classifier_commercial.fit(X_commercial, y_commercial) + classifier_residential.fit(X_residential, y_residential) + + return [classifier_type, classifier_residential, classifier_commercial] + + +def get_pred(X, classifier_list, column_list): + """ + Generate predictions for a hierarchical multi-output multi-class classification problem. + + This function takes in a feature dataframe and a list of trained classifiers to generate predictions + for the 'building_stock_type' and its respective attributes based on the hierarchical structure. The predictions + are populated in a new dataframe with the same index as the input features and columns specified in the column list. + + Parameters: + ---------- + X : pd.DataFrame + The feature dataframe used for making predictions. Each row represents a building, and each column represents a feature. + + classifier_list : list + A list of three trained classifiers: + 1. classifier_type: A model for predicting 'building_stock_type'. + 2. classifier_residential: A model for predicting residential attributes. + 3. classifier_commercial: A model for predicting commercial attributes. + + column_list : list + A list of column names to be included in the output predictions dataframe. This should include 'building_stock_type' + and other columns ending with '_com' for commercial attributes and '_res' for residential attributes. + + Returns: + ------- + pd.DataFrame + A dataframe containing the predictions. The index matches the input feature dataframe's index, and the columns are + specified by the column list. The values are populated based on the hierarchical structure of the predictions. + + """ + + classifier_type, classifier_residential, classifier_commercial = classifier_list + + # Predict 'building_stock_type' + y_pred_type = classifier_type.predict(X) + + # Predict relevant columns based on predicted 'building_stock_type' + y_pred_commercial = classifier_commercial.predict(X[y_pred_type == "commercial"]) + y_pred_residential = classifier_residential.predict(X[y_pred_type == "residential"]) + + y_pred = pd.DataFrame(index=X.index, columns=column_list) + + # Set all values in y_pred to np.nan + y_pred[:] = np.nan + + # Ensure the index name is the same + y_pred.index.name = X.index.name + + y_pred["building_stock_type"] = y_pred_type + y_pred.loc[ + y_pred_type == "commercial", y_pred.columns.str.endswith("_com") + ] = y_pred_commercial + y_pred.loc[ + y_pred_type == "residential", y_pred.columns.str.endswith("_res") + ] = y_pred_residential + + return y_pred + + +def calculate_hierarchical_f1_score( + df_targets, df_pred, alpha=0.4, average="macro", F1_list=False +): + """ + Calculate the hierarchical F1-score for a multi-level classification problem. + + This function computes the F1-score at two hierarchical levels: + 1. The 'building_stock_type' level, which is the first level of hierarchy. + 2. The second level, which is conditional on the 'building_stock_type' being either 'commercial' or 'residential'. + + The final F1-score is a weighted average of the first level and second level F1-scores. + + Parameters: + ---------- + df_targets : pd.DataFrame + The dataframe containing the true target values. It must include a column 'building_stock_type' and other + columns ending with '_com' or '_res' representing the second level of classification. + + df_pred : pd.DataFrame + The dataframe containing the predicted values. It must be structured similarly to `df_targets`. + + alpha : float, optional, default=0.3 + The weight given to the first level F1-score in the final score calculation. The weight for the second level + F1-score will be (1 - alpha). + + average : str, optional, default='macro' + The averaging method for calculating the F1-score. It is passed directly to the `f1_score` function from sklearn. + + F1_list : bool, optional, default=False + If True, the function returns a dictionary of F1-scores for all individual columns along with the overall F1-score. + + Returns: + ------- + float or tuple + If `F1_list` is False, returns a single float representing the overall hierarchical F1-score. + If `F1_list` is True, returns a tuple where the first element is the overall hierarchical F1-score and the second + element is a dictionary containing the F1-scores for all individual columns. + + """ + + def calculate_f1_l2(df_targets, df_pred, average): + """ + Calculate the F1-score for the second level of hierarchy. + + Parameters: + ---------- + df_targets : pd.DataFrame + The dataframe containing the true target values for the second level of hierarchy. + df_pred : pd.DataFrame + The dataframe containing the predicted values for the second level of hierarchy. + average : str + The averaging method for calculating the F1-score. + + Returns: + ------- + dict + A dictionary where keys are column names and values are the corresponding F1-scores. + """ + F1_l2_dict = {column: 0 for column in df_targets.columns} + + # Find the intersection of indices + common_indices = df_targets.index.intersection(df_pred.index) + + # Check if the intersection is empty + if common_indices.empty: + return F1_l2_dict + else: + # Select only the rows with common indices + df_targets_common = df_targets.loc[common_indices] + df_pred_common = df_pred.loc[common_indices] + + # Calculate the F1-score for each column based on the common rows + for column in df_targets.columns: + F1_l2_dict[column] = f1_score( + df_targets_common[column], df_pred_common[column], average=average + ) + + return F1_l2_dict + + # Sort both dataframes based on index + df_targets = df_targets.sort_index() + df_pred = df_pred.sort_index() + + # Calculate F1 score for the first level of hierarchy + F1_l1 = f1_score( + df_targets["building_stock_type"], + df_pred["building_stock_type"], + average=average, + ) + F1_dict = {"building_stock_type": F1_l1} + + # Calculate F1 score for the second level of hierarchy (commercial buildings) + df_com_targets = df_targets[ + df_targets["building_stock_type"] == "commercial" + ].filter(like="_com") + df_com_pred = df_pred[df_pred["building_stock_type"] == "commercial"].filter( + like="_com" + ) + F1_l2_dict_com = calculate_f1_l2(df_com_targets, df_com_pred, average) + F1_l2_com = sum(F1_l2_dict_com.values()) / len(F1_l2_dict_com.values()) + + F1_l2_dict = {} + F1_l2_dict.update(F1_l2_dict_com) + + # Calculate F1 score for the second level of hierarchy (residential buildings) + df_res_targets = df_targets[ + df_targets["building_stock_type"] == "residential" + ].filter(like="_res") + df_res_pred = df_pred[df_pred["building_stock_type"] == "residential"].filter( + like="_res" + ) + F1_l2_dict_res = calculate_f1_l2(df_res_targets, df_res_pred, average) + F1_l2_res = sum(F1_l2_dict_res.values()) / len(F1_l2_dict_res.values()) + + F1_l2_dict.update(F1_l2_dict_res) + F1_l2_dict_sorted = sorted(F1_l2_dict.items(), key=lambda x: x[1], reverse=True) + F1_dict.update(F1_l2_dict_sorted) + + # Calculate F1 score for the second level of hierarchy + F1_l2 = (F1_l2_com + F1_l2_res) / 2 + + # Calculate overall F1 score + F1 = alpha * F1_l1 + (1 - alpha) * F1_l2 + + if F1_list: + return F1, F1_dict + + return F1 + + +def sample_submission_generator(bldg_id_list, df_targets, path_to_save): + """ + Generate a sample submission dataframe with a specified distribution and save it as a .parquet file. + + This function creates a dataframe with the same columns as `df_targets` and populates it with values + that resemble the distribution of values in `df_targets`. The index of the dataframe is given by + `bldg_id_list`, and the column values are sampled to match the distribution of the corresponding columns + in `df_targets`. The dataframe is then saved as a .parquet file to the specified path. + + Parameters: + ---------- + bldg_id_list : list of int + List of building IDs to be used as the index of the generated dataframe. + + df_targets : pd.DataFrame + The metadata dataframe used to sample the column values. It provides the distribution of values for + each column to be replicated in the generated dataframe. + + path_to_save : str + The path where the generated dataframe will be saved as a .parquet file. + + Returns: + ------- + pd.DataFrame + The generated dataframe with the same columns as `df_targets` and index values as `bldg_id_list`, + populated with values sampled from the distribution of `df_targets`. + + """ + + # Create an empty dataframe with the same columns and index name as df_targets + df = pd.DataFrame(index=bldg_id_list, columns=df_targets.columns) + df.index.name = df_targets.index.name + + # Populate the first column 'building_stock_type' + building_stock_type_distribution = df_targets["building_stock_type"].value_counts( + normalize=True + ) + df["building_stock_type"] = np.random.choice( + building_stock_type_distribution.index, + size=len(bldg_id_list), + p=building_stock_type_distribution.values, + ) + + # Separate columns into residential and commercial + res_columns = [col for col in df_targets.columns if col.endswith("_res")] + com_columns = [col for col in df_targets.columns if col.endswith("_com")] + + # Populate the rest of the columns based on the value of 'building_stock_type' + for bldg_id in df.index: + if df.at[bldg_id, "building_stock_type"] == "residential": + df.loc[bldg_id, com_columns] = np.nan + for col in res_columns: + distribution = df_targets[col].value_counts(normalize=True) + df.at[bldg_id, col] = np.random.choice( + distribution.index, p=distribution.values + ) + else: # commercial + df.loc[bldg_id, res_columns] = np.nan + for col in com_columns: + distribution = df_targets[col].value_counts(normalize=True) + df.at[bldg_id, col] = np.random.choice( + distribution.index, p=distribution.values + ) + + # Save the dataframe as a parquet file + df.to_parquet(path_to_save) + return df