Merge pull request #17 from pascalhorton/dev-pascal

CNN development
pascalhorton · Aug 6, 2024 · a023c40 · a023c40
2 parents 02b0c33 + dc4453b
commit a023c40
Show file tree

Hide file tree

Showing 21 changed files with 1,104 additions and 381 deletions.
diff --git a/README.md b/README.md
@@ -61,34 +61,50 @@ The damages correspond to insurance claims per cell (pixel of the precipitation
 The damages are managed by the `Damages` class in `swafi/damages.py` and are also handled internally as a Pandas dataframe.
 
 There are two classes of damages:
-- `DamagesMobiliar` from the file `swafi/damages_mobiliar.py`: handles the claims from the Swiss Mobiliar Insurance Company as GeoTIFF.
-  The dataset from the Mobiliar contains the following categories of claims:
-
-  | Name in swafi       | Client  | Ext/Int  | Object    | Flood type | Original file names               |
-  |---------------------|---------|----------|-----------|------------|-----------------------------------|
-  | sme_ext_cont_pluv   | SME     | external | content   | pluvial    | Ueberschwemmung_pluvial_KMU_FH    |
-  | sme_ext_cont_fluv   | SME     | external | content   | fluvial    | Ueberschwemmung_fluvial_KMU_FH    |
-  | sme_ext_struc_pluv  | SME     | external | structure | pluvial    | Ueberschwemmung_pluvial_KMU_GB    |
-  | sme_ext_struc_fluv  | SME     | external | structure | fluvial    | Ueberschwemmung_fluvial_KMU_GB    |
-  | sme_int_cont        | SME     | internal | content   |            | Wasser_KMU_FH                     |
-  | sme_int_struc       | SME     | internal | structure |            | Wasser_KMU_GB                     |
-  | priv_ext_cont_pluv  | Private | external | content   | pluvial    | Ueberschwemmung_pluvial_Privat_FH |
-  | priv_ext_cont_fluv  | Private | external | content   | fluvial    | Ueberschwemmung_fluvial_Privat_FH |
-  | priv_ext_struc_pluv | Private | external | structure | pluvial    | Ueberschwemmung_pluvial_Privat_GB |
-  | priv_ext_struc_fluv | Private | external | structure | fluvial    | Ueberschwemmung_fluvial_Privat_GB |
-  | priv_int_cont       | Private | internal | content   |            | Wasser_Privat_FH                  |
-  | priv_int_struc      | Private | internal | structure |            | Wasser_Privat_GB                  |
-
-- `DamagesGVZ` from the file `swafi/damages_gvz.py`: handles the claims from the GVZ (Building insurance Canton Zurich) as netCDF.
-  The dataset from the GVZ contains the following categories:
-
-  | Name in swafi       | Original tag  |
-  |---------------------|---------------|
-  | most_likely_pluvial | A             |
-  | likely_pluvial      | A, B          |
-  | fluvial_or_pluvial  | A, B, C, D, E |
-  | likely_fluvial      | D, E          |
-  | most_likely_fluvial | E             |
+
+#### DamagesMobiliar
+`DamagesMobiliar` from the file `swafi/damages_mobiliar.py`: handles the claims from the Swiss Mobiliar Insurance Company as GeoTIFF.
+The dataset from the Mobiliar contains the following categories of **exposure** (contracts):
+
+| Name in swafi       | Client  | Ext/Int  | Object    | Original file names         |
+|---------------------|---------|----------|-----------|-----------------------------|
+| sme_ext_cont        | SME     | external | content   | Vertraege_KMU_ES_FH_YYYY    |
+| sme_ext_struc       | SME     | external | structure | Vertraege_KMU_ES_GB_YYYY    |
+| sme_int_cont        | SME     | internal | content   | Vertraege_KMU_W_FH_YYYY     |
+| sme_int_struc       | SME     | internal | structure | Vertraege_KMU_W_GB_YYYY     |
+| priv_ext_cont       | Private | external | content   | Vertraege_Privat_ES_FH_YYYY |
+| priv_ext_struc      | Private | external | structure | Vertraege_Privat_ES_GB_YYYY |
+| priv_int_cont       | Private | internal | content   | Vertraege_Privat_W_FH_YYYY  |
+| priv_int_struc      | Private | internal | structure | Vertraege_Privat_W_GB_YYYY  |
+
+The dataset from the Mobiliar contains the following categories of **claims**:
+
+| Name in swafi       | Client  | Ext/Int  | Object    | Flood type | Original file names               |
+|---------------------|---------|----------|-----------|------------|-----------------------------------|
+| sme_ext_cont_pluv   | SME     | external | content   | pluvial    | Ueberschwemmung_pluvial_KMU_FH    |
+| sme_ext_cont_fluv   | SME     | external | content   | fluvial    | Ueberschwemmung_fluvial_KMU_FH    |
+| sme_ext_struc_pluv  | SME     | external | structure | pluvial    | Ueberschwemmung_pluvial_KMU_GB    |
+| sme_ext_struc_fluv  | SME     | external | structure | fluvial    | Ueberschwemmung_fluvial_KMU_GB    |
+| sme_int_cont        | SME     | internal | content   |            | Wasser_KMU_FH                     |
+| sme_int_struc       | SME     | internal | structure |            | Wasser_KMU_GB                     |
+| priv_ext_cont_pluv  | Private | external | content   | pluvial    | Ueberschwemmung_pluvial_Privat_FH |
+| priv_ext_cont_fluv  | Private | external | content   | fluvial    | Ueberschwemmung_fluvial_Privat_FH |
+| priv_ext_struc_pluv | Private | external | structure | pluvial    | Ueberschwemmung_pluvial_Privat_GB |
+| priv_ext_struc_fluv | Private | external | structure | fluvial    | Ueberschwemmung_fluvial_Privat_GB |
+| priv_int_cont       | Private | internal | content   |            | Wasser_Privat_FH                  |
+| priv_int_struc      | Private | internal | structure |            | Wasser_Privat_GB                  |
+
+#### DamagesGVZ
+`DamagesGVZ` from the file `swafi/damages_gvz.py`: handles the claims from the GVZ (Building insurance Canton Zurich) as netCDF.
+The dataset from the GVZ contains a single category of **exposure** (contracts): `all_buildings`, and the following categories of **claims**:
+
+| Name in swafi       | Original tag  |
+|---------------------|---------------|
+| most_likely_pluvial | A             |
+| likely_pluvial      | A, B          |
+| fluvial_or_pluvial  | A, B, C, D, E |
+| likely_fluvial      | D, E          |
+| most_likely_fluvial | E             |
 
 These classes are subclasses of the `Damages` class and implement the data loading according to the corresponding file format as well as their specific classification.
 
@@ -273,7 +289,7 @@ All the hyperparameters of the model can be set as options of the script.
 The model can be trained using the following command:
 
 ```bash
-train_dl_occurrence.py [-h] [--run-id RUN_ID] [--optimize-with-optuna]
+train_cnn_occurrence.py [-h] [--run-id RUN_ID] [--optimize-with-optuna]
                        [--target-type TARGET_TYPE]
                        [--factor-neg-reduction FACTOR_NEG_REDUCTION]
                        [--weight-denominator WEIGHT_DENOMINATOR]

diff --git a/config_example.yaml b/config_example.yaml
@@ -9,8 +9,8 @@ YEAR_END_MOBILIAR: 2022
 YEAR_START_GVZ: 2005
 YEAR_END_GVZ: 2022
 
-# CID (cells IDs) raster path
-CID_PATH: '..\files\cids.tif'
+# CID (cells IDs) raster path (not needed for Switzerland)
+CID_PATH: 'path/to/cids.tif'
 
 # Contract and damage data directories
 DIR_EXPOSURE_MOBILIAR: ''
@@ -19,7 +19,7 @@ DIR_EXPOSURE_GVZ: ''
 DIR_CLAIMS_GVZ: ''
 
 # Path to the events parquet file
-EVENTS_PATH: ''
+EVENTS_PATH: 'path/to/prec_events_2005-2023_no_smoothing.parquet'
 
 # Path to the precipitation directory
 DIR_PRECIP: ''

diff --git a/requirements-optional.txt b/requirements-optional.txt
@@ -6,3 +6,4 @@ optuna
 asyncpg
 psycopg2
 psycopg2-binary
+plotly
diff --git a/scripts/data_analyses/analyze_damage_data.py b/scripts/data_analyses/analyze_damage_data.py
@@ -0,0 +1,186 @@
+"""
+This script analyzes the distribution of the number of contracts and claims per cell.
+"""
+
+from swafi.config import Config
+from swafi.damages_mobiliar import DamagesMobiliar
+from swafi.damages_gvz import DamagesGvz
+import pandas as pd
+import matplotlib.pyplot as plt
+
+config = Config(output_dir='analysis_damage_distribution')
+output_dir = config.output_dir
+
+PICKLES_DIR = config.get('PICKLES_DIR')
+DATASET = 'mobiliar'  # 'mobiliar' or 'gvz'
+
+if DATASET == 'mobiliar':
+    EXPOSURE_CATEGORIES = ['external']
+    CLAIM_CATEGORIES = ['external', 'pluvial']
+elif DATASET == 'gvz':
+    EXPOSURE_CATEGORIES = ['all_buildings']
+    CLAIM_CATEGORIES = ['likely_pluvial']
+
+
+def main():
+    if DATASET == 'mobiliar':
+        damages = DamagesMobiliar(dir_exposure=config.get('DIR_EXPOSURE_MOBILIAR'),
+                                  dir_claims=config.get('DIR_CLAIMS_MOBILIAR'),
+                                  year_start=config.get('YEAR_START_MOBILIAR'),
+                                  year_end=config.get('YEAR_END_MOBILIAR'))
+    elif DATASET == 'gvz':
+        damages = DamagesGvz(dir_exposure=config.get('DIR_EXPOSURE_GVZ'),
+                             dir_claims=config.get('DIR_CLAIMS_GVZ'),
+                             year_start=config.get('YEAR_START_GVZ'),
+                             year_end=config.get('YEAR_END_GVZ'))
+    else:
+        raise ValueError(f'Dataset {DATASET} not recognized.')
+
+    # Format the date of the claims
+    df_claims = damages.claims
+    df_claims['date_claim'] = pd.to_datetime(
+        df_claims['date_claim'], errors='coerce')
+
+    # Compute the monthly sum of claims for each month per category
+    df_claims_month = df_claims.copy()
+    df_claims_month['month'] = df_claims_month['date_claim'].dt.month
+    df_claims_month = df_claims_month.drop(
+        columns=['date_claim', 'mask_index', 'cid', 'x', 'y'])
+    df_claims_month_sum = df_claims_month.groupby('month').sum()
+
+    # Plot the monthly distribution of the total # of claims for different categories
+    for category in damages.claim_categories:
+        sum_claims = df_claims_month_sum[category].sum()
+        plt.figure(figsize=(8, 4))
+        plt.title(f'Monthly distribution of the claims for category {category} '
+                  f'(total: {sum_claims})')
+        plt.xlabel('Month')
+        plt.ylabel('Percentage of claims [%]')
+        nb_annual_claims = df_claims_month_sum[category] / sum_claims
+        plt.bar(df_claims_month_sum.index, 100 * nb_annual_claims)
+        plt.xticks(range(1, 13))
+        plt.tight_layout()
+        plt.savefig(output_dir / f'monthly_distribution_tot_claims_{category}.png')
+        plt.savefig(output_dir / f'monthly_distribution_tot_claims_{category}.pdf')
+        plt.close()
+
+    # For the whole domain, aggregate by date (sum)
+    df_claims_date = df_claims.copy()
+    df_claims_date = df_claims_date.drop(
+        columns=['mask_index', 'cid', 'x', 'y'])
+    df_claims_date = df_claims_date.groupby('date_claim').sum()
+    df_claims_date['date_claim'] = pd.to_datetime(df_claims_date.index, errors='coerce')
+    df_claims_date['month'] = df_claims_date['date_claim'].dt.month
+
+    # Plot the monthly distribution of the mean # of claims for different categories
+    for category in damages.claim_categories:
+        df_claims_date_cat = df_claims_date.copy()
+        df_claims_date_cat = df_claims_date_cat[df_claims_date_cat[category] > 0]
+        df_claims_date_cat = df_claims_date_cat.groupby('month').mean()
+        plt.figure(figsize=(8, 4))
+        plt.title(f'Monthly distribution of the mean # of claims / event for category {category}')
+        plt.xlabel('Month')
+        plt.ylabel('Mean number of claims')
+        plt.bar(df_claims_date_cat.index, df_claims_date_cat[category])
+        plt.xticks(range(1, 13))
+        plt.tight_layout()
+        plt.savefig(output_dir / f'monthly_distribution_mean_claims_{category}.png')
+        plt.savefig(output_dir / f'monthly_distribution_mean_claims_{category}.pdf')
+        plt.close()
+
+    # Select the categories of interest
+    damages.select_categories_type(EXPOSURE_CATEGORIES, CLAIM_CATEGORIES)
+
+    # Analyze the occurrences of damages to the structure and/or content
+    if DATASET == 'mobiliar':
+        df_mobi = damages.claims
+
+        # Sum priv and sme
+        df_mobi['ext_struc_pluv'] = (
+            df_mobi['priv_ext_struc_pluv'] + df_mobi['sme_ext_struc_pluv'])
+        df_mobi['ext_cont_pluv'] = (
+            df_mobi['priv_ext_cont_pluv'] + df_mobi['sme_ext_cont_pluv'])
+        df_mobi['ext_both_pluv'] = (
+            df_mobi[['ext_struc_pluv', 'ext_cont_pluv']].min(axis=1))
+        df_mobi['ext_struc_only_pluv'] = (
+            df_mobi['ext_struc_pluv'] - df_mobi['ext_both_pluv'])
+        df_mobi['ext_cont_only_pluv'] = (
+            df_mobi['ext_cont_pluv'] - df_mobi['ext_both_pluv'])
+
+        nb_both = df_mobi['ext_both_pluv'].sum()
+        nb_struc = df_mobi['ext_struc_only_pluv'].sum()
+        nb_cont = df_mobi['ext_cont_only_pluv'].sum()
+        nb_tot = nb_both + nb_struc + nb_cont
+        pc_both = 100 * nb_both / nb_tot
+        pc_struc = 100 * nb_struc / nb_tot
+        pc_cont = 100 * nb_cont / nb_tot
+
+        print(f"Number of claims with both structure and content: {pc_both:.2f}%")
+        print(f"Number of claims with structure only: {pc_struc:.2f}%")
+        print(f"Number of claims with content only: {pc_cont:.2f}%")
+
+    # Analyze the distribution of the number of contracts and claims per cell
+    df_contracts = damages.exposure
+    df_contracts = df_contracts[['mask_index', 'selection', 'cid']]
+
+    # Average the number of annual contracts per location
+    df_contracts = df_contracts.groupby('mask_index').mean()
+
+    # Plot the histogram of the number of contracts per cell
+    plt.figure()
+    plt.title('Histogram of the number of contracts per cell')
+    plt.xlabel('Number of contracts')
+    plt.ylabel('Number of cells')
+    plt.hist(df_contracts['selection'], bins=100)
+    plt.yscale('log')
+    plt.xlim(0, None)
+    plt.tight_layout()
+    plt.savefig(output_dir / 'histogram_contracts.png')
+    plt.savefig(output_dir / 'histogram_contracts.pdf')
+
+    df_claims = damages.claims
+    df_claims = df_claims[['mask_index', 'selection']]
+
+    # Sum the number of claims per location and divide by the number of years
+    df_claims = df_claims.groupby('mask_index').sum()
+    n_years = damages.year_end - damages.year_start + 1
+    df_claims['selection'] = df_claims['selection'] / n_years
+
+    # Plot the histogram of the number of claims per cell
+    plt.figure()
+    plt.title('Histogram of the number of annual claims per cell')
+    plt.xlabel('Number of claims')
+    plt.ylabel('Number of cells')
+    plt.hist(df_claims['selection'], bins=50)
+    plt.yscale('log')
+    plt.xlim(0, None)
+    plt.tight_layout()
+    plt.savefig(output_dir / 'histogram_claims.png')
+    plt.savefig(output_dir / 'histogram_claims.pdf')
+
+    # Merge the contracts and claims dataframes on the index
+    df_merged = df_contracts.merge(df_claims, left_index=True,
+                                   right_index=True, how='left')
+
+    # Rename the columns
+    df_merged.columns = ['contracts', 'cid', 'claims']
+
+    # Replace nan values with 0
+    df_merged.fillna(0, inplace=True)
+
+    # Plot the relationship between the number of contracts and the number of claims
+    plt.figure()
+    plt.title('Relationship between the number of contracts and the claims')
+    plt.xlabel('Number of contracts (mean per cell)')
+    plt.ylabel('Mean number of annual claims (sum per cell)')
+    plt.scatter(df_merged['contracts'], df_merged['claims'],
+                facecolors='none', edgecolors='k')
+    plt.xscale('log')
+    plt.yscale('log')
+    plt.tight_layout()
+    plt.savefig(output_dir / 'scatter_contracts_claims.png')
+    plt.savefig(output_dir / 'scatter_contracts_claims.pdf')
+
+
+if __name__ == '__main__':
+    main()
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,3 +6,4 @@ optuna @@
     asyncpg
     psycopg2
     psycopg2-binary
+    plotly