incentive_paper_stats_corrs.qmd

---
title: "Incentive Classification Time-Series Dataset"
subtitle: "Assessing Data Collinearity and Relationships Over Time"
author: "Michael Van Hulle & Alea Wilbur-Mujtaba"
date: "July 8, 2024"
format: 
  html:
    embed-resources: true
    theme: lumen
    code-fold: true
    code-line-numbers: true
    code-overflow: wrap
    toc: true
    toc-location: left
    df-print: paged
knitr: 
  opts_chunk:
    warning: true
    message: false
---

```{r setup}
#| output: FALSE

# Load packages

library(tidyverse)
library(corrr)
library(glue)
library(DT)
library(flextable)
library(kableExtra)
library(crosstable)
library(scales)

# Set table formatting defaults

set_flextable_defaults(theme_fun = theme_vanilla, 
                       padding = 2,
                       #line_spacing = 1,
                       big.mark = ","
                       )

options(DT.options = list())

FitFlextableToPage <- function(ft, pgwidth = 6){
  ft_out <- ft %>% autofit()
  ft_out <- width(ft_out, width = dim(ft_out)$widths*pgwidth /(flextable_dim(ft_out)$widths))
  return(ft_out)
}


comm_ind <- read_csv("./Output/comm_ind_PINs_2006to2022_timeseries.csv") %>% group_by(pin) %>% mutate(years_existed = n()) %>% ungroup()
```


```{r runscript, eval = FALSE}

# Run helper file

source("./scripts/helper_pull_comm_ind_allyears.R")

# Remove distracting objects and values from R environment

helper_objects <- ls()

keep <- "comm_ind"

comm_ind <- comm_ind_pins_ever

rm(list = setdiff(helper_objects, keep))

rm(helper_objects, keep)

```

```{r alea_check}
comm_ind |>
 # filter(year %in% c("2006", "2011", "2022")) |>
  filter(land_use %in% c("Industrial", "Commercial")) |>
 # select(year, pin, fmv, incent_prop, land_use) |>
  group_by(year, incent_prop) |>
  summarize(fmv=sum(fmv)) |> 
  ggplot() + 
  geom_line(aes(x=year, y=fmv, color = incent_prop))

comm_ind |>
  filter(land_use %in% c("Industrial", "Commercial")) |>
  group_by(year, land_use) |>
  summarize(av_clerk=sum(av_clerk)) |> 
  ggplot() + 
  geom_line(aes(x=year, y=av_clerk, color = land_use))
```

# Data "Quirks"

## Missing Values

We want to make sure that we aren't missing values for key variables and also want to know if the missing values change by year.

These factors are especially salient to determining our ability to assemble a panel data and used fixed effects as opposed to some other form of time-series analysis or spatial analysis.


_Note: 186 tax code rates were accidently excluded before July 7 2024. \nCause: Used taxcodes that are taxed by Municipalities to pull tax codes instead. This caused all PINs from unincorporated areas to not have tax code rates and resulted in many NAs for those PINs._ 


```{r include = FALSE}
comm_ind %>% filter(is.na(tax_code_rate)) %>% distinct(tax_code_num) 
```


```{r}
#| label: tbl-missing_values_all_years
#| tbl-cap: "**Comparison of Select Missing Values Across Years**<br>ALL Commercial and Industrial PINs: 2006, 2012, 2022"
#| tbl-cap-location: top


missing_values_all_years <- comm_ind |>
  filter(year %in% c("2006", "2012", "2022")) |>
  rename(land_use = land_use) |>
  group_by(year) |>
  summarize(
    "Observations (PIN-Year)" = n(),
    "Class Code" = sum(is.na(class)),
    "FMV" = sum(is.na(fmv)),
    "AV" = first(sum(is.na(av_clerk))), 
    "Land Use" = first(sum(is.na(land_use))), 
    "Incentive Dummy" = first(sum(is.na(incent_prop))), 
   # missing_rate = sum(is.na(tax_code_rate)),
    "Tax Code" = first(sum(is.na(tax_code_num)))
)

mvay_long <- missing_values_all_years |>
  pivot_longer(
    cols = -year,
    names_to = "variable",
    values_to = "count"
  )

mvay_wide = mvay_long |>
  pivot_wider(
    names_from = year,
    values_from = count
  )

kable(mvay_wide, format = "html", escape = FALSE) |>
  kable_styling(full_width = FALSE) |>
  column_spec(1, width = "3cm")

```


Zeroes were imputed for missing values at one point in our analysis, so we also check for excess zero values.

```{r}
#| label: tbl-zero_values_all_years
#| tbl-cap: "**Comparison of Zero Values Across Years**<br>ALL Industrial and Commercial PINs: 2006, 2012, 2022"
#| tbl-cap-location: top

all_year_zero_values <- comm_ind |>
  filter(year %in% c("2006", "2012", "2022")) |>
  group_by(year) |>
  mutate(year_obs = first(n())) |>
  ungroup() |>
  filter(class != 0) |>
  rename(land_use = land_use) |>
  group_by(year) |>
summarize(
    "Observations (PIN-Year)" = first(year_obs),
    "Class Code" = sum(class == "0"),
    "FMV" = sum(fmv == "0"),
    "AV" = sum(av_clerk == "0"), 
    "Land Use" = sum(land_use == "0"), 
    "Incentive Dummy" = sum(incent_prop == "0"),
    "Tax Code" = sum(tax_code_num == "0")
)

ayz_long <- all_year_zero_values |>
  pivot_longer(
    cols = -year,
    names_to = "variable",
    values_to = "count"
  )

ayz_wide = mvay_long |>
  pivot_wider(
    names_from = year,
    values_from = count
  )

kable(ayz_wide, format = "html", escape = FALSE) |>
  kable_styling(full_width = FALSE) |>
  column_spec(1, width = "3cm")

```

## Serial autocorrelations

Depending on our statistical model, we may need to account for serial autocorrelation for some variables across time. This seems likely with AV given the triennial reassessment cycle and will help us determine the number of lags needed in our model.

[To be inserted]

# Correlations

## Correlations: 2006 - 2022 Time Period

Notes: 

These correlations are across the entire 2006-2022 time period and are not stratified by year.

```{r}
#| label: tbl-all_year_corrs
#| tbl-cap: "**Correlations of Main Variables**<br>ALL Industrial & Commercial PINs: 2006-2022"
#| tbl-cap-location: top
#| warning: false

row_name_map <- c(
  av_clerk = "AV",
  fmv = "FMV",
  years_existed = "PIN<br>Age",
 base_year_fmv_2006 = "FMV<br>(2006)", 
  base_year_fmv_2011 = "FMV<br>(2011)"
)

col_name_map <- c(
  #term = "",
  av_clerk = "AV",
  years_existed = "PIN<br>Age",
  fmv = "FMV",
  base_year_fmv_2006 = "FMV<br>(2006)",  
  base_year_fmv_2011 = "FMV<br>(2011)"
)

vars_4fn <- c(
  #"n_pins", "min_fmv_all", "quant25_all_fmv", "quant50_all_fmv", "quant75_all_fmv", "max_fmv_all", "fmv_mean", "fmv", "cty_pins", 
            #  "term", 
              "av_clerk", "years_existed","fmv", "base_year_fmv_2006", 
              "base_year_fmv_2011" )

labels_4fn <- c(
  #"PIN Count", "Min. FMV", "25th Quantile", "50th Quantile", "75th Quantile", "Max. FMV", "Avg. FMV", "Total FMV", "PIN Count",
              #  "Variables", 
                "AV",  "PIN<br>Age", "FMV", "FMV<br>2006", 
                "FMV<br>2011"
                )

df_labels <- data.frame(vars_4fn, labels_4fn)


comm_ind$year <- as.factor(comm_ind$year)

corr_matrix <- comm_ind |>
  select(av_clerk, fmv, years_existed,
         base_year_fmv_2006, 
         base_year_fmv_2011, 
         incent_prop, land_use, year) |>
  correlate()

corr_matrix <- corr_matrix %>% 
#  rename(#`Variables` = term,
 #        `Final AV` = av_clerk, FMV = fmv, `Years PIN Existed` = years_existed) %>%
     mutate(Variables = ifelse(term %in% df_labels$vars_4fn, df_labels$labels_4fn, "Need Name")) %>%
  select(Variables, everything(), -term)

corr_df_temp_1 <- corr_matrix |>
  shave() |>
  fashion(decimals = 3)

corr_df <- as.data.frame(corr_df_temp_1)
    
new_col_names <- c("Variables", 
               #    col_name_map[colnames(corr_df)[-1]]
                                      col_name_map

                   )

kable(corr_df, format = "html", escape = FALSE, col.names = new_col_names) |>
  kable_styling(full_width = FALSE) |>
  column_spec(1, width = "3cm")

```

```{r}
#| label: tbl-corr_2006
#| tbl-cap: "**Correlations of Main Variables**<br>ALL Industrial & Commercial PINs: 2006"
#| tbl-cap-location: top

corr_matrix_2006 <- comm_ind |>
  filter(year == 2006) |>
  select(av_clerk, fmv, years_existed, 
         base_year_fmv_2006, base_year_fmv_2011
         ) |>
  correlate()

corr_temp_2006 <- corr_matrix_2006 |>
  shave() |>
  fashion(decimals = 3) 

corr_2006 <- as.data.frame(corr_temp_2006)

new_2006_col_names <- c("Variable", col_name_map[colnames(corr_2006)[-1]])

kable(corr_2006, format = "html", escape = FALSE, col.names = new_2006_col_names) |>
  kable_styling(full_width = FALSE) |>
  column_spec(1, width = "3cm")

```

```{r}
#| label: tbl-corr_2012
#| tbl-cap: "**Correlations of Main Variables**<br>ALL Industrial & Commercial PINs: 2012"
#| tbl-cap-location: top

corr_matrix_2012 <- comm_ind |>
  filter(year == 2012) |>
  select(av_clerk, fmv, years_existed, base_year_fmv_2006, base_year_fmv_2011) |>
  correlate()

corr_temp_2012 <- corr_matrix_2012 |>
  shave() |>
  fashion(decimals = 3) 

corr_2012 <- as.data.frame(corr_temp_2012)

new_2012_col_names <- c("Variable", col_name_map[colnames(corr_2012)[-1]])

kable(corr_2012, format = "html", escape = FALSE, col.names = new_2012_col_names) |>
  kable_styling(full_width = FALSE) |>
  column_spec(1, width = "3cm")

```

```{r}
#| label: tbl-corr_2022
#| tbl-cap: "**Correlations of Main Variables**<br>ALL Industrial & Commercial PINs: 2022"
#| tbl-cap-location: top

corr_matrix_2022 <- comm_ind |>
  filter(year == 2022) |>
  select(av_clerk, fmv, years_existed, 
         base_year_fmv_2006, base_year_fmv_2011
         ) |>
  correlate()

corr_temp_2022 <- corr_matrix_2022 |>
  shave() |>
  fashion(decimals = 2) 

corr_2022 <- as.data.frame(corr_temp_2022)

new_2022_col_names <- c("Variable", col_name_map[colnames(corr_2022)[-1]])

kable(corr_2012, format = "html", escape = FALSE, col.names = new_2022_col_names) |>
  kable_styling(full_width = FALSE) |>
  column_spec(1, width = "2cm")

```

```{r}
#| label: fig-fmv_baseline_year_strat
#| fig-cap: "Comparison of FMV in 2006 vs. 2022"
#| fig-cap-location: top

comm_ind |>
  filter(base_year_fmv_2006 > 0) |>
  filter(year == 2022) |>
  ggplot() +
  geom_point(aes(x = base_year_fmv_2006, y = fmv)) +
  theme_classic() +
  labs(x = "FMV (2006)", y = "FMV (2022)") +
  scale_x_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
  scale_y_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
  geom_abline(slope = 1, intercept = 0, color = "blue")

```

```{r}
#| label: fig-fmv_baseline_year_2006_incent_strat
#| fig-cap: "Comparison of FMV in 2022 and 2006<br>by Incentive Classification"
#| fig-cap-location: top

comm_ind |>
  filter(base_year_fmv_2006 > 0) |>
  filter(year == 2022) |>
  ggplot() +
  geom_point(aes(x = base_year_fmv_2006, y = fmv)) +
  theme_classic() +
  labs(x = "FMV (2006)", y = "FMV (2022)") +
  scale_x_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
  scale_y_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
  geom_abline(slope = 1, intercept = 0, color = "blue") +
  facet_grid(cols = vars(incent_prop))

```

```{r}
#| label: fig-fmv_baseline_year_2011_incent_strat
#| fig-cap: "Comparison of FMV in 2011 vs. 2022<br>by Incentive Classification"
#| fig-cap-location: top

comm_ind |>
  filter(base_year_fmv_2011 > 0) |>
  filter(year == 2022) |>
  ggplot() +
  geom_point(aes(x = base_year_fmv_2011, y = fmv)) +
  theme_classic() +
  labs(x = "FMV (2011)", y = "FMV (2022)") +
  scale_x_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
  scale_y_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
  geom_abline(slope = 1, intercept = 0, color = "blue") +
  facet_grid(cols = vars(incent_prop))

```