Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to get_SDA_* / SOD-like methods #185

Merged
merged 8 commits into from
Jun 11, 2021
Merged

Updates to get_SDA_* / SOD-like methods #185

merged 8 commits into from
Jun 11, 2021

Conversation

brownag
Copy link
Member

@brownag brownag commented May 31, 2021

  • Extends get_SDA_property(property = ...) and get_SDA_interpretation(rulename = ...) vectorization over property/rulename to work with any aggregation method. Now supports: Dominant Condition, Min/Max, Dominant Component, Weighted Average
    • Previously vector length > 1 properties or interpretations could be passed only for method = "NONE" (no aggregation)
  • Add query_string argument (default: FALSE). Set as TRUE to skip submitting query to SDA returning a string of the query that would have been sent instead of data.frame result
  • Standardizing MUKEY column name (and other keys) as lowercase in results (and subqueries)
  • More informative error messages for bad input / arguments inconsistent with specified method
  • Updated docs
  • get_SDA_property: Remove ISNULL(x, 0) logic that affects weighted averages in presence of missing data / miscellaneous areas that have horizon records
  • get_SDA_interpretation: review handling of NULL values and calculation of weighted averages
    • added argument not_rated_value with default value of NA_real_ -- this creates consistent, user-definable behavior for not rated values across methods/queries. For backwards compatibility with original SQL use not_rated_value = 99.0

@brownag
Copy link
Member Author

brownag commented Jun 2, 2021

Thanks @dylanbeaudette for the ISNULL(x,0) fix/discussion on #178.

We also trigger some similar ISNULL type logic in the get_SDA_interpretation() queries. Probably not as problematic given the sidebars on interpretations and the general idea that rating values are 1:1 with component. Worth reviewing this logic to make sure it is what we want.

AND (interphr) IS NOT NULL GROUP BY mapunit.mukey),2) AS [sum_com_%s],

function(x) sprintf("ISNULL(ROUND(([rating_%s] / [sum_com_%s]),2), 99) AS [rating_%s]",

"CASE WHEN rating IS NULL THEN 'Not Rated'

@brownag
Copy link
Member Author

brownag commented Jun 4, 2021

Here is a quick review of NULL logic for get_SDA_interpretation(method="weighted average"). I believe that everything that follows is "expected" behavior, but would welcome any thoughts.

Weighted average of an interpretation rating:

library(soilDB)
library(dplyr, warn.conflicts = FALSE)

res_wtdavg <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                                     method = "Weighted Average",
                                     mukeys = 2424959)
res_wtdavg
#>   areasymbol musym                                              muname   mukey
#> 1      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#>   rating_FORPotentialSeedlingMortality class_FORPotentialSeedlingMortality
#> 1                                 0.23                    Slightly limited
#>   reason_FORPotentialSeedlingMortality
#> 1                      Available water

No aggregation:

res_noagg <-  get_SDA_interpretation("FOR - Potential Seedling Mortality",
                                     method = "None", 
                                     mukeys = 2424959)
res_noagg
#>   areasymbol musym                                              muname   mukey
#> 1      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#> 2      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#> 3      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#> 4      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#>      cokey            compname comppct_r rating_FORPotentialSeedlingMortality
#> 1 19586584              Beybek        45                                  0.0
#> 2 19586585           Millvilla        10                                  0.5
#> 3 19586586        Rock outcrop        35                                   NA
#> 4 19586587 Mollic Haploxeralfs        10                                  1.0
#>   class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 1                                 Low                                 <NA>
#> 2                            Moderate                      Available water
#> 3                           Not rated                                 <NA>
#> 4                                High                      Available water

If all component ratings are considered, ignoring missing values, the weighted average cannot be calculated (is NA)

res_noagg %>% 
  group_by(mukey) %>% 
  summarize(sum(comppct_r / 100 * rating_FORPotentialSeedlingMortality))
#> # A tibble: 1 x 2
#>     mukey `sum(comppct_r/100 * rating_FORPotentialSeedlingMortality)`
#>     <int>                                                       <dbl>
#> 1 2424959                                                          NA

If NA ratings are removed from summation, but assume total component percentage of 100 the weighted average is based on denominator that includes weight for NA values.

res_noagg %>% 
  group_by(mukey) %>% 
  summarize(sum(comppct_r/100 * rating_FORPotentialSeedlingMortality, na.rm = TRUE))
#> # A tibble: 1 x 2
#>     mukey `sum(comppct_r/100 * rating_FORPotentialSeedlingMortality, na.rm = TR~
#>     <int>                                                                  <dbl>
#> 1 2424959                                                                   0.15

Assuming 100% only makes sense for things can be conceived of as "stocks" (on an area/volume basis) within a mapunit (i.e. does not make sense for interpretation ratings) and you want to "dilute" the result accordingly for misc areas/volumes that do not have soil.

In contrast when NA ratings are filtered as below, and remaining comppct_r summed for denominator, we match the method="Weighted Average" current behavior

res_noagg %>% 
  group_by(mukey) %>% 
  filter(!is.na(comppct_r) & !is.na(rating_FORPotentialSeedlingMortality)) %>% 
  summarize(sum(comppct_r / sum(comppct_r) * rating_FORPotentialSeedlingMortality, na.rm=TRUE))
#> # A tibble: 1 x 2
#>     mukey `sum(...)`
#>     <int>      <dbl>
#> 1 2424959      0.231

Finally, if all components are not rated, we obtain a value of "99" for the rating (domain is [0,1] for valid values)

The only thing I would consider changing here is that on the R side we could convert these values >1 to NA to limit the possibility of someone accidentally using them in math or other summaries. Thoughts?

res_99 <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                                 method = "Weighted Average", 
                                 areasymbols = "CA630")
subset(res_99, rating_FORPotentialSeedlingMortality == 99.0)
#>     areasymbol musym     muname   mukey rating_FORPotentialSeedlingMortality
#> 90       CA630  1012 Mined Land 2403709                                   99
#> 92       CA630   DAM       Dams 2924912                                   99
#> 104      CA630     W      Water 2462630                                   99
#>     class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 90                            Not Rated                                 <NA>
#> 92                            Not Rated                                 <NA>
#> 104                           Not Rated                                 <NA>

Note: The rating returns NA (not 99) when method = "NONE", so alternately convert "Not rated" ratings to 99 to match the aggregate behavior

get_SDA_interpretation("FOR - Potential Seedling Mortality",
                       method = "NONE", 
                       mukeys = 2924912)
#>   areasymbol musym muname   mukey    cokey compname comppct_r
#> 1      CA630   DAM   Dams 2924912 19586041     Dams       100
#>   rating_FORPotentialSeedlingMortality class_FORPotentialSeedlingMortality
#> 1                                   NA                           Not rated
#>   reason_FORPotentialSeedlingMortality
#> 1                                   NA

@brownag brownag marked this pull request as ready for review June 7, 2021 19:32
@brownag
Copy link
Member Author

brownag commented Jun 7, 2021

I propose that we use the new argument added in 2774148 "not_rated_value" to give the user control over what the numeric values associated with "Not Rated" records are. I suggest the default value to be NA_real_ which is a bit more "safe" in my opinion than 99, but users can readily set the not rated value to whatever they want as in example below

library(soilDB)

# default will use NA_real_ for the numeric ratings of "Not rated" components
res1 <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                       method = "Weighted Average", 
                       areasymbols = "CA630")

head(subset(res1, class_FORPotentialSeedlingMortality == "Not Rated"))
#>     areasymbol musym     muname   mukey rating_FORPotentialSeedlingMortality
#> 90       CA630  1012 Mined Land 2403709                                   NA
#> 92       CA630   DAM       Dams 2924912                                   NA
#> 104      CA630     W      Water 2462630                                   NA
#>     class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 90                            Not Rated                                 <NA>
#> 92                            Not Rated                                 <NA>
#> 104                           Not Rated                                 <NA>
     
# user can specify a custom not rated value, e.g. 9999
res2 <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                       method = "Weighted Average", 
                       areasymbols = "CA630",
                       not_rated_value = 9999)
head(subset(res2, class_FORPotentialSeedlingMortality == "Not Rated"))
#>    areasymbol musym     muname   mukey rating_FORPotentialSeedlingMortality
#> 30      CA630  1012 Mined Land 2403709                                 9999
#> 32      CA630   DAM       Dams 2924912                                 9999
#> 44      CA630     W      Water 2462630                                 9999
#>    class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 30                           Not Rated                                 <NA>
#> 32                           Not Rated                                 <NA>
#> 44                           Not Rated                                 <NA>

@brownag brownag merged commit dea0032 into master Jun 11, 2021
@brownag brownag deleted the SDA_methods1 branch August 25, 2021 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants