Updates to `get_SDA_*` / SOD-like methods #185

brownag · 2021-05-31T04:18:25Z

Extends get_SDA_property(property = ...) and get_SDA_interpretation(rulename = ...) vectorization over property/rulename to work with any aggregation method. Now supports: Dominant Condition, Min/Max, Dominant Component, Weighted Average
- Previously vector length > 1 properties or interpretations could be passed only for method = "NONE" (no aggregation)
Add query_string argument (default: FALSE). Set as TRUE to skip submitting query to SDA returning a string of the query that would have been sent instead of data.frame result
Standardizing MUKEY column name (and other keys) as lowercase in results (and subqueries)
More informative error messages for bad input / arguments inconsistent with specified method
Updated docs
get_SDA_property: Remove ISNULL(x, 0) logic that affects weighted averages in presence of missing data / miscellaneous areas that have horizon records
get_SDA_interpretation: review handling of NULL values and calculation of weighted averages
- added argument not_rated_value with default value of NA_real_ -- this creates consistent, user-definable behavior for not rated values across methods/queries. For backwards compatibility with original SQL use not_rated_value = 99.0

…t query string

…cleanRuleColumnName()

… names in results

brownag · 2021-06-02T23:58:10Z

Thanks @dylanbeaudette for the ISNULL(x,0) fix/discussion on #178.

We also trigger some similar ISNULL type logic in the get_SDA_interpretation() queries. Probably not as problematic given the sidebars on interpretations and the general idea that rating values are 1:1 with component. Worth reviewing this logic to make sure it is what we want.

soilDB/R/SDA_interpretations.R

Line 1155 in e8cc5d5

AND (interphr) IS NOT NULL GROUP BY mapunit.mukey),2) AS [sum_com_%s],

soilDB/R/SDA_interpretations.R

Line 1169 in e8cc5d5

    
           function(x) sprintf("ISNULL(ROUND(([rating_%s] / [sum_com_%s]),2), 99) AS [rating_%s]",

soilDB/R/SDA_interpretations.R

Line 1176 in e8cc5d5

"CASE WHEN rating IS NULL THEN 'Not Rated'

brownag · 2021-06-04T21:49:38Z

Here is a quick review of NULL logic for get_SDA_interpretation(method="weighted average"). I believe that everything that follows is "expected" behavior, but would welcome any thoughts.

Weighted average of an interpretation rating:

library(soilDB)
library(dplyr, warn.conflicts = FALSE)

res_wtdavg <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                                     method = "Weighted Average",
                                     mukeys = 2424959)
res_wtdavg
#>   areasymbol musym                                              muname   mukey
#> 1      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#>   rating_FORPotentialSeedlingMortality class_FORPotentialSeedlingMortality
#> 1                                 0.23                    Slightly limited
#>   reason_FORPotentialSeedlingMortality
#> 1                      Available water

No aggregation:

res_noagg <-  get_SDA_interpretation("FOR - Potential Seedling Mortality",
                                     method = "None", 
                                     mukeys = 2424959)
res_noagg
#>   areasymbol musym                                              muname   mukey
#> 1      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#> 2      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#> 3      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#> 4      CA630  8317 Beybek-Rock outcrop complex, 3 to 30 percent slopes 2424959
#>      cokey            compname comppct_r rating_FORPotentialSeedlingMortality
#> 1 19586584              Beybek        45                                  0.0
#> 2 19586585           Millvilla        10                                  0.5
#> 3 19586586        Rock outcrop        35                                   NA
#> 4 19586587 Mollic Haploxeralfs        10                                  1.0
#>   class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 1                                 Low                                 <NA>
#> 2                            Moderate                      Available water
#> 3                           Not rated                                 <NA>
#> 4                                High                      Available water

If all component ratings are considered, ignoring missing values, the weighted average cannot be calculated (is NA)

res_noagg %>% 
  group_by(mukey) %>% 
  summarize(sum(comppct_r / 100 * rating_FORPotentialSeedlingMortality))
#> # A tibble: 1 x 2
#>     mukey `sum(comppct_r/100 * rating_FORPotentialSeedlingMortality)`
#>     <int>                                                       <dbl>
#> 1 2424959                                                          NA

If NA ratings are removed from summation, but assume total component percentage of 100 the weighted average is based on denominator that includes weight for NA values.

res_noagg %>% 
  group_by(mukey) %>% 
  summarize(sum(comppct_r/100 * rating_FORPotentialSeedlingMortality, na.rm = TRUE))
#> # A tibble: 1 x 2
#>     mukey `sum(comppct_r/100 * rating_FORPotentialSeedlingMortality, na.rm = TR~
#>     <int>                                                                  <dbl>
#> 1 2424959                                                                   0.15

Assuming 100% only makes sense for things can be conceived of as "stocks" (on an area/volume basis) within a mapunit (i.e. does not make sense for interpretation ratings) and you want to "dilute" the result accordingly for misc areas/volumes that do not have soil.

In contrast when NA ratings are filtered as below, and remaining comppct_r summed for denominator, we match the method="Weighted Average" current behavior

res_noagg %>% 
  group_by(mukey) %>% 
  filter(!is.na(comppct_r) & !is.na(rating_FORPotentialSeedlingMortality)) %>% 
  summarize(sum(comppct_r / sum(comppct_r) * rating_FORPotentialSeedlingMortality, na.rm=TRUE))
#> # A tibble: 1 x 2
#>     mukey `sum(...)`
#>     <int>      <dbl>
#> 1 2424959      0.231

Finally, if all components are not rated, we obtain a value of "99" for the rating (domain is [0,1] for valid values)

The only thing I would consider changing here is that on the R side we could convert these values >1 to NA to limit the possibility of someone accidentally using them in math or other summaries. Thoughts?

res_99 <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                                 method = "Weighted Average", 
                                 areasymbols = "CA630")
subset(res_99, rating_FORPotentialSeedlingMortality == 99.0)
#>     areasymbol musym     muname   mukey rating_FORPotentialSeedlingMortality
#> 90       CA630  1012 Mined Land 2403709                                   99
#> 92       CA630   DAM       Dams 2924912                                   99
#> 104      CA630     W      Water 2462630                                   99
#>     class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 90                            Not Rated                                 <NA>
#> 92                            Not Rated                                 <NA>
#> 104                           Not Rated                                 <NA>

Note: The rating returns NA (not 99) when method = "NONE", so alternately convert "Not rated" ratings to 99 to match the aggregate behavior

get_SDA_interpretation("FOR - Potential Seedling Mortality",
                       method = "NONE", 
                       mukeys = 2924912)
#>   areasymbol musym muname   mukey    cokey compname comppct_r
#> 1      CA630   DAM   Dams 2924912 19586041     Dams       100
#>   rating_FORPotentialSeedlingMortality class_FORPotentialSeedlingMortality
#> 1                                   NA                           Not rated
#>   reason_FORPotentialSeedlingMortality
#> 1                                   NA

brownag · 2021-06-07T19:48:50Z

I propose that we use the new argument added in 2774148 "not_rated_value" to give the user control over what the numeric values associated with "Not Rated" records are. I suggest the default value to be NA_real_ which is a bit more "safe" in my opinion than 99, but users can readily set the not rated value to whatever they want as in example below

library(soilDB)

# default will use NA_real_ for the numeric ratings of "Not rated" components
res1 <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                       method = "Weighted Average", 
                       areasymbols = "CA630")

head(subset(res1, class_FORPotentialSeedlingMortality == "Not Rated"))
#>     areasymbol musym     muname   mukey rating_FORPotentialSeedlingMortality
#> 90       CA630  1012 Mined Land 2403709                                   NA
#> 92       CA630   DAM       Dams 2924912                                   NA
#> 104      CA630     W      Water 2462630                                   NA
#>     class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 90                            Not Rated                                 <NA>
#> 92                            Not Rated                                 <NA>
#> 104                           Not Rated                                 <NA>
     
# user can specify a custom not rated value, e.g. 9999
res2 <- get_SDA_interpretation("FOR - Potential Seedling Mortality",
                       method = "Weighted Average", 
                       areasymbols = "CA630",
                       not_rated_value = 9999)
head(subset(res2, class_FORPotentialSeedlingMortality == "Not Rated"))
#>    areasymbol musym     muname   mukey rating_FORPotentialSeedlingMortality
#> 30      CA630  1012 Mined Land 2403709                                 9999
#> 32      CA630   DAM       Dams 2924912                                 9999
#> 44      CA630     W      Water 2462630                                 9999
#>    class_FORPotentialSeedlingMortality reason_FORPotentialSeedlingMortality
#> 30                           Not Rated                                 <NA>
#> 32                           Not Rated                                 <NA>
#> 44                           Not Rated                                 <NA>

brownag and others added 6 commits May 30, 2021 16:42

get_SDA_property: Vectorizing property argument, add argument to ge…

951cc6e

…t query string

get_SDA_interpretation: Vectorizing rulename

950a253

get_SDA*: Standardize lower case MUKEY, COKEY, CHKEY, etc.

cd1687e

get_SDA_interpretations: consistent column names for all methods w/ .…

b5bacfd

…cleanRuleColumnName()

Demo of bottom_depth logic for weighted average, and comparing column…

b71d1cc

… names in results

removing conversion of NULL to 0, why was this in there

e8cc5d5

dylanbeaudette mentioned this pull request Jun 2, 2021

"soil data aggregation engine" vs. wrangling SQL statements or ad hoc R code #178

Closed

brownag marked this pull request as ready for review June 7, 2021 19:32

get_SDA_interpretation: add not_rated_value argument, default NA_real_

2774148

Update docs

4ea5b27

brownag merged commit dea0032 into master Jun 11, 2021

brownag deleted the SDA_methods1 branch August 25, 2021 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to `get_SDA_*` / SOD-like methods #185

Updates to `get_SDA_*` / SOD-like methods #185

brownag commented May 31, 2021 •

edited

Loading

brownag commented Jun 2, 2021

brownag commented Jun 4, 2021 •

edited

Loading

brownag commented Jun 7, 2021

Updates to get_SDA_* / SOD-like methods #185

Updates to get_SDA_* / SOD-like methods #185

Conversation

brownag commented May 31, 2021 • edited Loading

brownag commented Jun 2, 2021

brownag commented Jun 4, 2021 • edited Loading

brownag commented Jun 7, 2021

Updates to `get_SDA_*` / SOD-like methods #185

Updates to `get_SDA_*` / SOD-like methods #185

brownag commented May 31, 2021 •

edited

Loading

brownag commented Jun 4, 2021 •

edited

Loading