diff --git a/_freeze/slides/execute-results/html.json b/_freeze/slides/execute-results/html.json
index 295484a..1540957 100644
--- a/_freeze/slides/execute-results/html.json
+++ b/_freeze/slides/execute-results/html.json
@@ -1,7 +1,8 @@
{
- "hash": "f5d666dc1ba4ef53cda3f84ad8763789",
+ "hash": "9fb68042bd9d7467a217d5a3dffbda6c",
"result": {
- "markdown": "---\ntitle: \"Building an interpretable SDM from scratch\"\nsubtitle: \"using Julia 1.9\"\nauthor:\n name: \"Timothée Poisot\"\n email: timothee.poisot@umontreal.ca\ninstitute: \"Université de Montréal\"\ntitle-slide-attributes: \n data-background-image: https://cdn.pixabay.com/photo/2017/03/29/11/29/nepal-2184940_960_720.jpg\n data-background-opacity: \"0.15\"\nbibliography: references.bib\ncsl: https://www.zotero.org/styles/ecology-letters\n---\n\n## Overview\n\n- Build a *simple* classifier to predict the distribution of a species\n\n- No, I will not tell you which species, it's a large North American mammal\n\n- Use this as an opportunity to talk about interpretable ML\n\n- Discuss which biases are appropriate in a predictive model\n\n::: footer\nCC BY 4.0 - Timothée Poisot\n:::\n\n------------------------------------------------------------------------\n\n::: r-fit-text\nWe care a lot about the\n\n**process**\n\nand only a little about the\n\n**product**\n:::\n\n------------------------------------------------------------------------\n\n## Why...\n\n... think of SDMs as a ML problem?\n\n: They are (they really, really are, see @beery2021)\n\n... think of explainable ML for SDM?\n\n: Uptake of models *requires* transparent predictions\n\n... not tell us which species this is about?\n\n: Because this is the point (you'll see)\n\n## Do try this at home!\n\n💻 + 📔 + 🗺️ at `https://github.com/tpoisot/InterpretableSDMWithJulia/`\n\n::: {#include-the-packages-we-need .cell execution_count=1}\n``` {.julia .cell-code}\ninclude(joinpath(\"code\", \"pkg.jl\")); # Dependencies\ninclude(joinpath(\"code\", \"nbc.jl\")); # Naive Bayes Classifier\ninclude(joinpath(\"code\", \"bioclim.jl\")); # BioClim model\ninclude(joinpath(\"code\", \"confusion.jl\")); # Confusion matrix utilities\ninclude(joinpath(\"code\", \"splitters.jl\")); # Cross-validation (part one)\ninclude(joinpath(\"code\", \"crossvalidate.jl\")); # Cross-validation (part deux)\ninclude(joinpath(\"code\", \"variableselection.jl\")); # Variable selection\ninclude(joinpath(\"code\", \"shapley.jl\")); # Shapley values\ninclude(joinpath(\"code\", \"palettes.jl\")); # Accessible color palettes\n```\n:::\n\n\n## Species occurrences\n\n::: {#get-the-species-data .cell execution_count=2}\n``` {.julia .cell-code}\nsightings = CSV.File(\"occurrences.csv\")\nocc = [\n (record.longitude, record.latitude)\n for record in sightings\n if record.classification == \"Class A\"\n]\nfilter!(r -> -90 <= r[2] <= 90, occ)\nfilter!(r -> -180 <= r[1] <= 180, occ)\nboundingbox = (\n left = minimum(first.(occ)),\n right = maximum(first.(occ)),\n bottom = minimum(last.(occ)),\n top = maximum(last.(occ)),\n)\n```\n:::\n\n\n## Bioclimatic data\n\nWe collect BioClim data from CHELSA v1, using `SpeciesDistributionToolkit`\n\n::: {#download-the-bioclim-data-from-worldclim2 .cell execution_count=3}\n``` {.julia .cell-code}\nprovider = RasterData(WorldClim2, BioClim)\nopts = (; resolution=2.5)\ntemperature = SimpleSDMPredictor(provider, layer=1; opts..., boundingbox...)\n```\n:::\n\n\n::: footer\nBioClim data from @karger2020; see @dansereau2021 for more about the packages\n:::\n\n## Bioclimatic data\n\nWe set the pixels with only open water to `nothing`\n\n::: {#get-the-open-water-pixels .cell execution_count=4}\n``` {.julia .cell-code}\nwater = \n SimpleSDMPredictor(RasterData(EarthEnv, LandCover), layer=12; boundingbox...)\nland = similar(temperature, Bool)\nreplace!(land, false => true)\nfor k in keys(land)\n if !isnothing(water[k])\n if water[k] == 100\n land[k] = false\n end\n end\nend\ntemperature = mask(land, temperature)\n```\n:::\n\n\n::: footer\nLand-cover data from @tuanmu2014\n:::\n\n## Where are we so far?\n\n::: {#17885051 .cell execution_count=5}\n\n::: {.cell-output .cell-output-display execution_count=6}\n![](slides_files/figure-revealjs/cell-6-output-1.png){}\n:::\n:::\n\n\n## Spatial thinning\n\nWe limit the occurrences to one per grid cell, assigned to the center of the grid cell\n\n::: {#make-the-layer-for-presences .cell execution_count=6}\n``` {.julia .cell-code}\npresence_layer = similar(temperature, Bool)\nfor i in axes(occ, 1)\n if ~isnothing(presence_layer[occ[i]...])\n presence_layer[occ[i]...] = true\n end\nend\n```\n:::\n\n\n## Background points generation\n\nWe generate background points proportionally to the distance away from observations, with a 10km buffer around each point with no background point allowed:\n\n::: {#make-the-pseudo-absence-buffer .cell execution_count=7}\n``` {.julia .cell-code}\npossible_background = pseudoabsencemask(DistanceToEvent, presence_layer)\n```\n:::\n\n\nAnd then we sample three pseudo-absence for each occurrence:\n\n::: {#make-the-absence-layer .cell execution_count=8}\n``` {.julia .cell-code}\nabsence_layer = backgroundpoints(\n (x -> x^1.01).(possible_background), \n 3sum(presence_layer);\n replace=false\n)\n```\n:::\n\n\n::: footer\nSee @barbet-massin2012 for more on background points\n:::\n\n## Background points cleaning\n\nWe can remove all of the information that is neither a presence nor a pseudo-absence\n\n::: {#pseudo-absencepresence-remove .cell execution_count=9}\n``` {.julia .cell-code}\nreplace!(absence_layer, false => nothing)\nreplace!(presence_layer, false => nothing)\n```\n:::\n\n\n## Data overview\n\n::: {#e17987b1 .cell execution_count=10}\n\n::: {.cell-output .cell-output-display execution_count=11}\n![](slides_files/figure-revealjs/cell-11-output-1.png){}\n:::\n:::\n\n\n\n\n## Preparing the responses and variables\n\n::: {#assemble-y-and-x .cell execution_count=12}\n``` {.julia .cell-code}\nXpresence = hcat([bioclim_var[keys(presence_layer)] for bioclim_var in predictors]...)\nypresence = fill(true, length(presence_layer))\nXabsence = hcat([bioclim_var[keys(absence_layer)] for bioclim_var in predictors]...)\nyabsence = fill(false, length(absence_layer))\nX = vcat(Xpresence, Xabsence)\ny = vcat(ypresence, yabsence)\n```\n:::\n\n\n\n\n## The model -- Naive Bayes Classifier\n\nPrediction:\n\n$$\nP(+|x) = \\frac{P(+)}{P(x)}P(x|+)\n$$\n\nDecision rule:\n\n$$\n\\hat y = \\text{argmax}_j \\, P(\\mathbf{c}_j)\\prod_i P(\\mathbf{x}_i|\\mathbf{c}_j)\n$$\n\n::: footer\nWith $n$ instances and $f$ features, NBC trains *and* predicts in $\\mathcal{O}(n\\times f)$\n:::\n\n## The model -- Naive Bayes Classifier\n\nAssumption of Gaussian distributions:\n\n$$\nP(x|+) = \\text{pdf}(x, \\mathcal{N}(\\mu_+, \\sigma_+))\n$$\n\n## Cross-validation\n\nWe keep an **unseen** *testing* set -- this will be used at the very end to report expected model performance\n\n::: {#testing-set .cell execution_count=14}\n``` {.julia .cell-code}\nidx, tidx = holdout(y, X; permute=true)\n```\n:::\n\n\nFor *validation*, we will run k-folds\n\n::: {#k-folds .cell execution_count=15}\n``` {.julia .cell-code}\nty, tX = y[idx], X[idx,:]\nfolds = kfold(ty, tX; k=15, permute=true)\nk = length(folds)\n```\n:::\n\n\n::: footer\nSee @valavi2018 for more on cross-validation\n:::\n\n## A note on cross-validation\n\nAll models share the same folds\n\n: we can compare the validation performance across experiments to select the best model\n\nModel performance can be compared\n\n: we average the relevant summary statistics over each validation set\n\nTesting set is *only* for future evaluation\n\n: we can only use it once and report the expected performance *of the best model*\n\n## Baseline performance\n\nWe need to get a sense of how difficult the classification problem is:\n\n::: {#ce38c2a0 .cell execution_count=16}\n``` {.julia .cell-code}\nN_v0 = crossvalidate(naivebayes, ty, tX, folds)\nB_v0 = crossvalidate(bioclim, ty, tX, folds, eps())\n```\n:::\n\n\nThis uses an un-tuned model with all variables and reports the average over all validation sets. In addition, we will always use the BioClim model as a comparison.\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC |\n|-----|-------------------------------|-------------------------------|\n| FPR | 0\\.32±0\\.0 | 0\\.11±0\\.0 |\n| FNR | 0\\.01±0\\.0 | 0\\.14±0\\.0 |\n| TPR | 0\\.99±0\\.0 | 0\\.86±0\\.0 |\n| TNR | 0\\.68±0\\.0 | 0\\.89±0\\.0 |\n| TSS | 0\\.66±0\\.0 | 0\\.75±0\\.0 |\n| MCC | 0\\.58±0\\.0 | 0\\.71±0\\.0 |\n\n::: footer\nIt's a good idea to check the values for the training sets too...\n:::\n\n## Variable selection\n\nWe add variables one at a time, until the Matthew's Correlation Coefficient stops increasing -- we keep annual temperature, isothermality, mean diurnal range, and annual precipitation\n\n::: {#a2c14deb .cell execution_count=17}\n``` {.julia .cell-code}\navailable_variables = forwardselection(ty, tX, folds, naivebayes, mcc)\n```\n:::\n\n\nThis method identifies 5 variables, some of which are:\n\n1. Mean Temp\\. of Coldest Quarter\n\n2. Mean Diurnal Range \n\n3. Annual Precip\\.\n\n## Variable selection?\n\n- Constrained variable selection\n\n- VIF threshold (over the extent or over document occurrences?)\n\n- PCA for dimensionality reduction *v.* Whitening for colinearity removal\n\n- Potential for data leakage: data transformations don't exist, they are just models we can train\n\n## Model with variable selection\n\n::: {#dfa977c2 .cell execution_count=18}\n``` {.julia .cell-code}\nN_v1 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds)\nB_v1 = crossvalidate(bioclim, ty, tX[:,available_variables], folds, eps())\n```\n:::\n\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC | BioClim (v.s.) | NBC (v.s.) |\n|---------------|---------------|---------------|---------------|---------------|\n| FPR | 0\\.32±0\\.0 | 0\\.11±0\\.0 | 0\\.57±0\\.1 | 0\\.07±0\\.0 |\n| FNR | 0\\.01±0\\.0 | 0\\.14±0\\.0 | 0\\.01±0\\.0 | 0\\.14±0\\.0 |\n| TPR | 0\\.99±0\\.0 | 0\\.86±0\\.0 | 0\\.99±0\\.0 | 0\\.86±0\\.0 |\n| TNR | 0\\.68±0\\.0 | 0\\.89±0\\.0 | 0\\.43±0\\.1 | 0\\.93±0\\.0 |\n| TSS | 0\\.66±0\\.0 | 0\\.75±0\\.0 | 0\\.42±0\\.1 | 0\\.79±0\\.0 |\n| MCC | 0\\.58±0\\.0 | 0\\.71±0\\.0 | 0\\.39±0\\.1 | 0\\.77±0\\.0 |\n\n## How do we make the model better?\n\nThe NBC is a *probabilistic classifier* returning $P(+|\\mathbf{x})$\n\nThe *decision rule* is to assign a presence when $P(\\cdot) > 0.5$\n\nBut $P(\\cdot) > \\tau$ is a far more general approach, and we can use learning curves to identify $\\tau$\n\n## Thresholding the model\n\n::: {#39f67e12 .cell execution_count=19}\n``` {.julia .cell-code}\nthr = LinRange(0.0, 1.0, 500)\nT = hcat([crossvalidate(naivebayes, ty, tX[:,available_variables], folds, t) for t in thr]...)\n```\n:::\n\n\n## But how do we pick the threshold?\n\n::: {#d8b1753b .cell execution_count=20}\n\n::: {.cell-output .cell-output-display execution_count=24}\n![](slides_files/figure-revealjs/cell-21-output-1.svg){}\n:::\n:::\n\n\n## Tuned model with selected variables\n\n::: {#51843c1a .cell execution_count=21}\n``` {.julia .cell-code}\nN_v2 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds, thr[m])\n```\n:::\n\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC | BioClim (v.s.) | NBC (v.s.) | NBC (v.s. + tuning) |\n|------------|------------|------------|------------|------------|------------|\n| FPR | 0\\.32±0\\.0 | 0\\.11±0\\.0 | 0\\.57±0\\.1 | 0\\.07±0\\.0 | 0\\.06±0\\.0 |\n| FNR | 0\\.01±0\\.0 | 0\\.14±0\\.0 | 0\\.01±0\\.0 | 0\\.14±0\\.0 | 0\\.15±0\\.0 |\n| TPR | 0\\.99±0\\.0 | 0\\.86±0\\.0 | 0\\.99±0\\.0 | 0\\.86±0\\.0 | 0\\.85±0\\.0 |\n| TNR | 0\\.68±0\\.0 | 0\\.89±0\\.0 | 0\\.43±0\\.1 | 0\\.93±0\\.0 | 0\\.94±0\\.0 |\n| TSS | 0\\.66±0\\.0 | 0\\.75±0\\.0 | 0\\.42±0\\.1 | 0\\.79±0\\.0 | 0\\.79±0\\.0 |\n| MCC | 0\\.58±0\\.0 | 0\\.71±0\\.0 | 0\\.39±0\\.1 | 0\\.77±0\\.0 | 0\\.77±0\\.0 |\n\n## How do we make the model better?\n\nThe NBC is a *Bayesian classifier* returning $P(+|\\mathbf{x})$\n\nThe *actual probability* depends on $P(+)$\n\nThere is no reason not to also tune $P(+)$ (jointly with other hyper-parameters)!\n\n## Joint tuning of hyper-parameters\n\n::: {#e5f0942b .cell execution_count=22}\n``` {.julia .cell-code}\nthr = LinRange(0.0, 1.0, 55)\npplus = LinRange(0.0, 1.0, 45)\nT = [crossvalidate(naivebayes, ty, tX[:,available_variables], folds, t; presence=prior) for t in thr, prior in pplus]\nbest_mcc, params = findmax(map(v -> mean(mcc.(v)), T))\nτ = thr[params.I[1]]\nppres = pplus[params.I[2]]\n```\n:::\n\n\n## Tuned (again) model with selected variables\n\n::: {#1d05da68 .cell execution_count=23}\n``` {.julia .cell-code}\nN_v3 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds, τ; presence=ppres)\n```\n:::\n\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC (v0) | NBC (v1) | NBC (v2) | NBC (v3) |\n|------------|------------|-------------|------------|------------|------------|\n| FPR | 0\\.32±0\\.0 | 0\\.11±0\\.0 | 0\\.07±0\\.0 | 0\\.06±0\\.0 | 0\\.06±0\\.0 |\n| FNR | 0\\.01±0\\.0 | 0\\.14±0\\.0 | 0\\.14±0\\.0 | 0\\.15±0\\.0 | 0\\.15±0\\.0 |\n| TPR | 0\\.99±0\\.0 | 0\\.86±0\\.0 | 0\\.86±0\\.0 | 0\\.85±0\\.0 | 0\\.85±0\\.0 |\n| TNR | 0\\.68±0\\.0 | 0\\.89±0\\.0 | 0\\.93±0\\.0 | 0\\.94±0\\.0 | 0\\.94±0\\.0 |\n| TSS | 0\\.66±0\\.0 | 0\\.75±0\\.0 | 0\\.79±0\\.0 | 0\\.79±0\\.0 | 0\\.79±0\\.0 |\n| MCC | 0\\.58±0\\.0 | 0\\.71±0\\.0 | 0\\.77±0\\.0 | 0\\.77±0\\.0 | 0\\.77±0\\.0 |\n\n## Tuned model performance\n\nWe can retrain over *all* the training data\n\n::: {#c490755c .cell execution_count=24}\n``` {.julia .cell-code}\nfinalmodel = naivebayes(ty, tX[:,available_variables]; presence=ppres)\nprediction = vec(mapslices(finalmodel, X[tidx,available_variables]; dims=2))\nC = ConfusionMatrix(prediction, y[tidx], τ)\n```\n:::\n\n\n## Estimated performance\n\n| | Final model |\n|-----|----------------------------|\n| FPR | 0\\.06 |\n| FNR | 0\\.15 |\n| TPR | 0\\.85 |\n| TNR | 0\\.94 |\n| TSS | 0\\.79 |\n| MCC | 0\\.78 |\n\n## Acceptable bias\n\n- false positives: we expect that our knowledge of the distribution is incomplete, and *this is why we train a model*\n\n- false negatives: wrong observations (positive in the data) may be identified by the model (negative prediction)\n\n## Prediction for each pixel\n\n\n\n::: {#f367cc92 .cell execution_count=26}\n``` {.julia .cell-code}\nprediction = similar(temperature, Float64)\nvariability = similar(temperature, Float64)\nuncertainty = similar(temperature, Float64)\nThreads.@threads for k in keys(prediction)\n pred_k = [p[k] for p in predictors[available_variables]]\n bootstraps = [\n samplemodel(pred_k)\n for samplemodel in samplemodels\n ]\n prediction[k] = finalmodel(pred_k)\n variability[k] = iqr(bootstraps)\n uncertainty[k] = entropy(prediction[k])\nend\n```\n:::\n\n\n## Tuned model - prediction\n\n::: {#a18273fd .cell execution_count=27}\n\n::: {.cell-output .cell-output-display execution_count=34}\n![](slides_files/figure-revealjs/cell-28-output-1.png){}\n:::\n:::\n\n\n## Tuned model - variability in output\n\n::: {#0e2dd0e6 .cell execution_count=28}\n\n::: {.cell-output .cell-output-display execution_count=35}\n![](slides_files/figure-revealjs/cell-29-output-1.png){}\n:::\n:::\n\n\n::: footer\nIQR for 50 bootstrap replicates\n:::\n\n## Tuned model - entropy in probability\n\n::: {#8bca6275 .cell execution_count=29}\n\n::: {.cell-output .cell-output-display execution_count=36}\n![](slides_files/figure-revealjs/cell-30-output-1.png){}\n:::\n:::\n\n\n::: footer\nEntropy (in bits) of the NBC probability\n:::\n\n## Tuned model - range\n\n::: {#755beabc .cell execution_count=30}\n\n::: {.cell-output .cell-output-display execution_count=37}\n![](slides_files/figure-revealjs/cell-31-output-1.png){}\n:::\n:::\n\n\n::: footer\nProbability \\> 0.759\n:::\n\n## Predicting the predictions?\n\nShapley values (Monte-Carlo approximation): if we mix the variables across two observations, how important is the $i$-th variable?\n\nExpresses \"importance\" as an additive factor on top of the *average* prediction (here: average prob. of occurrence)\n\n## Calculation of the Shapley values\n\n::: {#3a35709d .cell execution_count=31}\n``` {.julia .cell-code}\nshapval = [similar(first(predictors), Float64) for i in eachindex(available_variables)]\nThreads.@threads for k in keys(shapval[1])\n x = [p[k] for p in predictors[available_variables]]\n for i in axes(shapval, 1)\n shapval[i][k] = shapleyvalues(finalmodel, tX[:,available_variables], x, i; M=50)\n if isnan(shapval[i][k])\n shapval[i][k] = 0.0\n end\n end\nend\n```\n:::\n\n\n## Importance of variables\n\n::: {#a97fdc7e .cell execution_count=32}\n``` {.julia .cell-code}\nvarimp = sum.(map(abs, shapval))\nvarimp ./= sum(varimp)\nshapmax = mosaic(argmax, map(abs, shapval[sortperm(varimp; rev=true)]))\nfor v in sortperm(varimp, rev=true)\n vname = variables[available_variables[v]][2]\n vctr = round(Int, varimp[v]*100)\n println(\"$(vname) - $(vctr)%\")\nend\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nMean Temp. of Coldest Quarter - 46%\nAnnual Precip. - 16%\nPrecip. of Coldest Quarter - 14%\nMean Diurnal Range - 12%\nPrecip. Seasonality - 11%\n```\n:::\n:::\n\n\nThere is a difference between **contributing to model performance** and **contributing to model explainability**\n\n## Top three variables\n\n::: {#7e813121 .cell execution_count=33}\n\n::: {.cell-output .cell-output-display execution_count=41}\n![](slides_files/figure-revealjs/cell-34-output-1.png){}\n:::\n:::\n\n\n## Most determinant predictor\n\n::: {#a6ecd92b .cell execution_count=34}\n\n::: {.cell-output .cell-output-display execution_count=42}\n![](slides_files/figure-revealjs/cell-35-output-1.png){}\n:::\n:::\n\n\n## Future predictions\n\n- relevant variables will remain the same\n\n- relevant $P(+)$ will remain the same\n\n- relevant threshold for presences will remain the same\n\n## Future climate data (ca. 2070)\n\n::: {#19a5fbe7 .cell execution_count=35}\n\n::: {.cell-output .cell-output-display execution_count=43}\n```\n19-element Vector{SimpleSDMPredictor{Float32}}:\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n SDM predictor → 955×2048 grid with 1118001 Float32-valued cells\n```\n:::\n:::\n\n\n## Future climate prediction\n\n::: {#36a53444 .cell execution_count=36}\n``` {.julia .cell-code}\nfuture_prediction = similar(temperature, Float64)\nThreads.@threads for k in keys(future_prediction)\n pred_k = [p[k] for p in future_predictors[available_variables]]\n if any(isnothing.(pred_k))\n continue\n end\n future_prediction[k] = finalmodel(pred_k)\nend\n```\n:::\n\n\n## Tuned model - future prediction\n\n::: {#28223584 .cell execution_count=37}\n\n::: {.cell-output .cell-output-display execution_count=45}\n![](slides_files/figure-revealjs/cell-38-output-1.png){}\n:::\n:::\n\n\n## Loss and gain in distribution\n\n::: {#40f6d1c9 .cell execution_count=38}\n\n::: {.cell-output .cell-output-display execution_count=46}\n```\nSDM response → 955×2048 grid with 6822 Float64-valued cells\n Latitudes\t25.125 ⇢ 64.91666666666666\n Longitudes\t-149.79166666666669 ⇢ -64.45833333333334\n```\n:::\n:::\n\n\n| Change | Area (10⁶ km²) |\n|-----|------|\n| Expansion | 1.9194810307638006 | \n| No change | 4.868109022484933 | \n| Loss | 0.1294527148434877 |\n\n## Tuned model - future range change\n\n::: {#37781baa .cell execution_count=39}\n\n::: {.cell-output .cell-output-display execution_count=48}\n![](slides_files/figure-revealjs/cell-40-output-1.png){}\n:::\n:::\n\n\n## But wait...\n\n> What do you think the species was?\n\n## Take-home\n\n- building a model is *incremental*\n\n- each step adds arbitrary decisions we can control for, justify, or live with\n\n- we can provide explanations for every single prediction\n\n- free online textbook (in development) at `https://tpoisot.github.io/DataSciForBiodivSci/`\n\n## References\n\n",
+ "engine": "jupyter",
+ "markdown": "---\ntitle: \"Building an interpretable SDM from scratch\"\nsubtitle: \"using Julia 1.9\"\nauthor:\n name: \"Timothée Poisot\"\n email: timothee.poisot@umontreal.ca\ninstitute: \"Université de Montréal\"\ntitle-slide-attributes: \n data-background-image: https://cdn.pixabay.com/photo/2017/03/29/11/29/nepal-2184940_960_720.jpg\n data-background-opacity: \"0.15\"\nbibliography: references.bib\ncsl: https://www.zotero.org/styles/ecology-letters\n---\n\n## Overview\n\n- Build a *simple* classifier to predict the distribution of a species\n\n- No, I will not tell you which species, it's a large North American mammal\n\n- Use this as an opportunity to talk about interpretable ML\n\n- Discuss which biases are appropriate in a predictive model\n\n::: footer\nCC BY 4.0 - Timothée Poisot\n:::\n\n------------------------------------------------------------------------\n\n::: r-fit-text\nWe care a lot about the\n\n**process**\n\nand only a little about the\n\n**product**\n:::\n\n------------------------------------------------------------------------\n\n## Why...\n\n... think of SDMs as a ML problem?\n\n: Because they are\n\n... think of explainable ML for SDM?\n\n: Because model uptake *requires* transparency\n\n... not tell us which species this is about?\n\n: Because this is the point (you'll see)\n\n::: footer\nSee @beery2021 for more on SDM-as-ML\n:::\n\n## Do try this at home!\n\n💻 + 📔 + 🗺️ at `https://github.com/tpoisot/InterpretableSDMWithJulia/`\n\n::: {#include-the-packages-we-need .cell execution_count=1}\n``` {.julia .cell-code}\ninclude(joinpath(\"code\", \"pkg.jl\")); # Dependencies\ninclude(joinpath(\"code\", \"nbc.jl\")); # Naive Bayes Classifier\ninclude(joinpath(\"code\", \"bioclim.jl\")); # BioClim model\ninclude(joinpath(\"code\", \"confusion.jl\")); # Confusion matrix utilities\ninclude(joinpath(\"code\", \"splitters.jl\")); # Cross-validation (part one)\ninclude(joinpath(\"code\", \"crossvalidate.jl\")); # Cross-validation (part deux)\ninclude(joinpath(\"code\", \"variableselection.jl\")); # Variable selection\ninclude(joinpath(\"code\", \"shapley.jl\")); # Shapley values\ninclude(joinpath(\"code\", \"palettes.jl\")); # Accessible color palettes\n```\n:::\n\n\n## Species occurrences\n\n::: {#get-the-species-data .cell execution_count=2}\n``` {.julia .cell-code}\nsightings = CSV.File(\"occurrences.csv\")\nocc = [\n (record.longitude, record.latitude)\n for record in sightings\n if record.classification == \"Class A\"\n]\nfilter!(r -> -90 <= r[2] <= 90, occ)\nfilter!(r -> -180 <= r[1] <= 180, occ)\nboundingbox = (\n left = minimum(first.(occ)),\n right = maximum(first.(occ)),\n bottom = minimum(last.(occ)),\n top = maximum(last.(occ)),\n)\n```\n:::\n\n\n## Bioclimatic data\n\nWe collect BioClim data from CHELSA v1, using `SpeciesDistributionToolkit`\n\n::: {#download-the-bioclim-data-from-worldclim2 .cell execution_count=3}\n``` {.julia .cell-code}\nprovider = RasterData(WorldClim2, BioClim)\nopts = (; resolution=10.0)\ntemperature = SimpleSDMPredictor(provider, layer=1; opts..., boundingbox...)\n```\n:::\n\n\n::: footer\nBioClim data from @karger2020; see @dansereau2021 for more about the packages\n:::\n\n## Bioclimatic data\n\nWe set the pixels with only open water to `nothing`\n\n::: {#get-the-open-water-pixels .cell execution_count=4}\n``` {.julia .cell-code}\nwater = \n SimpleSDMPredictor(RasterData(EarthEnv, LandCover), layer=12; boundingbox...)\nland = similar(temperature, Bool)\nreplace!(land, false => true)\nfor k in keys(land)\n if !isnothing(water[k])\n if water[k] == 100\n land[k] = false\n end\n end\nend\ntemperature = mask(land, temperature)\n```\n:::\n\n\n::: footer\nLand-cover data from @tuanmu2014\n:::\n\n## Where are we so far?\n\n::: {#9283eb52 .cell execution_count=5}\n\n::: {.cell-output .cell-output-display execution_count=7}\n![](slides_files/figure-revealjs/cell-6-output-1.png){}\n:::\n:::\n\n\n## Spatial thinning\n\nWe limit the occurrences to one per grid cell, assigned to the center of the grid cell\n\n::: {#make-the-layer-for-presences .cell execution_count=6}\n``` {.julia .cell-code}\npresence_layer = similar(temperature, Bool)\nfor i in axes(occ, 1)\n if ~isnothing(presence_layer[occ[i]...])\n presence_layer[occ[i]...] = true\n end\nend\n```\n:::\n\n\n## Background points generation\n\nWe generate background points proportionally to the distance away from observations\n\n::: {#make-the-pseudo-absence-buffer .cell execution_count=7}\n``` {.julia .cell-code}\npossible_background = pseudoabsencemask(DistanceToEvent, presence_layer)\n```\n:::\n\n\nAnd then we sample three pseudo-absence for each occurrence:\n\n::: {#make-the-absence-layer .cell execution_count=8}\n``` {.julia .cell-code}\nabsence_layer = backgroundpoints(\n (x -> x^1.01).(possible_background), \n 3sum(presence_layer);\n replace=false\n)\n```\n:::\n\n\n::: footer\nSee @barbet-massin2012 for more on background points\n:::\n\n## Background points cleaning\n\nWe can remove all of the information that is neither a presence nor a pseudo-absence\n\n::: {#pseudo-absencepresence-remove .cell execution_count=9}\n``` {.julia .cell-code}\nreplace!(absence_layer, false => nothing)\nreplace!(presence_layer, false => nothing)\n```\n:::\n\n\n## Data overview\n\n::: {#1e68595c .cell execution_count=10}\n\n::: {.cell-output .cell-output-display execution_count=12}\n![](slides_files/figure-revealjs/cell-11-output-1.png){}\n:::\n:::\n\n\n\n\n## Preparing the responses and variables\n\n::: {#assemble-y-and-x .cell execution_count=12}\n``` {.julia .cell-code}\nXpresence = hcat([bioclim_var[keys(presence_layer)] for bioclim_var in predictors]...)\nypresence = fill(true, length(presence_layer))\nXabsence = hcat([bioclim_var[keys(absence_layer)] for bioclim_var in predictors]...)\nyabsence = fill(false, length(absence_layer))\nX = vcat(Xpresence, Xabsence)\ny = vcat(ypresence, yabsence)\n```\n:::\n\n\n\n\n## The model -- Naive Bayes Classifier\n\nPrediction:\n\n$$\nP(+|x) = \\frac{P(+)}{P(x)}P(x|+)\n$$\n\nDecision rule:\n\n$$\n\\hat y = \\text{argmax}_j \\, P(\\mathbf{c}_j)\\prod_i P(\\mathbf{x}_i|\\mathbf{c}_j)\n$$\n\n::: footer\nWith $n$ instances and $f$ features, NBC trains *and* predicts in $\\mathcal{O}(n\\times f)$\n:::\n\n## The model -- Naive Bayes Classifier\n\nAssumption of Gaussian distributions:\n\n$$\nP(x|+) = \\text{pdf}(x, \\mathcal{N}(\\mu_+, \\sigma_+))\n$$\n\n## Cross-validation\n\nWe keep an **unseen** *testing* set -- this will be used at the very end to report expected model performance\n\n::: {#testing-set .cell execution_count=14}\n``` {.julia .cell-code}\nidx, tidx = holdout(y, X; permute=true)\n```\n:::\n\n\nFor *validation*, we will run k-folds\n\n::: {#k-folds .cell execution_count=15}\n``` {.julia .cell-code}\nty, tX = y[idx], X[idx,:]\nfolds = kfold(ty, tX; k=15, permute=true)\nk = length(folds)\n```\n:::\n\n\n::: footer\nSee @valavi2018 for more on cross-validation\n:::\n\n## A note on cross-validation\n\nAll models share the same folds\n\n: we can compare the validation performance across experiments to select the best model\n\nModel performance can be compared\n\n: we average the relevant summary statistics over each validation set\n\nTesting set is *only* for future evaluation\n\n: we can only use it once and report the expected performance *of the best model*\n\n## Baseline performance\n\nWe need to get a sense of how difficult the classification problem is:\n\n::: {#38198fb6 .cell execution_count=16}\n``` {.julia .cell-code}\nN_v0 = crossvalidate(naivebayes, ty, tX, folds)\nB_v0 = crossvalidate(bioclim, ty, tX, folds, eps())\n```\n:::\n\n\nThis uses an un-tuned model with all variables and reports the average over all validation sets. In addition, we will always use the BioClim model as a comparison.\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC |\n|-----|-------------------------------|-------------------------------|\n| FPR | 0\\.3274 | 0\\.1186 |\n| FNR | 0\\.0138 | 0\\.156 |\n| TPR | 0\\.9862 | 0\\.844 |\n| TNR | 0\\.6726 | 0\\.8814 |\n| TSS | 0\\.6588 | 0\\.7254 |\n| MCC | 0\\.5737 | 0\\.6872 |\n\n::: footer\nIt's a good idea to check the values for the training sets too...\n:::\n\n## Variable selection\n\nWe add variables one at a time, until the Matthew's Correlation Coefficient stops increasing -- we keep annual temperature, isothermality, mean diurnal range, and annual precipitation\n\n::: {#383eec55 .cell execution_count=17}\n``` {.julia .cell-code}\navailable_variables = forwardselection(ty, tX, folds, naivebayes, mcc)\n```\n:::\n\n\nThis method identifies 5 variables, some of which are:\n\n1. Mean Temp\\. of Coldest Quarter\n\n2. Mean Diurnal Range \n\n3. Annual Precip\\.\n\n## Variable selection?\n\n- Constrained variable selection\n\n- VIF threshold (over the extent or over document occurrences?)\n\n- PCA for dimensionality reduction *v.* Whitening for colinearity removal\n\n- Potential for data leakage: data transformations don't exist, they are just models we can train\n\n## Model with variable selection\n\n::: {#4a50f5ce .cell execution_count=18}\n``` {.julia .cell-code}\nN_v1 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds)\nB_v1 = crossvalidate(bioclim, ty, tX[:,available_variables], folds, eps())\n```\n:::\n\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC | BioClim (v.s.) | NBC (v.s.) |\n|---------------|---------------|---------------|---------------|---------------|\n| FPR | 0\\.3274 | 0\\.1186 | 0\\.5892 | 0\\.0908 |\n| FNR | 0\\.0138 | 0\\.156 | 0\\.0071 | 0\\.1411 |\n| TPR | 0\\.9862 | 0\\.844 | 0\\.9929 | 0\\.8589 |\n| TNR | 0\\.6726 | 0\\.8814 | 0\\.4108 | 0\\.9092 |\n| TSS | 0\\.6588 | 0\\.7254 | 0\\.4037 | 0\\.768 |\n| MCC | 0\\.5737 | 0\\.6872 | 0\\.3817 | 0\\.7401 |\n\n## How do we make the model better?\n\nThe NBC is a *probabilistic classifier* returning $P(+|\\mathbf{x})$\n\nThe *decision rule* is to assign a presence when $P(\\cdot) > 0.5$\n\nBut $P(\\cdot) > \\tau$ is a far more general approach, and we can use learning curves to identify $\\tau$\n\n## Thresholding the model\n\n::: {#8cabc328 .cell execution_count=19}\n``` {.julia .cell-code}\nthr = LinRange(0.0, 1.0, 500)\nT = hcat([crossvalidate(naivebayes, ty, tX[:,available_variables], folds, t) for t in thr]...)\n```\n:::\n\n\n## But how do we pick the threshold?\n\n::: {#c8def73c .cell execution_count=20}\n\n::: {.cell-output .cell-output-display execution_count=25}\n![](slides_files/figure-revealjs/cell-21-output-1.svg){}\n:::\n:::\n\n\n## Tuned model with selected variables\n\n::: {#c1239243 .cell execution_count=21}\n``` {.julia .cell-code}\nN_v2 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds, thr[m])\n```\n:::\n\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC | BioClim (v.s.) | NBC (v.s.) | NBC (v.s. + tuning) |\n|------------|------------|------------|------------|------------|------------|\n| FPR | 0\\.3274 | 0\\.1186 | 0\\.5892 | 0\\.0908 | 0\\.0805 |\n| FNR | 0\\.0138 | 0\\.156 | 0\\.0071 | 0\\.1411 | 0\\.1572 |\n| TPR | 0\\.9862 | 0\\.844 | 0\\.9929 | 0\\.8589 | 0\\.8428 |\n| TNR | 0\\.6726 | 0\\.8814 | 0\\.4108 | 0\\.9092 | 0\\.9195 |\n| TSS | 0\\.6588 | 0\\.7254 | 0\\.4037 | 0\\.768 | 0\\.7623 |\n| MCC | 0\\.5737 | 0\\.6872 | 0\\.3817 | 0\\.7401 | 0\\.7445 |\n\n## How do we make the model better?\n\nThe NBC is a *Bayesian classifier* returning $P(+|\\mathbf{x})$\n\nThe *actual probability* depends on $P(+)$\n\nThere is no reason not to also tune $P(+)$ (jointly with other hyper-parameters)!\n\n## Joint tuning of hyper-parameters\n\n::: {#9557f8c7 .cell execution_count=22}\n``` {.julia .cell-code}\nthr = LinRange(0.0, 1.0, 55)\npplus = LinRange(0.0, 1.0, 45)\nT = [crossvalidate(naivebayes, ty, tX[:,available_variables], folds, t; presence=prior) for t in thr, prior in pplus]\nbest_mcc, params = findmax(map(v -> mean(mcc.(v)), T))\nτ = thr[params.I[1]]\nppres = pplus[params.I[2]]\n```\n:::\n\n\n## Tuned (again) model with selected variables\n\n::: {#34b2f0d0 .cell execution_count=23}\n``` {.julia .cell-code}\nN_v3 = crossvalidate(naivebayes, ty, tX[:,available_variables], folds, τ; presence=ppres)\n```\n:::\n\n\n## Measures on the confusion matrix {.smaller}\n\n| | BioClim | NBC (v0) | NBC (v1) | NBC (v2) | NBC (v3) |\n|------------|------------|-------------|------------|------------|------------|\n| FPR | 0\\.3274 | 0\\.1186 | 0\\.0908 | 0\\.0805 | 0\\.0774 |\n| FNR | 0\\.0138 | 0\\.156 | 0\\.1411 | 0\\.1572 | 0\\.1632 |\n| TPR | 0\\.9862 | 0\\.844 | 0\\.8589 | 0\\.8428 | 0\\.8368 |\n| TNR | 0\\.6726 | 0\\.8814 | 0\\.9092 | 0\\.9195 | 0\\.9226 |\n| TSS | 0\\.6588 | 0\\.7254 | 0\\.768 | 0\\.7623 | 0\\.7594 |\n| MCC | 0\\.5737 | 0\\.6872 | 0\\.7401 | 0\\.7445 | 0\\.7448 |\n\n## Tuned model performance\n\nWe can retrain over *all* the training data\n\n::: {#5761b4ac .cell execution_count=24}\n``` {.julia .cell-code}\nfinalmodel = naivebayes(ty, tX[:,available_variables]; presence=ppres)\nprediction = vec(mapslices(finalmodel, X[tidx,available_variables]; dims=2))\nC = ConfusionMatrix(prediction, y[tidx], τ)\n```\n:::\n\n\n## Estimated performance\n\n| | Final model |\n|-----|----------------------------|\n| FPR | 0\\.0673 |\n| FNR | 0\\.1712 |\n| TPR | 0\\.8288 |\n| TNR | 0\\.9327 |\n| TSS | 0\\.7615 |\n| MCC | 0\\.7494 |\n\n## Acceptable bias\n\n- false positives: we expect that our knowledge of the distribution is incomplete, and *this is why we train a model*\n\n- false negatives: wrong observations (positive in the data) may be identified by the model (negative prediction)\n\n## Prediction for each pixel\n\n\n\n::: {#2fc86a64 .cell execution_count=26}\n``` {.julia .cell-code}\nprediction = similar(temperature, Float64)\nvariability = similar(temperature, Float64)\nuncertainty = similar(temperature, Float64)\nThreads.@threads for k in keys(prediction)\n pred_k = [p[k] for p in predictors[available_variables]]\n bootstraps = [\n samplemodel(pred_k)\n for samplemodel in samplemodels\n ]\n prediction[k] = finalmodel(pred_k)\n variability[k] = iqr(bootstraps)\n uncertainty[k] = entropy(prediction[k])\nend\n```\n:::\n\n\n## Tuned model - prediction\n\n::: {#85b5115b .cell execution_count=27}\n\n::: {.cell-output .cell-output-display execution_count=35}\n![](slides_files/figure-revealjs/cell-28-output-1.png){}\n:::\n:::\n\n\n## Tuned model - variability in output\n\n::: {#3adde8e9 .cell execution_count=28}\n\n::: {.cell-output .cell-output-display execution_count=36}\n![](slides_files/figure-revealjs/cell-29-output-1.png){}\n:::\n:::\n\n\n::: footer\nIQR for 50 bootstrap replicates\n:::\n\n## Tuned model - entropy in probability\n\n::: {#96e7bd6d .cell execution_count=29}\n\n::: {.cell-output .cell-output-display execution_count=37}\n![](slides_files/figure-revealjs/cell-30-output-1.png){}\n:::\n:::\n\n\n::: footer\nEntropy (in bits) of the NBC probability\n:::\n\n## Tuned model - range\n\n::: {#87628523 .cell execution_count=30}\n\n::: {.cell-output .cell-output-display execution_count=38}\n![](slides_files/figure-revealjs/cell-31-output-1.png){}\n:::\n:::\n\n\n::: footer\nProbability \\> 0.333\n:::\n\n## Predicting the predictions?\n\nShapley values (Monte-Carlo approximation): if we mix the variables across two observations, how important is the $i$-th variable?\n\nExpresses \"importance\" as an additive factor on top of the *average* prediction (here: average prob. of occurrence)\n\n## Calculation of the Shapley values\n\n::: {#9d60faf1 .cell execution_count=31}\n``` {.julia .cell-code}\nshapval = [similar(first(predictors), Float64) for i in eachindex(available_variables)]\nThreads.@threads for k in keys(shapval[1])\n x = [p[k] for p in predictors[available_variables]]\n for i in axes(shapval, 1)\n shapval[i][k] = shapleyvalues(finalmodel, tX[:,available_variables], x, i; M=50)\n if isnan(shapval[i][k])\n shapval[i][k] = 0.0\n end\n end\nend\n```\n:::\n\n\n## Importance of variables\n\n::: {#bcc40b02 .cell execution_count=32}\n``` {.julia .cell-code}\nvarimp = sum.(map(abs, shapval))\nvarimp ./= sum(varimp)\nshapmax = mosaic(argmax, map(abs, shapval[sortperm(varimp; rev=true)]))\nfor v in sortperm(varimp, rev=true)\n vname = variables[available_variables[v]][2]\n vctr = round(Int, varimp[v]*100)\n println(\"$(vname) - $(vctr)%\")\nend\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nMean Temp. of Coldest Quarter - 36%\nAnnual Precip. - 21%\nPrecip. of Coldest Quarter - 19%\nPrecip. Seasonality - 12%\nMean Diurnal Range - 11%\n```\n:::\n:::\n\n\nThere is a difference between **contributing to model performance** and **contributing to model explainability**\n\n## Top three variables\n\n::: {#854ec227 .cell execution_count=33}\n\n::: {.cell-output .cell-output-display execution_count=42}\n![](slides_files/figure-revealjs/cell-34-output-1.png){}\n:::\n:::\n\n\n## Most determinant predictor\n\n::: {#d9693994 .cell execution_count=34}\n\n::: {.cell-output .cell-output-display execution_count=43}\n![](slides_files/figure-revealjs/cell-35-output-1.png){}\n:::\n:::\n\n\n## Future predictions\n\n- relevant variables will remain the same\n\n- relevant $P(+)$ will remain the same\n\n- relevant threshold for presences will remain the same\n\n## Future climate data (ca. 2070)\n\n::: {#aa7c40a6 .cell execution_count=35}\n\n::: {.cell-output .cell-output-display execution_count=44}\n```\n19-element Vector{SimpleSDMPredictor{Float32}}:\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n SDM predictor → 240×513 grid with 69098 Float32-valued cells\n```\n:::\n:::\n\n\n## Future climate prediction\n\n::: {#66eb09a1 .cell execution_count=36}\n``` {.julia .cell-code}\nfuture_prediction = similar(temperature, Float64)\nThreads.@threads for k in keys(future_prediction)\n pred_k = [p[k] for p in future_predictors[available_variables]]\n if any(isnothing.(pred_k))\n continue\n end\n future_prediction[k] = finalmodel(pred_k)\nend\n```\n:::\n\n\n## Tuned model - future prediction\n\n::: {#93b954ab .cell execution_count=37}\n\n::: {.cell-output .cell-output-display execution_count=46}\n![](slides_files/figure-revealjs/cell-38-output-1.png){}\n:::\n:::\n\n\n## Loss and gain in distribution\n\n::: {#75a965db .cell execution_count=38}\n\n::: {.cell-output .cell-output-display execution_count=47}\n```\nSDM response → 240×513 grid with 941 Float64-valued cells\n Latitudes\t25.0 ⇢ 65.0\n Longitudes\t-149.83333333333334 ⇢ -64.33333333333334\n```\n:::\n:::\n\n\n| Change | Area (10⁶ km²) |\n|-----|------|\n| Expansion | 1.900361191915685 | \n| No change | 4.698755924606283 | \n| Loss | 0.2519090994649904 |\n\n## Tuned model - future range change\n\n::: {#5987d0d4 .cell execution_count=39}\n\n::: {.cell-output .cell-output-display execution_count=49}\n![](slides_files/figure-revealjs/cell-40-output-1.png){}\n:::\n:::\n\n\n## But wait...\n\n> What do you think the species was?\n\nHuman in the loop *v.* Algorithm in the loop\n\n## Take-home\n\n- building a model is *incremental*\n\n- each step adds arbitrary decisions we can control for, justify, or live with\n\n- we can provide explanations for every single prediction\n\n- free online textbook (in development) at `https://tpoisot.github.io/DataSciForBiodivSci/`\n\n## References\n\n",
"supporting": [
"slides_files/figure-revealjs"
],
diff --git a/_freeze/slides/figure-revealjs/cell-11-output-1.png b/_freeze/slides/figure-revealjs/cell-11-output-1.png
index 51f2269..c3cad2d 100644
Binary files a/_freeze/slides/figure-revealjs/cell-11-output-1.png and b/_freeze/slides/figure-revealjs/cell-11-output-1.png differ
diff --git a/_freeze/slides/figure-revealjs/cell-21-output-1.svg b/_freeze/slides/figure-revealjs/cell-21-output-1.svg
index 2ed77f6..abeb6a7 100644
--- a/_freeze/slides/figure-revealjs/cell-21-output-1.svg
+++ b/_freeze/slides/figure-revealjs/cell-21-output-1.svg
@@ -2,1140 +2,1176 @@