Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createSSURGO: add support for creating DuckDB and other DBI-compatible databases #352

Merged
merged 4 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ License: GPL (>= 3)
LazyLoad: yes
Depends: R (>= 3.5.0)
Imports: grDevices, graphics, stats, utils, methods, aqp (>= 2.0.2), data.table, DBI, curl
Suggests: jsonlite, xml2, httr, rvest, odbc, RSQLite, sf, wk, terra, raster, knitr, rmarkdown, testthat
Suggests: jsonlite, xml2, httr, rvest, odbc, RSQLite, duckdb, sf, wk, terra, raster, knitr, rmarkdown, testthat
Repository: CRAN
URL: https://ncss-tech.github.io/soilDB/, https://ncss-tech.github.io/AQP/
BugReports: https://github.com/ncss-tech/soilDB/issues
Expand Down
120 changes: 87 additions & 33 deletions R/createSSURGO.R
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#' Several ESRI shapefiles are found in the _/spatial/_ folder extracted from a SSURGO ZIP. These have prefix `soilmu_` (mapunit), `soilsa_` (survey area), `soilsf_` (special features). There will also be a TXT file with prefix `soilsf_` describing any special features. Shapefile names then have an `a_` (polygon), `l_` (line), `p_` (point) followed by the soil survey area symbol.
#'
#' @return Character. Paths to downloaded ZIP files (invisibly). May not exist if `remove_zip = TRUE`.
#' @seealso [createSSURGO()]
downloadSSURGO <- function(WHERE = NULL,
areasymbols = NULL,
destdir = tempdir(),
Expand Down Expand Up @@ -111,10 +112,18 @@ downloadSSURGO <- function(WHERE = NULL,
invisible(paths2)
}

#' Create a SQLite database or GeoPackage from one or more SSURGO Exports
#' Create a database from SSURGO Exports
#'
#' The following database types are tested and fully supported:
#' - SQLite or Geopackage
#' - DuckDB
#' - Postgres or PostGIS
#'
#' In theory any other DBI-compatible data source can be used for output. See `conn` argument. If you encounter issues using specific DBI connection types, please report in the soilDB issue tracker.
#'
#' @param filename Output file name (e.g. `'db.sqlite'` or `'db.gpkg'`)
#' @param exdir Path containing containing SSURGO spatial (.shp) and tabular (.txt) files.
#' @param filename Output file name (e.g. `'db.sqlite'` or `'db.gpkg'`). Only used when `con` is not specified by the user.
#' @param exdir Path containing containing input SSURGO spatial (.shp) and tabular (.txt) files, downloaded and extracted by `downloadSSURGO()` or similar.
#' @param conn A _DBIConnection_ object. Default is a `SQLiteConnection` used for writing .sqlite or .gpkg files. Alternate options are any DBI connection types. When `include_spatial=TRUE`, the sf package is used to write spatial data to the database.
#' @param pattern Character. Optional regular expression to use to filter subdirectories of `exdir`. Default: `NULL` will search all subdirectories for SSURGO export files.
#' @param include_spatial Logical. Include spatial data layers in database? Default: `TRUE`.
#' @param overwrite Logical. Overwrite existing layers? Default `FALSE` will append to existing tables/layers.
Expand All @@ -123,6 +132,7 @@ downloadSSURGO <- function(WHERE = NULL,
#' @param ... Additional arguments passed to `write_sf()` for writing spatial layers.
#'
#' @return Character. Vector of layer/table names in `filename`.
#' @seealso [downloadSSURGO()]
#' @export
#' @examples
#' \dontrun{
Expand All @@ -131,29 +141,44 @@ downloadSSURGO <- function(WHERE = NULL,
#' }
createSSURGO <- function(filename,
exdir,
conn = DBI::dbConnect(DBI::dbDriver("SQLite"),
filename,
loadable.extensions = TRUE),
pattern = NULL,
include_spatial = TRUE,
overwrite = FALSE,
header = FALSE,
quiet = TRUE,
...) {

if (missing(filename) || length(filename) == 0) {
stop("`filename` should be a path to a .gpkg or .sqlite file to create or append to.")
if ((missing(filename) || length(filename) == 0) && missing(conn)) {
stop("`filename` should be a path to a .gpkg or .sqlite file to create or append to, or a DBIConnection should be provided via `conn`.")
} else {
filename <- NULL
}

IS_GPKG <- grepl("\\.gpkg$", filename, ignore.case = TRUE)[1]
# DuckDB has special spatial format, so it gets custom handling for
IS_DUCKDB <- inherits(conn, "duckdb_connection")

f <- list.files(exdir, recursive = TRUE, pattern = pattern, full.names = TRUE)

if (!requireNamespace("sf"))
stop("package `sf` is required to write spatial datasets to SSURGO SQLite databases", call. = FALSE)

if (!requireNamespace("RSQLite"))
stop("package `RSQLite` is required to write tabular datasets to SSURGO SQLite databases", call. = FALSE)
if (!IS_DUCKDB) {
if (!requireNamespace("sf"))
stop("package `sf` is required to write spatial datasets to DBI data sources", call. = FALSE)
}

if (isTRUE(overwrite) && file.exists(filename)) {
file.remove(filename)
if (missing(conn)) {
# delete existing file if overwrite=TRUE
if (isTRUE(overwrite) && file.exists(filename)) {
file.remove(filename)
}

# if user did not specify their own connection, close on exit
on.exit(DBI::dbDisconnect(conn))
} else {
if (isTRUE(overwrite)) {
message("`filename` and `overwrite` arguments ignored when `conn` is specified")
}
}

# create and add combined vector datasets:
Expand All @@ -167,14 +192,35 @@ createSSURGO <- function(filename,
if (nrow(shp.grp) >= 1 && ncol(shp.grp) == 3 && include_spatial) {
f.shp.grp <- split(f.shp, list(feature = shp.grp[,1], geom = shp.grp[,2]))

if (IS_DUCKDB) {
DBI::dbExecute(conn, "INSTALL spatial; LOAD spatial;")
}

lapply(seq_along(f.shp.grp), function(i) {
lapply(seq_along(f.shp.grp[[i]]), function(j){
lapply(seq_along(f.shp.grp[[i]]), function(j) {
lnm <- layer_names[match(gsub(".*soil([musfa]{2}_[apl])_.*", "\\1", f.shp.grp[[i]][j]),
names(layer_names))]

if (overwrite && j == 1) {
sf::write_sf(sf::read_sf(f.shp.grp[[i]][j]), filename, layer = lnm, overwrite = TRUE, ...)
} else sf::write_sf(sf::read_sf(f.shp.grp[[i]][j]), filename, layer = lnm, append = TRUE, ...)
if (IS_DUCKDB) {
if (overwrite && j == 1) {
DBI::dbExecute(conn, sprintf("DROP TABLE IF EXISTS %s;", lnm))
DBI::dbExecute(conn, sprintf("CREATE TABLE %s AS SELECT * FROM ST_Read('%s');",
lnm, f.shp.grp[[i]][j]))
} else {
DBI::dbExecute(conn, sprintf("INSERT INTO %s SELECT * FROM ST_Read('%s');",
lnm, f.shp.grp[[i]][j]))
}
} else {
shp <- sf::read_sf(f.shp.grp[[i]][j])

colnames(shp) <- tolower(colnames(shp))
sf::st_geometry(shp) <- "geometry"

if (overwrite && j == 1) {
sf::write_sf(shp, conn, layer = lnm, delete_layer = TRUE, ...)
} else {
sf::write_sf(shp, conn, layer = lnm, append = TRUE, ...)
}
}
NULL
})
})
Expand Down Expand Up @@ -211,9 +257,6 @@ createSSURGO <- function(filename,
msidxdet <- read.delim(msidxdn[1], sep = "|", stringsAsFactors = FALSE, header = header)
}

con <- RSQLite::dbConnect(RSQLite::SQLite(), filename, loadable.extensions = TRUE)
on.exit(RSQLite::dbDisconnect(con))

lapply(names(f.txt.grp), function(x) {

if (!is.null(mstabcol)) {
Expand All @@ -226,7 +269,7 @@ createSSURGO <- function(filename,
}

d <- try(as.data.frame(data.table::rbindlist(lapply(seq_along(f.txt.grp[[x]]), function(i) {

# print(f.txt.grp[[x]][i])
y <- read.delim(f.txt.grp[[x]][i], sep = "|", stringsAsFactors = FALSE, header = header)

if (length(y) == 1) {
Expand All @@ -252,37 +295,48 @@ createSSURGO <- function(filename,
# write tabular data to file
try({
if (overwrite) {
RSQLite::dbWriteTable(con, mstab_lut[x], d, overwrite = TRUE)
} else RSQLite::dbWriteTable(con, mstab_lut[x], d, append = TRUE)
DBI::dbWriteTable(conn, mstab_lut[x], d, overwrite = TRUE)
} else {
DBI::dbWriteTable(conn, mstab_lut[x], d, append = TRUE)
}
}, silent = quiet)

# create pkey indices
if (!is.null(indexPK) && length(indexPK) > 0) {
try({
RSQLite::dbExecute(con, sprintf("CREATE UNIQUE INDEX IF NOT EXISTS %s ON %s (%s)",
paste0('PK_', mstab_lut[x]), mstab_lut[x], paste0(indexPK, collapse = ",")))
q <- sprintf("CREATE UNIQUE INDEX IF NOT EXISTS %s ON %s (%s)",
paste0('PK_', mstab_lut[x]), mstab_lut[x],
paste0(indexPK, collapse = ","))
DBI::dbExecute(conn, q)
}, silent = quiet)
}

# create key indices
if (!is.null(indexDI) && length(indexDI) > 0) {
for (i in seq_along(indexDI)) {
try({
RSQLite::dbExecute(con, sprintf("CREATE INDEX IF NOT EXISTS %s ON %s (%s)",
paste0('DI_', mstab_lut[x]), mstab_lut[x], indexDI[i]))
q <- sprintf("CREATE INDEX IF NOT EXISTS %s ON %s (%s)",
paste0('DI_', mstab_lut[x]), mstab_lut[x], indexDI[i])
DBI::dbExecute(conn, q)
}, silent = quiet)
}
}

if (inherits(conn, 'SQLiteConnection')) {
IS_GPKG <- grepl("\\.gpkg$", conn@dbname, ignore.case = TRUE)[1]
} else {
IS_GPKG <- FALSE
}

# for GPKG output, add gpkg_contents (metadata for features and attributes)
if (IS_GPKG) {
if (!.gpkg_has_contents(con)) {
if (!.gpkg_has_contents(conn)) {
# if no spatial data inserted, there will be no gpkg_contents table initally
try(.gpkg_create_contents(con))
try(.gpkg_create_contents(conn))
}
# update gpkg_contents table entry
try(.gpkg_delete_contents(con, mstab_lut[x]))
try(.gpkg_add_contents(con, mstab_lut[x]))
try(.gpkg_delete_contents(conn, mstab_lut[x]))
try(.gpkg_add_contents(conn, mstab_lut[x]))
}

# TODO: other foreign keys/relationships? ALTER TABLE/ADD CONSTRAINT not available in SQLite
Expand All @@ -291,7 +345,7 @@ createSSURGO <- function(filename,
}
})

res <- RSQLite::dbListTables(con)
res <- DBI::dbListTables(conn)
invisible(res)
}

Expand Down
21 changes: 17 additions & 4 deletions man/createSSURGO.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions man/downloadSSURGO.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

91 changes: 91 additions & 0 deletions misc/postgis-ssurgo.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: "`createSSURGO()` + PostGIS"
output: md_document
---

## Setup PostGIS Docker Container

First, install [Docker](https://docs.docker.com/engine/install/).

Then, pull the `postgis` image you want.

Here, we get the latest version using tag `"latest"`:

```sh
docker pull postgis/postgis:latest
```

Then run an instance of the image.

```sh
docker run --name steep-piano -e POSTGRES_PASSWORD=mypassword -d -p 5432:5432 postgis/postgis
```

Give it a name (e.g. `steep-piano`). The default username is `postgres`. Set a password (e.g. `"mypassword"`) and port to forward (e.g. `5432`) from your docker container to your public host network address.

In the simplest case you are just running `postgis` for local testing, so the host address will simply be `localhost`.

You can then make a connection to the PostGIS instance running in the Docker container.

To make sure everything is working open up the `psql` SQL prompt:

```sh
psql -h localhost -p 5432 -U postgres
```

Then use command `\l` to list databases then `\q` to quit.

# Connecting with R

Create a database connection to your local PostGIS instance. To do this we use DBI and RPostgres packages.

```{r}
library(soilDB)
library(RPostgres)

con <- DBI::dbConnect(DBI::dbDriver("Postgres"),
host = "localhost",
port = 5432L,
user = "postgres",
password = "password")
```

Download some SSURGO data, if needed, and extract the files/folder structure to `"SSURGO_test"` subfolder of current working directory.

```{r, eval = FALSE}
if (!dir.exists("SSURGO_test"))
downloadSSURGO(areasymbols = c("CA067", "CA077", "CA632"),
destdir = "SSURGO_test")
```

Pass the `con` argument to `createSSURGO()` to override the default SQLite connection that is created to `filename`. When `con` is not the default SQLite connection, the `filename` argument can take any value (including `NULL`).

```{r, eval = FALSE}
createSSURGO(exdir = "SSURGO_test",
con = con)
```

Once the tables have been written to the database, you can write queries involving spatial and tabular data.
Perhaps the easiest way to do this is to use `sf::st_read()`.

The `st_read()` function can take a DBIConnection as a data source, and takes a `query` argument. It will return an `sf` data frame if the result table contains a geometry column. Otherwise, it will return a data.frame (with a warning).

```{r}
res <- sf::st_read(con, query = "SELECT
ST_Centroid(geometry) AS centroid,
ST_Area(geometry) AS polygon_area,
muaggatt.*
FROM mupolygon
LEFT JOIN muaggatt ON muaggatt.mukey = CAST(mupolygon.mukey AS integer)
LIMIT 2")

res

plot(res["musym"], pch = "+")
```

Close the connection when you are done.

```{r}
DBI::dbDisconnect(con)
```
Loading