Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to OSD formatting #64

Closed
dylanbeaudette opened this issue Apr 26, 2023 · 4 comments
Closed

Changes to OSD formatting #64

dylanbeaudette opened this issue Apr 26, 2023 · 4 comments

Comments

@dylanbeaudette
Copy link
Member

dylanbeaudette commented Apr 26, 2023

Several changes to the OSD formatting standards (NSSH) may cause further inconsistency among OSD formatting styles encountered within the entire collection.

  1. Conversion of doubled hyphen-minus delimiters (--) to em dash () in all sections. This is most likely to affect parsing of the TYPICAL PEDON section. See .extractHzData().

  2. Section headers will now use title case: "TYPICAL PEDON:" → "Typical Pedon:" . This may affect all parsing related to finding section headers, and the downstream use of list element names if those are changed to match.

  3. It is not clear if the encoding of the text or HTML files will change, will the new files be Unicode?

The TYPICAL PEDON section is also modified such that the short narrative is on its own line:

Typical Pedon:
Gamma silt loam with a north-facing, linear, 1 percent slope in an alfalfa field at an elevation of 210 meters. (Colors are for dry soil unless otherwise noted.) 

Ap—0 to 15 centimeters; grayish brown (10YR 5/2) silt loam, very dark grayish brown (10YR 3/2) moist; weak fine granular structure; slightly hard, friable; neutral (pH 6.7 in 1:1 water); abrupt smooth boundary. (10 to 23 centimeters thick)

C—15 to 33 centimeters; stratified grayish brown (10YR 5/2) and light brownish gray (10YR 6/2) silt loam, very dark grayish brown (10YR 3/2) and dark grayish brown (10YR 4/2) moist; massive with evident bedding planes; slightly hard, friable; few fine prominent reddish brown (5YR 4/4) masses of oxidized iron in the soil matrix; neutral (pH 6.7 in 1:1 water); abrupt smooth boundary. (15 to 30 centimeters thick)

Cg1—33 to 48 centimeters; stratified dark gray (10YR 4/1) and grayish brown (10YR 5/2) silt loam, very dark gray (10YR 3/1) and dark grayish brown (10YR 4/2) moist; massive with evident bedding planes; slightly hard, friable; few fine prominent reddish brown (5YR 4/4) masses of oxidized iron in the soil matrix; neutral (pH 6.8 in 1:1 water); abrupt smooth boundary. (10 to 25 centimeters thick)

Cg2—48 to 81 centimeters; stratified grayish brown (10YR 5/2) and light brownish gray (10YR 6/2) silt loam, very dark grayish brown (10YR 3/2) and dark grayish brown (10YR 4/2) moist; massive with evident bedding planes; slightly hard, friable; few fine prominent reddish brown (5YR 4/4) masses of oxidized iron in the soil matrix; neutral (pH 6.9 in 1:1 water); abrupt smooth boundary. (25 to 51 centimeters thick)

Agb1—81 to 112 centimeters; dark gray (10YR 4/1) silt loam, very dark gray (10YR 3/1) moist; massive; hard, friable; neutral (pH 6.8 in 1:1 water); gradual wavy boundary. (0 to 38 centimeters thick)

Agb2—112 to 153 centimeters; dark gray (N 4/) silt loam, black (N 2.5/) moist; massive; hard, friable; neutral (pH 6.8 in 1:1 water).

Ideas on checking encoding of text files. I have no idea if this will change, or how the download process modifies (?) the encoding.

f <- list.files(path = "e:/working_copies/OSDRegistry/OSD/D/", full.names = TRUE)

x <- lapply(f, function(i) {
  .e <- readr::guess_encoding(i, n_max = 1000)
  .osd <- gsub('.txt', '', basename(i))
  .res <- data.frame(osd = .osd, encoding = .e$encoding, confidence = .e$confidence)
  
  return(.res)
})

x <- do.call('rbind', x)

table(x$encoding)
@brownag
Copy link
Member

brownag commented Apr 26, 2023

Thanks for spelling this out here, I was aware of changes to Part 614 OSD section but didn't quite realize these minor formatting changes.

Item 1 should be easily fixable right now by adding \\u2014 to the set of allowed separator characters.

  1. Section headers will now use title case: "TYPICAL PEDON:" → "Typical Pedon:" . This may affect all parsing related to finding section headers, and the downstream use of list element names if those are changed to match.

Seriously? We handle this in some instances (for Typical Pedon specifically). But this very well could break tons of things with little to no benefit.

I want to hold off on any changes to the codebase until we actually see these changes coming in via OSDRegistry. No need to change anything unless it is causing parsing problems.

@brownag
Copy link
Member

brownag commented Apr 26, 2023

  1. It is not clear if the encoding of the text or HTML files will change, will the new files be Unicode?

Yes, these changes, if implemented, will change the encoding (or inferred encoding) of the files.

Currently the HTML has no declared encoding, but W3C validator detects as windows-1252. e.g. https://validator.w3.org/nu/?doc=https%3A%2F%2Fsoilseries.sc.egov.usda.gov%2FOSD_Docs%2Fb%2FBOOMER.html

@brownag
Copy link
Member

brownag commented Apr 26, 2023

Oops, my mistake, if encoding is indeed intended to be "windows-1252" then the emdash is included in that set.

brownag added a commit that referenced this issue Apr 26, 2023
Anticipated changes in OSD Style for #64 item 1
@brownag
Copy link
Member

brownag commented Apr 17, 2024

Closing this issue as there have been no significant systematic changes to OSD formatting. We can address specific problems if/when they trickle in

@brownag brownag closed this as completed Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants