Tracking of media associated with pdftohtml #1

brownag · 2020-12-02T02:39:57Z

Need to develop a routine for identifying relative position of the image file references in HTML files, so that they can be tagged to the "appropriate" NSSH Part, subpart, section, clauses. etc. as gleaned from pdftotext and JSON-ified.

The following files are currently produced by pdftohtml and are untracked.

Any reason to not track them as is? Or should there be a pipeline specifically for this? Should individual NSSH parsers handle it or is there a generic solution?

> list.files("inst/extdata","png|jpg",recursive = TRUE)
 [1] "NSSH/600/600B-4_1.png"  "NSSH/614/614B-1_1.png"  "NSSH/618/618B-10_1.png" "NSSH/618/618B-11_1.png" "NSSH/618/618B-11_2.png"
 [6] "NSSH/618/618B-11_3.png" "NSSH/618/618B-15_1.jpg" "NSSH/618/618B-8_1.png"  "NSSH/627/627B-13_1.jpg" "NSSH/627/627B-14_1.jpg"
[11] "NSSH/627/627B-4_1.png"  "NSSH/644/644B-5_1.jpg"  "NSSH/647/647B-2_1.jpg"

The text was updated successfully, but these errors were encountered:

brownag changed the title ~~Tracking of media associated with txttohtml~~ Tracking of media associated with pdftohtml Dec 2, 2020

brownag self-assigned this Dec 24, 2020

This was referenced Dec 24, 2020

Integrate label-studio for manual curation and annotation #4

Open

Untrack NSSH pdftohtml HTML in inst/extdata #7

Closed

brownag mentioned this issue Jan 18, 2021

Integrate Geomorphic Description System document #9

Open

4 tasks

brownag mentioned this issue Dec 22, 2021

Soil Taxonomy (1999) #42

Open

3 tasks

brownag mentioned this issue Apr 26, 2023

Changes to OSD formatting #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking of media associated with pdftohtml #1

Tracking of media associated with pdftohtml #1

brownag commented Dec 2, 2020 •

edited

Loading

Tracking of media associated with pdftohtml #1

Tracking of media associated with pdftohtml #1

Comments

brownag commented Dec 2, 2020 • edited Loading

brownag commented Dec 2, 2020 •

edited

Loading