Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update/align with paper #11

Open
wants to merge 124 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
8e06855
Remove previous mirror scripts
nick-j-roberts Jun 15, 2023
e27244c
Delete removed transfer scripts from readme
nick-j-roberts Jun 15, 2023
f857484
Start scaffolding structure of mirror job
nick-j-roberts Jun 15, 2023
fd7da25
Upgrade pip
nick-j-roberts Jun 16, 2023
653707b
Add pyshacl
nick-j-roberts Jun 16, 2023
a4f8058
Reorganize commonly used classes
nick-j-roberts Jun 16, 2023
8ccd345
Finish url verification and zip download functions
nick-j-roberts Jun 16, 2023
67c3a86
Finish url verificiation, download, and geospatial extent for RFCs
nick-j-roberts Jun 16, 2023
633531d
Delete cloud utils
nick-j-roberts Jun 16, 2023
b5b137b
Moved class and create source dataset skolemizer
nick-j-roberts Jun 16, 2023
94d719a
Temporarily change compose statement
nick-j-roberts Jun 19, 2023
daaceb9
Start on handwritten TTL ontology
nick-j-roberts Jun 19, 2023
bb623fe
Make transposition region and watershed region apply only to transpos…
nick-j-roberts Jun 20, 2023
b38544f
Add AORC namespace pointing to v0.9
nick-j-roberts Jun 20, 2023
f0728a3
Update aorc ontology
nick-j-roberts Jun 20, 2023
9a6025c
Create SHACL shapes for AORC ontology
nick-j-roberts Jun 20, 2023
cc83438
Get git and docker hashes and expose them to docker container
nick-j-roberts Jun 20, 2023
c984c0b
Fix reachable git url check
nick-j-roberts Jun 20, 2023
1ff1293
Add docker push before compose
nick-j-roberts Jun 20, 2023
e84d03a
use load_dotenv
nick-j-roberts Jun 20, 2023
cc4dee4
Add graphdb to docker compose
nick-j-roberts Jun 21, 2023
2c3cde2
Add guide on adding CKAN extension within docker environment
nick-j-roberts Jun 26, 2023
97f467c
Ignore graphdb temp data changes
nick-j-roberts Jul 1, 2023
07d39e1
Get rid of extension guide
nick-j-roberts Jul 1, 2023
b0311cd
Start on integration of CKAN extensions to mirror job
nick-j-roberts Jul 1, 2023
41ebcee
Progressed on mirror upload function
nick-j-roberts Jul 2, 2023
bfc078f
Add docker ignore file
nick-j-roberts Jul 2, 2023
d8d3072
Make compose which uses graphdb a dev version
nick-j-roberts Jul 2, 2023
45bd276
Revise docker compose statement
nick-j-roberts Jul 2, 2023
74deb31
Add env vars
nick-j-roberts Jul 2, 2023
85385a1
Reduce uncessesary copies
nick-j-roberts Jul 2, 2023
2992bbf
Change to safe requests version
nick-j-roberts Jul 2, 2023
a7096d8
Update structure and provenance resources
nick-j-roberts Jul 2, 2023
216d6f7
Update initiation shell script
nick-j-roberts Jul 2, 2023
03b6744
Attempt error fix
nick-j-roberts Jul 2, 2023
f29016c
End if block
nick-j-roberts Jul 2, 2023
460e9df
Include env vars in compose
nick-j-roberts Jul 2, 2023
447f6ff
Get rid of useless push statement
nick-j-roberts Jul 2, 2023
ec9bad5
Copy reqs
nick-j-roberts Jul 2, 2023
501985e
Put in pull statement
nick-j-roberts Jul 2, 2023
2983071
Took volume out of prod compose
nick-j-roberts Jul 2, 2023
a073e3a
Move timedelta translation
nick-j-roberts Jul 2, 2023
6654c77
Fix default arg
nick-j-roberts Jul 2, 2023
8baa31a
Fix namespace for IANA
nick-j-roberts Jul 2, 2023
69e7fd4
Fix return
nick-j-roberts Jul 2, 2023
53d7877
Comment unused graph creator class
nick-j-roberts Jul 2, 2023
a8434d9
Add TODO
nick-j-roberts Jul 2, 2023
ee62d31
Add argument validator to init.sh
nick-j-roberts Jul 2, 2023
bb449e1
Undo mistaken copy
nick-j-roberts Jul 2, 2023
d68f766
Narrow copy
nick-j-roberts Jul 2, 2023
b0a6fa8
Fixed equality comparison
nick-j-roberts Jul 2, 2023
a8abc15
Change to relative imports, fix upload
nick-j-roberts Jul 3, 2023
88e01c4
Simplify geoms to allow for CKAN upload
nick-j-roberts Jul 3, 2023
bb381b4
Undid relative imports
nick-j-roberts Jul 3, 2023
7ae1adc
Added geom simplification to allow CKAN upload
nick-j-roberts Jul 3, 2023
def8126
Replace simplifier with convex hull
nick-j-roberts Jul 3, 2023
f174b90
Change JSON-LD serialization to TTL
nick-j-roberts Jul 3, 2023
121a9e2
Get rid of sys.argv extensions
nick-j-roberts Jul 3, 2023
72e5537
Do modification of upload params in main mirror.py
nick-j-roberts Jul 3, 2023
4458b33
Rename preserved 'id'
nick-j-roberts Jul 3, 2023
f866cc6
Take out left over formatting in upload
nick-j-roberts Jul 3, 2023
e21c777
Bind namespaces
nick-j-roberts Jul 3, 2023
33bf793
Add resource creation to dataset upload
nick-j-roberts Jul 3, 2023
bfcc9e7
Add resources to upload
nick-j-roberts Jul 3, 2023
e8eaa8c
Re-enable upload of mirror
nick-j-roberts Jul 3, 2023
aed4377
Reorganize general utils
nick-j-roberts Jul 3, 2023
26c699b
Started structuring composite refactor
nick-j-roberts Jul 3, 2023
a47206f
Take metadata retrieval
nick-j-roberts Jul 5, 2023
973d4b5
Create pseudocode
nick-j-roberts Jul 5, 2023
d502045
First shot at composite job
nick-j-roberts Jul 6, 2023
ea952b8
Delete done TODO
nick-j-roberts Jul 6, 2023
d359e9a
Fix exit method
nick-j-roberts Jul 6, 2023
84a15f6
Fix temporal property path
nick-j-roberts Jul 6, 2023
f0d385f
Fix source url attribution
nick-j-roberts Jul 6, 2023
f683787
Reenable json writing
nick-j-roberts Jul 6, 2023
e194a6a
Fix import, take dev limit off urls
nick-j-roberts Jul 6, 2023
4b43252
Fix bucket parameter
nick-j-roberts Jul 6, 2023
1555138
Correct dataset id and description assignments
nick-j-roberts Jul 6, 2023
eca0af1
Align ids with hourly data
nick-j-roberts Jul 6, 2023
5482650
Unnest dataset creation
nick-j-roberts Jul 7, 2023
94e55a5
Fix key misname in upload
nick-j-roberts Jul 7, 2023
4c4f0d1
Make URIs lowercase
nick-j-roberts Jul 7, 2023
670c358
Move demo to full
nick-j-roberts Jul 7, 2023
61d8497
Fix json writing
nick-j-roberts Jul 7, 2023
cd7981a
Fix sqlite syntax
nick-j-roberts Jul 7, 2023
d7e2c5c
Add required compress type param
nick-j-roberts Jul 7, 2023
5da0551
Ensure all source mirrors are recorded
nick-j-roberts Jul 7, 2023
d11a2f8
Undo JSON writes for mirror uploads
nick-j-roberts Jul 7, 2023
422fdf3
Get rid of old methods for composite RDF creation
nick-j-roberts Jul 7, 2023
497bf14
Update psuedocode for transposition job
nick-j-roberts Jul 7, 2023
f1dd99e
Reorganize composite imports
nick-j-roberts Jul 7, 2023
b62fd8d
Add ms index and dss URI
nick-j-roberts Jul 7, 2023
373d240
Remove unused import
nick-j-roberts Jul 7, 2023
0792859
First draft of transposition metadata creator
nick-j-roberts Jul 7, 2023
e1572ef
First draft of transposition metadata creator
nick-j-roberts Jul 7, 2023
bca941d
Edit pseudocode
nick-j-roberts Jul 7, 2023
23d80a6
Fix broken import, convert list to dict
nick-j-roberts Jul 7, 2023
c027c14
Fix geojson streaming, convert to convex hull
nick-j-roberts Jul 7, 2023
3167afe
Remove unneeded imports
nick-j-roberts Jul 7, 2023
0519a37
Add meilisearch and dependencies
nick-j-roberts Jul 7, 2023
772c193
Remove irrelevant ckan scripts
nick-j-roberts Jul 7, 2023
009a4b5
Remove outdated DCAT-US extension work
nick-j-roberts Jul 7, 2023
3cf4b9c
Remove unused utils
nick-j-roberts Jul 7, 2023
e297240
Add # to aorc namespace
nick-j-roberts Jul 7, 2023
9a37f31
Fix syntax errors
nick-j-roberts Jul 7, 2023
096189a
Regenerate ontology HTML
nick-j-roberts Jul 7, 2023
fa8bd4e
Get rid of logs and mirrors directories
nick-j-roberts Jul 7, 2023
e83f4a5
Get rid of outdates rdf writing utils
nick-j-roberts Jul 7, 2023
8a76b31
modify rdf2py
nick-j-roberts Jul 7, 2023
bcfbe6e
Delete pseudocode
nick-j-roberts Jul 7, 2023
62ea943
Update readmes
nick-j-roberts Jul 7, 2023
c261941
Move bucket to .env
nick-j-roberts Jul 7, 2023
93c7761
Add .env example
nick-j-roberts Jul 7, 2023
f25440b
Update usage in readme
nick-j-roberts Jul 7, 2023
ef575f6
Remove broken dependency
nick-j-roberts Jul 7, 2023
044e407
Fix filter in sparql
nick-j-roberts Jul 7, 2023
1e9a864
Remove test prefix
nick-j-roberts Jul 7, 2023
516e13e
Get rid of unused class
nick-j-roberts Jul 7, 2023
bd03917
Move ckan AORC plugin to blobfish
nick-j-roberts Jul 11, 2023
254b1be
Add logging to mirror job
nick-j-roberts Jul 11, 2023
b024d5c
Add logging to composite and transposition jobs
nick-j-roberts Jul 11, 2023
2bd811e
Add script summary headers
nick-j-roberts Jul 11, 2023
94e6e90
Finish adding docstring, fix typo
nick-j-roberts Jul 11, 2023
824abbb
Update aorc.ttl
nick-j-roberts Aug 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update readmes
  • Loading branch information
nick-j-roberts committed Jul 7, 2023
commit 62ea943c2646887162343f2d61ac16d3ac761359
24 changes: 15 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,21 @@
# blobfish

Repo used in experimentation with Resource Description Framework metadata creation, automation, and querying
## Summary
This repository is a proof of concept to evaluate the usability of resource description framework (RDF) metadata in documenting extensible data pipelines. The data pipeline documented by this repository is the mirroring, transformation, and subsequent use of NOAA AORC gridded precipitation data for stochaistic storm transposition (SST) modeling.

## AORC
Resources related to creating an ontology used in documenting a data pipeline for creating hourly composites of AORC precipitation data
## Pipeline Description
This pipeline can be broken into 3 stages

## PYRDF
Resources used to translate python files to TTL files
### Mirror
The first stage is taking the AORC data from an FTP server, operated by NOAA, and putting it on s3. This is to allow for easier access and less potential network interruptions. This data is zipped into packages covering one month each and is partitioned by river forecast center (RFC).

## CKAN
Resources used to query TTL RDF files that have been uploaded to the Dewberry CKAN instance
### Composite
The second stage is unzipping the montly data into series of hourly netCDF data, aligning the datasets temporally, and merging the data to a contiguous coverage of the united states, rather than separated by RFC region.

## SST
Resources demonstrating potential applications of extensions on the DCAT-US vocabulary proposed as a standard for federal data documentation
### Transposition
The third stage is using the data now that it has been transformed into a more convenient format. Our use case is utilizing the precipitation data in SST hydrological modeling. This repo does not encompass the data processing, but rather consumes the metadata produced during this process and converts these to a compatible format for use as RDF

## RDF Implementation
With the exception of the source datasets, the metadata created by this repo is not RDF, but rather plain JSON (not JSON-LD, an RDF format). This is because this repo works in tandem with extensions created on the CKAN api specifically for parsing and serializing the uploaded metadata into an RDF format.

These extensions can be found (here)[https://example.org]
6 changes: 6 additions & 0 deletions blobfish/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## blobfish

Repo used in experimentation with Resource Description Framework metadata creation, automation, and querying

### /aorc
Holds the scripts utilized in the creation of metadata POST requests which interact with CKAN to create DCAT-compliant dataset instances and catalogs
51 changes: 21 additions & 30 deletions blobfish/aorc/README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,31 @@
## AORC
Holds the main scripts as well as the utility scripts and classes that are used to upload metadata necessary for the creation of metadata to CKAN

A collection of ontologies and scripts for generating mirrors for use in data pipelines to support the development of geospatial datasets used in risk and resilience studies.
### /classes
Holds classes used by individual processes as well as classes shared between processes in the AORC pipeline

### /composite_utils
Holds utility scripts used by the composite task of the AORC pipeline

---
### /general_utils
Holds general utility scripts used by multiple processes of the AORC pipeline

#### Ontologies:
### /mirror_utils
Holds utility scripts used by the mirror task of the AORC pipeline

[AORC](http://htmlpreview.github.io/?https://github.com/Dewberry/blobfish/blob/aorc/semantics/html/aorc/index.html): Analysis of Record for Calibration precipitation and temperature datasets.
### /transposition_utils
Holds utility scripts used by the transposition metadata collection task of the AORC pipeline

---
### composite.py
Main script for the composite creation task
Responsible for querying the available mirror datasets from CKAN, aligning the data based on shared temporal coverage, and converting the monthly zipped data to hourly zarr format data, as well as collecting and uploading relevant metadata from this process

#### Scripts:
### const.py
Holds constants used in the AORC pipeline, including URIs for data formats, RFC info, data portal URLs, etc.

[AORC](./blobfish/aorc/):
* [composite](./blobfish/aorc/composite.py) - Script for creating CONUS-level composite gridded datasets composed of mirrored AORC data which utilizes RDF metadata created during the mirroring process
* [const](./blobfish/aorc/const.py) - Constants used in AORC processing or parsing
* [parse_composite](./blobfish/aorc/parse_composite.py) - Script for parsing metadata from s3 objects of created composite gridded dataset to add onto existing RDF metadata created during the mirroring process in order to document the relationship between the mirrored datasets, the composited datasets, and the compositing process
### mirror.py
Main script for the mirror creation task
Responsible for verifying the total available data from NOAA, its asynchronous acquistion, and its upload to s3, as well as collecting and uploading relevant metadata not only for the source data but also the mirror datasets created in the process

##### Setup

To run the scripts in this repo, you should have a .env file in the same directory as this repo which has the following keys:

```
AWS_ACCESS_KEY_ID=access_id_here
AWS_SECRET_ACCESS_KEY=access_key_here
AWS_DEFAULT_REGION=aws_region_here
TAG=docker_image_tag
HASH=docker_image_hash
```

The workflow during development was to launch a docker container using the specified tag and hash from the docker hub image https://hub.docker.com/layers/njroberts/blobfish-python/ and use the container to run the scripts in the following sequence in order to both complete the composite process and create RDF TTL files documenting the metadata for the jobs:

```
python -m blobfish.aorc.transfer
python -m blobfish.aorc.parse_transfer
python -m blobfish.aorc.composite
python -m blobfish.aorc.parse_composite
```
### transposition_meta.py
Main script for collecting, parsing, and submitting metadata created during stochaistic storm transposition models to CKAN for serialization as RDF
12 changes: 12 additions & 0 deletions semantics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## semantics

Holds AORC ontology and SHACL validation materials used to describe and validate RDF metadata created to document the AORC pipeline

### /html
Holds the HTML copy of the current AORC ontology

### /rdf
Holds the turtle (.ttl) copy of the current AORC ontology

### /shacl
Holds the validation rules for the AORC datasets created