Skip to content

behavioral-data/bdatasets-backend

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bdatasets-backend

This repository contains the infrastructure to provide a live status of /projects/bdata/datasets by crawling its contents.

Features

Checks over contents of all corpora (sub)directories.

Per-dataset:

  • owner
  • group
  • permissions
  • configurable access restrictions (groups and permissions)
  • dataset structure adherence
  • readme existence
  • readme project description
  • readme documentation (of processed variants)
  • size

Overall:

  • can fix permissions errors automatically with a flag
  • total size checks (above a configurable drive limit)
  • log containing detailed status breakdown and all errors
  • report generation
    • copies readme into browsable index
    • concise summary per dataset (name, readme link, description, size, access, status)
    • pie chart of overall size usage
  • configured cron usage:
    • self-updates backend and runs daily
    • pushes updated report to frontend
    • emails full error log on failures (configurable verbosity)

Installation

# create a fresh virtualenv. I use pyenv. You can use whatever.
# Use python >= 3.6.5. Then:
pip install -r requirements.txt

Running

Example usage:

# also prints log to stderr if any checks failed. (This behavior so cron
# auto sends an email to you if anything fails, but not if things pass.)
python check.py \
    --directory /projects/bdata/datasets \
    --out-file ~/bdatasets-repo/README.md \
    --log-file ~/bdatasets-repo/BUILD.txt \
    --doc-dir ~/bdatasets-repo/doc \
    --plot-dest ~/bdatasets-repo/disk-usage.svg

# The script can attempt to fix permission errors it finds. This isn't normally
# run in the cron job (though it could be). It can be enabled with a flag:
python check.py --fix-perms

# To run on the test directories (sorry Nelson, no automated tests yet), I run
# this to ignore the output markdown and see only the log.
python check.py \
    --directory test/test-nlp-corpora/ \
    --ok-owners max \
    --group-config test/test-groups.json \
    --out-file /dev/null

Full options:

python check.py --help
usage: check.py [-h] [--directory DIRECTORY] [--ok-owners OK_OWNERS]
                [--group-config GROUP_CONFIG] [--fix-perms] [--verbose]
                [--out-file OUT_FILE] [--log-file LOG_FILE]
                [--doc-dir DOC_DIR] [--plot-dest PLOT_DEST]

Tool to check bdatasets directory and output documentation.

optional arguments:
  -h, --help            show this help message and exit
  --directory DIRECTORY
                        path to top-level dataset directory (default:
                        /projects/bdata/bdatasets)
  --ok-owners OK_OWNERS
                        comma-separated list of allowed owners (default:
                        mbforbes)
  --group-config GROUP_CONFIG
                        json file containing group information (default:
                        groups.json)
  --fix-perms           whether this should attempt to fix permission errors
                        it finds (default: False)
  --verbose             whether to log error messages for every problematic
                        file (default: False)
  --out-file OUT_FILE   path to write output file. If not provided, writes to
                        stdout. (default: None)
  --log-file LOG_FILE   if provided, writes log to this path. If not 100% of
                        checks pass, always writes log to stderr. (default:
                        None)
  --doc-dir DOC_DIR     if provided, DESTROYS this dir if it exists, creates
                        it fresh, and then writes directories and readmes for
                        all corpora under it. (default: None)
  --plot-dest PLOT_DEST
                        if provided, writes a donut plot of corpora disk space
                        usage to this location. (default: None)

About

Staging grounds for bdatasets scripts and docs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.0%
  • Shell 5.0%