Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata extraction/storage/update/basic query #49

Open
yarikoptic opened this issue Sep 3, 2021 · 0 comments
Open

Metadata extraction/storage/update/basic query #49

yarikoptic opened this issue Sep 3, 2021 · 0 comments

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Sep 3, 2021

This is a "planing" issue on adding metadata extraction and "storage" within registry to formalize what we are to develop in immediate next steps (placing details of "metadata usage" such as search aside, but keeping them in mind).

  • Metadata extraction per dataset should be done using WiP https://github.com/datalad/datalad-metalad which is a replacement for "built-in" in current DataLad metadata interfaces/storage approach
    • "autodeducting" extractors should be added (datasets will likely be not configured for specific extractor) based on the data found in a dataset
    • probably will care to extract only for repositories which are datalad datasets (i.e have non-None .id)
  • Since we would not be able to afford fetching all data, and in many if not most cases it would be wasteful anyways, streamed access to the files idea: datalad/git-annex aware streaming/random access access to files content datalad#4003 (comment) should be implemented
  • Introduce to registry tracking of subdatasets
  • Decide on how we would store extracted JSON metadata records in the DB (or in some other "connected storage")
    • I do not think we should rely on having them purely in "instance/cache"d instances of those repositories using metalad
    • Keep in mind future needs for "simple search" or "syncing" (one way -- just export AFAIK) with an external specialized DB (e.g. graph DB)
    • Decide on "retention" of metadata across multiple versions of datasets (as registry keeps them updating to reflect current status)
      • most likely we would eventually add ability to store history (commit hexshas) for datasets registry sees
      • we might want to implement some kind of GC and policy on what metadata versions to keep
  • Provide API endpoints for
    • GETing information on what metadata is available for a dataset(/version)
    • GETing specific metadata record(s)
    • very basic (grep-like) "search" end point
    • "administrative" interfaces (or end points) to
      • trigger re-extraction of metadata for a specific dataset(/version)
      • add extractor(s) to be used for a specific dataset

Attn @jwodder and @datalad/developers on what other high level components or related aspects you see I have missed or have ideas about.

  • Decide on how we would store extracted JSON metadata records in the DB (or in some other "connected storage")
    • I do not think we should rely on having them purely in "instance/cache"d instances of those repositories using metalad
    • Keep in mind future needs for "simple search" or "syncing" (one way -- just export AFAIK) with an external specialized DB (e.g. graph DB)
    • Decide on "retention" of metadata across multiple versions of datasets (as registry keeps them updating to reflect current status)
      • most likely we would eventually add ability to store history (commit hexshas) for datasets registry sees
      • we might want to implement some kind of GC and policy on what metadata versions to keep
  • Provide API endpoints for
    • GETing information on what metadata is available for a dataset(/version)
    • GETing specific metadata record(s)
    • very basic (grep-like) "search" end point
    • "administrative" interfaces (or end points) to
      • trigger re-extraction of metadata for a specific dataset(/version)
      • add extractor(s) to be used for a specific dataset

Attn @jwodder and @datalad/developers on what other high level components or related aspects you see I have missed or have ideas about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant