Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore benefits of azure data lake #112

Open
stevehadd opened this issue Nov 23, 2022 · 0 comments
Open

explore benefits of azure data lake #112

stevehadd opened this issue Nov 23, 2022 · 0 comments
Assignees

Comments

@stevehadd
Copy link
Member

discussions at the prism sprint planning suggests that azure data lake is what microsoft recommend for data storage to feed into azure ML. Currently we're using Blob storage, which is apparently less performant. Should be easy to:

  • copy some data into a data lake
  • create a data store to point to the data lake (or relevant subset of contents)
  • create a dataset from the datastore (this should be exactly the same as the datastore abstracts away where the data is actually coming from)
  • create a notebook to compare data loading performance for the two options
  • STRETCH: it might be interesting to try and implement using azure data lake analytics (ADLA ) (equivalent to AWS athena) to do the querying of our tabular data files, and compare getting data directly through this compared to getting from a AML dataset. Also might be interesting ton think about a lazy loading strcuture backed by ADLA e.g. a zarr
@stevehadd stevehadd self-assigned this Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant