explore benefits of azure data lake #112

stevehadd · 2022-11-23T15:35:55Z

discussions at the prism sprint planning suggests that azure data lake is what microsoft recommend for data storage to feed into azure ML. Currently we're using Blob storage, which is apparently less performant. Should be easy to:

copy some data into a data lake
create a data store to point to the data lake (or relevant subset of contents)
create a dataset from the datastore (this should be exactly the same as the datastore abstracts away where the data is actually coming from)
create a notebook to compare data loading performance for the two options
STRETCH: it might be interesting to try and implement using azure data lake analytics (ADLA ) (equivalent to AWS athena) to do the querying of our tabular data files, and compare getting data directly through this compared to getting from a AML dataset. Also might be interesting ton think about a lazy loading strcuture backed by ADLA e.g. a zarr

stevehadd self-assigned this Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore benefits of azure data lake #112

explore benefits of azure data lake #112

stevehadd commented Nov 23, 2022

explore benefits of azure data lake #112

explore benefits of azure data lake #112

Comments

stevehadd commented Nov 23, 2022