-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC of Version Store with pluggable backend #586
base: master
Are you sure you want to change the base?
Conversation
First implementation in S3
@pablojim awesome - I'll take a look this week! |
General implementation notes:
Random thoughts/possibilities for improvements:
|
arctic/pluggable/_parquet_store.py
Outdated
segment_keys = version['segment_keys'] | ||
assert len(segment_keys) == 1, "should only be one segment for parquet" | ||
# TODO this is S3 functionality bleeding out of the backing store. | ||
# Currently reading a Pandas dataframe from a parquet bytes array fails as it only takes a file path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return sorted(dirs, key=mtime) | ||
|
||
def delete_symbol(self, library_name, symbol): | ||
"""Soft deletes a symbol - no data is removed, snapshots still work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshots can be done with hard-links which would make deletions safe
Quite a nice bit of work - any comments on performance of the implementations? |
@jamesblackburn There would also be large improvements due to being able to load partial frames e.g. only loading selected columns and row groups. This may help cases such as #609. Still some work and implementation decisions to do though. |
" Does it exist and is versioning enabled?".format(bucket_name)) | ||
|
||
|
||
class S3KeyValueStore(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have an abstract class KeyValueStore which has the methods for the KV api
and then have e.g. S3KeyValueStore as one concrete implementation maybe in a separate file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if e.g. someone wants to implement a KV store on different backing storage, could simply extend the KeyValueStore class and not the S3-specific one ?
version['segment_count'] = len(segment_keys) # on appends this value is incorrect but is updated later on | ||
version['append_size'] = 0 | ||
version['append_count'] = 0 | ||
version['segment_keys'] = segment_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, this is the forward-pointer implementation, where the version keeps all the segment keys, cool
previous_version = self._backing_store.read_version(self.library_name, symbol) | ||
|
||
handler = self._write_handler(version, symbol, data, **kwargs) | ||
handler.write(self._backing_store, self.library_name, version, symbol, data, previous_version, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to think about decoupling further and have:
- the VersionStore top level api (looks great, as shown here)
- a read/write handler being a composite of:
- version metadata handling. Metadata to be attached on the version are returned in the form of a dict byt he handler. The handler is not aware of version/version documents, but only handler-specific metadata. The write handler write() expects metadata to be passed, and returns to the to top-level version new metadata, to be attached in the version document.
- serialization handler. Can be numpy recarry serializer, Arrow serializer, anything.
- segmentation policy. How to segment data, size of segments (upper size bound probably dictated by the backing store)
- compression handler
- As you already have in the code, a backing_store handler, which is responsible for writing individual segments + associated per-segment metadata at the underlying storage.
Having such a well-separated model, one may create custom implementation of individual handler for serialization/segmentation/compression/backing_store.
The logic of e.g. converting a numpy array to a rec array, segmenting, producing byte arrays and finally write chunks, is now very integrated and hard to add different/new implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this approach. One worry is that it becomes very complicated with an explosion of possible interactions between the different parts which may or may not make sense e.g. mongodb metadata with parquet serialisation with 2Mb chunking on a S3 backend etc. etc.
Each library would then have to be configured with a particular config to work correctly. I think it might be an abstraction too far.
@pablojim Shame to let this bitrot, can we discuss later this week with @willdealtry, @shashank88 |
Yeah, this seems pretty good, will go through it tonight. Have fixed the merge conflict. Will see if the tests are fine |
If we don’t think this is prod ready or a feature we want to support long term. Maybe we can segregate it is an example and make sure our API allows this sort of flexibility. Sent with GitHawk |
Apart from some dependencies it is completely isolated from the rest of Arctic. It's all in the "pluggable" package and duplicates some code from the main APIs. One option would be to merge it but mark it in the code and documentation as Beta until it is deemed ready for wider use. |
Will move to a contrib directory to unblock this PR, without committing this to be used as a production store. |
First implementation of Version Store with pluggable backends. POC in S3 with read, write, soft delete & snapshots using the existing VersionStore chunking and serialisation mechanisms. Append and hard deletes are not implemented.
This implementation stands alone and has no effect on existing functionality. Contains lots of code duplication form the existing functionality. Has limited error checking and cleanup functionality. This PR is mostly for discussion purposes at this point.