-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very simple LRU-based garbage collector (GC) #144
base: main
Are you sure you want to change the base?
Conversation
@@ -1479,6 +1531,8 @@ def __init__(self, environments: Collection[TaskEnvironment], n_threads: Optiona | |||
self._device_launched_datamove_task_counts = { | |||
dev: 0 for dev in self._available_resources.get_resources()} | |||
|
|||
self._lrum = LRUManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking through this bc we'll be looking at it again soon and been thinking about memory counting.
It should be a separate LRUManager for each device object (memory space), shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the design. My original idea was to have a single GC (not necessarily now but I thouhgt a daemon like a scheduler thread) and manage all memory regardless of its location. It might be simpler than a GC per a device? Could you please let me know your idea and its pros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we would call into the EvictionPolicy is when a specific device is full and we want to bring more data onto it. This would look at only data on that device to start evicting (which means it needs a device specific hash-list structure for LRU). I am not sure how this would be handled by a global one for all devices without evicting data from other devices unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also (just terminology), specifically I'm using "GC" for cleaning up data that will not be used in the future and "Eviction" for kicking out any data (possibly data that will be used in the future) due to space limitations on a device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am on board with you. But I think there is a different way to implement what you said. What I meant was the global memory manager can still manage the whole devices. For example, as you said, we maintain a list of arrays for each device and the global one can traverse and evicts/dellocates subset of them to get more memory. Or we can have a dedicated memory manager thread for each device and let each of them manage the owned device parrays. I think this is implementation detail?
This PR invokes an LRU-based GC for each task execution. Cholesky on various setting worked correctly. This PR is not optimized well performance yet, but aims to implement a baseline. This PR includes possible bug fixes on the runtime.