-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rabbit allocations of equal or near-equal sizes #175
Comments
What's the requirement that's causing trouble currently? I can think of a few ways we can force fit this, but we may have to work at it a bit. |
Currently we allocate rabbit storage local to the compute nodes we've chosen. So if Fluxion picks five nodes on rack A and one on rack B, it will also allocate five times as much storage on rabbit A as on rabbit B. [side note: we do it this way just because it works for XFS and GFS2 and Lustre, even though it's unnecessarily restrictive for Lustre. See #161] If we try to set up Lustre OSTs on both rabbits in a case like that where the storage isn't evenly distributed, Brian has said that he expects performance to be "badly wrecked":
I'm not sure I entirely understand this logic though so I'm going to check back in with Brian about it. |
That logic sounds right, though painful, since it's how uneven storage striping systems tend to work. You end up basically round-robining over however many stripes there are, which overloads a larger device in some cases (did this to myself with an uneven software parity setup once). The simplest (though most annoying) thing I can think to do is to enact a version of the plan we talked about over the whiteboard a while ago and effectively "split" the rabbits into storage meant for NNL storage and lustre-type storage. Make half available in the nodes you're already using, and half available from another "meta-node" maybe just hanging off the cluster at the top that gets used for ephemeral lustre, and request an even amount from each rabbit involved. Unfortunately that would mean tracking how much of each rabbit's lustre storage is consumed some other way, which is kinda awful. The more satisfying solution would be to have a way to express that a given resource type prefers even load, or greater distribution, or something such that we can actually get better behavior. The cheapest thing I can think of that's at least somewhat in this direction is we could try to optimize for choosing as many different rabbits as possible to service the request. You have 10 nodes and ask for lustre of at least 10TB? We try to give you slices of 10 rabbits. We don't have built-in support for that, but the hook we use to do "node-centric allocation" works similarly such that we might be able to get it with a relatively small tweak. |
For ephemeral lustre file systems on rabbits, Flux can choose rabbit storage irrespective of the location of compute nodes. Flux can also split the allocation across multiple rabbits (and will need to depending on the size of the file system). However, if the allocation is split across multiple rabbits, @behlendorf has indicated that is a performance requirement that the allocations all be equal or near-equal in size.
I don't know at the moment how to accomplish this. A new Fluxion match policy? @milroy , @zekemorton, or @trws , any ideas?
The text was updated successfully, but these errors were encountered: