Reducing OST (and perhaps MDT) fragmentation #176

jameshcorbett · 2024-07-03T23:38:27Z

Servers resources for lustre, when filled in by Flux, can look something like this:

spec:
  allocationSets:
  - allocationSize: 824633720832
    label: ost
    storage:
    - allocationCount: 3
      name: elcap1
    - allocationCount: 1
      name: elcap2
  - allocationSize: 17179869184
    label: mdt
    storage:
    - allocationCount: 3
      name: elcap1
    - allocationCount: 1
      name: elcap2

What this gives us is 3 OSTs on elcap1 and 1 on elcap2. However, as @behlendorf noted,

For Lustre we realistically wouldn't want to ever create more than one OST or MDT per rabbit per workflow. It's good that HPEs software supports it since it'd be nice to experiment with, but it would be an odd configuration.

However I think there is a disconnect between the way Flux allocates storage and the the way Servers asks for the storage to be represented. At the moment Flux does not have any kind of policy to allocate equal amounts of storage from each rabbit. Flux may allocate a huge chunk of storage (let's say N bytes) from elcap1 and a much smaller amount of storage (M bytes) on elcap2 (as in the example above), with the desire of nevertheless having a single OST (and perhaps MDT) on each despite the size differences. But there is no good way for us to represent that in Servers without doing something like the above, in which we take the greatest common divisor of N and M, make that the allocationSize, and then multiply the allocationCount for each by N / GCD(N, M) and M / GCD(N,M) respectively.

@behlendorf also noted that imbalanced allocations may not be desirable, since

Lustre will do a better job of balancing their usage if [OSTs are] all close to the same capacity

So Flux may need to work on a policy to equalize the amount of storage on each rabbit node. However it might be nice if we could do something like the following:

spec:
  allocationSets:
    label: ost
    storage:
    - allocationCount: 1
      name: elcap1
      allocationSize: 2424633720832
    - allocationCount: 1
      name: elcap2
      allocationSize: 824633720832

The text was updated successfully, but these errors were encountered:

jameshcorbett · 2024-07-04T01:09:06Z

Brian has indicated to me that OSTs of unequal sizes may be devastating to performance. So perhaps the problem is really in Flux's scheduling policy. I've opened flux-framework/flux-coral2#175.

github-project-automation bot added this to Issues Dashboard Jul 3, 2024

github-project-automation bot moved this to 📋 Open in Issues Dashboard Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing OST (and perhaps MDT) fragmentation #176

Reducing OST (and perhaps MDT) fragmentation #176

jameshcorbett commented Jul 3, 2024

jameshcorbett commented Jul 4, 2024

Reducing OST (and perhaps MDT) fragmentation #176

Reducing OST (and perhaps MDT) fragmentation #176

Comments

jameshcorbett commented Jul 3, 2024

jameshcorbett commented Jul 4, 2024