Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing OST (and perhaps MDT) fragmentation #176

Open
jameshcorbett opened this issue Jul 3, 2024 · 1 comment
Open

Reducing OST (and perhaps MDT) fragmentation #176

jameshcorbett opened this issue Jul 3, 2024 · 1 comment

Comments

@jameshcorbett
Copy link
Collaborator

Servers resources for lustre, when filled in by Flux, can look something like this:

spec:
  allocationSets:
  - allocationSize: 824633720832
    label: ost
    storage:
    - allocationCount: 3
      name: elcap1
    - allocationCount: 1
      name: elcap2
  - allocationSize: 17179869184
    label: mdt
    storage:
    - allocationCount: 3
      name: elcap1
    - allocationCount: 1
      name: elcap2

What this gives us is 3 OSTs on elcap1 and 1 on elcap2. However, as @behlendorf noted,

For Lustre we realistically wouldn't want to ever create more than one OST or MDT per rabbit per workflow. It's good that HPEs software supports it since it'd be nice to experiment with, but it would be an odd configuration.

However I think there is a disconnect between the way Flux allocates storage and the the way Servers asks for the storage to be represented. At the moment Flux does not have any kind of policy to allocate equal amounts of storage from each rabbit. Flux may allocate a huge chunk of storage (let's say N bytes) from elcap1 and a much smaller amount of storage (M bytes) on elcap2 (as in the example above), with the desire of nevertheless having a single OST (and perhaps MDT) on each despite the size differences. But there is no good way for us to represent that in Servers without doing something like the above, in which we take the greatest common divisor of N and M, make that the allocationSize, and then multiply the allocationCount for each by N / GCD(N, M) and M / GCD(N,M) respectively.

@behlendorf also noted that imbalanced allocations may not be desirable, since

Lustre will do a better job of balancing their usage if [OSTs are] all close to the same capacity

So Flux may need to work on a policy to equalize the amount of storage on each rabbit node. However it might be nice if we could do something like the following:

spec:
  allocationSets:
    label: ost
    storage:
    - allocationCount: 1
      name: elcap1
      allocationSize: 2424633720832
    - allocationCount: 1
      name: elcap2
      allocationSize: 824633720832
@jameshcorbett
Copy link
Collaborator Author

Brian has indicated to me that OSTs of unequal sizes may be devastating to performance. So perhaps the problem is really in Flux's scheduling policy. I've opened flux-framework/flux-coral2#175.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Open
Development

No branches or pull requests

1 participant