Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-config: fix calculation of num_osds #7502

Closed

Conversation

janhorstmann
Copy link

The number of OSDs defined by the lvm_volumes variable is added to num_osds in task Count number of osds for lvm scenario. Therefore theses devices must not be counted in task
Set_fact num_osds (add existing osds).
There are currently three problems with the existing approach:

  1. Bluestore DB and WAL devices are counted as OSDs
  2. lvm_volumes supports a second notation to directly specify logical volumes instead of devices when the data_vg key exists. This scenario is not yet accounted for.
  3. The difference filter used to remove devices from lvm_volumes returns a list of unique elements, thus not accounting for multiple OSDs on a single device

The first problem is solved by filtering the list of logical volumes for devices used as type block.
For the second and third problem lists are created from lvm_volumes containing either paths to devices or logical volumes devices. For the second problem the output of ceph-volume is simply filtered for lv_paths appearing in the list of logical volume devices described above.
To solve the third problem the remaining OSDs in the output are compiled into a list of their used devices, which is then filtered for devices appearing in the list of devices from lvm_volumes.

Fixes: #7435

The number of OSDs defined by the `lvm_volumes` variable is added to
`num_osds` in task `Count number of osds for lvm scenario`. Therefore
theses devices must not be counted in task
`Set_fact num_osds (add existing osds)`.
There are currently three problems with the existing approach:
1. Bluestore DB and WAL devices are counted as OSDs
2. `lvm_volumes` supports a second notation to directly specify logical
   volumes instead of devices when the `data_vg` key exists.
   This scenario is not yet accounted for.
3. The `difference` filter used to remove devices from `lvm_volumes`
   returns a list of **unique** elements, thus not accounting for
   multiple OSDs on a single device

The first problem is solved by filtering the list of logical volumes for
devices used as `type` `block`.
For the second and third problem lists are created from `lvm_volumes`
containing either paths to devices or logical volumes devices.
For the second problem the output of `ceph-volume` is simply filtered
for `lv_path`s appearing in the list of logical volume devices described
above.
To solve the third problem the remaining OSDs in the output are
compiled into a list of their used devices, which is then filtered for
devices appearing in the list of devices from `lvm_volumes`.

Fixes: ceph#7435

Signed-off-by: Jan Horstmann <[email protected]>
@clwluvw
Copy link
Member

clwluvw commented Mar 16, 2024

jenkins test centos-container-external_clients

@clwluvw
Copy link
Member

clwluvw commented Mar 16, 2024

jenkins test centos-container-rbdmirror

@guits
Copy link
Collaborator

guits commented Mar 20, 2024

@clwluvw that's probably the kind of task which deserves to be converted into a module

@guits
Copy link
Collaborator

guits commented Mar 20, 2024

@janhorstmann I might be missing something but with the current implementation :

inventory

[osds]
osd0 lvm_volumes="[{'data': 'data-lv1', 'data_vg': 'test_group'},{'data': 'data-lv2', 'data_vg': 'test_group', 'db': 'journal1', 'db_vg': 'journals'}]"
osd1 lvm_volumes="[{'data': 'data-lv1', 'data_vg': 'test_group'},{'data': 'data-lv2', 'data_vg': 'test_group'}]" dmcrypt=true
osd2 lvm_volumes="[{'data': 'data-lv1', 'data_vg': 'test_group'},{'data': 'data-lv2', 'data_vg': 'test_group', 'db': 'journal1', 'db_vg': 'journals'}]"
osd3 lvm_volumes="[{'data': '/dev/sda'}, {'data': '/dev/sdb'}, {'data': '/dev/sdc'}]"

I see this in log:

TASK [ceph-config : Set_fact num_osds (add existing osds)] *********************
task path: /home/guillaume/workspaces/ceph-ansible/7502/roles/ceph-config/tasks/main.yml:93
Wednesday 20 March 2024  16:52:31 +0100 (0:00:00.641)       0:01:59.236 *******
ok: [osd0] => changed=false
  ansible_facts:
    num_osds: '2'
ok: [osd1] => changed=false
  ansible_facts:
    num_osds: '2'
ok: [osd2] => changed=false
  ansible_facts:
    num_osds: '2'
ok: [osd3] => changed=false
  ansible_facts:
    num_osds: '3'

is there anything wrong here?

@janhorstmann
Copy link
Author

@janhorstmann I might be missing something but with the current implementation :

inventory

[osds]
osd0 lvm_volumes="[{'data': 'data-lv1', 'data_vg': 'test_group'},{'data': 'data-lv2', 'data_vg': 'test_group', 'db': 'journal1', 'db_vg': 'journals'}]"
osd1 lvm_volumes="[{'data': 'data-lv1', 'data_vg': 'test_group'},{'data': 'data-lv2', 'data_vg': 'test_group'}]" dmcrypt=true
osd2 lvm_volumes="[{'data': 'data-lv1', 'data_vg': 'test_group'},{'data': 'data-lv2', 'data_vg': 'test_group', 'db': 'journal1', 'db_vg': 'journals'}]"
osd3 lvm_volumes="[{'data': '/dev/sda'}, {'data': '/dev/sdb'}, {'data': '/dev/sdc'}]"

I see this in log:

TASK [ceph-config : Set_fact num_osds (add existing osds)] *********************
task path: /home/guillaume/workspaces/ceph-ansible/7502/roles/ceph-config/tasks/main.yml:93
Wednesday 20 March 2024  16:52:31 +0100 (0:00:00.641)       0:01:59.236 *******
ok: [osd0] => changed=false
  ansible_facts:
    num_osds: '2'
ok: [osd1] => changed=false
  ansible_facts:
    num_osds: '2'
ok: [osd2] => changed=false
  ansible_facts:
    num_osds: '2'
ok: [osd3] => changed=false
  ansible_facts:
    num_osds: '3'

is there anything wrong here?

Thanks for taking the time to look into this, @guits. The output shown is correct.

Is this by chance from a first run of ceph-ansible on a fresh install? In that case, at the time the ceph-config role is run, there won't be any OSDs provisioned. Thus the output of ceph-volume lvm list will be empty and num_osds is only counted from the devices defined in lvm_volumes.
On any subsequent run of ceph-ansible OSDs will have been created and shown by ceph-volume lvm list. Then the calculation in task Set_fact num_osds (add existing osds) will

  • Sum up the devices lists from all OSDs:
    lvm_list.stdout | default('{}') | from_json | dict2items | map(attribute='value') | flatten | map(attribute='devices') | sum(start=[])
    From what I have seen the devices of OSDs in the output of ceph-volume lvm list are always the underlying disks and never any logical volumes.
    At this point the resulting list will contain devices from block, db, and wal types, thus counting more OSDs than actually exist if db or wal types are listed
  • Create an iterable of the data values in lvm_volumes for the difference filter
    lvm_volumes | default([]) | map(attribute='data')
    This iterable now contains devices and logical volumes
  • Apply the difference filter to both items:
    [...] | difference([...])
    Counterintuitively the difference filter will return a list of unique items, thus ignoring multiple OSDs provisioned on the same device.
    It will also contain those OSDs provisioned from logical volumes in lvm_volumes as the list only contains disk devices and the difference is taken from a list containing the logical volume devices
  • Count the items in the resulting list and add it to the existing value in num_osds
    num_osds: "{{ num_osds | int + ([...] | difference([...]) | length | int) }}"
    The existing value already contains a count of all items in lvm_volumes
    This will the correct value on the first run as it will only contain the count of lvm_volumes. On subsequent runs this number will be different according to the combination of num_osds_per_device, db devices, etc.

@guits
Copy link
Collaborator

guits commented Mar 21, 2024

Is this by chance from a first run of ceph-ansible on a fresh install?

no, this is from a re-run after a fresh install.

@janhorstmann
Copy link
Author

Is this by chance from a first run of ceph-ansible on a fresh install?

no, this is from a re-run after a fresh install.

Before I dive deeper into this could you please confirm that the output is actually from the current implementation. I noticed the number 7502 in the task path, which is the exact number of this PR

TASK [ceph-config : Set_fact num_osds (add existing osds)] *********************
task path: /home/guillaume/workspaces/ceph-ansible/7502/roles/ceph-config/tasks/main.yml:93
[...]

Could this be a run with the version containing the fix? In that case I would hope that it is correct ;)

If that number is unrelated could you show the output of ceph-volume lvm list --format json of an OSD node? Maybe that could help to pinpoint the flaw in my logic.

@guits
Copy link
Collaborator

guits commented Mar 21, 2024

Is this by chance from a first run of ceph-ansible on a fresh install?

no, this is from a re-run after a fresh install.

Before I dive deeper into this could you please confirm that the output is actually from the current implementation. I noticed the number 7502 in the task path, which is the exact number of this PR

I cloned the repo at a new path and named it with the id of you PR but it was well with the branch main

@guits
Copy link
Collaborator

guits commented Mar 21, 2024

I'm gonna do more tests and double-check I didn't miss a detail

@janhorstmann
Copy link
Author

I cloned the repo at a new path and named it with the id of you PR but it was well with the branch main

I'm gonna do more tests and double-check I didn't miss a detail

Thank you for bearing with me here.

I did not exactly reproduce your test environment, but set up a single instance with four volumes on four volume groups on four devices

pvs && vgs && lvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sdb   vg_b lvm2 a--  <10.00g    0
  /dev/sdc   vg_c lvm2 a--  <10.00g    0
  /dev/sdd   vg_d lvm2 a--  <10.00g    0
  /dev/sde   vg_e lvm2 a--  <10.00g    0
  VG   #PV #LV #SN Attr   VSize   VFree
  vg_b   1   1   0 wz--n- <10.00g    0
  vg_c   1   1   0 wz--n- <10.00g    0
  vg_d   1   1   0 wz--n- <10.00g    0
  vg_e   1   1   0 wz--n- <10.00g    0
  LV   VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_b vg_b -wi-ao---- <10.00g
  lv_c vg_c -wi-ao---- <10.00g
  lv_d vg_d -wi-ao---- <10.00g
  lv_e vg_e -wi-ao---- <10.00g

matching this config:

---
lvm_volumes:
  - data: lv_b
    data_vg: vg_b
  - data: lv_c
    data_vg: vg_c
  - data: lv_d
    data_vg: vg_d
    db: lv_e
    db_vg: vg_e

Using branch main I get the following diff between the first and second run in the output of the relevant parts of ceph-config:

TASK [ceph-config : Reset num_osds] *************************	TASK [ceph-config : Reset num_osds] *************************
ok: [localhost] => changed=false				ok: [localhost] => changed=false
  ansible_facts:						  ansible_facts:
    num_osds: 0							    num_osds: 0

TASK [ceph-config : Count number of osds for lvm scenario] **	TASK [ceph-config : Count number of osds for lvm scenario] **
ok: [localhost] => changed=false				ok: [localhost] => changed=false
  ansible_facts:						  ansible_facts:
    num_osds: '3'						    num_osds: '3'

TASK [ceph-config : Look up for ceph-volume rejected devices]	TASK [ceph-config : Look up for ceph-volume rejected devices]
skipping: [localhost] => changed=false				skipping: [localhost] => changed=false
  false_condition: devices | default([]) | length > 0		  false_condition: devices | default([]) | length > 0
  skip_reason: Conditional result was False			  skip_reason: Conditional result was False

TASK [ceph-config : Set_fact rejected_devices] **************	TASK [ceph-config : Set_fact rejected_devices] **************
skipping: [localhost] => changed=false				skipping: [localhost] => changed=false
  skipped_reason: No items in the list				  skipped_reason: No items in the list

TASK [ceph-config : Set_fact _devices] **********************	TASK [ceph-config : Set_fact _devices] **********************
skipping: [localhost] => changed=false				skipping: [localhost] => changed=false
  false_condition: devices | default([]) | length > 0		  false_condition: devices | default([]) | length > 0
  skip_reason: Conditional result was False			  skip_reason: Conditional result was False

TASK [ceph-config : Run 'ceph-volume lvm batch --report' to s	TASK [ceph-config : Run 'ceph-volume lvm batch --report' to s
skipping: [localhost] => changed=false				skipping: [localhost] => changed=false
  false_condition: devices | default([]) | length > 0		  false_condition: devices | default([]) | length > 0
  skip_reason: Conditional result was False			  skip_reason: Conditional result was False

TASK [ceph-config : Set_fact num_osds from the output of 'cep	TASK [ceph-config : Set_fact num_osds from the output of 'cep
skipping: [localhost] => changed=false				skipping: [localhost] => changed=false
  false_condition: devices | default([]) | length > 0		  false_condition: devices | default([]) | length > 0
  skip_reason: Conditional result was False			  skip_reason: Conditional result was False

TASK [ceph-config : Set_fact num_osds from the output of 'cep	TASK [ceph-config : Set_fact num_osds from the output of 'cep
skipping: [localhost] => changed=false				skipping: [localhost] => changed=false
  false_condition: devices | default([]) | length > 0		  false_condition: devices | default([]) | length > 0
  skip_reason: Conditional result was False			  skip_reason: Conditional result was False

TASK [ceph-config : Run 'ceph-volume lvm list' to see how man	TASK [ceph-config : Run 'ceph-volume lvm list' to see how man
ok: [localhost] => changed=false				ok: [localhost] => changed=false
  cmd:								  cmd:
  - docker							  - docker
  - run								  - run
  - --rm							  - --rm
  - --privileged						  - --privileged
  - --net=host							  - --net=host
  - --ipc=host							  - --ipc=host
  - -v								  - -v
  - /run/lock/lvm:/run/lock/lvm:z				  - /run/lock/lvm:/run/lock/lvm:z
  - -v								  - -v
  - /var/run/udev:/var/run/udev:z				  - /var/run/udev:/var/run/udev:z
  - -v								  - -v
  - /dev:/dev							  - /dev:/dev
  - -v								  - -v
  - /etc/ceph:/etc/ceph:z					  - /etc/ceph:/etc/ceph:z
  - -v								  - -v
  - /run/lvm:/run/lvm						  - /run/lvm:/run/lvm
  - -v								  - -v
  - /var/lib/ceph:/var/lib/ceph:ro				  - /var/lib/ceph:/var/lib/ceph:ro
  - -v								  - -v
  - /var/log/ceph:/var/log/ceph:z				  - /var/log/ceph:/var/log/ceph:z
  - --entrypoint=ceph-volume					  - --entrypoint=ceph-volume
  - quay.io/ceph/daemon-base:latest-main			  - quay.io/ceph/daemon-base:latest-main
  - --cluster							  - --cluster
  - ceph							  - ceph
  - lvm								  - lvm
  - list							  - list
  - --format=json						  - --format=json
  delta: '0:00:00.356457'				      |	  delta: '0:00:00.350386'
  end: '2024-03-25 09:43:04.401330'			      |	  end: '2024-03-25 09:45:04.428959'
  rc: 0								  rc: 0
  start: '2024-03-25 09:43:04.044873'			      |	  start: '2024-03-25 09:45:04.078573'
  stderr: ''							  stderr: ''
  stderr_lines: <omitted>					  stderr_lines: <omitted>
  stdout: '{}'						      |	  stdout: |-
							      >	    {
							      >	        "0": [
							      >	            {
							      >	                "devices": [
							      >	                    "/dev/sdb"
							      >	                ],
							      >	                "lv_name": "lv_b",
							      >	                "lv_path": "/dev/vg_b/lv_b",
							      >	                "lv_size": "10733223936",
							      >	                "lv_tags": "ceph.block_device=/dev/vg_b/lv_b,
							      >	                "lv_uuid": "E5KteH-nE2B-6n3p-jVzj-BHjN-kfON-6
							      >	                "name": "lv_b",
							      >	                "path": "/dev/vg_b/lv_b",
							      >	                "tags": {
							      >	                    "ceph.block_device": "/dev/vg_b/lv_b",
							      >	                    "ceph.block_uuid": "E5KteH-nE2B-6n3p-jVzj
							      >	                    "ceph.cephx_lockbox_secret": "",
							      >	                    "ceph.cluster_fsid": "c29aec7d-cf6c-4cd4-
							      >	                    "ceph.cluster_name": "ceph",
							      >	                    "ceph.crush_device_class": "",
							      >	                    "ceph.encrypted": "0",
							      >	                    "ceph.osd_fsid": "57d97201-db17-4927-839a
							      >	                    "ceph.osd_id": "0",
							      >	                    "ceph.osdspec_affinity": "",
							      >	                    "ceph.type": "block",
							      >	                    "ceph.vdo": "0"
							      >	                },
							      >	                "type": "block",
							      >	                "vg_name": "vg_b"
							      >	            }
							      >	        ],
							      >	        "1": [
							      >	            {
							      >	                "devices": [
							      >	                    "/dev/sdc"
							      >	                ],
							      >	                "lv_name": "lv_c",
							      >	                "lv_path": "/dev/vg_c/lv_c",
							      >	                "lv_size": "10733223936",
							      >	                "lv_tags": "ceph.block_device=/dev/vg_c/lv_c,
							      >	                "lv_uuid": "63g2QD-3l00-3mIt-YcoL-yfUs-GPPD-L
							      >	                "name": "lv_c",
							      >	                "path": "/dev/vg_c/lv_c",
							      >	                "tags": {
							      >	                    "ceph.block_device": "/dev/vg_c/lv_c",
							      >	                    "ceph.block_uuid": "63g2QD-3l00-3mIt-YcoL
							      >	                    "ceph.cephx_lockbox_secret": "",
							      >	                    "ceph.cluster_fsid": "c29aec7d-cf6c-4cd4-
							      >	                    "ceph.cluster_name": "ceph",
							      >	                    "ceph.crush_device_class": "",
							      >	                    "ceph.encrypted": "0",
							      >	                    "ceph.osd_fsid": "5deb4190-7b0d-4170-adcf
							      >	                    "ceph.osd_id": "1",
							      >	                    "ceph.osdspec_affinity": "",
							      >	                    "ceph.type": "block",
							      >	                    "ceph.vdo": "0"
							      >	                },
							      >	                "type": "block",
							      >	                "vg_name": "vg_c"
							      >	            }
							      >	        ],
							      >	        "2": [
							      >	            {
							      >	                "devices": [
							      >	                    "/dev/sdd"
							      >	                ],
							      >	                "lv_name": "lv_d",
							      >	                "lv_path": "/dev/vg_d/lv_d",
							      >	                "lv_size": "10733223936",
							      >	                "lv_tags": "ceph.block_device=/dev/vg_d/lv_d,
							      >	                "lv_uuid": "Rwef6N-ETHv-4TUd-9j2B-0N31-EAtp-c
							      >	                "name": "lv_d",
							      >	                "path": "/dev/vg_d/lv_d",
							      >	                "tags": {
							      >	                    "ceph.block_device": "/dev/vg_d/lv_d",
							      >	                    "ceph.block_uuid": "Rwef6N-ETHv-4TUd-9j2B
							      >	                    "ceph.cephx_lockbox_secret": "",
							      >	                    "ceph.cluster_fsid": "c29aec7d-cf6c-4cd4-
							      >	                    "ceph.cluster_name": "ceph",
							      >	                    "ceph.crush_device_class": "",
							      >	                    "ceph.db_device": "/dev/vg_e/lv_e",
							      >	                    "ceph.db_uuid": "mANY8b-MvVI-VaU9-3afv-N7
							      >	                    "ceph.encrypted": "0",
							      >	                    "ceph.osd_fsid": "9a470d84-f91a-4e59-b963
							      >	                    "ceph.osd_id": "2",
							      >	                    "ceph.osdspec_affinity": "",
							      >	                    "ceph.type": "block",
							      >	                    "ceph.vdo": "0"
							      >	                },
							      >	                "type": "block",
							      >	                "vg_name": "vg_d"
							      >	            },
							      >	            {
							      >	                "devices": [
							      >	                    "/dev/sde"
							      >	                ],
							      >	                "lv_name": "lv_e",
							      >	                "lv_path": "/dev/vg_e/lv_e",
							      >	                "lv_size": "10733223936",
							      >	                "lv_tags": "ceph.block_device=/dev/vg_d/lv_d,
							      >	                "lv_uuid": "mANY8b-MvVI-VaU9-3afv-N7Mw-KWED-m
							      >	                "name": "lv_e",
							      >	                "path": "/dev/vg_e/lv_e",
							      >	                "tags": {
							      >	                    "ceph.block_device": "/dev/vg_d/lv_d",
							      >	                    "ceph.block_uuid": "Rwef6N-ETHv-4TUd-9j2B
							      >	                    "ceph.cephx_lockbox_secret": "",
							      >	                    "ceph.cluster_fsid": "c29aec7d-cf6c-4cd4-
							      >	                    "ceph.cluster_name": "ceph",
							      >	                    "ceph.crush_device_class": "",
							      >	                    "ceph.db_device": "/dev/vg_e/lv_e",
							      >	                    "ceph.db_uuid": "mANY8b-MvVI-VaU9-3afv-N7
							      >	                    "ceph.encrypted": "0",
							      >	                    "ceph.osd_fsid": "9a470d84-f91a-4e59-b963
							      >	                    "ceph.osd_id": "2",
							      >	                    "ceph.osdspec_affinity": "",
							      >	                    "ceph.type": "db",
							      >	                    "ceph.vdo": "0"
							      >	                },
							      >	                "type": "db",
							      >	                "vg_name": "vg_e"
							      >	            }
							      >	        ]
							      >	    }
  stdout_lines: <omitted>					  stdout_lines: <omitted>

TASK [ceph-config : Set_fact num_osds (add existing osds)] **	TASK [ceph-config : Set_fact num_osds (add existing osds)] **
ok: [localhost] => changed=false				ok: [localhost] => changed=false
  ansible_facts:						  ansible_facts:
    num_osds: '3'					      |	    num_osds: '7'

So on the second run, additionally to the count of items in lvm_volumes we have a count of all items in the output of ceph-volume lvm list --format json, thus the value for osd_memory_target is not calculated for the correct number of provisioned OSDs. In this case it gets reduced, so that resources are not used efficiently.

If we start to bring osds_per_device > 1 into the equation then memory might get overcommited resulting in OOM situations.

Copy link

github-actions bot commented Apr 9, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 9, 2024
@janhorstmann
Copy link
Author

janhorstmann commented Apr 10, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

I am still interested in landing this. Let me know if there is anything I can do to move this along

@github-actions github-actions bot removed the stale label Apr 10, 2024
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 26, 2024
@janhorstmann
Copy link
Author

@guits did you have time to look into this yet?

@github-actions github-actions bot removed the stale label May 2, 2024
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 17, 2024
Copy link

github-actions bot commented Jun 1, 2024

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calculated value for osd target memory too high for deployments with multiple OSDs per device
3 participants