Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Device Incomplete and Core Devices Inactive after reboot #1215

Open
nfronz opened this issue May 26, 2022 · 30 comments
Open

Cache Device Incomplete and Core Devices Inactive after reboot #1215

nfronz opened this issue May 26, 2022 · 30 comments
Labels
question Further information is requested

Comments

@nfronz
Copy link

nfronz commented May 26, 2022

Question

Why don't Cache and cores activate properly on reboot?

Motivation

I broke my Ceph Cluster as I have Open CAS running on all OSDs

Your Environment

  • OpenCAS version (commit hash or tag):22.03.0.0685.master
  • Operating System:Debian 11/Proxmox
  • Kernel version:5.13.19-6-pve
  • Cache device type (NAND/Optane/other):NAND
  • Core device type (HDD/SSD/other):HDD
  • Cache configuration:
  • [caches]
    1 /dev/disk/by-id/nvme-Samsung_SSD_980_PRO_1TB_S5P2NG0R508005Y wb cache_line_size=4,ioclass_file=/etc/opencas/ansible/default.csv,cleaning_policy=alru,promotion_policy=always

[cores]
1 1 /dev/disk/by-id/wwn-0x5000cca255c01f9d-part1
1 2 /dev/disk/by-id/wwn-0x5000c500c93f187a-part1
1 3 /dev/disk/by-id/wwn-0x5000c50085e9d2eb-part1
1 4 /dev/disk/by-id/wwn-0x5000c50085b40c5b-part1
1 5 /dev/disk/by-id/wwn-0x5000c50085e47697-part1
1 6 /dev/disk/by-id/wwn-0x5000c50085e577ff-part1
1 7 /dev/disk/by-id/wwn-0x5000c50085dd1293-part1

    • Cache mode: (default: wt)WB
    • Cache line size: (default: 4)4
    • Promotion policy: (default: always)ALWAYS
    • Cleaning policy: (default: alru)ALRU
    • Sequential cutoff policy: (default: full)FULL
  • Other (e.g. lsblk, casadm -P, casadm -L)
@nfronz nfronz added the question Further information is requested label May 26, 2022
@mmichal10
Copy link
Contributor

Hi @nfronz

thank you for posting the question. Did Open CAS print any information to dmesg during booting to the OS? What is the state of the cache instance after the reboot? Is is not running at all? Or is it in an incomplete state? Could you share casadm -L output?

@brokoli18
Copy link

@mmichal10 I have the same issue as the above asker. What I did is deploy opencas on several ceph nodes in a test lab using the instructions in https://open-cas.github.io/getting_started_open_cas_linux.html.

Casadm version: 22.12.0.0844.master
Operating system: Ubuntu 18.04
Kernel version: 5.4.0-139-generic
Cache device type: SSD
Core device type: HDD

While changing one of the config options the casadm command hung and as I couldnt kill it I rebooted the node. I assume that becouse I had no opencas.conf configuration the opencas devices didnt come back, but after running the casadm -S -d /dev/disk/by-id/ata-SSDSC2K...--load command I am in this state:

# casadm -L
type    id   disk       status       write policy   device
cache   1    /dev/sdh   Incomplete   wb             -
+core   1    /dev/sda   Inactive     -              -
+core   2    /dev/sdb   Inactive     -              -
+core   3    /dev/sdc   Inactive     -              -
+core   4    /dev/sdd   Inactive     -              -

Looking at the actual block devices it seems that ceph is now trying to run on the raw disks instead of the cas devices as well:

# lsblk -o name
NAME
sda
└─sda1
  └─ceph--418e8367--4d48--42a0--89e9--0bed53fc705b-osd--block--418e8367--4d48--42a0--89e9--0bed53fc705b
sdb
└─sdb1
  └─ceph--4169945f--e1f1--44d5--8ba0--ddf9ebe42738-osd--block--4169945f--e1f1--44d5--8ba0--ddf9ebe42738
sdc
└─sdc1
  └─ceph--c818d7ca--9467--443c--941b--bf7df5ac8376-osd--block--c818d7ca--9467--443c--941b--bf7df5ac8376
sdd
└─sdd1
  └─ceph--5e92a9e5--9176--460e--950d--e0858ddef29d-osd--block--5e92a9e5--9176--460e--950d--e0858ddef29d

Is there anything that can be done to fix this situation?

P.S. Here are some (what I assume) relevant dmesg logs:

[   19.315794] cache1: Loading cache state...
[   19.379479] Thread cas_cl_cache1 started
[   19.379484] cache1: Policy 'always' initialized successfully
[   19.379509] cache1: Cannot open core 1. Cache is busy
[   19.379511] cache1.core1: Seqential cutoff init
[   19.379549] cache1: Cannot find core 1 in pool, core added as inactive
[   19.379555] cache1: Cannot open core 2. Cache is busy
[   19.379556] cache1.core2: Seqential cutoff init
[   19.379590] cache1: Cannot find core 2 in pool, core added as inactive
[   19.379595] cache1: Cannot open core 3. Cache is busy
[   19.379596] cache1.core3: Seqential cutoff init
[   19.379630] cache1: Cannot find core 3 in pool, core added as inactive
[   19.379635] cache1: Cannot open core 4. Cache is busy
[   19.379644] cache1.core4: Seqential cutoff init
[   19.379678] cache1: Cannot find core 4 in pool, core added as inactive
[   31.707718] cache1: Done loading cache state
[   46.325490] cache1: Done saving cache state!
[   46.362173] cache1: Cache attached
[   46.362176] cache1: Successfully loaded
[   46.362177] cache1: Cache mode : wb
[   46.362178] cache1: Cleaning policy : acp
[   46.362179] cache1: Promotion policy : always
[   46.362181] cache1.core1: Failed to initialize
[   46.362182] cache1.core2: Failed to initialize
[   46.362183] cache1.core3: Failed to initialize
[   46.362184] cache1.core4: Failed to initialize
[   46.362223] [Open-CAS] Adding device /dev/disk/by-id/ata-SSDSC2KB480G8R_PHYF8446032P480BGN as cache cache1
[   46.362235] [Open-CAS] [Classifier] Initialized IO classifier
[   46.362260] [Open-CAS] Adding device /dev/disk/by-id/wwn-0x5000039562708b05 as core core1 to cache cache1
[   46.362262] [Open-CAS] Adding device /dev/disk/by-id/wwn-0x5000039961a8734e as core core2 to cache cache1
[   46.362263] [Open-CAS] Adding device /dev/disk/by-id/wwn-0x5000039881608fe5 as core core3 to cache cache1
[   46.362264] [Open-CAS] Adding device /dev/disk/by-id/wwn-0x5000039673688f92 as core core4 to cache cache1
[   46.367193] cache1.core3: Inserting core
[   46.367214] cache1: Adding core core3 failed
[   46.371824] cache1.core4: Inserting core
[   46.371832] cache1: Adding core core4 failed

@mmichal10
Copy link
Contributor

mmichal10 commented Jul 10, 2023

Hello @brokoli18,

The cache couldn't be loaded because CAS couldn't open core devices exclusively. Would it be possible stop the cache, detach the disks from ceph, load the cache again and attach cas exported objects (/dev/cas1-X) to ceph?

@brokoli18
Copy link

Thank you for your response. What exactly do you mean by detach the disks from ceph. Do you mean:

  1. Stop ceph/lvm from trying to autostart the disks on boot
  2. Remove the disks from ceph/lvm configuration completely and readd them

For your reference I have tried option 1 by masking the ceph-osd@ service at startup and I am still in much the same situation. Although I could try option 2 here as it is a lab environment this would not be a good result as I cant start wiping disks in our prod environment when this sort of situation occurs.

@fpeg26
Copy link

fpeg26 commented Sep 4, 2024

Hi,

Just letting you know that we're having the issue in our setup.

The only "fix" that I found was to add this line in the open-cas.service file:
ExecStartPre=/bin/sleep 30
But this is not a proper solution.

I've also been experimenting on the startup order of the open-cas, lvm and ceph services with no luck.

Any help on this would be greatly appreciated.
Regards

@fpeg26
Copy link

fpeg26 commented Sep 6, 2024

It also seems like if you zap the ceph volumes attached to the /dev/sd{a..d}1 devices like this:
for i in /dev/sd{a..d}1; do ceph-volume lvm zap --destroy $i;done
(you will need to allow those devices in /etc/lvm/lvm.conf first and you might also need to stop the corresponding osd services first)
And reboot your server, then the cas devices will be initialized properly, and the ceph osds will use them as you would expect.
The next reboots will also work properly so you should need to run the above command only once.

@mmichal10
Copy link
Contributor

Hello @fpeg26,
we're trying to reproduce the problem on our setup. Could you please share which version of ceph are you using?

@fpeg26
Copy link

fpeg26 commented Oct 17, 2024

Hi @mmichal10,

Thanks a lot for taking the time to have look at this issue.
We're running ceph 18.2.2 on proxmox:
ceph-base/stable,now 18.2.2-pve1

If that helps, we're running this version of OpenCAS:
╔═════════════════════════╤═════════════════════╗
║ Name │ Version ║
╠═════════════════════════╪═════════════════════╣
║ CAS Cache Kernel Module │ 22.12.0.0852.master ║
║ CAS CLI Utility │ 22.12.0.0852.master ║
╚═════════════════════════╧═════════════════════╝

Let me know if I can provide more details

Regards

@jfckm
Copy link

jfckm commented Oct 18, 2024

Hi @fpeg26,
I tried to reproduce that with some simple cluster, but it started alright multiple times. Can you provide journalctl, dmesg and lvm.conf from your failing config? Maybe there will be some clues that will show us what might've gone wrong.

@fpeg26
Copy link

fpeg26 commented Oct 18, 2024

Hi,

Thank you for taking the time to give this a try.

Here is the content of our lvm.conf file we're using (with drive wwn ids masked).
It's the same file on every node with corresponding wwn updated accordingly :

devices {
     global_filter=["r|/dev/zd.*|","r|/dev/rbd.*|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sda|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sdb|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sdc|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sdd|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sde|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sdf|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sdg|","r|/dev/disk/by-id/wwn-xxxxxx|","r|/dev/sdh|",]
     types=["cas",16]
}

Unfortunately, I won't be able to upload the other 2 requested files in public.
Let me know if there is a way I can send them to you in private.

Our OpenCAS setup looks like this:

type    id   disk       status    write policy   device
cache   1    /dev/sdi   Running   wb             -
├core   1    /dev/sda   Active    -              /dev/cas1-1
├core   2    /dev/sdb   Active    -              /dev/cas1-2
├core   3    /dev/sdc   Active    -              /dev/cas1-3
└core   4    /dev/sdd   Active    -              /dev/cas1-4
cache   2    /dev/sdj   Running   wb             -
├core   1    /dev/sde   Active    -              /dev/cas2-1
├core   2    /dev/sdf   Active    -              /dev/cas2-2
├core   3    /dev/sdg   Active    -              /dev/cas2-3
└core   4    /dev/sdh   Active    -              /dev/cas2-4

When a node fails to attach the core devices to the cache on boot, we see those lines in journalctl (one for each core device):

Sep 17 09:31:11 xyz lvm[1748]: /dev/sda excluded: device is rejected by filter config.
Sep 17 09:31:12 xyz open-cas-loader.py[2684]: Unable to attach core /dev/disk/by-id/wwn-xxx from cache 1. Reason: Error while adding core device to cache instance 1
Sep 17 09:31:12 xyz (udev-worker)[1685]: sda: Process '/lib/opencas/open-cas-loader.py /dev/sda' failed with exit code 1.

Ceph then uses /dev/sdX as lvms before OpenCAS was able to attach the core devices.

Just out of curiosity, when you restart a node, do you do anything special like flushing the cache or outing the osds on ceph, or do you just issue a simple reboot command?

Shuting down a node that's gonna fail on next reboot results in those lines in journalctl:

Sep 17 09:27:48 xyz casctl[2328286]: Unable to detach core /dev/disk/by-id/wwn-xxxxxx. Reason:
Sep 17 09:27:48 xyz casctl[2328286]: Error while detaching core device 1 from cache instance 1
Sep 17 09:27:48 xyz  casctl[2328286]: Device opens or mount are pending to this cache

Also (but that's a separate issue), we're using OpenCAS for a workstation workload and we noticed that the cache never flushes on ALRU policy because the cluster never sleeps. Even with an Activity Threshold set to 1ms. For now we're doing it manually.

Hopefully this will help you understand our issue a bit better.

Regards

@robertbaldyga
Copy link
Member

Hi @fpeg26 ,

I did a little experiment and managed to get a reproduction of the problem.
I looked at journalctl and I found the following message:

Oct 19 17:42:30 localhost lvm[752]:   Please remove the lvm.conf filter, it is ignored with the devices file.

It turns out that lvm by default is configured to use a new mechanism (the devices file) instead of filters.
Disabling the devices file enables the filters. So I added this line to my /etc/lvm/lvm.conf (in the devices section):

use_devicesfile = 0

After the reboot the cores were added correctly and the lvms detected on top of Open CAS devices:

type    id   disk       status    write policy   device
cache   1    /dev/vdb   Running   wt             -
├core   1    /dev/sda   Active    -              /dev/cas1-1
└core   2    /dev/sdb   Active    -              /dev/cas1-2
NAME                              MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda                                 8:0    0    5G  0 disk 
└─cas1-1                          251:0    0    5G  0 disk 
  └─ceph--276b3665--98c5--486d--87b3--f2ca8807cb3e-osd--block--30a9f46b--5b78--4fe2--b2ec--de7d2102e69f
                                  253:3    0    5G  0 lvm  
sdb                                 8:16   0    5G  0 disk 
└─cas1-2                          251:256  0    5G  0 disk 
  └─ceph--acc15d17--b9d9--4402--bc77--bde81240d449-osd--block--bd617c4c--00e3--41d6--a201--5836ec1423a7
                                  253:4    0    5G  0 lvm  

Let me know it that solves the problem on your configuration.

@robertbaldyga
Copy link
Member

I also managed to get the proper behavior with use_devicesfile = 1, but to make it work Open CAS needs to provide a serial for its block devices, so that lvm can distinguish them properly from the backend devices. I prepared a patch that does this #1574.

@robertbaldyga
Copy link
Member

Just out of curiosity, when you restart a node, do you do anything special like flushing the cache or outing the osds on ceph, or do you just issue a simple reboot command?

Shuting down a node that's gonna fail on next reboot results in those lines in journalctl:

Sep 17 09:27:48 xyz casctl[2328286]: Unable to detach core /dev/disk/by-id/wwn-xxxxxx. Reason:
Sep 17 09:27:48 xyz casctl[2328286]: Error while detaching core device 1 from cache instance 1
Sep 17 09:27:48 xyz  casctl[2328286]: Device opens or mount are pending to this cache

We have two redundant shutdown methods. One is this service that you just noticed failing, another one is systemd-shutdown script, that is executed later in the shutdown process. In my tests with Ceph I also see the first one failing, but the second one correctly stops the cache (you can see it in kernel log). Not super elegant, but good enough.

@fpeg26
Copy link

fpeg26 commented Oct 21, 2024

Hi @robertbaldyga,

Thank you for taking the time to look into this further

Unfortunately, we don't see the LVM filter message on any of the hosts in our cluster.
However, If I add the use_devicesfile = 0 line in lvm.conf and run lvmconfig --test , there are no complaints, which suggests that this feature might already be enabled.
We're also missing the lvmdevices command which is typically used to manage the devices file (unless done manually).
Could it be that this new mechanism was fully implemented in a newer LVM release? It seems possible that the error message we're expecting is not present in our version and may have been introduced later.
I will check the LVM Git repo and try to find more details.

This is our setup to compare with yours:

# Kernel
Linux localhost 6.5.13-6-pve

# LVM packages
libllvm15/stable,now 1:15.0.6-4+b1 amd64 [installed,automatic]
liblvm2cmd2.03/stable,now 2.03.16-2 amd64 [installed,automatic]
lvm2/stable,now 2.03.16-2 amd64 [installed]

I will apply the configuration change on each host to observe how it behaves with this setting in place.

Thanks as well for clarifying the shutdown methods. I just wanted to confirm that shutting down a node without flushing the cache is considered a good practice in a Ceph environment.

EDIT:
I checked the LVM config on each nodes before manually adding the use_devicesfile = 0 line and they are all already setup like this by default:

lvmconfig --type full | grep use_devices
	use_devicesfile=0

So it feels like we're on the wrong track 😞

@robertbaldyga
Copy link
Member

@fpeg26 Are you sure there is no other reference to /dev/sd* than /dev/disk/by-id/wwn-*? Maybe your filter does not filter out those paths and that's why LVM is able to start on the backend device. I even tried to look if LVM on Proxmox platform behaves somehow differently, but it doesn't seem so.

@fpeg26
Copy link

fpeg26 commented Oct 23, 2024

The lvm.conf I posted earlier is the exact one we have on each host. Note that I filtered out both /dev/disk/by-id/wwn-* and /dev/sd* in and attempt to fix this issue. Prior to that, I was only filtering /dev/disk/by-id/wwn-*.

Unless there is another file that got populated automatically that I'm not aware of, lvm.conf is the only place where I filtered them manually.

Also, I can tell that it's working, at least partially, because if I comment out the global_filter line, I get those warnings for each filtered out devices when running lvm commands:
WARNING: Not using device /dev/sda for PV 1aSner-oik2-ad2c-fljc-Ynzg-pBkl-fjvieML.

Now, I noticed some strange behaviors while writing this comment:

  • on one node, commenting out the global_filter line doesn't display the above warnings at all
  • one another node, the warning only appears on 6/8 devices used by OpenCAS/Ceph. On another it's 4/8...
  • one node doesn't see the lvm at all. ceph-volume lvm list reports nothing even though the OSDs are in and up and ceph is "healthy"
  • only one node out of 9 in the cluster is showing the correct informations: 8 warnings and pvs, lvs, vgs, ceph in correct state

So it's really inconsistent across the cluster even thought they all share the exact same configuration.
All nodes displayed the same problem after their first reboot: lvm using sdX instead of casX-X. They were all fixed the same way using the method described earlier (#1215 (comment)) and were compliant to what was expected (pvs, lvs, vgs, ceph in correct state) but they got worse overtime after a few reboots here and there.

@robertbaldyga
Copy link
Member

My guess is that gradual degradation is a result of using Write-Back. When LVM metadata is stored as a dirty data on the cache, LVM will not recognize the backend device. Once it gets flushed to the backend, after the next reboot LVM starts on the backend device because now it can see its metadata there.

Can you verify that ls -l /dev/disk/* | grep /sd* shows only wwn-*?

@fpeg26
Copy link

fpeg26 commented Oct 23, 2024

That would make a lot of sense.

No, ls -l /dev/disk/* | grep /sd* return a whole bunch of different things like this (took one example for each type of returns):

lrwxrwxrwx 1 root root  9 Oct 15 08:06 9 -> ../../sda
lrwxrwxrwx 1 root root  9 Oct 15 08:06 ata-DELLBOSS_VD_xyz -> ../../sdk
lrwxrwxrwx 1 root root  9 Oct 15 08:06 lvm-pv-uuid-xyz -> ../../sda
lrwxrwxrwx 1 root root  9 Oct 15 08:06 scsi-xyz -> ../../sda
lrwxrwxrwx 1 root root  9 Oct 15 08:06 wwn-xyz -> ../../sda
lrwxrwxrwx 1 root root 10 Oct 15 08:06 EFI\x20System\x20Partition -> ../../sdk1
lrwxrwxrwx 1 root root 10 Oct 15 08:06 6a803660-0b54-43bb-bd8a-2e43321459411c -> ../../sdk1
lrwxrwxrwx 1 root root  9 Oct 15 08:06 pci-0000:01:00.0-sas-phy0-lun-0 -> ../../sda
lrwxrwxrwx 1 root root  9 Oct 15 08:06 pci-0000:a1:00.0-ata-1 -> ../../sdk
lrwxrwxrwx 1 root root 10 Oct 15 08:06 pci-0000:a1:00.0-ata-1.0-part1 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Oct 15 08:06 W4FI-F189 -> ../../sdk1

@robertbaldyga
Copy link
Member

robertbaldyga commented Oct 25, 2024

Ok, so you'd need to filter out all of those links or add one rule for the entire /dev/disk directory like "r|/dev/disk/.*|", so that LVM could not match the backend device by neither of those paths.

Alternatively you can try the current Open CAS master (https://github.com/Open-CAS/open-cas-linux/tree/588b7756a957417430d6eca17ccb66beae051365) with use_devicesfile = 1, and backend devices removed from /etc/lvm/devices/system.devices. If that method suits better your setup, we can release the needed changes in a dot release within a few weeks.

@fpeg26
Copy link

fpeg26 commented Oct 25, 2024

Thanks @robertbaldyga it's really appreciated.
I will test both solutions and report back asap.

I have a few questions though:

  • If I filter out everything from /dev/disk, what will happen with the OS disk (sdk) that has LV on it? (sda-sdh = cores, sdi-j = caches, sdk = OS disk)
  • What will happen if a cache disk fails and need to be replaced? I guess I will have to un-filter the raw devices affected to this drive so ceph can still function?

@robertbaldyga
Copy link
Member

robertbaldyga commented Oct 25, 2024

Thanks @robertbaldyga it's really appreciated. I will test both solutions and report back asap.

I have a few questions though:

  • If I filter out everything from /dev/disk, what will happen with the OS disk (sdk) that has LV on it? (sda-sdh = cores, sdi-j = caches, sdk = OS disk)

The base path /dev/sdk is not filtered out, so LVM should be able to recognize it. You can also add "a|/dev/sdk|" at the beginning of the filter list, to make sure it will not be affected by any other filter rules. That way you can even simplify your filter to something like this:

global_filter=["a|/dev/disk/by-id/wwn-of-OS-disk|", "r|/dev/zd.*|", "r|/dev/rbd.*|", "r|/dev/sd.*|", "r|/dev/disk/.*|"]

or even:

global_filter=["a|/dev/disk/by-id/wwn-of-OS-disk|", "a|/dev/cas.*|", "r|.*|"]
  • What will happen if a cache disk fails and need to be replaced? I guess I will have to un-filter the raw devices affected to this drive so ceph can still function?

Yes, if you want to move back to the backend devices, you need to allow them in the filter.
The major consideration when using Write-Back mode is that most of the dime the data is not fully flushed to the backend devices, so you need to make sure to flush the cache before switching back to the backend devices. If it's impossible (like during cache disk fail), then you most likely would have to reinitialize the OSDs and then Ceph would recreate the content from the replicas. That may generate additional load on the cluster. If you want to avoid such situation, you can setup a cache using two SSDs in RAID1 configuration.

@fpeg26
Copy link

fpeg26 commented Oct 30, 2024

So I've tested the new global_filter you suggested and unfortunately, it didn't help.

devices {
     global_filter=["a|/dev/disk/by-id/ata-DELLBOSS_VD_ec320321312as2a9c1232130|", "a|/dev/cas.*|", "r|.*|"]
     types=["cas",16]
}

I rebooted the node, it came back with all OSDs down with empty folders under /var/lib/ceph/osd.
Both my caches were in incomplete state:

type    id   disk       status       write policy   device
cache   2    /dev/sdj   Incomplete   wb             -
├core   1    /dev/sde   Active       -              /dev/cas2-1
├core   2    /dev/sdf   Inactive     -              -
├core   3    /dev/sdg   Active       -              /dev/cas2-3
└core   4    /dev/sdh   Active       -              /dev/cas2-4
cache   1    /dev/sdi   Incomplete   wb             -
├core   1    /dev/sda   Inactive     -              -
├core   2    /dev/sdb   Inactive     -              -
├core   3    /dev/sdc   Active       -              /dev/cas1-3
└core   4    /dev/sdd   Active       -              /dev/cas1-4

Now, I can see that lvm is excluding the 3 inactive devices and am seeing entries like this in journalctl (only for those 3):

Oct 30 04:28:10 localhost lvm[1723]: /dev/sdf excluded: device is rejected by filter config.
Oct 30 04:28:10 localhost lvm[1722]: /dev/sdb excluded: device is rejected by filter config.
Oct 30 04:28:10 localhost lvm[1722]: /dev/sda excluded: device is rejected by filter config.

But Ceph is still using them as target which I don't understand:

NAME                                                                                      MAJ:MIN  RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                         8:0     0   2.2T  0 disk 
└─ceph--319da2ec--a844--44c5--be92--1a02e313cf08-osd--block--a328f07c--4ff7--4578--a914--aac07c344f9d
                                                                                          252:3     0   2.2T  0 lvm  
sdb                                                                                         8:16    0   2.2T  0 disk 
└─ceph--4ad2c2b7--b7fb--4b25--ba7b--029e400da73b-osd--block--b3d00f67--bd6d--45a2--9d44--2dc960d685fb
                                                                                          252:4     0   2.2T  0 lvm  
sdc                                                                                         8:32    0   2.2T  0 disk 
└─cas1-3                                                                                  251:0     0   2.2T  0 disk 
sdd                                                                                         8:48    0   2.2T  0 disk 
└─cas1-4                                                                                  251:256   0   2.2T  0 disk 
sde                                                                                         8:64    0   2.2T  0 disk 
└─cas2-1                                                                                  251:512   0   2.2T  0 disk 
sdf                                                                                         8:80    0   2.2T  0 disk 
└─ceph--1d9c8b4a--627a--463a--b357--49d61a9dfc84-osd--block--457c2716--7904--4c18--b57f--52c65b235322
                                                                                          252:2     0   2.2T  0 lvm  
sdg                                                                                         8:96    0   2.2T  0 disk 
└─cas2-3                                                                                  251:768   0   2.2T  0 disk 
sdh                                                                                         8:112   0   2.2T  0 disk 
└─cas2-4                                                                                  251:1024  0   2.2T  0 disk 

And Ceph is actually only only those 3, the cas devices were not selected to mount any OSDs on them.
The end result is the same for all OSDs as said earlier; all down with empty folders and services failing.

I will fix this node, let Ceph recover it, test the use_devicesfile = 1 solution and report back.

@fpeg26
Copy link

fpeg26 commented Oct 30, 2024

Also on a unrelated note, I was trying to compile OpenCAS master and found this line that is missing ";" at the end:
https://github.com/Open-CAS/open-cas-linux/blob/master/casadm/cas_lib.c#L1875

@fpeg26
Copy link

fpeg26 commented Oct 30, 2024

Quick update about use_devicesfile = 1: my version of lvm doesn't have the lvmdevices command so I'm not sure what the system.devices should look like. Is it just a list of devices with one /dev/disk/by-id/xyz per line ? was not able to find much info about this online but I will continue digging more.

In the meantime, I updated Ceph (18.2.4-pve3) and OpenCAS (24.09.0.0909.master) and the behavior is the same a previously; Ceph might use the /dev/sdX devices after a reboot even if the devices are excluded by lvm filters:

  PV          VG                                        Fmt  Attr PSize   PFree
  /dev/cas1-4 ceph-b1c39c2a-508d-4c01-9d52-22a326ec9489 lvm2 a--    2.18t    0 
  /dev/cas2-4 ceph-6096de43-3a34-46aa-a023-2643c2b64707 lvm2 a--    2.18t    0 
  /dev/sda    ceph-79eb3207-f751-439e-81db-0fac4a0cbce4 lvm2 a--    2.18t    0 
  /dev/sdb    ceph-69b5a73a-211f-4023-a288-9caee1447fdb lvm2 a--    2.18t    0 
...

@fpeg26
Copy link

fpeg26 commented Oct 31, 2024

I was able to "test" use_devicefile=1. I added a bunch of devices like this in /etc/lvm/devices/system.devices:

devices = [
    "/dev/sdk",
    "/dev/cas1-1",
    "/dev/cas1-2",
    "/dev/cas1-3",
    "/dev/cas1-4",
    "/dev/cas2-1",
    "/dev/cas2-2",
    "/dev/cas2-3",
    "/dev/cas2-4",
]

I appears that it's not right syntax but what I was able to demonstrate here is that even when forcing lvm to use a devices file, it ignores it and still attaches the Ceph vgs to random pvs like below after a reboot:

  PV          VG                                        Fmt  Attr PSize   PFree
  /dev/cas1-2 ceph-ce0ca297-0916-4942-a859-b478c09d6106 lvm2 a--    2.18t    0 
  /dev/sda    ceph-0f4c7589-18ff-4502-9e77-d6e435f233fb lvm2 a--    2.18t    0 
  /dev/sdc    ceph-d9ba47f1-9fda-409b-b3cf-aef3df2a53d1 lvm2 a--    2.18t    0 
  /dev/sdd    ceph-d8c99cfa-3e49-4937-a12e-1e26652ac997 lvm2 a--    2.18t    0 
  /dev/sde    ceph-50c1eccf-0e46-47fe-b9a2-4044bfbba663 lvm2 a--    2.18t    0 
  /dev/sdf    ceph-9bcad917-7c6b-4421-bde6-3365edff92bd lvm2 a--    2.18t    0 
  /dev/sdg    ceph-f97c1d1e-6765-407c-b96a-0b82d1a3ca7b lvm2 a--    2.18t    0 
  /dev/sdh    ceph-b8850d2b-a7ac-4436-9b66-d20a6486db8b lvm2 a--    2.18t    0 

See here that it decided to use one CAS device and took raw devices for the rest of the OSDs.
To add to this, the pvs, lvs and vgs commands returned nothing until I commented out the use_devicefile=1 line in lvm.conf. So lvm knows that it should ignore every devices but decide not to follow the directive on boot.

@fpeg26
Copy link

fpeg26 commented Nov 1, 2024

It seems that adding the open-cas.service to the After and Required section of the [email protected] fixes the issue

/usr/lib/systemd/system# cat [email protected]
[Unit]
Description=Ceph Volume activation: %i
After=local-fs.target open-cas.service
Required=open-cas.service
...

I will do more investigation with this but it's promising so far.`
edit: the issue is back on one node ... so I guess it was a false hope ... I was able to reboot a few times with no problem the issue came back after the 3rd or 4rth reboot.
edit2: got 6/6 working consecutive reboots by replacing the After/Wants settings with this:

ExecStartPre=/bin/sh -c 'until [ -e /dev/cas2-4 ]; do sleep 1; done'

edit3: tried again this morning (2 days after posting edit2) and the issue is back

Also, what's the best practice when it comes to updating OpenCAS? I've been doing it on a couple hosts but I had to re-create the caches manually and rebuild the OSDs after the update which can take a while because of the rebalance process.

@fpeg26
Copy link

fpeg26 commented Nov 6, 2024

One more update, I updated all nodes to Ceph 19.2.0 and I can see the same behavior on restart.

@robertbaldyga
Copy link
Member

robertbaldyga commented Nov 7, 2024

@fpeg26 I did some experiments on Proxmox and I reproduced the same behavior. It seems that it initializes all the LVM volumes in initramfs. After setting filter in /etc/lvm/lvm.conf I called update-initramfs -u and it worked. Interestingly I did not observe this problem on the other distros even when rootfs was on LVM. Let me know if that resolved the issue on your setup.

@fpeg26
Copy link

fpeg26 commented Nov 8, 2024

I've been running some tests all day and update-initramfs -u definitely seems to be helping.

One of my node was not happy about it though and was showing the same reboot behavior but I think I narrowed it down to the lvm filters.

Using devices by id is totally ignored on boot and had to use /dev/sdX instead.

Using this:
global_filter=["a|/dev/disk/by-id/wwn-of-OS-disk|", "a|/dev/cas.*|", "r|.*|"]
or this:
global_filter=["a|/dev/disk/by-id/wwn-of-OS-disk.*|", "a|/dev/cas.*|", "r|.*|"]
resulted in the host booting in initramfs mode which is an improvment in the sense that, now, I know that the filters are actually being used.
But I had to set the filters like this:
global_filter=["a|/dev/sdk.*|", "a|/dev/cas.*|", "r|.*|"]
to get a proper boot and it looks like Ceph is now using the proper cas devices.

I will report back next week after I've done more tests but it looks promising.
Thanks a lot.

@fpeg26
Copy link

fpeg26 commented Nov 20, 2024

Hi,
I tested extensively for the past 2 weeks and I was not able to reproduce the problem since!

Still have to use devices by "name" like this:
global_filter=["a|/dev/sdk.*|", "a|/dev/cas.*|", "r|.*|"]
And I'm not entirely sure why...

The only thing I noticed is that after a reboot, the lettering can be messed up on the devices sdA-sdJ but it doesn't impact the cluster in anyway because OpenCAS is setup to use the devices by id and Ceph uses CAS devices
A reboot usually fixes the order.

Quick TLDR if somebody passes by looking for a fix:

  • setup lvm filters properly using devices name, not devices by id.
  • run initramfs -u
  • this only seems to be required in a proxmox environment

Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants