Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent corruption with readonly cache?? #4

Open
alinsavix opened this issue Mar 19, 2019 · 4 comments
Open

Apparent corruption with readonly cache?? #4

alinsavix opened this issue Mar 19, 2019 · 4 comments

Comments

@alinsavix
Copy link

Y'all seem to have the most active of the EnhanceIO forks, so I'm going to ask here to see if perhaps you can help me. I'm generally pretty clueful, but I've got no clue how to even start debugging a problem like this, so I'm reaching out. EnhanceIO seems like the only caching solution that actually meets my needs, so I really appreciate any help/pointers/etc you could give.

The setup:
Spinning rust: A Drobo 5C (think "raid array"), formatted NTFS (don't judge me!), connected via USB3, and mounted as /export/Drobo using the ntfs-3g FUSE module (this setup has worked flawlessly for about a decade, FWIW). The device shows up as a 64TB filesystem (despite not having that much actual disk) due to thin provisioning magic that is transparent to anything using the device.
Cache disk: Partition #3 on a SATA-attached Samsung 860 EVO, roughly 120GB in size
Kernel: The 4.4.176 "LTS" kernel, as provided by elrepo (specifically 4.4.176-1.el7.elrepo.x86_64)
enhanceio: current HEAD (009db3a)
userspace: RHEL 7.6, though this shouldn't matter much, due to the kernel being a completely stock 4.4 kernel (and I've verified that none of the "Redhat backported a bunch of stuff" fixes to the driver build are getting included)

Cache configured with:

# eiocli create -d /dev/disk/by-id/<drobo>-part2 -s /dev/disk/by-id/<ssd>-part3 -c DroboCache -p lru -m ro -b 4096

The problem:

I was testing this out by doing ls -alR /export/Drobo to prime the cache with at least the directory entries (the device has several hundred thousand files on it). After this has been running for 10 or 20 seconds, the system log starts to spam messages like this:

Mar 18 15:46:43 hostname ntfs-3g[5350]: ntfs_mst_post_read_fixup_warn: magic: 0x00000000  size: 1024   usa_ofs: 0  usa_count: 0: Invalid argument
Mar 18 15:46:43 hostname ntfs-3g[5350]: Record 404427 has no FILE magic (0x0)
Mar 18 15:46:44 hostname ntfs-3g[5350]: ntfs_mst_post_read_fixup_warn: magic: 0x00000000  size: 1024   usa_ofs: 0  usa_count: 0: Invalid argument
Mar 18 15:46:44 hostname ntfs-3g[5350]: Record 405305 has no FILE magic (0x0)
Mar 18 15:46:44 hostname ntfs-3g[5350]: ntfs_mst_post_read_fixup_warn: magic: 0x00000000  size: 1024   usa_ofs: 0  usa_count: 0: Invalid argument
Mar 18 15:46:44 hostname ntfs-3g[5350]: Record 405304 has no FILE magic (0x0)

Once this starts happening, attempting to just walk the filesystem results in I/O errors:

# find /export/DroboOne -name zzzzzzz
find: '/export/Drobo/DigitalPhotos/2017/Projectname/Subjectname/Subdir': Input/output error

...and accompanying log entries...

Mar 18 18:42:09 hostname ntfs-3g[7525]: ntfs_mst_post_read_fixup_warn: magic: 0x51eea9f5  size: 4096   usa_ofs: 6229  usa_count: 57650: Invalid argument
Mar 18 18:42:09 hostname ntfs-3g[7525]: Actual VCN (0x9915f5e410ab58f2) of index buffer is different from expected VCN (0x1d8) in inode 0xee8e.
Mar 18 18:42:11 hostnamentfs-3g[7525]: ntfs_mst_post_read_fixup_warn: magic: 0x71fbe439  size: 4096   usa_ofs: 973  usa_count: 56360: Invalid argument
Mar 18 18:42:11 hostname ntfs-3g[7525]: Actual VCN (0x22cf393f4f92db55) of index buffer is different from expected VCN (0x1d8) in inode 0x10311.
Mar 18 18:42:12 hostname ntfs-3g[7525]: ntfs_mst_post_read_fixup_warn: magic: 0x71f8439a  size: 4096   usa_ofs: 7623  usa_count: 232: Invalid argument

(I'm assuming the specific ntfs-3g errors won't be meaningful to anyone except the ntfs-3g developers, but I'm including them for completeness.)

One this starts happening, to get a completely functional filesystem again, I need to disable the cache and remount the filesystem (I'm guessing some bad blocks are getting into the buffer cache).

At no point in this exercise does /proc/enhanceio/DroboCache/errors show any nonzero values.

Any ideas? Is there any debugging functionality available that I could enable to allow me to provide more information? Halp?

@NickPublica
Copy link

I know it's been nearly a couple years and everyone has probably moved on but I'm getting the same issue. Did a write back to one partition and that works flawlessly. Did a read-only on my boot and had instant corrupted blocks reading from it(tried verifying a completed torrent). From this point, no apps will load and everything will error out. Anything that gets touched(read/written) gets corrupted. I assumed running it off the boot drive caused it but perhaps it's just the read-only mode. Might take the plunge and do a writethrough cache and hope for the best.

Ubuntu 20, external USB 3.1 SSD and two partitions running off mdadm RAID5 and RAID6 modes.

Wish me luck. If you never hear from me again, write-through either didn't work or I'm happily chugging away.

@NickPublica
Copy link

Tried it on my RAID6 boot drive with write through and verifying a torrent immediately lead to about 20% corrupt data followed by the system going downhill quickly. Deleting the cache, rebooting and reverifying the same corrupt torrent brought me back to around 99% good data. I assume the 1% was just the torrent running for a few seconds trying to overwrite the "bad" stuff. So it turns out this is semi-reversible if one is experiencing corruption.

The secondary partition running RAID5 and read only still appears fine after all this time so perhaps it just doesn't like caching boot drives.

@elmystico
Copy link
Owner

Hi
Are you still interested in subject guys?

@NickPublica
Copy link

Yea, I think this project has something that the other caching solutions do not - you can add/remove/modify the cache without needing to wipe the whole partition.

I’m personally interested in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants