Skip to content

Commit

Permalink
Added comment regarding Windows implementation.
Browse files Browse the repository at this point in the history
  • Loading branch information
Mart-Bogdan committed Nov 19, 2022
1 parent ba0b162 commit f3c6c12
Show file tree
Hide file tree
Showing 2 changed files with 94 additions and 31 deletions.
67 changes: 67 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ impl FileInfo {
if md.is_dir() {
Ok(FileInfo::Directory {
volume_id: md.dev(),
// TODO: we can provide size for directories. Linux reports it, and `du` is actually using it in it's calculations
})
} else {
let size = if apparent {
Expand All @@ -103,6 +104,71 @@ impl FileInfo {
use winapi::um::winnt::LARGE_INTEGER;
use winapi_util::Handle;

/*
Windows Implemntation notice.
File size is tricky on windows.
Here is article by Raymond Chen from Microsoft https://devblogs.microsoft.com/oldnewthing/20160427-00/?p=93365
Quot from there:
```
The algorithm for “Size on disk” is as follows:
* If the file is sparse, then report the number of non-sparse bytes.
* If the file is compressed, then report the compressed size. The compressed size may be less than a full sector.
* If the file is neither sparse nor compressed, then report the nominal file size, rounded up to the nearest cluster.
Starting in Windows 8.1, the Size on disk calculation includes the sizes of alternate data streams
and sort-of-kind-of tries to guess which streams could be stored in the MFT
and not count them toward the size on disk. (Even though they really are on disk.
I mean, if they’re not on disk, then where are they?)
```
From my research and observations:
Win API Reports size on disk wierdly for small files.
AllocationSize field of structs can be not dividable by FS cluster size.
Also AllocationSize could be one byte bigger than size reported by GetCompressedFileSize function,
or value obtained by FILE_COMPRESSION_INFO struct.
But when we open properties windows for such file -- explorer would report size 0.
That indicates that file is actually stored inside directory.
We also can get size of directory on disk using this APIs.
In my opiniopn (Bohdan Mart) perfect soulition would be to detect somehow that file is stored inline,
and for such file also report 0 size. And add own size of directory for final result.
Current implemetation is just adding reported file physical syze if -a flag is provided.
Which is consistent with versions of dirstat-rs 0.3.8 and earlier.
Also I have noticed that we can read file sizes in bulk from directory handle (fd).
Basically we pass in dirrectory handle and receive iterator of FILE_FULL_DIR_INFO,
which contrains both AllocationSize and EndOfFile.
If this can benefit performance is needed to be tested.
Second problem are alternate data streams. On Windows each file can have multiple data streams,
and main stream called $DATA. Data streams can be opened if we try open file with name ending
in `:<streamName>` like "some_file.txt:stream1".
Upon experimentation it is clear that windows explorer is calculating size of data streams as well.
On windows each stream can have any length, even several TB. On linux therea re similar feature,
clled *extended attributes*, but it have limited size.
It would be nice for dirstat to get size of alternate datastreams. Unfortuantely it is not compatible
with FILE_FULL_DIR_INFO, so it should be tested, if taht optimisation is actually needed.
perhaps calcualting alt DS streams size can be optional flag, to maximize performance.
Possible future work items:
1. Check if getting list of files and their sizes in bulk would be benficial.
2. Try to mimic windows explorer algorithm to calculate file size.
3. Get size for alternate data streams
(perhaps we don't need to get regular size, as it is reported with datastreams as well)
More info in SO question https://stackoverflow.com/questions/51033508/how-do-i-get-the-size-of-file-in-disk
Some playground I've used to experiment with API https://gist.github.com/Mart-Bogdan/bda2995621911254f73f80d157f07622
*/

let h = Handle::from_path_any(path)?;
let std_info: FILE_STANDARD_INFO = get_file_information_by_handle_ex(&h)?;
// That's unfortunate that we have to make second syscall just to know volume serial number
Expand All @@ -112,6 +178,7 @@ impl FileInfo {
if std_info.Directory != 0 {
Ok(FileInfo::Directory {
volume_id: id_info.VolumeSerialNumber,
// TODO file size is actually provided for directories. We can use it to provide more precise info.
})
} else {
let size: LARGE_INTEGER = if apparent {
Expand Down
58 changes: 27 additions & 31 deletions src/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ use crate::{DiskItem, FileInfo};
// warn: don't remove `as &str` after macro invocation.
// It breaks type checker in Intellij Rust IDE
use const_format::concatcp;
#[cfg(windows)]
use once_cell::sync::Lazy;
use rstest::*;
use std::fs::File;
Expand Down Expand Up @@ -92,37 +93,32 @@ fn test_files_logical_size(#[case] file: &str, #[case] size: u64) {
assert_size(&file, false, size);
}

#[test]
fn test_files_physical_size() {
// Can't test top dir, as compressed files would mess the picture

// following are windows quirks/optimisations
if cfg!(windows) {
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b23_rand"), true, 24);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b23_zero"), true, 24);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b512_rand"), true, 512);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b512_zero"), true, 512);
} else {
// TODO this is really FS dependant. On WSL and ntfs it all would be 0. With Ext4 it would be 4096
// either add FS specific logic, or don't assert this. I guss second option, as otherwise tests
// aren't reproducible.

// assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b23_rand"), true, 0);
// assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b23_zero"), true, 0);
// assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b512_rand"), true, 0);
// assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b512_zero"), true, 0);
}

assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b4000_rand"), true, 4096);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b4000_zero"), true, 4096);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b4096_rand"), true, 4096);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b4096_zero"), true, 4096);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b8000_rand"), true, 8192);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b8192_rand"), true, 8192);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "b8192_zero"), true, 8192);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "rand_1000"), true, 4096);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "text1.txt"), true, 4096);
assert_size(concatcp!(TEST_PRE_CREATED_DIR, "text2.txt"), true, 12288);
#[rstest]
// Can't test top dir, as compressed files would mess the picture so no test for ""
#[cfg_attr(windows, case("b23_rand", 24))]
#[cfg_attr(windows, case("b23_zero", 24))]
#[cfg_attr(windows, case("b512_rand", 512))]
#[cfg_attr(windows, case("b512_zero", 512))]
// TODO this is really FS dependant. On WSL and ntfs it all would be 0. With Ext4 it would be 4096
// either add FS specific logic, or don't assert this. I guss second option, as otherwise tests
// aren't reproducible.
// #[cfg_attr(not(windows),case("b23_rand", 0))]
// #[cfg_attr(not(windows),case("b23_zero", 0))]
// #[cfg_attr(not(windows),case("b512_rand", 0))]
// #[cfg_attr(not(windows),case("b512_zero", 0))]
#[case("b4000_rand", 4096)]
#[case("b4000_zero", 4096)]
#[case("b4096_rand", 4096)]
#[case("b4096_zero", 4096)]
#[case("b8000_rand", 8192)]
#[case("b8192_rand", 8192)]
#[case("b8192_zero", 8192)]
#[case("rand_1000", 4096)]
#[case("text1.txt", 4096)]
#[case("text2.txt", 12288)]
fn test_files_physical_size(#[case] file: &str, #[case] size: u64) {
let file = String::from(TEST_PRE_CREATED_DIR) + file;
assert_size(&file, true, size);
}

#[cfg(windows)] // isn't supported on Unix (Theoretically possible on btrfs)
Expand Down

0 comments on commit f3c6c12

Please sign in to comment.