Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance - skip manifest read if no components/targets requested #2917

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JoshMcguigan
Copy link

@JoshMcguigan JoshMcguigan commented Dec 4, 2021

I was looking into why rustup had a performance penalty over calling cargo directly in simple cases, and found most of the time was spent parsing the multirust-channel-manifest.toml file. Looking more closely, it seemed parsing that file was not really required in many cases. This PR tweaks the ordering of the code to skip parsing this file when it isn't needed.

I do not intend to make any behavioral change with this, other than potentially if some code paths handle an error while reading this file (and now they won't see that error in cases where we skip reading the file entirely). I believe this is the reason the heal_damaged_toolchain test is now failing. I need to look into this more closely (this PR is marked as draft until I figure this out), but I'd appreciate any insight into whether that test was intended to cover this behavior.

The performance changes were tested in a brand new hello world cargo project (cargo new hello), after building rustup in release mode and installing it into a rustup/home directory as described in the contributing guidelines. The before tests were done on the current master branch, because it seems to already be much faster than the most recent official release.

Before this change a repeated call to cargo build (through the rustup proxy) would take ~90ms. After this change it takes ~35ms. Running cargo directly takes ~22ms. So this change reduces the rustup overhead in this case from 68ms to 13ms (a reduction of 80%).

@JoshMcguigan
Copy link
Author

It looks like the heal_damaged_toolchain test is failing because it relies on rustup show active-toolchain having previously read/parsed the manifest, which is specifically what this PR removes (since reading/parsing the manifest seems to not be strictly required for that operation, and it takes meaningful time).

Is my understanding correct here? And is it reasonable to a command like rustup show active-toolchain to NOT heal the damaged toolchain manifest file (since it doesn't even need to look at that file)? If so, is there a command which would be more appropriate to run as part of this test to demonstrate the healing?

@kinnison
Copy link
Contributor

Personally I think the request to show the active toolchain (which implies checking for components in the rust-toolchain.toml being available) should be able to succeed at healing things.

I remain convinced that the correct way to ameliorate the incredible slowness of parsing the TOML is to, instead, keep a cached faster-to-parse copy alongside it, e.g. BINCODE.

@dralley
Copy link
Contributor

dralley commented Jan 12, 2022

Has any profiling been done on the TOML parsing itself? Just glancing through the file, the parsing code doesn't seem especially efficient, it could probably benefit from a little focused attention.

edit: I just created one and the main thing that stands is tons and tons of individual allocations and drops

@JoshMcguigan
Copy link
Author

JoshMcguigan commented Jan 12, 2022

I created a small one.

use std::fs::read_to_string;

fn main() {
    let s = read_to_string("/home/josh/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/multirust-channel-manifest.toml").unwrap();

    for _ in 0..10 {
        s.parse::<toml::Value>().unwrap();
    }
}

This runs in ~450ms on my machine. I found I can improve this by ~100ms by bypassing the \r\n windows style newline handling in src/tokens.rs, which I actually wonder if that is where you are seeing all the allocations. However I'm not sure how we could get this improvement while also handling windows style newlines (short of replacing all \r\n with \n before parsing, but there is a memory use cost there).

edit: I just created one and the main thing that stands is tons and tons of individual allocations and drops

I'd like to poke around at this. How did you gather this data?

@dralley
Copy link
Contributor

dralley commented Jan 12, 2022

I'd like to poke around at this. How did you gather this data?

If you pass the --reverse flag it groups together all of the tiny calls at the bottom of the call stack instead of the ones at the top, if you do it seems that about 25% of the total time is spent on allocs, frees and moves. Another 8% is spent computing hashes.

@JoshMcguigan
Copy link
Author

I've been using frida tracing to count malloc calls, and its about 100,000 allocations to parse the manifest file a single time (so the example above would be 10x that). I replaced Vec in one place with TinyVec and cut the allocations by ~10%, but the runtime didn't really change.

I'm probably not going to continue down this path right now, but I'd be very curious to see if anyone else comes up with anything fruitful here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants