Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make performance counter #20

Open
foriequal0 opened this issue Feb 17, 2020 · 9 comments
Open

Make performance counter #20

foriequal0 opened this issue Feb 17, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@foriequal0
Copy link
Owner

To diagnosis operations, I need a performance counter.
How many branches are there? How big is a repository? How long it takes for each operation?

@foriequal0 foriequal0 self-assigned this Feb 17, 2020
@foriequal0 foriequal0 added the enhancement New feature or request label Mar 9, 2020
@siedentop
Copy link

Yet one more ticket that I wanted to comment on. ;)

From the description, it is not clear to me, what you want to accomplish with these metrics. I.e. repo size, number of branches are in and of themselves not performance counters.

However, I am using this on a repo and it basically does not work because of performance reasons. I have ~3500 branches (git branch --all | wc -l). I don't know the repo size (I'm writing this on an iPad.) but it is definitely very big (think LLVM repo size). Could you tell me how you would calculate repo size? Possibilities: number of git objects. Size of .git folder, size of checkout out repo. Depth of history. Max width of git commit tree.

"How long it takes for each operation."

I was trying this earlier on this particular repo and I was running it env_logger set to "debug" level. That will provide some timings. As I said, it took to long and so I aborted it.

@foriequal0
Copy link
Owner Author

Thank you for pointing it out. I've made this issue to leave some notes while closing this issue. Current log is quiet tedious to inspect. I might be able to find the slow spot with timgings in the log, but it doesn't tell me why it is slow. I wanted to make it easier to analyze. But was a vague idea and lost motivation since there hasn't been a performance issue after closing it.

By the way, can you tell me how many local branches do you have? IIRC, total time should be proportional to the number of local branches by default, not total branches, except --delete remote flag is given.

@siedentop
Copy link

Here's the data you requested in the linked ticket.

  • git rev-list --all --count ==> 369170
  • All CPU cores are in use
  • ❯ git branch | wc -l ==> 102
  • git branch --all | wc -l ==> ~3500

Feature Request: Provide output during the run at which stage it is (out of how many total stages.) Ideally in the form of a status-bar.

I am going to do two things next: (1) Run cargo-flamegraph (I tried but couldn't get it to work in a different directory.) . (2) Run it overnight.

@foriequal0
Copy link
Owner Author

foriequal0 commented Nov 13, 2020

Thanks! However, requested features might take some time. I'm busy at work for months. Also I might be able to try to trim some branches even if it is aborted.

@siedentop
Copy link

Results from running the command over night:

The logs look like it took 2:45h to run. (I.e. last log message timestamp minus first timestamp).

       Command being timed: "git trim --no-update --dry-run"
        User time (seconds): 180559.03
        System time (seconds): 10047.93
        Percent of CPU this job got: 1926%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:44:52
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1151504
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3845
        Minor (reclaiming a frame) page faults: 446960188
        Voluntary context switches: 1063108901
        Involuntary context switches: 3131583
        Swaps: 0
        File system inputs: 655272
        File system outputs: 880
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Also I might be able to try to trim some branches even if it is aborted.

I have a feeling that this the complexity is non-linear? Then trimming only the first N branches would be beneficial. That would also be beneficial from the perspective of reviewing the proposed changes.

siedentop added a commit to siedentop/git-trim that referenced this issue Nov 15, 2020
git2::Repository::remotes() and find_remote(..) are incredibly slow.
Calling them every time for each branch is unnecessarily slow.

Please note that this is not yet the best way to implement this. We are
still iterating through all remote.refspecs() for each branch. Just that
all Remote structs are cached.

An alternative would be to work on `expand_refspec()` and create a
map/index of (branch name, remote branches) once.

\foriequal0#20
siedentop added a commit to siedentop/git-trim that referenced this issue Nov 16, 2020
git2::Repository::remotes() and find_remote(..) are incredibly slow.
Calling them every time for each branch is unnecessarily slow.

Please note that this is not yet the best way to implement this. We are
still iterating through all remote.refspecs() for each branch. Just that
all Remote structs are cached.

An alternative would be to work on `expand_refspec()` and create a
map/index of (branch name, remote branches) once.

\foriequal0#20
@siedentop
Copy link

@foriequal0 sorry to take your time again. I realize that you might be busy. If so, no need to respond.

I identified MergeTracker<T>::check_and_track as taking up all the time in the earlier stages of the runtime (after making fixes in 42a0874). In particular, it calls repo.merge_base which takes all the time.

Here's a list of things that I don't understand:

  1. What is the idea of MergeTracker? What is the meaning of merged_set, what does it contain?
  2. How does the check_and_track function work?

Many thanks and also totally fine if you don't have time to answer this.

@foriequal0
Copy link
Owner Author

The basic isn't changed from this merge testing script:

MERGE_BASE=$(git merge-base $BASE $BRANCH)
# Is branch merged by cherry-pick or rebase?
git rev-list --cherry-pick --right-only --no-merges -n1 $MERGE_BASE...$BRANCH # empty if merged
// Is branch merged by squash? https://stackoverflow.com/questions/43489303/how-can-i-delete-all-git-branches-which-have-been-squash-and-merge-via-github/56026209#56026209
TREE=$(git rev-parse $BRANCH^{tree})
SQUASH=git commit-tree $TREE -p $MERGE_BASE -m _)
git rev-list --cherry-pick --right-only --no-merges -n1 $MERGE_BASE...$SQUASH # empty if merged

Also git branch --merged $BASE gives you a list of no-ff merged branches. I've tried to optimize more by avoiding rev-list and commit-tree since rev-list is slow, and commit-tree is slower, especially if you are using it on WSL (its disk operations are notoriously slow. Even slower than non-optimized, non-cached git for windows)

This is the core idea of MergeTracker.

  1. merged_set contains a set of branches that are already merged into bases. It starts from a set of bases, and git branch --merged.
  2. Any branches that are ancestors of merged branches are trivially merged, (children of un-merged branches are un-merged?), without testing it.
    If A is an ancestor of B, we know that A == B or $(git merge-base A B) == A. It would be a short-circuit when it is much cheaper than running a series of rev-list and commit-tree
  3. If we fail, then we return to the rev-list and commit-tree method. However, can we share the result to other tasks so they can short-circuit with the result?

Also, I thought that calling libgit2 API would be faster faster than executing git with std::process::Command. I thought MergeTracker would be the bottleneck, not repo.remotes()?.iter() and find_remote() so I've left them for brevity and some simplicity.

@jyn514
Copy link

jyn514 commented Feb 25, 2023

FWIW, git branch --all --merged does roughly the same thing and runs several orders of magnitude faster. Maybe it would be simpler to do that and prune the branches instead of trying to determine them from scratch?

@foriequal0
Copy link
Owner Author

As far as I know, git branch --all --merged doesn't count squash-merged, or rebase-merged branches.
git-trim was created when I preferred rebase-merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants