Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reopening packfile on every object access #852

Merged
merged 4 commits into from
Oct 30, 2024
Merged

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Oct 29, 2024

By default, the go-git library will open the packfile on every call to Repository.BlobObject, then close it. During indexing, we collect the list of files to index, then iterate through each one calling Repository.BlobObject. So on every object access the packfile reopened, and go-git reallocates some in-memory buffers.

This PR bypasses git.PlainOpen to allow us to enable the KeepDescriptors option. This option keeps packfile files open, and caches wrappers for them. The files then need to be explicitly closed when done with the repo.

Benefits:

  • Avoid reallocating the memory buffers on every object access (see benchmark results below)
  • (Highly speculative) I suspect this could improve OS decisions around when to cache portions of the packfile. Maybe constantly reopening and seeking within the file makes it harder for the OS to determine the true access pattern, which is roughly random access. This can affect decisions like readahead and whether to consider pages 'active'.

Relates to SPLF-636

@cla-bot cla-bot bot added the cla-signed label Oct 29, 2024
@jtibshirani
Copy link
Member Author

I tested this against megarepo, and noticed a small improvement in allocations. This is due to us not reallocating the in-memory buffers for reading packfiles:

~/code/zoekt % go tool pprof -alloc_space -diff_base base.prof new.prof  
File: zoekt-git-index
Type: alloc_space
Time: Oct 29, 2024 at 12:12pm (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for -2.13GB, 4.12% of 51.61GB total
Dropped 276 nodes (cum <= 0.26GB)
Showing top 10 nodes out of 100
      flat  flat%   sum%        cum   cum%
   -1.54GB  2.98%  2.98%    -1.54GB  2.98%  bufio.NewReaderSize (inline)
   -0.29GB  0.55%  3.53%    -0.29GB  0.55%  github.com/sourcegraph/zoekt.(*postingsBuilder).newSearchableString
    0.16GB  0.32%  3.22%     0.16GB  0.32%  bytes.growSlice
   -0.12GB  0.24%  3.46%    -0.12GB  0.24%  strings.(*Builder).grow

Here is the web representation:

Screenshot 2024-10-29 at 12 18 56 PM

// It copies the relevant logic from git.PlainOpen, and enables the filesystem KeepDescriptors option. This
// caches the packfile handles, preventing the packfile from being opened then closed on every object access.
func openRepo(repoDir string) (*git.Repository, io.Closer, error) {
fs := osfs.New(repoDir)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included all the parts from git.PlainOpen that seem relevant. I also checked the go-git examples for KeepDescriptors, and this is almost identical: https://github.com/go-git/go-git/blob/3a210b5157e7d2a57eeac29e492efa510d616a24/_examples/ls/main.go#L32-L42

However git.PlainOpen is quite complex and I could be missing something. Here's what I purposefully omitted:

  • Special handling for user home shortcuts ... I tested and this was not necessary 🤔
  • Detecting when .git is a file that points to another git directory (like if you're trying to directly index submodule directory). I don't think we do this in indexing, or ever would.

I've tested zoekt-git-index with all the following:

  • Repos with and without .git subdirectories
  • Relative paths like ../megarepo
  • Shortcut for user home like ~/code/megarepo
  • Updating Sourcegraph to use this version and reindexing a bunch of repos locally

@jtibshirani jtibshirani requested a review from a team October 29, 2024 22:12
}

s := filesystem.NewStorageWithOptions(fs, cache.NewObjectLRUDefault(), filesystem.Options{
KeepDescriptors: true,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the point of the whole change :) This also sets me up to set LargeObjectThreshold in a follow-up PR, which could prove even more important.

gitindex/index.go Outdated Show resolved Hide resolved
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This aligns with earlier research I did on go-git when analysing it.

On your macbook, could you maybe share the output of zoekt-git-index on the megarepo before and after this change, but prefixed with /usr/bin/time -al?

Copy link

@janhartman janhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tests!

gitindex/index.go Outdated Show resolved Hide resolved
@jtibshirani
Copy link
Member Author

jtibshirani commented Oct 30, 2024

On your macbook, could you maybe share the output of zoekt-git-index on the megarepo before and after this change, but prefixed with /usr/bin/time -al?

This command is great, thanks for the pointer. My interpretation of the results below:

  • Indexing latency roughly unchanged -- differences are within usual variance
  • Lower maximum RSS (6.2GB -> 5.6GB) -- this looks real, I tried another run and saw the same difference

Before

          144.43 real       507.76 user        88.85 sys
          6206930944  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             4056713  page reclaims
               17123  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
               16032  signals received
              398288  voluntary context switches
             3328543  involuntary context switches
          3359314288  instructions retired
          2034483542  cycles elapsed
            31655552  peak memory footprint

After

      146.26 real       506.48 user        90.05 sys
          5686149120  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             4035750  page reclaims
               17140  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
               15796  signals received
              423029  voluntary context switches
             3327541  involuntary context switches
          4134626459  instructions retired
          2604012040  cycles elapsed
            30099008  peak memory footprint

@jtibshirani
Copy link
Member Author

jtibshirani commented Oct 30, 2024

Latest changes:

  • Avoid running git config --global in tests since this will change your git author/ email when running tests locally!
  • Only change root to .git if it's a directory, not a plain file
  • Put optimizations behind a temporary feature flag to de-risk testing this on dot com

@jtibshirani jtibshirani merged commit 6a4b615 into main Oct 30, 2024
9 checks passed
@jtibshirani jtibshirani deleted the jtibs/packfile branch October 30, 2024 18:10
Comment on lines +95 to +96
executeCommand(t, repoDir, exec.Command("git", "config", "user.name", "Thomas"))
executeCommand(t, repoDir, exec.Command("git", "config", "user.email", "[email protected]"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice way to do it in a repo. FYI in the sourcegraph repo what we often do is set these envvars, which also allow you to get deterministic commits since it controls the timestamps:

	c.Env = []string{
		"GIT_CONFIG=" + path.Join(dir, ".git", "config"),
		"GIT_COMMITTER_NAME=a",
		"[email protected]",
		"GIT_COMMITTER_DATE=2006-01-02T15:04:05Z",
		"GIT_AUTHOR_NAME=a",
		"[email protected]",
		"GIT_AUTHOR_DATE=2006-01-02T15:04:05Z",
	}

The setting of GIT_CONFIG is useful for the init command, since then it also avoids reading the users git config which can affect things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, this is nicer.

jtibshirani added a commit that referenced this pull request Oct 31, 2024
By default go-git maintains an LRU cache of git objects of size 96MB. When an
object's contents are loaded, it's stored as a MemoryObject in this cache. This
cache is not super useful in the indexing access pattern, which accesses each
file only once. And in many profiles, we see a substantial number of
allocations from these memory objects.

This PR disables caching for most git objects by setting LargeObjectThreshold:
1. go-git still proactively caches packfile objects under 16KB (see
smallObjectThreshold here).

Follow up to #852. This change is also gated by the
ZOEKT_ENABLE_GOGIT_OPTIMIZATION feature flag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants