Lots of people have filed issues against git-filter-repo, and many times their issue boils down into questions of "How do I?" or "Why doesn't this work?"
Below are a collection of example repository filterings in answer to their questions, which may be of interest to others.
- Adding files to root commits
- Purge a large list of files
- Extracting a libary from a repo
- Replace words in all commit messages
- Only keep files from two branches
- Renormalize end-of-line characters and add a .gitattributes
- Remove spaces at the end of lines
- Having both exclude and include rules for filenames
- Removing paths with a certain extension
- Removing a directory
- Convert from NFD filenames to NFC
- Set the committer of the last few commits to myself
- Handling special characters, e.g. accents in names
- Handling repository corruption
- Removing all files with a backslash in them
- Replace a binary blob in history
- Remove commits older than N days
- Replacing pngs with compressed alternative
- Updating submodule hashes
- Using multi-line strings in callbacks
Here's an example that will take /path/to/existing/README.md
and
store it as README.md
in the repository, and take
/home/myusers/mymodule.gitignore
and store it as src/.gitignore
in
the repository:
git filter-repo --commit-callback "if not commit.parents: commit.file_changes += [
FileChange(b'M', b'README.md', b'$(git hash-object -w '/path/to/existing/README.md')', b'100644'),
FileChange(b'M', b'src/.gitignore', b'$(git hash-object -w '/home/myusers/mymodule.gitignore')', b'100644')]"
Alternatively, you could also use the insert-beginning contrib script:
mv /path/to/existing/README.md README.md
mv /home/myusers/mymodule.gitignore src/.gitignore
insert-beginning --file README.md
insert-beginning --file src/.gitignore
Stick all the files in some file (one per line),
e.g. ../DELETED_FILENAMES.txt
, and then run
git filter-repo --invert-paths --paths-from-file ../DELETED_FILENAMES.txt
If you want to pick out some subdirectory to keep
(e.g. src/some-filder/some-feature/
), but don't want it moved to the
repository root (so that --subdirectory-filter isn't applicable) but
instead want it to become some other higher level directory
(e.g. src/
):
git filter-repo \
--path src/some-folder/some-feature/ \
--path-rename src/some-folder/some-feature/:src/
Replace "stuff" in any commit message with "task".
git filter-repo --message-callback 'return message.replace(b"stuff", b"task")'
Let's say you know that the files currently present on two branches
are the only files that matter. Files that used to exist in either of
these branches, or files that only exist on some other branch, should
all be deleted from all versions of history. This can be accomplished
by getting a list of files from each branch, combining them, sorting
the list and picking out just the unique entries, then passing the
result to --paths-from-file
:
git ls-tree -r ${BRANCH1} >../my-files
git ls-tree -r ${BRANCH2} >>../my-files
sort ../my-files | uniq >../my-relevant-files
git filter-repo --paths-from-file ../my-relevant-files
contrib/filter-repo-demos/lint-history dos2unix
[edit .gitattributes]
contrib/filter-repo-demos/insert-beginning .gitattributes
Removing all spaces at the end of lines of non-binary files, including converting CRLF to LF:
git filter-repo --replace-text <(echo 'regex:[\r\t ]+(\n|$)==>\n')
If you want to have rules to both include and exclude filenames, you
can simply invoke git filter-repo
multiple times. Alternatively,
you can do it in one run if you dispense with --path
arguments and
instead use the more generic --filename-callback
. For example to
include all files under src/
except for src/README.md
:
git filter-repo --filename-callback '
if filename == b"src/README.md":
return None
if filename.startswith(b"src/"):
return filename
return None'
git filter-repo --invert-paths --path-glob '*.xsa'
or
git filter-repo --filename-callback '
if filename.endswith(b".xsa"):
return None
return filename'
git filter-repo --path node_modules/electron/dist/ --invert-paths
Given that Mac does utf-8 normalization of filenames, and has historically switched which kind of normalization it does, users may have committed files with alternative normalizations to their repository. If someone wants to convert filenames in NFD form to NFC, they could run
git filter-repo --filename-callback '
try:
return subprocess.check_output("iconv -f utf-8-mac -t utf-8".split(),
input=filename)
except:
return filename
'
or instead of relying on the system iconv utility and spawning separate processes, doing it within python:
git filter-repo --filename-callback '
import unicodedata
try:
return bytearray(unicodedata.normalize('NFC', filename.decode('utf-8')), 'utf-8')
except:
return filename
'
git filter-repo --refs main~5..main --commit-callback '
commit.commiter_name = b"My Wonderful Self"
commit.committer_email = b"my@self.org"
'
Since characters like ë and á are multi-byte characters and python
won't allow you to directly place those in a bytestring
(e.g. b"Raphaël González"
would result in a SyntaxError: bytes can only contain ASCII literal characters
error from Python), you just
need to make a normal (UTF-8) string and then convert to a bytestring
to handle these. For example, changing the author name and email
where the author email is currently example@test.com
:
git filter-repo --refs main~5..main --commit-callback '
if commit.author_email = b"example@test.com":
commit.author_name = "Raphaël González".encode()
commit.author_email = b"rgonzalez@test.com"
'
First, run fsck to get a list of the corrupt objects, e.g.:
$ git fsck --full
error in commit 166f57b3fbe31257100361ecaf735f305b533b21: missingSpaceBeforeDate: invalid author/committer line - missing space before date
error in tree c15680eae81cc8539af7e7de766a8a7c13bd27df: duplicateEntries: contains duplicate file entries
Checking object directories: 100% (256/256), done.
Odds are you'll only see one type of corruption, but if you see multiple, you can either do multiple filterings, or create replacement objects for all the corrupt objects (both commits and trees), and then do the filtering. Since the method for handling corrupt commits and corrupt tress is slightly different, I'll give examples below for each.
Print out the corrupt object literally to a temporary file:
$ git cat-file -p 166f57b3fbe31257100361ecaf735f305b533b21 >tmp
Taking a look at the file would show, for example:
$ cat tmp
tree e1d871155fce791680ec899fe7869067f2b4ffd2
author My Name <my@email.com>1673287380 -0800
committer My Name <my@email.com> 1673287380 -0800
Initial
Edit that file to fix the error (in this case, the missing space between author email and author date). In this case, it would look like this after editing:
tree e1d871155fce791680ec899fe7869067f2b4ffd2
author My Name <my@email.com> 1673287380 -0800
committer My Name <my@email.com> 1673287380 -0800
Initial
Save the updated file, then use git replace
to make a replace reference
for it.
$ git replace -f 166f57b3fbe31257100361ecaf735f305b533b21 $(git hash-object -t commit -w tmp)
Then remove the temporary file tmp
and run filter-repo
to consume
the replace reference and make it permanent:
$ rm tmp
$ git filter-repo --proceed
Note that if you have multiple corrupt objects, you only need to run filter-repo once; that is, so long as you create all the replacements before you run filter-repo.
Print out the corrupt object literally to a temporary file:
$ git cat-file -p c15680eae81cc8539af7e7de766a8a7c13bd27df >tmp
Taking a look at the file would show, for example:
$ cat tmp
100644 blob cd5ded43e86f80bfd384702e3f4cc7ce42de49f9 .gitignore
100644 blob 226febfcc91ec2c166a5a06834fb47c3553ec469 README.md
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 src
040000 tree df2b8fc99e1c1d4dbc0a854d9f72157f1d6ea078 src
040000 tree 99d732476808176bb9d73bcbfe2505e43d65cb4f t
Edit that file to fix the error (in this case, removing either the src
file (blob) or the src
directory (tree)). In this case, it might look
like this after editing:
100644 blob cd5ded43e86f80bfd384702e3f4cc7ce42de49f9 .gitignore
100644 blob 226febfcc91ec2c166a5a06834fb47c3553ec469 README.md
040000 tree df2b8fc99e1c1d4dbc0a854d9f72157f1d6ea078 src
040000 tree 99d732476808176bb9d73bcbfe2505e43d65cb4f t
Save the updated file, then use git mktree
to turn it into an actual
tree object:
$ git mktree <tmp
ace04f50a5d13b43e94c12802d3d8a6c66a35b1d
Now use the output of that command to create a replacement object for the original corrupt object:
git replace -f c15680eae81cc8539af7e7de766a8a7c13bd27df ace04f50a5d13b43e94c12802d3d8a6c66a35b1d
Then remove the temporary file tmp
and run filter-repo
to consume
the replace reference and make it permanent:
$ rm tmp
$ git filter-repo --proceed
As mentioned with corrupt commit objects, if you have multiple corrupt objects, as long as you create all the replacements for those objects first, you only need to run filter-repo once.
git filter-repo --filename-callback 'return None if b'\\' in filename else filename'
Let's say you committed a binary blob, perhaps an image file, with
sensitive data, and never modified it. You want to replace it with
the contents of some alternate file, currently found at
../alternative-file.jpg
(it can have a different filename than what
is stored in the repository). Let's also say the hash of the old file
was f4ede2e944868b9a08401dafeb2b944c7166fd0a
. You can replace it
with either
git filter-repo --blob-callback '
if blob.original_id == b"f4ede2e944868b9a08401dafeb2b944c7166fd0a":
blob.data = open("../alternative-file.jpg", "rb").read()
'
or
git replace -f f4ede2e944868b9a08401dafeb2b944c7166fd0a $(git hash-object -w ../alternative-file.jpg)
git filter-repo --proceed
This is such a bad usecase. I'm tempted to leave it out, but it has come up multiple times, and there are people who are totally fine with changing every commit hash in their repository and throwing away history periodically. First, identify an ${OLD_COMMIT} that you want to be a new root commit, then run:
git replace --graft ${OLD_COMMIT}
git filter-repo --proceed
(The trick here is that git replace --graft
takes a commit to replace, and
a list of new parents for the commit. Since ${OLD_COMMIT} is the final
positional argument, it means the list of new parents is an empty list, i.e.
we are turning it into a new root commit.)
Let's say you committed thousands of pngs that were poorly compressed, but later aggressively recompressed the pngs and commited and pushed. Unfortunately, clones are slow because they still contain the poorly compressed pngs and you'd like to rewrite history to pretend that the aggressively compressed versions were used when the files were first introduced.
First, take a look at the commit that aggressively recompressed the pngs:
git log -1 --raw --no-abbrev ${COMMIT_WHERE_YOU_COMPRESSED_PNGS}
that will show output like
:100755 100755 edf570fde099c0705432a389b96cb86489beda09 9cce52ae0806d695956dcf662cd74b497eaa7b12 M resources/foo.png
:100755 100755 644f7c55e1a88a29779dc86b9ff92f512bf9bc11 88b02e9e45c0a62db2f1751b6c065b0c2e538820 M resources/bar.png
Use that to make a --file-info-callback to fix up the original versions:
git filter-repo --file-info-callback '
if filename == b"resources/foo.png" and blob_id == b"edf570fde099c0705432a389b96cb86489beda09":
blob_id = b"9cce52ae0806d695956dcf662cd74b497eaa7b12"
if filename == b"resources/bar.png" and blob_id == b"644f7c55e1a88a29779dc86b9ff92f512bf9bc11":
blob_id = b"88b02e9e45c0a62db2f1751b6c065b0c2e538820"
return (filename, mode, blob_id)
'
Let's say you have a repo with a submodule at src/my-submodule, and that you feel the wrong commit-hashes of the submodule were commited within your project and you want them updated according to the following table:
old new
edf570fde099c0705432a389b96cb86489beda09 9cce52ae0806d695956dcf662cd74b497eaa7b12
644f7c55e1a88a29779dc86b9ff92f512bf9bc11 88b02e9e45c0a62db2f1751b6c065b0c2e538820
You could do this as follows:
git filter-repo --file-info-callback '
if filename == b"src/my-submodule" and blob_id == b"edf570fde099c0705432a389b96cb86489beda09":
blob_id = b"9cce52ae0806d695956dcf662cd74b497eaa7b12"
if filename == b"src/my-submodule" and blob_id == b"644f7c55e1a88a29779dc86b9ff92f512bf9bc11":
blob_id = b"88b02e9e45c0a62db2f1751b6c065b0c2e538820"
return (filename, mode, blob_id)
Yes, blob_id
is kind of a misnomer here since the file's hash
actually refers to a commit from the sub-project. But blob_id
is
the name of the parameter passed to the --file-info-callback, so that
is what must be used.
Since the text for callbacks have spaces inserted at the front of every line, multi-line strings are normally munged. For example, the command
git filter-repo --blob-callback '
blob.data = bytes("""\
This is the new
file that I am
replacing every blob
with. It is great.\n""", "utf-8")
'
would result in a file with extra spaces at the front of every line:
This is the new
file that I am
replacing every blob
with. It is great.
The two spaces at the beginning of every-line were inserted into every line of the callback when trying to compile it as a function. However, you can use textwrap.dedent to fix this; in fact, using it will even allow you to add more leading space so that it looks nicely indented. For example:
git filter-repo --blob-callback '
import textwrap
blob.data = bytes(textwrap.dedent("""\
This is the new
file that I am
replacing every blob
with. It is great.\n"""), "utf-8")
'
That will result in a file with contents
This is the new
file that I am
replacing every blob
with. It is great.
which has no leading spaces on any lines.