I'm seeing numerous performance problems with etckeeper on my servers. It generally happens when there is a git repository underneath /etc
or some other thing that generates lots of small files. Typically, what I try next is to make git ignore those files, which works for the git side of things, but not for the etckeeper side of things, as all those files still get listed (and parsed) by etckeeper's .etckeeper
metadata file.
The problem here is the metadata "ignore" filter doesn't really work. We can see this in multiple different bug reports here, like etkeeper warning about hardlinks doesn't exclude ignored files, Do not recreate ignored empty directory, or what if there is a Git repo somewhere underneath /etc?. I think the approach taken to solve this problem is incorrect, which is why I am opening this issue to regroup all of them in one single place.
From what I can tell, the 30store-metadata
script tries to ignore files ignored by git. But it tries to do that using grep
, and git ignore files are not grep
patterns at all. In fact, they're not even regular expressions, they are like glob patterns, but with many different exceptions (like !
to negate a pattern, for example). I do not think there is a command that can faithfully reproduce this outside of git itself.
Right now, we basically do this:
- list all empty directories, and add them to the metadata, regardless of gitignore (which is Do not recreate ignored empty directory)
- then list all files and directories (empty or not), try to filter out ignored files, and try to ignore normal modes, which means:
find $NOVCS \( -type f -or -type d \)
(basically skips.git
and friends)sed 's/^\.\///' | grep -xFvf $(git ls-files -oi --exclude-standard; git ls-files -oi --exclude-standard --directory | sort -u)
- some inline perl script which actually systematically chmods all files, and optionally changes the owner and group if they are not the EUID/EGID
Now, the problem that concerns us here is mostly in step 2.2 above. That grep
command is bad for many reasons, but the first of which is -x
: that will do a full match on the entire line so if you have, say, puppet
in your .gitignore
that will match the puppet
directory, but nothing underneath, which is not how gitignore files work at all.
I think the proper way to do this is to actually start from files git actually tracks, since in that step, we're trying to keep track of modes and owners of actual files managed by git. It seems to me much better to make git list the files, then process that, than try to reimplement git-ignore outside of git.
I think the pseudocode then would be something like, for the git case:
git ls-files | sort -u | maybe_chmod_chown
git ls-files --others --exclude-standard --directory | sed -e "s/^/mkdir -p /"
... and that's it! The trick here is we separate the normal file tracking (first step) from the empty directory listing (second step). I haven't actually tested this because I'm out of battery in that third yak razor, but I wanted to brainstorm this here first to see if we could somehow make sense of this.
(Also, I don't think that the maybe_chmod_chown
script should systematically change the mode on files, but that's a different story (and optimization)...)
-- ?anarcat
Update: and I think the patch is something like this (see branch for latest. it doesn't quite work the way I expected, unfortunately. a few examples:
diff --git a/.etckeeper b/.etckeeper
index 71fb3188..8b34a087 100755
--- a/.etckeeper
+++ b/.etckeeper
@@ -3,34 +3,28 @@
mkdir -p './ImageMagick'
mkdir -p './X11/xkb'
mkdir -p './X11/xorg.conf.d'
-mkdir -p './apm/event.d'
-mkdir -p './apm/scripts.d'
+mkdir -p './apm'
[...]
in the above example, you see that apm
is correctly added but not the underlying events.d
and scripts.d
...
it does correctly follow ignores for most stuff however, which is an improvement. I did have to bend over backwards to remove symlinks from the listing, with that ungodly sed
. and, in general, we have quoting problems here because we pipe filenames that might have newlines in them... thankfully, we might consider /etc
trusted, but that's something that makes me uncomfortable about this whole thing in the first place...
here at least, a bunch of stuff is cleaned up:
root@marcos:/etc# git diff --cached --stat
.etckeeper | 636 ++++------------------------------------------------------------------------------------------------------------------------------------------------
1 file changed, 17 insertions(+), 619 deletions(-)
root@marcos:/etc# wc -l .etckeeper
7419 .etckeeper
Anyways, let me know what you think. The main tradeoff here is that we lose empty subdirectories, which maybe is a big deal, but for my use cases, I don't expect etckeeper to recover everything like this, that's what backups are for.
You have misunderstood (and oddly, misquoted) the code.
30store-metadata already uses
git ls-files -oi --exclude-standard
to list gitignored files. The use of grep is only to search through the resulting list of ignored files, to find an exact match for a filename.. That is indeed inefficient when there are a lot of ignored files. But it matches ignored files correctly.(30store-metadata does grep .darcsignore, I don't know if that is a good idea.)
But why do we list all the ignored files in the metadata store? Shouldn't we store data only about files tracked with git?
Maybe you are correct and I do not understand the purpose of this code, I thought the point of the metadata store was to restore modes when checking out files, and therefore that adding ignored files in there was not necessary.
that, i cannot say. i haven't used darcs in ages.
It doesn't. This code is what filters those ignored files out of the ones that are included. It's easy to show it works:
You have not yet given an example of it not working.
here's an example, from my workstation, which has a similar configuration to the server i was working on for this:
here you see the empty directory is managed in .etckeeper even though it's ignored.
i thought i provided clearer examples of this in the original article, but i guess it wasn't very explicit... here's an example:
right now, if i revert the changes I made to
etckeeper/pre-commit.d/30store-metadata
, and run the hook:... i get this:
it's mostly empty directories, but there's also other stuff:
with the patch, it looks like this instead:
the patch proposed here, together with the one on metadata ignore filters do not work improve etckeeper performance tremendously in my case.