Hacker News new | past | comments | ask | show | jobs | submit login
Using strace to figure out how Git push over SSH works (kamalmarhubi.com)
111 points by luu on Nov 23, 2015 | hide | past | favorite | 21 comments



Cool hack. Another way to solve the mystery is to read the book: https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protoc...


As someone who implemented git over SSH in Go recently, I found that page helpful but frustratingly high-level of an overview. It didn't answer some lower-level questions I had.


You can try https://github.com/git/git/tree/master/Documentation/technic...

But sometimes you have to roll up your sleeves and read the source. I also found the git mailing list to be pretty helpful.


strace is very powerful, but probably not the tool I would jump to at first to solve this mystery. Instead, just go to the source code. Being able to read, interpret, understand, and, eventually, modify other peoples' code is a foundational skill both for industry and open source work. Reading code is often harder than writing code, too, so it's a worthwhile outlet for your natural curiosity, too. (For this particular mystery, you also could have tried Git's own tracing via setting GIT_TRACE=1 in your environment; many tools, especially ones that work over a network, offer some kind of debugging or tracing).

I tend to keep a copy of glibc, Python, coreutils, openssl, and binutils laying around for random grep'ing. I also sometimes keep other more specific tools and libraries laying around, depending on what I'm working on. And, of course, whatever code you use at your place of work.

If you want to have some fun, strace strace to see how it works. Or strace lsof, or ps, or df, or other tools that inspect the state of your system.


strace is a very good first tool to use. It's extremely quick and easy to run, and gives a wonderful view of how the process interacts with the system or other processes. You also notice quickly if you can't find your answer with strace, if what you are interested in is not done with a syscall.

In the source code you are looking at layers over layers of abstraction, variables being passed around and mangled and substituted into strings, maybe get into a dependency tree of various libraries.

Not to say that you are wrong, the source the ultimate most powerful tool and absolutely you should learn to use it. But I disagree with what tool to use first, if you have zero knowledge about the process.


> strace is very powerful, but probably not the tool I would jump to at first to solve this mystery. Instead, just go to the source code.

Though from a learning perspective, it might be useful to play with something like git using strace then compare with what you find in the code. This could help you learn how to interpret results when you don't have access to the code when diagnosing something else later.


Painfully hard for me, I end up mapping the whole things to have a full view of it. Maybe with time efficiency comes.


strace is definitely a fun toy.

Here's my favourite hack: https://chris-lamb.co.uk/posts/can-you-get-cp-to-give-a-prog...


Or you could replace "cp" with "rsync --progess" for everything. rsync has become a habit for me and I only use cp these days when that isn't available (or in scripts that might one day run where rsync isn't installed) or for "cp -al" (which I could possibly do with rsync too).

But the strace-cp-progress hack is an interesting snippet for understanding some of the things you can do with strace.


Agreed! but rsync isn't nearly as fast over the network as something like this. rsync has to track tons of metadata (and the bigger the tree, the bigger the slowdown). Tar just screams (try it and compare):

tar -zc /somedir/ | ssh remotebox tar -xzC /somedir/

(you can play with -z vs ssh -C, but my experience is that -z is faster because it compresses before passing across the wire; don't do both, though..)

I've also had really great luck doing this on a single filesystem (tar into tar), since it does a better job with things like /dev compared to cp.

rsync has a lot of other awesomeness, though, especially with its hard-link capabilities. That can save a huge amount of space on remote backups over multiple days. (We use glacier for backups at Userify (ssh key management), but we also use this trick for remote pull backup to the backups without server access.)


The original posts were about progress indication on the file transfers, so I would add an invocation of pv in your example:

    tar -zc /somedir/ | pv | ssh remotebox tar -xzC /somedir/
That won't give a % done as it won't know the total size, but if you scan the source first and split out the tar and compress stages you can achieve that though it'll be a little less efficient:

    tar -c somedir | pv --size `du somedir -bs | cut -f1` | ssh -C remotebox tar -xC /tmp/
and if you want to keep the compression via gzip instead of ssh (I'm not sure that'll make any difference myself as ssh IIRC uses the same algorithm, though I've not benchmarked it at all):

    tar -c somedir | pv --size `du somedir -bs | cut -f1` | gzip | ssh remotebox tar -xzC /somedir/


Only when doing one file. Rsync will give you the progress for the file it's working on, not the overall progress over the copy job.


You can now do:

rsync --info=progress2 --no-inc-recursive

...and see total transfer progress. This requires rsync to do a full recursive scan before beginning the transfer, but sometimes it's worth it.


Good point, this makes a big difference when copying files of different sizes (though if copying large files of roughly the same size it is still a useful indicator).


Definitely a cute hack if you don't have the pv tool around, but if you do: better to use pv than have an strace hackery dependency, imho.

Still, a very good way to get started with understanding what strace is doing ..


That's one of the most evil things I've seen for a long time (and I have done many evil things).


Most versions of cp also respond to ctrl-t by printing current statistics


On *BSD and OS X the keyboard command Ctrl-T sends a SIGINFO to all processes connected to the tty. Many tools such as cp, mv and even dd display their progress when they receive a SIGINFO.

This signal does not exist on Linux.


Then at some point they wanted to add some metadata to the protocol without breaking compatibility with older versions of git. They came up with a nifty hack that exploits C’s null-terminated strings: add the metadata after a null byte but before the newline.

I believe this is the commit which made that change, very early in Git's history:

http://article.gmane.org/gmane.comp.version-control.git/1405...


The hack is cool, but the post's title could use a lot of work. The article doesn't really talk about how git push over SSH works, and strace isn't even mentioned outside of the first paragraph.


terminating lines in streams with null bytes is not really a git specific hack btw. Many shell tools have options for that as well and you should always use them if you pipe stuff from one program to another. Don't remember the reason though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: