Hacker News new | past | comments | ask | show | jobs | submit login
Characterizing secret leakage in public GitHub repositories (acolyer.org)
117 points by feross on April 8, 2019 | hide | past | favorite | 36 comments



I accidentally published[1] my AWS secret key last year because I pushed an old project from college. At the time, I was very new to using source control and had little idea how to distinguish between what should and shouldn't be committed. I hope colleges and code boot camps go over that sort of info nowadays. The usefulness to effort to learn ratio seems exceptionally high.

[1]: https://www.dannyguo.com/blog/i-published-my-aws-secret-key-...


My sophomore year of high school, I was trying to writing a Discord (chat platform) bot for a server I shared with friends and unknowingly included the private key in a public repo I hoped to show them. A specifically written crawler for Discord keys found the key and starting spamming the server with images of very very undesirable things from the far corners of the internet at a rate of hundreds per second. Needless to say I learned my lesson the hard way.


Couldn't we use Google's bigquery to search for private keys?


Thanks for sharing! How do people write such crawlers? Do they specifically point them at Github repos?


The paper discussed in the article describes writing such a crawler. They simply use the GitHub search API.


When I attended Hack Reactor they did tell us not to push them. However since they didn't teach us git (they expected us to know it) many still pushed them up. You would know because they'd get an email from some random company/person letting them know that they found their secret keys and that they should enroll/buy their services if they don't know what they're doing. Luckily no one from my class got hosed, but others in the past had.


An increased use of SSH keys is also found in our SSH honeypot: https://pmcao.github.io/caudit/ ~ plugging our paper.


Looks like a great methodology and good results. Looking forward to reading the paper because I've been working around the GitHub API restrictions for the same purpose.

Specifically, I'm building a SaaS (https://www.locktower.com/) for organizations (or security teams) looking to have a managed solution for detecting leaked secrets in GitHub/BitBucket/etc. I'm in the process of building an on-prem version as well. Overall, I really hope to help drive down the number of unresolved leaks that the authors found.


I wrote a tool that scans all the new commits to our Org for passwords/secrets.

Webhook > AWS API Gateway > Lambda

The Lambda uses the new(ish) Layers feature so it can use Git. I then use the truffleHog[0] library to scan for entropy/regexes inside the commit.

If something is detected, it posts to an SNS topic, which is currently subscribed to by another Lambda that posts an alert to my team and the Security team's Slack channel.

It then calls the GitHub API to make the repo private to limit the exposure.

[0] - https://github.com/dxa4481/truffleHog


Why not have a pre-commit hook clientside that runs truffleHog AND if successful generates some form of file indicating it was run, then have a serverside hook checking for that file? This should be doable even with plain Github/etc, no?


That was too simple to think of ;)


How do you install serverside hooks in GitHub?


I’m sorry! I didn’t realize my screen name said google


I'm sorry! I realized my question sounded as if I was asking instructions about how to do something that was possible but I didn't know how to do it.

Rephrasing:

What if you use a git server like GitHub that doesn't allow you to install server side hooks?


I assume you saw the note on truffleHog in the article? The paper found it to be rather inaccurate outside of the basics (mainly AWS keys). Hopefully the authors open source their stuff.


This highlights the difficulty of sharing secrets with your production code. How can you get secrets into production in a secure way?

Cloud providers have proprietary solutions, but those don't work on other providers (or your local dev env).

Rolling your own secrets server seems like an expensive centralized disaster waiting to happen.

It seems like putting a secret into source code is one of the least risky options. Just make sure it's not in a public git repo.


https://github.com/zricethezav/gitleaks plugging my own tool. You can enforce custom rules like entropy ranges + custom regexes to get less false positives similar to what is described under "Validity Filters" in this article.


Article quotes someone making this claim: > we discovered that even if commit histories are rewritten, secrets can still be recovered…. we discovered we could recover the full contents of deleted commits from GitHub with only the commit’s SHA-1 ID.

Do repo cleaning tools such as https://rtyley.github.io/bfg-repo-cleaner/ leave the original commit's SHA-1 ID intact?


Shouldn't GitHub put in some sort of warning for potential leaks when they happen?



That's ok, I guess. But I was thinking more in the lines of blocking the push unless you disabled the feature from your repo.


I once pushed a GitHub token in a public repository. They immediately revoked the token and notified me by email.


I believe they have it - I've gotten notified in the past when I committed secrets on purpose, for test applications. I'm not 100% sure they were from GitHub, but I think they were.


Wait do people not use .env files? I've aliased "gitinit" to make a .env file, .gitignore that ignores env nodemodules etc, then runs git init


There is no excuse to ever have AWS secret keys anywhere in your code or your settings.

If you are running locally, you should be using your own secret keys that are configured in your user directory with

  aws configure
If you are running on anything within AWS you should be using a role attached to your EC2 instance or lambda and the SDK can retrieve your keys automatically.

Unfortunately, every single third party code sample on the internet has you including the secret keys in your code.


An employee of mine once committed a keypair for our company GSuite, clearly labeled, in a Python script. I asked her to remove it from the repo, and she simply pushed a new version of the file with the keypair gone. Plus, she hadn’t configured .gitignore, so all the binaries were there too.


The right request would have been to revoke it, not to try to remove it from the repo.


One must be careful if they use docker.

I have seen people doing this in their Dockerfile

    ADD . /src
To add all the sources in the image, and inadvertently just made public all the secrets that were in the .env file.

Personally, I like to keep all my secrets very far from my repository.


sometimes people use .env files, and then share them with other developers using a publicly accessible paste service like gist (I'm not kidding)


If its harder than putting something in a file, people usually don't use it...


If it's not default behavior even less people use it.


Exactly right. The default behavior has to change, but thats probably going to be an uphill battle. Its easier to protect users from their own mistakes than to change years of habits though.


Since there's no standard, presumably different people use different methods, or sometimes none at all. Beyond that, people could still make a mistake and put something in code that belongs in an env file.


Honest question: where should I put that env file?


You can keep a .env.default or .env.sample in your repo, but never use it directly. It should only document what the available parameters are.

Using a .env file is a bit of an anti-pattern, partly because many applications expect it and thus will be affected by it in ways you might not want. But also because passing configuration to applications via environment is not great, because then the values ar all static and the only way to change them is to restart the app. Better to have a function that can reload a real data format (json, yaml, ini) at run-time.

Each environment that runs your app will need to have its own 'env file' because every environment is slightly different, and they're coupled to deployments. So I'd keep your environment stuff wherever your deployment stuff is; with your terraform/ansible/puppet/chef configs, or etc/consul, or an S3 bucket, or SSM, etc. Create it at deploy time, pull it into the app at run time.


The title's a great example of Pun? Description. The pun gets the attention, the description tells you why you should click.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: