Characterizing secret leakage in public GitHub repositories

dguo · on April 8, 2019

I accidentally published[1] my AWS secret key last year because I pushed an old project from college. At the time, I was very new to using source control and had little idea how to distinguish between what should and shouldn't be committed. I hope colleges and code boot camps go over that sort of info nowadays. The usefulness to effort to learn ratio seems exceptionally high.

[1]: https://www.dannyguo.com/blog/i-published-my-aws-secret-key-...

aiddun · on April 9, 2019

My sophomore year of high school, I was trying to writing a Discord (chat platform) bot for a server I shared with friends and unknowingly included the private key in a public repo I hoped to show them. A specifically written crawler for Discord keys found the key and starting spamming the server with images of very very undesirable things from the far corners of the internet at a rate of hundreds per second. Needless to say I learned my lesson the hard way.

HNLurker2 · on April 9, 2019

Couldn't we use Google's bigquery to search for private keys?

sbmthakur · on April 9, 2019

Thanks for sharing! How do people write such crawlers? Do they specifically point them at Github repos?

yorwba · on April 9, 2019

The paper discussed in the article describes writing such a crawler. They simply use the GitHub search API.

Liveanimalcams · on April 9, 2019

When I attended Hack Reactor they did tell us not to push them. However since they didn't teach us git (they expected us to know it) many still pushed them up. You would know because they'd get an email from some random company/person letting them know that they found their secret keys and that they should enroll/buy their services if they don't know what they're doing. Luckily no one from my class got hosed, but others in the past had.

pmc · on April 8, 2019

An increased use of SSH keys is also found in our SSH honeypot: https://pmcao.github.io/caudit/ ~ plugging our paper.

rsmolinski · on April 9, 2019

Looks like a great methodology and good results. Looking forward to reading the paper because I've been working around the GitHub API restrictions for the same purpose.

Specifically, I'm building a SaaS (https://www.locktower.com/) for organizations (or security teams) looking to have a managed solution for detecting leaked secrets in GitHub/BitBucket/etc. I'm in the process of building an on-prem version as well. Overall, I really hope to help drive down the number of unresolved leaks that the authors found.

7ewis · on April 8, 2019

I wrote a tool that scans all the new commits to our Org for passwords/secrets.

Webhook > AWS API Gateway > Lambda

The Lambda uses the new(ish) Layers feature so it can use Git. I then use the truffleHog[0] library to scan for entropy/regexes inside the commit.

If something is detected, it posts to an SNS topic, which is currently subscribed to by another Lambda that posts an alert to my team and the Security team's Slack channel.

It then calls the GitHub API to make the repo private to limit the exposure.

[0] - https://github.com/dxa4481/truffleHog

semi-extrinsic · on April 8, 2019

Why not have a pre-commit hook clientside that runs truffleHog AND if successful generates some form of file indicating it was run, then have a serverside hook checking for that file? This should be doable even with plain Github/etc, no?

waffleguy · on April 8, 2019

That was too simple to think of ;)

ithkuil · on April 9, 2019

How do you install serverside hooks in GitHub?

waffleguy · on April 9, 2019

I’m sorry! I didn’t realize my screen name said google

ithkuil · on April 10, 2019

I'm sorry! I realized my question sounded as if I was asking instructions about how to do something that was possible but I didn't know how to do it.

Rephrasing:

What if you use a git server like GitHub that doesn't allow you to install server side hooks?

pry_or · on April 9, 2019

I assume you saw the note on truffleHog in the article? The paper found it to be rather inaccurate outside of the basics (mainly AWS keys). Hopefully the authors open source their stuff.

novaleaf · on April 9, 2019

This highlights the difficulty of sharing secrets with your production code. How can you get secrets into production in a secure way?

Cloud providers have proprietary solutions, but those don't work on other providers (or your local dev env).

Rolling your own secrets server seems like an expensive centralized disaster waiting to happen.

It seems like putting a secret into source code is one of the least risky options. Just make sure it's not in a public git repo.

pr0tocol_7 · on April 8, 2019

https://github.com/zricethezav/gitleaks plugging my own tool. You can enforce custom rules like entropy ranges + custom regexes to get less false positives similar to what is described under "Validity Filters" in this article.

torbjorn · on April 14, 2019

Article quotes someone making this claim: > we discovered that even if commit histories are rewritten, secrets can still be recovered…. we discovered we could recover the full contents of deleted commits from GitHub with only the commit’s SHA-1 ID.

Do repo cleaning tools such as https://rtyley.github.io/bfg-repo-cleaner/ leave the original commit's SHA-1 ID intact?

aflag · on April 8, 2019

Shouldn't GitHub put in some sort of warning for potential leaks when they happen?

gcommer · on April 8, 2019

From the paper: "GitHub recently introduced a beta version of Token Scanning"

https://help.github.com/en/articles/about-token-scanning

https://github.blog/2018-10-17-behind-the-scenes-of-github-t...

aflag · on April 8, 2019

That's ok, I guess. But I was thinking more in the lines of blocking the push unless you disabled the feature from your repo.

Pawamoy · on April 8, 2019

I once pushed a GitHub token in a public repository. They immediately revoked the token and notified me by email.

LyndsySimon · on April 8, 2019

I believe they have it - I've gotten notified in the past when I committed secrets on purpose, for test applications. I'm not 100% sure they were from GitHub, but I think they were.

herohamp · on April 8, 2019

Wait do people not use .env files? I've aliased "gitinit" to make a .env file, .gitignore that ignores env nodemodules etc, then runs git init

scarface74 · on April 8, 2019

There is no excuse to ever have AWS secret keys anywhere in your code or your settings.

If you are running locally, you should be using your own secret keys that are configured in your user directory with

  aws configure

If you are running on anything within AWS you should be using a role attached to your EC2 instance or lambda and the SDK can retrieve your keys automatically.

Unfortunately, every single third party code sample on the internet has you including the secret keys in your code.

i_am_nomad · on April 8, 2019

An employee of mine once committed a keypair for our company GSuite, clearly labeled, in a Python script. I asked her to remove it from the repo, and she simply pushed a new version of the file with the keypair gone. Plus, she hadn’t configured .gitignore, so all the binaries were there too.

inimino · on April 9, 2019

The right request would have been to revoke it, not to try to remove it from the repo.

alain_gilbert · on April 9, 2019

One must be careful if they use docker.

I have seen people doing this in their Dockerfile

    ADD . /src

To add all the sources in the image, and inadvertently just made public all the secrets that were in the .env file.

Personally, I like to keep all my secrets very far from my repository.

33degrees · on April 8, 2019

sometimes people use .env files, and then share them with other developers using a publicly accessible paste service like gist (I'm not kidding)

bifrost · on April 8, 2019

If its harder than putting something in a file, people usually don't use it...

giancarlostoro · on April 8, 2019

If it's not default behavior even less people use it.

bifrost · on April 12, 2019

Exactly right. The default behavior has to change, but thats probably going to be an uphill battle. Its easier to protect users from their own mistakes than to change years of habits though.

sbov · on April 8, 2019

Since there's no standard, presumably different people use different methods, or sometimes none at all. Beyond that, people could still make a mistake and put something in code that belongs in an env file.

SwiftyBug · on April 8, 2019

Honest question: where should I put that env file?

0xbadcafebee · on April 9, 2019

You can keep a .env.default or .env.sample in your repo, but never use it directly. It should only document what the available parameters are.

Using a .env file is a bit of an anti-pattern, partly because many applications expect it and thus will be affected by it in ways you might not want. But also because passing configuration to applications via environment is not great, because then the values ar all static and the only way to change them is to restart the app. Better to have a function that can reload a real data format (json, yaml, ini) at run-time.

Each environment that runs your app will need to have its own 'env file' because every environment is slightly different, and they're coupled to deployments. So I'd keep your environment stuff wherever your deployment stuff is; with your terraform/ansible/puppet/chef configs, or etc/consul, or an S3 bucket, or SSM, etc. Create it at deploy time, pull it into the app at run time.

lallysingh · on April 8, 2019

The title's a great example of Pun? Description. The pun gets the attention, the description tells you why you should click.