I hate to pile on to the complaints about sending usage metrics to a server, but this is pretty funky. The metrics include the instance id, account id, a list of the commands run with timestamps, the region, a bunch of metadata about the number of VPCs, Subnets, IAM users and policies, etc. Which is kind of a lot to get, and definitely isn't anonymous. Why do they need to know how many VPCs, Subnets, IAM users, and IAM roles I have?
Then there's how the data is sent. The metrics are converted to JSON, gzipped, then AES encrypted with a random key. The random key is then encrypted with a constant public key. the encrypted key and encrypted payload are serialized into some JSON, and is then POST-ed to an HTTPS URL. This seems unnecessarily convoluted, and even with my meager knowledge of crypto I already see some problems (compressing then encrypting is a no-no) which could spell trouble. Shouldn't you just need to upload the JSON of the metrics over an SSL connection?
Thanks all for the feedback. We understood that the data collection could be done with much care. As a result, until we design a better way to send truly anonymous data, we have disabled the data collection (cf. https://github.com/wallix/awless/commit/f6389e75787390bd7797...). We will let you know when we have something better, keeping everything transparent, as we will always do.
Yeah, this is a huge trust violation. I'm glad people caught on to it quickly. You should not be phoning home with _anything_ from a tool like this without it being 100% opt-in and with huge warnings and alerts. Even if you do it with the most care and best of intentions, a bug could easily compromise security of the people using your tool by uploading too much or the wrong information to your servers. This is not a risk worth taking for you or any potential users of the software.
Because, whether or not we like it, the reality is that data collection is the #1 way for a product to eventually work towards profit? And even if it's a totally free, open-source project with no intention of profiting, too many developers feel like analytics is the only way to gather information to improve?
Metrics are converted to JSON, true, gzipped and AESed with a random key. It’s true.
The random key is then encrypted with our public key. True as well, and perfectly fine.
“This seems unnecessarily convoluted”
No, this is the proper way to do!
We don’t want to send data in cleartext. We don’t want to store data in cleartext in cloud servers neither. The statistics that we collected (PS: they are not anymore, we disabled the collection until we take time to explain what we collect and most importantly why) were then downloaded and analysed locally. AES encryption is perfectly fine and necessary in that case. Should we just rely on TLS as comments ask for, the statistics would be accessible by any one who has access to our AWS instance and infrastructure.
And then:
“compressing then encrypting is a no-no”
This sounds like a recipe taken from a very precise attack (the CRIME attack, as explained in comments down the thread). However, this attack does not apply in this case. Ensues a long discussion on how we are supposedly incapable of implementing crypto, although it rather seems the comment author messes things up.
Take away here: Judge for yourself! We want to simplify considerably how AWS infrastructures are created and managed with awless, give it a look. From version 0.0.14 on, no data will ever be collected without your consent.
Perhaps it is just my own meager understanding of cryptography, but I didn't know of anything that would make this a bad idea. Can you explain why it is potentially a problem?
It's called a compression oracle attack https://en.wikipedia.org/wiki/Oracle_attack
Basically, since you know about how big the plaintext is, and you can determine the size of the compressed payload by monitoring the traffic, so you can measure the amount of entropy in the plaintext.
An example is the CRIME attack. But that involves a chosen plaintext attack, so I'm not sure if something can be done with this method.
This is the problem everyday developers, like myself, have with attack vectors on encryption and cryptography. We read up on Oracle attacks, and specific proven attacks like CRIME (and every other attack against openssl and crypto in general). Yet the majority of us can barely understand the basic details. The majority of the research written about the topic is published from the perspective of the top 1000 cryptographers on the planet. Whether the concepts are honestly too complex for "normal developers" to comprehend or whether the top experts in the field enjoy the superiority and exclusivity of being "in the know" while labelling the rest of us "stupid", the fact is nobody dumbs down the information involved to the point where 95% of us who work in technology can apply the knowledge to counter these attacks.
The research is published with information that is far too low-level. Very few software developers, including the vast majority of engineers with degrees, understand the theory and math behind these issues. The best of the worst of us know not to roll our own crypto, but that is clearly the tip of the iceberg. Someone out there needs to figure out how to properly explain "Crypto for Dummies" if we ever want or expect the overall security of encryption to improve.
I would say that "compress, then encrypt is bad" is the wrong message to take away from this type of vulnerability. In the case of CRIME in particular, the issue was that:
1. The attacker provided part of the message.
2. The rest of the message contained a secret.
3. The entire message (attacker-provided-part and secret) were compressed together.
We can stop there; the length of the compressed data now contains information about the similarity between the attacker-provided content and the secret.
The correct lesson to take away from this is "do not compress a combination of attacker-provided content and secrets". Compressing before encrypting is perfectly sensible. (And, by the way, compressing after encrypting isn't better, it is useless since your encrypted content ought to be incompressible.)
It also relies on the ability of the attacker to inject their own data into the compressed content. Without that vector there's nothing special about compressed and encrypted data vs. any other encrypted data.
I get that part, but I don't see why they are encrypting it in the first place. It's being sent over SSL, so why bother?
Plus, hybrid cryptosystems exist because symmetric encryption is much faster than asymmetric, which matters for large amounts of data. But this is (even before compression) probably only about a kilobyte of data. Why have the extra complexity?
This is kinda why I avoid so many packages, even probably-safe stuff like iterm. I try to go vanilla as much as possible and avoid having to think hard of which service or app may be transmitting my data.
The hash functions are totally unrevertable, so it is impossible to come back to the original identifiers.
We added these anonymous ids, in order to know which commands are the most used per users.
Anyway, if you have better ideas on how to manage this, feel free to make a pull request or create a Github issue. And if you prefer to disable it, you can also do it easily with the source code (you just need to comment a few lines).
You don't need to break SHA256 to de-anonymize these values.
`awless` collects account number hashes. AWS account numbers are 12 decimal digits long, meaning there's a total of 10^12 unique values. Values are anonymized before submission using a single round of SHA256, so in ~2^40 hash operations, anyone with your database of hashes can invert every single account number.
For comparison, the bitcoin blockchain presently has a hash rate of ~2^61 SHA256 hashes per second. (Edit: I incorrectly stated 2^41 based on a hash rate of 3 TH/s, when it's actually 3 million TH/s.)
On my not-so-special spare server, I'm able to pregenerate the hashes with that fixed salt at 344,191 per second. So, it would take only about a month to compute them for every 12 digit AWS account number. And, as mentioned, that's on my not-so-fast spare server, running in one process, one thread.
acct [000003441910] has hash [d2a52833a6e434d2a55be0ce852c2dd9c5260c49a7c28ea4fa3fe2ac6d054d7e] (the last one it finished in 10 seconds)
A little effort with a decent GPU + hashcat though, would take this exercise down to a few minutes.
Good point. Thanks for the advice, we will study quickly how we can improve this.
Our goal is above all to make the usage of AWS easier, and as a result, more secure. We do not want to expose the CLI users to any new threat. We made the source code available to anyone (even the anonymous data collection), to be transparent and get feedback on our work to correct it when needed.
PBKDF2, bcrypt, and scrypt are all used where a database needs to store something and check for equality, but where the values in the database need to not be reversible even if the database is breached. They might be suitable here.
None of those can deal with the case of having too limited of an input range. Even if you use a million rounds, you've only added 2^20 to the workload.
You can create a randomly generated cookie of sorts instead of doing anything with a users' credentials. The supposed accomplished task and end goal would be the same, and yet, people would feel more comfortable.
Your claim that you are using an irreversible hash is not comforting.
Your forced data collection is also not comforting.
> You can create a randomly generated cookie of sorts instead of doing anything with a users' credentials.
That throws off their statistical analysis. Random cookies generates a new cookie for each new install or re-install, inflating the "users" count. If someone installs this on five different servers, the stats under random cookies will show five separate streams of data, and they will draw improper conclusions that a particular operation used on all of those servers if five times more popular than it really is. A configuration flag to disable the data collection is reasonable, but using a well-known hash like Whirlpool to anonymize the data stream is also reasonable.
If someone doesn't like data collection, then they shouldn't use cloud products, and they should just as vociferously declaim cloud services. With cloud services, whether or not the usage data collection is anonymized is at vendor discretion, but here, you control the source. Using a utility for a cloud service, and complaining about usage data collection, is ironic, considering AWS surely collects the same data.
Well of course they do, since all of these commands send off calls to AWS servers. And is you're using AWS products you already trust Amazon, that doesn't mean you trust a random person who put some code on Github.
This whole mess should be opt-in, but it's shocking that anyone thought uploading account IDs hashed with known salts was a good idea. How long did it take you to generate the rainbow table? What you did was more difficult than simply generating a random string as you should have done.
As the project is Apache licensed, you're free to modify it if you don't want this. Also, if you're conscious about privacy you should use application firewalls on your client side like Little Snitch etc. since many software that you install on your machine also do this.
I like the look of this, so on the software side it's a thumbs up.
However, the fact that the code is active at all will rule it out for some companies (firewall or not).
Perhaps make it something users can turn off in a config file? Not everyone can code in go, especially if their job is as a sysadmin, which isn't unlikely given that this is an infrastructure tool, so it might not be as simple as forking and editing the code for them.
Or make it turn-off-able (?) with an environmental variable. There are a couple of ways to make the tool default to report and allowable in non-reportable environments. The key thing is to make what is happening transparent.
I appreciate that your folks released this OSS tool.
However:
Where I work, as long as the data collection code is in there, whether I can modify it or not, they won't allow it on our computers. I know this is not uncommon.
Dismissing this concern by saying "other software does this" while awless falls into a different category (small CLI tool) is also problematic.
What does the data payload look like? I'd like to see the actual data you're sending, even if it's just a mock. From digging around in the code, it looks like you're sending infra data, including instance IDs. How do I know you aren't sending my AWS access tokens[0]?
Bitching? You kidding me? This is user feedback. Someone posted here to promote the tool out here, and we are asking them to remove it, that becomes bitching? That's insulting from your end.
Maybe you're right and I'm being unfair. It just seems kind of dick-ish - what's wrong with even "Cool, but I don't like stats being collected, please make this opt in"?
This kind of functionality is generally frowned upon in the Free Software world. For example, in Debian, it'd be treated as a bug and patched out. So I disagree; calling it out to inform others is entirely appropriate.
The tool phones home. Their website doesn't have HTTPS. It's plausible that the tools phones home over an unencrypted channel (I didn't look, so I could be wrong).
My overall impression is that they don't do security very well.
@heartsucker If you want to judge on previous things, we are the team that created http://opalang.org and have no tie at all with the company static and outsourced portal. Also, will be in Berlin soon, contact me will gladly meet there.
Sure, and if we imagine a hypothetical entity that has 10 products with security holes and then releases and 11th, it might be worth looking at the 11th more suspiciously. Things don't happen in a vacuum.
Just wondering here, but why would you use this vs terraform? Given that I can define most of the stuff I need from AWS in terraform and check the state of the infra via plan command, what would be the use case for CLI? I'm actively trying to break the habit of modifying infra without first writing terraform document for it. This way I can always be sure that I have no surprises when creating a new environment.
There is no state file and it is a more "AWS first" way to do queries and one off tasks. Especially for destroying old stuff, there is a value in a non-idempotent approach.
How complicated have your terraform templates been? Because it gets pretty hairy pretty fast because not everything fits the declarative terraform model.
I'm considering learning go but amount of 'return err' and 'return nil, 0, err' is instant turn-off. Is this best-practice error handling in go ?
Thanks !
It's pretty normal and you will get used to it quickly.
Thanks to this pattern it's very hard to ignore errors.
The only think that could be done better is instead of always blindly returning an error, one could wrap them in higher level errors and build a sort of error trace:
E.g:
- task failed because
- authentication failed because
- could not load credentials because
- because file xy.pem is not readable
But instead of the above, you often just receive a "permission error", but you don't know where it came from, which can make debugging hard.
Basically. It's a little different because you have more control. It's way too easy to assume exceptions won't happen and basically ignore them. But they happen in unexpected places (e.g. every time when dealing with IO).
Exceptions only contain a function call trace (stack of function calls), while a logical error trace is more like an explicit try/catch/wrap/throw around every call and could be more informative to the end user if done properly.
This does not require exceptions. A very simple implementation could involve appending the additional error to the messsage from the lower error level. As far as I understand, exceptions were invented so that error handling code will not clutter up the "happy path". They allow "exceptional condition handling" to be seggregated. But in real world applications you may need to, say, recover from an error. For example, if connection using https fails, fallback to http. This makes code using exceptions look wonky. So might as well use error codes always
This particular code would benefit greatly from named return values.
For example, BuildStats is defined as returning (* stats, int, error), when it could name those and just use naked returns. In buildInstancesStats, they name the return values but then repeat them again on all 3 return lines instead of just using a naked "return".
The easiest way to build awless from source is to use our release script (go run release.go -tag 0.0.13).
Or you can also get the full sources with `go get github.com/wallix/awless`. Then, `go build .` in `$GOPATH/src/github.com/wallix/awless/` should work.
If anyone is interested in a more minimalistic alternative to this, I have been designing an "as simple as possible, but no simpler" devops toolkit and CLI on top of the AWS CLI: https://github.com/kislyuk/aegea
Looks great. My main complaint is that the templates feature appears to be reinventing Terraform. Would have been cool if we could use the Terraform templates we already have, or even provide support for CloudFormation (given this is an AWS-only tool).
awless is currently in its early life. We also plan to support both CloudFormation (first) and maybe Terraform at some point. CF and TF are exhaustive but more complex than awless templates.
awless is meant to simplify how we can create and manage an AWS infrastructure (which is originally our own need at Wallix), and we wanted to have simple templates as part of the CLI.
For me, the output is friendlier, which is great if you're just trying to make a quick query about the running instances or something.
However, the whole thread above pointing out the data collection issues makes me far less likely to be moving away from the official cli + bash magic any time soon
One reason I can think of is this is written in Go while the official AWS CLI tools are written in Python (and some of the older ones are Java, which is often a terrible choice for CLIs given the startup time).
A Go CLI tool has some deployment advantages over Python.
Also, we're still at the beginning of the project. Since we build a RDF model of the infrastructure (stored locally), we will soon be able to answer many advanced queries easily such as:
awless also includes an easy-to-write template engine (vs. CloudFormation or TerraForm - which we also plan to integrate).
See more features in the README. Note that, according to feedback since launch, it seems that awless is noticeably faster than aws-cli. The latest version (that you build with go install) has no statistics, try for yourself!
This looks great. I use the boto tools for all my aws stuff but they're clumsy for working in an interactive fashion like this. Going to install now to have a play!
(I'm one of the core developpers of awless).
We are going to add filters and queries very soon. We built awless by relying on RDF to represent the cloud resources. As a result, in addition to simple filters on the properties, we can also represent more complex queries such as "everthing inside a VPC", "the siblings of an instance", etc. Try the `awless show` command to have an insight of what we can do.
i like that RDF to sync states from between local and remote. Nice to see similar idea that I had for while to have a client for AWS that sync states between AWS and local.
As of now, you can create a template to deploy an environment, and for instance create one master node (for instance a subnet) for each environment.
The missing values in the template (aka "holes" will be asked for by awless), so you can have staging and production deployments.
Note that we just released the project last Friday, and in particular have an ambitious roadmap for the templates. For instance, we could password protect accessing some nodes, and prevent wrong actions on the production env.
To all those complaining about collection of stats: Pretty much _every_ SaaS company is collecting stats about your behavior. It seems a bit off that there is so much rage about a project whose code is out their in the open to collect stats.
Point me to atleast a single popular SaaS product that does not have analytics in it's page.
What part of "Awsless: A Mighty CLI for AWS" sounds like SaaS?
"SaaS" means software as a service, something someone else runs, generally web based where metrics are relevant, not a command line tool on your own machine talking to your own infra.
From read me:
Install
Choose one of the following options:
- Download the latest awless binaries (Windows/Linux/macOS) from Github
- If you have Golang already installed, build the source with: go get github.com/wallix/awless
- On macOS, use homebrew: brew tap wallix/awless; brew install awless
Not sure how to respond. I wasn't comparing awsless to a SaaS at all. I was saying that SaaS collects info about you all the time and I don't see anyone going up in arms about that.
According to you, it's OK to collect data if the software is on another person's infra.
Yes, you compared them, or rather, compared the reactions of people to them. If you aren't implying by extension that SaaS and awsless are comparable then your argument makes no sense.
Then there's how the data is sent. The metrics are converted to JSON, gzipped, then AES encrypted with a random key. The random key is then encrypted with a constant public key. the encrypted key and encrypted payload are serialized into some JSON, and is then POST-ed to an HTTPS URL. This seems unnecessarily convoluted, and even with my meager knowledge of crypto I already see some problems (compressing then encrypting is a no-no) which could spell trouble. Shouldn't you just need to upload the JSON of the metrics over an SSL connection?