Totally possible. As an example, Codec2 used in ham-radio applications can send pretty decent voice audio at a few kilobit/s or less, and it keeps sounding perfectly understandable even at 1.2 kbit/s like this short sample:
http://www.rowetel.com/downloads/codec2/hts1a_1200.wav
To spell it out, that's < 12.7 megabytes every 24 hours. If there are 10 million Echo devices, that's 127 million megabytes a day, or 127 terabytes a day. That's actually not that hard to handle for the company that runs AWS, so while extremely unlikely, it's not impossible. Just very, very costly.
You don't need to record 24/7. A well planned spying operation would involve multiple devices and connections, location discovery, proximity with other devices etc.
Phones, tablets, Home and laptop PCs, Car PCs, Smart TVs, and pretty much every connected device, can be hijacked into becoming a bug or cooperate with any of them if in proximity.
The victim cellphone could establish a secure connection via WiFi or Bluetooth with the Echo o any similar assistant, grab the audio data to transmit, alert the user some important upgrade is needed on the phone then start transmitting the data and fake some random download just to make the downlink act as it's receiving something. That way those 12 megabytes of data would remain totally unnoticed.
This is of course the product of tinfoilhattery at its finest level, until someone does it for real.
Sure, I was just providing an upper bound for requirements if they decided to store all audio all the time from every device. Of companies that have the capability to do so under their own resources, Amazon in on the short list. Amazon could possibly pull it off and hide it in the rounding of numbers for their normal business.
Companies that interact (peer) with them would likely see something though, but possibly not as easily as it seems. The average home internet connection probably downloads far more than 12.6 MB of content from AWS hosted services every day. The only question is whether the upload amount would trigger any alarms. I think in most cases not, as it would probably just go a very small amount towards evening those peering connections out, which are likely very heavy in the other direction.
Easy answer. They may not have to send the audio. They could transcribe it locally at the client, encrypt, and send text to store on the server. Consider that almost a decade ago, programs like dragon naturally speaking could be run on a relatively inexpensive laptop. It's entirely possible that a dedicated device like the echo could do this today.
EDIT: Original reply sounded too definitive
Totally not possible. 1.2kbps * 10M devices is 12Gbps, or greater than the bandwidth of a STM-64 link. Not practical to either receive or store, even for Amazon, and certainly bandwidth consumption on that scale would be extremely noticeable.
I'm not sure why you would assume one of the largest computational and datacenter service providers in the world with many datacenters in many regions would require all input to be over a single connection to a single location, and even if it was, why it wouldn't come across the many, many peering agreements they have.
There are many reasons why it doesn't make sense for them to do this, but this isn't one of them.
Edit: To clarify, and put this in perspective, 12 Gbps is 1.5 GB per second, which is less than 127 terabytes a day. Amazon, through AWS in multiple regions, is entirely capable of adding 127 terabytes of storage a day, and already transfers MUCH more than 12 Gbps. This is not impossible, just very improbable.
Compression over an hour or a day won't do anything meaningful vs compression over, say, a minute. There's just not that much additional redundancy to eliminate. Uploading in a big burst doesn't save on overall bandwidth, either.
I see a lot of people in this thread saying that's not possible, so here's the math:
As given by squarefoot, you can record human voice at 1200 bit/sec = 150 bytes/sec. A day is 86400 seconds, assume people are talking (generously) 10% of the day, so 8640 seconds * 150 bytes = 1.3 megabytes per day uploaded to Amazon.
Does anyone doubt that _the company that runs AWS_ is incapable of dealing with barely a megabyte per device per day?
You do realize this is same company running AWS. One of if not the largest network of cloud services. They can probably do that with idle capacity of one region.
Practically, it would be unwieldy to have 10,000,000 devices sending audio to 10,000,000 audio decoding and processes 24 hours a day nonstop.
You'd also see a huge bandwidth hit at your ISP, which would certainly kill future adoptions if the word got out.