The post is good but just scratches the surface on running Kinesis Streams / Lambda at scale. Here are a few additional things I found while running Kinesis as a data ingestion pipeline:
- Only write logs out that matter. Searching logs in cloudwatch is already a major PITA. Half the time I just scan the logs manually because search never returns. Also, the fewer println statements you have the quicker your function will be.
- Lambda is cheap, reporting function metrics to cloudwatch from a lambda is not. Be very careful about using this.
- Having metrics from within your lambda is very helpful. We keep track of spout lag (delta of when event got to kineis and when it was read by the lambda), source lag (delta of when the event was emitted and when it was read by the lambda), number of events processed (were any dropped due to validation errors?).
- Avoid using the kinesis auto scaler tool. In theory it's a great idea but in practice we found that scaling a stream with 60+ shards causes issues with api limits. (maybe this is fixed now...)
- Have plenty of disk space on whatever is emitting logs. You don't want to run into the scenario where you can't push logs to kinesis (eg throttling) and they start filling up your disks.
- Keep in mind that you have to balance our emitters, lambda, and your downstream targets. You don't want too few / too many shards. You don't want to have 100 lambda instances hitting a service with 10 events each invocation.
- Lambda deployment tools are still young but find one that works for you. All of them have tradeoffs in how they are configured and how they deploy.
There are some good tidbits in the Q&A section from my re:Invent talk [1]. Also, for anyone wanting to use lambda but not wanting to re-invent checkout Bender [2]. Note I'm the author.
> For us, increasing the memory for a Lambda from 128 megabytes to 2.5 gigabytes gave us a huge boost.
> The number of Lambda invocations shot up almost 40x.
One thing I've learned from talking to AWS support is that increasing memory also gets you more vCPUs per container.
-----
Serverless is great in scaling and handling bursts, but you may find it VERY difficult in terms of testing and debugging.
A while back I started using an open source tool called localstack[1] to mirror some AWS services locally. Despite some small discrepancies in certain APIs (which are totally expected), it's made testing a lot easier for me. Something worth looking into if testing serverless code is causing you headaches.
Thanks! For testing on our local machines, we use SAM Local, https://aws.amazon.com/blogs/aws/new-aws-sam-local-beta-buil... .
This has very similar capabilities as local stack. Still, there is always a delay from a service being released by AWS to it being available on tools like Serverless or AWS local.
The key benefit of using kinesis paired up it with lambda is controlled sequencing and parallelisation at the same time. Kinesis streams guarantee a single Lambda per shard iterator.
For example, at MindMup, we’re using a similar setup to the one described in the article for appending events to user files. It’s critical that we don’t get two updates going over eachother in a single file, which is tricky to do with lambdas only because there’s no guarantee how many lambdas will kick off if updates come concurrently from different users for the same file. With Kinesis, we just use the file ID as the sharding key, so no more than a single lambda ever works on a single user file, but that we can have multiple lambdas in parallel working on different files.
Your assumption is correct. One of our goals was to transform and restructure the data we have on our on-site premise MySQL. This is where the Lambdas come in. The Kinesis streams trigger the lambda, and the Lambdas filter and process the record and save them to a database on AWS.
You could also have replicas in AWS replicating off your on prem DBs in real time. Is Kafka offering something as the message bus/data transfer mechanism that replication and ETL off the AWS replica can’t?
Let me try to come up with a TL;DR here:
trivago comes from a complete on-premise, central database point of view. Change Data Capture via Debezium into Kafka enables a lot of migration strategies into different directions (e.g. Cloud) in the first place, while not having the need to change everything on the spot.
It seems like a common pattern to compare Kafka with a pure MQ technolgy. Kafka can also serve as a persistent data storage and a source of truth for data.
I hope this makes the picture a bit more clear to you. Feel free to ask if I missed something.
While the basic idea of using PG logical decoding for CDC is the same, Debezium is a completely different code base than Bottled Water. Also we provide connectors for a variety of databases (MySQL, Postgres, MongoDB; Oracle and MongoDB connectors are in the workings). If you like, you can also use Debezium independently of Kafka by embedding it as a library into your own application, e.g. if you don't need to persist change events or want to connect it to other streaming solutions than Kafka.
In terms of CDC use cases, I keep seeing more and more the longer I work on it. Besides replication e.g. updates of full-text search indexes and caches, propagatating data between microservices, facilitating the extraction of microservices from monoliths (by streaming changes from writes to the old monoliths to new microservices), maintaining read models in CQRS architectures, life-updating UIs (by streaming data changes to Web Sockets clients) etc. I touch on a few in my Debezium talk (https://speakerdeck.com/gunnarmorling/data-streaming-for-mic...).
We have cases where we do this, yes. We create a cluster in the cloud mostly for traffic optimization between the cloud zones and the on-premise datacenters. The cost to value efficiency needs to be reconsidered on a case-by-case basis.
I'm curious about the economics of this design. Real-time stream consumption implies that the consumer is always running, and if you need to run software 24x7, running it on EC2 instances is likely to be far cheaper than running Lambda functions continuously.
The lambdas are only triggered when there is data in the stream, so the "consumer" in this case is not always running. Under the hood there is a process that polls Kinesis to check for records but it's completely managed by AWS and you aren't charged for it.
Plus, in any real world scenario, if you're running 24x7 your load isn't evenly distributed throughout the day. Which means you're setting up autoscaling (w/ added time+complexity) or provisioning for peak, wiping out your cost savings.
In my experience if your data volumes are low enough like in the article, a Kinesis+Lambda setup is stress free and quick to implement. That makes it worth the cost over raw instances.
If it's just doing simple ETL then it's probably OK, but if you need to do aggregations on the data, you're going to have a bad time, or end up implementing some sort of ersatz map-reduce framework in lambda.
Yeah, it all depends on how much load or incoming records you expect. In our case, we use the pipeline to import inventory, say images from from our partners. So, we might have a few million images coming in many times a month. Lambda is super efficient in this case.
Depending on the requirements, one can run the sink (we use Kafka Connect for bridging between Kafka and Kinesis) on-premise or as Docker container in ecs or on ec2. That gives you a lot of flexibility to optimize the parameters like costs, throughput, liveness, etc.
We used to run an architecture similar to this a few years ago, I work for a broadcaster and unfortunately it failed badly during a big event.
The Kinesis stream was adequately scaled, but the poller between Kinesis -> Lambda just couldn't cope. This was discovered after lots of support calls with AWS.
It might be better these days I don't know, we moved to using Apache Flink + Apache Beam, which has a lot more features and allows us to do stuff like grouping by a window, aggregation etc.
I'm not being negative about this, it's a cool setup.
But just to be clear, the pattern (not marketing terms) is doing change data capture (essentially the database transaction log) to a message queue, with message/job processors that can take any action, including writing the messages to other databases.
Kinesis is SQLStream underneath, which is probably why the lifetime of messages is limited - it's not originally intended to be Kafka or a durable message queue.
EDIT: Note above, when SQLStream first came out it didn't seem intended as a long term store. That was like really early on when I saw it at Strata. It looks like they made the storage engine pluggable and Kafka is an option too, so my statement above is likely incorrect.
Lambda is being used as a distributed message/job processor, much like any worker process processing a queue would be scaled up.
That's the first I've heard of this. Any citations you can provide to substantiate this claim?
My understanding is that SQLStream is an event streaming processor, which would make it a potential Kafka consumer, not a basis for a durable message queue.
I don't have this on authority, but I believe Kinesis Streams is an AWS-maintained fork of Kafka around 0.7. Firehose is a wrapper around Lambda + Kinesis Streams. And Kinesis Analytics is a SQLStream wrapper with Kinesis Streams as its input. That's my best understanding, anyway.
I heard a rumor that Amazon wanted to offer Kafka as a managed service, couldn't get it to work they way they wanted, and then said "f--k this" and wrote something that is functionally similar.
That's wholly possible. They might have just copied the "rough design" of topics/consumers and queue-as-log, but clean-house implemented it. I wouldn't be surprised. Kafka was a pretty minimal codebase at that stage.
"Lambda is being used as a distributed message/job processor, much like any worker process processing a queue would be scaled up."
Is this just clarification of the architecture, ie how Lambda fits into pre-existing patterns, or are you suggesting that since Lambda is being used as a distributed job processor, there would be a better tool for that specific task?
Just clarification. Lately, I've found myself being asked to explain to others what it is that these new things do and found relating it to existing architectural patterns was helpful. Not saying there is or isn't a better tool than Lambda - it's a new option, just clarifying the role it fits.
With so much overlap in the functionality and use cases of Kafka and Kinesis, it's not clear why they increase their surface area by using both.
Is Kinesis' write latency better than it was? IIRC it wrote to 3 data centers synchronously, which led to some pretty bad performance. This was almost 2 years ago though.
Kinesis works better for people that have opt'ed in for the whole AWS toolbox, esp. when you do not want to maintain your own Kafka cluster on ec2. Unfortunately, I cannot say anything about the write latency, because I have no in-depth knowledge about the inner workings and what is "fast enough" for them.
Exactly. The biggest motivation to use Kinesis was the other services we could plug in from 'AWS ecosystem'. Lambdas have Kinesis triggers available natively. We did not have any concern with the latency of Kinesis. Now I'm curious, do you have any literature on the bad performance history?
Just past experience. In the end, the low data retention time was the real reason Kinesis didn't fit our needs, but the difference in write performance between the two was pretty stark. Read performance was fine for what we needed.
Copying a billion records screams for streaming? do you mean like a couple of hundred gigs that could be copied overnight (and maybe transformed with some script)?
Rather than doing a one-time copy of the data, we have applications continually writing to our on-premise MySQL. This data has to reactively reach our datastore on AWS. Also, this is not a 1-1 copy. There are several transformations and cross-validations in play.
- Only write logs out that matter. Searching logs in cloudwatch is already a major PITA. Half the time I just scan the logs manually because search never returns. Also, the fewer println statements you have the quicker your function will be.
- Lambda is cheap, reporting function metrics to cloudwatch from a lambda is not. Be very careful about using this. - Having metrics from within your lambda is very helpful. We keep track of spout lag (delta of when event got to kineis and when it was read by the lambda), source lag (delta of when the event was emitted and when it was read by the lambda), number of events processed (were any dropped due to validation errors?).
- Avoid using the kinesis auto scaler tool. In theory it's a great idea but in practice we found that scaling a stream with 60+ shards causes issues with api limits. (maybe this is fixed now...)
- Have plenty of disk space on whatever is emitting logs. You don't want to run into the scenario where you can't push logs to kinesis (eg throttling) and they start filling up your disks.
- Keep in mind that you have to balance our emitters, lambda, and your downstream targets. You don't want too few / too many shards. You don't want to have 100 lambda instances hitting a service with 10 events each invocation.
- Lambda deployment tools are still young but find one that works for you. All of them have tradeoffs in how they are configured and how they deploy.
There are some good tidbits in the Q&A section from my re:Invent talk [1]. Also, for anyone wanting to use lambda but not wanting to re-invent checkout Bender [2]. Note I'm the author.
[1] https://www.youtube.com/watch?v=AaRawf9vcZ4 [2] https://github.com/Nextdoor/bender
edit: formatting