Hacker News new | past | comments | ask | show | jobs | submit login
Solving durable execution's immutability problem (restate.dev)
52 points by stsffap 12 months ago | hide | past | favorite | 28 comments



The hardest problem in durable execution, as in many areas of infrastructure, is safe updates.


Not just in durable execution, but in control systems in general! Anything that operates on persisted state in a distributed environment has similar issues. Durable execution frameworks make them more apparent - and may also help us chart a path for how to deal with updates more rigorously.


Funny to see this continue to get so much energy. I was strongly opposed to making some AWS Step Function workflows fully immutable in the first project I used it for. This was all the more my feeling with some of the gigantic state machines that teams around me were building. Not shockingly, situations would arise where we could know ahead of time that a large portion of outstanding state machines were cruising to a crash at step $BIG_NUMBER.

Ironically, I'm still largely about idempotent calls that should be able to trigger something. And those idempotent identifiers can be stored such that you can replay or serve up the relevant data, as needed. But going all in and forcing a long compute chain that cannot be changed is something you do in places you have to do it. No reason to choose to do it everywhere. And certainly don't build larger and larger compute chains between durable writes to the backing data store.


That... Just sounds like a horrible implementation/engine/execution/strategy whatever you want to call it.

I don't see any reason whatsoever that a temporal workflow composed of activities would be subject to the things you've described. And I don't even think it's that impressive of an implementation of the idea.


And that sounds like a no true scotsman criticism. :D

My complaint was less that it isn't a compelling idea. The problem seems to be that people want to expand the one system they are working on to be a fully comprehensive system. I remember one team tried to get their entire state reflected in a Step Function state machine that could run for as long as necessary. Idea literally being a state machine that would last years or so. They compromised for one that would only last a week or so. And then faced the consequences rather quickly.

And note, given enough time, I think this can work. On a deadline, though, I've seen it fail too many times for it to be my goto solution. All the time by people that are objectively smarter than I am.


> And that sounds like a no true scotsman criticism. :D

Very fair, and point totally understood.


Can you dive deeper into the technical details of how Restate integrates with Lambda's versioning system to ensure in-flight requests?


Every Lambda you deploy to AWS Lambda gets a version assigned. The corresponding arn would look like `arn:aws:lambda:my-region:123456789101:function:my-function:my-version`. Once you have deployed the Lambda, you register it with Restate by providing this very same arn. Now, when Restate invokes a Lambda function, it remembers the arn with which it started the invocation. So on any subsequent retry, Restate will always invoke the same arn it started the invocation with.


Curious how you think about state that changes outside of a workflow’s scope (like user info in a database)


In general, if you're changing database schemas, you need a code version running that supports old and new. If you have some old code running to serve in-flight requests, you probably need to wait for those requests to complete. This is one of the main areas that makes me feel that requests > a few hours are a real problem (and that we shouldn't be writing workflows this way)


I'm of the opinion that actual database schemas should be a private implementation detail of any given service within an organization, whereas workflows are not services themselves, but instead are clients of the services and so should only ever interact with services via their API (be it a web-service or otherwise (no love for CORBA, eh?) and never directly talking to an RDBMS.

(obviously the rules are different for KV/object/blob-stores and don't apply here).

-----

That said, if you need a stopgap solution for a problem like that, then take advantage of existing compatibility-shim features in a decent RDBMS: things like VIEWs and SYNONOYMs exist for this reason (and yes, you can INSERT INTO-a-VIEW).

-----

Another option is to constrain your DB designs such that all DDL/schema changes are always backwards-compatible (i.e. disallow dropping anything, disallow alter-column except to widen an existing column (e.g. int-to-bigint is okay but not string-to-UUID), all domain-model attributes must be initially modelled with a m:m multiplicity whereever possible, etc etc) - and you can enforce these rules using a DDL trigger too - and any changes made by a workflow's DML can be repaired by a background job.

----

(Just throwing some ideas around)


+1 on views, for those things than need direct DB access.

One effective pattern I’ve seen in large DB deployments is to separate the write schema from the read schema. That is, treat what is allowed to be written into the DB separately from the shapes of any views that exist. The views themselves are a tightly coupled client of the DB log - by constraining writers, you can migrate/rebuild views, then point services to read from the migrated views, and retire old views.

This allows you to keep accepting writes - you never have to shut down the write path. If you’re introducing new shapes or the DB, you’d prepare a new view, the “widen” the write schema, and begin accepting writes in the new shape, and only then re-point clients to read from the new view.

To drop elements of your read schema, you do the dance in reverse. First, constrain your writes. Then, build new views that don’t require the elements you removed. Gradually, update application code to work on the new, reduced views. When you’re done reading from the original views, you can drop them.

This is inherently much less efficient than online or offline DB migrations. But it’s a sane strategy for wrangling very large systems with very low risk.

Versioned workflows are in practice distributed entities that interact with their peers, themselves defined by versioned interfaces. By tracking the versions of handlers touched by any given execution, we can imagine a similar experience for deployed code - including garbage-collecting unused versions.


Cool! why is it crucial to keep old versions of code, and what are the risks of running outdated code?


If don't keep old code around, you are forced to either drop pending requests that started on them, or try and replay them on the updated code, which can't be known to be safe. So it's better to keep it around, but this brings in a new set of infra and security challenges, eg where will it run, what will it cost, will it have a vulnerable dependency, etc


Can't you replay pending requests? How can you mitigate the issue of differing side effects generated by the new / old versions?


if you replay the request on the new version then you might encounter new steps that don't match what you have in the journal. temporal users know this pain well...


Rolling a task over to a new version should be "safe" in that you can detect conflicts and roll back if the sequence of calls does not match the old version.

For a post about "solving" durable execution I would expect both a scale-to-zero way to keep older versions around indeterminately - I guess the Lambda based approach does qualify - and a safe and controlled way to upgrade task versions iff the execution history is compatible.


How do you imagine detection of conflicts working?


Each execution by design has a record of all calls with side-effects, with input and output.

If you replay history up to the newest call and all calls are identical, that specific execution instance is compatible with the new code and can be upgraded. If not it should be rolled back, and you can either deploy a fixed version of the code with backwards compatibility, or delete executions that can not be upgraded.

Backwards compatible code can be written as

  if (workflowVersion() >= FIX_VERSION) new_way() else old_way()
There should be two ways to get the version for backwards compatibility: workflowVersion() is replayed and can change between side effect calls, e.g executions will use the old retry logic until they reach the current point in time, when they will switch over to the new one.

originalWorkflowVersion() is constant, e.g. all executions that started before NEW_TAX_RULE will keep using the old tax rules for all calculations.


also would be great to see a deep comparison with Temporal and AWS Func


Are there any aspects in particular that would be of interest?


I'd love a deep technical comparison, but it would also be great to understand if Restate is better than Temporal for specific use cases and vice versa. When someone should choose one of them over another


this is pretty important when you move to LLMs as microservices (e.g. calling gpt-4 for some plan / then 3.5 for easier sub component).

Would be keen for a python implementation


Is there a timeline for a Python SDK?


We are currently gathering feedback on which SDKs to prioritize next. Python has been asked for a couple of times already. Once we decide on the next SDK, we'll let you know.


Ok, this actually looks interesting


Will there be a Go SDK?


It's probably the next on the list :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: