Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
When "letting it crash" is not enough (flawless.dev)
157 points by thunderbong on Feb 8, 2024 | hide | past | favorite | 79 comments


Two big question marks after reading this and the linked home page (partially pointed out in other comments):

- If there's a flaw in your application code that causes a crash (as is the motivating example in the essay), then restoring the entire program into the state it was in just before the crash happened would just cause it to crash again ad infinitum. Sure, this model helps against "my VM instance got preempted", but that's a pretty different category of "crash" (and also notably unrelated to supervision trees).

- "External" state (like an API endpoint being down/returning gibberish) can be part of the reason why your program got into a bad state. In fact, that's disproportionately likely, since external "weirdness" is comparatively hard to cover exhaustively in tests. In such a situation, the suggested computation model would never be able to recover even when restarted, because it would forever retain the (bad) API response. Effectively, this is just caching all the non-pure effects of your computation, and we all know about cache invalidation being a hard problem...


Original article is now ironically crashed, but: my last job was working on a point-of-sale system which used this kind of append-only transaction system and a "crash and reboot on failure" model. Every button press got turned into one or more transactions. This had the nice property that if something failed, most of the time it was before anything got written, so the system would reboot and leave you back on the screen before your last button press. The state could also be shipped over to the developer's PC so you could repro that state under the debugger. There had to be a "detach account" getout clause for cases where the sequence of transactions caused a crash on load, which was rare but possible.

The hardest part was of course managing external state and journaling exactly where you had got to with external transaction APIs. Further backend reconciliation was available to flag this (and avoid Post Office scenarios).

Note that French NF525 almost mandates this design, at least for point-of-sale systems: every financial transaction has to be durably written for tax auditing purposes.


Hi! I'm the author of the essay.

Durable execution is meant to complement your application. You will never want to model everything with it. It solves the problem of needing to decide how often to manually make snapshots of some important state, this becomes implicit. Workflows in flawless can still fail, you could call the `panic` function, or divide by zero. In the end it's arbitrary compute.

"External" state is one of the text book examples for using durable execution. If you are interacting with 5 different services and calling 5 different API endpoints, you sometimes want to have transactional behaviour. Leave all 5 systems in a consistent state after your interaction. You can't only call 2 and stop. Durable execution and patterns like saga [1] are one of the most straight forward ways (for me) to solve this.

In flawless specifically, I try to give enough context to the user why things failed. It's very easy to reconstruct the whole computation from the log. And let the user decide if they want to re-run the workflow. If you charge someone's credit card, but the call to extend their subscription fails (service down), you can't safely just re-run this. You have two choices, either you continue progressing and roll back the charge, or you fail and have someone manually look at it. In general, you want to use flawless in scenarios where the "called exactly once" guarantee is important. If you can just throw away the state and it's safe to re-run from the start, then you don't need flawless for this part of the app. The less state you have to care about, the better.

EDIT: The alternative would be to manually construct a state machine with a database. "Check if the credit card was charged. Call Stripe. I finished charging the credit card, save this information. Call the subscription service, it failed, restart everything. Check if the credit card was charged ...". And depending on your workflow, this can be a very complicated process where 90% of your code is just dealing with possible failures. Especially if failures happen on the edge of some calls it can become very tricky.

[1]: https://medium.com/cloud-native-daily/microservices-patterns...


I feel like this approach might still pose some challenges or issues with regards to time or stale data. A couple of problematic scenarios:

- Application requests a JWT token. It then crashes and gets restarted. It gets past the problematic point, but later when trying to make a request, it crashes due to the cached token being expired.

- Application interacts with the current time in a meaningful manner. Due to the log replay, it will always live in the past and when switching from the cache-sourced time to the current time, some issues might occur, like deltas being larger than expected

- Application goes through a webshop checkout flow. After restart, some of the items in its cart have been already sold, but the app doesn't check this, since it already went through a (cached) check and got the result


Funnily enough, this is actually a massive problem when working with cloud automation APIs. Terraform and the like kinda handle this problem by calculating / storing the “goal state” and then looking at the system’s current state, and coming up with a “plan” to reconcile it.

Unfortunately, cloud provider APIs are usually eventually consistent, and getting a full snapshot at scale is nigh impossible.

So, in order to work around this, I effectively built a write ahead log style atop Postgres. Something like Sagas would have been great, but as far as I can tell, there was no real pattern for multiple Sagas operating on global state doing coordination. This is where Postgres SSI came in handy, where I could read the assumed state of the system, and if another worker came in and manipulated it, the write ahead entry wouldn’t get written as the txn would fail to commit. The write ahead entry would then get asynchronously processed by another worker, in case the first worker failed.


This sort of architecture often shows up in actors, e.g. https://www.microsoft.com/en-us/research/publication/orleans...

In that world, what you're generally looking at is

local state + incoming message -> new state + outgoing messages

Outgoing messages are sent only after persisting, and will be retried until success. Unique message IDs are used for idempotency (also for incoming messages).

Important: the actor runs with no side effects -- that's what makes rerunning things safe. In that kind of world, side effects are often achieved via e.g. sending a message that updates a materialized view somewhere (see e.g. Kafka).

With this setup, the source of badness is often isolated to the incoming message, and after a few failures an incoming message can be moved into a "dead letter queue" for ops to look at. In many scenarios, this actually works remarkably well.

https://pmatseykanets.github.io/beanstalkd-docs/protocol/#bu...


> then restoring the entire program into the state it was in just before the crash happened would just cause it to crash again ad infinitum.

Hopefully!

But, sadly, this is not always the case. It's sad because it's incredibly hard to debug crashes that don't happen deterministically.


About your first point, in a Beam app with a supervision tree you don't necessarily need to restore your entire app state or the state "it was in before", you can restart with just a "workable" state.


> cache invalidation being a hard problem

For those unaware: "There are only two hard things in computer science: cache invalidation and naming things." Plus variants like, "Oh, and off-by-one errors."


> .. durable execution, and is so new that most developers never have heard of it.

It's called checkpoint/restart, and was a feature of some early operating systems. Mostly for programs whose run time exceeded the mean time before failure of the system.

Tandem's whole system concept was built around that.

Amusingly, Second Life, of all things, has durable execution of the little LSL programs that make in-world objects go. They're checkpointed every minute or two, and if a region crashes, they are restarted, stack, heap, and all. They survive machine crashes and ports to new hardware. Even migration from a dedicated data center to AWS. Some have been running for well over a decade. Internally, they are Mono programs.


There's a decent chance checkpoint/restart has been with us since the days of paper tape.

It's moderately amusing to see "save state somewhere" described as so new people haven't heard of it.


I've been in the industry for so long now as to see this pattern repeat often enough that I've come to accept it as a sort of universal truth: nearly everything that people think is "new" is actually an echo and (hopefully) refinement of something that has come before.

I am hard-pressed to think of anything in the software world today that is actually, truly, new. We all stand on the shoulders of giants.


Checkpoint/Restore I feel is a bigger concept than just saving state. At the zeroth level it's a system that can correctly stop and serialize a running process (as criu https://github.com/checkpoint-restore/criu has shown is a huge pain in the ass to still not be perfect) in a way that can initiated from within the process itself.

The 1st level more-work-but-easier way to do this is to build or use a heavily constrained VM/language you run from within your main application that doesn't allow for most of the hard problems to even exist.

I can't find any ready-made tools to do this that I wouldn't consider an endeavor. Emacs has to be the most famous application to utilize dump/restore state but it's not exactly turnkey.


I remember there was one for unix back in the 90's, I think called Condor, that could be used to migrate long running processes to other machines. (I think the tricky part was restoring external connections like file handles.)

And before that, there were TeX and emacs, which do a memory dump and restore to save time loading their initial state. One of those, I think emacs, would tweak a core dump file to be a valid executable again.


The challenge is purely in how to make it perform well.

Dumping the whole memory was a lot more viable when that was only a few kilobytes.


Good point about Tandem, but it wasn't limited to them.

Checkpoint/restart has also long been a high-end capability found in HPC (High Performance Computing) systems. I recall it even it made it onto some commercial UNIXes, although I don't remember whether it was bundled in or took the form of some additional layered software such as LSF (Load Sharing Facility) (example: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=admini... ).

There are some people who have tried to bring some checkpoint/restart to Linux. For example, see the CRIU project: https://criu.org/Checkpoint/Restore

I acknowledge that the original poster is talking about all this being done at the application level, not outside-the-app at the OS level...

For more references on the topic, see the wikipedia entry for Application checkpointing: https://en.wikipedia.org/wiki/Application_checkpointing


Durable execution and state restoration is also a great way to get close to the crashing point again, though.

The entire point of crashing early is to reset the state of the program completely, hoping that the path that crashes it will not be taken too often. (and it only works well in the situations where crashing means negligible interruption of the service) By preserving the state between crashes, you diminish the pool of possible non-crashing states.


The approach of check-pointing computation such that it is restartable sounds similar to a time-traveling debugger, like rr or WinDbg:

https://rr-project.org/

https://learn.microsoft.com/windows-hardware/drivers/debugge...

Some Googling found Checkpoint/Restore In Userspace, or CRIU. It’s like Flawless, but for Linux processes:

https://criu.org/Main_Page

I bet that Flawless can make better guarantees about reliability due to constraints of the WebAssembly sandbox.


Our time travel debugger at undo.io also uses a similar approach.

But the world is different for time travel debug, in that you can replay the whole history of the recorded execution (my understanding is that this is not required for Durable Execution) but you cannot resume the real process.

In fact, since it's used for capturing faults, you wouldn't generally want to resume execution - it's going to fail the same way (whereas a higher level restart framework might allow you to throw away some bad state and continue the computation - at the cost of not being able to precisely duplicate the bug).


There's also an implementation of time-traveling debugging in Qenu[0], which sounds really nice but does not actually work reliably on larger use-cases.

[0] https://www.linux-kvm.org/images/d/d0/02x06b-DeterministicRe...


Docker[0] supports CRIU, but it has always been experimental. I always assumed it would be a valuable tool for test cases, but I never used it.

[0]: https://docs.docker.com/engine/reference/commandline/checkpo...


I have been looking for an excuse to use RR, a problem that would have been solved much easier if I had it. I haven't yet found one.

Does anyone here have experience with it? Maybe know the sort of problems RR excels at helping find?

EDIT: And is there some way to still use a better visual overlay a-la Visual Studio or QTCreator?


I saw a tech demo somewhere that showcased that Qt applications on wayland could be checkpointed and restored with CRIU.


I implemented something similar at work for a phone handling system. We had a "workflow" system that let clients script how calls should be handled. All side effects that the script was allowed to make happened through an API with functionality like reading DTMF input, playing media, adding another call leg etc.

Like the article says, we would record a log of all the interactions with the outside world. The lower level parts of the system had the capability of transferring an in-progress call from one server to another. SIP and RTP traffic would be moved to another server, and finally the workflow log was sent. The server that took over the call would then replay the script and arrive at the same execution state that the first server had before the call was moved. Workflow authors didn't have to care about any of this. To them it looked like the call started and ended on the same server.


I thought the idea of "exactly once" was a very questionable claim given what we know about computing, specially distributed.

Then somewhere else you see this gem.

> Workflows in flawless are written in Rust, in fact they are just regular Rust functions. This means that they can contain arbitrary logic. But instead of native code, the functions are compiled to WebAssembly and executed in a completely deterministic environment.

As much as I love Rust, this sounds like, here is a problem, let me throw fancy Rust and WebAssembly and that should fix it.


The main page (https://flawless.dev/) has a diagram/video that shows how it would work and it basically writes any 'side effects' to a log which is used to track their existence. If they exist in the log, you read them; if they don't you run the code that generates them and then log it.

It's interesting, but I can't imagine how it is going to work at scale, both in terms of managing state across a very large application and in terms of running thousands of instances concurrently.


As the saying goes, "it is turtles all the way down". How do you ensure the log is "written" and "synced" _exactly once_?


Just write the action of writing a log to a different log file. Duh. /s


That can absolutely work... if you are confined into a single machine. (In fact, that's roughly how cloud instances work.) I see no way to generalize that into a distributed system. In my previous job I worked on a game server engine which solves a very limited version of this problem via virtual actors and it was still way too hard.


Exactly Once is basically a solved problem. The CAP theorem says you can either get consistency or availability in a system that has network partitions.

If you give up consistency, you end up corrupting data every once in a while.

If you don’t, then you can have exactly once, but it might take a long time if there’s a network failure.

NFSv3 solved this back in the 1980’s. (V2 and V1 may have, but I don’t know.)

It did it without requiring deterministic execution or other programming language innovations, so I share your skepticism about rust and web assembly solving these problems.

(I really like rust, and recommend it for pretty much all new systems code. However, it is not a panacea.)


This comment has some errors in it:

1. Exactly Once is absolutely not a solved problem. CAP says nothing about 'exactly once', it's about design choices for your data. But you can do all sorts of side-effectful things when it comes setting the data.

2. Giving up consistency does not mean you "corrupt your data every once in awhile". I don't know why you would say that. Choosing availability means you have to make decisions about how your data becomes consistent, but nothing about that means it is corrupted.

3. Choosing consistency says nothing about "exactly once". Consider this basic workflow backed by a consistent database: request comes into service, service sends email, services goes to store that email was sent in consistent database and crashes, user gets crash back, reruns request, email sent again. Oh so send the email AFTER it's stored in the database, well service crashes after saving in database, same problem.


If your system provides strong consistency, it is possible to build exactly once processing over it. NFSv3 is an existence proof, and there are plenty of theoretical results showing it is possible. Roughly speaking, you run the job and install the result in the consistent store iff your output register is null. If you side effect outside of the exactly once system, and the receiver can’t suppress duplicate notifications then you’re screwed, but the system in the article precludes that, as do many existing systems.

As for your other points, with eventual consistency you get weird problems like “I wrote X, then Y, but then read X and the system converged to X, except a week later I read Y, but just from half my fleet, and only for a 30 minute window”.

Unless you layer consistency on top of that (which is not always possible), then, tautologically, you don’t get consistency. In particular, most intuitive application-level invariants are going to be violated in all sorts of bizarre ways that take many pages to explain.

In practice, eventual consistency is closer to corrupting than not corrupting, because most application developers aren’t going to follow this conversation, and will use the database wrong.

I’m sure you or I could model our program and the storage in TLA+ and confirm we’re correctly maintaining application state, but that doesn’t help most developers.

Also, it’s not economically efficient.

We’re talking about “only” getting 6 nines instead of 7 because we chose CP instead of AP, but as a side benefit the system is easier to maintain, and it took less than 1/10th as much to implement, and has orders of magnitude fewer implementation bugs.


> If you side effect outside of the exactly once system, and the receiver can’t suppress duplicate notifications then you’re screwed

If you are getting duplicate notifications, then you don't have an "exactly once" system, but definition. Exactly once systems are not possible, bu your own explanation. You can store data in a durable way such that applying it multiple times is safe, but that is not "exactly once". Your options are at-least once or at-most once. Again, by your own statement: if you have duplicate notifications in your system, you do not have "exactly once".

> As for your other points, with eventual consistency you get weird problems like “I wrote X, then Y, but then read X and the system converged to X, except a week later I read Y, but just from half my fleet, and only for a 30 minute window”.

This is not data corruption, this is the semantics of an eventually consistent system. If you don't want those semantics, don't have an eventually consistent system, but that is entirely different than data corruption. Developers not understanding the system is different than corruption.

I agree that eventual consistent is probably not a good idea for most problems.


Many, many distributed systems provide exactly once semantics by giving up availability.

They don’t compose well with systems that don’t provide exactly once semantics, but that doesn’t somehow make them stop existing or stop working.

Also, you don’t need to arrange for data structures that can be modified multiple times safety.

You just need the exactly once system to support transactions, compare and swap, atomic rename, etc. Such things are usually pretty easy to retrofit.

For instance, Google and Microsoft S3 support conditional writes that check the etag for equality. That’s enough to let you layer consistency and exactly once on top. AWS S3 doesn’t support this, and people have been asking for it for a long time.

These things are extremely well understood and done at scale all the time.

Edit: I think they use etags for this. They might expose it as atomic rename or create if not exists instead. The expressive power of the two should be equivalent.


Exactly once is a mathematical impossibility. It has been proven. See the two generals problem. You can get similar end results if you have idempotent calls and allow duplicate calls to be made. This gives the illusion or impression of exactly once.


> This brings me to a recent discovery I made, another approach to dealing with failure that completely blew my mind . It's commonly known under the name durable execution, and is so new that most developers never have heard of it.

Some feedback: this paragraph lands like a Reddit post from a teenager, who just took a big drag off a joint, and thinks they've invented a new field of science. (And also hasn't completed it so isn't ready to share it yet!)

Of course, everything's already been thought of. What you've described so far as durable computing sounds indistinguishable from event sourcing.

The code snippet is intriguing, though. We're abstracting event sourcing away from the developer---it happens automatically for every non-purely-functional operation---so the programmer can just write normal code.

Though, you'd have a lot of events, and I'd wonder about performance and scalability for larger apps.

Also, assuming events are at the finest level of granularity, I also imagine this starts to look like the equivalent of some kind of intermittent heap dump.


They're not claming that they invented it, it's an existing term. This blogpost lists 14 different projects in the space: https://www.golem.cloud/post/the-emerging-landscape-of-durab...


Flawless sounds a lot like https://temporal.io/ .

I'm wondering if it has the same scalability concerns - sticking everything in Postgres is fine at small-ish scale, but what happens when you outgrow Postgres, either because you have higher availability requirements (can't handle primary DB restarts) or because of the sheer volume of the workload?


Yeah the whole thing reminded me of the blog post they talk about regarding a time travelling debugger...

https://temporal.io/blog/time-travel-debugging-production-co...

Anyways I wish all the best for flawless as this problem is worth solving/popularizing.


What percentage of apps “outgrow” Postgres?


Very few. But there's a difference between having control over the codebase where you can start to split up the monolith, adopt event-based architectures down the line, i.e. having a strategy on how to deal with the problem if and when it becomes a problem, versus tightly coupling your business logic to a vendor that is tightly coupled to Postgres and therefore your options on how to switch away will be extremely fragile and expensive at best and straight up impossible at worst.


Temporal can also run on Cassandra, which scales much larger than Postgres (if you put in enough effort). It can also be replicated across regions for high availability. It's already running some pretty huge use cases.

(I work at Temporal)


Oh come on, when you click through the setup through to Cassandra the documentation states that cassandra support was deprecated in 1.21 and to migrate to a "supported" database: https://docs.temporal.io/self-hosted-guide/visibility#cassan...


You're looking at the docs for "visibility". Visibility is a separate eventually-consistent data store off to the side that's used for certain queries so it can be scaled independently of the main data store, and indexed in fancier ways. The main data store for all the stateful and transactional stuff has always, and probably will always, support Cassandra. For visibility, the recommendation for high scalability is currently Elasticsearch.

Temporal may have properties that make it not a good fit for a particular use case, but scalability is really not one.


As it was mentioned, this sounds very much like time-traveling debuggers like RR, which also records such database.

But I don't exactly understand: If you exactly recover the same state, don't you end up with the same faulty undesirable state? (That's what you also get with RR, by intention, to debug the problem.)

Clearly, here this is not the intention, i.e. you don't want to recover exactly the same state. So, it means, some parts will not be recovered. How would it be decided what parts to recover and what parts not? This seems like an impossible problem to solve in general.

And then, the state will not be exactly the same. It might be now in a sane state, but is it the right state that the user wants?


I guess it insulates you against bugs in the VM implementation, plus against (transient) failures of the host system.


Ah, right, that would be an option to define a clear boundary, i.e. to recover exactly the VM state, but not the OS native state. I was thinking about a native app here, where there is no VM.

But this would cover only some specific set up failures, namely where something goes wrong in the host, which you can easily recover by resetting its state and retrying again. This is only a very limited amount of failures. I guess most failures come from bugs in the user code, which would all be state within the VM.

Also, maybe your app depends on some resources from network, e.g. some NFS, or maybe some other remote server, or whatever, which you anyway cannot control. This is not easily recoverable then. (Tools like https://criu.org/Main_Page, which try to serialize an app state, to be able to recover it later, have the same problem.)


Big problem with this approach is doing the same thing twice, which is a big no no in a lot of applications. For example, if you do something, then crash before you get the chance to update your durable execution object, then you’ll do it again after restarting.

We were dealing with such code in a TCB and the only way is to audit your code and ensure that any action that can only be done once at most saves that action before doing the action.

I think you have the same concepts in queues no? Some queues will repeat at most once, some at least once, something like that. I can’t remember.


Hi! I'm the author of flawless.

Exactly once execution is one of the guarantees I want to provide. Flawless should always give you peace of mind, and in cases it can't guarantee exactly once execution it will sacrifice progress. When interacting with the real world, hardware and software can fail in creative ways and it's not always possible to automatically recover without manual intervention. Sometimes it's just not possible to solve issues without looping in a human. But flawless will not repeat a call if it can't guarantee it was not performed or if it's not idempotent.

For many features, like HTTP requests, flawless uses a dual commit system. In cases where data was sent to an external server, but no response is received (timed out), we can't know what the external system observed, and will not allow progress to continue (fail the workflow). You can relax this requirement by marking the HTTP call as idempotent.


I'm not sure exactly once can be done in the presence of failure. When you come back online, if you're not sure whether a call happened or not, you either drop it or try again - and that gives you at most or at least once.


If your system refuses to execute an action that it can’t guarantee didn’t occur, isn’t that “at most once” semantics?


I guess you could then use something like undo-redo-logging or a database, if it is critical. But then you would need to store data and process it as well, writing the logic for that.


> The more complex your application is, the more variables you need to keep track of everything. Eventually it becomes impossible for developers to predict all combinations of state that these variables will form.

I discovered an interesting simple case of this on my utility's websites. Autopay for some reason didn't trigger. It showed I had autopay enabled, but no payment ever made.

Tried to pay manually, got "error: autopay enabled, no one-time payment allowed". The only way to pay my bill was to delete my autopay, and manually pay.

I now cannot re-enable autopay getting the error "autopay disabled for this account" (possibly because it had the past-due balance I had just paid?).

Crossing my fingers if I log back in next week all the states will reconcile themselves. Software!


AFAIK Erlang and Elixir have a way to save the state just before the process is stopped. It should be the `trap_exit` flag on a gen_server that guarantees that the exit message will be managed. In that handler you can save all your state and resume it when the process restarts.


I’ve been doing this for a long time, except I just write basically a memoize function that I slap onto any old function.

It’s not a real error handling solution. I only use it for my quick and dirty projects.

Many errors are recoverable but the ones that are not fuck up this scheme because a human needed to have written a proper error recovery scheme that isn’t just “retry over and over from a previous point.”

At the end of the day, you gotta put in manual work for error handling. You can’t rely on your language or any library to create a generic error handler.


IMHO the beauty of "let it crash" is that you can code very tersely while maintaining data safety. In order for "let it crash" to work you need two things:

1. A tech stack that isolates crashes such that they do not affect the rest of the system. Example: you receive a malformed API response. The code responsible for parsing it crashes, but the rest of your application does not.

2. You use this to code in a declarative style.

Coding declaratively means your code has the shape of the data you expect to have.

Example (in elixir):

Let's say you are calling an external API where you expect to get back a list with a single element.

If you code non decleratively, your code might look like this:

  thing_i_want = List.first(my_api_response)
Now imagine that for some reason the external API sent you a list with two elements. That code still runs! It doesn't crash, but you are now in a strange state. The data you are passing into the system is unexpected.

Coding declaratively you would write:

    [thing_i_want] = my_api_response
In that case, if for whatever reason the external API sends you a list with more than one element the code will crash. In BEAM languages that is fine--the rest of the application should be ok, and you should log why you crashed and when, so you can look into it, but you are not ingesting bad data into the system.

The alternative is to code defensively: check the lenght of the list before extracting the first/only element. That works, but it tends to be very verbose and more importantly it means the entire call stack needs to be aware that you might get exceptions bubbled up from anywhwere.

Of course, for this to work you need software that has lightweight processes and a supervisor architecture..not usually worth it unless you're getting it for free! (as with elixir/erlang).


The erlang thing has bitten me in the past.

On my laptop, RabbitMQ would get into a state where it was unavailable and taking 100% CPU. I think it happened when I switched networks (moved laptop home) and the spinning was from repeatedly crashing. I never sorted the root cause, just wrote a script to kill and restart rabbitMQ. (The script to kill rabbit was named 'fudd'.)


A bit of an aside, but I really wish the animation on the flawless.dev homepage "crashed" at a non-ideal spot.

As it is, the animation crashes at the most opportune moment possible: after a side-effect statement has been persisted to the log, and even before the subsequent statement executes.

Where I want to see that animation crash is in the middle of that "HTTP request" at the bottom, where that request is an HTTP POST, and where the crash is "the server successfully received the request, and processed it, but we crashed before we could get the response."

The resuming execution's log is either "empty" … or "we sent it, but never got the result" — the former has you repeating the side-effect, and the in latter, you can at least sort of tell you're doomed.


In this scenario, the workflow is marked as "failed" and requires manual approval to continue. Sometimes you just can't resolve the issue without human input.


properly examined, this is another form of compilation, except slightly handicapped. i wonder if it would be possible to unroll/compile the function three-address code[0] style before execution, instead of the current model where you use specialized apis (eg temporal[1] and flawless[2]). i want to write my code like any program should be written, without special incantations, but still get durable execution as a non-negotiable benefit.

[0]: https://en.wikipedia.org/wiki/Three-address_code

[1]: https://temporal.io/

[2]: https://flawless.dev/


Sounds like Azure Durable Functions [0]

[0]: https://learn.microsoft.com/en-us/azure/azure-functions/dura...


I happen to have this exact need come up recently. I have a few Bash scripts that I run once in a while. They evolve each time I run them, because often I will discover some missing things, or improve them a bit as I use them. They also fail a lot, due to the experimental nature.

Now the problem is that every time they fail, I need to restart it. But not from the beginning, but where it failed. So I comment out everything before the failure and rerun.

It would be nice to use durable execution for this.


Erlang and OTP, as ugly as it is (to me) got a lot right on this front already.


Indeed. I do wish there were a few more "extras" in OTP, because "let it crash" needs some more details in some circumstances.

For instance if you have a system with a user interface and some various components, like say, a database, and the database becomes unavailable, you don't want the entire system to crash. You want it to display an error message to the user and maybe go into some kind of diagnostic mode or other "things are not normal" state.

Something like https://github.com/jlouis/fuse is one approach. I had a small stab at creating something similar: https://github.com/davidw/hardcore


Let it crash is just the start of the process, really. Write the happy path, with pattern matching so that if things failed, it does crash, like ok = do_the_thing(). You're also expected to monitor crash reports and (live) update the system to handle crashes that need more specific handling than a simple restart.

In a distributed system you always need to handle the case where you sent a request and didn't get a response and the request may or may not have been processed. Once you accept that, it's totally reasonable for the request processor to crash during a request, because the requester always needs to be know what to do in case they get no response or a non-specific error.



Yes, and I don't think there are much changes since then? Still the waitlist.


I wonder if this would end up with a bloated log. Imagine that you download a large file and then count the occurrences of a word. You wouldn't want to save that whole file to the log, after you count occurrences you don't need it, but it must stay in the log incase the execution is retried.

Maybe you'd want a wrapper around certain sections to say: only only log the output of this group of statements.


I saw something similar back in 2010 for java but it was marketed as a different product. It was some kind of a debugger which would let me forward and rewind time and the state of the program with a single slider. It was marvelous and totally unusable for the monster java EE stuff we were working on but sounded awesome for sane, smaller projects.


But but... this will raise development costs.

I thought we're all supposed to program for "the browser" (in reality Chrome) to reduce development costs.


Isn’t a durable store the same thing as a transactional database where transactions are listed on a tape and then read in to reconstruct the state?


This could also double as an undo/redo queue.


The concept sounds like redux with persistent layer?


So, is this going to be a SaaS with a proprietary programming framework, or open source?


This reminds me of Kafka Streams


Isn’t this just Event Sourcing?


it applies to human civilization as well....


(2023)


Bit of a cliffhanger style ad. But well set up for the pitch as I know the whole Erlang deal and that part was covered well.

I stopped reading somewhere in the presentation of flawless. Ran out of care for the moment. Might revisit it though. Sounds interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: