Hacker News new | past | comments | ask | show | jobs | submit login
The hidden complexity of scaling WebSockets (composehq.com)
193 points by atul-jalan 10 days ago | hide | past | favorite | 68 comments





This is all true, but it also serves to remind us that Rails gives developers so much out of the box, even if you're not aware of it.

ActionCable is Rails' WebSockets wrapper library, and it addresses basically every pain point in the post. However, it does so in a way that all Rails developers are using the same battle-tested solution. There's no need for every project to hack together its own proprietary approach.

Thundering herds, heartbeat monitoring are both covered.

If you need a messaging schema, I strongly recommend that you check out CableReady. It's a powerful library for triggering outcomes on the client. It ships with a large set of operations, but adding custom operations is trivial.

https://cableready.stimulusreflex.com/hello-world/

While both ActionCable and CableReady are Rails libraries, other frameworks would score huge wins if they adopted their client libraries.


Elixir’s lightweight processes are also a good fit. Though I’ve seen some benchmarks that claim that goroutines can hit even lower overhead per connection.

That makes sense, Erlang/Elixir processes are a much higher-level construct than goroutines, and they trade off performance for fault tolerance and observability.

As an example, with a goroutine you have to be careful to handle all errors, because a panic would take down the whole service. In Elixir a websocket handler can crash anywhere without impacting the application. This comes at a cost, because to make this safe Elixir has to isolate the processes so they don't share memory, so each process has its own individual heap, and data gets copied around more often than in Go.


> As an example, with a goroutine you have to be careful to handle all errors, because a panic would take down the whole service.

Unless you're the default `net/http` library and simply recover from the panic: https://github.com/golang/go/blob/master/src/net/http/server...


You still need to be careful, as this won't catch panics from go routines launched from your http handler.

Yeah, I'm simplifying a bit. It may not cause an immediate exit, but it can leave the service broken in unpredictable ways. See this discussion for instance: https://iximiuz.com/en/posts/go-http-handlers-panic-and-dead...

Node has similar libraries like Socket.IO too, but it over-abstracts it a bit in my opinion.

I've done my share of building websocket servers from scratch, but when you don't use libraries like ActiveCable or socket.io, you have to build your own MessageID reconciliation so that you can have request/response cycles. Which is generally what you want (or eventually want) in a websocket-heavy application.

    send(payload).then(reply => ...)

Yep, for our application, we have an `executionId` that is sent in essentially every single WebSocket message.

But client and server use it to maintain a record of events.


Isn't this JSON-RPC's approach?

At this point why even use a websocket vs a normal request/reply technology like grpc or json-rpc?

For scenarios requiring a constant exchange of information, such as streaming data or real-time updates. After the initial handshake, data is exchanged directly over the connection with minimal overhead. Lower latency is especially beneficial for high-frequency message exchanges. Gaming, live auctions, or real-time dashboards are well suited. I also think that real time collaboration is under-explored.

JSON-RPC is request-response only; the server cannot send unsolicited messages. gRPC supports bidirectional streaming, but I understand that setting it up is more complex than WebSockets.

I will concede that horizontal scaling of RPC is easier because there's no connection overhead.

Ultimately, it really depends on what you're trying to build. I also don't underestimate the cultural aspect; fair or not, JSON-RPC feels very "enterprise microservices" to me. If you think in schemas, RPC might be a good fit.


Why can't the server send unsolicited messages in JSON-RPC? I've implemented bidirectional JSON-RPC in multiple projects and things work just fine, even going as far as sharing most of the code.

Yep, the web server and client can both act as JSON-RPC servers and clients. I've used this pattern before too with Web Workers, where the main thread acts as both client (sending requests to the worker) and server (fielding requests from the worker).

Or just use jsonrpc.

???

That solves none of the issues outlined in the post or the comments.


It solves the very limited problem of bike-shedding envelope shapes for request/reply protocols, which I think was all they meant to say.

At its core, JSON-RPC boils down to "use `id` and `method` and work the rest out", which is acceptably minimal but does leave you with a lot of other issues to deal with.


It's a bit misnomer because it defines rpcs _and_ notifications.

What people seem to be often missing for some reason is that those two map naturally to existing semantics of the programming language they're already using.

What it means in practice is that you are exposing and consuming functions (ie. on classes) – just like you do in ordinary libraries.

In js/ts context it usually means async functions on classes annotated with decorators (to register method as rpc and perform runtime assertions) that are event emitters – all concepts already familiar to developers.

To summarize you have system that is easy to inspect, reason about and easy to use - almost like any other package in your dependency.

Introducing backward compatible/incompatible changes also becomes straight forward for everybody ie. following semver on api surface just like in any ordinary package you depend on.

Those straight forward facts are often missed and largely underappriciated.

ps. in our systems we're introducing two deviations – error code can also be strings, not just numbers (trivial); and we support async generators (emitting individual objects for array results) – which helps with head of line blocking issues for large resultsets (still compatible with jsonrpc at protocol level, although it would be nice if they supported it upstream as dedicated semantic in jsonrpc 2.1 or something). They could also specify registering and unregistering notification listeners at the spec level so everybody is using the same scheme.


That do you see as the difference between an RPC and a notification?

The terminology is not ideal, I grant, but a JSON-RPC "notification" (a request with no id) is just a request where the client cannot, and does not, expect any response, not even a confirmation that the request was received and understood by the server. It's like UDP versus TCP.

> emitting individual objects for array results

This is interesting! How does this change the protocol? I assume it's more than just returning multiple responses for the same request?


Yes that’s all there is in the difference remote procedure call expects response, remote notification doesn’t.

Our implementation emits notifications for entries and rpc returns done payload (which is largely irrelevant just the fact of completion is relevant).

As I said it would be nice if they’d support generator functions at the protocol level.


If you add Content-Negotiation it will have ALL the OSI layers! /s

Honestly, I'm a little surprised and more than a bit depressed how we effectively reinvent the OSI stack so often...


My SaaS has been using WebSockets for the last 9 years. I plan to stop using them and move to very simple HTTP-based polling.

I found that scalability isn't a problem (it rarely is these days). The real problem is crappy network equipment all over the world that will sometimes break websockets in strange and mysterious ways. I guess not all network equipment vendors test with long-lived HTTP websocket connections with plenty of data going over them.

At a certain scale, this results in support requests, and frustratingly, I can't do anything about the problems my customers encounter.

The other problems are smaller, but still annoying, for example it isn't easy to compress content transmitted through websockets.


I always recommend looking at Server-Sent Events [0] and EventSource [1]. It's a standardization of old style long-polling, mapping very well to the HTTP paradigm and is built in to the web standard.

It's so much easier to reason about than websockets, and a naive server side implementation is very simple.

A caveat is to only use them with HTTP 2 and/or client side logic to only have one connection open to the server, because of browser limits on simultaneous requests to the same origin.

[0] https://developer.mozilla.org/en-US/docs/Web/API/Server-sent... [1] https://developer.mozilla.org/en-US/docs/Web/API/EventSource


The last project I worked on went in the same direction.

Everything works great in local/qa/test, and then once we move to production we inevitably have customers with super weird network security arrangements. Users in branch offices on WiFi hardware installed in 2007. That kind of thing.

When you are building software for other businesses to use, you need to keep it simple or the customer will make your life absolutely miserable.


This is likely a misconfiguration or bugs on your end. Our products use WebSockets extensively for both business logic and for media delivery. We have significant traffic, including from 3rd world countries with extremely poor networks. When the server and the browser are proper, the reliability of the WebSocket protocol and the software stack is basically not different from raw TCP. When we have issues, the bugs are always on our end due to 1) ingress software (firewalls, WAPs, reverse proxies, TLS termination). 2) HTTP server protocol parsing and processing of the WebSocket data stream. 3) Web/HTTP framework issues. We never have any issues due to networks on the users end, apart from the quality of the connection itself. When seen from the equipment side, the WebSocket connection is an opaque stream of data. No different from a fps gameplay or a livestream. The equipment can break it but then the underlying HTTP breaks as well, giving very clear errors in browsers. Reconnection and keepalive for WebSockets in browsers are very robust which you can actually prove by tests...

What are the typical payload sizes in your WebSocket messages? Could you share the median and p99 values?

I've also discovered similar networking issues in my own application while traveling. For example, in Vietnam right now, I was facing recurring issues like long connection establishment times and loss of responsiveness mid-operation. I thought I was losing my mind - I even configured Caddy to not use HTTP3/QUIC (some networks don't like UDP).

I moved some chunkier messages in my app to HTTP requests, and it has become much more stable (though still iffy at times).


I transmit a lot over websockets. Large messages and large amounts of data. I don't think it makes sense to move bigger messages to HTTP requests while keeping the websockets — I heard that advice, but if I am to do that, I'd rather go all the way and stop using websockets altogether.

This is surprising to me as I would expect network equipment to just see a TCP connection given both HTTP and Websockets are an application layer protocol and that long lived TCP connections are quite ubiquitous (databases, streaming services, SSH, etc).

Found this same issue trying to scale streamlit. It's just not a good idea.

The key to managing this complexity is to avoid mixing transport-level state with application-level state. The same approach for scaling HTTP requests also works for scaling WebSocket connections:

* Read, write and track all application-level state in a persistent data store.

* Identify sessions with a session token so that application-level sessions can span multiple WebSocket connections.

It's a lot easier to do this if your application-level protocol consists of a single discrete request and response (a la RPC). But you can also handle unidirectional/bidirectional streaming, as long as the stream states are tracked in your data store and on the client side.


Functional core, imperative shell makes testing and this fast iteration a lot easier. It’s best if your business logic knows very little about transport mechanisms.

I think part of the problem is that early systems wanted to eagerly process requests while they are still coming in. But in a system getting 100s of requests per second you get better concurrency if you wait for entire payloads before you waste cache lines on attempting to make forward progress on incomplete data. Which means you can divorce the concept of a payload entirely from how you acquired it.


> system getting 100s of requests per second you get better concurrency if you wait for entire payloads before you waste cache lines

At what point should one scale up & switch to chips with embedded DRAMs ("L4 cache")?


I haven’t been tracking price competitiveness on those. What cloud providers offer them?

But you don’t get credit for having three tasks halfway finished instead of one task done and two in flight. Any failover will have to start over with no forward progress having been made.

ETA: while the chip generation used for EC2 m7i instances can have L4 cache, I can’t find a straight answer about whether they do or not.

What I can say is that for most of the services I benchmarked at my last gig, M7i came out to be as expensive per request as the m6’s on our workload (AMD’s was more expensive). So if it has L4 it ain’t helping. Especially at those price points.


When you've profiled the code running in production and identified memory bottlenecks that can not be solved by algorithmic/datastructural optimizations.

Currently another thread is going[1] which advocates very similar things, in order to reduce complexity when dealing with distributed systems.

Then again, the frontend and backend are a distributed system, so not that weird one comes to similar conclusions.

[1]: https://news.ycombinator.com/item?id=42813049 Every System is a Log: Avoiding coordination in distributed applications


Many years ago, we used to start a streaming session with an http request, then upgrading to websockets after obtaining a response (this was our original "StreamSense" mechanism). In recent years, we changed StreamSense to go websocket first and fallback to http streaming or http long polling in case of issues. At Lightstreamer, we started streaming data 25 years ago over http, then moving to websockets. We've seen so many different behaviors in the wild internet and got some much feedback from the fieldsl in these decades that we believe our current version of Lightstreamer includes heuristics and mechanisms for virtually every possible aspect of websockets that could go wrong. From massive disconnections and reconnections, to enterprise proxies with deep inspections, to mobile users continuously switching networks. I recall when a big customer required us to support one million live websocket connections for each server (mid-sized) keeping low latency. It was challenging but forced us to come up with a brand new internal architecture. So many stories to tell covering 25 years of evolution...

Sounds like you have an interesting book to write :-)

Eg “The 1M challenge”


Nice idea for my retirement (not in the very short term)!

I am really unsure why devs around the world keep defaulting to websockets for things that are made for server sent events. In 90% of the usecases i see, websockets are just not the right fit. Everything is simpler and easier with SSE. Some exceptions are high throughput >BI<directional data streams. But even if eg. your synced multiplayer cursors in something like figma use websockets don't use it for everything else eg. your notification updates.

I wrote about the way we handle WebSocket connections at Canva a while ago [1]. Even though some small things have changed here and there since the post was published, the overall approach has held up pretty well handling many millions of concurrent connections.

That said, even with great framework-level support, it's much, much harder to build a streaming functionality compared to plain request/response if you've got some notion of a "session".

[1]: https://www.canva.dev/blog/engineering/enabling-real-time-co...


> it's much, much harder to build a streaming functionality compared to plain request/response if you've got some notion of a "session"

This touches something that I think is starting to become understood- the concept of a "session backend" to address this kind of use case.

See the complexity of disaggregation a live session backend on AWS versus CloudFlare: https://digest.browsertech.com/archive/browsertech-digest-cl...

I wrote about session backends as distinct from durable execution: https://crabmusket.net/2024/durable-execution-versus-session...


I recall another complication with websockets: IIRC it's with proxy load balancers, like binding a connection to a single connection server, even if the backend connection is using HTTP/2. I probably have the details wrong. I'm sure someone will correct my statement.

I think there is a way to do it, but it likely involves custom headers on the initial connection that the load balancer can read to route to the correct origin server.

I imagine the way it might go is that the client would first send an HTTP request to an endpoint that returns routing instructions, and then use that in the custom headers it sends when initiating the WebSocket connection.

Haven't tried this myself though.


I think it's more that WebSockets are held open for a long time, so if you're not careful, you can get "hot" backends with a lot of connections that you can't shift to a different instance. It can also be harder to rotate backends since you know you are disrupting a large number of active clients.

The trick to doing this efficiently is to arrange for the live session state to be available (through replication or some data bus) at the alternative back end before cut over.

Assuming you control the client code, you can periodically disconnect and reconnect. This could also simplify deployment.

Another option is to have the client automatically reconnect. That way your backend can just drop the connection when it needs to, and the load balancing infra will make sure the reconnection ends up on a different server.

Of course you do want to make sure the client has exponential backoff and jitter when reconnecting, as to avoid thundering herd problems.

The relevant state will need to be available to all servers as well. Anything that's only known to the original server will be lost as it drops connections. On a modern deployment with a database and probably redis available this isn't too big of an ask.


Horizontal scaling is certainly a challenge. With traditional load balancers, you don't control which instance your clients get routed to, so you end up needing to use message brokers or stateful routing to ensure message broadcasts work correctly with multiple websocket server instances.

The only complexity I have found with regards to scaling WebSockets is knowing the minimum delay between flush event completion and actual message completion to destination. It takes longer to process a message, even on IPC routing, than it does to kill a socket. That has upstream consequences with consideration of redirection and message pipes between multiple sockets. If you kill a socket too early after a message is flushed from the socket there is a good chance the destination sees the socket collapse before it has processed the final message off the socket and that processing delay is not something a remote location is easily aware of.

I have found for safety you need to allow an arbitrary delay of 100ms before killing sockets to ensure message completion which is likely why the protocol imposes a round trip of control frame opcode 8 before closing the connection the right way.


> WebSocket connections can be unexpectedly blocked, especially on restrictive public networks.

What? How would public network even know you’re running a websocket if you’re using TLS? I dont think it’s really possible in general case

> Since SSE is HTTP-based, it's much less likely to be blocked, providing a reliable alternative in restricted environments.

And websockets are not http-based?

What article describes as challenges seems like very pedestrian things that any rpc-based backend needs to solve.

The real reason websockets are hard to scale is because they pin state to a particular backend replica so if the whole bunch of them disconnect at scale the system might run out of resources trying to re-load all that state


I agree here. I have had an experience of scaling WebSockets server to 20M connections on a single server (with this one https://github.com/ITpC/LAppS.git). However there are several issues with scaling WebSockets, on the backends as well: mutex locking, non-parallel XOR of input stream, utf8 validation. I do not know the state of the above repository code, it seems that it was never updated for at least 5 years. There were bugs in HTTP parsing in the client part for some cases. Though vertical scalability was excellent. Sad this thing never reached production state.

> non-parallel XOR of input stream

I remember this one in particular making me upset, simply because of another extra buffer pass for security reasons that I believe are only to prevent proxies doing shit they never should have done in the first place?


The initial handshake will usually include an `Upgrade: websocket` header, which can be inspected by networks.

No, it literally can not be because by the time Upgrade header appears the connection is already encrypted.

Restricted environments in larger corporations can do a full mitm proxy

It's not a very good man-in-the-middle if it can't handle a ubiquitous protocol from 2011 based on http/1.1. More like an incompetent bureaucrat in the middle.

Eh, if you're dealing with corporate network proxies all bets are already off. They keep blocking connections for the most random reasons until everyone is opening ssh tunnels just to get work done. Websockets are a standard feature of the web, if you cut off your ears don't complain about loss of hearing. Unless, you're explicitly targeting such corporations as clients, in which case - my condolences.

The comment about Render/Railway gracefully tranferring connections seems weird? I am pretty sure it just kills the service after the new one is alive which will kill the connections. Not some fancy zero downtime reconnect.

It does, but it will generally give a grace period for outstanding requests to resolve prior to killing the service.

It's a good idea for short-lived HTTP requests, but will cause problems for a persistent connection.


for me, the most important lesson i've learned when using websockets is to not use them whenever possible.

i don't hate them, they're great for what they are, but they're for realtime push of small messages only. trying to use them for the rest of your API as well just throws out all the great things about http - like caching and load balancing, and just normal request/response architecture. while you can use websockets for that it's only going to cause you headaches that are already solved by simply using a normal http api for the vast majority of your api.


Elixir will get you pretty far along this scaling journey without too many problems:

https://hexdocs.pm/phoenix/channels.html


> Elixir will get you pretty far along this scaling journey without too many problems:

been running a phoenix app in prod for 5 years. 1000+ paying customers. heavy use of websockets. never had an issue with the channels systems. it does what it says on the tin and works great right out of the box


Question for those in the know:

Why would I use websockets over SSE?


Websockets are bidirectional while SSE is unidirectional (server to client). That said, there's nothing stopping you from facilitating client to server communication separately from SSE, you just don't have to build that channel with websockets.

I have been working on an idea/Node.js library called vramework.dev recently, and a big part of it focuses on addressing the main complexities mentioned below.

For a bit of background, in order to tackle scalability, the initial approach was to explore serverless architecture. While there are both advantages and disadvantages to serverless, a notable issue with WebSockets on AWS* is that every time a message is received, it invokes a function. Similarly, sending a message to a WebSocket requires invoking an HTTP call to their gateway with the websocket / channel id.

The upside of this approach is that you get out-of-the-box scalability by dividing your code into functions and building things in a distributed fashion. The downside is latency, due to all the extra network hops.

This is where vramework comes in. It allows you to define a few functions (e.g., onConnect, onDisconnect, onCertainMessage) and provides the flexibility to run them locally using libraries like uws, ws, or socket.io, or deploy them in the cloud via AWS or Cloudflare (currently supported).

When running locally, the event bus operates locally as well, eliminating latency issues. If you apply the same framework to serverless, latency increases, but you gain scalability for free.

Additionally, vramework provides the following features:

- Standard Tooling

Each message is validated against its typescript signature at runtime. Any errors are caught and sent to the client. (Note: The error-handling mechanism has not yet been given much thought into as an API). Rate limiting is also incorporated as part of the permissioning system (each message can have permissions checked, one of them could rate limiting)

- Per-Message Authentication

It guards against abuse by ensuring that each message is valid for the user before processing it. For example, you can configure the framework to allow unauthenticated messages for certain actions like authentication or ping/pong, while requiring authentication for others.

- User Sessions

Another key feature is the ability to associate each message with a user session. This is essential not only for authentication but also for the actual functionality of the application. This is done by doing a call to a cache (optionally) which returns the user session associated with the websocket. This session can be updated during the websocket lifetime if needed (if your protocol deals with auth as part of it's messages and not on connection)

Some doc links:

https://vramework.dev/docs/channels/channel-intro

A post that explains vramework.dev a bit more in depth (linked directly to a code example for websockets):

https://presentation.vramework.dev/#/33/0/5

And one last thing, it also produces a fully typed websocket client, so if using routes (where a property in your message indicates which function to use, the approach AWS uses serverless).

Would love to get thoughts and feedback on this!

edit: *and potentially Cloudflare, though I’m not entirely sure of its internal workings, just the Hibernation server and optimising for cost saving


  const onConnect: ChannelConnection<'hello!'> = async (services, channel) => {
    // On connection (like onOpen)
    channel.send('hello') // This is checked against the input type
  }

  const onDisconnect: ChannelDisconnection = async (services, channel) => {
    // On close
    // This can't send anything since channel closed
  }

  const onMessage: ChannelMessage<'hello!' | { name: string }, 'hey'> = async (services, channel) => {
    channel.send('hey')
  }

  export const subscribeToLikes: ChannelMessage<
    { talkId: string; action: 'subscribeToLikes' },
    { action: string; likes: number }
  > = async (services, channel, { action, talkId }) => {
    const channelName = services.talks.getChannelName(talkId)
    // This is a service that implements a pubsub/eventhub interface
    await services.eventHub.subscribe(channelName, channel.channelId)
    // we return the action since the frontend can use it to route to specific listeners as well (this could be absorbed by vrameworks runtime in future)
    return { action, likes: await services.talks.getLikes(talkId) }
  }

  addChannel({
    name: 'talks',
    route: '/',
    auth: true,
    onConnect,
    onDisconnect,
    // Default message handler
    onMessage,
    // This will route the message to the correct function if a property action exists with the value subscribeToLikes (or otherwise)
    onMessageRoute: {
      action: {
        subscribeToLikes: {
          func: subscribeToLikes,
          permissions: {
            isTalkMember: [isTalkMember, isNotPresenter],
            isAdmin
          },
        },
      },
    },
  })

A code example.

Worth noting you can share functions across websockets as well, which allows you to compose logic across different ones if needed


> At Compose, every WebSocket message starts with a fixed 2-byte type prefix for categorizing messages.

some of the complexity is self-inflected by ignoring KISS principle


How would you make it simpler?



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: