> Cloud Storage FUSE can only write whole objects at a time to Cloud Storage and...

londons_explore · on Sept 6, 2023

Append workloads are common in distributed systems. Turns out nearly every time you think you need random read/write to a datastructure (eg. a hard drive/block device for a Windows/Linux VM), you can instead emulate that with an append-only log of changes and a set of append-only indexes.

Doing so has huge benefits: Write performance is way higher, you can do rollbacks easily (just ignore the tail of the files), you can do snapshotting easily (just make a new file and include by reference a byte range of the parent), etc.

The downside is from time to time you need to make a new file and chuck out the dead data - but such an operation can be done 'online', and can be done during times of lower system load.

capableweb · on Sept 6, 2023

> Append workloads are common in distributed systems.

Bringing back the topic to what the parent was saying; since S3 is a pretty common system, and a distributed system at that, are you saying that S3 does support appending data? AFAIK, S3 never supported any append operations.

vlovich123 · on Sept 6, 2023

I’ll add some nuance here. You can implement append yourself in a clunk way. Create a new multipart upload for the file, copy its existing contents, create a new part appending what you need, and then complete the upload.

Not as elegant / fast as GCS’s and there may be other subtleties, but it’s possible to simulate.

ozfive · on Sept 6, 2023

It does not. Consider using services like Amazon Kinesis Firehose, which can buffer and batch logs, then periodically write them to S3.

merb · on Sept 6, 2023

you just described wal files of a database.

KRAKRISMOTT · on Sept 6, 2023

The generalized architecture is called event sourcing

capableweb · on Sept 6, 2023

Well, I'd argue that it's just two different names for similar concepts, but applied at different levels. WAL is a low-level implementation detail, usually for durability, while Event Source is a architecture applied to solve business problems.

A WAL would usually disappear or truncate it's length after a while, and you'd only rerun things from it if you absolutely have two. Changes in business requirements shouldn't require you to do anything with a WAL.

In contrast, Event sourcing log would be kept indefinitely, so when business requirements change, you could (if you want to, not required) re-run N previous events so you can apply new changes to old data in your data storage.

But, if you really want to, it's basically the same, but in the end, applied differently :)

slt2021 · on Sept 6, 2023

WAL is just one moving piece of database, that could be using SSTable or LSM tree or some other append-only immutable data structure, and the Event Sourcing is enterprise architecture pattern applied to business applications that heavily relies on immutable append-only data structures as the backbone

tough · on Sept 6, 2023

ha I was just thinking how similar this was to postgres pg-audit way of reusing the logs to sum up the correct state

Severian · on Sept 6, 2023

You can use S3 versioning, assuming you have enabled this on the bucket. It would be a little clunky. It would also be done in batches and not continuous append.

Basically if your data is append only (such as a log), buffer whatever reasonable amount is needed, and then put a new version of the file with said data (recording the generated version ID AWS gives you). This gets added to the "stack" of versions of said S3 object. To read them all, you basically get each version from oldest to newest and concatenate them together on the application side.

Tracking versions would need to be done application side overall.

You could also do "random" byte ranges if you track the versioning and your object has the range embedded somewhere in it. You'd still need to read everything to find what is the most up to date as some byte ranges would overwrite others.

Definitely not the most efficient but it is doable.

dataangel · on Sept 6, 2023

what is the advantage of versioning versus just naming your objects log001, log002, etc and opening them in order?

vlovich123 · on Sept 6, 2023

You can set up lifecycle policies. For example, auto delete or auto archive versions > X date. That’s one lifecycle rule. With custom naming schemes, it wouldn’t scale as well.

jeffbarr · on Sept 6, 2023

OMG...

nicornk · on Sept 6, 2023

With S3 you can do something similar my misusing the multiparty upload functionality, e.g.: https://github.com/fsspec/s3fs/blob/fa1c76a3b75c6d0330ed03c4...

advisedwang · on Sept 6, 2023

gcsfuse uses compose [1] to append. Basically it uploads the new data to a temp object, then performs a compose operation to make a new object in the place if the original with the combined content.

[1] https://cloud.google.com/storage/docs/composing-objects

rickette · on Sept 6, 2023

Azure Blob Storage actually has explicit append support using "Append"-blobs (next to block and page blobs)

capableweb · on Sept 6, 2023

Building a storage service like these today and not having "append" would be very silly indeed. I guess S3 is kind of excused since it's so old by now. Although I haven't read anything about them adding it, so maybe less excused...

paulddraper · on Sept 6, 2023

> Correct me if I’m wrong, but I don’t think S3 has an equivalent way to append to a value, which makes it clunky to work with as a log sink.

You are correct. (There are multipart uploads, but that's kinda different.)

ELB logs are delivered as separate object every few minutes, FWIW.

KptMarchewa · on Sept 6, 2023

I wonder how it handles potential conflicts.

londons_explore · on Sept 6, 2023

For appends, the normal way is to apply the append operations in an arbitrary order if there are multiple concurrent writers. That way you can have 10 jobs all appending data to the same 'file', and you know every record will end up in that file when you later scan through it.

Obviously, you need to make sure no write operation breaks a record midway while doing that. (unlike the posix write() API which can be interrupted midway).

dekhn · on Sept 6, 2023

It's also sensible to have high quality record markers that make identifying and skipping broken records easy. For example, recordio, a record-based container format used at Google in the early mapreduce days (and probably still used today) was explicitly designed to skip corruption efficiently (you don't wnt a huge index build to fail halfway through because a single document got corrupted).

Even so, I avoid parallel appenders to a single file, it can be hard to reason about and debug in ways that having each process appending to its own file doesn't.

boulos · on Sept 6, 2023

Objects have three fields for this: Version, Generation, and Metageneration. There's also a checksum. You can be sure that you were the writer / winner by checking these.

dpkirchner · on Sept 6, 2023

You can also send a x-goog-if-generation-match[0] header that instructs GCS to reject writes that would replace the wrong generation (sort of like a version) of a file. Some utilities use this for locking.

0: https://cloud.google.com/storage/docs/xml-api/reference-head...

KptMarchewa · on Sept 6, 2023

That makes sense - if you keep data in something like ndjson and don't require any order.

If you need order then probably writing to separate files and having compaction jobs is still better.