Hacker News new | past | comments | ask | show | jobs | submit login
A new ProtoBuf generator for Go (vitess.io)
296 points by tanoku on June 3, 2021 | hide | past | favorite | 73 comments



Using CPU utilization as a performance metric can be extremely misleading. My favorite article on the subject is from Brendan Gregg:

http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-...

A much better way to test the influence of the new compiler would be to test the actual throughput at which saturation is achieved (which is what the benchmark in the C++ grpc library measure to assess their performance).


There is a fairly robust set of benchmarks that are run to test out performance improvements[1] and macro benchmarks are the ultimate test of holistic improvement. CPU isn't a great proxy, but one of the biggest problems in real world performance on this specific system ( databases in general ) is latency. CPU time is a really good proxy for latency so by taking a look at CPU time we can get an idea of how the system will respond under "normal" conditions.

1.https://benchmark.vitess.io/macrobench


In this case the regression also caused a 3% decrease in throughput.


I'm not sure that the phrasing in the article is particularly fair:

> The maintainers of Gogo, understandably, were not up to the gigantic task.

I'm 99% sure they are "up to" (as in "capable of") doing so, they are just not "up for" it (as in, "will not do it").


They could be "not up to" because of lack of resources, probably time and/or money. I think that's what is implied, rather than lack of technical knowledge.


I got the sense that they meant "not willing" but I agree that's one of those English phrases that can easily be misconstrued towards the more negative interpretation.

That said, I love the detailed post and the interesting solution, and the commitment to performance!


Yes I assume the author meant “not up for”


I hadn't realized that Gogo was in such a bad spot with the upstream Go protobuf changes. There was lots of drama when the changes were made and I guess that overshadowed any optics I had on Gogo.

Making vtprotobuf an additional protoc plugin seems like the Right Thing™, although it's a shame how complicated protoc commands end up becoming for mature projects. I'm pretty tempted to port Authzed over to this and run some benchmarks -- our entire service requires e2e latency under 20ms, so every little bit counts. The biggest performance win is likely just having an unintrusive interface for pooling allocated protos.


Proto message unmarshal in Go for a small message should be 5 orders of magnitude below 20ms, shouldn't even begin to matter until you are sweating individual microseconds.


That's true if your program only does a single unmarshal at a time at a leisurely pace. And in a steady state situation, the memory trashing left behind each individual unmarshal call needs to be paid up by some poor future request.

I agree it's unlikely the difference here will be solely responsible for tipping the GP's request above 20ms, but the memory problems could reasonably ruin tail latencies.


The significance of 20ms isn't clear so this is hard to judge.

Perhaps they have significant external (network) latency leaving only a few ms budget for the application stack - so they could easily be up against a wall.


Until the GC kicks in and steals a full 200usec + a bunch of your throughput...

(Holy shit, who is downvoting this? It's literally the whole article!)


If your path is sensitive to 200us of latency you should probably optimize your application and tune your GC. Typically 200us for freeing all unreachable memory is not a big deal.


> If your path is sensitive to 200us of latency you should probably optimize your application and tune your GC.

okay, you've done this, three years later and it's the same thing again since you need to accomodate the new features. your users haven't upgraded their computers. what do you do ?


Run a profiler and optimize again.


Your original code is already optimized as much as is possible outside of the things mentioned by OP


Don't guess, measure.

Then just like C, writing a tiny set of functions in Assembly is always an option.


Keep in mind that I am referring to this:

> If your path is sensitive to 200us of latency you should probably optimize your application and tune your GC.

> Don't guess, measure.

yes, I am saying that the original code has already gone through a complete optimization process, everything is written in written in assembly, and you are at 970us on your 1ms time budget. (I'm not pulling those out of thin air, I was literally at a client yesterday with some real-time code with a 1ms deadline on a desktop OS and we have to cram as much things as possible in that millisecond)


Well in that case there is no way around upgrading the hardware, a 486 won't play a MP4 no matter how hand tuned the Assembly code is.


Properly written Go code (or even Java for that matter) will try to minimize allocations. For Java, unless I am mistaken pause-less GC is only offered by Azul - $$


>or even Java

Just in case you may be unaware, the latest GCs for Java (Shenandoah, ZGC) are miles ahead of anything available for Go due to sheer age and manpower. Parallel and Pauseless are easily achievable in most cases.


> Latest GCs for Java (Shenandoah, ZGC) are miles ahead of anything available.

Beyond hyperbole, do you have any actual comparison of Go vs Java GC performance?


Java's GC is better but Go's GC is also parallel and "pauseless" - iirc ZGC is 50-500usec which is comparable to Go's target 200usec.

The point is, neither is "five orders of magnitude" below 20ms. And neither needs zero CPU even if it doesn't block other threads.


Yeah, the whole point of the article is that gRPC v2 (and frankly v1 for that matter) are not “properly written” to do this.


3% regression in QPS, 20% regression in CPU, and 5% regression in memory usage according to the article. Those are considerably worse than "5 orders of magnitude below".


GP meant 5 orders of magnitude below "20 ms". 20 ms is a lot of time.

There is nothing one can do to a, say, a 1 kilo byte buffer that will cross 1 ms in any language. My own Go code doesn't cross more than few micros per message.


GP's root claim is that protobuf serialization/deserialization performance shouldn't matter, on an article where a user is specifically demonstrating that it does matter.


The usecase described in the article, and the usecase described in the top post in this thread aren't the same usecase. If you aren't throughput bound, a 5% regression in parse speed doesn't matter if your goal is to stay under 20ms and parsing takes 17 us. Sure it now takes 19 us, which is a regression of 2 us out of 20ms, or 1/10000th of your time.


> our entire service requires e2e latency under 20ms

Why are you using Go then?


20ms is a pretty considerable amount of time WRT E2E transaction time in today's world. Can you expand on your concerns with Go?


It's not really suitable for latency-critical applications.

EDIT: Fixed unfortunate typo


You can 100% write services with P999 < 20ms in go. Not even trying that hard. Go is entirely suitable for this kind of constraints, I dare say that's go's main target.

P99 < 1ms, that's when you're going to want to switch it up.


Depending on workload, Go also does sub-1ms p99 pretty easily. I'm getting sub-1ms p99.9.


What are the proposed solutions to get better than that? C/Rust code? Assembly?


Going fast is one thing. Making a program that responds consistently is different, and there's a continuum of choices. The last time I read about Go's GC they targeted 500us and for tons of applications that's more than sufficient; for some, it's not.

You could start with twiddling some of the GC knobs Go gives you, but you're still working against an SLO. If you need stronger guarantees you'll look at languages that completely eschew GC, because Go's GC still has STW bits. Climb the ladder further and you're reducing allocations, eventually avoiding any malloc() beyond what it takes to get an arena and doing your own bookkeeping. I've never been near the top of the ladder when you have hard real-time constraints, but I've heard it involves paying Wind River for VxWorks licenses ;)


was the double-negative intentional? I've used Go for sub-millisecond needs. So 20ms seems like it would be a reasonable choice from where I'm sitting.


It was not intentional, thanks for asking...very unfortunate typo ;)

Go doesn't give you control over inline vs indirect allocation, instead relying on escape analysis, which is notoriously finicky. Seemingly unrelated changes, along with compiler upgrades, can ruin your carefully optimized code.

This is especially heinous because it uses a GC; unnecessary allocations have a disproportionately large impact on your application performance. One or the other wouldn't be nearly as bad.

Time and time again we see reports from organizations/projects with perfectly fine average latency, but horrendous p95+ times, when written in Go - some going as far as to do straight-up insane optimizations (see Dragph) or rewrite in other languages.


But you think this impacts a 20ms budget? It’s mostly trivia to get sub 20ms p99 in Go.


While escape analysis in Go is finky, you can make it part of the CI/CD to keep it under control.

https://medium.com/a-journey-with-go/go-introduction-to-the-...

No different than running other kinds of static analysis for well known languages, unsafe by default.


I don't know, I'm able to get 150k grpc q/sec with p99 sub 1ms. It's def better than G1 and CMS.


Funny timing, I've just written most of a TypeScript generator for protobufs. I learned about some fun corners of protobufs I didn't expect trying to pass the protouf conformance tests [1] (which this one passes, that's no mean feat!).

- If you write the same message multiple times, protobuf implementations should merge fields with a last write wins policy (repeated fields are concatenated). This includes messages in oneofs.

- For a boolean array, you're better off using a packed, repeated int64 (if wire size matters a lot). Protobuf bools use varint encoding meaning you need at least 2 bytes for every boolean, 1+ for the tag and type and 1 byte for the 0 or 1 value. With a repeated int64, you'd encode the tag and length in 2 varints, and then you get 64 bools per 8 bytes.

- Fun trivia: Varints take up a max of 10 bytes but could be implemented in 9 bytes. You get 7 bits per varint byte, so 9 bytes gets you 63 bits. Then you could use the most significant bit of the last byte to indicate if the last bit is 0 or 1. Learned by reading the Go varint implementation [2].

- Messages can be recursive. This is easy if you represent messages as pointers since you can use nil. It's a fair bit harder if you want to always use a value object for each nested message since you need to break cycles by marking fields as `T | undefined` to avoid blowing the stack. Figuring out the minimal number of fields to break cycles is an NP hard problem called the minimum feedback arc set[3].

- If you're writing a protobuf implementation, the conformance tests are a really nice way to check that you've done a good job. Be wary of implementations that don't implement the conformance tests.

[1]: https://github.com/protocolbuffers/protobuf/tree/master/conf...

[2]: https://github.com/golang/go/blob/master/src/encoding/binary...

[3]: https://en.wikipedia.org/wiki/Feedback_arc_set#Minimum_feedb...


The varint format also isnt as dense on average as it could be and allows for non-canonical encodings. I.e. you can encode any integer in multiple ways (up to 9 or 10 bytes)

The solution for this is to subtract 1 from the integer every time you encode a byte (since the existence of the next byte you're adding already indicates that the intermediate value isn't 0)


> Arenas are, however, unfeasible to implement in Go because it is a garbage collected language.

If you are willing to use cgo, google already implemented one for gapid.

https://github.com/google/gapid/tree/master/core/memory/aren...


Not only that, there are other garbage collected languages like D, Nim and C# that offer the language features to do arenas without having to touch any C code.

There is still so much education to do.


Aren't arenas old news in GC languages in general?

Most of the time, their non-presence is due to general pools being just as good most of the time, or people simply not needing them that much with modern GC


Yes, so I really did not got how come such assertion was made.

Probably lack of experience with machine friendly code.


Do I misunderstand what arenas are? I thought it was just "allocate this big array as a single allocation rather than N little allocations"? If so, how is that not supported in Go? (e.g., `arena := make([]Foo, 1000000000)`)


An arena allocator allows you to store many allocations _of different types_ in the same single chunk of memory, and then free all of them at one point in time.


Why can't you do this in Go? I'm 99% sure we can allocate a massive array of bytes using safe Go and use unsafe to cast a chunk of bytes to an instance of a type. This isn't type safe, but neither would the equivalent C code.


That's what this whole thread is about: you can literally do just that.


> That's what this whole thread is about: you can literally do just that

I don't know how you get that from the thread:

> Arenas are, however, unfeasible to implement in Go because it is a garbage collected language.

> If you are willing to use cgo, google already implemented one for gapid.

> there are other garbage collected languages like D, Nim and C# that offer the language features to do arenas without having to touch any C code.

It seems like the above statements implicitly or explicitly claim that this isn't feasible in Go without C.


You are misunderstanding the thread, I just mentioned some of the languages I like (still waiting for Go's generics), and the comment I was replying to made an assert about an implementation that uses cgo.

Both of us are dismissing the assertion that "Arenas are, however, unfeasible to implement in Go because it is a garbage collected language."

You can do manually memory allocation via a syscall into the host OS, use unsafe to cast memory blocks to the types that you want and then clean it all up with defer, assuming the arena is only usable inside a lexical region, otherwise extra care is needed to avoid leaks.


Fair enough.


I was proposing the cgo option because it's already implemented.

I _think_ allocating a slice of contiguous bytes and using unsafe pointers should work fine as long as you are very cautious about structs/vars with pointers into the buffer getting freed by the GC.


> I _think_ allocating a slice of contiguous bytes and using unsafe pointers should work fine as long as you are very cautious about structs/vars with pointers into the buffer getting freed by the GC

Go's GC is conservative, so I don't think you need to take any special caution in that regard. I would expect that you just need to take care that your casts are correct (e.g., that you aren't casting overlapping regions of memory as distinct objects).


Go went to a precise GC with version 1.3


Oh wow, I didn’t realize.


I can't believe we've managed to have this lengthy of a discussion about GC languages and speed without anyone mentioning rust. Has HN turned a corner?


Maybe, don't know.

In what concerns me, although I like Rust, I only see it for scenarios where any kind of memory allocation is very precious, Ada/SPARK and MISRA-C style.

I have been using GC languages with C++ like features, or polyglot codebases, for almost 20 years to think otherwise.

Most of the time developers learn about new and miss out on the low level language features.

It is a matter of balance, either trying to do everything in a single language, or eventually write a couple of functions in a lower level language that are then used as building blocks for the rest of the application.

No need to throw away the ecosystem and developer tooling just to rewrite a data structure.


Would you consider codecs or heavy numerical simulations to fall under those memory allocation scenarios that you'd use Rust for as well?


That is a good scenario, however you can still use languages like D, Nim, Swift, C#, F#, Go, among others for such scenarios.

For example, you can do codecs in C# on WinRT with .NET Native,

https://docs.microsoft.com/en-us/windows/uwp/audio-video-cam...

In the context of protobuf,

https://devblogs.microsoft.com/aspnet/grpc-performance-impro...

Back to Rust, yes it is a good option, I just wouldn't write the whole application on it, just specialized libraries.

Hence why I am looking forward to Rust/Windows efforts.


Rust has an arena allocator too[1], but it is implemented with 165(!!!) usages of unsafe. :)

[1] https://github.com/fitzgen/bumpalo


This is far from the only arena allocator written in Rust.

From the same author, a zero-unsafe arena allocator: https://github.com/fitzgen/generational-arena

There are many, many arena implementations available with varying characteristics. It's disingenuous to act like Rust requires the author of an arena library to write "unsafe" everywhere.


I wonder what Google is thinking about the v2 performance. It's well known that protobuf processing is taxing heavy on their data center [1]. It's hard to imagine they just leave it slow. Or do they?

[1] https://research.google/pubs/pub44271/


There was a project to develop a asic (probably bundled inside NIC) to do protobuf parsing. At some point Sanjay did a change to proto API that rendered that project less appealing.

Disclaimer: Google had a lot of internal stuff they considered important to their core tech competencies. For example, no open source about Google paxos APIs and infrastructure, networking, etc.


Maybe I'm missing something, but my read of golang/protobuf#364[1] was that part of the motivation for the re-organization in protobuf-go v2 was to allow for optimizations like gogoprotobuf to be developed without requiring a complete fork. I totally understand that the authors of gogoprotobuf do not have the time to re-architect their library to use these hooks, but best I can figure this generator does not use these hooks either. Instead it defines additional member functions, and wrappers that look for those specialized functions and fallback to the generic ones if not found.

For example, it looks like pooled decoders could be implemented by setting a custom unmarshaller through the ProtoMethods[2] API.

I wonder why not? Did the authors of the vtprotobuf extension not want to bite off that much work? Is the new API not sufficient to do what they want (thus failing some of the goals expressed in golang/protobuf#364?

[1]: https://github.com/golang/protobuf/issues/364

[2]: https://pkg.go.dev/google.golang.org/protobuf@v1.26.0/reflec...


I haven't looked in more detail, but one blocker is that `ProtoMethods() *methods` returns a private type, making it effectively unimplementable outside this package.


So, I thought this at one point, too. But it turns out that methods is a type alias to an unnamed type, so there's no package level privacy issues: https://github.com/protocolbuffers/protobuf-go/blob/v1.26.0/...


Oh huh, interesting, I've never seen that done before.

I'm struggling to understand what the rationale _for_ doing it is though. Maybe it's to avoid an import cycle?


Yes, to avoid an import cycle or polluting the protoreflect API documentation with a rather large non-user-facing API surface.


the biggest current problem with Go and ProtoBuf is swagger support when using it for API returns. Enums are not supported for example. The leniency of protojson can't be used in other languages that built on top of the swagger docs.


Is there one for Kotlin yet? It's pretty pathetic that Google's own protocol lacks native support for its most popular operating system.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: