More

vblanco · 2025-01-08T11:50:55 1736337055

Interesting library, but i see it falls back into what happens to almost all SIMD libraries, which is that they hardcode the vector target completely and you cant mix/match feature levels within a build. The documentation recommends writing your kernels into DLLs and dynamic-loading them which is a huge mess https://jfalcou.github.io/eve/multiarch.html

Meanwhile xsimd (https://github.com/xtensor-stack/xsimd) has the feature level as a template parameter on its vector objects, which lets you branch at runtime between simd levels as you wish. I find its a far better way of doing things if you actually want to ship the simd code to users.

kookamamie · 2025-01-08T12:49:52 1736340592

100% agreed. This is the main reason ISPC is my go-to tool for explicit vectorization.

janwas · 2025-01-08T16:25:30 1736353530

+1, dynamic dispatch is important. Our Highway library has extensive support for this.

Detailed intro by kfjahnke here: https://github.com/kfjahnke/zimt/blob/multi_isa/examples/mul...

spacechild1 · 2025-01-08T12:31:08 1736339468

Thanks, that's an important caveat!

> Meanwhile xsimd (https://github.com/xtensor-stack/xsimd) has the feature level as a template parameter on its vector objects

That's pretty cool because you can write function templates and instantiate different versions that you can select at runtime.

vblanco · 2025-01-08T14:17:33 1736345853

Yeah thts the fun of it, you create your kernel/function so that the simd level is a template parameter, and then you can use simple branching like:

if(supports<avx512>){ myAlgo<avx512>(); } else{ myAlgo<avx>(); }

Ive also used it for benchmarking to see if my code scales to different simd widths well and its a huge help

dyaroshev · 2025-01-08T19:48:52 1736365732

FYI: You don't want to do this. `supports<avx512>` is an expensive check. You really want to put this check in a static.

spacechild1 · 2025-01-09T13:12:32 1736428352

I guess this was just pseudo-code. Of course you don't want to do a runtime feature check over and over again.

dyaroshev · 2025-01-08T19:27:02 1736364422

Our answer to this - is dynamic dispatch. If you want to have multiple version of the same kernel compiled - compile multiple dlls.

The big problem here is: ODR violations. We really didn't want to do the xsimd thing of forcing the user to pass an arch everywhere.

Also that kinda defeats the purpose of "simd portability" - any code with avx2 can't work for an arm platform.

eve just works everywhere.

Example: https://godbolt.org/z/bEGd7Tnb3

janwas · 2025-01-08T19:51:23 1736365883

It is possible to avoid ODR violations :) We put the per-target code into unique namespaces, and export a function pointer to them.

dyaroshev · 2025-01-08T20:08:12 1736366892

You can do many thing with macros and inline namespaces but I believe they run into problems when modules come into play. Can you compile the same code twice, with different flags with modules?

janwas · 2025-01-09T10:05:04 1736417104

We use pragma target instead of compiler flags :)

dyaroshev · 2025-01-09T14:36:03 1736433363

I don't think we understand each other.

We want to take one function and compile it twice:

``` namespace MEGA_MACRO {

void foo(std::span<int> s) { super_awesome_platform_specific_thing(s); }

} // namespace MEGA_MACRO ```

Whatever you do - the code above has to be written once but compiled twice. In one file/in many files - doesn't matter.

My point is - I don't think you can compile that code twice if you support modules.

janwas · 2025-01-09T15:02:01 1736434921

I think I do understand, this is exactly what we do. (MEGA_MACRO == HWY_NAMESPACE)

Then we have a table of function pointers to &AVX2::foo, &AVX3::foo etc. As long as the module exports one single thing, which either calls into or exports this table, I do not see how it is incompatible with building your project using modules enabled?

(The way we compile the code twice is to re-include our source file, taking care that only the SIMD parts are actually seen by the compiler, and stuff like the module exports would only be compiled once.)

dyaroshev · 2025-01-09T15:15:56 1736435756

> is to re-include our source file

Yeah - that means your source file is never a module. We would really like eve to be modularized, the CI times are unbearable.

I'd love to be proven wrong here, that'd be amazing. But I don't think google highway can be modularized.

janwas · 2025-01-09T16:19:40 1736439580

What leads you to that conclusion? It is still possible to use #include in module implementations. We can use that to make the module implementation look like your example.

Thus it ought to be possible, though I have not yet tried it.

dyaroshev · 2025-01-09T17:49:47 1736444987

Well.

You have a file, something like: load.h

You need to include it multiple times, compiled with different flags.

So - it's never going to be in load.cxx or whatever that's called.

janwas · 2025-01-09T18:19:44 1736446784

As mentioned ("re-include our source file"), we are indeed able to put the SIMD code, as well as the self-#include of itself, in a load.cxx TU.

Here is an example: https://github.com/google/gemma.cpp/blob/9dfe2a76be63bcfe679...

dyaroshev · 2025-01-09T18:49:45 1736448585

I don't think this works if your files are modules.

Let's stop here, it doesn't seem like we understand each other.

vlovich123 · 2025-01-08T17:21:52 1736356912

Since you seem knowledgeable about this, what does this do differently from other SIMD libraries like xsimd / highway? Is it the addition of algorithms similar to the STD library that are explicitly SIMD optimized?

dyaroshev · 2025-01-09T01:19:06 1736385546

The algorithms I tried to make as good as I knew how. Maybe 95% there. Nice tail handling. A lot of things supported. I like or interface over other alternatives, but I'm biased here. Really massive math library.

vblanco · 2024-11-16T08:58:00 1731747480

Game developers have been doing this since forever, its one of their main reasons to avoid the STL.

EASTL has this as a feature by default, and unreal engine container library has the boundchecks enabled on most games. The performance cost of those boundchecks in practice is well worth the reduction of bugs even on performance sensitive code.

pjmlp · 2024-11-16T18:24:16 1731781456

Which is yet another reason to assert (pun intend), how far from reality the anti-bounds check folks are, when even the game industry takes them seriously.

vblanco · 2024-09-24T08:39:19 1727167159

A truly incredible profiler for the great price of free. There is nothing coming at this level of features and performance even on paid software. Tracy could cost thousands of dollars a year and would still be the best profiler.

Tracy requires you to add macros to your codebase to log functions/scopes, so its not an automatic sampling profiler like superluminal, verysleepy, VS profiler, or others. Each of those macros has around 50 nanoseconds of overhead, so you can liberally use them in the millions. On the UI, it has a stats window that will record average, deviation, min/max of those profiler zones, which can be used to profile functions at the level of single nanoseconds.

Its the main thing i use for all my profiling and optimization work. I combine it with superluminal (sampling profiler) to get a high level overview of the program, then i put tracy zones on the important places to get the detailed information.

eagle2com · 2024-09-24T10:03:35 1727172215

Doesn't Tracy have the capability to do sampling as well? I remember using it at some point, even if it was finicky to setup because windows.

vblanco · 2024-09-24T10:27:51 1727173671

it does, but i dont use it much due to it being too slow and heavy on memory on my ryzen 5950x (32 threads) on windows. a couple seconds of tracing goes into tens of gigabytes of ram.

forrestthewoods · 2024-09-24T15:02:26 1727190146

Yeah I had issues with the Tracy sampler. It didn’t “just work” the way Superluminal did.

My only issue with Superluminal is I can’t get proper callstacks for interpreted languages like Python. It treats all the CPP callstacks as the same. Not sure if Tracy can handle that nicely or not…

forrestthewoods · 2024-09-24T12:49:22 1727182162

Tracy and Superluminal are the way. Both are so good.

Flex247A · 2024-09-24T08:42:27 1727167347

Hello! Going through your tutorial and it's been a great ride!

Thanks for the good work.

vblanco · on Oct 15, 2023

They are not slower than headers. Ive been looking into it because modular STL is such a big win. On my little toy project i have .cpp files compiling in 0.05 seconds while doing import std.

Downside is that at the moment you cant mix normal header STL with module STL in the same project (msvc), so its for cleanroom small projects only. I expect the second you can reliably use that almost everyone will switch overnight just from how fast of a speed boost it gives on the STL vs even precompiled headers.

cwzwarich · on Oct 15, 2023

The one way in which they are slower than headers is that they create longer dependency chains of translation units, whereas with headers you unlock more parallelism at the beginning of the build process, but much of it is duplicated work.

maccard · on Oct 15, 2023

Every post or exploration of modules (including this one) has found that modules are slower to compile.

> I expect the second you can reliably use that almost everyone will switch overnight just from how fast of a speed boost it gives on the STL vs even precompiled headers.

I look forward to that day, but it feels like we're a while off it yet

klipt · on Oct 15, 2023

I assume templates can only be partly preprocessed (parsed?) but not fully pre compiled, since final code depends on the template types?

jcelerier · on Oct 15, 2023

Depends on the compiler, clang is able to pre-instantiate templates and generate debug info as part of its pch system - (for instance most likely you have some std::vector<int> which can be instantiated somewhere in a transitively included header).

In my projects enabling the relevant flags gave pretty nice speedups.

vblanco · on Oct 15, 2023

Yes, but template code is all on headers, so it gets parsed every single time its included on some compile unit. With modules this only happens once so its a huge speed upgrade in pretty much all cases.

moregrist · on Oct 15, 2023

Whenever I’ve profiled compile times, parsing accounts for relatively little of the time, while the vast majority of the time is spent in the optimizer.

So at least for my projects it’s a modest (maybe 10-20%) speed up, not the order of magnitude speed up I was hoping for.

Thus C++ compile times will remain abysmal.

dagmx · on Oct 15, 2023

For some template heavy code bases I’ve been in, going to PCH has cut my compile times to less than half. I assume modules will have a similar benefit in those particular repositories, but obviously YMMV

vblanco · on Sept 3, 2023

Jai has a few unique features that are quite interesting. You can think of it as C with extremelly powerful metaprogramming, compile-time codegen and reflection, and strong template systems, plus a few extra niceties like native bump allocator support or things like easily creating custom iterators. There is nothing quite like it.

Rochus · on Sept 3, 2023

Thanks. I guess the "extremelly powerful metaprogramming", "compile-time codegen" and "strong template systems" are essentially the same thing. So the unique selling point of Jai from your perspective would be "C with generic metaprogramming, iterators and reflection"? Besides reflection, which is quite limited in C++, this would essentially be a subset of C++, isn't it?

vblanco · on Sept 3, 2023

Cpp goes nowhere to this level. Jai templates execute normal code (any code) at compile time, so there is no need for sfinae tricks or others, you just write code that directly deals with types.

You can do things like having structs remove or add members by just putting the member behind an if branch. You can also fully inspect any type and iterate members through reflection. There is a macro system that lets you manipulate and insert "code" objects, the iterator system works this way. you create a for loop operator overload where the for loop code is sent to the function as a parameter.

It also lets you convert a string (calculated compile time) into code. Either by adding it into the macro system, or just inserting it somewhere.

Some of the experiments ive done were implementing a full marc-sweep GC by generating reflection code that would implement the GC sweep logic, having a database system where it directly converts structs into table rows and codegens all the code relating to queries, automatic inspection of game objects by having a templated "inspect<T>" type thing where it exposes the object into imgui or debug log output, fully automatic json serialization for everything, and an ECS engine where it directly codegens the optimal gather code and prepares the object model storage from whatever loops you use in the code.

All of those werent real projects, just experiments to play around with it, but they were done in just a few dozens/hundreds lines of code each. While in theory you could do those in Cpp, it would need serious levels of template abominations to write it. In jai you write your compile time templates as normal jai. And it all compiles quickly with optimal codegen (its not bloating your PDB like a million type definitions would do in cpp template metaprogramming).

The big downside of the language is that its still a closed private beta, so tooling leaves much to be desired. There is a syntax color system, but things like correct IDE autocomplete or a native debugger arent there. I use remedybg to debug things and vscode as text editor for it. Its also on relatively early stages even if its been a thing for years, so some of the features are still in flux between updates.

Rochus · on Sept 3, 2023

That sounds interesting; seems to be even more flexible than comptime in Zig, almost as powerful as Lisp and the MOP. How about compiler and runtime performance of these features?

EDIT: can you point to source locations in the referenced projects which demonstrate such features?

vblanco · on Sept 3, 2023

https://pastebin.com/VPypiitk This is a very small experiment i did to learn the metaprogramming features. its an ECS library using the same model as entt (https://github.com/skypjack/entt). In 200 lines or so it does the equivalent of a few thousand lines of template heavy Cpp while compiling instantly and generating good debug code.

Some walkthrough:

Line 8 declares a SparseSet type as a fairly typical template. its just a struct with arrays of type T inside. Next lines implement getters/setters for this data structure. Note how std::vector style dynamic array is just a part of the lang, with the [..] syntax to declare a dynamically sized array.

Line 46 Base_Registry things get interesting. This is a struct that holds a bunch of SparseSet of different types, and providers getters/setters for them by type. It uses code generation to do this. The initial #insert at the start of the class injects codegen that creates structure members from the type list the struct gets on its declaration. Note also how type-lists are a native structure in the lang, no need for variadics.

Line 99 i decide to do variadic style tail templates anyway for fun. I implement a function that takes a typelist and returns the tail, and the struct is created through recursion as one would do in cpp. Getters and setters for the View struct are also implemented through recursion

Line 143 has the for expansion. This is how you overload the for loop functionality to create custom iterators.

The rest of the code is just some basic test code that runs the thing.

Last line does #import basic to essentially do #import stl type thing. Jai doesnt care about the order of any declarations, so having the includes at the bottom is fairly common.

Rochus · on Sept 3, 2023

Great, thanks!

Xeamek · on Sept 3, 2023

Dunno about runtime performance, but compiling speed is something that's regularly mentioned as an important aspect of Jai and I think it's not an exaggeration to name it as one of the fastest languages when it comes down to comp speed.

Rochus · on Sept 3, 2023

Do you happen to have measurement results?

archargelod · on Sept 3, 2023

So, it is like Nim? Templates, metaprogramming, custom iterators, Nim's comp-time is also powerful, but I heard that 100% of Jai is available at compile time? How good is interop with other languages? Does Jai's syntax have some nice features? Syntax sugar like Nim's UFCS or list comprehension (collect macro in nim)? effects system? Functional paradigm support?

cyber_kinetist · on Sept 3, 2023

Yup if I would pick a language that’s closest to Jai I would choose Nim. Both have a bytecode-VM interpreter to run comp-time code (unlike with Zig where it still tree-walks the AST), have similar template codegen systems (though Nim also has AST macros), and has good interop with existing C/C++ code.

The one issue with Nim I had while trying it out is that there is an incredibly rich amount of features but many of them seem half-baked and it’s constantly being changed, so you’re always a bit unsure what’s a feature stable and safe to use or is “experimental”. But at least it’s been public for quite some time and you can actually use it right away…

vblanco · on July 13, 2023

I once had a possible series of PRs that would increase performance of godot renderer, fixing considerable bottlenecks on scene rendering on CPU. Reduz didn't like the changes and it went into the "will be fixed for 4.0" deflect. To this day most of those performance fixes arent there and 4.0 is slower than the prototype i had. My interaction was very much not positive.

Even then, i believe that godot leadership is doing a great job. Its almost comparable to the amazing process Blender has. This post looks to me like a ridiculous statement. Reduz and other godot leads get constant pestering from hundreds of people daily, Godot even has thousands of issues on the github repo for bug reports.

The godot project and W4 spend their money wisely. I know some freelance developers who got hired for doing some project features. Someone like reduz just does not need to scam anyone, because if he wanted money he could likely work for other companies as a low level Cpp engineer and get more money than what he pays himself with the w4 funding. That w4 funding is being used to make godot into a "real" game engine, with console support which is needed for the engine to be taken seriously by commercial projects and not just small indies or restricted projects. Setting up a physical office and a place to have all those console development kits costs money, and hiring developers experienced in those platforms is not cheap.

In the way i see it, the godot project often develops features in a "marketing" fashion. This clashes quite directly with people using it for serious projects. Unity engine has a very similar issue. We get things like development effort being spent on fancy dynamic global illumination while completely rejecting the classic static light maps (needed for lower end), and even basic features like level-of-detail or occlusion culling which are considered a must-have that every engine has. I think this is what makes the poster cyberreality in the linked forum so angry. Its one of the big faults of the engine, but its not that bad as a development idea. Those fancy big features attract a lot of users, who can then provide feedback and bug reports, PR their fixes to the engine, and of course, funding and hype. The main team makes sure that the architecture of the engine is good, with some fancy big features, and the bugfixing and more niche/professional features are left to community that can PR it. Github is filled with "better" engines from a technical standpoint with a total of 1 user.

vblanco · on March 15, 2023

Vkguide deals with the basics, starting from scratch. The book above can be a great read after one completes vkguide, as it shows how to implement some advanced features. I helped review the book and think they should have been more clear that its a book for the people who already know graphics to a high level and want a refresher/info on some new state of the art techniques.

vblanco · on Feb 28, 2023

its custom made mirrored tshirt, he is writing on one of those lightboards and then mirroring.

vblanco · on Jan 16, 2023

I wrote extensively about that as part of my vulkan guide. https://vkguide.dev/docs/gpudriven . The TLDR of it is that you have the cpu upload scene information to the GPU, and then the GPU performs culling and batching by itself in a series of compute shaders. In most games the scene is 99% static, so you can upload the scene at the beggining and never touch it again. this decreases CPU-GPU traffic by orders of magnitude, and if used right, gives you 10x or more performance improvements. Unreal 5 Nanite tech is based on this concept.

Rhedox · on Jan 17, 2023

Love your site btw!

vblanco · on Dec 15, 2022

Important detail to comment on this movie. Its rendered using the EEVEE renderer, which is kind of a "game" realtime renderer. Its not using pathtracing on a massive machine to calculate the final image.

anshumankmr · on Dec 15, 2022

for a four minute clip, it looks great.