pornel's comments | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit | pornel's comments

login

pornel 2 days ago | parent | context | [–] | on: Fundamental flaws of SIMD ISAs (2021)

There are alternative universes where these wouldn't be a problem.

For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation), then we could adjust SIMD width for existing binaries instead of waiting decades for a new baseline or multiversioning faff.

Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way.

aengelke 2 days ago | | [–]

> if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation)

Apple tried something like this: they collected the LLVM bitcode of apps so that they could recompile and even port to a different architecture. To my knowledge, this was done exactly once (watchOS armv7->AArch64) and deprecated afterwards. Retargeting at this level is inherently difficult (different ABIs, target-specific instructions, intrinsics, etc.). For the same target with a larger feature set, the problems are smaller, but so are the gains -- better SIMD usage would only come from the auto-vectorizer and a better instruction selector that uses different instructions. The expectable gains, however, are low for typical applications and for math-heavy programs, using optimized libraries or simply recompiling is easier.

WebAssembly is a higher-level, more portable bytecode, but performance levels are quite a bit behind natively compiled code.

LegionMammal978 2 days ago | | | [–]

> Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way.

Is that not where we're already going with the GPGPU trend? The big catch with GPU programming is that many useful routines are irreducibly very branchy (or at least, to an extent that removing branches slows them down unacceptably), and every divergent branch throws out a huge chunk of the GPU's performance. So you retain a traditional CPU to run all your branchy code, but you run into memory-bandwidth woes between the CPU and GPU.

It's generally the exception instead of the rule when you have a big block of data elements upfront that can all be handled uniformly with no branching. These usually have to do with graphics, physical simulation, etc., which is why the SIMT model was popularized by GPUs.

winwang 2 days ago | | | [–]

Fun fact which I'm 50%(?) sure of: a single branch divergence for integer instructions on current nvidia GPUs won't hurt perf, because there are only 16 int32 lanes anyway.

pornel 1 day ago | | | | [–]

CPUs are not good at branchy code either. Branch mispredictions cause costly pipeline stalls, so you have to make branches either predictable or use conditional moves. Trivially predictable branches are fast — but so are non-diverging warps on GPUs. Conditional moves and masked SIMD work pretty much exactly like on a GPU.

Even if you have a branchy divide-and-conquer problem ideal for diverging threads, you'll get hit by a relatively high overhead of distributing work across threads, false sharing, and stalls from cache misses.

My hot take is that GPUs will get more features to work better on traditionally-CPU-problems (e.g. AMD Shader Call proposal that helps processing unbalanced tree-structured data), and CPUs will be downgraded to being just a coprocessor for bootstrapping the GPU drivers.

janwas 9 hours ago | | | [–]

hm. Doesn't the existence of Vulkan subgroups and CUDA shuffle/ballot poke huge holes in their 'SIMT' model? From where I sit, that looks a lot like SIMD. The only difference seems to be that SIMT professes to hide (or use HW support for) divergence. Apart from that, reductions and shuffles are basically SIMD.

almostgotcaught 1 day ago | | | [–]

> There are alternative universes where these wouldn't be a problem

Do people that say these things have literally any experience of merit?

> For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass

You do understand that at the end of the day, hardware is hard (fixed) and software is soft (malleable) right? There will be always be friction at some boundary - it doesn't matter where you hide the rigidity of a literal rock, you eventually reach a point where you cannot reconfigure something that you would like to. And also the parts of that rock that are useful are extremely expensive (so no one is adding instruction-updating pass silicon just because it would be convenient). That's just physics - the rock is very small but fully baked.

> we could have had every instruction SIMDified

Tell me you don't program GPUs without telling me. Not only is SIMT a literal lie today (cf warp level primitives), there is absolutely no reason to SIMDify all instructions (and you better be a wise user of your scalar registers and scalar instructions if you want fast GPU code).

I wish people would just realize there's no grand paradigm shift that's coming that will save them from the difficult work of actually learning how the device works in order to be able to use it efficiently.

pornel 1 day ago | | | [–]

The point of updating the instructions isn't to have optimal behavior in all cases, or to reconfigure programs for wildly different hardware, but to be able to easily target contemporary hardware, without having to wait for the oldest hardware to die out first to be able to target a less outdated baseline without conditional dispatch.

Users are much more forgiving about software that runs a bit slower than software that doesn't run at all. ~95% of x86_64 CPUs have AVX2 support, but compiling binaries to unconditionally rely on it makes the remaining users complain. If it was merely slower on potato hardware, it'd be an easier tradeoff to make.

This is the norm on GPUs thanks to shader recompilation (they're far from optimal for all hardware, but at least get to use the instruction set of the HW they're running on, instead of being limited to the lowest common denominator). On CPUs it's happening in limited cases: Zen 3 added AVX-512 by executing two 256-bit operations serially, and plenty of less critical instructions are emulated in microcode, but it's done by the hardware, because our software isn't set up for that.

Compilers already need to make assumptions about pipeline widths and instruction latencies, so the code is tuned for specific CPU vendors/generations anyway, and that doesn't get updated. Less explicitly, optimized code also makes assumptions about cache sizes and compute vs memory trade-offs. Code may need L1 cache of certain size to work best, but it still runs on CPUs with a too-small L1 cache, just slower. Imagine how annoying it would be if your code couldn't take advantage of a larger L1 cache without crashing on older CPUs. That's where CPUs are with SIMD.

almostgotcaught 1 day ago | | | [–]

i have no idea what you're saying - i'm well aware that compilers do lots of things but this sentence in your original comment

> compiled machine code exactly as-is, and had a instruction-updating pass

implies there should be silicon that implements the instruction-updating - what else would be "executing" compiled machine code other than the machine itself...........

pornel 1 day ago | | | [–]

I was talking about a software pass. Currently, the machine code stored in executables (such as ELF or PE) is only slightly patched by the dynamic linker, and then expected to be directly executable by the CPU. The code in the file has to be already compatible with the target CPU, otherwise you hit illegal instructions. This is a simplistic approach, dating back to when running executables was just a matter of loading them into RAM and jumping to their start (old a.out or DOS COM).

What I'm suggesting is adding a translation/fixup step after loading a binary, before the code is executed, to make it more tolerant to hardware changes. It doesn’t have to be full abstract portable bytecode compilation, and not even as involved as PTX to SASS, but more like a peephole optimizer for the same OS on the same general CPU architecture. For example, on a pre-AVX2 x86_64 CPU, the OS could scan for AVX2 instructions and patch them to do equivalent work using SSE or scalar instructions. There are implementation and compatibility issues that make it tricky, but fundamentally it should be possible. Wilder things like x86_64 to aarch64 translation have been done, so let's do it for x86_64-v4 to x86_64-v1 too.

almostgotcaught 1 day ago | | | [–]

that's certainly more reasonable so i'm sorry for being so flippant. but even this idea i wager the juice is not worth the squeeze outside of stuff like Rosetta as you alluded, where the value was extremely high (retaining x86 customers).

pornel 2 days ago | parent | context | | [–] | on: EU Energy labelling will apply to phones and table...

I wonder if we get some malicious workarounds for this, like re-releasing the same phone under different SKUs to pretend they were different models on sale for a short time.

Y-bar 2 days ago | | [–]

Maybe, but the regulation tries to prevent this by separating "models" from "batches" from "individual items" and defaults to "model" when determining compatibility. Worth noting is that each new model requires a separate filing for both EcoDesign and other certifications like CE which could help reduce workarounds like model number inflation.

pornel 2 days ago | parent | context | | [–] | on: Pixel is a unit of length and area

We commonly use hardware like LCDs and printers that render a sharp transition between pixels without the Gibbs' phenomenon. CRT scanlines were close to an actual 1D signal (but not directly controlled by the pixels, which the video cards still tried to make square-ish), but AFAIK we've never had a display that is a 2D signal that we assume in image processing.

In signal processing you have a finite number of samples of an infinitely precise contiguous signal, but in image processing you have a discrete representation mapped to a discrete output. It's contiguous only when you choose to model it that way. Discrete → contiguous → discrete conversion is a useful tool in some cases, but it's not the whole story.

There are images designed for very specific hardware, like sprites for CRT monitors, or font glyphs rendered for LCD subpixels. More generally, nearly all bitmap graphics assumes that pixel alignment is meaningful (and that has been true even in the CRT era before the pixel grid could be aligned with the display's subpixels). Boxes and line widths, especially in GUIs, tend to be designed for integer multiples of pixels. Fonts have/had hinting for aligning to the pixel grid.

Lack of grid alignment, an equivalent of a phase shift that wouldn't matter in pure signal processing, is visually quite noticeable at resolutions where the hardware pixels are little squares to the naked eye.

grandempire 2 days ago | | [–]

I think you are saying there are other kinds of displays which are not typical monitors and those displays show different kinds of images - and I don’t disagree.

pornel 2 days ago | | | [–]

I'm saying "digital images" are captured by and created for hardware that has the "little squares". This defines what their pixels really are. Pixels in these digital images actually represent discrete units, and not infinitesimal samples of waveforms.

Since the pixels never were a waveform, never were sampled from such signal (even light in camera sensors isn't sampled along these axis), and don't get displayed as a 2D waveform, the pixels-as-points model from the article at the top of this thread is just an arbitrary abstract model, but it's not an accurate representation of what pixels are.

pornel 3 days ago | parent | context | | [–] | on: Apple and Meta fined millions for breaching EU law

If Apple is so bad at this that they have to charge 30%, they should have failed in the free market to a competitor that can do the same or better for 3%. However, Apple has prevented that, not by being better or cheaper, but by implementing DRM that locks users out from having a choice (and the market as a whole ended up being a duopoly with cartel-like pricing).

Whether Apple can be cheaper isn't really the point (they should be, digital services are a very high margin business). It's that they're anti-competitive to the point that the market for paid apps and in-app payments became inefficient (in a financial sense).

pornel 5 days ago | parent | context | | [–] | on: Pipelining might be my favorite programming langua...

Rust has such open extensibility through traits. The prime example is Itertools that already adds a bunch of extra pipelining helper methods.

pornel 5 days ago | parent | context | | [–] | on: Gemma 3 QAT Models: Bringing AI to Consumer GPUs

It is due to the risk of a leak.

Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

6510 5 days ago | | [–]

Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.

vbezhenar 5 days ago | | | [–]

How can you be sure that AWS will not use your data to train their models? They got enormous data, probably most data in the world.

simonw 5 days ago | | | [–]

Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.

There is no world in which training on customer data without permission would be worth it for AWS.

Your data really isn't that useful anyway.

mdp2021 5 days ago | | | [–]

> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

simonw 5 days ago | | | [–]

There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.

That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.

Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

mdp2021 5 days ago | | | [–]

Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).

pornel 6 days ago | parent | context | | [–] | on: Things Zig comptime won't do

There is a stark contrast in usability of self-contained/owning types vs types that are temporary views bound by a lifetime of the place they are borrowing from. But this is an inherent problem for all non-GC languages that allow saving pointers to data on the stack (Rust doesn't need lifetimes for by-reference heap types). In languages without lifetimes you just don't get any compiler help in finding places that may be affected by dangling pointers.

This is similar to creating a broadly-used data structure and realizing that some field has to be optional. Option<T> will require you to change everything touching it, and virally spread through all the code that wanted to use that field unconditionally. However, that's not the fault of the Option syntax, it's the fault of semantics of optionality. In languages that don't make this "miserable" at compile time, this problem manifests with a whack-a-mole of NullPointerExceptions at run time.

With experience, I don't get this "oh no, now there's a lifetime popping up everywhere" surprise in Rust any more. Whether something is going to be a temporary view or permanent storage can be known ahead of time, and if it can be both, it can be designed with Cow-like types.

I also got a sense for when using a temporary loan is a premature optimization. All data has to be stored somewhere (you can't have a reference to data that hasn't been stored). Designs that try to be ultra-efficient by allowing only temporary references often force data to be stored in a temporary location first, and then borrowed, which doesn't avoid any allocations, only adds dependencies on external storage. Instead, the design can support moving or collecting data into owned (non-temporary) storage directly. It can then keep it for an arbirary lifetime without lifetime annotations, and hand out temporary references to it whenever needed. The run-time cost can be the same, but the semantics are much easier to work with.

pornel 10 days ago | parent | context | | [–] | on: TLS certificate lifetimes will officially reduce t...

Browsers check the identity of the certificates every time. The host name is the identity.

There are lots of issues with trust and social and business identities in general, but for the purpose of encryption, the problem can be simplified to checking of the host name (it's effectively an out of band async check that the destination you're talking to is the same destination that independent checks saw, so you know your connection hasn't been intercepted).

You can't have effective TLS encryption without verifying some identity, because you're encrypting data with a key that you negotiate with the recipient on the other end of the connection. If someone inserts themselves into the connection during key exchange, they will get the decryption key (key exchange is cleverly done that a passive eavesdropper can't get the key, but it can't protect against an active eavesdropper — other than by verifying the active participant is "trusted" in a cryptographic sense, not in a social sense).

pornel 10 days ago | parent | context | | [–] | on: TLS certificate lifetimes will officially reduce t...

I copy the same certbot account settings and private key to all servers and they obtain the certs themselves.

It is a bit funny that LetsEncrypt has non-expiring private keys for their accounts.

pornel 10 days ago | parent | context | | [–] | on: TLS certificate lifetimes will officially reduce t...

DANE is a TLS with too-big-to-fail CAs that are tied to the top-level domains they own, and can't be replaced.

Separation between CAs and domains allows browsers to get rid of incompetent and malicious CAs with minimal user impact.

ryao 9 days ago | | [–]

DANE lets the domain owner manage the certificates issued for the domain.

pornel 8 days ago | | | [–]

This delegation doesn't play the same role as CAs in WebPKI.

Without DNSSEC's guarantees, the DANE TLSA records would be as insecure as self-signed certificates in WebPKI are.

It's not enough to have some certificate from some CA involved. It has to be a part of an unbroken chain of trust anchored to something that the client can verify. So you're dependent on the DNSSEC infrastructure and its authorities for security, and you can't ignore or replace that part in the DANE model.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact