Interesting. I guess unlike software, these sorts of bugs can't be fixed without...

matthew-wegner · on March 21, 2017

In this case the fix is adding a resistor:

"If your system does not use SERIRQ and BIOS puts SERIRQ in Quiet-Mode, then the weak external pull up resistor is not required. All other cases must implement an external pull-up resistor, 8.2k to 10k, tied to 3.3V"

https://www-ssl.intel.com/content/dam/www/public/us/en/docum...

Depending on the motherboard, those pins may be exposed already. Here's what the fix looks like a Synology unit: https://www.reddit.com/r/synology/comments/609u1l/c2538_cloc...

yuhong · on March 21, 2017

So I assume that the problem is that SERIRQ is left floating causing too much of a load on the LPC clock it depends on, right?

jjuhl · on March 21, 2017

Microcode is software running on your CPU. And many problems can be fixed with a microcode update, but not all - see the workarounds listed in the PDF I linked - note especially how many say "None identified".

yeukhon · on March 21, 2017

Okay, thanks for clarification. I was thinking about the floating bug which they had to recall the defective processors: https://en.wikipedia.org/wiki/Pentium_FDIV_bug

justsid · on March 22, 2017

Hard to say if something like this would be patchable via a microcode update. But regardless, back then your CPU didn’t run software to run your software, so a new hardware revision was in order.

rrdharan · on March 21, 2017

They do tremendous amounts of validation. I believe random generation of input data is part of that.

Here's an old heavily cited paper from Intel on the topic; I'm sure their state of the art has advanced considerably in the intervening 17 years since its publication:

http://dl.acm.org/citation.cfm?id=623013

wumpus · on March 21, 2017

"crashme" is one venerable program that does this kind of fuzzing -- I managed to find a bug in a cpu with it once, which does not speak well of their QA department.

xenadu02 · on March 22, 2017

A former coworker was a QA manager at Intel. He said it was an explicit decision to cut back on validation and QA, which is why he wasn't at Intel anymore. The general feeling was that they had "overreacted" to the Pentium FDIV bug and needed to move faster.

YMMV, I have no inside knowledge.

i336_ · on March 22, 2017

I'm very curious what CPU it was, and what the bug was.

wumpus · on March 22, 2017

Broadcom Sibyte 1250. I had pre-release silicon, and there was a known bug that prefetch would occasionally hang the chip. I wanted to have a little fun goofing around, so I modified crashme to replace prefetches in the randomly generated code with a noop. A few minutes later, I hung the chip.

If I identified the right erratum later, it was: if there is a branch in the delay slot of a branch in a delay slot which is mispredicted, the cpu hangs. It was fixed by making this sequence throw an illegal instruction exception. (It was undefined behavior already, I think.)

i336_ · on March 22, 2017

Interesting. And... the nesting you describe is messing with my head! :)

How did the erratum get fixed? Hardware re-re-release?

Definitely filed crashme away, sounds like a useful tool.

The Sibyte 1250 sounds cute. There's a "650000000 Hz" in https://lists.freebsd.org/pipermail/freebsd-mips/2009-Septem... although I'm not sure if MIPS from circa 2002 (?) was that fast (completely ignorant; no field knowledge - definitely want to learn more though).

I also noted the existence of the BCM91250E via http://www.prnewswire.com/news-releases/broadcom-announces-n... sibytetm-bcm1250-mipsr-processor-to-accelerate-customers-software-and-hardware-development-75841437.html, which was kind of cool. I like how the chip is a supported target for QNX :)

Now I'm wondering what you were using it for. Some kind of network device? (I think I saw networking as one of its target markets.)

--

As an aside, I'm also very curious what HN notifier you use, if you use one. (I use http://hnreplies.com/ myself, but it's sometimes slow. I saw your message after 4 minutes in this case (fast for HN Replies); typing/research + IRL stuff = delay :) )

wumpus · on March 22, 2017

Our company was PathScale, and we were hoping that the Sibyte 1250 would make a great supercomputing chip. Opteron wasn't out yet, Intel was pricing 64-bit Itanium very high, and this dual-core Sibyte thing was projected to have a reasonable clock and it could do 2 64-bit fp mul-adds/cycle. We had Fred Chow and the GPLed SGI Compiler on our side. And the next revision of the chip had these great 10 gig networking features that I thought I could make work well with MPI, the scientific computing message-passing standard.

You can guess how it worked out: Sibyte was late, slow, and buggy. Even simple code sequences like hand-paired multiply-adds would start running at 1/2 speed after an interrupt. Our experienced compiler team was unable to get good perf on several SPECcpu benchmarks despite the code looking good. (Fred didn't have much hair left to pull out!)

Soon after we raised our A round we pivoted to using Opteron for compute, building an ASIC for a specialized MPI network, and Fred's team did an Opteron backend for the compiler.

The descendant of the network is now called Intel Omni-Path, and is on-package for Xeon Phi cpus.

i336_ · on March 22, 2017

I see. Always poignant to hear these kinds of stories.

I'm curious as to whether interrupt handling was done underneath the multiprocessing layer or whether interrupts were just hammering the pipeline design. (I assume by "an interrupt" you're referring to slowdown within the fraction of a second after a given interrupt occurred, within the context of floods of interrupts interspersed between instructions?)

Very cool to hear that Intel snapped up what you eventually managed to ship, FWIW - and that you were able to pivot in the way you did. Also interesting to hear about Opteron use in the field, my experience is only with tracking the consumer sector.

Also, the "GPLed SGI compiler" part you mentioned caught my eye and led me to the EKOpath compiler, and particularly its OSS release in 2011: http://web.archive.org/web/20110616135434/http://pathscale.c...

After some floating back and forth I found https://www.phoronix.com/forums/forum/software/general-linux... which a) mirrored my questioning exactly and b) contained a very nice and straightforward answer, so I guess that put closure on that. Really sad that it never really took off though; faster compilers (https://lwn.net/Articles/447541/ in/from https://lwn.net/Articles/447529/) are always something I'm looking for :)

monocasa · on March 21, 2017

CPU ones definitely fuzz, and formally verify parts of their chips.

Unfortunately fuzzing ultimately has a random component, which doesn't really prove that you got all of these bugs.

valarauca1 · on March 21, 2017

(To addon to parent poster)

     Unfortunately fuzzing ultimately has a random component

An instruction which accepts 2 registers and returns 1 register has a 192bit problem space to validate. This complexity is present in an instruction as simple as `add`.

As AVX2 instructions which accepts 3 registers and outputs 1 has a 1024bit problem space to validate.

This occurred in FMA3 with a ~512bit problem space.

Repeat for _every_ instruction (HUNDREDS). You can see how a few bugs slip though the cracks. The problem space is as large as some cryptographic functions!! I'm honestly surprised we don't see more of them.

cwzwarich · on March 21, 2017

The specific result produced by the data path is probably not very relevant in the case of a lockup. The control path involved with instruction decoding, register renaming, out-of-order execution, SMT, etc. is generally the cause of issues like this. With interactions between different blocks of the CPU and the size of some of the data structures involved, the full verification space is much, much larger.

makomk · on March 22, 2017

I don't know about that. As I understand it, interesting things can happen if a large number data lines toggle at the same time, and this obviously depends on both the data and the control path. Huge space of possible states.

valarauca1 · on March 21, 2017

This isn't a lock up.

If you read the source (TechPowerUp is terrible why it isn't banned is beyond me) it is actually an _illegal instruction_ error. Which just crashes the current application.

Also if SMT is not disable the error doesn't occur. Also if the chip isn't over clocked the error doesn't occur.

So it is very clearly power related.

wongarsu · on March 21, 2017

Quoting from http://forum.hwbot.org/showthread.php?t=167605:

> this always hard freezes the computer:

>- At all clock speeds. >- When running single-threaded, it happens to any core that I pin it to.

It should be at worst an illegal instruction, but instead the whole core freezes, even on underclocked computers.

yeukhon · on March 21, 2017

Oh I understand fuzzing won't catch all bugs. But I see, glad they do run random over.