Hacker News new | past | comments | ask | show | jobs | submit login

It goes a bit deeper than that. I work on POWER which is a sibling to the z platform. They share some common architecture and microarchitectural elements.

To get a sense, have a look at the E980’s Reliability, Availability, and Serviceability Manual. [1] that systems NOT a mainframe - but it borrows ideas from it. Even in that system all levels of it are designed for failure, from Customer replaceable units like fans or power supplies to certain busses have extra lanes or wires for parity or spares depending on the design. Internal to the processor logic there’s a large number of internal checkers that watch the architected state of the machine and notice when something abnormal happens. For several things the processor core will unwind its state and retry an op, and if it proceeds, flag itself as having had a “recoverable” error, which usually triggers a service action. If the system can’t recover it halts either the core or the entire platform, and a dump happens from the service processor which then offloads it to the management console for diagnosis.

Z boxes do all of the above - and more. There’s pictures floating around the office of a system that went through and earthquake, fell over (someone didn’t buy the optional earthquake bracing for the rack) and as it fell, it domino’d over along with several other racks. It still running when the techs went to inspect the data center, though it did turn on its “service me” attention light. ;)

[1] https://www.ibm.com/downloads/cas/2RJYYJML




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: