It's notable that this report is completely consistent so far with the way I would design a cyber attack to take out a countries power grid.
> The causes provided by RWE for the initiation of the trip of Little Barford steam turbine (ST1C) was due to discrepancies on three independent safety critical speed measurement signals to the generator control system.
A cyber attack would most likely, rather than directly take capacity offline, instead make the system less stable by attempting to make safety systems not work properly. That has the benefit that the attack will be triggered at th exact same time as another incident, increasing the impact.
Injecting false values for a sensor reading is a very good way to hide your tracks as an attacker - just inject a bunch of "0" values, and then uninstall whatever malware did it, and probably nobody will ever find out why that sensor randomly acted up.
> Injecting false values for a sensor reading is a very good way to hide your tracks as an attacker - just inject a bunch of "0" values, and then uninstall whatever malware did it, and probably nobody will ever find out why that sensor randomly acted up.
That may be for legacy machinery control systems (MCSs), but modern MSCs and increasingly wise to such threats as they monitor and archive all critical control system signals at the subsecond sample rates, such that sensor anomalies are detected in near real time and logged for future anaysis. You would not only need to hack the individual sensor signal, but you'd need to also know how to disrupt the MCS ladder logic that knows how to bypass a sensor putting out bad signals to keep the machinery system stable. Not an easy feat.
Personally, I mainly get involved in this data after the fact for energy optimization. However, the folks that design the systems complete Failure Mode and Effects Analysis (FMEA) in attempts to thwart failures and attacks at any concievable level. Still further, greater protection in terms of first line of defense firewalls and PKI certs between subsystems and components, making getting access to these bits int he first place increasingly difficult whereas these systems used to be virtually unprotected for any ding dong to tap into.
Sure, there are still holes, and probably always will be, but the efforts of MCS designers to protect their systems grows and the automation intelligence being developed to govern these control systems will provide an additional level of protection (and a known point of vulnerability).
An even more stealth method is to make the system work perfectly if everything is running fine. Like correctly reporting net frequency in the range 49-51 Hz but reporting false values when outside of that range. That way there will be no detectable errors until everything starts behaving erratically.
The best way for a society to prepare is to have a few small incidents that put the safety systems to the test on a regular basis.
This. Very much this. In the presentations I've seen on the matter, the most vulnerable systems are those measuring and carrying telemetry back from various meters around the power system. One doesn't have to directly compromise the system to have an effect. Simply providing bad data to automated protection systems could be enough to cause instability.
”Like correctly reporting net frequency in the range 49-51 Hz but reporting false values when outside of that range.”
You can’t fake the system frequency like that. Anyone connected to the grid can observe/measure the system frequency trivially - you could do so right now at the nearest 10A socket. The attack you describe would require compromising large numbers of diverse, distributed generation and frequency response assets.
It would be like trying to fool everyone into thinking it’s a hot day by hacking every single thermometer!
Yep. Once one meter starts reporting the wrong frequency vs the thousands of other ones. It'll simply be taken offline and replaced because it could just be a broken one. Depending on severity, the gig could be up if the meter is sent back to the manufacturer for "repair".
Decentralized grid will have less ability to deal with spikes in load, and be more unstable as a result. Question in cases like this is often - do you want to have a lot of smaller outages all the time or bigger, but less common outages.
When I was discussing the two interruptions (one on the 9nth and a smaller one on the 24th) with friends, I suggested that this could have been an attempt to map out UK's infrastructure/SCADAs, and they thought that my sandwich had too much consiracy sauce in it.
Remembering how Stuxnet interefered with Iran's nuclear centrifuges, this could very well be something similar. Just scratch it a bit, where it matters most, and let it sink on its own. No point blasting a huge hole in it.
I personally knew the chap auditing SCADA in power distribution back in 2001, so I'm fairly confident the level of competence and what governmental bodies handle such issues.
But nothing is perfect, and whilst an edge case, space weather can and does wondrous things to pylons. It's planning for such outages and the layers of vulnerability in systems that such an event can expose in which you want top people on the case and planning around such exposures. After all, you don't want your fallback site all secure but nobody put the electric security doors on their own redundant power. Might end up with such doors opening automaticly and allowing anybody in, or not oipening and preventing access to the big red switch to failover onto some backup system. Things like that you just can't overlook.
But with power-grid distribution systems, their will always be a few weak points, case of if these three transformers suddenly went offline the cascade effect would be horrific. It's those cases you need to plan for. After all, 3 lorry accidents crashing into those transformers, whilst not very likely to happen at the same time - people win the lottery every week with much higher odds.
Except that you would have no control over when the outage happens as you would have to wait around for a random lightning strike to initiate the chain reaction. This outage lasted a very short time. An attacker would likely want to take advantage of the power loss in some way, meaning control over the timing would be pretty important.
This blackout destroyed our microwave. When the power come back on so did the microwave. But I already removed the rice. I guess being run for minutes with nothing it burned it out.
To understand how/why these cascading failures can happen, Grady Hillhouse (Practical Engineering on YouTube) recently made a video about substations and what kind of faults can affect them, interesting if you're not super knowledgeable about our electric grids already: https://www.youtube.com/watch?v=7Q-aVBv7PWM
"Frequency drops below 48.8Hz which triggers the Low Frequency Demand Disconnection scheme (LFDD). Approximately 1GW (5% of total) of Great Britain electricity demand is turned off."
Is interesting as it highlights the importance of managed failure. Things fail, but the ability to see those failures and control how that fails is important. Had such an action not been taken, the result would of been a far greater outage/impact and more so an outage that you had no idea what parts it would impact.
The issue with this specific outage is that it was badly managed.
The main gripe people and authorities have with what happened is that key infrastructures immediately lost power. When (only) 5% of demand is turned off this should not impact railways, airports, and hospitals.
Beyond the initial loss of power other problems amplified the consequences: For example, trains could not just restart but required specialist staff to attend each of them to get them through a specific procedure, which turned a short power loss into hours of delays.
This highlights the importance of looking at systems globally. The grid is just one part of the system.
A lot of the headline grabbing stuff wasn't the grid's fault though:
* the railways didn't have traction power disconnected -- but one type of (new) train saw the within-specification frequency drop, shut down, and in many cases needed a fitter to go out and restart it. Everything else behind a dead train was stuck. I imagine the train company are discussing this with their supplier...
* Ipswich hospital had a power outage, but that was because their own systems disconnected from the grid when they saw the frequency drop, and then they had a problem with one of their generators.
*One regional airport seems to have been accidentally not on the list of key sites not to disconnect -- sounds like an admin snafu, and has now been fixed.
None of these seem like "bad management" -- they are a collection of relatively minor, entirely unrelated, problems of the sort you're likely to run into if you have to activate an emergency procedure you haven't needed to use in a decade.
I read your list and see a bunch of 'technical debt.' Trains put into operation without the ability to survive a power transient without intervention. Life critical backup power system failures. The set of critical sites is not being maintained properly. In my mind these failures are are not entirely unrelated; reliable power is taken for granted and the testing and maintenance necessary to cope with power outages is neglected.
Do you mean that they don't meet the requirements of the tender spec ("not fit for purpose") or that the spec in the tender process was deficient ("shockingly poor procurement")?
> Trains put into operation without the ability to survive a power transient without intervention.
The trains were built to a environmental tolerance spec, and shutting down when that spec is exceeded seems like a reasonable thing to do. Having an expert test that all the safety systems are still working properly after they've been exposed to conditions they weren't designed for also seems prudent.
Do you think people should pay higher ticket prices / more tax to fund trains that can tolerate a wider range of input frequencies, given how rarely these sorts of events happen?
> The trains were built to a environmental tolerance spec and shutting down when that spec is exceeded seems like a reasonable thing to do.
Yes, they were built to a spec. According to the official technical report[1] these trains DID NOT experience anything outside that spec. The shut down was not supposed to occur. The trains are simply defective and both manufacturer and operator testing failed to reveal these defects. There is no other reasonable conclusion. Defective, inadequately tested trains.
> Having an expert test that all the safety systems are still working properly after they've been exposed to conditions they weren't designed for also seems prudent.
This bit of guesswork is also contradicted in the official report as the manufacturer has committed to providing a system update to permit operation to resume WITHOUT any of your supposed 'prudence.' Resuming operation on power recovery without intervention by technicians is the specified, expected behavior. This is a separate defect that also survived whatever obviously inadequate testing regime was supposedly in effect.
> Do you think people should pay higher ticket prices / more tax to fund trains that can tolerate a wider range of input frequencies, given how rarely these sorts of events happen?
Since the official conclusion and subsequent response by the manufacturer both show the trains were specified to tolerate these conditions the extraordinary costs you imagine are a fiction; the cost has already been paid. No higher ticket prices required. And since these conditions were specified we also know they were indeed anticipated despite their infrequency.
Both the manufacturer and the operator failed here. They got caught with their pants down. You may not like hearing that for whatever reason but that's the bottom line and inventing excuses won't change it.
Which specification though? At the time they were being designed and procured, I think the specified tolerance for frequency change rate on the grid was 0.125Hz/s, but this was relaxed in 2014 to 1Hz/s (see https://www.nationalgrideso.com/document/10771/download). How much the effects of relaxing this on existing equipment was considered, it's not clear.
So we have a power anomaly where the frequency moved >0.125Hz/s and dropped well below the 49.5Hz license threshold. I don't know what the procurement spec was, but I'd be very surprised if this event was "in-spec" for the trains, so in the absence if this it's probably reasonable to say that GP is incorrect.
Transformers get quite unhappily melty if you drop the frequency too much, so if your cooling system isn't designed to cope with this, it's probably better to shut down than have an exciting fire.
The grid's license only requires them to be within ±0.5Hz of the nominal 50Hz for at least 95% of the time. Anyone designing a system which needs to operate reliably ideally ought to design it to operate over the full 47.5-52Hz range, or at least not fail so completely it cannot be restarted. Low frequency demand disconnect only starts at 48.8 Hz, ao any system that can't handle that is effectively setting itself up to be the first thing to fail when there's a power shortfall. This is utterly inexcusable for a train given the amount of disruption caused by multiple lines being clogged by failed trains in multiple places.
I don't disagree that a train that can tolerate a wider range is desirable, but lots of things are desirable. It's also desirable to give everyone their own carriage, or have a free bar on every train. But at what cost?
* Wider tolerance means heavier electrical and cooling systems, increasing both capital outlay and operating cost for both track wear and maintenance.
* The trains procured were essentially "off the shelf" from Siemens and would have been designed and manufactured for other operators before the DfT put the tender together. A wider operating tolerance would have therefore either ruled out the Siemens trains, or at least significantly upped the cost as they would have had to redesign them.
An annual season ticket from St Albans to London is about £3,600. How much more should that passenger (as well as the taxpayer) be paying so that they can travel on trains that work when this type of event happens? How often does this type of event happen?
> This is utterly inexcusable for a train given the amount of disruption caused by multiple lines being clogged by failed trains in multiple places.
People being stuck on trains for a few hours is pretty far down the severity list of "things that could go wrong" when the power goes out. That said, what I do think the outcome will be here is they'll figure out if/how the driver can self-reset this type of issue rather than requiring a fitter to travel out.
In a way these events become a blessing in disguise because of those things. We've had to deal with similar crisis situations in our supply chain. It really helps to test your procedures and operations fixing stuff along the way. The team also gains experience and becomes much more prepared for worse events should they come.
I would go as far as suggesting that these things should be routinely tested. In the UK most public buildings test fire alarms once a week; a programmed plan of localised, managed blackouts, to be carried out once every few years, would probably help in the long run - particularly as the grid is rapidly changing as described at the end of the post.
We didn't hear about all the other hospitals that easily switched over to generators, just the one that failed. There's room for investigation there, but it's on the site, not the system.
Definitely. We use seasonal demand peaks to learn how to cope with extremes. But if we didn't have those plus strikes I'd definitely be planning a few test runs.
I did mention the railways and indeed the importance of making sure the whole system can cope, which means what you list in addition to the grid in isolation.
> None of these seem like "bad management"
They do seem like bad management because to me it looks like these issues were not taken seriously.
> problems of the sort you're likely to run into if you have to activate an emergency procedure you haven't needed to use in a decade.
That's bad management right there if an organisation's emergency procedure has been collecting dust for a decade.
There are lots of reasons why hospitals lose power, which is why they all have backup generators.
Ipswich's backup generator had a faulty battery. This is nothing to do with the grid. This is entirely the fault of the hospital trust. Most of the hospital's backups worked.
> A spokeswoman said: "Our initial investigation has shown that at 4.52pm on Friday, August 9, we lost our electricity supply until 5pm. We have 11 generators at Ipswich Hospital and all of them worked immediately.
> "What we have found is that the battery which automatically operates one of the switch over systems from mains power to generator power was faulty despite being within its recommended life.
Here's the trust document talking about the new installation when it was put in.
> Electrical High Voltage Infrastructure was a critically important project to the Trust in providing a resilient high voltage electrical network to the hospital. Through a range of electrical projects, including the replacement and upgrading of outpatients generator ‘C’, the maternity generator project and the upgrading of the high voltage switchgear for the forthcoming biofuels project, this represented a total overall investment in excess of £2m. Virtually all the Trust’s electrical switchgear has now been improved and modernised. Along with the new backup generators the hospital is protected for years to come and a major significant risk to hospital has been dealt with.
It's clear that there were issues at hospitals (as expected cynical me thinks, knowing British healthcare...) but the fact that key infrastructure lost power does points to issues at the grid level as well (because sites that lost power did so based on grid procedures).
When they cut the power to an area, they can't selectively keep isolated locations active unless they are isolated in that they operate upon their own spur. That is why Hospitals would be deemed critical and important not to cut out, they are handled by onsight failover power supply in the form of backup generators.
In this case, it was like asking somebody to be able to catch a ball, they say they are ready and yet fail to catch that ball when you thru it to them. Is that the fault of the person who throws the ball or the person that was not prepared to catch the ball when they said they could.
In this instance it falls into the infortunate situation, and from what I can tell, you can't lay blame onto anybody or party for the failure and it is just one of those things. Sure leasons will be learned, but from the grid procedure in this hospital case, there will be no change I can see that would be learned. However the Hospital will be learning from this and may have more frequent testing and checks of their failover, may make changes. So if blame is to be leveraged upon anybody - you could say it would be from the party that had to make changes. But in this instance, would that be fair. Hence I would call this unfortunate and whilst lessons will be learned, you can not lay blame out of those lessons all the time. However - if it happens again and those same mistakes transpire - then, maybe you have some grounds to lay some blame.
The actual report says that Ipswich hospital wasn't part of the load shedding; so it's not clear why the generators had to cut in in the first place yet.
When the frequency is nosediving and below some threshold it is a good sign that you want to switch to your back up power immediately. Even if there isn’t ultimately a cut, it is bad quality power and a sign of system instability
I'm sure that they will learn lessons that railways, hospitals, airports should perhaps not be the first to lose power, that trains should be able to restart quickly, and that generators should be comprehensively tested to respond to real situations, and it will be no-one's fault.
I fully agree with you that this is the likely outcome.
Reading this exchange as a third party, both sides make fair points, but one side is being excessively downvoted, which is either unfortunate or possibly a regrettable sign of emotional over intellectual engagement in the discussion somewhere.
You're misunderstanding how it went. The 5% that's mentioned as the Low Frequency Demand Disconnection scheme, who were disconnected in managing the incident, is a scheme where very high volume users (those who have a meter reading taken every hour) voluntarily opt to have interruptible supply for better prices. When the grid needs to, they're given a power cut that they signed up for. Commonly used by big manufacturers, smelters etc. It's in their contract, expected and gets them lower electricity bills.
It helps prevent the grid tipping into catastrophic shutdown when there's no backup generation left, or something major has tripped, as part of demand management. If it hadn't happened the consequences would have far more headline-worthy.
Railways, hospitals and the like that encountered issues were part of "the rest". That is not part of the 5% turned off, but the small part of the 95% "rest" that can't cope with some (larger than usual) drop of frequency. If they're buying trains that can't cope with a frequency slump or backup systems that aren't tested that's outwith grid control. That's a sign their systems need to be improved, and tested properly too.
Here in Germany large industrial customers often get a cheaper electicity price in return for being the first to be cut in a load shedding scenario. If you are something like a steel mill then cheaper electicity quickly pays for slightly more frequent blackouts; and everyone else gets a more stable electricity network.
Most situations/responses to an issue in hindsight can always be managed better, being able to have a cookie-cutter solution that fits most cases is better than none. THough I'm sure such plans will be refined - leasons learned.
Thing is, a solution today might not be best for tomorrow as the grid's usage is not static, new this and that are added so the impact changes.
It is interesting that you mention the inability for trains to restart. That was educational and was due to some new trains and showed that the drivers did not have the skills to in effect - reboot the trains from a cold start. So less a grid fault and more a case of that it never happened before so the result was less prepared. Interestingly the previous generation of trains did not have this issue, which may explain why it got overlooked as something that needed attention until such an event like this.
As for hospitals, they all have backup generators, though if that generator fails to kick in as was the case in one I read about, that can and will cause issues. Again, may be a case of never needing it before and the combination of that and regular checks/tests presented the problem.
Remember - nothing is perfect, and many lessons (as with any outage in all walks of life) will be learned from this as with all previous outages. But equally have to remember that whilst you can have the same combination of events happen in say 5 years time and have a response that works for that combination learned from the last time. Things change, like that new housing estate attached to the grid, small nuances like that could change everything or nothing. You just don't know. Heck, could be newer issues, suddenly find that you have thousands of electric cars being charged at the time you cut them off now feeding back small amounts of electricity back into the grid as they never expected to loose power without a physical disconnect.
All the issues I mentioned are obvious, and so are the potential problems you describe. This is not a question of "it's easy to say in hindsight" as these are the key issues that must be covered by standard planning.
This blackout made a lot of noise because of this (apart from the actual disruption, obviously).
Do you have background in the field of giant power supply fault tolerance backup mishap management strategies, or something? Or are you just stating and assuming all of this is obvious, covered by standard planning, instead of easy to say in hindsight? Cause you're making a lot of statements, saying it is such, without backing it up. Maybe it is super obvious, but personally I'm impressed by the strategies that were in place.
"Would you not prioritise an airport, a hospital, the railway, over residential areas?" Yes and they do.
As mentioned, things get added to the grid all the time, case of the airport was that it was not on a priority list. Yes that is an oversight, but who's fault and accountability we do not know. The onus would of been in the airport to inform the power distribution and if that failed to happen then the airport would be the ones with the hands up, if they was and the distribution failed to add it, then their hand would go up.
But hospitals as I mentioned operate and a designed and planned for such issues with them having backup power supplies. Which is the solution for many installations as many will not have or justify the cost of being upon their own distribution spur.
As for the "we never had to use it before" that's down the testing and covered in previous comments fairly well without repeating it again.
But if you wish to focus upon being able to point a finger at somebody and say it is there fault, then fine. But I and I'm sure many others would not be focusing upon that, though I can appreciate that mentality due to the way media/tabloids cover such and all issues in life. After all, power cuts today are not that common on this scale, decades past and in less developed countries they would be so common that they wouldn't even make the news.
But remember, if the weather forecaster says it will not rain and you go out and a light shower makes you wet - would the forecast be wrong or would we have the mentality to go, that mistake was within acceptable parameters and just one of those things. It gets down to what we are used to and what we accept. So yes, I get that blame need to be lain in many eye's but is that fair all the time?
But then - nothing is perfect - better to focus on getting closer to perfection than expending energy stagnating as too busy pointing the finger.
I'm pretty sure there is a lot of people who is interested to know if this is someone's fault, and I don't think it has anything to do with how media covers this issues. I think it's a good thing, in fact.
This failure may be within acceptable parameters and it may be simply meaningless to try to find someone truly responsible for it. But you seem to be unreasonably reluctant to believe that maybe it's not.
I understand that it's interesting to analyze what happened from a more technical perspective, but of course a proper investigation has to be conducted to find the responsible if there happens to be one (or more).
"But you seem to be unreasonably reluctant to believe that maybe it's not."
WE have covered many area's, trains, planes and hospitals (maybe a movie there). Sure I'd rather focus on the technical aspects and not presume this was something totally avoidable. Rather look at each nuance instead of the binary blame game. Which is not clear cut. After all - we don't know what SLA's supply contracts are in place now do we. But nothing is ever 100%. But it is possible to argue both sides of the coin and without ALL the details, go around in circles.
If we really want to blame anybody - why not just blame God for allowing it to happen and move on. Leave it to the report, which will happen into the incident and associated fallout to cover all that and discuss any blame then, once all the facts are in. But remember - we do not know what the SLA levels are, so all those fallouts take on a whole new perspective of blame. Equally, you can have a problem and nobody is to blame - though God works for those, at least "Act of God" is actually used in insurance contracts (yes worked in that field as well disclaimer, though late 80's early 90's). https://en.wikipedia.org/wiki/Act_of_God
I do not work for the grid and last involvment in that area was working for Eastern Electricity Board as an analyst/programmer upon a COBOL DPS7 mainframe circa first half of the 80's and worked upon a new plant management system mostly (distribution level so upto attaching to the main grid). I've had no involvement in that industry since then. I have also worked for the Department Of Health, RAND, and numerous other industries and involvement.
I have no axe to grind. Hope that adds some clarity.
Several people in this thread have said that you are wrong to place blame entirely on the grid. It's unlikely that they all work for power distribution companies.
I have never placed blame entirely on the grid... Can we not lower ourselves into this sort of sordid methods of trying to put words into others' mouths?
You protest a bit too much at this point, methinks.
I am commenting that the scale of the disruption resulting from (only) a 5% drop does not seem reasonable and that some of the issues should have been better handled or avoided altogether.
> But I and I'm sure many others would not be focusing upon that, though I can appreciate that mentality due to the way media/tabloids cover such and all issues in life
> > But I and I'm sure many others would not be focusing upon that, though I can appreciate that mentality due to the way media/tabloids cover such and all issues in life
> That's a personal attack quite out of order.
Not really. It's just a societal commentary on how we've conditioned ourselves by the media we consume into passing quick blame instead of taking a slow, reasoned approach to analyzing the event, seeing the compounding factors at stake and only in the face of gross negligence assigning blame.
I am NOT a safety engineer, and I'm aware how complex that field is, so no, I don't believe that my personal hunches about what might have been a good idea intuitively are somehow better than these people whose actual job it is.
Hence my question about your background. But you implicitly answered that, too.
A potential argument against transit reliance, or at least trains relying on the grid to keep running (instead of using batteries or diesel). Personal transport options (cars, bikes, horses) are a decentralization success story when the grid goes down.
It may well be that cars with large batteries enable a smarter grid that can store surplus generation at off-peak and buffer the peaks better. After all - not everybody drives their car at the same time, and some access to a cheaper rate of electric for allowing your car to be used in such a way when it is plugged in, is certainly one avenue that may well open up more, as we see more and more electric cars.
Though the swappable battery charger offerings do seem to be more suited in accommodating such avenues and removes the worry of battery lifetime cycles from the owners. Worry that their car acts as a grid battery for a month when on holiday and reduces it's life beyond the cost of saving. So will be a case of getting the balance right. Hence I feel the swappable battery offerings may be at an advantage here and opens up making such solutions more cost effective to run.
Most traditional generators (think CCGT, nuclear, hydro) run in synchronous mode. Which means the generators spins with rpm proportional/equal to the network frequency. For 50Hz that would be 3000rpm.
When the total electricity demand exceeds the supply, the kinetic energy of spinning generators is converted to electricity which slows down the generator rpm, which reduces the frequency. All synchronous generators on a given network spin at the same frequency, so when the demand is too high all of them slow down a bit.
There was a comment once on HN that once a power generator was hooked to the network without the phases being aligned, and the resulting mechanical forces were so large that the generator got unbolted from the floor and flew through the air. Not sure if true or not.
This is kinda unrelated (it involved a line frequency slowdown from 60 Hz to 51 Hz), but I always found this radio recording of the US northeast blackout of 1965[0] fascinating as the DJ is grappling with records/tape cartridges slowing down and lights dimming live on the air https://www.musicradio77.com/images/ing11-9-65blackout.mp3
If I listen carefully, I can tell the difference between 48 and 50hz using a sine generator (although it's very subtle), so I'd guess that you possibly could.
It actually sounds like this was really useful though, as a lot of the systems involved will be improved in the future.
I'm not sure if the grid is already doing this, but they should trigger failures randomly, including Low Frequency Demand Disconnection schemes from time to time to ensure that all downstream systems are also ready (like the hospital someone mentioned that had an issue with the backup generator, or the trains that stopped working when the frequency failed). A bit like caos monkeys in dev environments.
Triggering demand disconnection randomly is saying "we should provide consumers a service including deliberate random blackouts". That's OK in contexts like Netflix where the downstream systems are supposed to be able to be resilient to faults and failures are comparatively low-stakes anyway. But almost all grid consumers aren't resilient to power loss (would your house continue to be fully functional without noticeable ill effects if power to it went out? Mine wouldn't...) and even for industrial scale consumers who do have generators the costs of finding out in a deliberate blackout that their generator was faulty could be anything from expensive to life-threatening.
Sorry, but the electrical grid is not built like a Netflix server farm. Its explicit purpose is to provide reliability to the customer, so the customer will not have to invest into mitigations for network failures, because it is much cheaper, more economical and technically much more feasible to make these investments in the network than individually on every single customer location. The only exception are some very specific customers for which the provided amount of reliability is not enough - these need to increase the reliability by investing in additional failsafes, and are of course responsible for testing these by themselves (if they'd like to have random network outages performed, they are free to do so, because any consequences from these harsh tests are in their responsibility anyway, just as any consequences from not doing proper tests and thus having the safeguards potentially fail in an emergency).
Netflix also would never intentionally break my FireTV box, which I use to watch Netflix, just to ensure that I have a backup in place (which I don't, because I don't care THAT much about Netflix availability).
[edit] Now that I think about it...the purpose of the Netflix server farm is also to provide reliability to the customer, so it's actually comparable to a certain degree. It's just that the possible outcome of a mitigation failure in the Netflix server farm is limited to some people not watching Netflix for a few minutes, while the possible outcome of a power outage even for residential customers are usually far worse, let alone industrial or health customers. This allows Netflix to take the limited risk of intentionally breaking internals in their infrastructure, because the worst that can happen is them having to excuse themselves to some customers for a service interruption. Also, no one expects them to submit a report describing how it came to those interruptions, so they don't need to admit that they were the result of intentional self-inflicted disruptions (which sounds really crazy and irresponsible to pretty much all consumers out there). With electrical grid operators, the stakes are infinitely higher, and they are legally required to investigate any bigger problems and report details about their origins to both government authorities and customers. And you can be sure that "oh we just intentionally dropped 5% of the network to see if we get a full blackout or if our failsafes work" would not be an acceptable excuse to any of these two.
Doesn't LFDD just mean cutting off large areas from the grid to improve the frequency elsewhere? I doubt it would be accepted by home users and businesses. Hospitals should be running their own local tests.
You would be setting up a Chaos Monkey (https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey) on your electrical grid. The problem is, while these kinds of things are an awesomely good idea early in a system's life, they are very difficult to put in after it has become both complex and mission-critical.
I assumed that the LFDD scheme was a small number of large power users such as factories that would get cheaper rates, much like AWS spot instances. Another commenter has said this is how it works in Germany.
Unfortunately, some quick research[1] suggests this isn’t the case, and actually it’s just a contractual obligation of the power companies to disconnect enough demand to bring power use back down. The amount they need to disconnect is banded by the frequency the grid is at, and apparently done at the substation level.
Some industries/applications such as infrastructure, hospitals, airports, etc, is supposed to be set up to never disconnect in these scenarios, but it seems this failed.
The UK has the 'pay commercial users who are able to voluntarily reduce demand' system too, and that was part of the initial response to the frequency drop, I think. The problem is that the volume of demand you can get rid of this way isn't necessarily enough to be sufficient -- if you need to dump 10 or 20% of demand and especially if it needs to be spread evenly over the grid then it's going to be hard to do without disconnecting people involuntarily.
I wouldn't be surprised if we could do better in the volume of demand we could reduce on demand, especially with smarter technology, though. At some point the question is whether the extra complexity and infrastructure is cost-effective to avoid once-a-decade events on the scale of this one, which probably 90% of the UK would never have noticed unless they read the news...
That would be like AWS telling all customers that their services would randomly go offline to test disaster response. The responsibility for blackout testing should be at the client level, at each hospital, train station etc similar to fire drills.
The whole point of sophisticated grid management is to save us money by reducing the total generating capacity needed for a reliable supply. If this were to impose massive costs on the rest of society in the form of planned blackouts then what would be the point? Why not invest in a little extra capacity?
What an interesting read! I had no idea that a blackout happened in UK recently, so I clicked out of curiosity and was rewarded with this delightful description of the initial events.
It's also interesting to see the interplay of automated and manual systems in such a complicated scenario.
The article says near the bottom "At the time [the GT1A generator tripped] there was 4000MW of unused generation in reserve, and if only a fraction of this reserve was instructed to dispatch within 60 seconds of the initial loss it’s very likely that load shedding would have not occurred."
My understanding was that having reserves available within 30-60s was pretty expensive, and that that would have been what was used to handle the initial frequency drop (the interim report PDF says there was 1000MW of this). Drax's page on this (https://www.drax.com/technology/power-systems-super-subs/) says that even spinning reserve can be on the order of two minutes, and the next tier below that (STOR) is on the order of 20 minutes. So does anybody know what the 4000MW available within 60s but not called on was?
Hey! (author of the article here). I'm not happy with how I phrased this with the 4GW of capacity -- and while I suspect generation could have increased by hundreds of MW within 60 seconds I may be wrong. I've updated how I've phrased it in the blog to hopefully give a bit more clarity to my position.
I don't meant to suggest that plants could ramp full within 60 seconds, it's that 4GW means there's a bunch of headroom, and that there was at least the physical possibility that dispatch instructions to increase generation/start-up (the reserves were "available") could have been sent out and plants could have started moving on an upward trajectory, creating even a few extra hundred MW of generation if many of the more flexible plants (gas, hydro) could increase their output even just a bit.
Here in Australia we dispatch on a 5 minute period, and there are some plants that are very slow moving and have to ramp over tens of minutes or hours, but moving grid-wide generation up and down many hundreds of MW within a 5 minute dispatch period is a common occurrence, and so this coupled with the relatively high amount of reserve makes me suspect it would have at least been physically possible to increase output and hasten the return to a normal operating frequency.
How you create both the systems and market structures/rules to oversee dispatch and coordination of something like that in a fast and automated way is very complex and infeasible to effectively do any time soon, but I think it's a system we should be thinking about and moving towards.
Overall though I'm just grumpy about the amount of manual, slow work that's still involved in grid operations (when we lose an inter-connector here in Australia it takes 3-15 minutes for the dispatch engine to become aware that there's no inter-connector anymore as it needs to be reconfigured manually).
I assume battery based grid power stations like the Tesla one in Australia are amongst the best in class to help in this kind of crisis situation? Compared to the gaz turbine the system it is simpler (no mehanical/heat/fluid parts) and depending on design will likely never be completely off since made of dozens or hundreds of independant equipement (batteries and inverters).
Batteries are great for this kind of stability support. I actually work for an Australian company that aggregates residential batteries together into MW-scale units and provides coordinated responses from 1,000 of kW-scale batteries that just as capable as a MW-scale plant (at a much lower cost of capital).
Here's one of our plants in Canberra which is past 3MW now, and is the first of its kind to be approved to bid into our fast frequency stability markets (where you have less than a couple of seconds to respond).
https://www.youtube.com/watch?v=36UJMuC9-k0
Interesting! I've been following this through cleantechnica, electrek & other sites.
It's transforming a complex heat/fluid/mechanical system issue into a computer network and security issue. Usually the home network internet access market is very concentrated so a failure there could make the VPP unavailable. Solvable by putting control servers in various places and having each box connecting to multiple redundant control servers hosted with diversity and redundancy in mind.
Also ISM low bandwidth IoT radio technology like LoRa and sigfox might help redundancy too.
Might become important is VPP grows beyond handling rare grid events.
Skinny, cheap comms such as LoRa would be perfect for redundancy. If you can at least get a simple message to the batteries (charge/discharge at xkW) you still have most of your effectiveness of the system.
One nice thing though is many of the stability services (such as balancing frequency) can be detected locally and therefore centralised dispatch and comms aren't necessary. All of our household systems have high speed metering attached (sampling at 20hZ) and when they make a frequency measurement outside the normal operating range (48.8Hz - 50.2Hz) they discharge/charge respectively. This means that when there's an event we have all the systems reacting in a way that looks coordinated, but is actually just syncronised (as they all essentially have the same input -- grid frequency), so services like this are extremely resilient.
Battery systems seem likely to play a greater part in the grid in future, but I think it's worth noting that the problem here was "sudden loss of generation exceeded the amount that the system operating specs require to be covered by having alternatives on standby". So unless we're collectively prepared to pay to raise the resilience by having more standby capacity, might we just find ourselves with more batteries and fewer turbines running inefficiently at half-load, so we take the benefit of battery tech in cheaper electricity rather than reduced chances of outage?
Yes, batteries are very fast acting, but that battery in South Australia is only 100MW and the grid lost over 1400 MW. https://www.news.com.au/technology/innovation/inventions/eve... Batteries have a lot of potential, but they are currently small in capacity.
> but that battery in South Australia is only 100MW and the grid lost over 1400 MW
I hear this kind of misconception fairly often and so it is worth noting that you don't need 1400 MW of battery to protect from this event.
The initial 735 MW drop happened, as I read TFA, when a major generating source took itself offline as a result of the frequency disturbance. The later generator drops are cascades of the frequency disturbance earlier in time.
If a suitable battery had been present and reacting in milliseconds it could have stabilized the frequency enough, for just the very short period required, to avoid the whole generator trip in the first place. Thus you can use much smaller amounts of fast-response frequency support to avoid much bigger drops. This is how batteries can pay for themselves so quickly when placed in the right topologies w.r.t to the grid demand and generation resources.
If 200 MW of generation trips instantly, which is typical when electrical equipment fails, you need 200 MW to replace it in the next few seconds, and then additional generation to bring the frequency back to Nominal.
So far the largest battery is 40 MW? So not really going to help more than 40 MW can, even if it comes online instantly, when 200 MW trips off.
Answering that question would take a detailed analysis which I am not aware of. It would be less than the instant load of the Eaton Socon - Wymondley transmission circuit that was struck by lightning.
It is basically a question of providing just enough frequency support, for long enough, for the remainder of the grid to respond to the frequency drop. In other words, let the frequency sag a bit so generators respond but not so much that they disconnect.
I think there's also some control system error implied in the wind farm's drop as I read the article. That perhaps shouldn't have happened at all and was an un-analyzed state that will be corrected.
>> but that battery in South Australia is only 100MW and the grid lost over 1400 MW
>I hear this kind of misconception fairly often and so it is worth noting that you don't need 1400 MW of battery to protect from this event.
I would be interested in any papers or literature that have led you to this conclusion.
as soon as (generation - load) <> 0 the frequency is going to change. If you lose 200MW of generation and have 100MW of batteries that are configured with 0 or very small amount of droop they see some small change in frequency and crank out 100MW to try and stop it from falling further. Then they are maxed out, there is still 100MW of imbalance for conventional generation with governors to take up, which they will as the frequency falls as if the trip had only been 100MW in the first place. How is this not a 1:1 relationship, 100 MW of batteries displace 100MW of lost generation?
For this event in the UK in particular it appears the issue is that the large windfarm, and 500MW of small embedded or distributed generation are en masse improperly configured or by nature are not able to ride through any small transient wobble in frequency. TFA didn't show us what voltage these plants saw.
Although a 200MW steam turbine also tripped due to the same lightning/transmission line trip event, so clearly it was subjected to something that exceeded instantaneous thresholds and had to trip immediately. And then that steam turbine tripping caused the gas turbines at the same plant to trip which indicates there are some design or operational issues at that plant.
Also is 20s reclose time for a transmission line standard? I've only configured distribution line reclosers but it was more on the order of 1s for the first reclose and maybe 7s for the second?
According to the article it was a cascade of events and the last one was the 244MW gas turbine going out that lead to the collapse so not that far from current scale grid batteries.
Also battery reacting instantaneously to correct or mitigate a grid issue might avoid some other generator to go offline or allow to go offline for a must shorter amount of time avoiding some of the cascading effects.
It seems to me that the value of grid batteries is not just the nameplate capacity compared to other generation technologies, it is a class on its own.
May be once batteries are incorporated in the grid at some scale software triggers commanding other generators capacity disconnect/reconnect will have to be updated "let the grid batteries do their job first and then act".
Pretty much the only plants you can get to full power in such a short notice are battery or hydroelectric. Even pumped-storage hydro typically takes 10-30 seconds to full power because of inertia. Batteries are much more expensive, but as close to instant as it gets.
Other fast responders are supercapacitors, superconductor magnetic energy storage, flywheels, compressed air and there probably a lot more I don't know about.
What conventional chemical batteries have is mostly that they are manufactured in vast quantities, so the unit costs are constantly improving.
I suspect it being evening peak made it worse, because that's peak time, so I think in normal operation they're discharging some of the batteries/pump storage that's taken up some of the spare capacity during the day.
I would encourage more people to read the actual interim report - it's quite detailed with lots of detail about the frequency measurement and trips.
Might be pumped water storage like "Electric Mountain" in North Wales (there's some in Scotland too, not sure what the total pumped storage on the grid is). They pump water to remove excess power from the grid and then run in reverse to generate. Their 0-100% is O(1minute) IIRC, actually on the website they say:
Electric Mountain are definitely used for frequency management. Again IIRC, they're the fastest response on the grid. If you get chance to do the tour (it's just below Mount Snowdon) I'd recommend it.
What affect does the frequency drop have to home users? Is there protection in the substations near to homes that will cut off the power completely if the frequency goes out of range?
The grid frequency itself doesn't matter much as far as consumers are concerned, the deviations mentioned in this article would not cause any problems for a home user directly (or even generators, but as mentioned below if it is too far out of range generators start to need to cut out) but it is a far better indicator of the general balance of supply and demand of power in the grid than any other metric, because it can be measured very precisely, isn't subject to local effects like voltage, and correlates well with the state of the generators.
One quirk of synchronous generators is you can think of the whole grid like all of the generators in it are geared together. If demand rises, the torque on the generators starts to increase and if the power plants don't feed more mechanical power in then they will start to slow down in unison. Likewise their speed will increase as the demand drops. So the frequency is the main indicator of the health of the grid and so the grid operators will monitor it closely.
> The grid frequency itself doesn't matter much as far as consumers are concerned
So I thought and then we had a multi week frequency shift in the EU and various clocks started lagging behind. This does show up for consumers as well.
Generally grids try to average out to their target frequency over some period (24hrs, for example) because so many timekeeping systems rely on counting cycles.
So sometimes they will run slightly above target and sometimes they will let it run slightly below target. Ordinary daily demand cycles help.
Most things in a modern home don't care about frequency.
The problem is mostly at the generation end.
The electrical grid is essentially distributed transmission locking all the generators across country together. They will all rotate in sync at the same speed/phase, it might as well be a mechanical linkage.
The entire system is under too much load and the generators are all physically slowing down. Unfortunately the generators are all designed to operate over a very narrow frequency range. If the frequency continues to drop, the safety systems will kick in to protect the generator.
If the load isn't reduced, more and more generators will drop off the network and there will be less and less supply to meet the demand. The frequency will drop lower and lower. Eventually the whole network will drop to zero and nobody will have power.
It's basically the same thing as stalling an engine.
A small frequency drop doesn't have a major impact to home users as well, but it's a symptom of a generation/consumption mismatch - the frequency dropping from 50 to 48.8 is a sign that the consumption is draining energy that's not generated by essentially taking it from the inertia of the major generators, and temporarily reducing consumption as all the large phase-linked motors are suddenly slowing down slightly... both of which are temporary, short-term, unsustainable effects.
The emergency response kicks in not because 49 is innately very bad and 48 is some horrible evil by itself, it kicks in because if you do nothing, then in a few minutes you're going to see 40, 30 and the whole system is going to stop - or, more realistically, every connected subsystem (both generators and consumers) will have their emergency responses kicking in and either shutting down or automatically disconnecting from your system.
The main problem with frequency mismatch is that when two power sources "meet" with different frequency the effect can be catastrophic, because the faster generator pulls the slower (think of the "hares" in 3000-10000 m running competitions) and typically both end up breaking.
Though I'm not sure on the details of a frequency drop and impact upon your house in exacting details. Equally I'm not up on substations (did work on IT for large util company many decades ago, but not field engineering). But the first safety control from supply to the home, would be the distribution transformer and those back then had overheat protection. Also had a flag attached by wax at one spot, so if they did overheat and cut out for a while, you could visualy see that due to the flag being visible. But that was more a legacy solution that still prevailed into the 80's, unsure about now. But would not be supprised if such tried and tested things still take place as some form of backup monitoring. I learned about it after coding for a database that stored such records that had the field name SITESHIT, which was short for "SITES HIT", yes - you can see how that gave a young humble COBOL programmer a laugh. But it would be a count of how many times that wax flag had been triggered.
>"In Japan, the western half of the country runs at 60Hz, and the eastern half of the country runs at 50Hz – a string of power stations across the middle of the country steps up and down the frequency of the electricity as it flows between the two grids." //
> Eastern Japan (including Tokyo, Kawasaki, Sapporo, Yokohama, and Sendai) runs at 50 Hz; Western Japan (including Okinawa, Osaka, Kyoto, Kobe, Nagoya, Hiroshima) runs at 60 Hz. This originates from the first purchases of generators from AEG for Tokyo in 1895 and from General Electric for Osaka in 1896.
For anything powered by a Switch Mode Power Supply/SMPS (TVs, computers etc) then not much effect, as power is rectified immediately and then converted back several 10s of KHz, and so many SMPS devices will actually run from DC.
Heating elements (kettles etc), also don't care, and a lot of motors used in household items are universal motors[1], which will end up running at the same speed even as the frequency drops. Old alarm clocks do use the mains supply as a time reference though, so they will end up having the wrong time, as happened recently in the European grid [2].
For a transformer, efficiency decreases rapidly[3] with falling frequency. For household usage, transformers running directly on mains frequency are only used for linear power supplies (as opposed to the SMPS supply used earlier) which found in things like audiophile level power amplifiers (adds a nice bit of weight), so your Class-A amplifier may get even hotter.
The power grid uses transformers of course, to step down the several kV supply to the 240V used in the UK, and its possible that a big enough frequency shift could cause overheating of these from the decreased efficiency. I would be very surprised if they would not turn off if they got too hot though, it seems like a sensible thing to design in.
In short, not much for small changes, but the frequency is falling as there is more consumption than supply, so the supply gets turned off to decrease demand.
Not every substation but definitely the system is protected against over and under frequency and voltage so that loads can’t be fed 45 Hz for a minute.
I like this article as it gives a lot of advanced insights about what seems at first an easy problem.
Generators are tech from, what, 2 centuries ago? Keeping them in sync sounds boring, just talk together. And then, under the surface, looms an incredibly complex problem about working at scale, having engineering, political, organizational aspects.
(1) A lightning strike creates a transient voltage on the grid.
(2) Some power generators remove themselves from the grid to prevent damage to themselves from the transient voltage.
(3) Now there is too much demand on the grid -- too many people using power and not enough generators to supply it. (The line frequency dropping indicates that supply and demand are not balanced.)
(4) So the grid operator begins to (deliberately) remove some people from the grid to keep the supply and demand balanced.
What's not clear to me is what would happen if you didn't do step #4.
That’s when you end up with a cascading failure (https://en.wikipedia.org/wiki/Cascading_failure). The remaining generators are overloaded by the demand and shut down to protect themselves as well, which further decreases the supply and causes more stations to shut off, etc.
4 needs to happen, and is the benefit the UK has of a massive country-wide electricity grid. As someone else said, cascading failures and a much worse time.
It was a few minutes of downtime affecting a fraction of people. The last large UK outage was 10 years ago. The uptime has a lot of 9s already, so even if they ensure this particular problem wouldn't cause downtime again, the next time it'd just be a different rare event that causes it.
At one stage, they have to shed load (turn power off), and that sounds totally reasonable. How are these areas divided, how does one turn power off for a whole area, and how is it decided who to cut from power (temporarily)?
From the little I know UK settlements have substations that serve something like 1000 homes. Older ones are open air affairs, newer ones are prefabbed enclosed buildings. They're oil cooled transformers and switching gear. One serves the village where my parents live, in a city there are two within about 400m of me (serving different neighbourhoods). Sometimes a neighbourhood will trip out, like in a thunderstorm; or get turned off (maintenance). I assume that the central grid has a switching order for these to be turned off to maintain essential supply. Factories, hospitals, and such have their own substations.
https://www.westernpower.co.uk/downloads/4093 is a pdf for one of the local power distribution companies that includes an explanation. The short answer is that they have frequency-sensitive relays that automatically disconnect some or all of the feeders downstream of 132kV substations as frequency drops through various specified limits.
Isn't this kind of fault a good argument for synchronizing the UK grid to the rest of Europe? They seem to already have an interconnect that's of the scale of demand needed here, but it's a DC link.
If they were synchronized, it would mean that in the case of any larger failure that would draw too much current over the undersea link, the link itself would have to trip out.
That depends on the link size. But I guess it may very well be better to have a big enough DC link and have it automatically modulate power all the way up to maximum and then keep it there, instead of tripping.
Imagine if you are running on a conveyor belt at 60 steps per second. This belt is turning a crank that, let's say, moves water. You are the energy input (the sun for solar, heat energy for steam turbines, movement of wind for windmills, and so on), the crank is the generator, and the water is the load.
Now imagine the water is suddenly turned into vegetable oil (more load has come on). It is more difficult to move. For a brief moment, your energy input is still the same but you can't push as much of the new liquid so your pace slows down a bit, to 58 steps per minute. You increase your exertion (add more energy input) and your pace climbs back to 60 steps per minute.
Finally, imagine that you're running as hard as you can. That 60 steps per second to push that vegetable oil is as much as you can do. But the vegetable oil turns to heavy cream (even more load). You've put in as much energy as you can but your pace falters and drops to 49 steps per second.
For the same reason a car engine’s RPMs will drop when you start going up a hill (unless you increase the throttle).
Spinning up the generators, and keeping them spinning, takes energy on top of whatever the load is. If the load increases, but the input energy does not, the difference is robbed from the momentum of the spinning mass. This slows the generators down thus the frequency drops.
ELY5: When you're at a playground, and you're pushing the roundabout round and round, it gets harder and you get slower the more kids jump on it, right?
ELY not 5, but I'm no expert either: in terms of phase sources like generators lead [0] sinks, by an amount proportional to their over-production; so the more sunk, the less they lead, and the lower the frequency.
> The causes provided by RWE for the initiation of the trip of Little Barford steam turbine (ST1C) was due to discrepancies on three independent safety critical speed measurement signals to the generator control system.
A cyber attack would most likely, rather than directly take capacity offline, instead make the system less stable by attempting to make safety systems not work properly. That has the benefit that the attack will be triggered at th exact same time as another incident, increasing the impact.
Injecting false values for a sensor reading is a very good way to hide your tracks as an attacker - just inject a bunch of "0" values, and then uninstall whatever malware did it, and probably nobody will ever find out why that sensor randomly acted up.