With BP belching a gazillion barrels of crude into the Gulf, now might be a good moment to reflect on the impending catastrophe of our over-reliance on petroleum to power our world. Indeed the only Americans who may be secretly licking their lips at the current disaster are the nuclear power lobbyists. Support for nuclear power had already been growing, in part because there seems to be few other practical short-term options. Even President Obama is talking nice about nuclear power. Therefore I want to use this diary to discuss a subject that has not yet been openly debated, but which has been of growing concern within the Nuclear Regulatory Commission itself, i.e. software bugs.
There is a new generation of nuclear power plants currently being licensed in the United States. One of the “exciting” new features of these new designs, pioneered by Toshiba in Japan, is the sophisticated new safety software. Prior generations of reactors had hard-wired safety systems which had become incredibly antiquated in our digital world. Therefore clever software engineers have designed highly redundant software safety systems to replace the old Rube Goldberg bells and whistles. But again there is one little problem worrying the regulators, and it ought to be worrying the president and worrying us, too. The problem is that all large software systems contain bugs.
The word, bug, in this use, is attributed to the late Rear Admiral Grace Hopper, who was the co-inventor of COBOL. She said the word went back to a circuit malfunction in the first “supercomputer,” the Mark I. After an agonizing debugging, programmers finally found the bug: it was an actual moth that had gotten into the computer.
The real problem with software bugs can be put this way: one bug can crash the system. One engineer put it a little more colorfully:
If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.
Let us very briefly review a few real-world woodpecker strikes (mostly taken from Wikipedia).
--In 1999 “NASA Mars Polar Lander was destroyed because its flight software mistook vibrations due to atmospheric turbulence for evidence that the vehicle had landed and shut off the engines 40 meters from the Martian surface.”
--Her sister ship, the Mars Climate Orbiter, was also destroyed, but not because of a bug per se. Instead, what happened is that a Lockheed Martin engineering team had failed to convert one value from English units to metric.
--The $1 billion European Space Agency’s Ariane 5 Flight 501 blew up 40 seconds after takeoff due to a bug in the guidance software.
--A software error in an MIM-104 Patriot missile caused its system clock to drift by one third of a second resulting in failure to locate and intercept an incoming missile. The scud landed in a military compound in Dhahran, Saudi Arabia (February 25, 1991), killing 28 Americans.
On May 20, 2010, a subcommittee of the Nuclear Regulatory Commission met to consider the new Toshiba design proposed for a South Texas nuclear plant. First the staff made its presentations, then Toshiba and Westinghouse engineers made their presentations. The powerpoints were stupefyingly complex. There were acronyms within acronyms. But the one point that a few of the more crotchety subcommittee members kept hammering the staff and the companies’ representatives on was software. Complex software systems have bugs. How can you possibly assure us that these new software-based safety systems will not have bugs?
Finally, toward the end of an exhausting meeting many hours long, the Deputy Director of New Reactors, Gary Holahan, stepped into the fray and made it clear what the NRC itself officially thinks about software bugs.
Digital systems can enhance safety and reliability, but they bring with them new and different issues and concerns, and we need to deal with those. We take these issues very seriously, we take the ACRS’ [Advisory Committee on Reactor Safeguards] concerns of these issues very seriously. We won’t always agree on all the details and it’s going to take a fair amount of dialog, but we take the ACRS issues seriously. We intend to address them, we have a very good I think working relationship with the committee, and certainly with the committee staff...
In the past the staff, the Commission, and the ACRS came to what we think is a practical approach for assuring the safety of plants with – based on digital technology. And that is really with two major aspects to the review. One is assuring that the design itself is a good design, that it’s based on up to date and state of the art standards, that it is done through a well planned and structured process, that it’s tested to the extent that systems can be tested, and it’s checked for things like independence and communication between channels and between various parts of the system, all of these things are normally done.
But we also recognize, and maybe this is an inherent characteristic of digital systems, that it is really hard to come to a conclusion that the system is without flaw. The systems are complex, the software is getting more complicated I think all the time, and recognizing how difficult it is to make a determination that a digital-based system is inherently reliable, the Commission took a position, and I think it was a very practical position to say, we need to push the systems to the state of the art that they should be as good as we know how to make them; they should not have any identifiable design flaws, and I think that our interactions with the committee and with the applicant are very good ways to identify areas of concern; but ultimately none of these systems will be good enough that it doesn’t need some kind of diverse -- either diversity within the system or a backup system.”
BP is showing us how one shoddy decision can threaten a vast ecosystem. Are you willing to risk the “China syndrome” hoping that the safety software in future nuclear reactors will never crash?