Pacific Northwest National Laboratory
Harnessing Hundreds of Thousands of Processors
(Page 3 of 3)
Soft Landings
In addition to harnessing the power of healthy supercomputers, Nieplocha tries to prepare for the unavoidable failures. While the individual components of any supercomputer — processors, circuit boards, hard drives, power supplies — are highly reliable, the sheer numbers of components used in high-end systems make failure a certainty. Supercomputer failures carry a high price in time and money. A single simulation may take months of preparation. Researchers may wait weeks for computer time. And some big simulations take days or even weeks to run.
“Most applications running on supercomputers are not fault-tolerant,” Nieplocha says. “If a hard drive or memory chip fails, users typically lose their data and have to restart from scratch.”
One solution is to save the work periodically. PC word processors and spreadsheets do this automatically by making mirror images of documents. Saving a simulation that sprawls across thousands of different nodes and processors is far more complex. “You have to think about what to save and where and when to save it. The applications are very complex, going through different types of stages, and some points simply cannot be saved,” Nieplocha explains.
To resolve these issues, Nieplocha’s team uses virtualization, an approach that traces back to mainframe computer days. It involves slipping a layer of software, called a hypervisor, between the hardware and the operating system that controls it. Instead of delivering instructions to the computer, the operating system talks to the hypervisor, which whispers those instructions to the hardware.
This sounds inefficient, but recent advances have made hypervisors much more economical and, for the first time ever, worthwhile, Nieplocha says. “Today, using a modern hypervisor, we will periodically stop each node and map the application state and operating system memory. Our software writes the image to a hard drive, then moves on to the next node. If a node fails, we will sense the problem, retrieve the saved image, and mount it on a healthy node so that the application can continue from the last checkpoint,” Nieplocha says.
“People have worked on this problem for some time, but the results were specific to individual computers or basic academic research,” he adds. “People describe how they made it work, write a paper and move on to other challenges. We’re creating a more practical solution that can manage multiple nodes automatically and reconfigure the system if there is a failure.”
Nieplocha’s team understands that virtualization degrades computer speed, though they believe their technique makes the slowdown negligible. Given the high cost of failure, they believe most users would prefer a minor delay to a failure that leaves them without critical results for months at a time.
Fault tolerance was never much of a problem in a chariot, but then, chariots were relatively simple machines. Modern high-performance supercomputers are anything but simple. With their cluster-based architecture, they promise an unprecedented combination of flexibility and power. One day they may help us produce pollution-free fusion energy or develop new genetically based medications.
The goal of Jarek Nieplocha and his team is to make that power and flexibility more easily accessible to those who need it. It’s just a matter of harnessing all those processors so they pull in the same direction.
« Previous 1 | 2 | 3 | References | Sidebar | Print Next »
