Pacific Northwest National Laboratory
Harnessing
Hundreds of Thousands of Processors
By Alan S. Brown
What do ancient chariots and modern high-performance supercomputers have in common? Both started with a single engine pulling the load. With chariots, that motor was a horse. To go faster, drivers bred larger, stronger horses.
With supercomputers, it was a high-performance processor. To boost performance, engineers created bigger, faster, and more expensive processors.
The big breakthrough in chariots came when drivers harnessed teams of horses. By distributing the load among two, four, or even six horses, chariots could go faster and farther than any one horse could pull them. For this to work, however, drivers had to train their horses to start, stop, and turn together. Otherwise, each animal would bolt in a different direction during the tumult of battle.
Supercomputer architects made a similar breakthrough. They learned to break big calculations into smaller parts that multiple processors could solve simultaneously. Instead of a single processor, computers were composed of many networked processors, pushing calculation speeds into the stratosphere.
Parallel computers also slashed costs, since they replaced expensive custom processors with the same off-the-shelf chips used to run servers and workstations. The world’s fastest computer, IBM’s BlueGene/L, uses more than 130,000 IBM Power PC processors. A new model unveiled in 2007, the BlueGene/P, will use almost 900,000 processors.
This is where the similarity between chariots and high-performance supercomputers breaks down. The mind easily wraps around the task of training a handful of willful horses. It’s simply a matter of training one horse to lead and the others to follow. But how do you yoke together hundreds of thousands of computer processors in a single harness? How do they share data, store information, or recover when one node fails?
“Our goal is to find a practical solution for those problems,” says Jarek Nieplocha of the Department of Energy’s Pacific Northwest National Laboratory (PNNL) in Richland, Washington.
“In academia, most people do research, write a paper, and then forget about it and move onto another problem,” he explains. “In DOE national laboratories we’re developing real applications here, and our goal is to create software that has an impact on science. We want to create something that benefits everyone.”
Managing Memory
Benchmark results showing that the per-processor throughput using ScalaBLAST is essentially constant. Click image for larger version and more information |
Memory is a key issue in supercomputers. In conventional PCs, each processor has its own local memory. In a parallel supercomputer, however, every processor has its own local memory, but also shares memory with the hundreds or thousands of other processors in the system. In large simulations, where the results of one set of calculations drive another and another and another, processors are constantly reading, writing, and exchanging data in memory.
Most supercomputers share information using an approach called Message Passing Interface (MPI). It’s a complicated technique that Nieplocha compares to postal delivery.
“Let’s say I want to send data,” he begins. “So I put it in an envelope and drop it in the mailbox. The post office comes, picks up my letter, and delivers it to a post office in another city, which puts it in the right mailbox. Then someone goes to the mailbox and picks up the letter.”
In computers, “This two-sided process requires handshaking: both the sender and the receiver need to know what’s being moved and agree on when to do it. It is laborious, expensive, and difficult to program this type of communication.”
Nieplocha’s alternative is an approach called Global Arrays. “You don’t have to send data because it is visible to every processor without handshaking,” he says.
This works because users first define where all their data will be within a matrix, or array. The array itself is a logical object — that is, software sees it as a single entity even though portions of the array (and the data it holds) may reside on hundreds or even thousands of different memory chips. As long as users know which part of the array holds their data, they can access it without complex handshaking and tracking routines.

