China Tianhe-2 Supercomputer will have over 55 petaflops this month

Dr. Jack Dongarra from Oak Ridge National Lab, one of the founders of the Top500, was on hand for the event in China and shared a draft document that offers deep detail on the full scope of the Tianhe-2, which will, barring any completely unexpected surprises, far surpass the Cray-built Titan.

The 16,000-node Inspur-built Tianhe-2 is based on Ivy Bridge (32,000 sockets) and 48,000 Xeon Phi boards, meaning a total of 3,120,000 cores. Each of the nodes sports 2 Ivy Bridge sockets and 3 Phi boards.

According to Dongarra (Report of his visit to the National University for Defense Technology Changsha, China), there are some new and notable LINPACK results:

I was sent results showing a run of HPL benchmark using 14,336 nodes, that run was made using 50 GB of the memory of each node and achieved 30.65 petaflops out of a theoretical peak of 49.19 petaflops, or an efficiency of 62.3% of theoretical peak performance taking a little over 5 hours to complete. The fastest result shown was using 90% of the machine. They are expecting to make improvements and increase the number of nodes used in the test.

The system will be housed at the National Supercomputer Center in Guangzhou and has been aimed at providing an open platform for research and education and to provide a high performance computing service for southern China. It is the new Tianhe-2 (TH-2) also called the Milkyway-2 supercomputer.

There are a number of features of the TH-2 that are Chinese in origin, unique and interesting, including the TH-Express 2 interconnection network, the Galaxy FT-1500 16-core processor, the OpenMC programming model, their high density package, the apparent reliability and scalability of the system.
...
The Interconnect

The TH Express-2 uses a fat tree topology with 13 switches each of 576 ports at the top level. This is an optoelectronics hybrid transport technology. Running a proprietary network. The interconnect uses their own chip set. The high radix router ASIC called NRC has a 90 nm feature size with a 17.16x17.16 mm die and 2577 pins. The throughput of a single NRC is 2.56 Tbps. The network interface ASIC called NIC has the same feature size and package as the NIC, the die size is 10.76x10.76 mm, 675 pins and uses PCI-E G2 16X. A broadcast operation via MPI was running at 6.36 GB/s and the latency measured with 1K of data within 12,000 nodes is about 9 us.
...
The Software Stack

The Tianhe-2 is using Kylin Linux as the operating system. Kylin is an operating system developed by the National University for Defense Technology, and successfully approved by China’s 863 Hi-tech Research and Development Program office in 2006. See http://en.wikipedia.org/wiki/Kylin_(operating_system) for addition details. Kylin is compatible with other mainstream operating systems and supports multiple microprocessors and computers of different structures. The Kylin packages all include standard open source and public packages. This is the same OS used in the Tianhe-1A. Resource management is based on SLURM. They have a power-aware resource allocation and use multiple custom scheduling polices.
...
Energy Efficiency

To compute the flops/watt one can take the power under load for the whole system (processors, memory and interconnect) at 17.6 MW and divide by the percent of the machine used to run the benchmark, in this case 14,336 nodes of the total 16,000 nodes or 90% of the machine. The performance achieved was 30.65 Pflop/s or 1.935 Gflop/Watt.
The Top 5 systems on the Top 500 list have the following Gflops/Watt efficiency:

http://nextbigfuture.com/2013/06/china-tianhe-2-supercomputer-based-on…