Pushing the bodily limits of {hardware} is all the time a enjoyable story, however I used to be positively blown away after I noticed this poster at Supercomputing. Preferred Networks, seemingly a spin-out from Tokyo College, is shoving a number of massive chiplets right into a single PCIe card for peak efficiency, peak energy, and it appears to be like like they’re able to deploy over 10,000 of those playing cards right into a customized supercomputer.

Preferred Networks: A 500 W Custom PCIe Card using 3000 mm2 Silicon

Let’s begin with the package deal, which is available in at 7225 mm2. It is a typical BGA package deal, with different 6457 pins. Throughout the package deal are 4 silicon die, constructed on TSMC 12FFC, every of them 756.7 mm2 (32.2 mm x 23.5 mm), which signifies that this processor totals 3026.8 mm2 of silicon. That’s a far cry from the 800 mm2 of silicon utilized in high-end compute GPUs, and even the 1000 mm2+ utilized in high-end EPYC CPUs. Essentially that is an astonishing quantity to get your head round, particularly for one thing that’s meant to go onto a PCIe card.

Preferred Networks: A 500 W Custom PCIe Card using 3000 mm2 Silicon

With the related heatspreader, the chip sits on the PCB surrounded by 32 GiB of some type of reminiscence. The entire machine is a deep studying accelerator, aiming to supply key targets for efficiency and energy. At 524 TeraFLOPs of half-precision (FP16) efficiency, the chip additionally has a 500W TDP, which implies the objective of the chip was met: 1.05 TFLOPs per watt. At 0.55 V, it means the chip is pulling near 1000 amps at load, and consequently a customized PCB design was required, however nonetheless is enabled by way of PCIe. The cardboard is an prolonged PCIe design, with compelled cooling (even in a server), and can sit in a 7U rack-mount chassis. Every server is a twin socket CPU with as much as 4 of those playing cards, offering 2 PetaFLOPs of half-precision DL compute. With the cooling on the cardboard, we at the moment are as much as 600W per card, and that is how it’s calculated inside a server.

Preferred Networks: A 500 W Custom PCIe Card using 3000 mm2 Silicon

The chip is constructed as a part of the MN-Core household. Preferred Networks is an organization that specialises in constructing non-public supercomputers with particular wants. Because the founding in 2014, it has constructed $130m of funding, with virtually $97m of it coming from Toyota. Preferred Networks has constructed three AI supercomputers for Tokyo College since 2017, principally with P100 and V100 NVIDIA accelerators, with the newest MN-2 using 1024 V100 SXM2 elements to hit 128 PetaFLOPs whole. This new chip is on the heart of Preferred Networks’ newest MN-Three supercomputer, and would be the first with customized silicon.

MN-Three can have 4 of those chips per 7U server, giving ~2.1 PF of efficiency. There will likely be 4 servers to a rack, and about 300 racks, which involves 4800 MN-Core boards. This might give 2.5 ExaFLOPs of whole half-precision peak efficiency. David Schor at Wikichip estimates the entire energy consumption round 3.36 megawatts, which is mightily extra environment friendly than different techniques available in the market. MN-Three is anticipated to enter operation in 2020.

Preferred Networks: A 500 W Custom PCIe Card using 3000 mm2 Silicon

David has additionally completed some digging as to how this chip got here to be. On our pictures, we are able to clearly see the phrase ‘GRAPE-PFN2’ on the silicon, which stands for GRAPE, the identify of Tokyo College’s inner silicon tasks, and PFN2, or Preferred Networks. Tokyo U has quite a lot of customized silicon tasks underneath the GRAPE banner: one for gravitational calculations, one for a lot of physique calculations, one for molecular dynamics, and many others, and it seems that the Preferred Networks group was initially a part of the GRAPE-DR physics co-processor undertaking, and because of this the diagram of the structure proven at Supercomputing appears to be like so related.

Preferred Networks: A 500 W Custom PCIe Card using 3000 mm2 Silicon

Every chip consists of two die-to-die interconnects, partnered with some scheduling engines, PCIe material, and the compute occurs in 4 large ‘Level 2 Blocks (L2Bs)’. Every L2B options eight L1Bs and a shared cache, and inside in L1B are sixteen Matrix Arithmetic Blocks (MABs) in addition to an L1 shared cache. Every MAB has 4 processing engines (PEs) in addition to a matrix arithmetic unit (MAU), which appears constructed to carry out matrix multiplication and addition. Altogether, one die can have 512 MABs, consisting of 2048 PEs and 512 MAUs. A entire chip would subsequently have 2048 MABs, 8192 PEs, and 2048 MAUs. Hold scaling up and it turns into obvious how the excessive efficiency numbers might be achieved. Usually all of those models work in 16-bit, though combining PEs signifies that larger order precision may be achieved.

Preferred Networks has no plans to promote these chips or servers commercially. They’ve a selected buyer in thoughts, and are possible constructing the software program stack with deep integration with that companion. However nonetheless, 3000 mm2 of silicon per chip? 500W per chip? Insane.

Associated Studying



Please enter your comment!
Please enter your name here