The triple-FPGA E-Series: optimised for ultra-low latency inter-FPGA communication

Latest News AND EVENTS

Stay up to date and find all the latest news and latest events from Metamako right here.

The triple-FPGA E-Series: optimised for ultra-low latency inter-FPGA communication

Posted on August 28, 2018 by Matthew Knight 28 August 2018

An Introduction to the Metamako E-Series

Metamako's E-Series devices support up to three Xilinx UltraScale™ or UltraScale™+ FPGAs in a one or two rack unit (RU) form factor. The UltraScale™/UltraScale™+ Series from Xilinx used in the E-Series devices are a significant step forward from the the K-Series devices. Key improvements over the K-Series include:

  • three times as many logic cells
  • eight times as many DSP slices
  • six times as much memory
  • significantly lower-latency Ethernet transceivers
  • the ability for logic to be clocked higher than ever
  • significantly more IO bandwidth
  • all the above while consuming less power than the previous generation

There are three E-Series chassis variants available with 32, 48 and 96 ports respectively, allowing clients to choose the right balance between port density and performance for their applications. Each device comes with a full Layer 1 switch fabric and contains a range of management and monitoring features including: per-port Ethernet statistics, monitoring telemetry and an x86 processor running MOS, Metamako's Linux-based operating system, offering enterprise-grade device and application management.

When ordered with three FPGA's, there are multiple dedicated communication paths between the FPGAs and the Ethernet network offering flexibility and the ability to optimise communication latency. As there are currently a number of permutations of the E-Series with slightly different physical characteristics, the characteristics described below refer to the M48EP/ED variants (i.e.: 48/96-port, 1 RU/2 RU devices with either 3 Ultrascale or Ultrascale + FPGAs). Details of all variants, including A32E and C96E can be found here

Network Connectivity

The "Central" FPGA is connected to the Layer 1 switch fabric with 56 Ethernet transceivers. Each "Leaf" FPGA is connected to the Layer 1 switch fabric with 14 Ethernet transceivers. The Layer 1 switch fabric is also connected to the front-panel ports of the device for external connectivity. This Ethernet connectivity provides enormous flexibility in communication between the FPGAs and the external network.

Moreover, the Layer 1 switch can replicate data from incoming ports to multiple destination ports. For example, incoming UDP/IP multicast data coming into the Layer 1 switch from the network can be delivered to multiple transceivers (if desired) on one, two or all FPGAs in ~3 ns. One or multiple links can be configured between the FPGAs' ports in any required combination. Any of the FPGAs can therefore communicate with each other, or any of the front-panel ports via an Ethernet interface with a latency of ~50 ns (depending upon which MAC + PHY is used). Also read: Demystifying the MAC, PCS and PHY and how to measure their latency.

e-series FPGA network connectivity (1)


The E-Series Metamako Parallel Bus (MMP)

The MMP is a bus leveraging a large number of parallel IOs. Metamako has released the Metamako Parallel Bus IP cores package allowing client applications to communicate between FPGAs with the absolute minimum of latency. The first IP core offers an unidirectional bandwidth of 10 Gbps per link and provides an AXI4-stream interface.

There are four links connecting each Leaf FPGA to the Central FPGA and two links connecting the two Leaf FPGAs together. The direction of each MMP link can be configured individually. The key advantage that the MMP bus offers over Ethernet is that, being a parallel rather than a serial bus, its latency is significantly lower. MMP offers one-way transfer latencies between FPGAs of ~8 ns.


Inter-FPGA communication latencies compared

e-series FPGA connectivity


The visual latency breakdown below illustrates the latency differences (also refer to the graphic above) compared to a single FPGA reference. The latency benefits of inter-FPGA communication via MMP are significant in comparison to Ethernet.

In the following example, the latency savings of having FPGAs communicate locally via MMP links (~8 ns) versus having FPGAs on different devices communicating via 10GbE (~55 ns) are substantial. MMP offers an 85% latency reduction compared to 10GbE.


What about PCI Express?

Plugging an FPGA PCI Express board into a server remains popular. The rationale is sound; software running on the server's CPUs has an efficient, low-latency path with which to communicate with the FPGA. This rationale however is predicated on the communication between the host application and the FPGA being latency-critical, otherwise why not simply use Ethernet to communicate between the host and CPU? 

Architecturally, this may be due to a lack of FPGA capacity. It is not always possible to run all the latency-critical code on the FPGA, with part of it having to run on the host. The current generation of Xilinx UltraScale™/UltraScale™+ FPGAs now has sufficient capacity to implement, in many cases, the entirety of the latency-critical portion of a trade on the FPGA (or FPGAs in the case of the E-Series).

The non-latency sensitive part of the trade can still reside on the server but now communicate with FPGAs over the Ethernet network with a significantly simpler interface. There is also the added benefit that all communication between the FPGA and the controlling host can be monitored and timestamped as it traverses the network. This can be extremely useful to meet requirements such as record-keeping and compliance.

Example use cases for the triple-FPGA E-Series in electronic trading

  • One "stable" FPGA trading image can be retained with the remaining two FPGAs available for proving out strategies or experimentation
  • Risk checks can be run on a different FPGA from the actual trading algorithms whilst communicating with significantly lower latency than previously possible
  • Higher FPGA density means that multiple trading strategies can be implemented in as little as 1 RU, with the option to segregate them across three FPGA images
  • Three FPGAs contain over 1 Gb of on-chip RAM which is a real alternative to externally connected QDR-II/QDR-II+ (using the Xilinx UltraScale™+ VU9P as an example)


  • Metamako offers 1 and 2 RU E-Series devices containing up to three Xilinx UltraScale™/UltraScale™+ FPGAs
  • The E-Series devices offer the same enterprise-grade management and core features offered by all Metamako devices
  • All three FPGAs are connected to the device's Layer 1 Ethernet switch with as many as 84 transceivers
  • The FPGAs are connected to each other via one (leaf-to-leaf) or two (leaf-to-centre) MMP links offering applications inter-FPGA communication in ~8 ns
  • In an electronic trading use case benefiting from multiple-FPGAs, the E-Series platform offers significantly lower latency communication between FPGAs than Ethernet-connected FPGAs

Further reading:

>> 5 Things to consider when choosing an FPGA platform

>> FPGA Platforms: Why Metamako is best of breed

>> 4 Key Trends in the networked use of FPGAs

Contact us to talk about an evaluation of the Metamako E-Series.