Demystifying the MAC, PCS and PHY & how to measure their latency

Latest News AND EVENTS

Stay up to date and find all the latest news and latest events from Metamako right here.

Demystifying the MAC, PCS and PHY & how to measure their latency

Posted on August 13, 2018 by Matthew Knight 13 August 2018

Many have heard at least one of these three-letter labels and may be aware that they relate to something to do with computer networking. Surely though, they are not associated with things that only makers of network adapters or network switches worry about?

Until a few years ago they were quite correct. However, with the impressive growth of FPGA-based applications across multiple markets, connecting these applications to the network as efficiently as possible has become a priority. A number of commercial FPGA fabric offerings are available to interface FPGA applications to the Ethernet network with minimum latency. However, there currently don't exist generally accepted, standardised ways to measure and quote the latency of these offerings and hence make valid comparisons of their features or performance.

1.  An introduction - 10 Gigabit Ethernet

Computer scientists love layered models as they allow pipelined functionality to be implemented within the layers abstracted away from the connections between. In simple terms, implementations for a given layer can be swapped and as long as the interfaces remain the same, everything should just work. These TLAs live right down in the low-level plumbing of computer networking. To carry this analogy on further, they define roughly what can flow though a pipe, the specifications of the pipe and how pipes get linked together. For the purpose of this blog, we will focus on 10 Gigabit Ethernet (10GbE), which is widely used in electronic trading.

Wired Ethernet exists as a collection of standards defined and ratified by the IEEE 802.3 working group of the Institute of Electrical and Electronic Engineers. The first 10GbE standard, 10GbE over fibre, was ratified in 2002 as 802.3ae with support for other media over the following years. Ethernet is defined as its own layered model which contains layers with these and some other associated TLAs:

Screen Shot 2018-08-08 at 10.51.08

Source: 802.3-2015 IEEE Standard for Ethernet

 

The above is quite a busy architectural diagram that comes straight out of the 802.3-2015 IEEE Standard for Ethernet. It essentially covers the 10GbE specification's layered model where the FPGA application would interface with and sit above (in the diagram), the MAC layers. On the other side of the MAC, an incoming and outgoing data stream passes through the PCS, PMA and PMD layers to communicate with the physical fibre or cable.

1.1  What do the layers actually do? 

10GbE L1-2

 In the context of an FPGA application talking to a 10GbE switch, you start at the bottom and work our way up:

  1. The physical medium is quite likely to either be a fibre patchcord or a twinaxial copper cable.
  2. This is plugged into a transceiver which implements the PMD, converting whatever physical medium is used; light or electrons, into a stream of electrons matching a defined interface talking to the PMA.
  3. The PMA is part of an FPGA chip and does a number of things. It works out where each bit begins and ends in the stream and recovers the clock from the stream. It then deserialises the incoming stream at 10.3125 Gbps into typically a 32 or 64 bit bus clocked at ~322 Mhz or ~161 Mhz which is passed to the PCS.
  4. The PCS first synchronises with the stream by identifying special symbols sent periodically in the stream to indicate the start of a line-encoded block. It then descrambles the blocks which have been scrambled using a polynomial scrambler designed to prevent DC bias (too many zeros or ones in succession that would cause a loss of lock by the PMA and PCS or too many ones in succession that would cause bit errors) and passes the descrambled stream to the MAC. Some models of FPGA have hard IP implementations of this layer. 
  5. The MAC converts the decoded stream into Ethernet frames which it passes to the FPGA application. It also flags errors in the stream and maintains statistics. A PTP-aware MAC also timestamps Ethernet frames for PTP clock synchronisation

Sending (TX) data from the application to the network is simply the reverse of receiving (RX) it with the exception that the PMA does not need to perform clock recovery. The term PHY is simply a collective term for those layers operating at the physical layer of the OSI model. In summary, for an FPGA application to communicate using 10GbE, an implementation containing the functionality in all these layers is required.

1.2  How are the layers implemented? 

10GbE L1-2 2 
All FPGAs with transceivers supporting 10GbE provide the PMA layer, almost always as hard IP on the chip. Some FPGA models also have the PCS in hard IP however they usually still offer direct access to the PMA from the FPGA fabric. MACs are often implemented as soft IP running on the FPGA fabric.

Metamako offers devices containing FPGAs from both Intel and Xilinx. Both of these FPGA vendors have their own libraries available implementing the PCS and MAC layers; as does Metamako and a number of other vendors.

Why this profusion of similar offerings?

In the world of electronic trading, less is more when it comes to latency. The key advantage of FPGAs over software, written in high-level languages such as Java or C++ running on a CPU, is that the FPGA fabric is connected directly to the network via the MAC and PHY. The FPGA can therefore talk to the network with significantly less latency than the software application running on the CPU. The reason for this is that Ethernet adapters (implementing the MAC and PHY) are connected to CPUs via a standard bus, usually PCI Express, which imposes latency on all communication across it. Even between FPGAs and vendors, there are differences in the latency of the PMA and MAC implementations (the PMD can essentially be bypassed via direct-attach twinax copper cable).

1.3  MAC and PHY implementation options

There are a number of different features that vendors may offer with their MAC and PHY implementation such as:

  • link traffic counters
  • RX (and TX) timestamping
  • support for Ethernet flow control
  • a dual-clock FIFO on the receive side (more below in section 2.2)
  • IEEE 1588 (PTP) support
  • support for options not strictly in conformance of the 10GbE spec such as shrinking the Ethernet preamble and/or interpacket gap to reduce latency between MAC and PHYs offering this support

In general, most of these options will have some impact on the latency through the MAC and PHY though it may well be negligible e.g. link traffic counters.

2.  Measuring the latency of a MAC and PHY

Unfortunately, there is no standardised methodology for measuring the latency across a MAC and PHY. Though it may appear trivial to do so, there are a number of factors that need to be taken into account:

  1. Where is the latency measured from/to?
  2. How is it actually being measured from a practical point of view?
  3. How is the accuracy of the measurement ensured?
  4. What functionality and features must be implemented?

From an FPGA application programmer's perspective, the application is interfaced to the MAC via a parallel data bus. The type of data bus may conform to one of several common standards e.g. AXI4-Stream, Avalon-ST. The MAC will in-turn talk to a PCS - either hard or soft IP - which will talk to the FPGA's PMA interface connected to a transceiver (or direct-attach copper). Given that the PMA implementation is provided by the specific FPGA chosen as an application platform, the latency of the MAC and PHY will essentially depend upon the MAC and PCS implementation. The latency of a PHY is therefore intrinsicly coupled with the specific FPGA chosen.

The definition of the latency path is a key factor in determining what components are included. For example, it is possible to quote the latency through the MAC and PCS in isolation to the PMA + PMD, with or without a dual-clock FIFO on receive (section 2.2) or a transceiver. Vendors do interpret the latency path (used to quote their latency) differently, making direct comparisons of different vendors' published latency numbers sometimes impossible.

2.1  More on MAC interfaces

At Layer 1 (PHY), the 10GbE specifications define a serial data rate of 10.3125 Gbps, encoded with a 66b/64b line code. The PMA, in the majority of cases, is configured to convert this serial stream to and from a 32 or 64 bit streaming bus clocked at 322.265625 and 161.1328125 MHz respectively. The PCS implementation will be designed to interface with this bus and among its other functions, add (transmit) or remove (receive) the two-bit sync header on each 66 block from the data stream while interfacing with the MAC via an Ethernet media-independent synchronous interface called XGMII. A 32 bit XGMII bus is clocked at 312.5 MHz and a 64 bit XGMII bus is clocked at 156.25 MHz. 

2.2  A tale of two clocks

At this point, it is important to understand that 10GbE does not mandate that any given PHY clock be synchronised. The 802.3 specification simply states that clocks must maintain a nominal frequency within ±100 ppm from absolute. When two different PHY's are connected to each other, unless they physically share the same clock, at any given moment all that can be assumed is that their frequencies will be within 200 ppm of each other. At each end of the connection, the PHY will transmit on its local clock and receive and recover the clock frequency of its counterpart so it knows where data bits start and end. This is important as an application running on an FPGA must be clocked locally as recovered clocks are not generally of sufficient stability to run complex logic. Logic must therefore be placed between the MAC's receive interface and the FPGA application to allow the data stream to flow from one clock domain (recovered clock) to another (local clock). These two clock domains are asynchronous as each generally originates from a different source. One exception to this is where the RX recovered clock originates from the TX clock i.e. TX is looped back to RX. In this case, the clocks are frequency aligned, but not necessarily phase aligned. 

The logic usually used to move the data between the clock domains is known as a dual-clock FIFO. Essentially it is a queue that has items put into it at one end at the sender's clock frequency and items pulled out of it at the receiver's clock frequency. If the sending and receiving PHYs' clock frequencies were the same, the FIFO utilisation would be constant. When the frequency of the clock from the incoming stream recovered by the receiving PHY is higher than its local clock, data is coming in faster than it is can be consumed. The FIFO provides enough margin to serialise each incoming frame into the (slower) receiving clock domain without overrunning with the difference in clock speeds being taken care of by essentially shrinking the interpacket gap (IPG) in the receiving clock domain. Conversely, when the receiving PHY's frequency is higher than that of the recovered clock from the incoming stream, the FIFO will allow the receiving clock domain to serialise incoming frames without under-running by effectively growing the interpacket gap as seen by the receiving clock domain . Clearly, the depth of the FIFO is a key factor in ensuring that overruns and underruns do not occur. This depth will depend upon the MTU size of the network to which these interfaces are connected. A larger MTU will require a deeper FIFO to accommodate buffering longer Ethernet frames.

2.3  The FPGA Application

Keep in mind, all the bus clocks mentioned above come from the 10GbE specifications. FPGA application developers on the other hand may define that their application, or pieces of their application, be clocked as whatever they feel will give them the best performance - with the caveat that the application must meet the FPGA's timing constraints. For example, an application may give its optimum performance clocked at 440 MHz. To interface with the MAC however, typically clocked at 156.25 or 322.5 MHz, a dual-clock FIFO or similar solution may need to be implemented.

2.4  What latency measurements are meaningful?

Does the individual latency of each of the layers really matter? Surely the key measurement is the latency between the FPGA application and the 10GbE network?

Though appearing straightforward, there is more to it than meets the eye. In many common use cases for a MAC and PHY IP block, the FPGA will receive an Ethernet frame, perform some processing and transmit another Ethernet frame. This is true for network switches, firewalls, tick-to-trade applications etc. (where the critical path runs from market data reception to order transmission). 

Three main elements need to be addressed:

  1. What defines the start and stop endpoints?
  2. What path is being measured?
  3. How is this measurement to be obtained practically?

Let us look at these points one-by-one:

1.  The interface with the FPGA application is defined by the MAC. Different MACs are likely to mandate different combinations of bus width/frequency as well as different bus standards for the application connecting to it. Some applications may be capable of running in the same clock domain as the recovered RX clock or the local TX clock. Other applications will need a clock domain crossing to communicate with the MAC. A mandated bus standard used by the MAC may also impact latency.

On the network side, depending upon the device the FPGA is mounted in or on, most would agree that the network begins where the device physically connects to the network. Does this mean the pins in the SFP+/QSFP+ cage? Or should a transceiver and possibly a length of optical interconnect (if so with what latency characteristics) be included in the measurement? Or should this be a defined length of direct-attach twinax?

2.   Ideally, the TX and RX paths should be measured independently, however, one-way latency measurements between different mediums (FPGA fabric and Ethernet network) are often difficult to implement. A generally much simpler solution is to measure a round-trip through both the TX and RX paths. What constitutes a round-trip? Conventionally, a loopback is frequently used with the same frame arriving being sent back out. The fundamental problem with this approach is that it is not necessarily a particularly representative case of real-world applications. A loopback can also be implemented in a number of different ways making comparisons across different vendor implementations difficult.

Another option is to mandate that an incoming Ethernet frame received on RX trigger a separate outgoing frame on TX. The key advantage of this approach is that it allows the actual implementation of this process to the vendor yet allows direct comparison of the latency performance across different implementations.

3.   One practical way to obtain one-way latency paths through the FPGA is via simulation, with the caveat that the simulator used contains an accurate model of the FPGA's hard IP. Even measuring round-trip latencies is not completely straightforward. In the following diagram, both figures show a round-trip path in and out of the FPGA.

In Fig (i), the latency path involves external loopback and starts and ends within the FPGA application, with the application performing the measurement. In Fig (ii), the latency path starts and ends in the network, which can be implemented as an Ethernet frame loopback or as the incoming Ethernet frame triggering a separate outgoing Ethernet frame with an external component performing the measurement. In theory, both methods implementing loopback could certainly yield the same results however there is a key subtlety separating them:

In Fig (i), the TX link is looped back to the RX link and hence the recovered clock is precisely the same frequency as the local clock. There can therefore never be an overrun or underrun in the dual-clock FIFO between the recovered and local clock domains. On the other hand, in the case shown by Fig (ii), an external clock almost certainly running faster or slower than the local clock by up to 200 ppm will impact the path latency through the FPGA given the need to mitigate a FIFO underrun or overrun.

MAC-PHY measurements  

3.  Proposed Latency Measurement Methodologies

There are essentially three logically different latency measurements that fall out of the above analysis:

  1. simulation of both TX and RX paths through the MAC and PHY
  2. looping back TX to RX externally and measuring in an FPGA application by essentially counting clocks from the first bytes of a frame being sent out to the first bytes of it being returned - Fig (i)
  3. sending and receiving Ethernet frames from an external entity through the RX and TX paths (RX connected to TX via loopback or incoming Ethernet frame triggering outgoing Ethernet frame) as well as tapping and capturing the incoming and outgoing Ethernet frames while timestamping their arrival - Fig (ii)

3.1  The pros and cons of each type of measurement

Methodology
Pros
Cons
Simulation
  • If the hard IP is accurately modelled, this method measures both directions individually and is extremely accurate.
  • The accuracy vendor-provided model of the FPGA's hard IP may not reflect the actual hardware in every detail.
  • Simulation numbers do not take into account any additional elements in the device's network path such as board traces, transceivers or switching ASICs.
  • Generating simulation numbers requires access to the source code making them not independently verifiable.
Looping TX back to RX externally - Fig (i)
  • The method measures TX + RX latency to the precision of a quanta in the clock domain of the timing application making it self-contained.
  • This configuration does not require that the MAC implement a asynchronous dual-clock FIFO as both start and end timestamps can be taken independently.
  • Measurement precision is likely to be limited to the frequency of the local 312.5/322 MHz clock; around 3 to 6 nanoseconds though accuracy through averaging repeated measurements should be sub-nanosecond.
  • Validating the implementation of the timing application would require that each vendor's code be inspected.
External measurement using loopback - Fig (ii)
  • This method measures the RX + TX latency to the precision of an external Ethernet frame capture solution, which can be as low as a nanosecond with repeated measurements being averaged.
  • As latency measurements are external to the FPGA code, the measurements in this method are independent of the MAC + PHY implementation.
  • This requires an external Ethernet frame sender/receiver, tap and a timestamped packet capture solution which will require post-processing.
  • The frequency difference between the sender's and FPGA's clock will vary across devices.
  • It requires external "plumbing" i.e. transceivers + fibre patches + optical tap + timestamped capture solution.
  • The dual-clock FIFO latency and associated interfacing latency is included in the measurement which may not be required in some specific use-cases.
External measurement using incoming triggering outgoing Ethernet frame - Fig (ii)
  • This method measures the RX + TX latency to the precision of an external Ethernet frame capture solution, which can be as low as a nanosecond with repeated measurements being averaged.
  • As latency measurements are external to the FPGA code, the measurements in this method are independent of the MAC + PHY implementation including how the vendor has chosen to trigger the outgoing Ethernet frame on receipt of the incoming Ethernet Frame.
  • This requires an external Ethernet frame sender/receiver, tap and a timestamped packet capture solution which will require post-processing.
  • The frequency difference between the sender's and FPGA's clock will vary across devices.
  • It requires external "plumbing" i.e. transceivers + fibre patches + optical tap + timestamped capture solution.


3.2  Weighing up the pros and cons

  • Simulation numbers are considered useful, and indicative, but by definition do not necessarily accurately represent the real world performance of a MAC/PHY.
  • Simulation numbers can only be generated with source code which is unlikely to be available from most vendors.
  • External loopback - Fig (i) - is relatively self-contained however it can be run without an asynchronous dual-clock FIFO on the receive side making it a poor analogue for interfacing with a real-world application in a local clock domain. As the measurement logic is in the the FPGA application, validating it would require detailed code inspection and its performance would vary across implementations making direct comparisons between vendor offerings flawed. Potential inaccuracy of the measurement is also a factor. A significant technical skill with varied methods is required to measure and inspect the internals of these applications, making this method unsuitable for implementation by independent third parties (like STAC). 
  • An external measurement - Fig (ii) - has the dual benefits of the measurement process being external to the MAC/PHY and, when an incoming Ethernet frame is used to trigger an outgoing one, allows vendors to implement this functionality independently. Its accuracy is also a function of the network capture device which can be as low as 50 ps (see Measuring the absolute accuracy of 10GbE packet timestamping). 

3.3  Metamako's recommendation of a standardised, fair MAC + PHY latency measurement methodology

To provide a well-defined measurement architecture that all vendors and clients can follow and that represents a real-world use case, Metamako believes that an external measurement with the MAC and PHY's RX, only triggering TX of a separate Ethernet frame upon receipt of a frame containing a token that meets a predefined rule, provides the most practical unambiguous solution. The test would require that frames be generated externally and contain tokens that both meet and do not meet the predefined rule. Where RX frames do not contain tokens meeting the predefined rule, no TX frame (or partial frame) should be sent. To facilitate the implementation of this measurement, Metamako is submitting an IP core with its RTL implementation of this logic, interfacing with common bus standards.

a. The implementation of the IP core from RX to TX in more detail
  • The IP core will be clocked at 312.5 Mhz in a locally sourced clock domain and interface with the MAC via 32-bit bus (AXI4-Stream or Avalon-ST).
    • Any vendor's MAC requiring a different speed/width combination will therefore need to implement a gearbox and/or dual-clock FIFO to interface with this IP core.
  • The core's logic will parse incoming frames from the MAC and extract a 4-byte token (big-endian) contained within the 52nd to 56th byte of the Ethernet frame.
  • If the token's parity is odd, no outgoing frame will be triggered.
  • If the token's parity is even, an outgoing frame will be triggered containing the token from the incoming frame in the 60th to 64th byte of the Ethernet frame.
b. Test procedure
  • A packet source and sink on a separate device (with an independent clock) is required; generating and receiving a stream of 10GbE-compliant Ethernet packets and capable of detecting malformed Ethernet packets.
  • The PMD sub-layer from the FPGA device under test is implemented as (a) fibre-connected 10GBASE-SR transceiver(s).
  • The packet replay/capture device and the FPGA device under test are connected to each other either via single or dual connections (RX and TX on separate ports/transceivers).
  • The fibre patch(s) between the packet replay/capture device and the FPGA device under test's latency to the transceiver can be measured via an optical loopback in place of the FPGA device under test and subtracted from the measurement through the MAC + PHY to obtain the combined latency of the looback, dual-clock FIFO, MAC, PHY (including the transceiver).
  • A stream of Ethernet frames containing unique tokens is sent into the FPGA device under test and the resultant Ethernet frames captured with both incoming and outgoing frames from the FPGA device under test timestamped 
  • Any outgoing frames from the FPGA device under test that are malformed, do not contain the same token as the incoming frames or were produced when the token in the incoming frame had odd parity will cause the MAC + PHY under test to fail the test.

4.  Summary

The MAC, PCS and PHY are all components mandated by the Ethernet specification that implement the low-level Ethernet protocol, allowing an application to send and receive Ethernet frames. The PHY incorporates the PCS (and PMA + PMD) and implements the physical layer of the OSI model whereas the MAC implements the data link layer. When implemented within an FPGA, the FPGA application implemented on the FPGA fabric typically communicates with the MAC.

Measuring the latency of a MAC and PHY implementation in a standardised way is not straightforward due to the fact that:

  • different MAC implementations may present different streaming bus interface widths/frequencies.
  • one-way latency measurement is only really practical using simulation, though potentially inaccurate unless FPGA hard IP is modelled accurately.
  • Ethernet clocks are not synchronised, which requires that each side of the MAC will be clocked from different sources, usually requiring a dual-clock FIFO which adds additional latency.
  • the location of the interface between the FPGA and the physical Ethernet network can vary based upon interpretation.

Metamako proposes that a fair, standardised methodology be adopted to measure the MAC + PHY latency via an FPGA IP Core. This methodology involves extracting and validating that the parity of a token is even at a defined offset within an incoming Ethernet frame and then embedding it in an outgoing Ethernet frame only if the validation is successful.

 

Further reading:

>> Overview: The Metamako FPGA platform

>> White paper: A brief Guide to Timestamping & Time Synchronisation

>> White paper: 5 Things to consider when choosing an FPGA platform