Lowest-latency Ethernet: Byte-oriented NICs? (deleted from StackOverflow...)

Back in 2017 I had a kinda esoteric question about Ethernet NICs with less-than-frame latency, based on my experience with networking software and hardware, and the evidence for something "atypical" being in use by High Frequency Traders. So I posted my question to Stack Overflow.

The question got a lot of pushback, which I found bizarre. It had to be moved across 3 SO sites... and finally it stayed in SO proper, where anyway the only response I got was my own, when some days later I found the necessary information myself.

Aaand a couple of years of inactivity later, the question just got deleted. 🤷‍♂️

I thought it was an interesting question (else I wouldn't post it!); it was an interesting collection of pushback; and an interesting answer. So I saved them and here they are for posterity.

Lowest-latency Ethernet: Byte-oriented NICs?

This is a question that was put on hold both in Network Engineering (because it
involves a host) and in Server Fault (because... I'm not really sure). I'm hoping that it
makes more sense here.

First, some context. High Frequency Traders depend on very high speed networking. In
2014 they were already working at the level of about 750 nanoseconds with 10Gb
Ethernet, and competing to reach lower latencies, with single nanoseconds being
important.

The surprising thing is: a 10GbE frame takes about 1 ns per byte. So, from the moment
a typical NIC starts receiving a frame until it finishes and makes it available to the rest
of the hardware, a minimum of 64 ns have passed for a minimal frame. For a 1500 byte
frame, longer than 1 microsecond has passed. So if you wait to have a full frame before
you start working on it, you are wasting precious time.

But that is exactly what every Ethernet NIC I have seen does! In fact, even kernel-
bypass frameworks like DPDK or NetMap work with a granularity of frames, so they will
never reach less-than-frame latency.

Consider that data in an Ethernet frame starts 14 bytes into the frame. If you started
working on that data while the rest of the frame gets received, you'd save a minimum of
50 ns, and potentially 20 times that. That would be a BIG advantage when you are
fighting for single ns.

I see two possibilities:

HFTs and the like are not counting the potential time gain of processing data before the frame is fully received. Seems absurd: they use FPGAs for speed but are allowing to waste all that waiting time?
They are actually using some fairly specialist hardware, even by DPDK standards

Hence my questions: how do they do it? Does anyone know of "byte-oriented"Ethernet NICs, which would make available the single bytes in the frame as they arrive?
If so, how would such a NIC work: for example, would it present a stream à la stdout to the user? Probably also with its own network stack/subsystem?

Now, I have already collected some typical "that's impossible/bad/unholy" comments from the questions in NE and SF, so I will preemptively answer them here:

You want a product recommendation.

No. If anything, it'd be interesting to see some product manual if it explains the
programming model or gives some insight about how they get sub-frame latency. But
the goal is to learn how this is done.

Also, I know about the Hardware Recommendations SE. I'm not asking there because
that's not what I want.

You are asking "why doesn't someone do this".

No. I stated that there's talk of latencies that are impossibly low for traditional NICs, and I am asking how they do it. I have advanced two possibilities I suspect: byte-oriented NIC and/or custom hardware.
It could still be something else: for example, that the way they measure those ~750 ns
is actually from the moment the frame is made available by the (traditional) NIC. This
would moot all my questioning. Also, this would be a surprising waste, given that they
are already using FPGAs to shave nanoseconds.

You need the FCS at the end of the frame to know if the frame is valid.

The FCS is needed, but you don't need to wait for it.

Look at speculative execution and branch prediction techniques, used by mainstream CPUs for decades now: work starts speculatively on a set of instructions after a branch, and if the branch is not taken, the
work done is just discarded. This improves latency if the speculation was right. If not,
well, the CPU would have been idle anyway.
This same speculative technique could be used on Ethernet frames: start working ASAP
on the data at the beginning of the frame and only commit to that work once the frame
is confirmed to be correct.

Particularly note that modern Ethernet is rather expected to not find collisions
(switches everywhere) and is defined to have very low BER. Therefore, forcing all
frames to have 1-frame latency just in case one frame turned out to be invalid is clearly
wasteful when you care for nanoseconds.

Ethernet is frame-based, not byte-based. If you want to access bytes, use a
serial protocol instead of bastardizing Ethernet.

If accessing data in a frame before it's fully received is "bastardization", then I'm sorry
to break it to you, but this has been going on at least since 1989 with Cut-through switching.

In fact, note that the technique I described does drop bad frames, so it is
cleaner than cut-through switching, which does forward bad frames.

Getting an interrupt per byte would be horrible. If polling, you would need to
use full CPUs dedicated just to RX. NUMA latency would make it impossible.

All of these points are already mooted by DPDK and the likes (NetMap?). Of course one
has to configure the system carefully to make sure that you are working with the
hardware, not against it. Interrupts are entirely avoided by polling. A single 3-GHz core
is by far enough to RX without dropping frames at 10GbE. NUMA is a known problem,
but as such you just have to deal carefully with it.

You can move to a higher speed Ethernet standard to reduce latency.

Yes, 100GbE (for example) is faster than 10GbE. But that will not help if your provider
works at 10GbE and/or if your competition also moves to 100GbE. The goal is to reduce
latency on a given Ethernet standard.

And the answer...

OK, so I found the answer myself. SolarFlare for example does provide just this kind of
streaming access to the incoming bytes, at least inside the FPGA in (some?) of their NICs.
This is used for example to split a long packet into smaller packets, each of which gets
directed to a VNIC. There is an example explained in this SolarFlare presentation, at
49:50:
Packets arrive off the wire – bear in mind that this is all cut-through so you don't know the length at this point – you just have one metadata word at the beginning.

This also means that the host still communicates with the NIC in the traditional way: it
just looks like various fully-formed packets arrived all of a sudden (and each can be
routed to different cores, etc). And so, the network stack is rather an independent
variable.
Good stuff!

H. Mijail's Blog

Search This Blog