Hacker News new | past | comments | ask | show | jobs | submit login
Designing 100G optical connections (facebook.com)
95 points by ot on April 16, 2017 | hide | past | favorite | 46 comments



http://www.cwdm4-msa.org/

No where does this mention Facebook as a founding member. This dates back to at least 2015.

> We created a 100G single-mode optical transceiver solution, and we're sharing it through the Open Compute Project.

What gives? Looks to me like someone may be taking more credit than they deserve.

EDIT:

> The starting point for this specification is the CWDM4 MSA, a standard that was agreed upon in 2014 by several optical transceiver suppliers. It uses a wider wavelength grid (CWDM = 20 nm spacing) and, for many of the different technology approaches, does not require a cooler inside the module to keep the laser wavelength stable.

So Facebook took something that already existed. Tweaked it slightly by relaxing the constraints so that it works in the data-center only.


4 x 25Gbps has been around for years but in larger form factors like CFP, cfp2, etc. The main difference here is they're using commodity lasers from China and Taiwan sources with prism to do 4 x 20nm width channels between qsfp28 optics. For reference: https://en.wikipedia.org/wiki/QSFP

The next step down in size will be to sfp/sfp+ dimensions. I can see it coming, a qsfp is already a lot smaller than a first generation 100GbE CFP.


Facebook has really excellent marketing for their hardware engineering efforts.


The big difference is that it is lower cost than the 4x25 cwdm msa because it is specced for 500m reach through normal g.652.d singlemode and a reasonable number of clean, properly terminated sc and lc/UPC patch panels. Not specced to work at 2000 metres.


I guess the real question is do these cost savings actually save them money in the long run? Considering they had to do all the extra engineering work. Or is this just to thumb the big manufactures? Just another case of NIH syndrome?


If you have hundreds of optics and the difference is $1100 vs $2000 per unit, that could be $180,000 to $200,000 saved. Enough for the cost of one serious core router (half of a twin pair) capable of 400Gbps full duplex per slot. 500m should be more than enough for any optical path within a flat horizontal datacenter.

Facebook didn't do any extra engineering work, they just specced less sensitive Rx parts with the optics OEMS , and 1dB less powerful Tx.


> Facebook didn't do any extra engineering work...

I guess that depends on how you define "engineering."

They're basically trying to create a new class of transceiver. It remains to be seen if this will take off or not, but since it is part of the OCP effort, the chances are good that it will be taken seriously by QSFP vendors.

OCP is generating a lot of activity and change on the networking side. Whether it just becomes a race to the bottom where only the giant suppliers survive or whether it creates a new eco-system with more players and interesting technology remains to be seen.


I think it's great that a data center operator is willing to relax their requirements. For too long we've been designing against telecom specs and operating environments.

I think in order to bring more OEM vendors in we need to see the other big players to also accept the relaxed specs.

Hopefully, we don't end up with another dozen different 100G or 200G MSAs that work from 15-55C.


I'm also curious what the pricing difference is for a CWDM4 transceiver and the OCP version.

I would guess the NRE to develop either is similar and that the design for either is almost the same. Perhaps Facebook is just trying to get the optics cost down by negotiating discounts on the non-yielding MSA parts that would have otherwise had to get thrown out?


If you're wondering what this kind of optic costs the rest of us: http://www.fs.com/products/65219.html (£1,081)

Although you'd pay at least 10-20x that if it had a big vendor name on the top.


Ha awesome! Is this switch-to-switch only, or are there 100 gig optical connectors for servers as well? I like the 2 km max cable run length.


My understanding is that this is just a qsfp28 transceiver using a new optical format. There have have been 100GbE server NICs for almost two years (Mellanox ConnectX 4, using mlx5 driver) using QSFP28.

We've been using them with great success at Netflix in our flash storage appliances. We are able serve well over 90Gb/s from single machines with these NICs using our tuned/enhanced FreeBSD and nginx.


> We are able serve well over 90Gb/s from single machines with these NICs using our tuned/enhanced FreeBSD and nginx.

Incredible


Can you describe why you choose freebsd over Linux+DPDK? I realize freebsd has the fast path built in, but I would think it's lacking in other areas.


We're pretty much the opposite of DPDK. Rather than moving stuff out of the kernel, we move stuff into it. We're actually moving stuff like TLS encryption into the kernel -- see our kernel TLS papers at AsiaBSDCon

We do all of our stack "traditionally", in the kernel, whereas DPDK moves things into userspace. By using a traditional stack with async sendfile in the kernel, we benefit from the VM page cache, and reduce IO requirements at peak capacity. There are no memory to memory copies, very little kernel/user boundary crossing, and no AIO. Using single-socket Intel Xeon E5-2697A v4, we serve at 90Gb/s using roughly 35-50% CPU (which will increase as more and more clients adopt HTTPS streaming).

There is no question that FreeBSD is lacking in a number of areas. For example, device support is a constant struggle.


Which 100G nic card are you using with FreeBSD?


Mellanox ConnectX-4 (mlx5 driver, mceX interface name)


Just wondering, what kind of things do you think FreeBSD is lacking in?

Regardless of those things, the OpenConnect boxes are doing a pretty small subset of possible server tasks, it's basically serving static content from disk, and updating the static content occasionally. This is a task that FreeBSD has been excelling at, since basically forever. FreeBSD is a pretty stable target to tweak on as well, Netflix moved TLS bulk encryption into sendfile(), which helps avoid the transitions from kernel to user space, but by putting more stuff in kernel space, rather than the DPDK method of putting more stuff in user space. They've continued to tweak sendfile, which I imagine helped them get up to nearly 100Gbps out.

I haven't had the pleasure of running 100G network, but I had to do very little tuning to saturate 10G with FreeBSD on http download site, and TLS cpu was the primary thing holding me back when moving it to https. Bulk download was moved away from my team before I got new servers with 2x10G and fancier processors, so I was never able to see if I could saturate that too :(


You use CWDM4 for switch to switch. Cheaper/easier to use DAC cables for switch to server connections. Though I don't know anyone doing 100G to server yet, so you'd more likely be using DAC breakout cables to deliver 25G to the server.


Cool, thanks for the info. Wondering if this is remotely worth it for a home computing cluster yet.


100G, probably not unless you are extremely wealthy and you have a burning desire to be destitute. 40G is definitely doable for a homelab. Used 40Gbe adapters are roughly $200 and DAC cables are < $100. QSPF+ Switches are much harder to find used and/or cheap though. This means that p2p solutions or ring networks are generally preferred unless you are wealthy. If you want fiber, which you don't need for a homelab, you pay the prices listed above.


Agree with other commenters, not sure what the 2.5x increase in traffic would open for you in a home setting. I use 40gb fiber between switches on opposite ends of my place and 10gb connections between switch and computers and I've always found something else to be my bottleneck, not the internal network itself. I can saturate my wifi network easily though (I've got a $20k retail Aruba wifi network, yet still not enough). I buy most of my gear used so it's not that expensive (would've been very expensive retail), but still overkill unless your a network nerd (like me) or are doing something out of your home you likely should be using a DC.


Yeah, I think 10 G will be more than good enough for now. Thanks for the info!


Linux can route traffic over 40G on regular hardware as long as packet sizes are fairly large >~150 bytes. I would guess that serving static assets from ram would be able to saturate that on a single machine


150 bytes is a fairly small payload. I think you dropped a 0.

kernel bypass will get you line rate 40gbps w/64 byte packets.

We've been showing IPsec over 40g links during the past week.

https://twitter.com/netgateusa/status/853694461456646144


Neat! I've heard you can get line rate on slightly bigger packets without bypass using IPVS with direct return since it bypasses part of handling.

In a lot of scenarios the big pipe is right at the load balancer so this is useful.


You can do 40gbe even if packet sizes are small using any kernel bypass technique and multiple cores. Mellanox claims up to 130Mpps.


Is this just blind forwarding or does that include some decent routing logic? I've looked into kernel bypass but I keep shying away because it seems to proprietary, especially the virtualized flavor which seems to be Intel only


Blind forwarding. It's left to the application to do everything.


Looks like you can get 100Gb NICs for servers e.g. https://www.hpe.com/uk/en/product-catalog/servers/server-ada... (Intel, Mellanox etc also seem to have their own)

On the distance, there are other specs that support 100Gb up to 20km on there!


Where do you see an Intel 100gbe nic?

Edit: I think you're referring to omnipath. Should be clear than Intel has no 100gbe card available. That link is also pcie3x8, so it can do a max of ~60gbps to host.


Haha for when you need another DC that's sort of across town, or you're trying to add artificial latency intra DC with enormous coils of fiber.

Not too unreasonably expensive - roughly in line with a GeForce 1080, and that's after the Official HP Hardware markup. I wonder how much the switch/cabling is...


> trying to add artificial latency intra DC with enormous coils of fiber.

Do people really do this?! (I don't have DC experience, serious question)


I believe that's a reference to IEX's magic shoebox. 61 kilometers of cable that provide 350 microseconds of delay for stock market voodoo purposes.


At least one stock exchange has: https://en.m.wikipedia.org/wiki/IEX


Heh like the peers said, it was a reference to what IEX does.


There are Intel chipset single and dual port 100GbE pci-express 3.0 NICs with good Linux kernel support in v4.8+.


Note that "100G optical connections" is physically implemented as "4x25G optical connections". The 4 wavelengths are coarsely spaced, which allows cheaper optical components to be employed. (As opposed to long-haul optics, where it is economical to squeeze many more wavelengths on the same fiber with costlier optics.)


I must say that I'm fascinated by the fact that each bit is around 1 cm long while in transit, at near the speed of light (in vacuum.)

Single channel 100 GB would mean 2-3 mm long bits. Imagine that :-)


Single channel 100GbE is already done, but not with OOK, it's coherent modulated qpsk, 16qam or 64qam single wavelength. But wider channel size in THz occupied than a single channel 1550 10GbE single wavelength OOK circuit.


And even 1cm at light speed is still a good 7 thousand cycles long at these frequencies.


Because it's 25 Gbps electrically per lane, this also allows for economies of scale with new 25 GbE optical and direct attach NICs for servers, where 10GbE is not enough, but 100 is too costly. There is some good documentation and reference links on 25GbE on Wikipedia for anyone who is curious. This also allows for one 100GbE top of rack switch port to be broken out into 4x25GbE individual connections.

So if you have a 1.5RU 32-port 100GbE top of rack switch it can serve up to 120 servers, leaving two 100GbE ports free for uplink.


The other issue that people miss here is the physical board that the switch ASIC is on. People might want to google SerDes and XAUI. The simplified bit here is back in the day first 10G you had 4 x 3.25G lanes. So for example for the Fulcrum FM4000 (24x10G) you had 64 traces on the board, 4 to each port. For something like the Broadcom Trident+ which could do 64 ports of 10G you have 256 traces. For each 10G there are 4 and they have to line up to the nm on the board unless you wanted to use a PHY and / or retimers. Arista shipped the 1st PHYless design of this. Less parts, higher MTBF, less heat and also less latency (phy add latency). To do 40G you would need 16 of the 3.25 (yuk). Things moved to 4x10G lanes and you got 40G ports. With the 25G Ethernet the lanes moved to 4x25G so you get 100G ports. Anyway random stuff if anyone is interested. Yes I know I am mixing the Serdes and XAUI stuff but this is a simpler view just talking about the lanes from the chip. Google will give great answers if you want to dig deeper.


The OCP link loss is 3.5 dB @ 500m, which would be 9.5 dB @ 2000m.

The MSA is 5 dB @ 2000m, so where is the additional loss coming from in the OCP spec? Different connectors?


It's coming from loosened tolerances. Higher/lower temperatures may mean higher loss. I'm assuming there's a bit more tolerance in the alignment of the up optics/connectors as well. So, if you get lucky it might well be 5db@2km, but it's not guaranteed/required by spec.


This high bandwidth enables some interesting consumer services. The speed is also very good, like less then a millisecond over 10km, which is ten times faster then from your computer to the monitor.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: