> *At Stripe we have a network overlay using IPv6 and the Linux kernel's built-i...

jmillikin · on Nov 29, 2020

I posted briefly elsewhere in the thread, but am now back at a compute with a keyboard so can give a more detailed answer.

First, background reading:

https://en.wikipedia.org/wiki/6to4

https://en.wikipedia.org/wiki/IPv6_rapid_deployment

The basic idea is you create a SIT tunnel device and assign it an IPv6 /64 composed of two parts:

1. A network prefix between 32 and 56 bits long. This prefix is the same for all machines in the network.

2. A subnet derived from the machine's IPv4 address, minus the netmask.

For example, if your IPv4 addresses are allocated from 192.168.1.0/24 and the machine has 192.168.1.155, then the network prefix should be 56 bits long (64 - (32 - 24)) and the machine's prefix is `xxxx:xxxx:xxxx:xx9B::/64`.

The Linux kernel knows how to wrap the IPv6 with IPv4 so it can route within your local network to any other machine with a similarly configured tunnel device. If you want to send packets to 192.168.1.200 then they get addressed to `xxxx:xxxx:xxxx:xxC8::1` or whatever, they'll transit the IPv4 network like normal, and on arrival the receiving machine's kernel will strip off the IPv4 wrapper and route the IPv6 locally.

How's this useful? Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.

derefr · on Nov 29, 2020

> Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.

I’ve been meaning for a while now to experiment with this same idea in Erlang. I.e., hack up the Erlang runtime to use an IPv6 address as its PID type, such that each Erlang node running on a machine gets its own /64 subnet to hand out; and each Erlang actor-process on that node gets an IP allocated from its node’s /64 range.

This could just be a way of letting Erlang nodes talk to each-other through tunnels. Or it could be a way of having Erlang “VMs” exposed directly to the Internet as their own little machines.

simonebrunozzi · on Nov 29, 2020

This deserves a blog post in itself. Please be kind and share it with the world! I bet you will hire a few engineers as a result of this blog post :)

bogomipz · on Nov 29, 2020

I think you might mean a SIT(simple internet transition) interface and not SIP? In case anyone is interested. This is a quick read on setting this up:

https://kogitae.fr/debianipv6-debian-wiki.htm

jmillikin · on Nov 29, 2020

Yes, sorry, SIT -- it's been a while since I set it up and I forgot some of the details.

dang · on Nov 29, 2020

We've fixed that typo in the GP comment now.

rualca · on Nov 29, 2020

Outstanding post. Thank you for taking the time to share this gem.

ownagefool · on Nov 29, 2020

I haven't used it, but doesn't this suit the average usecase?

https://www.cni.dev/plugins/main/macvlan/

Basically just do normal ipv4 via your dhcp server rather than an overlay.

-- Edit

For arguments sake, I just set this up:

root@nas:/opt/cni/bin# ./dhcp daemon

cat /etc/cni/net.d/01-macvlan.conf { "name": "mynet", "type": "macvlan", "master": "eno1", "ipam": { "type": "dhcp", "routes": [{ "dst": "192.168.1.0/24"}] } }

PODIP = 192.168.1.181:8096

Works in my browser; so routes correctly.

Got its dhcp from my pihole.

daenney · on Nov 29, 2020

This is super useful for home networks, I do this for my k8s cluster hosted by a bunch of pi's.

But in production, I'd rather my ability to launch a new pod not be dependent on a DHCP server being reachable and functional. In that case, this particular trick is rather neat, since assignment of IP addresses is fully static/local (without having to agree upfront what range of IPs each node can use for bringing pods online), while retaining the benefit of everything being directly routable. You can now also run a ridiculous amount of pods on a single node.

ownagefool · on Nov 30, 2020

Yeah, I don't do this in production.

Though to counter your point, you don't actually need to use an external DHCP server in my example either, you can just define the block you're giving the server via the macvlan/ipvlan plugin, and I presume, again, it works with both IPV4 or IPV6.

So I guess my wider point is, k8s probably doesn't need to replaced to have the networking work how you like.