One does not simply go from a flat network to overlays. Overlays are slow, difficult, cause really odd failures and are often hilariously immature. They are the experimental graph database of the network world.
Just have a segregated network, and let the VPC/dhcp do all the hard stuff.
Have your hosts on the default VLAN(or Interface if your cloudy), with its own subnet (Subnets should only exist in one VLAN.) Then if you are in cloud land, have a second network adaptor on a different subnet. If you are running real steel, then you can use a bonded network adaptor with multiple VLANs on the same interface. (The need for a VLAN in a VPC isn't that critical because there are other tools to impose network segregation.)
Then use macvtap, or macvlan(or which ever thing that gives each container a macaddress) to give each container its own IP. This means that your container is visible on that entire subnet, either inside the host or without.
There is no need to faff with routing, it comes for free with your VPC/network or similar. Each container automatically has a hostname, IP, route. It will also be fast. As a bonus it call cane be created at the start using cloudformation or TF.
You can have multiple adaptors on a host, so you can separate different classes of container.
Look, the more networking that you can offload to the actual network the better.
If you are ever re-creating DHCP/routing/DNS in your project, you need to take a step back and think hard about how you got there.
70% of the networking modes in k8s are batshit insane. a large amount are basically attempts at vendor lock in, or worse someone's experiment thats got out of hand. I know networking has always been really poor in docker land, but there are ways to beat the stupid out of it.
I will have to take the other side of that golden rule. Not sure where it came from. But when one has a decent handle on the tools at hand, they work wonderously well.
I have bare metal servers tied together with L3 routing via Free Range Routing running BGP/VxLAN. It Just Works.
No hard coded vlans between physical machines. Just point-point L3 links. Vlans are tortuous between machines as a Layer 2 protocol, given spanning tree and all of its slow to converge madness.
OP was mostly talking about cloud + docket containers. Your use-case is unrelated and seems to make sense.. But I still agree with OP and I believe overlays in the cloud is generally an anti-pattern of unnecessary complexity.
Where I work we use overlays (flannel) and it just works. I don't think we've had issues. AFAIK the primary reason was that the network can be secure/encrypted. Otherwise you're running everything with TLS and managing all the certs can be more painful. Or you're running without encryption which is a potential security problem. You still need to do that for external facing stuff but that's a lot less.
Site is having issues atm... but I'll throw something out there I'd really like to see.
We encrypt 100% of our machine-to-machine traffic at the TCP level. There's a lot of shuffling of certs around to get some webapp to talk to postgres, then have that webapp serve https to haproxy, etc.
I'd be awesome if there was a way your cloud servers could just talk to each other using wiregaurd by default. We looked at setting it up, but it'd need to be automated somehow for anything above a handful of systems :/
I agree with your viewpoint but I'm also aware of several security standards that explicitly specify all traffic between hosts needs to be encrypted. Sometimes it's easier to meet the standard verbatim than try and justify an exception. If you already use a configuration management tool it shouldn't be a lot more overhead to install some certificates.
If you think about these things like physical networks, you can do things like run an interface in promiscious mode and sniff traffic.
Further, leaving your VM, you hit a shared NIC and network cables, so you start to worry about phyiscal layer attacks.
Amazon specifically states they handle these issues, and indeed they likely do, but how do you know? If you're able to easily encrypt by using something like istio, then why not?
More specifically:
"Packet sniffing by other tenants: It is not possible for a virtual instance running in promiscuous mode to receive or“sniff” traffic that is intended for a different virtual instance. While customers can place their interfaces into promiscuous mode, the hypervisor will not deliver any traffic to them that is not addressed to them. This includes two virtual instances that are owned by the same customer, even if they are located on the same physical host. Attacks such as ARP cache poisoning do not work within EC2. While Amazon EC2 does provide ample protection against one customer inadvertently or maliciously attempting to view another’s data, as a standard practice customers should encrypt sensitive traffic."
This still has the same problems as distributing certs and setting everything up :/ was looking for something that "encrypts literally everything" when it goes out to another machine on the cloud
In my mind, a "layer 2 subnet" really doesn't mean anything. Subnets are things that happen in IP, that is, layer 3, and layer 2 is the physical connection, ie. Ethernet or WLAN, which don't have the concept of subnets.
Edit: also the OSI layer model was specified in the eighties, and isn't all that accurate in 2019 to describe how our networks actually work.
A VLAN will isolate macs so that only those adaptors in that VLAN can see each other. Granted, there isn't really a concept of a netmask based subnet, but then that's because you don't really have control over one's physical address.
Now, you can have an adaptor in more than one VLAN, which is the point of them. As I said its not a perfect analogy, but then they are there to achieve different things based on different semantics.
IPv6 doesn't save you from any routing problems that IPv4 won't save you from. While IPv6 tries to hide the layer 2/layer 3 distinction from you, it doesn't actually make your physical network magically work differently. Internally IPv6 tries to implement this hiding using multicast - same as the VXLAN suggestion in the article. If you overload your network infrastructure's multicast support, at best you fall back to broadcast, which is just like reconfiguring your physical network to bridge all your layer 2 segments into one: if that won't work for you in IPv4, it won't work in IPv6. (And at worst, it stops routing correctly.) If you don't have multicast support at all in your network infrastructure, which as the article points out isn't common to have on cloud networks, then IPv6 won't be able to help you. You'll still need fancy routing and tunneling to make things work, whether you address machines with IPv4 and IPv6.
In my experience, IPv4 has the strong advantage of being familiar and well-supported, which means that when (not if) your network infrastructure starts to act up, it's easier to figure out what's going on. IPv6 works great if you have robust, reliable multicast support on all your devices and nothing ever goes wrong.
IPv4 numbering sucks. IPv6 lets you stop worrying about that.
In IPv4 you're going to need RFC1918 addresses, and then you're going to have to make sure that _your_ RFC1918 addresses don't conflict with any _other_ RFC1918 addresses that inevitably absolutely everything else is using or else you'll get hard-to-debug confusion. No need in IPv6, you should use globally unique addresses everywhere, there are plenty and you will not run out.
Everybody who has ever used a single byte to store a value they were convinced wouldn't need to be more than a few dozen, and then it blew up because somebody figured 300 ought to fit and it doesn't already knows in their heart that they shouldn't be using IPv4 in 2019.
Oh, yes, IPv6 saves you from worrying about addressing, which is a huge headache in IPv4. I agree with that and IPv4 address conflicts are a personal frustration. IPv6 doesn't save you from "fancy routing" and mostly does not save you from "nat," though. That's what I was responding to.
I'm hesitant to use IPv6 because it is not merely IPv4 + more addresses, it's IPv4 + more addresses + a very clever design that hides the L2 vs. L3 distinction by relying heavily on multicast groups + a replacement for ARP + a replacement for DHCP + etc. etc. etc. I know I shouldn't be using IPv4 in 2019, but I don't have a better option. I'm not excited about clever systems, hiding, the assumption that multicast works reliably, losing the last few decades of monitoring and debugging tools, happy eyeballs, etc., and I'm not willing to subject my users to the resulting outages simply because it'll save me the headache of thinking about numbering.
Why can't you just do all your v6 routing at l3 and skip l2 and NDP? If you were using WireGuard which is l3 then all you would need is a way to manage the routes. Can you make BGP work without l2? I seem to think you can but I've never tried it.
ZeroTier supports a mode where it emulates NDP for v6 and works without having to do multicast or broadcast at all. It does this by embedding its cryptographic hash vl1 addresses into v6 addresses.
Well, I have hardware routers that know about L2, and I'd like to have them do as much routing as possible. I'm running Quagga to advertise my VXLAN routes to my hardware routers, so packets originated on bare metal can reach my virtualized infrastructure and vice versa. I want them to know that if the machine advertising this particular IPv6 subnet is in the same rack, packets can go there and don't have to go to a dedicated gateway for all my VXLAN traffic.
I could run IPv6 on the inside and IPv4 on the outside, sure. I worry this is going to trigger more edge cases than either running IPv6 the way it was intended or IPv4 the way it was intended.
> If you don't have multicast support at all in your network infrastructure, which as the article points out isn't common to have on cloud networks,
Huh? Are you assuming large flat L2 networks addressed with IPv6?
IPv6 works great at scale, just route everything everywhere, stick with unicast & anycast, and don't roll large L2 domains.
Multicast is entirely unnecessary aside from the small amount needed for ND/RA between host and ToR.
And, for operations, a routed IPv6 network without NAT, VXLAN, or VLANs spanned across switches is much easier to troubleshoot and generally has fewer moving parts to fail.
Yes, I am. The sort of network architecture you describe works great, but unless I'm seriously misunderstanding the article (which maybe I am!), all the use cases where you'd want VXLAN or Wireguard on IPv4 are incompatible with such an architecture on IPv6.
I will grant that IPv6 + ULAs + BGP + flat networks is easier to think about than IPv4 + 10.0.i++.0/24 + BGP + flat network because you have basically unlimited ULAs, but "You have to pick a unique 10.0.i++.0 for each machine, and that's annoying" doesn't seem like the primary thing the article is trying to forget. If you can do a hierarchical routed IPv6 network, you can almost certainly do it with IPv4, too.
This article uses Quagga - they really should be using FRRouting, which was forked from Quagga in 2017 by the core Quagga developers and has 4 times as many commits (16000[0] vs 4000[1]), far more features, bugfixes, etc. Quagga has been dead for over a year.
For a more full fledged use case FRRouting may be the way to go. We are using Quagga here mainly to maintain internal routes with iBGP and route reflectors. This is a fairly simple use case.
Quagga is available in the default package managers of most distros so its a good place to start.
FRR is used in production and isn't academic, I don't know where you got that impression.
Microsoft runs FRR on SONiC.
Vyos runs FRR.
6wind runs FRR.
Cumulus Networks runs FRR.
Juniper runs FRR in certain products.
VMware runs FRR.
Broadcom is integrating it.
I don't think you are very familiar with the scope of changes that have gone in since Quagga. Not to detract from BIRD - which is a great, solid BGP implementation - but it is disingenuous to say FRR isn't used in production.
This comment completely misses the point. There is a distinction between "complete" and "dead", to whatever degree any software can be called "complete".
The Quagga source repo[1]'s certificate expired over 6 months ago. Looking at the Bugzilla[2] report (also with an expired certificate) there are 14 blockers, 49 critical and 69 issues that have not been resolved.
So no, I'd agree with the parent comment that using a project as seemingly dead as Quagga for something as critical as BGP routing is putting yourself on shaky ground at the very least.
You missed the point. It’s a demo doing trivial bgp stuff that hasn’t changed for 15 years.
It’s like someone doing a demo on some text processing where they use grep and the top comment is some jerk saying that map-reduce would be better because some new large systems use it and it’s being actively developed.
Yes, I would trust an actively developed fork of a TCP stack that is 2 years ahead of its forked project more than the original. Especially for something as critical as TCP, and equally so for BGP routing. Why use a dead project that hasn't gotten bug fixes for years?
It depends on the implementation of the control plane and how you maintain the mesh between the different servers (L2<=>L3 for arp resolution, mac learning).
Historically vxlan was a multicast thing, but not anymore.
Flannel (popular among the container networking solutions) will maintain its state in etcd by watching the Kubernetes resources then program the linux data plane with static unicast entries for the neighbors.
Just have a segregated network, and let the VPC/dhcp do all the hard stuff.
Have your hosts on the default VLAN(or Interface if your cloudy), with its own subnet (Subnets should only exist in one VLAN.) Then if you are in cloud land, have a second network adaptor on a different subnet. If you are running real steel, then you can use a bonded network adaptor with multiple VLANs on the same interface. (The need for a VLAN in a VPC isn't that critical because there are other tools to impose network segregation.)
Then use macvtap, or macvlan(or which ever thing that gives each container a macaddress) to give each container its own IP. This means that your container is visible on that entire subnet, either inside the host or without.
There is no need to faff with routing, it comes for free with your VPC/network or similar. Each container automatically has a hostname, IP, route. It will also be fast. As a bonus it call cane be created at the start using cloudformation or TF.
You can have multiple adaptors on a host, so you can separate different classes of container.
Look, the more networking that you can offload to the actual network the better.
If you are ever re-creating DHCP/routing/DNS in your project, you need to take a step back and think hard about how you got there.
70% of the networking modes in k8s are batshit insane. a large amount are basically attempts at vendor lock in, or worse someone's experiment thats got out of hand. I know networking has always been really poor in docker land, but there are ways to beat the stupid out of it.
The golden rule is this:
Always. Avoid. Network. Overlays.