Babel: a routing protocol with wireless support

Traffic

Previously I wrote about resurrecting the old forgotten routing protocol, RIPng. In a small network of more than one router, you need a routing protocol to share information between the routers. I used RIPng for about six months, turned it on, and pretty much forgot that it was running. Worked like a charm in my wired network.

I moved to a new (to me) house this summer, and thought it was a good opportunity to try out a routing protocol which not only handles wired networks but also wireless. Babel seemed just the thing for this environment.

Enter Babel

Babel is a loop-avoiding distance-vector routing protocol that is robust and efficient both in ordinary wired networks and in wireless mesh networks. Based on the loss of hellos the cost of wireless links can be increased, making sketchy wireless links less preferred.

RFC 6126 standardizes the routing protocol.There are two implementations which are supported on OpenWrt routers, babeld and bird

Creating a network with redundant paths

Like anything in networking, it starts with the physical layer (wireless is a form of physical layer). I attached the wireless links of the backup link router to the production and test routers. Thus creating redundant path of connectivity within my house.

Network Diagram

Running BIRD with Babel

I chose bird6 (the IPv6 version of bird on OpenWrt) because I already had it installed on the routers for RIPng. It was merely a matter of commenting out the RIP section in the /etc/bird6.conf file, and enabling Babel.

The Bird Documentation provides an example. Add the following to /etc/bird6.conf get Babel running in bird6

protocol babel {
    interface "wlan0", "wlan1" {
        type wireless;
        hello interval 1;
        rxcost 512;
    };
    interface "br-lan" {
        type wired;
    };
    import all;
    export all;
}

In the example above, wlan0 is the 2.4 Ghz radio, and wlan1 is the 5 Ghz radio.

Checking the path of connectivity

When determining the connectivity path, traceroute6 (the IPv6 version) is your friend. Checking between the laptop and the DNS server, the path is:

$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 49819, 30 hops max, 60 bytes packets
 1  2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1)  4.561 ms  0.510 ms  0.487 ms 
 2  2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1)  2.562 ms  2.193 ms  1.927 ms 
$ 

The traceroute is showing the path going clockwise through the 2.4 Ghz wireless link.

Network Failure!

To test how well Babel can automatically route around failed links, I started a ping to the DNS server from the laptop and disabled the 2.4 Ghz radio, thus blocking the link the pings were using, and waited...

$ ping6 6dns
PING 6dns(2001:db8:ebbd:4118::1) 56 data bytes
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=1 ttl=63 time=3.54 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=2 ttl=63 time=1.64 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=3 ttl=63 time=2.02 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=4 ttl=63 time=1.64 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=5 ttl=63 time=1.51 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=6 ttl=63 time=1.65 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=7 ttl=63 time=1.58 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=8 ttl=63 time=5.80 ms
From 2001:db8:ebbd:bac0::1 icmp_seq=33 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=34 Destination unreachable: No route
...
From 2001:db8:ebbd:bac0::1 icmp_seq=48 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=49 Destination unreachable: No route
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=101 ttl=61 time=2.12 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=102 ttl=61 time=3.42 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=103 ttl=61 time=3.16 ms

As you can see the outage was 93 seconds (101 - 8). Not a record time, OSPF would converge much faster, but still it did fix itself without human intervention.

Checking the connectivity path with traceroute6:

$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 47725, 30 hops max, 60 bytes packets
 1  2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1)  0.541 ms  0.445 ms  0.437 ms 
 2  2001:db8:ebbd:2080::1 (2001:db8:ebbd:2080::1)  1.705 ms  1.832 ms  1.817 ms 
 3  2001:db8:ebbd:2000::1 (2001:db8:ebbd:2000::1)  2.273 ms  1.891 ms  2.584 ms 
 4  2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1)  2.348 ms  2.822 ms  2.289 ms 
$

The path can now be seen to be traveling counter-clockwise around the circle via the 5 Ghz link. The Babel routing protocol is routing packets around the failure.

Restoring the Network

Starting a ping6 again from the laptop to the DNS server, and enabling the 2.4 Ghz radio, one can measure the time of the outage while Babel recalculates the shortest path.

$ ping6 6dns
PING 6dns(2001:db8:ebbd:4118::1) 56 data bytes
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=1 ttl=61 time=2.56 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=2 ttl=61 time=2.20 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=3 ttl=61 time=2.09 ms
...
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=25 ttl=63 time=8.03 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=26 ttl=63 time=6.13 ms
From 2001:db8:ebbd:bac0::1 icmp_seq=27 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=28 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=29 Destination unreachable: Unknown code 5
From 2001:db8:ebbd:bac0::1 icmp_seq=30 Destination unreachable: Unknown code 5
From 2001:db8:ebbd:bac0::1 icmp_seq=31 Destination unreachable: Unknown code 5
From 2001:db8:ebbd:bac0::1 icmp_seq=32 Destination unreachable: Unknown code 5
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=33 ttl=63 time=4.06 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=34 ttl=63 time=4.21 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=35 ttl=63 time=1.50 ms

The outage was a quick 7 seconds (33 - 26) as Babel restored the lower cost path. Check the path again with tracroute, and it can be seen that the 2.4 Ghz link is being used again:

$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 44936, 30 hops max, 60 bytes packets
 1  2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1)  0.417 ms  0.405 ms  0.444 ms 
 2  2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1)  1.575 ms  1.640 ms  1.925 ms 
$

Bird CLI

On OpenWrt there is also a Bird CLI package called birdc6. It offers a nice view of what Bird is doing under the covers. Since I was keeping things simple, only Babel was enabled, but bird can certainly redistribute routes between differing routing protocols such as RIPng, OSPF and Babel.

Looking at Babel running in bird, one can see the interfaces, neighbours, and entries. On the backup link router, the following shows that wlan0 has two neighbours (the production router and the DNS server, also running Babel).

# birdc6 
BIRD 1.6.3 ready.
bird> show babel interfaces
babel1:
Interface  State  RX cost   Nbrs  Timer
br-lan     Up          96      0      4
wlan1      Up         512      1      1
wlan0      Up         512      2      1
bird> 

Since this is IPv6, not surprisingly, the neighbours command displays the link-local addresses of the Babel peers.

bird> show babel neighbor
babel1:
IP address                Interface  Metric Routes Next hello
fe80::2ac6:8eff:fe16:19d7 wlan1          96     10          3
fe80::58ef:68ff:fe0d:51b7 wlan0          96     10          5
fe80::224:a5ff:fed7:3089  wlan0          96     10          5
bird> 

The Entries show the routing entries which it will use to calculate paths:

bird> show babel entries
babel1:
Prefix                        Router ID               Metric Seqno Expires Sources
::/0                          00:00:00:00:00:00:00:00    256   290      44       1
2001:db8:ebbd:4118::/64       00:00:00:00:00:00:00:00    256 65533      44       1
2001:db8:ebbd:4110::/60       00:00:00:00:00:00:00:00    256 65533      44       1
2001:db8:ebbd:4110::/64       00:00:00:00:00:00:00:00    256 65533      44       1
2001:db8:ebbd:2080::/57       00:00:00:00:00:00:00:00    256 40740      54       1
2001:db8:ebbd:2080::/64       <self>                       0 45974       0       0
2001:db8:ebbd::/64            00:00:00:00:00:00:00:00    256 40740      54       1
2001:db8:ebbd::/48            00:00:00:00:00:00:00:00    256   290      44       1
2001:db8:ebbd:bac0::/64       <self>                       0 45974       0       0
bird> 

A note about RouterIDs

Unlike RIPng which has no concept of RouterID, Babel uses RouterID to identify the source of routes and avoid loops. Unfortunately, there is a bug in this version of bird (v 1.63) which does not display the RouterID in the entries. Using wireshark to sniff the Babel packets (UDP port 6696), it can be seen that the RouterIDs are being transmitted.

Wireless is great, except ...

As more and more things come online using wireless there will be more interference and contention for bandwidth, especially in the 2.4 Ghz band. Babel can enables routing of packets around sketchy wireless links due to interference in a crowded wifi environment.

Your Metric may vary

Because wireless is variable, Babel applies differing metrics to routes as the wireless signal changes. An unfortunate side effect of this is that the network is continuously converging (or changing). The route that may have been used last minute to the remote host, my be invalid the next minute.

I noticed this as my previously very stable IPv6-only servers were now disconnecting, or worse, not reachable.

Route Flapping!

As I looked at the OpenWrt syslog (using the logread command) I could see that the routes were continually changing.

Tue Jul 24 14:46:45 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:46:45 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:46:46 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:01 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:01 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:02 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:33 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:33 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:34 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:49 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:49 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:50 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:48:53 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:48:53 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:48:54 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
...

I could also see this churn using the ip monitor all command:

# ip monitor all
[nsid current]fe80::2ac6:8eff:fe16:19d7 dev wlan1 lladdr 28:c6:8e:16:19:d7 router PROBE
[nsid current]fe80::2ac6:8eff:fe16:19d7 dev wlan1 lladdr 28:c6:8e:16:19:d7 router REACHABLE
[nsid current]Deleted 2001:db8:ebbd:2080::/57 via fe80::58ef:68ff:fe0d:51b7 dev wlan0  proto bird  metric 1024  pref medium
[nsid current]2001:db8:ebbd:2080::/57 via fe80::2ac6:8eff:fe16:19d7 dev wlan1  proto bird  metric 1024  pref medium
[nsid current]Deleted 2001:db8:ebbd:2080::/57 via fe80::2ac6:8eff:fe16:19d7 dev wlan1  proto bird  metric 1024  pref medium
[nsid current]2001:db8:ebbd:2080::/57 via fe80::224:a5ff:fed7:3089 dev wlan0  proto bird  metric 1024  pref medium
[nsid current]Deleted default via fe80::58ef:68ff:fe0d:51b7 dev wlan0  proto bird  metric 1024  pref medium
[nsid current]default via fe80::224:a5ff:fed7:3089 dev wlan0  proto bird  metric 1024  pref medium
[nsid current]Deleted 2001:db8:ebbd::/64 via fe80::58ef:68ff:fe0d:51b7 dev wlan0  proto bird  metric 1024  pref medium
[nsid current]2001:db8:ebbd::/64 via fe80::224:a5ff:fed7:3089 dev wlan0  proto bird  metric 1024  pref medium
[nsid current]Deleted 2001:db8:ebbd::/48 via fe80::58ef:68ff:fe0d:51b7 dev wlan0  proto bird  metric 1024  pref medium
[nsid current]2001:db8:ebbd::/48 via fe80::224:a5ff:fed7:3089 dev wlan0  proto bird  metric 1024  pref medium
...

The problem with this route flapping is that it was being propagated to the other routers which were busy adding and removing routes, causing unreachable to parts of my network. Not a desired behaviour.

Settling things down

To rid my network of the route churn, I changed the Babel wireless interfaces to wired, giving them a stable metric, no longer tied to the variability of the wireless signal quality (signal to noise).

The /etc/bird6.conf now looks like:

protocol babel {
    interface "wlan0", "wlan1" {
        type wired;
        hello interval 5;
    };
    interface "br-lan" {
        type wired;
    };
    import all;
    export all;
}

Restarting bird6, and looking at the syslog, a brief activity can be seen, then the route churn stops, and the network is stable.

# logread | tail
Tue Jul 24 15:11:23 2018 daemon.info bird6: Shutting down
Tue Jul 24 15:11:23 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:23 2018 daemon.crit bird6: Shutdown completed
Tue Jul 24 15:11:23 2018 daemon.info bird6: Started
Tue Jul 24 15:11:24 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:11:28 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:29 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:11:31 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:31 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:32 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:11:36 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:36 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:37 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:16:33 2018 authpriv.info dropbear[28213]: Exit (root): Error reading: Connection reset by peer
Tue Jul 24 15:17:31 2018 authpriv.info dropbear[30176]: Child connection from 2001:db8:ebbd::1dd7:fb11:ad01:5ef4:33185
Tue Jul 24 15:17:34 2018 authpriv.notice dropbear[30176]: Password auth succeeded for 'root' from 2001:db8:ebbd::1dd7:fb11:ad01:5ef4:33185
Tue Jul 24 15:20:30 2018 daemon.notice netifd: wwan (8869): udhcpc: sending renew
Tue Jul 24 15:20:30 2018 daemon.notice netifd: wwan (8869): udhcpc: lease of 10.1.1.60 obtained, lease time 43200
Tue Jul 24 15:20:47 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan

My ssh connection was dropped as the network did an initial reconverge, and then I was able log back in and examine the syslog.

Babel, still a work in progress

Babel is still being actively developed, and has a more modern approach to wireless links (something that was near non-existent when RIPng was being standardized back in 1997). Like RIPng, it is easy to set up without having to understand the complexities of OSPF. It is easy to setup on OpenWrt routers and provides redundancy in your network. That said the wireless functionality as implemented by Bird (v 1.63) is not quite there. Fortunately, there is Bird v2.0 out, and I look forward to giving it a try when it comes to OpenWrt.

Postscript

Although the route churn has subsided, I re-measured the convergence time for Babel, and it was quite long, 317 seconds, probably due to the hello timer being set to 5 seconds.

In the end, I reverted my house network to RIPng. Running the same convergence test yielded an outage of only 11 seconds with no route churn.

Perhaps many of the Babel issues are just Bird's implementation. And there may be tweaks to reduce network converge times. I'd happily give Babel another chance, but for now, I'll stick with good ol' RIPng.

** if you are running a firewall, the default on OpenWrt/LEDE, you will need to put in a rule to accept IPv6 UDP port 6696


Craig Miller -- 26 July 2017