Previously I wrote about resurrecting the old forgotten routing protocol, RIPng. In a small network of more than one router, you need a routing protocol to share information between the routers. I used RIPng for about six months, turned it on, and pretty much forgot that it was running. Worked like a charm in my wired network.
I moved to a new (to me) house this summer, and thought it was a good opportunity to try out a routing protocol which not only handles wired networks but also wireless. Babel seemed just the thing for this environment.
Babel is a loop-avoiding distance-vector routing protocol that is robust and efficient both in ordinary wired networks and in wireless mesh networks. Based on the loss of hellos the cost of wireless links can be increased, making sketchy wireless links less preferred.
RFC 6126 standardizes the routing protocol.There are two implementations which are supported on OpenWrt routers, babeld
and bird
Like anything in networking, it starts with the physical layer (wireless is a form of physical layer). I attached the wireless links of the backup link router to the production and test routers. Thus creating redundant path of connectivity within my house.
I chose bird6
(the IPv6 version of bird
on OpenWrt) because I already had it installed on the routers for RIPng. It was merely a matter of commenting out the RIP section in the /etc/bird6.conf
file, and enabling Babel.
The Bird Documentation provides an example. Add the following to /etc/bird6.conf
get Babel running in bird6
protocol babel {
interface "wlan0", "wlan1" {
type wireless;
hello interval 1;
rxcost 512;
};
interface "br-lan" {
type wired;
};
import all;
export all;
}
In the example above, wlan0
is the 2.4 Ghz radio, and wlan1
is the 5 Ghz radio.
When determining the connectivity path, traceroute6
(the IPv6 version) is your friend. Checking between the laptop and the DNS server, the path is:
$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 49819, 30 hops max, 60 bytes packets
1 2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1) 4.561 ms 0.510 ms 0.487 ms
2 2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1) 2.562 ms 2.193 ms 1.927 ms
$
The traceroute is showing the path going clockwise through the 2.4 Ghz wireless link.
To test how well Babel can automatically route around failed links, I started a ping to the DNS server from the laptop and disabled the 2.4 Ghz radio, thus blocking the link the pings were using, and waited...
$ ping6 6dns
PING 6dns(2001:db8:ebbd:4118::1) 56 data bytes
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=1 ttl=63 time=3.54 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=2 ttl=63 time=1.64 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=3 ttl=63 time=2.02 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=4 ttl=63 time=1.64 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=5 ttl=63 time=1.51 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=6 ttl=63 time=1.65 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=7 ttl=63 time=1.58 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=8 ttl=63 time=5.80 ms
From 2001:db8:ebbd:bac0::1 icmp_seq=33 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=34 Destination unreachable: No route
...
From 2001:db8:ebbd:bac0::1 icmp_seq=48 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=49 Destination unreachable: No route
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=101 ttl=61 time=2.12 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=102 ttl=61 time=3.42 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=103 ttl=61 time=3.16 ms
As you can see the outage was 93 seconds (101 - 8). Not a record time, OSPF would converge much faster, but still it did fix itself without human intervention.
Checking the connectivity path with traceroute6
:
$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 47725, 30 hops max, 60 bytes packets
1 2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1) 0.541 ms 0.445 ms 0.437 ms
2 2001:db8:ebbd:2080::1 (2001:db8:ebbd:2080::1) 1.705 ms 1.832 ms 1.817 ms
3 2001:db8:ebbd:2000::1 (2001:db8:ebbd:2000::1) 2.273 ms 1.891 ms 2.584 ms
4 2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1) 2.348 ms 2.822 ms 2.289 ms
$
The path can now be seen to be traveling counter-clockwise around the circle via the 5 Ghz link. The Babel routing protocol is routing packets around the failure.
Starting a ping6
again from the laptop to the DNS server, and enabling the 2.4 Ghz radio, one can measure the time of the outage while Babel recalculates the shortest path.
$ ping6 6dns
PING 6dns(2001:db8:ebbd:4118::1) 56 data bytes
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=1 ttl=61 time=2.56 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=2 ttl=61 time=2.20 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=3 ttl=61 time=2.09 ms
...
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=25 ttl=63 time=8.03 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=26 ttl=63 time=6.13 ms
From 2001:db8:ebbd:bac0::1 icmp_seq=27 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=28 Destination unreachable: No route
From 2001:db8:ebbd:bac0::1 icmp_seq=29 Destination unreachable: Unknown code 5
From 2001:db8:ebbd:bac0::1 icmp_seq=30 Destination unreachable: Unknown code 5
From 2001:db8:ebbd:bac0::1 icmp_seq=31 Destination unreachable: Unknown code 5
From 2001:db8:ebbd:bac0::1 icmp_seq=32 Destination unreachable: Unknown code 5
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=33 ttl=63 time=4.06 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=34 ttl=63 time=4.21 ms
64 bytes from 2001:db8:ebbd:4118::1: icmp_seq=35 ttl=63 time=1.50 ms
The outage was a quick 7 seconds (33 - 26) as Babel restored the lower cost path. Check the path again with tracroute, and it can be seen that the 2.4 Ghz link is being used again:
$ traceroute6 6dns
traceroute to 6lilikoi.hoomaha.net (2001:db8:ebbd:4118::1) from 2001:db8:ebbd:bac0:d999:cd8a:cd9b:2037, port 33434, from port 44936, 30 hops max, 60 bytes packets
1 2001:db8:ebbd:bac0::1 (2001:db8:ebbd:bac0::1) 0.417 ms 0.405 ms 0.444 ms
2 2001:db8:ebbd:4118::1 (2001:db8:ebbd:4118::1) 1.575 ms 1.640 ms 1.925 ms
$
On OpenWrt there is also a Bird CLI package called birdc6
. It offers a nice view of what Bird is doing under the covers. Since I was keeping things simple, only Babel was enabled, but bird can certainly redistribute routes between differing routing protocols such as RIPng, OSPF and Babel.
Looking at Babel running in bird, one can see the interfaces, neighbours, and entries. On the backup link router, the following shows that wlan0
has two neighbours (the production router and the DNS server, also running Babel).
# birdc6
BIRD 1.6.3 ready.
bird> show babel interfaces
babel1:
Interface State RX cost Nbrs Timer
br-lan Up 96 0 4
wlan1 Up 512 1 1
wlan0 Up 512 2 1
bird>
Since this is IPv6, not surprisingly, the neighbours command displays the link-local addresses of the Babel peers.
bird> show babel neighbor
babel1:
IP address Interface Metric Routes Next hello
fe80::2ac6:8eff:fe16:19d7 wlan1 96 10 3
fe80::58ef:68ff:fe0d:51b7 wlan0 96 10 5
fe80::224:a5ff:fed7:3089 wlan0 96 10 5
bird>
The Entries show the routing entries which it will use to calculate paths:
bird> show babel entries
babel1:
Prefix Router ID Metric Seqno Expires Sources
::/0 00:00:00:00:00:00:00:00 256 290 44 1
2001:db8:ebbd:4118::/64 00:00:00:00:00:00:00:00 256 65533 44 1
2001:db8:ebbd:4110::/60 00:00:00:00:00:00:00:00 256 65533 44 1
2001:db8:ebbd:4110::/64 00:00:00:00:00:00:00:00 256 65533 44 1
2001:db8:ebbd:2080::/57 00:00:00:00:00:00:00:00 256 40740 54 1
2001:db8:ebbd:2080::/64 <self> 0 45974 0 0
2001:db8:ebbd::/64 00:00:00:00:00:00:00:00 256 40740 54 1
2001:db8:ebbd::/48 00:00:00:00:00:00:00:00 256 290 44 1
2001:db8:ebbd:bac0::/64 <self> 0 45974 0 0
bird>
Unlike RIPng which has no concept of RouterID, Babel uses RouterID to identify the source of routes and avoid loops. Unfortunately, there is a bug in this version of bird (v 1.63) which does not display the RouterID in the entries. Using wireshark
to sniff the Babel packets (UDP port 6696), it can be seen that the RouterIDs are being transmitted.
As more and more things come online using wireless there will be more interference and contention for bandwidth, especially in the 2.4 Ghz band. Babel can enables routing of packets around sketchy wireless links due to interference in a crowded wifi environment.
Because wireless is variable, Babel applies differing metrics to routes as the wireless signal changes. An unfortunate side effect of this is that the network is continuously converging (or changing). The route that may have been used last minute to the remote host, my be invalid the next minute.
I noticed this as my previously very stable IPv6-only servers were now disconnecting, or worse, not reachable.
As I looked at the OpenWrt syslog (using the logread
command) I could see that the routes were continually changing.
Tue Jul 24 14:46:45 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:46:45 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:46:46 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:01 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:01 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:02 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:33 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:33 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:34 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:47:49 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:49 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:47:50 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 14:48:53 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:48:53 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 14:48:54 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
...
I could also see this churn using the ip monitor all
command:
# ip monitor all
[nsid current]fe80::2ac6:8eff:fe16:19d7 dev wlan1 lladdr 28:c6:8e:16:19:d7 router PROBE
[nsid current]fe80::2ac6:8eff:fe16:19d7 dev wlan1 lladdr 28:c6:8e:16:19:d7 router REACHABLE
[nsid current]Deleted 2001:db8:ebbd:2080::/57 via fe80::58ef:68ff:fe0d:51b7 dev wlan0 proto bird metric 1024 pref medium
[nsid current]2001:db8:ebbd:2080::/57 via fe80::2ac6:8eff:fe16:19d7 dev wlan1 proto bird metric 1024 pref medium
[nsid current]Deleted 2001:db8:ebbd:2080::/57 via fe80::2ac6:8eff:fe16:19d7 dev wlan1 proto bird metric 1024 pref medium
[nsid current]2001:db8:ebbd:2080::/57 via fe80::224:a5ff:fed7:3089 dev wlan0 proto bird metric 1024 pref medium
[nsid current]Deleted default via fe80::58ef:68ff:fe0d:51b7 dev wlan0 proto bird metric 1024 pref medium
[nsid current]default via fe80::224:a5ff:fed7:3089 dev wlan0 proto bird metric 1024 pref medium
[nsid current]Deleted 2001:db8:ebbd::/64 via fe80::58ef:68ff:fe0d:51b7 dev wlan0 proto bird metric 1024 pref medium
[nsid current]2001:db8:ebbd::/64 via fe80::224:a5ff:fed7:3089 dev wlan0 proto bird metric 1024 pref medium
[nsid current]Deleted 2001:db8:ebbd::/48 via fe80::58ef:68ff:fe0d:51b7 dev wlan0 proto bird metric 1024 pref medium
[nsid current]2001:db8:ebbd::/48 via fe80::224:a5ff:fed7:3089 dev wlan0 proto bird metric 1024 pref medium
...
The problem with this route flapping is that it was being propagated to the other routers which were busy adding and removing routes, causing unreachable to parts of my network. Not a desired behaviour.
To rid my network of the route churn, I changed the Babel wireless interfaces to wired, giving them a stable metric, no longer tied to the variability of the wireless signal quality (signal to noise).
The /etc/bird6.conf
now looks like:
protocol babel {
interface "wlan0", "wlan1" {
type wired;
hello interval 5;
};
interface "br-lan" {
type wired;
};
import all;
export all;
}
Restarting bird6
, and looking at the syslog, a brief activity can be seen, then the route churn stops, and the network is stable.
# logread | tail
Tue Jul 24 15:11:23 2018 daemon.info bird6: Shutting down
Tue Jul 24 15:11:23 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:23 2018 daemon.crit bird6: Shutdown completed
Tue Jul 24 15:11:23 2018 daemon.info bird6: Started
Tue Jul 24 15:11:24 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:11:28 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:29 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:11:31 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:31 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:32 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:11:36 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:36 2018 daemon.info odhcpd[778]: Raising SIGUSR1 due to default route change
Tue Jul 24 15:11:37 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
Tue Jul 24 15:16:33 2018 authpriv.info dropbear[28213]: Exit (root): Error reading: Connection reset by peer
Tue Jul 24 15:17:31 2018 authpriv.info dropbear[30176]: Child connection from 2001:db8:ebbd::1dd7:fb11:ad01:5ef4:33185
Tue Jul 24 15:17:34 2018 authpriv.notice dropbear[30176]: Password auth succeeded for 'root' from 2001:db8:ebbd::1dd7:fb11:ad01:5ef4:33185
Tue Jul 24 15:20:30 2018 daemon.notice netifd: wwan (8869): udhcpc: sending renew
Tue Jul 24 15:20:30 2018 daemon.notice netifd: wwan (8869): udhcpc: lease of 10.1.1.60 obtained, lease time 43200
Tue Jul 24 15:20:47 2018 daemon.info odhcpd[778]: Using a RA lifetime of 1800 seconds on br-lan
My ssh connection was dropped as the network did an initial reconverge, and then I was able log back in and examine the syslog.
Babel is still being actively developed, and has a more modern approach to wireless links (something that was near non-existent when RIPng was being standardized back in 1997). Like RIPng, it is easy to set up without having to understand the complexities of OSPF. It is easy to setup on OpenWrt routers and provides redundancy in your network. That said the wireless functionality as implemented by Bird (v 1.63) is not quite there. Fortunately, there is Bird v2.0 out, and I look forward to giving it a try when it comes to OpenWrt.
Although the route churn has subsided, I re-measured the convergence time for Babel, and it was quite long, 317 seconds, probably due to the hello timer being set to 5 seconds.
In the end, I reverted my house network to RIPng. Running the same convergence test yielded an outage of only 11 seconds with no route churn.
Perhaps many of the Babel issues are just Bird's implementation. And there may be tweaks to reduce network converge times. I'd happily give Babel another chance, but for now, I'll stick with good ol' RIPng.
** if you are running a firewall, the default on OpenWrt/LEDE, you will need to put in a rule to accept IPv6 UDP port 6696
Craig Miller -- 26 July 2017