Sunday, October 20, 2019

Consumer grade WiFi gear - when fixing the root cause is not at reach

Some time ago, I had to improve the performance and coverage of my home network, so as to be able to use the several devices around the house flawlessly, regardless of the location. Some of these devices have a certain demand for consistent bandwidth, as is the case of the SmartTV for watching IPTV and Netflix, and others such as the smartphones and tablets.

As always I tend to be frugal with spending money in hardware, trying to go with what performs well and is just about enough for the job.

This led me to aim for WiFi gear that would both be somewhat popular and low cost, while at the same time having some hope of being hackable and reflashed to OpenWRT in the future. This was the reasoning when I decided to buy a couple of TP-LINK TL-WR841N routers (with v9 hardware at the time).

At first I set these up and played with the stock firmware, configuring one to play the roles of  NAT, DHCP, DNS, firewall and so on, and the other to act solely as a WDS repeater, allowing WiFi coverage to be extended to the rest of the house.

But later I wanted to play home automation, and more and more felt the need to have these pieces of equipment more manageable. As such I became adventurous, and engaged in the twofold task of managing the frustration of the household users during the periods of service interruption while modifying the routers, and the technical change itself.

The flashing of these devices was a simple step, as the OpenWRT project provides packages which can readily be uploaded and flashed to the devices using the original Web UI. There is no need for setting up any serial connections or anything as such.

Details can be found in the official page:

Given the hardware limitations of these devices (4 MB of Flash and 32 MB of RAM), the sweet spot between features and stability, was using version 15.05.1.

The devices work correctly and the firmware performs quite well in this version, but I found that after a few days of operation and depending on network usage, both the router and the repeater get to a point where the network performance drops dramatically. CPU and memory is not the problem, as during degradation it is possible to patiently open an ssh session and monitor these (e.g. using "top" and "free" commands) and confirm that these are nominal.

Once degradations sets in, it is no longer possible to recover, except by forcing the device restart.

With this in mind and after exhausting the research on corrective solutions, I decided to make a rather generic watchdog script that all it does is measure the ping to a list of destinations, and if none of the destinations is reachable or the minimum ping time is above a given threshold for all destinations, it restarts the device. This allows the impact to be minimized, and reduces the need for manual intervention.

I made this script is available on GitHub, and is generic enough that can be used in other systems where the same kind of problem needs to be dealt with (instead of the reboot command you may configure the appropriate action in your case):

Once the script is copied to the device, you only need to adjust the environment variables to your scenario:

RTT_THRESHOLD - the minimum (ICMP) ping RTT in milliseconds needed for a connection to a given node be considered degraded;

IP_LIST - the list of IP addresses to be tested by the script;

MIN_DEGRADED - the minimum number of nodes from the list that are necessary to have a degraded connection, to assume that the problem is from the device where the script is being executed.

Lastly you need to configure the crontab to run the script at the desired frequency (in this case once every minute):

root@griffinnet-zh-router:~/scripts# crontab -l
* * * * * /root/scripts/ >/dev/null 2>&1

No comments: