At work, our app is hosted on a pair of Internet connections from different upstream providers. We have incoming and outgoing SIP calls, incoming web traffic, incoming and outcoming e-mail, and incoming and outgoing web service calls.
We wanted to be able to load balance all of these functions with failover, and we have a philosophy of simultaneously utilizing resources from all of our available routers and connections. That allows us to avoid situations in which the failover system or failover circuit does not function as intended when it becomes the active master.
We use a variety of load-balancing proxies and techniques to allow these various services to function reliably.
What we're not using is BGP anycast -- mainly because to do so requires a Class C of provider-neutral IP space and an AS number, and IP space is getting harder to come by. Instead, we utilize DNS-based load balancing and failover from DynECT for all of our inbound traffic.
Our network consists of Linux-based routers/load balancers running in parallel. In order to load balance our outbound Internet traffic, all that traffic goes through proxy servers which are looked up on our internal DNS. Here, we utilize a 5 second TTL for fast failover.
Finally, we have a load balancer probe which continually tests the uptime of our backend servers and our Internet circuits.
One time, we discovered a flaw with this system. We performed uptime monitoring of our outbound Internet circuits by pinging the default gateway. In many cases this check was sufficient, but if our Internet provider was experiencing loss of connectivity at a level beyond our next-hop, this strategy failed.
A tempting solution would be to ping something out on the Internet, but that means tying our reliability to the uptime of what we are pinging. We also weren't too sure about constantly pinging someone we hadn't already made arrangements with.
It turns out there is a better way. DynECT is constantly requesting web pages from each of our external load balancers, in order to determine whether or not to publish that IP address for our domains. We realized that we could monitor the frequency of these requests, and if we were not receiving them on a particular load balancer, that server could arrange for its own internal IP to stop being served for our internal proxy server DNS A record.
Of course, using this approach meant that if we stopped getting requests from DynECT due to a problem on their end, we could generate an outage where we were trying to prevent one. In order to build in some redundancy, we upgraded our Pingdom account and created a check for each server/Internet circuit. Now, if we're not hearing from either DynECT or Pingdom on a given circuit, we consider that circuit offline.
Since the implementation of this solution, we have experienced conditions in which the older "I can ping my default route" check did not trip, but our Internet circuit was nonetheless offline. But our new WAN monitoring solution reliably catches the problem and brings the affected circuit out of service correctly.