Failing Over Between pfSense Boxes
At work we have several buildings. Most of our buildings use pfSense for firewalling and splitting off subnets.
Say we have two buildings, "Building-Office" and "Building-Cafe", that are
physically near each other. Each of these buildings has its own
Internet connection and pfSense box: pfsense-office
and
pfsense-cafe
. Each building has a set of subnets associated with it,
as follows:
Building-Office has the following subnets:
- STAFF: 192.168.100.0/24
- WORKSHOP: 10.10.10.0/24
- LAB: 10.20.0.0/24
Building-Cafe has the following subnets:
- STAFF: 192.168.1.0/24
- CAFE: 172.19.19.0/24
As you can see, both buildings have a STAFF subnet with the same IP
address range. There is a wireless bridge that connects the two
buildings on the STAFF subnet. On the STAFF subnet, pfsense-office
has
an IP address of 192.168.100.254 and pfsense-cafe
has an IP address of
192.168.100.253 .
I have the following goals:
- When both Building-Office and Building-Cafe have working Internet, the subnets associated with each building should use the closest Internet connection.
- When one Internet is down (for whatever reason) but the pfSense box is running, I want to route all traffic from the failing building to the other using the wireless bridge.
- When failing over I do not want to touch individual client settings (for example, I do not want to change the DHCP servers to set new gateways).
- Failing over should not otherwise decrease the security of isolating subnets (which is why we have different subnets in the first place).
- I should be able to launch the failover relatively easily (by "flipping a switch").
- I should be able to choose specific subnets to fail over if I want. For example, maybe I want to fail over LAB but not WORKSHOP. I think I always want to fail over STAFF, however.
I have a feeling that any competent network admin could set up pfSense to accomplish these goals within minutes (which is one reason I have felt intimidated about asking this question online). It took me YEARS to get something working properly, so I want to document the procedure that works for me in the hopes that other people can learn from my incompetence.
As it turns out there are a few more important considerations to our situation:
- In
Firewall -> NAT -> Outbound
, we use Manual Outbound NAT generation, and specify a rule for each subnet going out its WAN interface. We disable automatic NAT because it seems to interfere with some of our VPN software (namely Hamachi).
Non-Solutions
Most failovers for pfSense talk about CARP, but I think that applies when you have multiple pfSense boxes monitoring the same Internet connection. We have two pfSense boxes monitoring two different Internet connections, with different associated subnets.
Similarly there is some functionality called "Virtual IPs", but I never figured out how they worked or whether they would solve my problem.
I think that gateway groups (in
System -> Routing
) might be useful for automatic failover but they do not solve the problem on their own.
Phase 1: Failing over STAFF
Failing over STAFF over the wireless link is relatively easy. The key
is to specify some new gateways in System -> Gateways
of the pfSense
interface:
On
pfsense-office
, make a gateway called GW_CAFE. This should use the STAFF interface, and have the gateway IP address ofpfsense-cafe
(in this example 192.168.100.253).Similarly, on
pfsense-cafe
, make a gateway called GW_OFFICE, also on the STAFF interface, with an gateway IP of 192.168.100.254 .
At this point failover across the wireless for the STAFF interface should be possible. Say that the internet connection goes down at Building-Office. Then to fail over STAFF to Building-Cafe, do the following:
- On
pfsense-office
, inSystem -> Gateways
, change the default gateway from GW_WAN to GW_CAFE - Maybe disable GW_WAN if that is not enough to make failover work.
If for some reason you have different sets of firewall rules for the
STAFF interfaces be aware that the rules for the pfsense-cafe
STAFF
interface will apply during failover.
Phase 2: Failing over other subnets
This is where things get tricky. The wireless link is on the STAFF network, so we need to route other traffic via that interface. Here are the broad steps:
- Set up an alias with the subnets to failover.
- Set up routes for on the failover pfsense box.
- Set up rules to allow traffic from the STAFF subnet of the failover pfsense.
- Fix manual outbound NAT rules.
Let's set up the failover for the LAB and WORKSHOP subnets over
pfsense-cafe
.
On pfsense-cafe
, set up an alias called office_subnet_failover
. It should
consist of two networks: WORKSHOP: 10.10.10.0/24 and LAB: 10.20.0.0/24
On pfsense-cafe
, go to System -> Routing -> Routes
. Make a new
static route with the destination network office_subnet_failover
and the gateway GW_OFFICE. Make the description descriptive:
"Fail over pfsense-office subnets."
On pfsense-cafe
, in Firewall -> Rules -> STAFF
make a firewall
rule:
- Action: Pass
- Interface: STAFF
- Protocol: any
- Source:
office_subnet_failover
- Destination: any
Be careful! If you are using subnet isolation then you want to put this rule after your isolation rules so that LAB and WORKSHOP clients cannot access STAFF resources.
If you have manual NAT, go to Firewall -> NAT -> Outbound
and make a
NAT rule:
- Interface: WAN
- Protocol: any
- Source:
office_subnet_failover
- Destination: any
- Address: Interface Address
- Static-port: checked
If you do not have this rule then there will be no NAT for outgoing
packets on LAB and WORKSHOP, and the destinations will try to return
packets to your internal subnets instead of the actual IP address of
pfsense-cafe
.
As far as I know, this is sufficient to get failover working (in one
direction). You can set up a similar set of rules on pfsense-office
to failover the CAFE subnet.
You then "flip the switch" on the failover in the same way as in Phase 1: Make the appropriate gateway the default, and disable the other one.
If you are paranoid you can also disable the failover firewall rules until it is failover time, and then enable them to make failover work. But this adds additional steps to flipping the switch.