Intermittent "failure trying master... operation canceled" on zone refresh

Discussion:

Rob Moser

2018-05-17 21:05:32 UTC

Hi All,

We're running a series of RHEL 7.4 machines (kernel version 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7. Our configuration consists of a hidden master and three hidden slave/recursive resolvers. I'm getting a LOT of errors on the slaves that look like:

17-May-2018 13:27:28.421 general: info: zone 34.22.10.in-addr.arpa/IN/internal-view: refresh: failure trying master 10.20.30.3#53 (source 0.0.0.0#0): operation canceled

Which a little digging shows me is often the result of network connectivity problems, firewall misconfigurations, etc. But in our case the failures are intermittent (but frequent; roughly 40% of our zone refreshes seem to end this way.) This includes failures on every one of our zones - forward and reverse - and refreshes on a single zone which succeed, then fail, then succeed again; so it's not keyed to a particular zone config. It happens on all three slaves intermittently - so a given refresh will make it to 2/3 slaves, and the next attempt will make it to a different set of 2... Its definitely affecting my users at this point; they'll add a new record and get results from an nslookup 2 tries out of 3 for an hour or so, until the final server finally gets the message.

Any thoughts on what I can do to debug this issue?

I turned up this link from back in 2014:

https://kb.isc.org/article/AA-01213/0/What-causes-refresh%3A-failure-trying-master-...%3A-operation-canceled-error-messages.html

along with a couple of references to what appears to be the same issue in the archives of this list. All describe a problem with the netfilter kernel modules, but the links to the associated bug report(s) are long gone, and I wasn't able to find any signs of the issue being solved. Does anyone know if this problem was ever fixed or had a workaround (besides disabling netfilter, which my sysadmins would prefer not to do...)

Appreciate any advice you can give me,

- rob.

Tony Finch

2018-05-18 13:31:10 UTC

Permalink

Rob Moser <***@nau.edu> wrote:
>
> Which a little digging shows me is often the result of network
> connectivity problems, firewall misconfigurations, etc. But in our case
> the failures are intermittent (but frequent; roughly 40% of our zone
> refreshes seem to end this way.)

Have you made sure there is no connection tracking / stateful packet
filtering configured on port 53?

Tony.
--
f.anthony.n.finch <***@dotat.at> http://dotat.at/
Southeast Iceland: Westerly 6 to gale 8, backing southerly ot southeasterly 5
to 7, perhaps gale 8 later. Rough occasionally very rough. Showers, rain
later. Moderate or good, becoming poor later.
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-***@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Matthew Pounsett

2018-05-18 18:08:59 UTC

Permalink

On 17 May 2018 at 17:05, Rob Moser <***@nau.edu> wrote:

> We're running a series of RHEL 7.4 machines (kernel version
> 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7.
> Our configuration consists of a hidden master and three hidden
> slave/recursive resolvers. I'm getting a LOT of errors on the slaves that
> look like:
>
> 17-May-2018 13:27:28.421 general: info: zone 34.22.10.in-addr.arpa/IN/internal-view:
> refresh: failure trying master 10.20.30.3#53 (source 0.0.0.0#0): operation
> canceled
>
> In addition to checking for firewalls and other stateful network devices
as Tony mentions, you should also have a look at the condition of the
network in between the hosts. That feels a lot like moderate packet loss,
or extreme latency, to me.

Are these machines all on the same LAN? Are there multiple networks in
between them?

Rob Moser

2018-05-25 23:23:42 UTC

Permalink

I apologize for being so slow to respond; it has been "one of those weeks", but I do very much appreciate everyone's comments.

The machines are on the same LAN, and there is no evidence of unusual load or dropped packets on the network.

We do not have any firewall rules restricting DNS traffic. We _do_ have state-aware firewalld rules on the machines that should not apply to DNS traffic - the default baseline firewalld rules that come to us from redhat include rules that check state.

The reason that I mention rules that should not apply to DNS traffic is that I found one of my colleagues who worked on this problem a few years back. They found at the time that it was the fact of loading the conntrack module by _any_ rule that caused the fault, regardless of which rule actually used the data. We're talking back a major release or two of redhat and everything else, so we are not assuming that this is necessarily the exact same problem, but to test it on our current systems I'll have to temporarily pull a server out of prod (the problem is not reproduceable under the load we get in test.) So next week, at this point.

(They solved the problem last time by disabling _all_ state-aware rules from iptables, but my sysadmins are resisting a similar approach this time, so I am attempting to find an alternate solution...)

Thanks,

- rob.

From: Matthew Pounsett <***@conundrum.com>
Sent: Friday, May 18, 2018 11:08 AM
To: Rob Moser
Cc: bind-***@lists.isc.org
Subject: Re: Intermittent "failure trying master... operation canceled" on zone refresh

On 17 May 2018 at 17:05, Rob Moser <***@nau.edu> wrote:

We're running a series of RHEL 7.4 machines (kernel version 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7. Our configuration consists of a hidden master and three hidden slave/recursive resolvers. I'm getting a LOT of errors on the slaves that look like:

17-May-2018 13:27:28.421 general: info: zone 34.22.10.in-addr.arpa/IN/internal-view: refresh: failure trying master 10.20.30.3#53 (source 0.0.0.0#0): operation canceled

In addition to checking for firewalls and other stateful network devices as Tony mentions, you should also have a look at the condition of the network in between the hosts. That feels a lot like moderate packet loss, or extreme latency, to me.

Are these machines all on the same LAN? Are there multiple networks in between them?

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-***@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users