BIND srtt algorithm not working as expected

Discussion:

Paul Roberts

2018-05-16 09:25:41 UTC

Hello,

I am researching an issue we are seeing with significant volumes of DNS traffic being sent to non-local forwarders. I think I understand how the srtt algorithm works, but I am seeing more traffic going to the non-local forwarders than I was expecting.

To give you some context, we have 2 forwarders in the UK and 2 in Hong Kong, all 4 servers are responsible for outbound internet resolution. We also have a number of resolving servers (in the UK and Hong Kong) that have these 4 servers listed in their local "forwarders" statement, so I am expecting the HK resolvers to forward mainly to the 2 local HK forwarders, with the occasional query out to the 2 UK forwarders so that the rtt can be measured.

When I do a packet capture on a Hong Kong resolver, over a 5 minute period, 22% of all packets captured are DNS queries being forwarded to the local HK forwarders, and 14% of the packets captured are being sent to the UK forwarders - this seems high to me. I had always believed that the number of queries sent to non-local forwarders would be a lot lower, but from looking into this in detail this doesn't seem to be the case.

When I do a ping from Honk Kong, the rtt to the UK forwarders is 180-190ms, in contrast the local HK forwarder rtt is <1ms. I can see from dumping the cache on the HK resolver that the rtt is indeed much lower to the HK servers:

; 10.<HK IP> [srtt 478560] [flags 00004000] [edns 146/5/4/4/4] [plain 0/0] [udpsize 2448] [ttl -1033437]

; 10.<HK IP> [srtt 648550] [flags 00004000] [edns 153/4/4/4/4] [plain 0/0] [udpsize 2270] [ttl -1033437]

; 10.<UK IP> [srtt 2774590] [flags 00004000] [edns 133/4/4/4/2] [plain 0/0] [udpsize 1160] [ttl -1033437]

; 10.<UK IP> [srtt 3477510] [flags 00004000] [edns 170/6/6/6/4] [plain 0/0] [udpsize 1012] [ttl -1033437]

I did some digging and came across this presentation: https://www.nanog.org/meetings/nanog54/presentations/Tuesday/Yu.pdf

This seems to imply on slide 16 that with lower query rates, BIND 9.8 has a habit of sending fairly significant volumes to DNS servers with higher rtts. I am wondering if this is still the case in BIND 9.10 or 9.11 and whether there is anything that can be done about it?

In BIND 8 I think we could have used the topology statement to influence the behaviour but I gather that is not an option in BIND 9?

Is there a solution to this because the slow responses back from the UK are impacting application performance for users in HK?

We need to keep the UK servers as part of the configuration for failover/redundancy, removing them is not an option.

Thanks,

Paul

Paul Roberts
Calleva Networks Ltd.
Email: ***@callevanetworks.com

Tony Finch

2018-05-16 18:42:15 UTC

Permalink

Paul Roberts <***@callevanetworks.com> wrote:
>
> This seems to imply on slide 16 that with lower query rates, BIND 9.8
> has a habit of sending fairly significant volumes to DNS servers with
> higher rtts. I am wondering if this is still the case in BIND 9.10 or
> 9.11 and whether there is anything that can be done about it?

The short answer is, 9.9 and later should be a lot better than 9.8.

There are a couple of obviously relevant entries in the CHANGES file:

Before the 9.6.0 release:

2423. [security] Randomize server selection on queries, so as to
make forgery a little more difficult. Instead of
always preferring the server with the lowest RTT,
pick a server with RTT within the same 128
millisecond band. [RT #18441]

Before the 9.9.0 release:

3024. [func] RTT Banding removed due to minor security increase
but major impact on resolver latency. [RT #23310]

Tony.
--
f.anthony.n.finch <***@dotat.at> http://dotat.at/
justice and liberty cannot be confined by national boundaries
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-***@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Paul Roberts

2018-05-17 08:32:16 UTC

Permalink

After doing some more packet captures, it looks like a lot of the queries are related to Sophos live protection DNS lookups (lots of queries for sophosxl.net), so there are a lot of queries which don't get resolved. We see multiple queries for the same name and the resolver seems to retransmit to each forwarder when it doesn't get a response, including the non-local ones. So the behaviour may be being exacerbated by these non-resolvable queries. Eventually after about 10 seconds, the forwarder replies with a SERVFAIL response as it eventually gives up trying to get a response from the Sophos name servers.

So now I am not sure if the rtt algorithm is completely at fault here as BIND is simply trying additional forwarders in an attempt to resolve the name.

I have seen this live protection stuff going on in quite a few corporates now, and each time we have had to raise the recursive-client limit. I don't think it's just Sophos that do this, pretty sure I saw this with McAfee a couple years ago too, they seem to use DNS to transmit file name hashes so they can do a reputation lookup, but for Sophos they only reply if some kind of action is required. There must be many corporates out there that are experiencing issues with the way this works, i.e all of a sudden their resolvers stop recursing because the recursive client limit is hit.

One account I am working on, the resolvers regularly hit 20,000+ recursive clients when they kick of a scheduled virus scan. I wish the anti-virus vendors would consider the impact they are having on corporate DNS environments and re-think how they implement their reputation lookups, it must be the cause of some pretty serious ouages. :-(

Cheers,

Paul

Matus UHLAR - fantomas

2018-05-17 08:47:25 UTC

Permalink

please wrap your lines when possible. <76 characters ideally.

On 17.05.18 08:32, Paul Roberts wrote:
>After doing some more packet captures, it looks like a lot of the queries
> are related to Sophos live protection DNS lookups (lots of queries for
> sophosxl.net), so there are a lot of queries which don't get resolved. We
> see multiple queries for the same name and the resolver seems to
> retransmit to each forwarder when it doesn't get a response, including the
> non-local ones. So the behaviour may be being exacerbated by these
> non-resolvable queries. Eventually after about 10 seconds, the forwarder
> replies with a SERVFAIL response as it eventually gives up trying to get a
> response from the Sophos name servers.

do those forwarders respond?

Because if they don't return anything, or return SERVFAIL, it's expected ang
logical for BIND to try again.

>So now I am not sure if the rtt algorithm is completely at fault here as
> BIND is simply trying additional forwarders in an attempt to resolve the
> name.

apparently not - I remember when having such problem years ago, I have
advised the client to turn those DNS lookups off. Those lookups were
overloading our DNS servers (not sure fi sophos).

>I have seen this live protection stuff going on in quite a few corporates
> now, and each time we have had to raise the recursive-client limit. I
> don't think it's just Sophos that do this, pretty sure I saw this with
> McAfee a couple years ago too, they seem to use DNS to transmit file name
> hashes so they can do a reputation lookup, but for Sophos they only reply
> if some kind of action is required. There must be many corporates out
> there that are experiencing issues with the way this works, i.e all of a
> sudden their resolvers stop recursing because the recursive client limit
> is hit.

>One account I am working on, the resolvers regularly hit 20,000+ recursive
> clients when they kick of a scheduled virus scan. I wish the anti-virus
> vendors would consider the impact they are having on corporate DNS
> environments and re-think how they implement their reputation lookups, it
> must be the cause of some pretty serious ouages. :-(

this kind of protection apparently should not be run on public DNS
infrastructure.

--
Matus UHLAR - fantomas, ***@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
One OS to rule them all, One OS to find them,
One OS to bring them all and into darkness bind them
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-***@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Tony Finch

2018-05-17 11:34:44 UTC

Permalink

Paul Roberts <***@callevanetworks.com> wrote:

> After doing some more packet captures, it looks like a lot of the
> queries are related to Sophos live protection DNS lookups (lots of
> queries for sophosxl.net), so there are a lot of queries which don't get
> resolved.

Good grief.

There are a few things you might do to mitigate this idiocy:

0. Block sophosxl.net. Your colleagues responsible for AV might not
appreciate this :-)

1. In BIND 9.11+ there are options `fetches-per-zone` and
`fetches-per-server` for helping a resolver to cope with overloaded
authoritative servers. When you are forwarding you'll have to rely on
fetches-per-zone since fetches-per-server will throttle everything.
I don't know how fetches-per-zone discovers zone cuts or how well that
works in the forwarding case when your resolver is relying on an
upstream to do the iteration.

2. Set up sacrificial forwarding IP addresses. These can be additional
addresses on your existing forwarders. Configure your resolvers to
forward queries for sophosxl.net to the sacrificial addresses instead
of the usual ones. Then BIND's address database entries used by most
queries won't get polluted by the non-responding servers.

You might profitably combine 1. and 2. to make the resolver eagerly drop
queries to the sacrificial forwarders.

Tony.
--
f.anthony.n.finch <***@dotat.at> http://dotat.at/
the quest for freedom and justice can never end
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-***@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Paul Roberts

2018-05-17 14:54:33 UTC

Permalink

Hey Nico, long time no speak, hope you are well! You still at Efficient IP?

Yes that would be a great idea in theory but in practice it would require a massive infrastructure change for this customer, we'd also have to migrate the anycast IPs to these new nodes (does dnsdist support anycast?), and ensure we can still meet the contracted SLAs. Basically it's a lot of work (+ cost) just to "sort out" this Sophos mess.

I'd rather Sophos did their stuff over a separate TCP or UDP port rather than hijacking DNS, but doubt they will listen to "little old me". ð

Cheers,

Paul

________________________________
From: Nico CARTRON <***@ncartron.org>
Sent: 17 May 2018 13:01
To: Paul Roberts
Cc: ML BIND Users
Subject: Re: BIND srtt algorithm not working as expected

Hi Paul,

On 17 May 2018, at 13:46, Paul Roberts <***@callevanetworks.com<mailto:***@callevanetworks.com>> wrote:

Good grief indeed!

I would love to implement 'fetches-per-zone' but we need to get them onto BIND 9.11 first, that's a few months away.

Unfortunately I can't just block this traffic else I'll have the security teams wanting to know why we are compromising their desktop security.

Even 'fetches-per-zone' is a bit contentious, if we are rate limiting and one of those queries happens to be for a malicious file which doesn't get quarantined (because we never got the actionable response code from Sophos) we'll be in big trouble.

So we are caught between a rock and a hard place. :-(

Why not putting dnsdist in front of those BIND 9.8, and having it redirect DNS traffic at destination of Sophos to dedicated BIND servers?
And have the other, non Sophos DNS traffic, sent to the current BIND servers?

Cheers,
Nico

________________________________
From: Tony Finch <***@dotat.at<mailto:***@dotat.at>>
Sent: 17 May 2018 12:34
To: Paul Roberts
Cc: bind-***@lists.isc.org<mailto:bind-***@lists.isc.org>
Subject: Re: BIND srtt algorithm not working as expected

Paul Roberts <***@callevanetworks.com<mailto:***@callevanetworks.com>> wrote:

> After doing some more packet captures, it looks like a lot of the
> queries are related to Sophos live protection DNS lookups (lots of
> queries for sophosxl.net<http://sophosxl.net>), so there are a lot of queries which don't get
> resolved.

Good grief.

There are a few things you might do to mitigate this idiocy:

0. Block sophosxl.net<http://sophosxl.net>. Your colleagues responsible for AV might not
appreciate this :-)

1. In BIND 9.11+ there are options `fetches-per-zone` and
`fetches-per-server` for helping a resolver to cope with overloaded
authoritative servers. When you are forwarding you'll have to rely on
fetches-per-zone since fetches-per-server will throttle everything.
I don't know how fetches-per-zone discovers zone cuts or how well that
works in the forwarding case when your resolver is relying on an
upstream to do the iteration.

2. Set up sacrificial forwarding IP addresses. These can be additional
addresses on your existing forwarders. Configure your resolvers to
forward queries for sophosxl.net<http://sophosxl.net> to the sacrificial addresses instead
of the usual ones. Then BIND's address database entries used by most
queries won't get polluted by the non-responding servers.

You might profitably combine 1. and 2. to make the resolver eagerly drop
queries to the sacrificial forwarders.

Tony.
--
f.anthony.n.finch <***@dotat.at<mailto:***@dotat.at>> http://dotat.at/

the quest for freedom and justice can never end
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-***@lists.isc.org<mailto:bind-***@lists.isc.org>
https://lists.isc.org/mailman/listinfo/bind-users