Discussion:
Clients get DNS timeouts because ipv6 means more queries for each lookup
Jonathan Kamens
2011-07-11 18:11:57 UTC
Permalink
The number of DNS queries required for each address lookup requested by
a client has gone up considerably because of IPV6. The problem is being
exacerbated by the fact that many DNS servers on the net don't yet
support IPV6 queries. The result is that address lookups are frequently
taking so long that the client gives up before getting the result.

The example I am seeing this with most frequently is my RSS feed reader,
rss2email, trying to read a feed from en.wikipedia.org in a cron job
that runs every 15 minutes. I am regularly seeing this in the output of
the cron job:

W: Name or service not known [8]
http://en.wikipedia.org/w/index.php?title=/[elided]/&feed=atom&action=history

The wikipedia.org domain has three DNS servers. Let's assume that the
root and org. nameservers are cached already when rss2email does its
query. If so, then it has to do the following queries:

wikipedia.org DNS
en.wikipedia.org AAAA
en.wikipedia.org A

This is fine when the wikipedia.org nameservers are working, but let's
postulate for the moment that two of them are down, unreachable, or
responding slowly, which apparently happens pretty often. Then we end up
doing:

wikipedia.org DNS
en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA
en.wikipedia.org A /times out/
en.wikipedia.org A /times out
/en.wikipedia.org A

By now the end of that sequence, the typical 30-second DNS request
timeout has been exceeded, and the client gives up.

I said above that the problem is exacerbated by the fact that many DNS
servers don't yet support IPV6 queries. This is because the AAAA queries
don't get NXDOMAIN responses, which would be cached, but rather FORMERR
responses, which are not cached. As a result, the scenario describes
above happens much more frequently because the DNS server has to redo
the AAAA queries often.

One suggestion that I've seen on the net for how to mitigate this
problem is to treat FORMERR responses as negative and cache them just
like NXDOMAIN responses are cached. I took a look at the bind code in
resolver.c briefly to see how easy it would be to do this, and I
although it doesn't look like it would be particularly difficult, I
don't feel like I know the ins and outs of the DNS protocol and BIND
implementation enough to be confident that I'd get it right.

I'm interested to hear if other people are encountering this problem and
if the developers who work on BIND have any thoughts about how to
migitate it, short of getting everyone on the internet to upgrade to
nameservers that support IPV6.

Thanks,

Jonathan Kamens

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/5c6f1eb7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3920 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/5c6f1eb7/attachment.bin>
Tony Finch
2011-07-11 19:10:35 UTC
Permalink
I said above that the problem is exacerbated by the fact that many DNS servers
don't yet support IPV6 queries. This is because the AAAA queries don't get
NXDOMAIN responses, which would be cached, but rather FORMERR responses, which
are not cached. As a result, the scenario describes above happens much more
frequently because the DNS server has to redo the AAAA queries often.
Your upstream resolver is broken if it returns FORMERR responses to AAAA
queries. The behaviour you describe is not normal.

Have a look at bind's filter-aaaa-on-v4 and deny-answer-addresses options
which should allow you prevent applications from trying to use IPv6. The
latter might also quell queries for IPv6 addresses of name servers (though
I haven't verified that). Also perhaps it'll help to declare all IPv6 name
servers bogus -- server ::/0 { bogus yes; };

Tony.
--
f.anthony.n.finch <dot at dotat.at> http://dotat.at/
North Bailey: Variable becoming southeasterly 3 or 4. Slight or moderate.
Fair. Good.
Jonathan Kamens
2011-07-11 19:21:58 UTC
Permalink
Post by Tony Finch
I said above that the problem is exacerbated by the fact that many DNS servers
don't yet support IPV6 queries. This is because the AAAA queries don't get
NXDOMAIN responses, which would be cached, but rather FORMERR responses, which
are not cached. As a result, the scenario describes above happens much more
frequently because the DNS server has to redo the AAAA queries often.
Your upstream resolver is broken if it returns FORMERR responses to AAAA
queries. The behaviour you describe is not normal.
There are people reporting all over the net that they're getting tons of
messages like this in their logs with recent BIND versions:

Jul 11 12:00:06 jik2 named[31354]: error (FORMERR) resolving
'en.wikipedia.org/AAAA/IN': 208.80.152.130#53

I've got 397 of them in my logs for just the last 24 hours.

I'm aware that this means the upstream DNS server is broken; isn't what
what I said, i.e., that it isn't responding properly to AAAA queries?

The problem is that I have no control over the upstream resolver. All I
have control over is my own name server.

I am not the only one who is going to encounter this problem. I've found
several reports of it on the net with a minimal amount of searching. I
think something more general has to be done than giving me advice about
what to change in my named.conf. I appreciate the advice for how to fix
the problem for myself, but I think it needs to be fixed for everyone.
Post by Tony Finch
Have a look at bind's filter-aaaa-on-v4 and deny-answer-addresses options
which should allow you prevent applications from trying to use IPv6.
Neither of these options are documented in named.conf(5) or
resolv.conf(5). Is this a problem that is specific to the Fedora 15
versions of these man pages, or is the documentation distributed with
BIND out-of-date?

I tried to use the option and I get "is not configured" in my log when
named starts up and then "parsing failed," so I think my BIND must not
be compiled with --enable-filter-aaaa, right? That makes it difficult to
use this solution. Perhaps that's also why it isn't listed in the man page?

jik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/449d7ac3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3920 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/449d7ac3/attachment.bin>
Eivind Olsen
2011-07-11 19:26:29 UTC
Permalink
Post by Jonathan Kamens
I said above that the problem is exacerbated by the fact that many DNS
servers don't yet support IPV6 queries. This is because the AAAA queries
don't get NXDOMAIN responses, which would be cached, but rather FORMERR
responses, which are not cached. As a result, the scenario describes
above happens much more frequently because the DNS server has to redo
the AAAA queries often.
I think the main issue here is - why is your nameserver thinking it has
IPv6 connectivity?
If you don't have a working IPv6 connectivity, do one / both of these:

1) Disable or at least configure IPv6 properly on your server
2) Tell BIND to not use IPv6 transport, typically by starting "named" with
the command line option "-4". How to do that depends on your operating
system / distribution / packaging system etc.

Regards
Eivind Olsen
Jonathan Kamens
2011-07-11 19:59:39 UTC
Permalink
Post by Eivind Olsen
I think the main issue here is - why is your nameserver thinking it has
IPv6 connectivity?
No, this isn't the issue.

I see the FORMERR errors in syslog and the timeouts resolving host names
even when I start named with -4.

Named is querying for AAAA records even when it is started with -4, and
it is the querying, not the connectivity, that is the issue.

jik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/737d11ac/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3920 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/737d11ac/attachment.bin>
Mark Andrews
2011-07-11 22:52:01 UTC
Permalink
Post by Jonathan Kamens
I think the main issue here is - why is your nameserver thinking it has=
IPv6 connectivity?
No, this isn't the issue.
I see the FORMERR errors in syslog and the timeouts resolving host names =
even when I start named with -4.
-4 and -6 affect what transport is used. They have no impact on data.
Post by Jonathan Kamens
Named is querying for AAAA records even when it is started with -4, and=20
it is the querying, not the connectivity, that is the issue.
jik
--------------050505010300040209020902
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<html>
<head>
<meta content=3D"text/html; charset=3DISO-8859-1"
http-equiv=3D"Content-Type">
</head>
<body text=3D"#000000" bgcolor=3D"#FFFFFF">
<blockquote
cite=3D"mid:43d661f35b8a94b70c2e3eddf6a29027.squirrel at webmail.amino=
r.no"
type=3D"cite">
<pre wrap=3D"">
I think the main issue here is - why is your nameserver thinking it has
IPv6 connectivity?</pre>
</blockquote>
No, this isn't the issue.<br>
<br>
I see the FORMERR errors in syslog and the timeouts resolving host
names even when I start named with -4.<br>
<br>
Named is querying for AAAA records even when it is started with -4,
and it is the querying, not the connectivity, that is the issue.<br>
<br>
&nbsp; jik<br>
<br>
</body>
</html>
--------------050505010300040209020902--
--------------ms050705050904040700000809
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIKtTCC
BIowggNyoAMCAQICECf06hH0eobEbp27bqkXBwcwDQYJKoZIhvcNAQEFBQAwbzELMAkGA1UE
BhMCU0UxFDASBgNVBAoTC0FkZFRydXN0IEFCMSYwJAYDVQQLEx1BZGRUcnVzdCBFeHRlcm5h
bCBUVFAgTmV0d29yazEiMCAGA1UEAxMZQWRkVHJ1c3QgRXh0ZXJuYWwgQ0EgUm9vdDAeFw0w
NTA2MDcwODA5MTBaFw0yMDA1MzAxMDQ4MzhaMIGuMQswCQYDVQQGEwJVUzELMAkGA1UECBMC
VVQxFzAVBgNVBAcTDlNhbHQgTGFrZSBDaXR5MR4wHAYDVQQKExVUaGUgVVNFUlRSVVNUIE5l
dHdvcmsxITAfBgNVBAsTGGh0dHA6Ly93d3cudXNlcnRydXN0LmNvbTE2MDQGA1UEAxMtVVRO
LVVTRVJGaXJzdC1DbGllbnQgQXV0aGVudGljYXRpb24gYW5kIEVtYWlsMIIBIjANBgkqhkiG
9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsjmFpPJ9q0E7YkY3rs3BYHW8OWX5ShpHornMSMxqmNVN
NRm5pELlzkniii8efNIxB8dOtINknS4p1aJkxIW9hVE1eaROaJB7HHqkkqgX8pgV8pPMyaQy
lbsMTzC9mKALi+VuG6JG+ni8om+rWV6lL8/K2m2qL+usobNqqrcuZzWLeeEeaYji5kbNoKXq
vgvOdjp6Dpvq/NonWz1zHyLmSGHGTPNpsaguG7bUMSAsvIKKjqQOpdeJQ/wWWq8dcdcRWdq6
hw2v+vPhwvCkxWeM1tZUOt4KpLoDd7NlyP0e03RiqhjKaJMeoYV+9Udly/hNVyh00jT/MLbu
9mIwFIws6wIDAQABo4HhMIHeMB8GA1UdIwQYMBaAFK29mHo0tCb3+sQmVO8DveAky1QaMB0G
A1UdDgQWBBSJgmd9xJ0mcABLtFBIfN49rgRufTAOBgNVHQ8BAf8EBAMCAQYwDwYDVR0TAQH/
BAUwAwEB/zB7BgNVHR8EdDByMDigNqA0hjJodHRwOi8vY3JsLmNvbW9kb2NhLmNvbS9BZGRU
cnVzdEV4dGVybmFsQ0FSb290LmNybDA2oDSgMoYwaHR0cDovL2NybC5jb21vZG8ubmV0L0Fk
ZFRydXN0RXh0ZXJuYWxDQVJvb3QuY3JsMA0GCSqGSIb3DQEBBQUAA4IBAQAZ2IkRbyispgCi
54fBm5AD236hEv0e8+LwAamUVEJrmgnEoG3XkJIEA2Z5Q3H8+G+v23ZF4jcaPd3kWQR4rBz0
g0bzes9bhHIt5UbBuhgRKfPLSXmHPLptBZ2kbWhPrXIUNqi5sf2/z3/wpGqUNVCPz4FtVbHd
WTBK322gnGQfSXzvNrv042n0+DmPWq1LhTq3Du3Tzw1EovsEv+QvcI4l+1pUBrPQxLxtjftz
Mizpm4QkLdZ/kXpoAlAfDj9N6cz1u2fo3BwuO/xOzf4CjuOoEwqlJkRl6RDyTVKnrtw+ymsy
XEFs/vVdoOr/0fqbhlhtPZZH5f4ulQTCAMyOofK7MIIGIzCCBQugAwIBAgIQe1dBw9n5Eb2l
Om+3Ltt3aTANBgkqhkiG9w0BAQUFADCBrjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAlVUMRcw
FQYDVQQHEw5TYWx0IExha2UgQ2l0eTEeMBwGA1UEChMVVGhlIFVTRVJUUlVTVCBOZXR3b3Jr
MSEwHwYDVQQLExhodHRwOi8vd3d3LnVzZXJ0cnVzdC5jb20xNjA0BgNVBAMTLVVUTi1VU0VS
Rmlyc3QtQ2xpZW50IEF1dGhlbnRpY2F0aW9uIGFuZCBFbWFpbDAeFw0xMDA3MjgwMDAwMDBa
Fw0xMTA3MjgyMzU5NTlaMIHYMTUwMwYDVQQLEyxDb21vZG8gVHJ1c3QgTmV0d29yayAtIFBF
UlNPTkEgTk9UIFZBTElEQVRFRDFGMEQGA1UECxM9VGVybXMgYW5kIENvbmRpdGlvbnMgb2Yg
dXNlOiBodHRwOi8vd3d3LmNvbW9kby5uZXQvcmVwb3NpdG9yeTEfMB0GA1UECxMWKGMpMjAw
MyBDb21vZG8gTGltaXRlZDEYMBYGA1UEAxMPSm9uYXRoYW4gS2FtZW5zMRwwGgYJKoZIhvcN
AQkBFg1qaWtAa2FtZW5zLnVzMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA0H8C
qtK52hCbrMnjyIfIJjIbfEO4FK4/NJWaXXh0ZBKUpqsaoK51v39KcNdXqmlLue+Rjck1dzRv
x4nScjAqIsO86gC8lrfbA4Mq9TjBDdzU8oTVkNbZQzbKbmJvtLts3sHwkVQAc4BJLn3D2TtY
LhuyBrmRJU8gircgdTLMa9htydGNbelt3I1rXPLcpQr/RQhyzii6CgxIurpfV4fCLx1pibCJ
/8NnLUsluIUDfaId8uBSPEBENbn2HpS9Z/z52C6rxMfLVWIyz2mWhxF9TLw//35uyKGkAQ4k
gUWGSNUkQZrCkH8is8FX9Pu7j5BbUhGKZrtPngn7PZei9nIvgwIDAQABo4ICDzCCAgswHwYD
VR0jBBgwFoAUiYJnfcSdJnAAS7RQSHzePa4Ebn0wHQYDVR0OBBYEFLLxhINiMjAmBabU27sd
6Hb1y3ITMA4GA1UdDwEB/wQEAwIFoDAMBgNVHRMBAf8EAjAAMCAGA1UdJQQZMBcGCCsGAQUF
BwMEBgsrBgEEAbIxAQMFAjARBglghkgBhvhCAQEEBAMCBSAwRgYDVR0gBD8wPTA7BgwrBgEE
AbIxAQIBAQEwKzApBggrBgEFBQcCARYdaHR0cHM6Ly9zZWN1cmUuY29tb2RvLm5ldC9DUFMw
gaUGA1UdHwSBnTCBmjBMoEqgSIZGaHR0cDovL2NybC5jb21vZG9jYS5jb20vVVROLVVTRVJG
aXJzdC1DbGllbnRBdXRoZW50aWNhdGlvbmFuZEVtYWlsLmNybDBKoEigRoZEaHR0cDovL2Ny
bC5jb21vZG8ubmV0L1VUTi1VU0VSRmlyc3QtQ2xpZW50QXV0aGVudGljYXRpb25hbmRFbWFp
bC5jcmwwbAYIKwYBBQUHAQEEYDBeMDYGCCsGAQUFBzAChipodHRwOi8vY3J0LmNvbW9kb2Nh
LmNvbS9VVE5BQUFDbGllbnRDQS5jcnQwJAYIKwYBBQUHMAGGGGh0dHA6Ly9vY3NwLmNvbW9k
b2NhLmNvbTAYBgNVHREEETAPgQ1qaWtAa2FtZW5zLnVzMA0GCSqGSIb3DQEBBQUAA4IBAQBq
nZ4sM7STCLe34bg9UKL1Udkt49soHgcFFz4ghNROynbTFJYRx2km95/I9l1zG86CgAZjYqWy
QNIsW6bSQJ+thhC+z59vqgHrZzV4ymPixa0HJj+GrT26Ekrl+xodiQw7B4M2l0GptxTGeiXt
JQx2a1ZsYuWBAVckYF8pWsQ1jQY5Neczv8mYoOreETWi8jbHjU5OlX/qO+yExd+IVBMFF4Kh
TsWnZCj0R9F9qs7f832mJCyyi2lvxDaT302J75jmp9lvtjsx12ZonpaAM0UEu12BsJcSJwaf
85eezMqqQeG7SjBXW1DyJtN/3EqLA7VoYS93EAWci15jPdX+cNDYMYIEXTCCBFkCAQEwgcMw
ga4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJVVDEXMBUGA1UEBxMOU2FsdCBMYWtlIENpdHkx
HjAcBgNVBAoTFVRoZSBVU0VSVFJVU1QgTmV0d29yazEhMB8GA1UECxMYaHR0cDovL3d3dy51
c2VydHJ1c3QuY29tMTYwNAYDVQQDEy1VVE4tVVNFUkZpcnN0LUNsaWVudCBBdXRoZW50aWNh
dGlvbiBhbmQgRW1haWwCEHtXQcPZ+RG9pTpvty7bd2kwCQYFKw4DAhoFAKCCAm4wGAYJKoZI
hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTEwNzExMTk1OTM5WjAjBgkq
hkiG9w0BCQQxFgQU65+uYzVOo0OaZDLFzoh8PAJ7TkQwXwYJKoZIhvcNAQkPMVIwUDALBglg
hkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFAMAcG
BSsOAwIHMA0GCCqGSIb3DQMCAgEoMIHUBgkrBgEEAYI3EAQxgcYwgcMwga4xCzAJBgNVBAYT
AlVTMQswCQYDVQQIEwJVVDEXMBUGA1UEBxMOU2FsdCBMYWtlIENpdHkxHjAcBgNVBAoTFVRo
ZSBVU0VSVFJVU1QgTmV0d29yazEhMB8GA1UECxMYaHR0cDovL3d3dy51c2VydHJ1c3QuY29t
MTYwNAYDVQQDEy1VVE4tVVNFUkZpcnN0LUNsaWVudCBBdXRoZW50aWNhdGlvbiBhbmQgRW1h
aWwCEHtXQcPZ+RG9pTpvty7bd2kwgdYGCyqGSIb3DQEJEAILMYHGoIHDMIGuMQswCQYDVQQG
EwJVUzELMAkGA1UECBMCVVQxFzAVBgNVBAcTDlNhbHQgTGFrZSBDaXR5MR4wHAYDVQQKExVU
aGUgVVNFUlRSVVNUIE5ldHdvcmsxITAfBgNVBAsTGGh0dHA6Ly93d3cudXNlcnRydXN0LmNv
bTE2MDQGA1UEAxMtVVROLVVTRVJGaXJzdC1DbGllbnQgQXV0aGVudGljYXRpb24gYW5kIEVt
YWlsAhB7V0HD2fkRvaU6b7cu23dpMA0GCSqGSIb3DQEBAQUABIIBALV1g8KM9Dh691XS3XMp
8+XIlA2cd5LlGlPSG0mIk+n80dtj92isN2G/uqgRjNonXxncuRMBHYS/y8Gu0NWTNB6qwsvy
GzXz98hA6GTXsPb9DWLcCSxZs4FAIcumAkULFBvRRZQNV4ZQM8ovcvqPh1uuVrVL+Qdkxow7
MlKlE7abPdLHMbEe2+g8OGVQkYKTUQkMAQGDQKXz9fveTawIRBCbnNazcxFH3fjzTOA0P9w3
17UWHXqIR0mQhLcH1bEx/zuN3bP3YqsXXvrBDzt0LGcbSmdqMttbL6J0qa14hBDJwNhDqxwY
jo/AuTq9MJbLjQigS/QWvxVfOQNCNCgvHpEAAAAAAAA=
--------------ms050705050904040700000809--
--===============8444784111914393364==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe
from this list
bind-users mailing list
bind-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
--===============8444784111914393364==--
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka at isc.org
Bill Owens
2011-07-11 20:06:42 UTC
Permalink
Post by Jonathan Kamens
The number of DNS queries required for each address lookup requested by
a client has gone up considerably because of IPV6. The problem is being
exacerbated by the fact that many DNS servers on the net don't yet
support IPV6 queries. The result is that address lookups are frequently
taking so long that the client gives up before getting the result.
I've seen the same thing, and poked around enough to see that the Wikipedia name servers are returning the wrong authority info for these and other queries (it isn't just AAAA - try TXT, SRV, etc.) Some digging through the archives finds this:

https://lists.isc.org/pipermail/bind-users/2011-March/083109.html
in which the first sentence says it all: "The nameservers for wikipedia.org are broken."

And this followup:
https://lists.isc.org/pipermail/bind-users/2011-March/083113.html
"It's PowerDNS 2.9.22 that is breaking this, and it will be fixed by
PowerDNS 3.0 once that's released, and we get around to deploying it."

Looks like PowerDNS was in RC2 as of April 19, not released yet. . .

Bill.
Jonathan Kamens
2011-07-11 20:25:59 UTC
Permalink
Post by Bill Owens
https://lists.isc.org/pipermail/bind-users/2011-March/083109.html
in which the first sentence says it all: "The nameservers for wikipedia.org are broken."
It's not just wikipedia.org that's broken, obviously. I see this error
in my logs for 19 domains since July 3:

Even if PowerDNS is the only source of this issue, and even if the new
version of PowerDNS is released tomorrow, I'm sure there will still be
sites running the old version a year from now. So just relying on a
PowerDNS release to fix this problem seems unwise.

Users are experiencing this problem /now/ in the field, and more users
will be experiencing it as BIND is upgraded in more and more places.
Every single user relying on a Fedora 15 DNS server, for example, is
going to see occasional unnecessary DNS timeouts when trying to resolve
host names.

It seems clear to me that a generally available, generally applicable
fix to BIND is needed to avoid this issue and perhaps similar issues
like it.

jik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/5062dc58/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3920 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/5062dc58/attachment.bin>
Tim Maestas
2011-07-11 22:27:28 UTC
Permalink
I'm unclear how BIND could be modified to fix this. The querying
client machines are asking BIND for AAAA records. BIND goes out to
the authoritative nameservers to attempt to resolve said AAAA records.
The broken nameservers (PowerDNS <3.0 etc) timeout or otherwise hand
out bad responses (FORMERR, NXDOMAIN). What would BIND do differently
to avoid this?

Even if BIND was modified, why would the responsibility fall on all
BIND administrators to implement this hack as opposed to the onus
being on the owners of the broken nameservers to upgrade their broken
authoritative servers?

-Tim
Post by Bill Owens
https://lists.isc.org/pipermail/bind-users/2011-March/083109.html
in which the first sentence says it all: "The nameservers for wikipedia.org are broken."
It's not just wikipedia.org that's broken, obviously. I see this error in my
Even if PowerDNS is the only source of this issue, and even if the new
version of PowerDNS is released tomorrow, I'm sure there will still be sites
running the old version a year from now. So just relying on a PowerDNS
release to fix this problem seems unwise.
Users are experiencing this problem now in the field, and more users will be
experiencing it as BIND is upgraded in more and more places. Every single
user relying on a Fedora 15 DNS server, for example, is going to see
occasional unnecessary DNS timeouts when trying to resolve host names.
It seems clear to me that a generally available, generally applicable fix to
BIND is needed to avoid this issue and perhaps similar issues like it.
? jik
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to
unsubscribe from this list
bind-users mailing list
bind-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
Chuck Swiger
2011-07-11 22:35:05 UTC
Permalink
Even if PowerDNS is the only source of this issue, and even if the new version of PowerDNS is released tomorrow, I'm sure there will still be sites running the old version a year from now. So just relying on a PowerDNS release to fix this problem seems unwise.
OK, but this same reasoning applies to making a change to BIND: even if we had such a change available tomorrow, there will be sites running older versions of BIND a year from now, also. :-)
Users are experiencing this problem now in the field, and more users will be experiencing it as BIND is upgraded in more and more places. Every single user relying on a Fedora 15 DNS server, for example, is going to see occasional unnecessary DNS timeouts when trying to resolve host names.
It seems clear to me that a generally available, generally applicable fix to BIND is needed to avoid this issue and perhaps similar issues like it.
What you probably want is a change to your local implementation of getaddrinfo() for your libc / glibc so that it prefers to issue T_A queries before it issues T_AAAA queries, and will only issue T_AAAA queries if IPv6 networking is compiled into the system.

In my experience, not only does this significantly help resolver performance in the face of nameservers which break when facing IPv6 AAAA queries, it is a solution which many people ignore.

http://www.netbsd.org/cgi-bin/query-pr-single.pl?number=42405

Regards,
--
-Chuck
Bill Owens
2011-07-11 23:01:04 UTC
Permalink
Post by Jonathan Kamens
Post by Bill Owens
https://lists.isc.org/pipermail/bind-users/2011-March/083109.html
in which the first sentence says it all: "The nameservers for
wikipedia.org are broken."
It's not just wikipedia.org that's broken, obviously. I see this error
I have FORMERR entries in my logs for 79 names since June 19, a total of 5185 error messages. 2247 of those are for wikipedia-related names. Spot-checking shows that the others appear to be unrelated issues; mostly bizarre-looking misconfigurations.
Post by Jonathan Kamens
Even if PowerDNS is the only source of this issue, and even if the new
version of PowerDNS is released tomorrow, I'm sure there will still be
sites running the old version a year from now. So just relying on a
PowerDNS release to fix this problem seems unwise.
A fix to the PowerDNS problem won't remove all the FORMERR messages, but a fixed version running the wikipedia-related domains would repair your original problem, and that seems like a reasonable thing to expect. More reasonable than asking BIND to ignore incorrect responses, IMO. . .

Bill.
Mark Andrews
2011-07-12 00:03:00 UTC
Permalink
Post by Bill Owens
Post by Bill Owens
https://lists.isc.org/pipermail/bind-users/2011-March/083109.html
in which the first sentence says it all: "The nameservers for wikiped=
ia.org are broken."
It's not just wikipedia.org that's broken, obviously. I see this error
Well you havn't been looking at your logs or you upgraded to a version
which logs the condition.
Post by Bill Owens
Even if PowerDNS is the only source of this issue, and even if the new
version of PowerDNS is released tomorrow, I'm sure there will still be
sites running the old version a year from now. So just relying on a
PowerDNS release to fix this problem seems unwise.
Sure, but it is a minor issue overall. FORMERR is a lot better
that what used to happen. Nameservers used to drop AAAA queries
so you got timeouts when all the nameseservers were working instead
of when some are working.
Post by Bill Owens
Users are experiencing this problem /now/ in the field, and more users
will be experiencing it as BIND is upgraded in more and more places.
Every single user relying on a Fedora 15 DNS server, for example, is
going to see occasional unnecessary DNS timeouts when trying to resolve
host names.
Well complain to the owners of those zones. You have logs that tell you
which nameservers are broken.
Post by Bill Owens
It seems clear to me that a generally available, generally applicable
fix to BIND is needed to avoid this issue and perhaps similar issues
like it.
The DNS has multiple nameservers so that when one is down you can
ask another and be able to cache the answer. Here none of the
nameservers are giving answers that can be cached. FORMERR, NOTIMP,
REFUSED/timeout are per server not per query tuple <QNAME/QTYPE/QCLASS>.
Post by Bill Owens
jik
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka at isc.org
Michael Sinatra
2011-07-12 00:16:10 UTC
Permalink
Post by Bill Owens
Users are experiencing this problem now in the field, and more users
will
Post by Bill Owens
be experiencing it as BIND is upgraded in more and more places. Every
single user relying on a Fedora 15 DNS server, for example, is going to
see occasional unnecessary DNS timeouts when trying to resolve host
names.
Post by Bill Owens
It seems clear to me that a generally available, generally applicable
fix
Post by Bill Owens
to BIND is needed to avoid this issue and perhaps similar issues like
it.

What is the fix you want? Negative caching of FORMERR responses? That
won't work in the wikipedia case, since the (incorrect) SOA minimum is
only 10 minutes, and your cron job runs every 15 minutes.

There are millions of broken domains out there. Asking BIND to install
kludges to pave over them is probably not the best way to go.

michael

PS. BTW, it would be incorrect to state that queries for non-existent AAAA
records for a domain name for which other records exist (e.g. CNAME or A)
should get an NXDOMAIN response. They absolutely should not. They should
get an empty answer with a NOERROR RCODE. NXDOMAIN means that there are
no dns records whatsoever that have the domain name en.wikipedia.org,
which is certainly not the case.
Bill Owens
2011-07-21 21:01:52 UTC
Permalink
Post by Bill Owens
Post by Jonathan Kamens
The number of DNS queries required for each address lookup requested by
a client has gone up considerably because of IPV6. The problem is being
exacerbated by the fact that many DNS servers on the net don't yet
support IPV6 queries. The result is that address lookups are frequently
taking so long that the client gives up before getting the result.
https://lists.isc.org/pipermail/bind-users/2011-March/083109.html
in which the first sentence says it all: "The nameservers for wikipedia.org are broken."
https://lists.isc.org/pipermail/bind-users/2011-March/083113.html
"It's PowerDNS 2.9.22 that is breaking this, and it will be fixed by
PowerDNS 3.0 once that's released, and we get around to deploying it."
Looks like PowerDNS was in RC2 as of April 19, not released yet. . .
Updating that - according to Bert Hubert (via Twitter):
"Friday the 22nd is... PowerDNS Authoritiative Server 3.0 release day!"

Bill.

Phil Mayers
2011-07-11 20:26:41 UTC
Permalink
Post by Jonathan Kamens
The number of DNS queries required for each address lookup requested
by a client has gone up considerably because of IPV6. The problem is
being exacerbated by the fact that many DNS servers on the net don't yet
support IPV6 queries. The result is that address lookups are frequently
taking so long that the client gives up before getting the result.
Can you be more specific here? Do you mean "many DNS servers don't
support queries with qtype=AAAA" or "many DNS servers don't support
queries over IPv6/UDP or IPv6/TCP"?
Post by Jonathan Kamens
This is fine when the wikipedia.org nameservers are working, but let's
postulate for the moment that two of them are down, unreachable, or
responding slowly, which apparently happens pretty often. Then we end up
wikipedia.org DNS
en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA
en.wikipedia.org A /times out/
en.wikipedia.org A /times out
/en.wikipedia.org A
I don't quite see how you're getting this behaviour.

Every operating system that I know of recommends getaddrinfo or some
similar variant for doing multiprotocol IPv4/IPv6 lookups, and as far as
I'm aware, they all do something very similar - namely, send the A and
AAAA lookups in parallel. When I try this against a bind server, I see
this makes bind perform the A/AAAA lookups in parallel too. So, at worst
you should have something like:

0.0001 A query
0.0002 AAAA query
...
1.0000 A query timeout
1.0001 AAAA query timeout

...repeated X+1 times for X non-responding NS records.

That is, the lookups should happen in parallel, so the time taken should
not double.

If your app is doing its own DNS requests and it's doing them in series,
then it's broken, for exactly this reason, and should use the system
resolver.
Post by Jonathan Kamens
By now the end of that sequence, the typical 30-second DNS request
timeout has been exceeded, and the client gives up.
I said above that the problem is exacerbated by the fact that many DNS
servers don't yet support IPV6 queries. This is because the AAAA queries
don't get NXDOMAIN responses, which would be cached, but rather FORMERR
Not in my observations. As Tony has said, you seem to have a broken
upstream resolver.
Post by Jonathan Kamens
I'm interested to hear if other people are encountering this problem and
No, we are not seeing this problem, and we have thousands of
IPv6-enabled clients making A/AAAA DNS requests constantly. It just
works (tm).

This is not to say -ve caching of FORMERR is a bad idea; it may well be
a good idea. But I think there is more going on here than simply a
failure of -ve caching.
Kevin Darcy
2011-07-11 21:50:21 UTC
Permalink
Post by Jonathan Kamens
The number of DNS queries required for each address lookup requested
by a client has gone up considerably because of IPV6. The problem is
being exacerbated by the fact that many DNS servers on the net don't
yet support IPV6 queries. The result is that address lookups are
frequently taking so long that the client gives up before getting the
result.
The example I am seeing this with most frequently is my RSS feed
reader, rss2email, trying to read a feed from en.wikipedia.org in a
cron job that runs every 15 minutes. I am regularly seeing this in the
W: Name or service not known [8]
http://en.wikipedia.org/w/index.php?title=/[elided]/&feed=atom&action=history
The wikipedia.org domain has three DNS servers. Let's assume that the
root and org. nameservers are cached already when rss2email does its
wikipedia.org DNS
en.wikipedia.org AAAA
en.wikipedia.org A
This is fine when the wikipedia.org nameservers are working, but let's
postulate for the moment that two of them are down, unreachable, or
responding slowly, which apparently happens pretty often. Then we end
wikipedia.org DNS
en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA
en.wikipedia.org A /times out/
en.wikipedia.org A /times out
/en.wikipedia.org A
By now the end of that sequence, the typical 30-second DNS request
timeout has been exceeded, and the client gives up.
The math isn't working. I just ran a quick test and named (9.7.x) failed
over from a non-working delegated NS to a working delegated NS in less
than 30 milliseconds. How are you reaching a 30-*second* timeout
threshold in only 6 queries?

In practice, it would also be quite unlikely that named would pick
"dead" nameservers before live ones for *both* the AAAA and the A query.
At the very least, once the timeouts were encountered for the AAAA
query, those NSes would be penalized in terms of NS selection, so they
are unlikely to be chosen *again*, ahead of the working NS, for the A
query. Any en.wikipedia.org NSes which were found to be *persistently*
broken, would gravitate to the bottom of the selection list, and be
tried approximately never.

I think maybe you need to probe deeper and find out what _else_ is going on.



- Kevin



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110711/3f9b1469/attachment.html>
Doug Barton
2011-07-12 00:15:30 UTC
Permalink
Post by Jonathan Kamens
The number of DNS queries required for each address lookup requested by
a client has gone up considerably because of IPV6. The problem is being
exacerbated by the fact that many DNS servers on the net don't yet
support IPV6 queries.
I have to disagree with your premise here. It's true that DNS software
has a notoriously long deprecation cycle, but AAAA records have been
around for long enough that it's highly unlikely there are enough name
servers that don't handle them to make a noticeable difference. And even
if you can find one, it should be upgraded for a vast array of other
reasons.
Post by Jonathan Kamens
The result is that address lookups are frequently
taking so long that the client gives up before getting the result.
It sounds to me like you don't have IPv6 connectivity. If so, you've
already been given the advice to configure your OS to avoid asking for
AAAA at all, or at least to ask for A first. Heed this advice.
Post by Jonathan Kamens
The example I am seeing this with most frequently is my RSS feed reader,
rss2email, trying to read a feed from en.wikipedia.org in a cron job
that runs every 15 minutes. I am regularly seeing this in the output of
W: Name or service not known [8]
http://en.wikipedia.org/w/index.php?title=/[elided]/&feed=atom&action=history
The wikipedia.org domain has three DNS servers. Let's assume that the
root and org. nameservers are cached already when rss2email does its
wikipedia.org DNS
en.wikipedia.org AAAA
en.wikipedia.org A
This is fine when the wikipedia.org nameservers are working, but let's
postulate for the moment that two of them are down, unreachable, or
responding slowly, which apparently happens pretty often. Then we end up
wikipedia.org DNS
en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA
en.wikipedia.org A /times out/
en.wikipedia.org A /times out
/en.wikipedia.org A
By now the end of that sequence, the typical 30-second DNS request
timeout has been exceeded, and the client gives up.
See above. YOU need to configure your software to not ask for AAAA, or
to ask for A first.
Post by Jonathan Kamens
I said above that the problem is exacerbated by the fact that many DNS
servers don't yet support IPV6 queries. This is because the AAAA queries
don't get NXDOMAIN responses, which would be cached, but rather FORMERR
responses, which are not cached. As a result, the scenario describes
above happens much more frequently because the DNS server has to redo
the AAAA queries often.
Can you provide examples of specific name servers, on the network now,
that respond this way? The authoritative name servers for wikipedia.org
respond correctly (NOERROR/ANSWER=0) to AAAA queries for
en.wikipedia.org. If you are seeing a FORMERR response to these queries
the problem lies somewhere in your resolution chain.

Before taking mitigating steps in correctly functioning software is
considered there needs to be substantial evidence that there are enough
really really old name servers that behave the way you describe still on
line to make the effort worthwhile.


hth,

Doug
--
Nothin' ever doesn't change, but nothin' changes much.
-- OK Go

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price. :) http://SupersetSolutions.com/
Mark Andrews
2011-07-12 02:35:18 UTC
Permalink
Wikipedia have been told multiple times that their nameservers are
broken, that they fail to add the CNAME records, as required by RFC
1034, which results in garbage answers being returned. Those garbage
answers result in the FORMERR log messages.

Both of the answers below should have CNAME chains in them but only
the A query has them.

Now luckily this doesn't affect every AAAA lookup as the CNAME
records returned from the A lookup are cached, so every hour the
recursive nameserver needs to go through this dance. Asking for A
before AAAA just hides the problem by priming the cache.

Mark

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23606
;; flags: qr aa; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;en.wikipedia.org. IN A

;; ANSWER SECTION:
en.wikipedia.org. 3600 IN CNAME text.wikimedia.org.
text.wikimedia.org. 600 IN CNAME text.pmtpa.wikimedia.org.
text.pmtpa.wikimedia.org. 3600 IN A 208.80.152.2

;; Query time: 411 msec
;; SERVER: 91.198.174.4#53(ns2.wikimedia.org)
;; WHEN: Tue Jul 12 12:02:06 2011

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23260
;; flags: qr aa; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;en.wikipedia.org. IN AAAA

;; AUTHORITY SECTION:
wikimedia.org. 86400 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2011071119 43200 7200 1209600 600

;; Query time: 306 msec
;; SERVER: 208.80.152.142#53(ns1.wikimedia.org)
;; WHEN: Tue Jul 12 12:00:58 2011
;; MSG SIZE rcvd: 108
Post by Doug Barton
Post by Jonathan Kamens
The number of DNS queries required for each address lookup requested by
a client has gone up considerably because of IPV6. The problem is being
exacerbated by the fact that many DNS servers on the net don't yet
support IPV6 queries.
I have to disagree with your premise here. It's true that DNS software
has a notoriously long deprecation cycle, but AAAA records have been
around for long enough that it's highly unlikely there are enough name
servers that don't handle them to make a noticeable difference. And even
if you can find one, it should be upgraded for a vast array of other
reasons.
Post by Jonathan Kamens
The result is that address lookups are frequently
taking so long that the client gives up before getting the result.
It sounds to me like you don't have IPv6 connectivity. If so, you've
already been given the advice to configure your OS to avoid asking for
AAAA at all, or at least to ask for A first. Heed this advice.
Post by Jonathan Kamens
The example I am seeing this with most frequently is my RSS feed reader,
rss2email, trying to read a feed from en.wikipedia.org in a cron job
that runs every 15 minutes. I am regularly seeing this in the output of
W: Name or service not known [8]
http://en.wikipedia.org/w/index.php?title=/[elided]/&feed=atom&action=h
istory
Post by Jonathan Kamens
The wikipedia.org domain has three DNS servers. Let's assume that the
root and org. nameservers are cached already when rss2email does its
wikipedia.org DNS
en.wikipedia.org AAAA
en.wikipedia.org A
This is fine when the wikipedia.org nameservers are working, but let's
postulate for the moment that two of them are down, unreachable, or
responding slowly, which apparently happens pretty often. Then we end up
wikipedia.org DNS
en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA /times out
/en.wikipedia.org AAAA
en.wikipedia.org A /times out/
en.wikipedia.org A /times out
/en.wikipedia.org A
By now the end of that sequence, the typical 30-second DNS request
timeout has been exceeded, and the client gives up.
See above. YOU need to configure your software to not ask for AAAA, or
to ask for A first.
Post by Jonathan Kamens
I said above that the problem is exacerbated by the fact that many DNS
servers don't yet support IPV6 queries. This is because the AAAA queries
don't get NXDOMAIN responses, which would be cached, but rather FORMERR
responses, which are not cached. As a result, the scenario describes
above happens much more frequently because the DNS server has to redo
the AAAA queries often.
Can you provide examples of specific name servers, on the network now,
that respond this way? The authoritative name servers for wikipedia.org
respond correctly (NOERROR/ANSWER=0) to AAAA queries for
en.wikipedia.org. If you are seeing a FORMERR response to these queries
the problem lies somewhere in your resolution chain.
Before taking mitigating steps in correctly functioning software is
considered there needs to be substantial evidence that there are enough
really really old name servers that behave the way you describe still on
line to make the effort worthwhile.
hth,
Doug
--
Nothin' ever doesn't change, but nothin' changes much.
-- OK Go
Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price. :) http://SupersetSolutions.com/
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe
from this list
bind-users mailing list
bind-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka at isc.org
Jonathan Kamens
2011-07-13 04:45:18 UTC
Permalink
Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.

In a nutshell, the getaddrinfo function in glibc sends both A and AAAA
queries to the DNS server at the same time and then deals with the
responses as they come in. Unfortunately, if the responses to the two
queries come back in reverse order, /and/ the first one to come back is
a server failure, both of which are the case when you try to resolve
en.wikipedia.org immediately after restarting your DNS server so nothing
is cached, the glibc code screws up and decides it didn't get back a
successful response even though it did.

If you do the same lookup again, it works, because the CNAME that was
sent in response to the A query is cached, so both the A and AAAA
queries get back valid responses from the DNS server. And even if that
weren't the case, since the CNAME is cached it gets returned first,
since the server doesn't need to do a query to get it, whereas it does
need to do another query to get the AAAA record (which recall isn't
being cached because of the previously discussed FORMERR problem). It'll
keep working until the cached records time out, at which point it'll
happen again, and then be OK again until the records time out, etc.

The workaround is to put "options single-request" in /etc/resolv.conf to
prevent the glibc innards from sending out both the A and AAAA queries
at the same time.

FYI, here's the glibc bug I filed about this:

http://sourceware.org/bugzilla/show_bug.cgi?id=12994

Thank you for telling me I was full of it and making me dig deeper into
this until I located the actual cause of the issue. :-)

jik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/1a6b714c/attachment.html>
Mark Andrews
2011-07-13 06:13:08 UTC
Permalink
No. The fix is to correct the nameservers. They are not correctly
following the DNS protocol and everything else is a fall out from
that.
Post by Jonathan Kamens
Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.
There is no bug in glibc.
Post by Jonathan Kamens
In a nutshell, the getaddrinfo function in glibc sends both A and AAAA
queries to the DNS server at the same time and then deals with the
responses as they come in. Unfortunately, if the responses to the two
queries come back in reverse order, /and/ the first one to come back is
a server failure, both of which are the case when you try to resolve
en.wikipedia.org immediately after restarting your DNS server so nothing
is cached, the glibc code screws up and decides it didn't get back a
successful response even though it did.
There is *nothing* wrong with sending both queries at once.
Post by Jonathan Kamens
If you do the same lookup again, it works, because the CNAME that was
sent in response to the A query is cached, so both the A and AAAA
queries get back valid responses from the DNS server. And even if that
weren't the case, since the CNAME is cached it gets returned first,
since the server doesn't need to do a query to get it, whereas it does
need to do another query to get the AAAA record (which recall isn't
being cached because of the previously discussed FORMERR problem). It'll
keep working until the cached records time out, at which point it'll
happen again, and then be OK again until the records time out, etc.
The workaround is to put "options single-request" in /etc/resolv.conf to
prevent the glibc innards from sending out both the A and AAAA queries
at the same time.
http://sourceware.org/bugzilla/show_bug.cgi?id=12994
Thank you for telling me I was full of it and making me dig deeper into
this until I located the actual cause of the issue. :-)
jik
Note your "fix" won't help clients that only ask for AAAA records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.

Mark
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka at isc.org
Jonathan Kamens
2011-07-13 06:35:32 UTC
Permalink
Post by Mark Andrews
No. The fix is to correct the nameservers. They are not correctly
following the DNS protocol and everything else is a fall out from
that.
You're right that everything else is fallout from that.

But that doesn't do me much good, does it? It's my system that keeps
getting bogus name resolution errors. It's my RSS feed reader that keeps
failing on an hourly basis when the cached records for en.wikipedia.org
expire. It's all very well and good to say that the Wikipedia folks and
other people with this problem should fix their nameservers -- I totally
agree with that -- but it doesn't help me solve my problem /now/.

I'm a real user in the real world with a real problem. Yelling at
Wikipedia to fix their DNS servers may feel good, but it doesn't make my
DNS work. As far as I and all the other users who are being impacted
/now/ by this problem are concerned, it's just pissing into the wind.
Post by Mark Andrews
Post by Jonathan Kamens
Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.
There is no bug in glibc.
To be blunt, that's bullshit.

If glibc makes an A query and an AAAA query, and it gets back a valid
response to the A query and an invalid response to the AAAA query, then
it should ignore the invalid response to the AAAA query and return the
valid A response to the user as the IP address for the host.

Please note, furthermore, that as I explained in detail in my bug report
and in my last message, glibc behaves differently based on the /order/
in which the two responses are returned by the DNS server. Since there's
nothing that says a DNS server has to respond to two queries in the
order in which they were received, and that would be an impossible
requirement to impose in any case, since the queries and responses are
sent via UDP which doesn' guarantee order, it's perfectly clear that
glibc needs to be prepared to function the same regardless of the order
in which it receives the responses.

What's more, there's plenty of code in the glibc files I spent hours
poring over which is clearly an attempt to do exactly that. The people
who wrote the code just got it wrong. Which isn't surprising, given how
god-awful the code is.

This is not an either/or situation. The broken nameservers should be
fixed, /and/ glibc should be fixed to properly handle the case of when
it sends two queries and gets back one valid response and one server
error in reverse order.
Post by Mark Andrews
Post by Jonathan Kamens
In a nutshell, the getaddrinfo function in glibc sends both A and AAAA
queries to the DNS server at the same time and then deals with the
responses as they come in. Unfortunately, if the responses to the two
queries come back in reverse order, /and/ the first one to come back is
a server failure, both of which are the case when you try to resolve
en.wikipedia.org immediately after restarting your DNS server so nothing
is cached, the glibc code screws up and decides it didn't get back a
successful response even though it did.
There is *nothing* wrong with sending both queries at once.
I didn't say there was. You really don't seem to be paying very good
attention.

Do you understand what the word /workaround/ means?
Post by Mark Andrews
Note your "fix" won't help clients that only ask for AAAA records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.
I am aware of that. It is irrelevant, because it is not the problem I am
trying to solve. I, and 99.999999% of the users in the world, are /not/
"only ask[ing] for AAAA records." Nobody actually trying to use the
internet for day-to-day work is doing that right now, because to say
that IPv6 support is not yet ubiquitous would be a laughably momentous
understatement.

You seem to have a really big chip on your shoulder about people who run
broken DNS servers. I don't like them any more than you do. But I
learned "Be generous in what you accept and conservative in what you
generate" way back when I started playing with the Internet well over
two decades ago. It holds up now as well as it did back then, and
there's no good reason why it shouldn't apply in this case.

It's clear that this is a religious issue for you. I'm not here to
debate religion, I'm here to get help making my DNS work, and to help
other people, to whatever extent I can, make /their/ DNS work. If you
continue to send religious screeds on this topic while making no effort
to actually read and understand what I write, please do not expect me to
respond further.

Jonathan Kamens

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/a7d62ca3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3920 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/a7d62ca3/attachment.bin>
Kevin Darcy
2011-07-13 17:06:13 UTC
Permalink
Post by Jonathan Kamens
Post by Mark Andrews
Post by Jonathan Kamens
Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.
There is no bug in glibc.
To be blunt, that's bullshit.
If glibc makes an A query and an AAAA query, and it gets back a valid
response to the A query and an invalid response to the AAAA query,
then it should ignore the invalid response to the AAAA query and
return the valid A response to the user as the IP address for the host.
Please note, furthermore, that as I explained in detail in my bug
report and in my last message, glibc behaves differently based on the
/order/ in which the two responses are returned by the DNS server.
Since there's nothing that says a DNS server has to respond to two
queries in the order in which they were received, and that would be an
impossible requirement to impose in any case, since the queries and
responses are sent via UDP which doesn' guarantee order, it's
perfectly clear that glibc needs to be prepared to function the same
regardless of the order in which it receives the responses.
I agree that the order of the A/AAAA responses shouldn't matter to the
result. The whole getaddrinfo() call should fail regardless of whether
the failure is seen first or the valid response is seen first. Why?
Because getaddrinfo() should, if it isn't already, be using the RFC 3484
algorithm (and/or whatever the successor to RFC 3484 ends up being) to
sort the addresses, and for that algorithm to work, one needs *both* the
IPv4 address(es) *and* the IPv6 address(es) available, in order to
compare their scopes, prefixes, etc.. If one of the lookups "fails", and
this failure is presented to the RFC 3484 algorithm as NODATA for a
particular address family, then the algorithm could make a bad selection
of the destination address, and this can lead to other sorts of
breakage, e.g. trying to use a tunneled connection where no tunnel
exists. The *safe* thing for glibc to do is to promote the failure of
either the A lookup or the AAAA lookup to a general lookup failure,
which prompts the user/administrator to find the source of the problem
and fix it.

It's rarely a good idea to mask undeniable errors as if there were no
error at all. It leads to unpredictable behavior and really tough
troubleshooting challenges. I think glibc is erring on the side of
openness and transparency here, rather than trying to cover up the fact
that something is horribly wrong.
Post by Jonathan Kamens
Post by Mark Andrews
Note your "fix" won't help clients that only ask for AAAA records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.
I am aware of that. It is irrelevant, because it is not the problem I
am trying to solve. I, and 99.999999% of the users in the world, are
/not/ "only ask[ing] for AAAA records." Nobody actually trying to use
the internet for day-to-day work is doing that right now, because to
say that IPv6 support is not yet ubiquitous would be a laughably
momentous understatement.
What about clients in a NAT64/DNS64 environment? They could be
configured as IPv6-only but normally able to access the IPv4 Internet
just fine. Even with your glibc "fix" in place, though, they'll
presumably break if the authoritative nameservers are giving garbage
responses to AAAA queries (could someone with practical experience in
DNS64 please confirm this?).

Another possibility you're not considering is that the invoking
application itself may make independent IPv4-specific and IPv6-specific
getaddrinfo() lookups. Why would it do this? Why not? Maybe IPv6
capability is something the user has to buy a separate license for, so
the IPv6 part is a slightly separate codepath, added in a later version,
than the base product, which is IPv4-only. When one of the getaddrinfo()
calls returns address records and the other returns garbage, your "fix"
doesn't prevent such an application from doing something unpredictable,
possibly catastrophic. So it's really not a general solution to the problem.




- Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/c4b6452b/attachment.html>
Kevin Darcy
2011-07-13 17:31:51 UTC
Permalink
Post by Kevin Darcy
Post by Jonathan Kamens
Post by Mark Andrews
Post by Jonathan Kamens
Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.
There is no bug in glibc.
To be blunt, that's bullshit.
If glibc makes an A query and an AAAA query, and it gets back a valid
response to the A query and an invalid response to the AAAA query,
then it should ignore the invalid response to the AAAA query and
return the valid A response to the user as the IP address for the host.
Please note, furthermore, that as I explained in detail in my bug
report and in my last message, glibc behaves differently based on the
/order/ in which the two responses are returned by the DNS server.
Since there's nothing that says a DNS server has to respond to two
queries in the order in which they were received, and that would be
an impossible requirement to impose in any case, since the queries
and responses are sent via UDP which doesn' guarantee order, it's
perfectly clear that glibc needs to be prepared to function the same
regardless of the order in which it receives the responses.
I agree that the order of the A/AAAA responses shouldn't matter to the
result. The whole getaddrinfo() call should fail regardless of whether
the failure is seen first or the valid response is seen first. Why?
Because getaddrinfo() should, if it isn't already, be using the RFC
3484 algorithm (and/or whatever the successor to RFC 3484 ends up
being) to sort the addresses, and for that algorithm to work, one
needs *both* the IPv4 address(es) *and* the IPv6 address(es)
available, in order to compare their scopes, prefixes, etc.. If one of
the lookups "fails", and this failure is presented to the RFC 3484
algorithm as NODATA for a particular address family, then the
algorithm could make a bad selection of the destination address, and
this can lead to other sorts of breakage, e.g. trying to use a
tunneled connection where no tunnel exists. The *safe* thing for
glibc to do is to promote the failure of either the A lookup or the
AAAA lookup to a general lookup failure, which prompts the
user/administrator to find the source of the problem and fix it.
It's rarely a good idea to mask undeniable errors as if there were no
error at all. It leads to unpredictable behavior and really tough
troubleshooting challenges. I think glibc is erring on the side of
openness and transparency here, rather than trying to cover up the
fact that something is horribly wrong.
Post by Jonathan Kamens
Post by Mark Andrews
Note your "fix" won't help clients that only ask for AAAA records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.
I am aware of that. It is irrelevant, because it is not the problem I
am trying to solve. I, and 99.999999% of the users in the world, are
/not/ "only ask[ing] for AAAA records." Nobody actually trying to use
the internet for day-to-day work is doing that right now, because to
say that IPv6 support is not yet ubiquitous would be a laughably
momentous understatement.
What about clients in a NAT64/DNS64 environment? They could be
configured as IPv6-only but normally able to access the IPv4 Internet
just fine. Even with your glibc "fix" in place, though, they'll
presumably break if the authoritative nameservers are giving garbage
responses to AAAA queries (could someone with practical experience in
DNS64 please confirm this?).
Another possibility you're not considering is that the invoking
application itself may make independent IPv4-specific and
IPv6-specific getaddrinfo() lookups. Why would it do this? Why not?
Maybe IPv6 capability is something the user has to buy a separate
license for, so the IPv6 part is a slightly separate codepath, added
in a later version, than the base product, which is IPv4-only. When
one of the getaddrinfo() calls returns address records and the other
returns garbage, your "fix" doesn't prevent such an application from
doing something unpredictable, possibly catastrophic. So it's really
not a general solution to the problem.
Oh, I should also point out that this brokenness by the
wikipedia/wikimedia nameservers *isn't* just specific to AAAA queries,
and therefore *isn't* "fixable" with getaddrinfo() alone. Try doing an
MX query of en.wikipedia.org. Or a PTR query. Or any of the other "old"
(yet non-deprecated) query types (e.g. NS, TXT, HINFO). The only QTYPEs
that are answered correctly are A, CNAME and (oddly enough) SOA. So they
don't even have the excuse of "well, AAAA queries are kinda new, we
haven't got around to handling them properly yet". This behavior has
failed to conform to the standard, for as long as the standard has
existed; it's not recent, IPv6-specific breakage.



- Kevin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/d3a6fab8/attachment.html>
Jonathan Kamens
2011-07-13 18:39:52 UTC
Permalink
I agree that the order of the A/AAAA responses shouldn't matter to the
result. The whole getaddrinfo() call should fail regardless of whether the
failure is seen first or the valid response is seen first. Why? Because
getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm
(and/or whatever the successor to RFC 3484 ends up being) to sort the
addresses, and for that algorithm to work, one needs *both* the IPv4
address(es) *and* the IPv6 address(es) available, in order to compare their
scopes, prefixes, etc..



RFC 3484 tells you how to sort addresses you've got.



If you've only got one address, then bang! It's already sorted for you. You
don't need RFC 3484 to tell you how to sort it.



I have to say that some of the people on this list seem completely detached
from what real users in the real world want their computers to do.



If I am trying to connect to a site on the internet, then I want my computer
to do its best to try to connect to the site. I don't want it to throw up
its hands and say, "Oh, I'm sorry, one of my address lookups failed, so I'm
not going to let you use the other address lookup, the one that succeeded,
because some RFC somewhere could be interpreted as implying that's a bad
idea, if I wanted to do so." Please, that's ridiculous.



If one of the lookups "fails", and this failure is presented to the RFC 3484
algorithm as NODATA for a particular address family, then the algorithm
could make a bad selection of the destination address, and this can lead to
other sorts of breakage, e.g. trying to use a tunneled connection where no
tunnel exists.



If the address the client gets doesn't work, then the address doesn't work.
How is being unable to connect because the address turned out to not be
routable different from being unable to connect because the computer refused
to even try?



Another possibility you're not considering is that the invoking application
itself may make independent IPv4-specific and IPv6-specific getaddrinfo()
lookups. Why would it do this? Why not? Maybe IPv6 capability is something
the user has to buy a separate license for, so the IPv6 part is a slightly
separate codepath, added in a later version, than the base product, which is
IPv4-only. When one of the getaddrinfo() calls returns address records and
the other returns garbage, your "fix" doesn't prevent such an application
from doing something unpredictable, possibly catastrophic. So it's really
not a general solution to the problem.



I have no idea what you're talking about. If the application makes
independent IPv4 and IPv6 getaddrinfo() lookups, then the change I'm
proposing to glibc is completely irrelevant and does not impact the existing
functionality in any way. The IPv4 lookup will succeed, the IPv6 lookup will
fail, and the application is then free to decide what to do.



In summary, getattrinfo() with AF_UNSPEC has a very clear meaning - "Give me
whatever addresses you can." The man page says, and I am quoting, "The value
AF_UNSPEC undicates that getaddrinfo() should return socket addresses for
any address family (either IPv4 or IPv6, for example) that can be used with
node and service." I don't see how the language could be any more clear. To
suggest that it's reasonable and correct for it to refuse to return a
successfully fetched address is simply ludicrous.



I hope and pray that people who maintain the glibc code have more common
sense about what users want and expect from their software.



In the meantime, it's clear that I don't belong on this mailing list, so I'm
out of here.



Jonathan Kamens



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/383de5ca/attachment.html>
Kevin Darcy
2011-07-13 22:27:59 UTC
Permalink
Post by Kevin Darcy
I agree that the order of the A/AAAA responses shouldn't matter to the
result. The whole getaddrinfo() call should fail regardless of whether
the failure is seen first or the valid response is seen first. Why?
Because getaddrinfo() should, if it isn't already, be using the RFC
3484 algorithm (and/or whatever the successor to RFC 3484 ends up
being) to sort the addresses, and for that algorithm to work, one
needs *both* the IPv4 address(es) *and* the IPv6 address(es)
available, in order to compare their scopes, prefixes, etc..
RFC 3484 tells you how to sort addresses you've got.
If you've only got one address, then bang! It's already sorted for
you. You don't need RFC 3484 to tell you how to sort it.
No, you've got one address, and one unspecified nameserver failure.
Garbage in, garbage out. To say that a nameserver failure is equivalent
to NODATA is not only technically incorrect, it leads to all sorts of
operational problems in the real world.
Post by Kevin Darcy
I have to say that some of the people on this list seem completely
detached from what real users in the real world want their computers
to do.
Really? Do you think I'm an academic? Do you think I sit and write
Internet Drafts and RFCs all day? No, I'm an implementor. I deal with
DNS operational problems and issues all day, every workday. And I can
tell you that I don't appreciate library routines making wild-ass
assumptions that, in the face of some questionable behavior by a
nameserver, maybe, possibly some quantity of addresses that I've
acquired from that dodgy nameserver are good enough for my clients to
try and connect to. No thanks. If there's a real problem I want to know
about it as clearly and unambiguously as possible. I can't deal
effectively with a problem if it's being masked by some library routine
doing something weird behind my back.
Post by Kevin Darcy
If I am trying to connect to a site on the internet, then I want my
computer to do its best to try to connect to the site. I don't want it
to throw up its hands and say, "Oh, I'm sorry, one of my address
lookups failed, so I'm not going to let you use the /other/ address
lookup, the one that succeeded, because some RFC somewhere could be
interpreted as implying that's a bad idea, if I wanted to do so."
Please, that's ridiculous.
No, what's more ridiculous is if users can't get to a site SOME OF THE
TIME, because someone's DNS is broken, a moronic library routine then
routes the traffic some unexpected way, and a whole raft of other
variables enter the picture, without anyone realizing or paying
attention to the dependencies and interconnectivity that is required to
keep the client working. There is a certain threshold of brokenness
where the infrastructure has to "throw up its hands", as you put it, and
say "nuh uh, not gonna happen", because to try to work around the
problem based on not enough information about the topology, the
environment, the dependencies, etc. you're likely to cause more harm
than good by making the failure modes way more complex than necessary.
Post by Kevin Darcy
If one of the lookups "fails", and this failure is presented to the
RFC 3484 algorithm as NODATA for a particular address family, then the
algorithm could make a bad selection of the destination address, and
this can lead to other sorts of breakage, e.g. trying to use a
tunneled connection where no tunnel exists.
If the address the client gets doesn't work, then the address doesn't
work. How is being unable to connect because the address turned out to
not be routable different from being unable to connect because the
computer refused to even try?
Because the failure modes are substantially different and it could take
significant man-hours to determine that the root cause of the problem is
actually DNS brokenness rather than something else in the network
infrastructure (routers, switches, VPN concentrators, firewalls, IPSes,
load-balancers, etc.) or in the client or server (OS, application,
middleware, etc.)

Have you ever actually troubleshot a difficult connectivity problem in a
complex networking environment? Trust me, you want clear symptoms, clear
failure modes. Not a bunch of components making dumb assumptions and/or
trying to be "helpful" outside of their defined scope of functionality.
That kind of "help" is like offering a glass of water to a drowning man.
Post by Kevin Darcy
Another possibility you're not considering is that the invoking
application itself may make independent IPv4-specific and
IPv6-specific getaddrinfo() lookups. Why would it do this? Why not?
Maybe IPv6 capability is something the user has to buy a separate
license for, so the IPv6 part is a slightly separate codepath, added
in a later version, than the base product, which is IPv4-only. When
one of the getaddrinfo() calls returns address records and the other
returns garbage, your "fix" doesn't prevent such an application from
doing something unpredictable, possibly catastrophic. So it's really
not a general solution to the problem.
I have no idea what you're talking about. If the application makes
independent IPv4 and IPv6 getaddrinfo() lookups, then the change I'm
proposing to glibc is completely irrelevant and does not impact the
existing functionality in any way. The IPv4 lookup will succeed, the
IPv6 lookup will fail, and the application is then free to decide what
to do.
I wasn't saying the glibc change would break such an application, only
that your proposed "fix" doesn't help it either, so it shouldn't be
mistaken for a general solution to the problem. Your "fix" only applies
to AF_UNSPEC and one should not assume that getaddrinfo() is *always*
called with AF_UNSPEC. That is, after all, why the ai_family member of
the "hints" parameter struct takes values other than AF_UNSPEC.
Post by Kevin Darcy
In summary, getattrinfo() with AF_UNSPEC has a very clear meaning --
"Give me whatever addresses you can." The man page says, and I am
quoting, "The value AF_UNSPEC undicates that getaddrinfo() should
return socket addresses for any address family (either IPv4 or IPv6,
for example) that can be used with node and service." I don't see how
the language could be any more clear.
Clear eh? That's because you're reading it through the tunnel vision of
your own preferences. The text you quoted says nothing about failure
modes, RFC 3484 or the full-/partial-answer distinction. Are you hanging
your hat completely on the phrase "can be used"? Really? You're reading
that much detail into such generic language? Well, as I've said before,
for RFC 3484 to work properly, one needs all available IPv4 and IPv6
addresses. If the list of addresses is short because of nameserver
brokenness, I'd say RFC 3484 couldn't do its job and the resulting
partial address list is "unusable" because it hasn't been properly
processed. That's my take on "can be used".

Since you decided to start playing the "man page semantic game", it's my
turn now.

I could just as easily point to this text from the Linux version of the
man page:

*EAI_AGAIN*
The name server returned a temporary failure indication. Try again
later.
*EAI_BADFLAGS*
/ai_flags/ contains invalid flags.
*EAI_FAIL*
The name server returned a permanent failure indication.

Notice that the descriptions of EAI_AGAIN and EAI_FAIL say "The
nameserver returned a [...] failure". Not "The nameserver returned [...]
failures for all lookups" or "The nameserver returned only failures". In
the English language, "a failure" is equivalent to "more than 0 failures".

Based on this text, slanted with my own biases and preferences, I'll say
then that *any* error return from a nameserver should cause
getaddrinfo() to fail, in order to be consistent with those error-code
descriptions. Are you going to argue differently? Shall we split hairs
over the meaning of the word "a"?
Post by Kevin Darcy
To suggest that it's reasonable and correct for it to refuse to return
a successfully fetched address is simply ludicrous.
By the same faulty reasoning, any addresses returned in a *truncated*
DNS response are still usable. We hashed that one over for years, and
finally decided it was a really stupid idea. Partial DNS results are not
reliably usable. And this is just as true for one "part" of a
getaddrinfo() lookup failing as it is for a truncated DNS response.
Unless you have all of the information which was requested, you can't
assume that the invoker is going to be able to sensibly use what you
hand back to it. If one detects a failure that materially affects the
completeness or trustworthiness of a response, one has to either a) try
to acquire the information another way (e.g. in the case of response
truncation, retry the query using TCP) or b) assume the worst and put
the onus on the invoker, or perhaps further up the responsibility chain
to the user, admin, implementor, or architect, to flag the error as an
actual error and fix what's really wrong.




- Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110713/117f9c1c/attachment.html>
Mark Andrews
2011-07-15 00:57:47 UTC
Permalink
You seem to have a really big chip on your shoulder about people who run =
broken DNS servers. I don't like them any more than you do. But I=20
learned "Be generous in what you accept and conservative in what you=20
generate" way back when I started playing with the Internet well over=20
two decades ago. It holds up now as well as it did back then, and=20
there's no good reason why it shouldn't apply in this case.
Perhaps I do, but it is with good justification. There is that much
garbage out there that it is hard to get answers back within the
2-4 seconds a client waits for a response.

There are broken servers out there.
There are misconfigured servers out there.
There are broken/misconfigured firewalls out there.
There are broken NAT boxes out there.
There are broken DNS proxies out there.
There are administrator out there that don't care.

What should be a clean straight forward request / response protocol
no longer is.

There are lots of workarounds built into recursive servers. It got
to the point that its getting hard to add new workarounds without
breaking old workarounds or breaking good answer processing.

Mark
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka at isc.org
Loading...