#487 DNS resolution of CNAMEs fails often (not always)
Reporter
Ge0rg
Owner
Zash
Created
Updated
Stars
★★ (2)
Tags
Type-Defect
Status-Fixed
Priority-Medium
Milestone-0.10
Ge0rG
on
Using prosody 0.9.8 on a Debian stable system, S2S connections involving CNAMEs often fail with "DNS resolution failed". Eventually, a link is constructed, though.
The problem happens with S2S connections involving CNAMEs, e.g. when joining conversations@conference.siacs.eu:
Apr 14 12:30:39 mod_s2s debug First attempt to connect to conference.siacs.eu, starting with SRV lookup...
Apr 14 12:30:39 adns debug Records for _xmpp-server._tcp.conference.siacs.eu. not in cache, sending query (thread: 0x2524940)...
Apr 14 12:30:39 adns debug Reply for _xmpp-server._tcp.conference.siacs.eu. (thread: 0x2524940)
Apr 14 12:30:39 mod_s2s debug conference.siacs.eu has no SRV records, falling back to A/AAAA
Apr 14 12:30:39 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x4bb08c0)...
Apr 14 12:30:39 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x83b4ed0)...
Apr 14 12:30:39 adns debug Reply for conference.siacs.eu (thread: 0x4bb08c0)
Apr 14 12:30:39 mod_s2s debug Looking up jabber.siacs.eu. (DNS depth is 3)
Apr 14 12:30:39 adns debug Reply for conference.siacs.eu (thread: 0x83b4ed0)
Apr 14 12:30:39 mod_s2s debug DNS lookup failed to get a response for conference.siacs.eu
Apr 14 12:30:39 s2sout9b972a0 info Out of connection options, can't connect to conference.siacs.eu
Apr 14 12:30:39 mod_s2s debug No other records to try for conference.siacs.eu - destroying
Apr 14 12:30:39 s2sout9b972a0 debug Destroying outgoing session yax.im->conference.siacs.eu: DNS resolution failed
And a later attempt succeeds:
Apr 14 12:34:04 mod_s2s debug First attempt to connect to conference.siacs.eu, starting with SRV lookup...
Apr 14 12:34:04 adns debug Records for _xmpp-server._tcp.conference.siacs.eu. not in cache, sending query (thread: 0x8764610)...
Apr 14 12:34:04 adns debug Reply for _xmpp-server._tcp.conference.siacs.eu. (thread: 0x8764610)
Apr 14 12:34:04 mod_s2s debug conference.siacs.eu has no SRV records, falling back to A/AAAA
Apr 14 12:34:04 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x87677d0)...
Apr 14 12:34:04 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x39663e0)...
Apr 14 12:34:04 adns debug Reply for conference.siacs.eu (thread: 0x87677d0)
Apr 14 12:34:04 mod_s2s debug Looking up jabber.siacs.eu. (DNS depth is 3)
Apr 14 12:34:04 mod_s2s debug DNS reply for conference.siacs.eu gives us 137.226.142.20
Apr 14 12:34:04 adns debug Reply for conference.siacs.eu (thread: 0x39663e0)
Apr 14 12:34:04 s2sout67711a0 info Beginning new connection attempt to conference.siacs.eu ([137.226.142.20]:5269)
For the record, the linked commit is a part of the solution but there are still some rare issues that need more investigation.
nicoo
on
I hit one of those edge-cases on Prosody 0.9.8, trying to connect to conference.yunohost.org.
The records in the yunohost.org changes, and conference.yunohost.org now has SRV records, but it should be possible to ask its operators for the earlier version of the zonefile.
Ge0rG
on
The problem is still happening on 0.10 (last time around the end of October 2016). Unfortunately it's very hard to time the exact moment and obtain pcaps from it.
crypto orchatno
on
is this complicated by Round Robin or similar strategy?
crypto orchatno
on
clicking star decreased from 3⭐ to 2
MattJ
on
We may consider backporting the changes from 0.10 (see issue #1001) to 0.9.
Kamigishi Rei
on
0.10.0, FreeBSD amd64, getting DNS resolution failures with CNAME records. The DNS software in use is unbound (caching, recursive, and validating resolver).
Here's the query dump:
reading from file dns.pcap, link-type NULL (BSD loopback)
10.0.0.1.53 > 10.0.0.7.38784: [bad udp cksum 0x14cc -> 0xf434!] 29380 q: SRV? _xmpp-server._tcp.myrl.net. 1/4/0 _xmpp-server._tcp.myrl.net. [10m34s] SRV xmpp.kraut.it.:5269 10 0 ns: myrl.net. [21h29m13s] NS ns2.1984hosting.com., myrl.net. [21h29m13s] NS ns2.1984.is., myrl.net. [21h29m13s] NS ns1.1984.is., myrl.net. [21h29m13s] NS ns0.1984.is. (171)
10.0.0.1.53 > 10.0.0.7.30954: [bad udp cksum 0x14cc -> 0xfcc0!] 48112 q: SRV? _xmpp-server._tcp.myrl.net. 1/4/0 _xmpp-server._tcp.myrl.net. [10m17s] SRV xmpp.kraut.it.:5269 10 0 ns: myrl.net. [21h28m56s] NS ns2.1984hosting.com., myrl.net. [21h28m56s] NS ns2.1984.is., myrl.net. [21h28m56s] NS ns1.1984.is., myrl.net. [21h28m56s] NS ns0.1984.is. (171)
10.0.0.7.25880 > 10.0.0.1.53: [bad udp cksum 0x1440 -> 0x753a!] 33219+ A? xmpp.kraut.it. (31)
10.0.0.1.53 > 10.0.0.7.25880: [bad udp cksum 0x14c8 -> 0x4e90!] 33219 q: A? xmpp.kraut.it. 2/4/0 xmpp.kraut.it. [21h28m56s] CNAME nsdap02.vss.kraut.it., nsdap02.vss.kraut.it. [3m4s] A 212.47.240.51 ns: kraut.it. [21h28m56s] NS ns0.1984.is., kraut.it. [21h28m56s] NS ns2.1984.is., kraut.it. [21h28m56s] NS ns2.1984hosting.com., kraut.it. [21h28m56s] NS ns1.1984.is. (167)
10.0.0.7.25880 > 10.0.0.1.53: [bad udp cksum 0x1440 -> 0xa865!] 13208+ AAAA? xmpp.kraut.it. (31)
10.0.0.1.53 > 10.0.0.7.25880: [bad udp cksum 0x1494 -> 0x7e7d!] 13208 q: AAAA? xmpp.kraut.it. 1/1/0 xmpp.kraut.it. [21h28m56s] CNAME nsdap02.vss.kraut.it. ns: kraut.it. [55m17s] SOA ns0.1984.is. hostmaster.1984.is. 2015092340 86400 7200 1209600 86400 (115)
The resolver returns the correct A record in the additional data section.
Oct 02 11:16:16 mod_s2s debug Best record found, will connect to xmpp.kraut.it.:5269
Oct 02 11:16:16 adns debug Records for xmpp.kraut.it. not in cache, sending query (thread: 0x004c5258)...
Oct 02 11:16:16 adns debug Sending DNS query to 10.0.0.1
Oct 02 11:16:16 adns debug Reply for xmpp.kraut.it. (thread: 0x004c5258)
Oct 02 11:16:16 mod_s2s debug DNS lookup failed to get a response for xmpp.kraut.it.
It seems like adns fails a lot. This also happens with simple IN A queries without CNAME results.
MattJ
on
@Kamigishi Rei - thanks for your additional data here, sorry it took a while to get back to you. Are you (or anyone) able to share a raw pcap file of the issue?
To debug we need the pcap and corresponding debug logs for the query from Prosody.
Changes
tags Milestone-0.11
Kamigishi Rei
on
I still have the pcap file (it only has the DNS traffic since it was taken with a port 53 filter) and the prosody debug log from back then. Would it be all right to edit everything except DNS-related stuff from the log before providing it?
The server I had issues with seems to no longer be using SRV records so right now I cannot test it properly unless I find someone else with a similar set-up…
MattJ
on
@Kamigishi Rei that would be fine, thanks. It's really just the DNS requests + responses I'm interested in. Feel free to email the pcap to developers@prosody.im (and reference this issue for context).
Kamigishi Rei
on
All right, just sent the pcap + log files. Note that the pcap was generated on FreeBSD (Microsoft Message Analyser, for one, seems to have issues reading it)
MattJ
on
Thanks, received - the file opens fine with Wireshark.
It looks like the domain in question actually has CNAME records listed as targets in the SRV records for the XMPP domain. Note that this is technically forbidden by the specifications, and so a domain configured this way cannot be expected to work reliably in all software.
However there are domains configured this way in the real world, and we did add code to Prosody to handle them. Therefore this remains a bug because it looks like at least in this case that code didn't work.
I'll work on a test case to try to reproduce the issue locally and hopefully pinpoint the problem.
Zash
on
A reasonable explanation may be that net.dns only remembers answers of the same type as the query, ie a query for A records will only add A answers to the cache. So if the answer consists of CNAME and A, the CNAME gets forgotten and when it peeks into the cache later it sees nothing and assumes the query returned zero answers.
Using prosody 0.9.8 on a Debian stable system, S2S connections involving CNAMEs often fail with "DNS resolution failed". Eventually, a link is constructed, though. The problem happens with S2S connections involving CNAMEs, e.g. when joining conversations@conference.siacs.eu: Apr 14 12:30:39 mod_s2s debug First attempt to connect to conference.siacs.eu, starting with SRV lookup... Apr 14 12:30:39 adns debug Records for _xmpp-server._tcp.conference.siacs.eu. not in cache, sending query (thread: 0x2524940)... Apr 14 12:30:39 adns debug Reply for _xmpp-server._tcp.conference.siacs.eu. (thread: 0x2524940) Apr 14 12:30:39 mod_s2s debug conference.siacs.eu has no SRV records, falling back to A/AAAA Apr 14 12:30:39 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x4bb08c0)... Apr 14 12:30:39 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x83b4ed0)... Apr 14 12:30:39 adns debug Reply for conference.siacs.eu (thread: 0x4bb08c0) Apr 14 12:30:39 mod_s2s debug Looking up jabber.siacs.eu. (DNS depth is 3) Apr 14 12:30:39 adns debug Reply for conference.siacs.eu (thread: 0x83b4ed0) Apr 14 12:30:39 mod_s2s debug DNS lookup failed to get a response for conference.siacs.eu Apr 14 12:30:39 s2sout9b972a0 info Out of connection options, can't connect to conference.siacs.eu Apr 14 12:30:39 mod_s2s debug No other records to try for conference.siacs.eu - destroying Apr 14 12:30:39 s2sout9b972a0 debug Destroying outgoing session yax.im->conference.siacs.eu: DNS resolution failed And a later attempt succeeds: Apr 14 12:34:04 mod_s2s debug First attempt to connect to conference.siacs.eu, starting with SRV lookup... Apr 14 12:34:04 adns debug Records for _xmpp-server._tcp.conference.siacs.eu. not in cache, sending query (thread: 0x8764610)... Apr 14 12:34:04 adns debug Reply for _xmpp-server._tcp.conference.siacs.eu. (thread: 0x8764610) Apr 14 12:34:04 mod_s2s debug conference.siacs.eu has no SRV records, falling back to A/AAAA Apr 14 12:34:04 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x87677d0)... Apr 14 12:34:04 adns debug Records for conference.siacs.eu not in cache, sending query (thread: 0x39663e0)... Apr 14 12:34:04 adns debug Reply for conference.siacs.eu (thread: 0x87677d0) Apr 14 12:34:04 mod_s2s debug Looking up jabber.siacs.eu. (DNS depth is 3) Apr 14 12:34:04 mod_s2s debug DNS reply for conference.siacs.eu gives us 137.226.142.20 Apr 14 12:34:04 adns debug Reply for conference.siacs.eu (thread: 0x39663e0) Apr 14 12:34:04 s2sout67711a0 info Beginning new connection attempt to conference.siacs.eu ([137.226.142.20]:5269)
it works, please merge it
it == http://q.zash.se/7d8b86e6db1d.txt
That patch was commited as http://hg.prosody.im/0.9/rev/45222bfb206f
ChangesFor the record, the linked commit is a part of the solution but there are still some rare issues that need more investigation.
I hit one of those edge-cases on Prosody 0.9.8, trying to connect to conference.yunohost.org. The records in the yunohost.org changes, and conference.yunohost.org now has SRV records, but it should be possible to ask its operators for the earlier version of the zonefile.
The problem is still happening on 0.10 (last time around the end of October 2016). Unfortunately it's very hard to time the exact moment and obtain pcaps from it.
is this complicated by Round Robin or similar strategy?
clicking star decreased from 3⭐ to 2
We may consider backporting the changes from 0.10 (see issue #1001) to 0.9.
0.10.0, FreeBSD amd64, getting DNS resolution failures with CNAME records. The DNS software in use is unbound (caching, recursive, and validating resolver). Here's the query dump: reading from file dns.pcap, link-type NULL (BSD loopback) 10.0.0.1.53 > 10.0.0.7.38784: [bad udp cksum 0x14cc -> 0xf434!] 29380 q: SRV? _xmpp-server._tcp.myrl.net. 1/4/0 _xmpp-server._tcp.myrl.net. [10m34s] SRV xmpp.kraut.it.:5269 10 0 ns: myrl.net. [21h29m13s] NS ns2.1984hosting.com., myrl.net. [21h29m13s] NS ns2.1984.is., myrl.net. [21h29m13s] NS ns1.1984.is., myrl.net. [21h29m13s] NS ns0.1984.is. (171) 10.0.0.1.53 > 10.0.0.7.30954: [bad udp cksum 0x14cc -> 0xfcc0!] 48112 q: SRV? _xmpp-server._tcp.myrl.net. 1/4/0 _xmpp-server._tcp.myrl.net. [10m17s] SRV xmpp.kraut.it.:5269 10 0 ns: myrl.net. [21h28m56s] NS ns2.1984hosting.com., myrl.net. [21h28m56s] NS ns2.1984.is., myrl.net. [21h28m56s] NS ns1.1984.is., myrl.net. [21h28m56s] NS ns0.1984.is. (171) 10.0.0.7.25880 > 10.0.0.1.53: [bad udp cksum 0x1440 -> 0x753a!] 33219+ A? xmpp.kraut.it. (31) 10.0.0.1.53 > 10.0.0.7.25880: [bad udp cksum 0x14c8 -> 0x4e90!] 33219 q: A? xmpp.kraut.it. 2/4/0 xmpp.kraut.it. [21h28m56s] CNAME nsdap02.vss.kraut.it., nsdap02.vss.kraut.it. [3m4s] A 212.47.240.51 ns: kraut.it. [21h28m56s] NS ns0.1984.is., kraut.it. [21h28m56s] NS ns2.1984.is., kraut.it. [21h28m56s] NS ns2.1984hosting.com., kraut.it. [21h28m56s] NS ns1.1984.is. (167) 10.0.0.7.25880 > 10.0.0.1.53: [bad udp cksum 0x1440 -> 0xa865!] 13208+ AAAA? xmpp.kraut.it. (31) 10.0.0.1.53 > 10.0.0.7.25880: [bad udp cksum 0x1494 -> 0x7e7d!] 13208 q: AAAA? xmpp.kraut.it. 1/1/0 xmpp.kraut.it. [21h28m56s] CNAME nsdap02.vss.kraut.it. ns: kraut.it. [55m17s] SOA ns0.1984.is. hostmaster.1984.is. 2015092340 86400 7200 1209600 86400 (115) The resolver returns the correct A record in the additional data section. Oct 02 11:16:16 mod_s2s debug Best record found, will connect to xmpp.kraut.it.:5269 Oct 02 11:16:16 adns debug Records for xmpp.kraut.it. not in cache, sending query (thread: 0x004c5258)... Oct 02 11:16:16 adns debug Sending DNS query to 10.0.0.1 Oct 02 11:16:16 adns debug Reply for xmpp.kraut.it. (thread: 0x004c5258) Oct 02 11:16:16 mod_s2s debug DNS lookup failed to get a response for xmpp.kraut.it. It seems like adns fails a lot. This also happens with simple IN A queries without CNAME results.
@Kamigishi Rei - thanks for your additional data here, sorry it took a while to get back to you. Are you (or anyone) able to share a raw pcap file of the issue? To debug we need the pcap and corresponding debug logs for the query from Prosody.
ChangesI still have the pcap file (it only has the DNS traffic since it was taken with a port 53 filter) and the prosody debug log from back then. Would it be all right to edit everything except DNS-related stuff from the log before providing it? The server I had issues with seems to no longer be using SRV records so right now I cannot test it properly unless I find someone else with a similar set-up…
@Kamigishi Rei that would be fine, thanks. It's really just the DNS requests + responses I'm interested in. Feel free to email the pcap to developers@prosody.im (and reference this issue for context).
All right, just sent the pcap + log files. Note that the pcap was generated on FreeBSD (Microsoft Message Analyser, for one, seems to have issues reading it)
Thanks, received - the file opens fine with Wireshark. It looks like the domain in question actually has CNAME records listed as targets in the SRV records for the XMPP domain. Note that this is technically forbidden by the specifications, and so a domain configured this way cannot be expected to work reliably in all software. However there are domains configured this way in the real world, and we did add code to Prosody to handle them. Therefore this remains a bug because it looks like at least in this case that code didn't work. I'll work on a test case to try to reproduce the issue locally and hopefully pinpoint the problem.
A reasonable explanation may be that net.dns only remembers answers of the same type as the query, ie a query for A records will only add A answers to the cache. So if the answer consists of CNAME and A, the CNAME gets forgotten and when it peeks into the cache later it sees nothing and assumes the query returned zero answers.
Attempted fix in https://hg.prosody.im/0.10/rev/102242a7ee60 Please test
ChangesHaven't heard anyone having these kinds of DNS issues recently, so I'll assume this go fixed and the issue can be closed.
ChangesMattJZash