Statistics can be affected by how you collect data
Pavel Khramtsov, head of the DNS project at MSK-IX traffic exchange platform, and research project manager at the InData Foundation for Development of Networking Technologies, presented a report, Statistics of Open Resolvers: Comparison of Outside and Inside View, at the TLDCON 2023 conference.
He proposed a classic statistical problem – to determine which resolvers are used by end users. His example clearly demonstrated that measurements strongly depended on whether the actual current conditions for data collection were as described in the methodology.
A DNS resolver is a tool that finds the requested address in a distributed information system, the DNS system.
To analyze the use of DNS resolvers, APNIC has developed a system of measurements that uses advertising websites. It is based on two protocols, HTTP and DNS. A script is placed on an advertising website such as Googlе, and uploaded from APNIC’s HTTP server. The script can determine the end user’s IP-address. The script accesses an APNIC authoritative name server via a DNS resolver. Accordingly, this authoritative name server can determine the IP address of the resolver. The data is then entered into a single database where it is matched and analyzed.
“But if you look at the final picture, the statistics around the world and in Russia are markedly different. In February 2022, the amount of Russian traffic that reached APNIC script plummeted to a fraction of what it was,” Pavel Khramtsov argued.
At the same time, Google’s popularity with Russian users has not changed. It was Google that suspended ads for Russian users, and that move affected the work of the APNIC script.
In this regard, a fair question arises as to whether the data that APNIC has collected since February 27, 2022, is still relevant. If the plan was to find out which resolvers are accessed by Russian users, this data set is incomplete.
“When we analyze any data, it is always necessary to ask ourselves whether the methods we use are applicable; whether the sample is representative; and whether we have enough sources of measurements to get reliable data,” Pavel Khramtsov summed up.