Measuring the Internet for fun and profit

Measuring the Internet for fun and profit

Since March 2010 APNIC Research has been engaged in a web based data collection, as part of an ongoing, wide ranging measurement of the Internet, to try and understand how IPv6 deployment is taking shape.

This article discusses how we’ve been doing this collection, and explores some of the ideas this measurement is provoking, as we look further into the data. If you’re interested in helping the data collection you’re welcome to join in.

We like to measure.

“Metrology” is a wonderful word, covering the science of measurement. The publication of books like “Longitude” by Dava Sobel or “The Measure of all Things” by Ken Alder go to the heart of our fascination with basic measurements and how we’ve come to understand the ‘shape’ of the world. At the core, people appear to be tuned to want to understand the world, and measure it. Even more exciting is to join in the measurement and contribute data into the sum of knowledge.

Some people measure for its own sake. A householder collecting rain gauge data in their garden might write down daily rainfall for years, and not be able to successfully predict next week’s rain, or even detect a pattern different to the national forecasts.

The pleasure is in seeing for yourself

Other people measure to a higher purpose: A community of people worldwide share ‘first light’ data for new and full moons, that the Islamic community can more perfectly understand their calendar (which is strongly driven by lunar cycles), and the implications for significant festivals and religious contemplation.

Here at APNIC Research we’ve been measuring aspects of the Internet as we see it for more prosaic reasons:

Can we understand the dynamics of Internet number resources deployment and the (not unassociated) global routing, to help inform the community for policy and operations?

What does it mean to ‘measure’ the Internet?

The Internet stands in need of its own kind of ‘metrology’. How do we measure the Internet? What does it mean when we say we’ve been ‘counting’ website hits, or how ‘big’ the Internet is (or was) at any point in time? Inevitably, it depends who you ask, and what aspect of the network you are talking about.

Of particular interest at the moment, and for quite some time to come, is the measurement of how widely Internet Protocol Version 6 (IPv6) is deployed and how many Internet users are able to successfully use it.

Had deployment of IPv6 taken place as we imagined it would by now, then most (if not all) of the global Internet would be using IPv6, and our discussion would be about how to measure the declining rump of IPv4, and when we can decide to “turn it off”. Alas, nothing proves to be as simple as we had hoped , and the rate of uptake of IPv6 has been slow enough to raise serious concerns about its likelihood to successfully go to completion anytime soon.

This is not a situation we want to be in.

The more we can understand the dynamics of IPv6 deployment, basic observations about the nature of its deployment, the better we can inform the policy and strategic planning processes which will direct the future Internet.

Some basic measurements about IPv6 are very easy to undertake. We know exactly how many IPv6 addresses have been handed out worldwide via the daily reports generated by the Regional Internet Registries. We know how much of the IPv6 address space handed out is being actively routed, because this too is visible in the Internet’s BGP routing data.

Unfortunately, one of the key problems in the deployment of IPv6 is a worldwide ‘air gap’ between what the ISP’s have successfully deployed in routing and what they are able to deliver to the customer network.

In some economies ISPs have been tasked to report back to central authorities on this kind of data, and so we have specific data for a given point in time which shows an overall low level of penetration. This level of reporting is not all that common, and for much of the Internet we are left to “infer” the amount of IPv6 delivery to customer networks by the amount of IPv6 activity we can see in other measures, such as the number of domain names that have IPv6 addresses, or the number of Autonomous Systems that announce IPv6 address prefixes into the global routing system. This inference is inherently weak as it does not relate to levels of use of Ipv6 by end clients.

What APNIC Research felt it needed, was a method of data collection which would reflected a broad domain of data collection, and was indicative of end-to-end use of IPv6 by end clients.

How can we collect data which can be wide ranging, but publicly available?

A problem is the inherent bias in measurements taken on ‘things you own’ which can fall into the trap of believing you are yourself typical of the wider case. In some situations, such as basic biology, its true that one organism is much the same as any other organism of the same kind. However for many human-centered measurements, its unsafe to assume self-experimentation is not going to fall into a trap of ‘observer bias’. For example, in the case of IPv6 deployment, the Regional Internet Registries are strong advocates of IPv6, and we have observed that the measurements of the rate of IPv6 usage at APNIC web services are skewed by the community which makes use of APNIC’s services.

APNIC Research was therefore interested in a data collection method which measured a broader cross section of the Internet’s user population.

Is there some source of data which is large enough we can get close to ‘random’ samples of users in the global Internet at large?

Air gap? Is that like a missile gap?

I refer to the problem as an ‘air gap’ because we have good reason to believe most of the more recent devices connecting to the Internet in both home and work deployments are in fact, coming pre-configured to use IPv6. Microsoft Operating systems since Vista have certainly been shipped with IPv6 enabled, as has Apple OSX, since Snow Leopard, and the many Android and iOS phones are also demonstrably capable of using IPv6. The problem we appear to see today is not in a lack of Ipv6 capability in the computers and phones we use, the problem is the ‘last mile’ delivery of Internet to those devices: the customer-facing network inside the ISP, and the home/office router which connects to it (this is often called the Customer Premises Equipment or CPE).

This leads us to the natural question:

Can we measure the extent of IPv6 capability across the Internet at large, including the IPv6 capability of access networks and CPE devices?

An approach to answering to this question was initially provided by embedding fetches of small, (typically 1-pixel ) images into the web markup, to log which images were, and were not successfully retrieved. The critical element was to map the http://name/ behind each image to a different kind of name-to-number record. One name was IPv4 only, one name was both IPv4 and IPv6, and one name was IPv6 only. By comparing which named images were successfully fetched, and by matching the set of retrieval operations to each individual end user, a basic measure of the capability of the user to use IPv6 can be collated.

It turns out, that in attempting to measure more micro-detailed questions about the capability of users visiting websites, a technique was borne which is capable of collecting large amounts of data about the users ‘at large’. This is a form of active measurement, using two different (but related) techniques:

Scripted fetches of unique, targeted web content, behind controlled IPv4, IPv6 and dual-stacked sources.

Scripted? Who wrote this script!

HTML was born in the early 1990s. Its roots lie far deeper in the concept of generalized markup languages, from the 1970s and 1980s which themselves owe much to computer typesetting initiatives of the 1960s.

Like almost anything written for scientists, by computer scientists, it didn’t take long for people to suggest that converting markup into a program, and adding a programming language to the markup stream to be run on the client would be generally useful. This is the kind of idea which lead to the emergence of Javascript. Javascript is a complete interpreted language, which can not only change what is displayed in the local browser, but can request objects from the network and choose to display, or not display the results.

This ability to pass a script, written in javascript, to a user and thereby direct them to fetch additional web objects, but not display them, provided a mechanism to conduct much more complex measurement on the user:

  • Can the user even see the IPv6-backed URL? (some operating systems and browsers hide any record which includes IPv6)
  • If the user doesn’t have direct access to IPv6, do they have access to some kind of indirect ‘tunnel’ to use IPv6 via another path?
  • What’s the relative performance of IPv6 to IPv4, from the users perspective?
  • What influence does IPv6 play on the DNS, distinct from the transport used to get the web element?

Additional aspects of javascript provide for basic controls on the experiment: random number generation (to identify each tested user uniquely), timeouts, and an ability to track the execution time from start to stop, from the user side (distinct from comparing log lines on the server)

How to make gravy (adding value for the website owner)

Google has deployed a widely used mechanism for website tracking called ‘Google Analytics’. This is a Javascript library, and an associated family of tracking options which can be used by a website owner to understand the traffic coming to the web. It can categorise users, measure their access, and follow inter-page transitions with a rich reporting framework in the web.

APNIC Research took a simple scripted mechanism being explored by a small research community for fetching specific IPv4, IPv6 and dual-stack images, combined it with Google Analytics, and constructed a mechanism which provided information for website owners on the capability of their web users to use IPv6, but additionally provided APNIC with a ‘feed’ of the measurement, to be combined into aggregated data.

This was designed to motivate website owners to include the javascript in their existing Google Analytics feed, and provide APNIC with a data source, at the same time they measure for themselves.

To date, this mechanism has been explored by around 50 websites worldwide and is contributing approximately 150,000 hits per day, of website measurements.

What drives a basic javascript measurement like this?

The javascript experiment depends on a few very simple elements:

  1. A sequence of configuration directives, included with the initial fetch of the .js code, to specify the experimental parameters. This allows us to stipulate the order of tests, the kind of tests, the timeout, randomization of the order (or fixed order) and to give each website owner a unique identity so their own logs can be known, aside from all others.

    This provided for some fine-tuning, per website.

  2. A mechanism for fetching arbitrary URLs from the web. Javascript provides this, with a constraint that cross site scripting has to be permitted and controlled. APNIC therefore deployed a suitable ‘crossdomain.xml’ control file on the webservers, to enable this. (as the called website, APNIC was the one having to give permission to be referenced by each web using the javascript)
  3. Wildcard DNS records provide a mechanism for taking unknown names to a DNS server, and returning specific known results. This permits the use of per-experiment unique names, which cannot have been seen in the network in advance, so cannot be subject to DNS or web caches, and so cannot mislead the experiment by being pre-fetched, or re-fetched subsequently.

    By logging the DNS queries, and careful construction of the specific DNS name being used, We are able to track each experiment’s elements, as the DNS query is made (proving the client has begun to use the javascript), as the web query is logged (proving they tried to fetch the URL) and, in the timing information returned by the client in a final ‘results’ request (of no significance as a test, but by being queried, providing information back to us in the logfile in the form of data embedded in the requested name)

  4. Basic timing information. Javascript provides the getTime() call which is a count in milliseconds since 1970, and therefore provides millisecond accuracy of events, within the limits of the javascript interpreter. For various reasons to do with the way Javascript runs inside a constrained environment, and the limits on parallelism inside the browser, this timer has to be viewed with some skepticism, but within limits, it provides the basis of a start..end time interval from the browsers point of view, to compare to website logs.

    In the process of deploying the system, we noticed that website logging is usually limited to the 1 second granularity in the Apache Web Server, despite having millisecond accurate timing available on the server. A small modification was made to the Apache code to provide more fine grained time, and while this is not generally advisable, it appears to offer comparable granularity to the browser time, and helps provide a view of microtimes of events in the web.

  5. A mechanism for deriving randomness. This doesn’t have to be cryptographically strong, but just ensures the least possible collision of each individual experiment, assuming two are conducted at almost exactly the same time.

    Javascript includes access to some of the basic operating system functionality common to POSIX systems (which includes all modern operating systems, and is in turn exposed to the web browser in all browsers which support javascript) and this provided by the Math.random() function.

So this was perfect, right?

Life is never simple..

A university department, exploring IPv6 connectivity on campus (a heavily controlled environment compared to a commercial ISP, and very probably able to deploy IPv6 natively across campus, wired and wireless) included the IPv6 measurements on their web site, as did a telco in a central European economy. While we really appreciated this support, and the data it generated, we also had to recognize that all user communities are not alike, and there were some fascinating insights into the specific situations of each experimental measurement source. What we were able to learn from this data is that IPv6 capability is highly variable across various Internet communities.

We felt we needed something more widely ranging, and less prone to this kind of community bias in our measurements.

Advertising for fun, profit and measurement.

A useful observation of embeddable interpreted languages in web objects is that you can take your pick, because there are so many. Javascript is similar to ‘actionscript’ from Macromedia Inc (now owned by Adobe) and ECMAscript.

Actionscript lies at the heart of “Flash” code. Flash is the ubiquitous programming language for dynamic elements on web pages, and such beloved Internet memes as “punch the monkey” and other web graphics which appeared as banners across many websites.

These web banners do not appear simply to entertain: the banner is an advertisement, and advertising networks have grown from a small margin of the web to the main funding vehicle behind many of today’s Internet content providers, including Google, Yahoo! And Facebook. Placement of advertising is fundamental to the commercial Internet as we experience it.

Internet banner advertisements can of course, be simple non-animated images, or even just text. But the ability to embed flash into an advert, by embedding the image inside the flash, has driven a marketplace of advertisement placement which is ready to accept Flash, from the advertiser, as a basic attribute..

Given that we had invested time and energy writing javascript that performed IPv6 measurement, we very quickly experimented with converting the javascript to actionscript, and then compiling it into Flash. This proved successful, and invited the question:

Can we convince an advertising network to let us lodge image-in-flash
advertisements with the IPv6 test code?

Yes we can.

Google runs one of the worlds largest web advertising clearing houses, and allows Flash to be used in advertisements, to fetch post-install ‘assets’ (as they are called) to enhance the advertisement. Perhaps the movie sequence is too big for the initial 50kbyte load, or the flash requires more current data such as news headlines. Therefore a built-in feature of Flash to load assets over the net, remains enabled in the Google Flash acceptance. This mechanism is functionally identical to the URL() fetching mechanism of Javascript, and therefore permits exactly the same method of multiple IPv4, IPv6 and dual-stack image fetches. And, as for javascript, there is no necessary display of the fetched data, and, there is basic timing and parallelism provided inside the flash engine in the browser.

No random numbers allowed.

The main problem with flash advertising is that the advertisement owner wants to know their advert is being placed, and the advertising agency wants to control placement of the advertisement, and prevent websites who are being paid from spuriously claiming the advert has been seen.

From these basic motivations there are a number of restrictions in the code libraries supported by Google for Flash code, including the limitation of no calls to random number generators.

Since the APNIC javascript used the Math.random() library call, we had to modify our code model to use a simple version of the crc32 checksum which was run over the value passed into the flashcode in the clickTAG() call. Because of the mechanisms Google itself depends on, to track advertising placement, this argument inherently includes a huge amount of unique information, and so a simple CRC over the data produces highly unique, per-user 32 bit number that acts as an acceptable generator of randomness for the purposes of generating unique domain names to individually label each experiment..

With this one simple conversion, an image advert was accepted by Google, and has been used in various forms since late May 2011, and viewed over 10,000,000 times.

Who sees the advertisement?

The prime purpose of the Google advertising network is to connect users with advertisers, and for most people paying for advertising, the primary driver is to harvest clicks: make users want to click on the advert, and convert it into a sale, or a promotion or event on your own web.

APNIC doesn’t want your clicks

APNIC Research on the other hand, has next to no interest in click harvesting. We wanted to place the advert before your eyes, so that the flashcode wrapping the image would run, and in the act of fetching the referenced objects provide us with measurements of each client’s IPv6 capability. Therefore our advertising budget was tuned to the placement pricepoints, more than the click-through pricepoints.

By and large, this is a very ‘cheap’ form of advertising: the kind where you really just want it seen, and you don’t’ expect to have clicks follow you, is the cheapest form of advertising available. Typically, its bid for, as CPM or clicks-per-mille, meaning that a price is set for an expectation of a click in every thousand views, and the lower you bid, the less likely it is that a click eventuates.

If you bid too low, nobody wants to place your advert. Google realize this, and so provide a ‘default’ advertising framework in the form of their own co-owned websites such as ‘youtube’ which acts as the advertising placement of last resort, and therefore permits google to accept our money, for placement of the advert, priced so low, no external party want to take our CPM bid, and so we avoid placement which will in turn invite clicks.

As each flash advert is run, it first fetches a control URL which detects the initial IPv4 address used, and based on this fine-tunes the specific set of URL to be fetched by that user. This set directs the user to one of three sources of measurement test:

  • America, homed in the http://www.isc.org/ network on the west coast This node takes measurements for the Americas, including LacNIC regional ISPs
  • Europe, homed in a German racking/cloud service provider, with the assistance of the RIPE NCC
  • This node takes measurements for Europe, Eurasia, and Africa including AfriNIC
  • Australia running from the APNIC research network. This node takes all other requests

This localization has been designed to maximize data collection at three high bandwidth locations, each with good IPv6 connectivity, but to also localize traffic as far as possible to avoid excessive IPv4 IPv6 divergences from tunnels spanning continents.

Youtube is everywhere!

An additional and highly desireable property of this placement strategy is that the lowest common denominator is seen almost worldwide and so provides us with both volume, and coverage. Further tuning of the advert placement choices can target language, time of day, region and device.

Our post-measurement data analysis suggests that Youtube is being seen in a significant majority of internet enabled economies, worldwide.

Everybody goes to youtube

It turns out that in seeking a bid mechanism to de-preference the ‘click’ feed we have embraced a placement strategy which is both worldwide, and we believe inherently representative of the average user.

We are comfortable that youtube is viewed worldwide because we can demonstrate presentation of the flash advertisement to IP addresses in over 200 economies from the list of 249 codes at the ISO registry. At this time, some economies do not contribute sufficient hits to be statistically useful. For example, the number of visits from the Vatican City (iso code VA) has never exceeded 10 in a month, as has those seen from the Ã…aland Islands (iso code AX) in the Baltic. Noting these extremely low samples in specific cases, the mechanism has successfully collected sufficient data to provide us with indicative IPv6 capability in over 150 economies.

Google distributes cheerfully

A review of the IP addresses presented by google advertisement placement suggests that the distribution of IP addresses is being carefully managed to maximize unique views, by different people. This is obviously a requirement for a neutral advertising placement strategy, and reinforces trust in the basic mechanism. By presenting ‘new’ viewers, Google can ensure the highest return on the advertisers investment in placement.

For our purposes, it also provides a good basis for ensuring samples into a routed network are tests of more than one individual host or viewer. We accept that some amount of NAT deployment may collapse multiple individuals behind one public Internet Address, but we believe Googles basic advertisement placement mechanisms manage this fairly, and therefore provide a good distribution of new users (and thus addresses) to our test.

You tube is everywhere but flash is not everywhere.

A major issue with flash, is that Apple iOS does not at this time display flash, and so adverts placed inside flash are not seen or executed by iPhone users.

Whilst we’d love to measure iPhone users, we are not aware of an advertising market emerging using HTML5, which includes functional elements analogous to actionscript we could use almost unchanged. Should one eventuate, we will explore its use.

In the meantime, we are able to measure iPhone IPv6 users via the prior javascript mechanism, and since at this time there is next to no IPv6 over cellphone radio (LTE is still in very early deployment, and LTE enabled handsets are only just coming into deployment) we do not feel this has materially altered our counts significantly. Households which have an iPhone typically have other devices also using wiFi or cabled access via DSL or Cable, and are measured as normal.

How do we handle the raw data?

Every day, APNIC Research takes the following raw data sources and secures them inside the research filestore. Raw data is not released.

  • The raw weblogs. This includes the IP address of the requestor, for every seen fetch of the crossdomain control file, and each specific 1×1 image file, and the ‘results’ fetch which provides us with the client-side view of delay and fetch status of each specific image. The loglines include the URL of the fetched object, which includes as a passed argument the DNS name being fetched. This provides us with a basic collation value to crossmatch the IPv4 and IPv6 fetched requests.
  • Raw DNS logs, of requests for the name-to-address map of the objects. This provides additional information on the capability of the DNS system used by the client to handle dual-stack and IPv6 DNS transport, and demonstrates the client fetches the tests, for cross-checking against their presence in the logs.
  • TCP Dumps, of the fetches, especially noting the use of Tunnelled IPv6. The TCP dumps also provide evidence of the maximum segment size (MSS) and the SYN/ACK flows during Teredo tunnel establishment.

The weblogs provide basic information and timing, but can be augmented by the TCPdumps to inform partially-attempted fetches, typically indicative of a broken Teredo or 6to4 tunnel.

The logs are combined into a single entry per IPv4 source, which records which tests succeeded, at what relative time delay from the basic IPv4 fetch, and which IPv6 address this cross-correlates to (for successful IPv6 fetches)

How do we collate the raw data into aggregated data?

From these daily files, the following subsequent data is collated.

  • Every day, the specific IP addresses are ‘combed’ against both BGP data, to map to the origin-AS and announced prefix for routing, and the per-RIR ‘delegated’ data, which maps the IP address to economy code of registration.
  • This data represents the first view which we feel can be exposed more widely, since no specific end-user can be identified from the prefix, and the prefix/origin-AS is a matter of public record in BGP.

Daily totals by Economy code, Origin-AS and some regional and organizational groupings of economy code are then produced. Each daily total is stored by year/month/day/ and by its major key of economy code, region or AS.

From the daily totals, week-of-year and month in year totals are also aggregated.

For all three, we then produce both CSV and JSON format data, which can be plotted or used for postprocess analysis.

APNIC research is producing daily updates of the data for IPv6 capability, based on this post-processed state of the logs.

  • Per-Economy recording the usage by Economy code
  • Per-AS recording the usage by Origin AS
  • Per Region and Per-Organization using collections of economy codes to determine the region and organizations membership, in line with UN regional groupings

For each case, a set of graphs, and associated data is being provided,as both monthly and weekly summaries:

  1. IPv6 Preference. This shows the percentage by month/week of sources which will prefer IPv6, when presented with a dual-stack (IPv4 and IPv6) URL. It is the lower of the three measures collected, but the most reliable for the true delivery of IPv6 to end users, because the existence of non-infrastructure (ad hoc) tunnels will suppress IPv6, so the absence of IPv6 preference where IPv6 capability exists shows users who depend on ‘hop over’ tunnels to gain access.

    This figure stands at 0.3% worldwide , (on a weekly basis: monthly figures aggregate to slightly higher levels) and can be seen to have significant variation depending on your specific OriginAS, economy or region. Less developed and more distant (in Internet terms) economies are lower, while economies with large and active IPv6 support such as Sweden and the Netherlands, and the USA can be seen to be above-average.

  2. IPv6 capability, coercability and preference. This shows the percentage by week and month, of measured end users who will use IPv6 if presented with no choice, and in the extreme case if forced to use ad-hoc tunnels.

    This is a noticeably higher figure in almost all cases, because of the larger number of end users who are capable of IPv6, but have it suppressed by the absence of native IPv6 in their CPE. Examples of AS who have provided native IPv6 such as FreeNet/Proxad show that where a more embedded form of IPv6, ( even if tunneled) is available, end users will exploit it. Another example is Internode, which has begin a native IPv6 deployment over ADSL, and has significantly better figures than the rest of Australia for IPv6 preference.

How accurate is the Economy data?

A common critique of the economy tagged data is that it cannot possibly be accurate, because we don’t know where the ISPs actually deploy their network resources.

While this criticism is potentially valid, we have performed a basic analysis comparing the economy of registration to other sources of economy/IP data such as maxmind’s GeoIP data. We believe that there is better than 98% agreement between the data sources. This is hardly surprising, since GeoIP is fed in part, by RIR delegation and WHOIS data sources. However, we find that the specific overrides noted in maxmind and other sources tend to refine location inside a given economy, rather than re-mapping it to another location.

If you consider the primary consumer of IP address resources in recent times, it has been the telco. Universally, this deployment has been to feed the emerging middle classes of China, India, Africa and South and Central America, catching up with the historical deployments in the developed economies. The user is therefore the end-user, at the end of the CPE, and not the deployment of one, or two head-end Caches in the continental USA or European exchange points, out of region. Where consumption of Internet Addresses has been to satisfy market demand on the edge, the usage is indeed in the economy of registration.

The second driver of consumption of addresses has been the content provision side, with cloud and telco co-located services and content delivery. By and large, these are the sources of web content, not the consumers. They do not figure in web use measurement on the consumption side.

Furthermore, they are almost universally associated with ISPs who have deployed IPv6 nearby, either in the same location or within short reach, and do not lie ‘beyond’ the CPE. Therefore measurement of their IPv6 capability lies on the supply side, and is informed by other measurements, and informs other aspects of internet development.

More data and more eyeballs on the data

APNIC Research has been delighted to collaborate with the RIPE NCC Research group in this activity. The RIPE NCC has generously donated cpu and disk housed in Germany, for basic test collection. This improved the local round-trip time for European test subjects by removing trans-atlantic and trans-pacific paths.

The RIPE NCC has also generously co-funded the Google flash advertisement placement doubling the data collection in 2012. Emile Aben, research scientist at the RIPE NCC has collaborated actively with us on the design of the javascript code, and in post collection analysis and visualization. Products of this ongoing inter-RIR activity will be published in the labs.ripe.net web as well as at labs.apnic.net.

APNIC continues to run this measurement, and is actively exploring new data sources in both the javascript tracker, and sources of funding to extend the flash advertisement reach. We hope the data can help inform the community on the rate of uptake of IPv6, and the dynamics of deployment at the ISP, Economy and Regional level. If you are interested in helping, please get in touch.

We believe this work is helping us to understand the dynamics of Internet number resources deployment and the associated global routing, to help inform the community for policy and operations.