- 2015年3月20日

The Dress: A Case Study of Flash Crowds and Invalid Traffic

Jeff Kline
Jeff Kline
Principal Data Scientist

On Feb 26, 2015 a photo of this dress was posted to Tumblr. The author asked other visitors to help resolve a friendly dispute about whether the dress was white-and-gold or black-and-blue. Controversy about the perceived colors in the picture drew sudden and intense interest. It was just before nightfall in the U.S. Fifteen hours after the initial post, interest in the dress peaked in London. By then, the dress had acquired its own Wikipedia entry and the U.S. had woken up to a new viral phenomenon. This was a “flash crowd” in action.

Flash crowds are a phenomenon common to internet traffic. As the name suggests, flash crowds consist of a large number of users accessing a relatively small number of web sites over a short period of time. The idea is borrowed from the work of science fiction writer Larry Niven who coined the term while pondering the darker consequences of free and instantaneous travel. Flash crowds can emerge either spontaneously (e.g., due to natural disasters) or due to an announced event (e.g., the Apple Watch debut), but what makes them unique is the exceptionally high volume of traffic they generate in a very short period of time.

In the extreme, a flash crowd can wreak havoc on web sites that are not sufficiently provisioned for the influx of traffic. Less extreme events are much more common yet they pose an equally serious threat to the security and integrity of reported web metrics. A malicious actor could use a flash crowd to cloak a surge of fake traffic.

We recently looked into whether this was happening and if so, to what degree. Comscore has an unparalleled and unique view of internet behavior via its publisher census, ad instrumentation and world-wide user panel. What follows is a case study based on this data of the flash crowd event surrounding the dress.

Traffic associated with the first 24 hours of the dress event is plotted below on the left and is grouped by time zone. The phenomenon lasted about five days. Its full life cycle of global traffic is shown on the right.

NHT

This particular event was interesting in that it was a global phenomenon which spanned geographic as well as publisher boundaries. We drilled down on the level of non-human traffic (NHT) activity associated with the dress event and found that it had lower levels of invalid traffic than is typical. More precisely, the fraction of dress traffic categorized as NHT was inversely related to the volume of said traffic.

This is important because it means that a common feature of network traffic (a flash crowd) has now been associated with a desirable trait (low NHT). This is in stark contrast to other types of large and sudden shifts in traffic volume which are often the result of bugs, misconfigurations or maliciousness.

We next sought to broaden this statement to flash crowd events in general. As of this writing, we do not believe that NHT software currently adapts or attempts to cloak itself within the traffic of trending stories. We offer three supporting arguments.

First, the source code to a widely-used botnet called Athena was leaked and immediately dissected by security experts. My colleague Kevin Springborn discussed its impact on the digital publishing world in a separate blog post. Athena is designed to do a number of things but it does not have trend-tracking features. Methods to identify emerging trends or current newsworthy stories are well known, and some techniques to do this are simple. In fact, one currently active Twitterbot is trying to trend.

Second, adaptation to emerging trends would represent a natural evolution of domain laundering. Domain laundering occurs through the injection of faked publisher domains into URLs. It is a simple and widespread method employed to establish the “legitimacy” of a traffic stream. Assuming similar techniques are used for trend tracking, then its traffic would share similar characteristics. If trend-tracking were a significant phenomenon, we would have identified some evidence of it. But our detection systems show no evidence of trend-tracking.

Finally, we undertook an admittedly coarse and subjective manual review of the publishers and URLs associated with this event. The top volume URLs and a sample of those with lower volume appear to have been built by human hands and for human consumption. While it is possible that NHT activity tracks trends, our review suggests that if it exists at all, it is currently not a significant source of traffic.

So why might we not see NHT malware follow trends? The answer may be that simpler methods are already profitable. The ubiquity of domain laundering shows it to be a useful mechanism for misdirection.

Our broadest goal is to help assign value to web traffic. Identifying traffic that has low levels of invalid activity aligns well with this goal. The methods applied in this analysis can be applied directly to other flash crowd events and we have some evidence of positive results. Although malware is not currently adapting to trending stories, this is a market-driven choice driven by the current environment. If the market changes and publishers start associating flash crowd events with value, then we can expect NHT malware to change too.

If you would like more information: Contact Us