Relative Errors in Television Audience Measurement: The Future is Now
Media measurements, whether television ratings or impressions, online campaign metrics, the reach of a given video, or the like, are typically reported as simple point estimates without an associated “margin of error” or “confidence interval”.
This tradition is appropriate when the measurements are used as currency, since in that case the estimates relate directly to payments which require a specific number, not a range. Other estimates used for planning, programming, or content acquisition strategies, however, should come with measures of precision and reliability. Users of the measurements should pay attention to such additional information to make informed decisions in the multi-billion-dollar television advertising industry. Without considering the statistical significance of measurements, many decisions in media placement are essentially equivalent to tossing a coin. The industry will only change as producers and consumers of audience measurement estimates become more aware of the importance of error estimates, and use those estimates to guide their decisions, including when selecting measurement methodologies and services. That’s the intent of this blog: to start a conversation about moving beyond point estimates in media measurement toward including considerations of the relative reliability of measurements and methodologies.
In any empirical science, there’s an assumption that there’s a true answer “out there” in the world, and the job of the researcher is to approximate that true answer as closely as possible. In the social sciences, the subject of interest is often the behavior or other characteristics of a population of humans. In almost all cases, it is impractical to measure every single member of the population, and so a sample is drawn from it. The characteristic of interest is then measured on the sample, and an inference is drawn as to the corresponding estimate of the full population. A well-developed, mature branch of statistics informs the calculation of these estimates under various circumstances, and moreover provides tools for estimating the accuracy and precision of the inferences.
The simplest case of a sample is a “probability sample”, where every member of the population has an equal probability of being selected into the sample. A “random sample” is a special case of a probability sample, where the sample members are chosen at random, with equal probability of selection. Think of this as an ideal case to aspire to, because in the real, messy world, no sample is truly and completely a probability sample. But, to the extent that you can approximate one, you get some very simple tools for understanding the relationship between your sample and the overall population. In particular, under some very broadly applicable assumptions about the nature of the population and the metric being considered, if you have a true probability sample of size N, then the expected amount of difference* between your sample-based estimate and the true answer “out there” in the population scales as the square root of N. This fact is a constant thorn in the side of social scientists, as it means that if you want to improve your accuracy by a factor of two, then you have to quadruple the size of your sample! That fact, alone, drives much of the cost of panel-based media measurement services. Errors due to sample size can be reduced, but only at a high price.
But this sample-size-driven error is only one part of a far more complex and interesting story. In the real world, no sample is truly random; there are always sources of bias that spoil the ideal of every member of the population having an equal probability of being selected into the sample or panel. In the natural sciences, experimental errors are divided into two categories: random errors and systematic errors. You can address random errors by repeating the experiment many times or selecting a larger sample. Systematic errors, which are things like instrumentation flaws, methodological approximations, or the effects of non-random sample selection, must be addressed by forever evaluating and reevaluating every step of your measurement and striving tirelessly to reduce sources of error and bias. In media measurement, this requirement is embodied in the MRC Minimum Standards for Media Rating Research, standard 1.A – and it’s no accident that it’s the first standard of the first section: “Each rating service shall try constantly to reduce the effects of bias, distortion, and human error in all phases of its activities”. Words to live by!
One reason it is so important to control sources of systematic error is that, unlike random error, there is no simple way to calculate their effect on the measurements themselves. This is, in fact, an ugly secret in media measurement: when errors are reported at all (which is fairly rare), only the random component is considered. This is very misleading. Anytime you see a media measurement metric with any kind of error or confidence interval, you can bet that it is almost always simply the square-root-of-N thing I mentioned above. Systematic error, which can be significant, is rarely mentioned and almost never estimated or reported.
Why is this important? Population sampling as a discipline was developed in the early 20th century, at a time when many people in many countries were surprisingly (from our 21st-century vantage point) willing to be surveyed or empaneled or otherwise measured. In those days, survey studies or polls would contact people according to a carefully designed “sample frame” and almost always the persons selected for the sample would willingly comply. In those circumstances, the impact of sample bias or other deviations from the probability sample assumption could fairly be considered to be reasonably small; reporting the simple random error estimate would be a reasonable approximation to the total error.
This isn’t true today, by any stretch. Many people simply refuse to talk to or respond to research interview requests. It’s very difficult to know anything about such people, since they generally resist any direct measurement, including a measurement designed to evaluate non-participation bias. There have been studies that attempted to recontact panel refusers (e.g. [1]), but those evaluations are necessarily incomplete insofar as many of the initial refusers also refuse the refuser study. It is widely believed that panel participation rates in the US are in the 20 to 30% range. Put differently, around 70% or more of people contacted refuse to be a part of a panel. Consequently, it becomes harder and harder to believe that the resulting panel is anything like a probability sample of the population. While bias and other systematic errors are endemically difficult to estimate, it seems plausible that the impact must be substantial. Moreover, the dynamics of panel refusal have almost certainly changed during the COVID-19 pandemic, particularly when empanelment requires in-home interviews or installation. For example, the Nielsen Peoplemeter panel typically requires technicians to enter a household to connect the equipment. Refusal to participate in that kind of close contact is likely to be driven by health considerations in addition to whatever was driving the low participation rates before the pandemic. Thus, even if there may have been some attempt to understand the effect of nonparticipation bias before 2020, in the pandemic and post-pandemic worlds those conclusions, particularly for methods that include in-home visits, are now largely inapplicable.
Given the difficulty and expense of recruiting a high-touch panel, together with the inevitable effects of bias, there is value in looking to other methodologies for television audience measurement. To be sure, any experiment or research will always have its own sources of error. There is no silver bullet. But the panel approach is so terribly inadequate today that there is clearly a need for something else.
In the 21st century there is data everywhere. Practically everything we do, from surfing the web to using our phones for almost anything to using a credit card will create a great deal of “data exhaust”. While using such data comes with serious privacy implications that have to be addressed, there is also an opportunity here for media measurement. Indeed, from the very start the digital advertising ecosystem has been built around unit-level data observations. This kind of ultra-granular, event-level data is a necessity to measure online behavior because of the vast number of web choices. In the early days of television, when any given household could only watch a handful of local stations, measuring television with a panel wasn’t a terrible idea. Today, with many hundreds of networks and stations available, not to mention time-shifting, streaming platforms, video on demand, and on and on, the size of a panel required to plausibly measure most content is becoming impractical, quite apart from the unavoidable bias errors discussed above. It seems to be an obvious conclusion that any “data exhaust” available for television viewing should be used, appropriately, for television audience measurement.
And, of course, there is a source of passively collected, large-scale data for linear television viewing: the event logging of digital set-top boxes (STBs). These STBs were not designed to be audience measurement meters. They exist to provide a satisfactory user experience for the subscribers to cable or satellite television services (collectively referred to as Multi-channel Video Programming Distributors or MVPDs). But they also record the channel changes, DVR activity, and other user-generated events and can be configured to send those data elements back to the MVPD. Comscore has spent more than a decade working with these data sets. They are complex, noisy, but at the same time rich with information encoding human television viewing behavior. It is the task of data science practitioners to extract the signal from all the messiness of the data source, a task that has led to a completely different, and potentially far more complete and accurate television measurement system.
In addition to the messiness of the data, there is also the plain, unavoidable fact that STB data, along with practically all such “data exhaust” sources, comes from a biased, non-probability sample. This does not mean the data is unusable. It does mean that it cannot be treated “as though” it were a probability sample. The sources of bias must be evaluated, measured, and accounted for in order to turn non-representative input data into representative estimates of the underlying truth “out there” in the real world. One feature of the STB data that makes this endeavor possible is its sheer scale. Comscore today collects STB data from over 30 million television households in the US across four wired and two satellite MVPDs. Given this amount of data, even if it may in some ways over-represent or under-represent various populations, there is always some of every kind, which can be used to adjust the raw data to account for the known biases. And this is another difference: in the case of panel non-response bias, it is difficult or impossible to learn anything about the refusers. In the case of STB data, there is no question of participation; the differences between the “sample” (which is just the subscribers to the MVPDs) and the population are measurable and well-understood. We know, for example, the geographic location of every STB household, and we know the distribution of total population by geography. Thus, we can up-weight under-represented geographies and down-weight over-represented geographies. Similar considerations are applied to demographic skews, network coverage/availability differences, and so on.
Indeed, the use of non-probability samples in statistical research has become a very relevant topic with an active research community studying it. The landmark report from AAPOR [2], the leading professional organization of public opinion and survey research in the United States, establishes the foundations of correctly handling non-probability inputs to measure the characteristics of populations under various recruitment and measurement scenarios.
To understand the various pros and cons of the two approaches, let’s be more specific. On the one hand, we have traditional panel-based measurement, recruited from a carefully-designed sample frame, which may encompass tens of thousands of households nationally, or high hundreds to low thousands in local markets. The panel measurement includes direct observations of which persons, with stated demographics, are watching television at any point in time via metering technology that requires viewers to “log in” or otherwise indicate their presence in front of the screen. On the other hand, we have a “convenience sample” of STB households with no requirement to be recruited into the data set. Moreover, there are no persistent reminders that their channel changes and other behavior is being recorded. The STB dataset is tens of millions of households in size; in Comscore’s case, we directly measure about 1/3 of all television households in the US.
The two categories of error are random/sample-size errors and systematic errors of bias in the observed set and possibly other measurement errors. In the case of the small panel, the random errors aren’t terrible for popular programming that draws a very large audience. But with all the fragmentation in the media environment, such large-audience programs are becoming fewer and fewer. For a national panel of, say, 50,000 households, a 1 rating will correspond to 500 households which isn’t so bad. But if you want to get down to the long tail, the numbers quickly fall apart. So, too, if you are trying to measure with any demographic specificity. By contrast, the STB dataset will have about 300,000 households watching a program with a 1 rating, many times more than the entire population of the full panel in the other case.
The situation gets much worse for panels in local market measurement. Take, as an example, Springfield MO, a mid-ranking market (it’s #70 in Comscore’s Market rankings) with about 380,000 television households. See Figure 1 for the geographic context of this market. Let’s suppose our small panel service has, generously, 1,000 households under measurement. Comscore has about 80,000 satellite and cable households under direct, passive measurement in this market. To get a sense of the effect of sample size on the ratings estimates, I took a random sample of 1,000 of the STB households and called them a “panel-like sample”. (In this case, it really was a probability sample of the STB households. If anything, this simulation of the panel is conservative in that I made no attempt to simulate nonresponse bias or sample churn within the STB households). And let’s ask, in these two datasets, what does the Univision primetime audience in Springfield look like over the course of 2020? See Figure 2. The blue points are from the full STB dataset. You can see that the Univision primetime audience is very consistent throughout the year. These are loyal viewers who keep coming back, and the data shows this, even in a market with a relatively small Latino or Spanish-language population, and even with the disruptions from the pandemic. The red points are from the 1,000-household subset. The stability is completely washed out by the small-sample-driven statistical fluctuations. Imagine trying to plan or post a media buy on the red points. The small sample is plainly inadequate to the task.
This example illustrates another feature of the STB data, regarding representation of minority groups. Because the households under measurement are not recruited in any sense, there is no opportunity for conscious or unconscious bias to creep into sample selection. The households under measurement are simply the households that subscribe to the MVPDs (and that also have a return path connection in the case of satellite MVPDs). The representivity of the measured set mirrors the population among the subscribers of the MVPDs. The viewing trend shown in Figure 2 is for a Spanish-language network, Univision, and there is good Latino representation within the households under measurement. Moreover, there is a demographic match for almost all of the measured STB households, which is used to adjust for any mismatch between the demographics among the STB households under measurement compared to the television market as a whole. For these reasons, STB data is substantially less likely to be biased against minority communities than a high touch, recruited sample.
We can also look at the stability of the STB data in a large market over time. Take New York for example. In the New York Comscore market, we have over 1.2 million satellite and cable households under direct measurement, representing nearly 20% of all television households in the market. This dataset is more than sufficient to measure local cable by quarter-hour across all quarter-hours. Indeed, Figure 3 illustrates the quarter-hour viewing across Q1 of 2021, for two very different cable networks, MSNBC and Investigation Discovery. The MSNBC audience is consistently lower during the weekends, and you can see other temporal effects such as the impact of the events of January 6, 2021 in Washington DC. Investigation Discovery, on the other hand, is smaller and has a very consistent audience over time. There is robust STB data across all 8,736 quarter-hours in this period for both networks.
Another consequence of a small sample is the presence of so-called “zero cells”, which is to say, reported metrics on some unit of content from a specified audience where there happens to be zero panelists watching. How common is this? In Figure 2, you can see at least three points on the simulated panel data (in red) that are very close to zero, and these are all-week aggregates. If you go down to the individual quarter-hours, then zero cells abound in the small panel case. The reason why these “zero cells” are troublesome to users of television ratings data is that those users do not believe that there is no viewership in the marketplace, rather that it’s a failing of the measurement service to identify viewership that truly exists.
The prevalence of “zero cells” can be evaluated across all of Comscore’s local-market reporting. In an analysis that includes over 1,800 local stations and dozens of cable networks (reported at the market level, that is), the incidence of zero cells is a fraction of a percent of the possible market/network-station/quarter-hours throughout Q1 of 2021 – a total of over 76 million cases. See Table 1 for more details. The few zero-cell cases that we do see tend to be small networks or stations in tiny markets in the middle of the night, like a MyNet affiliate in Fairbanks AK on January 27 during the 2:45am quarter-hour. In the New York case, the number of zero-cells among the 33 cable networks in this analysis is zero. None. With over 1.2 million households under measurement, New York simply doesn’t exhibit zero cells for these networks. I don’t know what the comparable metrics are for a panel-based service, but if I were a user of such data, I would take a good, hard look at this problem. And zero cells are only the most extreme case; a cell with one or two or ten reporting units contributing to it will be quite unstable.
The second category of error is systematic errors, including errors due to sample bias. To a good approximation, panel-based services are largely in the position of the emperor with no clothes [3]. With response rates under 50%, the empaneled sample is surely biased, but these services still try to pretend that they are dealing with a probability sample, or at most apply some stratification or demographic weights to try to account for some of the bias. But without any direct knowledge of behavioral differences of the majority of people who refuse to participate, these adjustments are about as effective as the emperor putting on a hat: they don’t address the real issue.
By contrast, the differences between the people who are in Comscore’s STB dataset and the people who are not, largely boil down to their choice of MVPD. If one household subscribes to DIRECTV and their next-door neighbors subscribe to Altice, there’s no obvious reason to think their behavior will, ipso facto, be markedly different. Altice may offer different networks in different subscription packages, but we actually know those “network coverage” numbers and can adjust accordingly. Moreover, because there are Comscore-measured households in essentially every populated ZIP code in the US, we have very good insight into geographic variations in viewing behavior and can adjust accordingly. The one type of household that we don’t have under direct measurement is over-the-air (OTA) households. We spend quite a lot of effort, including a massive survey of hundreds of thousands of respondents per year, to understand and model these OTA households. But that’s a topic for another blog.
The bottom line is this: at a time when panel participation rates are at historic lows while simultaneously the number of viewing choices is exploding, a small panel is a recipe for failure. It simply costs too much to get the scale needed to cover the full range of viewing choices, and no reasonable amount of money can solve the nonparticipation bias problem. The only viable alternative is to use passively collected, event-level tuning data at a massive scale. With the growing acceptance of non-probability sampling techniques, the time has come to make the transition to a future-looking methodology. Otherwise, the emperor truly has no clothes.
Acknowledgements
This paper benefited greatly from detailed discussions with David Algranati, Chris Wilson, Carol Hinnant, and Bill Livek from Comscore’s Executive Leadership Team. Of course, any errors or misstatements are solely on me, and not them.
References
[1] Robert M. Groves, Ashley Bowers, Frauke Kreuter, Carolina Casas-Cordero, Peter V. Miller, “An Independent Analysis of the Nielsen Meter Nonresponse Bias Study, A Report to the Council for Research Excellence”, March 2009, http://www.researchexcellence.com/files/pdf/2015-02/id158_cre_report_meter_03_31_09_.pdf, accessed 24 April 2021.
[2] Reg Baker et al., “Report of the AAPOR Task Force on Non-Probability Sampling”, June 2013, https://www.aapor.org/Education-Resources/Reports/Non-Probability-Sampling.aspx, accessed 24 April 2021.
[3] Hans Christian Andersen, “Fairy Tales Told for Children. First Collection. Third Booklet”, April 1837.
Figure 1. Comscore Local Markets map, showing Springfield MO in context.
Figure 2. Scaled weekly primetime average audience on Univision in Springfield MO throughout 2020, comparing the full STB dataset of about 80,000 cable and satellite households to a 1,000-household randomly selected subset.
Figure 3. Local cable quarter-hour (QH) trends for two cable networks in New York, using the full STB dataset of more than 1.2 million households in this market.
Type | 2021 Month | Total QHs | Non-Zero QHs | Zero-cell QHs | Zero-cell Fraction |
Broadcast stations | Jan 2021 | 6,037,920 | 6,025,949 | 11,971 | 0.198% |
Broadcast stations | Feb 2021 | 4,908,288 | 4,902,379 | 5,909 | 0.120% |
Broadcast stations | March 2021 | 4,996,992 | 4,984,518 | 12,474 | 0.250% |
Top-33 Cable Networks | Jan 2021 | 23,284,800 | 23,211,475 | 73,325 | 0.315% |
Top-33 Cable Networks | Feb 2021 | 18,627,840 | 18,567,573 | 60,267 | 0.324% |
Top-33 Cable Networks | March 2021 | 18,627,840 | 18,576,518 | 51,322 | 0.276% |
|
|
|
|
|
|
Total Broadcast | 2021 Q1 | 15,943,200 | 15,912,846 | 30,354 | 0.190% |
Total Cable | 2021 Q1 | 60,540,480 | 60,355,566 | 184,914 | 0.305% |
|
|
|
|
|
|
Total | 2021 Q1 | 76,483,680 | 76,268,412 | 215,268 | 0.281% |
Table 1. Zero cell counts across stations and market/networks across all quarter-hours in all 210 Comscore Markets during the first quarter of 2021. All Comscore-reported stations are included, as well as the “top-33” cable networks (exact list of networks available on request).