The availability of customer-level impression and conversion data for digital advertising, combined with the ability to observe online and sometimes offline conversions has led to a new discipline in marketing analytics - often called MultiTouch Attribution (MTA). Conversion outcomes may be in the form of sales, signups, requests for information, or deep engagement with the content of a website – whatever a marketer is attempting to influence. The ultimate goal of MTA is to properly measure the impact of marketing activities on the metric associated with a conversion event at a granular level, and to use these insights to guide decisions about future marketing spend. The fundamental question that MTA seeks to answer is ‘What is the expected change in a conversion metric that was the result of an impression (or any form of interaction with the customer)?’ To answer this question, we need to identify a causal relationship, not just observe correlation.
A prerequisite for any MTA analysis is to create a mapping of distinct marketing impressions to individuals and to sequence these events into a history for each individual. This mapping can be accomplished in many different ways. All of them rely on identifiers for each event that can be cross-referenced between data sources. The most common identifiers are cookie IDs, IDs in ad server logs, site tags and, especially for mapping offline conversions to online activity, personally identifiable data like name, address, email address and phone number. We will assume in this paper that such mapping has been accomplished with sufficient quality.
There are a number of biases in the observed data that make the identification of causality described above complicated, not least of which is the task of accounting for existing customer intention, which may be caused by additional drivers that do not form a part of the individual-level dataset. Customers that receive paid search advertising tend to buy more than those who do not. As they search for a term related to the product, they were more likely to be in the market for the product. They may be in the market for our product because of an un-trackable marketing stimulus, such as a compelling TV ad. Neither the customer’s predisposition to buy, nor the fact that a customer has been exposed to a linear TV impression, would be part of a standard customer-level dataset of addressable impressions and individually observable conversions. Unfortunately, the granularity of data about different influencers varies.
The challenge is to devise an analysis that can provide an unbiased assessment of the impact of a subset of influencers of demand using the data available. An obvious way to deal with the biases that creep into this type of analysis is to run a controlled experiment with well-defined control and treatment groups, and to directly compare conversion rates. However, there are many reasons why it is infeasible or at least impractical to use A/B testing for every tactic or every campaign in digital marketing. Therefore, methods that use non-experimental data – the data that is collectable as a by-product of executing marketing activities – are widely used to answer these questions.
From an academic perspective, the problems presented by the new world in which there are highly granular ‘Big Data’ sources for some effects, but still only aggregate data for others, has been the subject of some but perhaps insufficient research. In a 2007 study, Marks1 found that in some types of modeling, aggregate behavior can be relatively insensitive to the specific actions of individuals. Page2 identifies that in principle, there are significant problems that can occur when aggregate behavior is considered in isolation. Page’s example corresponds to a similar problem that can be found in MTA when variation in media activity at an aggregate level is minimal, but very distinct at an individual level. In another study of the impact of levels of aggregation on analytical outcomes, specifically household buying behavior, Kahn3 found that aggregation can be problematic, noting that aggregation of individual level behavior to a household level can mask the decision making criteria of the individual actors in a household.
From a marketer’s perspective, the need to cut-through the challenge of data granularity is reflected in a desire to understand the performance of all marketing channels, and also to understand their role in the context of the market environment. To meet this need, there has been a recent convergence of traditional, aggregate level analysis, such as Marketing Mix Modeling (MMM) and individual level analysis such as MTA, using individual customer impression-level data. As an example of these shifting industry practices and expectations, Forrester Inc. no longer reviews MMM or MTA solutions but provides market survey’s on combined Marketing Measurement and Optimization Solutions only4.
So what are the common ways to deal with offline effects in MTA models? Besides simply ignoring them, which is still common practice, we are going to take a closer look at four different ways to approach the challenge. This paper will investigate how including non-digital, aggregate-level data into the analysis of impression-level online data can help alleviate some of these problems, and compare the pros and cons of some of the ways this is currently done. Terminology hasn’t settled yet so many terms will be used loosely. Terms such as online and offline, individual and aggregate, addressable and not addressable will be used interchangeably.
The Need to Combine Customer-Level and Market-Level Data
But let’s, for the time being, assume that we are managers of digital marketing, and we care only about the performance of digital channels for which we have impression-level data at the individual (cookie, household, individual consumer) level. Do we really need to take into account influencers that are available at an aggregated level? What about those influencers only observable on a weekly basis at a market level? What would have to be true to expect that omitting them would not bias the analysis of the remaining influencers?
Standard statistical intuition would say that they would either have to be invariant across the dataset or, if they change, be uncorrelated with the influencers of interest. But in the real world of deliberately integrated marketing campaigns, either of these conditions would hold only if the analysis could be restricted to very short time frames of a few days to a few weeks.
Limiting the Analytical Time Frame
If we could just make the observation period short enough to be able to assume offline effects to be unchanging; and if we can assume no local differences in online marketing execution, we could get away with ignoring offline effects without introducing biases. So, given the large number of customers and impressions we are dealing with, can’t we just use a short time period for our analysis? Unfortunately, this presents a number of issues in the context of attribution.
Small Incremental Effects
The MTA practitioner is looking for a large number of very small changes in behavior; and the true bottleneck of the analysis is not the number of impressions, it is the number of conversions and the number of customers we interact with. A typical advertiser might generate millions of advertising impressions a day but only see hundreds or thousands of conversions. Now, a thousand conversions might sound like a large number but we are looking for very small changes of conversion probability as a result of an impression. Let’s illustrate that fact with an example: Assume an advertiser buys display impressions at a CPM of $1 or a cost of one tenth of a cent per impression. Let’s assume a conversion generates a profit of $10. If the impression increases conversion probability by 0.01 percent, the advertiser breaks even. However, this is only one advertising tactic. There might be a dozen or more of those going on at the same time with similar levels of lift. To establish solid estimates of multiple lifts of such a small magnitude requires a large number of observed conversions. This is why multi-week or multi-month datasets are needed for MTA despite the apparent glut of observations per day.
Long Conversion Lags
Another limitation of looking at short time windows is the fact that many marketing stimuli take time to lead to conversions. For high-consideration purchases or purchases that typically have long decision cycles, the initial impression that lead the consumer on the path of starting their research that ultimately lead to the conversion might not be part of the short dataset. As a result, prospecting activities and other educational forms of advertising are likely to get lower attribution than “call-to-action” or closing activities. Upper funnel advertising activities are already disadvantaged due to cookie deletion or device switching, which lead to broken sequences. In short time windows, they don’t have a chance.
Different Consumer Experiences
There is a subtler and less obvious reason why market-level data holds information that is lost in the online datasets: Market-level data describes the response of the total population. That population is largely unchanging during short-to mid-term time frames. Online datasets are different. Typically, for an individual to be registered in the online dataset, the individual will have to have received at least one impression or be involved in a conversion. So the composition of the observed population changes over time. In a week in which advertisers heavily execute on less-targeted prospecting campaigns, many new individuals will enter the dataset but their inherent probability to convert might be low – after all, prospecting campaigns aren’t highly targeted. In weeks with heavy reliance on search advertising or retargeting, fewer new individuals will enter the dataset but those who do are well qualified and have a higher inherent probability to convert. As a result, even without confounding external (to the dataset) influences, marketing activity will be correlated with the population’s average base propensity to convert. That correlation cannot be resolved based on individual to individual variations but has to rely on variation over time to be resolved.
But, as soon as our dataset for analysis spans multiple weeks or multiple geographies, chances are that influencers changing week-by-week or market-by-market are correlated with digital activity. In many cases that is by design: Marketers execute online campaigns in sync with offline campaigns, modulate marketing spend by seasonality or vary even online execution by market. In addition, the more detail there is, the greater is the chance for spurious correlations. As a result, we find on a regular basis that many marketing activities – online and offline – are positively correlated on a weekly time scale.
In a large scale simulation study5 , which simulated a series of digital campaigns delivered to the entire US population over a period of a year, we quantified the confounding impact of existing propensity on accurate marketing attribution. The results indicated that even assuming low levels of propensity to buy, and a media deployment typical of many major campaigns we see, failing to account for the impact of propensity could reduce the overall level of accuracy across all channels to less than 70 percent.
Bringing Together Individual and Aggregate Level Data
In the previous sections, we argued that in typical marketing environments it is often impossible to produce an unbiased description of individual advertising response without incorporating offline data. The question arises, then, as to how we can accomplish a mapping between individual and aggregate. In most cases, the mapping will be based on the geography associated with the individual. If information is available to map individuals to geographies (based on CRM data, registration data, data associated with a purchase, or based on mappings of cookies or device IDs to individual households), offline effects for these geographies can be mapped to an individual online history. Given that the individual is a member of that geography, we can make an assumption that this individual was exposed to the average of the influences in the geography, or equivalently, interpret market-level numbers like impressions or GRPs as a measure of exposure probability. More sophisticated mappings could be devised based on known media consumption distributions – and applied, for example, at geodemographic segment level. In the end, the result is a dataset that combines the online history of an individual with some probabilistic measure of offline influencers at the individual level. The challenge now is how to use this dataset. We will look at three different approaches to this below.
Mapping Market-level Data to Individual Customers and Treating Them as Customer Attributes in an Individual Response Model
One approach is to use the dataset described above as is, without a distinction between online data (i.e. data that describes the unique experience of the individual) and offline data (i.e. data that describes the shared experience of all members in a geography). All of these data points are describing influencers of an individual’s decision making process. Why not make them an equal part of a probabilistic model of consumer choice (e.g. logit model, decision trees, post-hoc A/B testing) and use the resulting choice model for attribution?
For one, this approach creates potentially a large number of additional variables in a model that is already limited by the number of conversions and which is likely to lead to over-fitting. More importantly, market-level variables do not vary across individuals in the same market. As a result, the power of the estimation to identify their coefficients comes from the overall variation across time and markets, not across individuals. Note that there are often millions of individuals but only dozens of markets or weeks in the dataset. The dataset, therefore, has the same power of identification and coefficient estimation as a traditional MMM for market-level effects.
For most business applications though, the fidelity of the individual-level dataset, if observed in the aggregate, is lower than aggregate-level datasets due to non-perfect match rates between sources, device shifting, losses in records in server logs and many different mapping issues when combining multiple tracking logs in high volume transactional systems.
In addition, establishing seasonalities and disentangling them from offline media effects like TV, radio, outdoor advertising or newspaper and magazine ads (which all have significant memory effects) requires many weeks – often multiple years – of data. In most cases, such long histories of impression-level data are simply not available.
As a result, coefficients of market-level effects in individual-level models can have low significance and often look counter-intuitive. But if those variables don’t reconcile with other sources of information in the model, it is likely that the online effects will be biased too.
As we discussed above, this setup also cannot recover the true base-conversion rate for individuals due to the changing composition of the observed population. As a result, if there are meaningful shifts in the mix of online advertising tactics, the results might be counter-intuitive. As shown in Stratton & Beyer, 2016, omitting a baseline correction from a market-level model to counteract the effect the impression type with the lowest incremental effect will compete with or be combined with base propensity (they cannot independently be identified). Thus, models will often show negative effects of prospecting (display) activity.
Adjusting Outcome Variables Based on Incrementality Derived from Market-level Models
The long history of research into marketing time series models6,7 demonstrates that models based on data aggregated over time provide a valuable way to determine the overall incrementality of marketing effects within the context of all other influencers of demand in the market. They do not suffer from mapping issues of individuals’ histories or the fluctuations in observed populations. In addition, aggregated data is typically available for longer histories.
A common practical approach to addressing incrementality in individual-level models has been to first build an aggregate market-level model, then to use that model to determine for every week and geography what is the total incremental effect of digital advertising. This method then uses digital individual-level data only to create an individual response model. When attributions to events in a sequence are calculated, only a fraction of the outcome variable is attributed for each converting sequence. The fraction is chosen such that the incrementality percentage in the aggregated model and the attribution results match. This is equivalent to applying a consistent modifier to all of the digital media in the analysis.
In terms of capturing the overall context of marketing, this method can have significant advantages over rule-based attribution techniques like first/last click or equal attribution. It largely decouples the estimation of aggregate and disaggregate models, bringing the results together at the last stage and making for a simpler modeling process. In addition, the construction of the attribution guarantees that in the aggregate, the attribution model and the market-level model are in agreement.
Unfortunately, applying a single incrementality percentage to all converting sequences creates significant biases at the detailed attribution level. We will illustrate this with a simple example:
Assume all customers have a base conversion rate of 10 percent, and every display impression they receive will increase that conversion rate by 1 percent. Assume there are 100 customers that receive no impressions, 100 customers that receive one impression from publisher 1, and 100 distinct customers receiving 10 impressions from publisher 2. Let each conversion be worth $1 to the advertiser. The table above summarizes the assumptions.
A correct aggregate model would identify a base of 30 conversions and an incremental of 11 conversions due to 1,100 digital display impressions. There are a total of 31 conversions that are preceded by digital advertising, so the incrementality ration to be applied is 11/31. As a result, Publisher 1’s impressions receive a credit of $11 x (11/31)/100=$0.039 and Publisher 2’s a credit of $20 x (11/31)/1000=$0.0071 each. While the total incremental of $11 is in accordance with the assumed data generation mechanism, Publisher 2 gets penalized for extracting incrementality from fewer customers with more impressions. In general, this method penalizes impressions that occur in customer histories with higher than average incrementality.
Admittedly, this is an extreme example, but the biases introduced by across the board post-hoc incrementality adjustments run counter to the goal of digital attribution to identify tactics that generate above average incrementality relative to their cost.
Using Market-level Models to Create Input Variables for Individual-level Response Models
Another approach is to start with a market-level model, and use that model to establish the relative impact of different offline influencers on customers, rather than applying a consistent percentage. Using decompositions of a market-level model, different influencers which originally might be expressed in different units of measure (GRPs, TRP’s spends, impressions, etc.) can be mapped by the market-level model into the same additive unit of, e.g.; percentage sales lift. This allows for the creation of new time series variables based on the combined lift of groups of influencers in a given week and geography.
An example might be to group all media effects into one compound variable while grouping competitive activity or seasonal effects into another. The advantage of this approach is that within these variables, the relative impact of the constituent influencers is estimated based on long, aggregated time series and the number of compound variables can be very small.
Those compound variables can then be mapped to individuals based on geographies or other population breaks that were present in the aggregate models and appended to the individual-level dataset. That augmented dataset is then used to estimate an individual-level response model.
The approach has limitations because offline influencers can only be interpreted in a probabilistic sense as an expected level of influence applying to every member of a larger population equally; but there are many advantages of being able to bring these offline influences into an individual response model as independent variables:
- The compound variables reduce the degrees of freedom in the model by fixing the relative impact of its member offline influencers while allowing the response model to freely choose the absolute impact of both online and offline effects.
- Because these variables are part of the set of independent variables, modelers can choose different transformations of these variables as well as interaction terms with other variables in the model.
- Because the compound variables are derived from the variables in the market-level models, they inherit time lags, ad stocks and saturation behavior that were measured in the market-level model 8.
- As the calculations are dynamic, the resulting models may be used for forward-looking predictions of conversion behavior as a function not only of online but also offline conditions.
Synching Market-Level and Individual-Level Models
When separate market-level and individual-level models are combined in the way described above, both models will provide estimates of the impact of both online and offline influencers. In addition, both models typically provide more reliable insights for different influencers. Therefore, it is important to take advantage of that and synchronize the market-level and individual-level models. We typically use the lifts from online influencers resulting from individual-level models to derive Bayesian priors for a re-estimation of market-level models. We also use prior information from the market-level models to guide coefficients for the compound variables ensuring consistency between both models. Typically, results from both market-level and individual-level models show consistency after very few re-estimations.
Ideally, one would combine the estimations of both the market-level and individual-level models into a single joint estimation step. The complexity arises from the fact that both models typically use different model forms, differently structured datasets and different time horizons. There are trade-offs between the elegance of a unified estimation and the simplifying model assumption that has to be made to accomplish it. This generally is an area of continuing research.
Whatever method for digital attribution one uses, as long as the data doesn’t completely rely on randomized test and control groups, aggregate data is essential to establish causal links between online impressions and the probability of conversion for an individual customer. That causal link in turn allows us to reason about the true incremental effects of an online impression; or the degree by which that impression changed the behavior of that individual.
It matters how those data get included into the analysis both in terms of the robustness of the estimation of the effects and in terms of what unwanted side effects the introduction of the data can cause. It is therefore important to be transparent about the methodology and to approach the modeling and estimation exercise from the ground up with a good understanding and a defendable model of the data-generation mechanism.