Beyond Traditional A/B Testing: Long-Term Alternatives

A/B tests are usually treated as a way to identify causality between a decision and the outcome obtained. Because, although correlation does not imply causation, we have known for decades how to identify this phenomenon. Given the gravity of claiming that event X causes Y, and the powerful implications of that knowledge, we have developed a variety of techniques in this area of statistics, helping with everything from handling small samples to dealing with missing data.

Much of this scientific development occurred, and still occurs, closely tied to medical and pharmaceutical research, where a wide range of constraints arise during experiments: patients may drop out, may have unrelated events that interrupt observation, may be included in the study at different stages of the disease. The medical literature has developed, over decades, robust tools to handle each of these situations.

More recently, these techniques have been applied online to validate new features. Since this is a more recent use, often only the most basic capabilities of these tools are used, and as a consequence, we end up limiting both the experiments and the results we can obtain.


What most tech companies do today

The most common practice in A/B testing at large technology companies is to use short-term metrics to infer long-term behaviors. Consumption, click-through rate, daily sessions, immediate conversion. While this can guide decisions and be useful in many contexts, these metrics were not built to understand churn or medium- and long-term behaviors. They are fragile proxies for what truly matters.

The reason for this choice is practical. As discussed in LinkedIn's work, keeping an experiment running for months has high costs: it endangers the user experience if the feature is bad, drastically reduces the speed of innovation, increases the chance of unexpected interactions with other experiments, and generates computational and product management overhead. You cannot simply "run the experiment longer until you observe the long-term outcome."

Spotify faces exactly this dilemma. The metric most aligned with the business health of a subscription platform is retention: does the user keep coming back? But observing retention over a window long enough to be informative means waiting weeks, and that wait is incompatible with the iteration speed the product demands.

LinkedIn faces the same problem in a different context. The north-star metric of the jobs platform is confirmed hires, meaning people who actually got a job with the platform's help. But a confirmed hire is only observed when the new employee updates their LinkedIn profile, which typically takes several months.

In both cases, the industry resorted for a long time to the same generic solution: use a top-of-funnel metric (views, applications, clicks) as an approximation. And pray that it is directionally aligned with the north-star metric.


Survival analysis: decades of development in medicine

A technique widely known in pharmaceutical and medical research is called survival analysis. The core idea is to handle data where the event of interest may not have occurred yet at the time of analysis, but the observation must remain valid. This is what we call right-censoring.

Classic examples come from medicine: a drug is being studied for heart problems, and one of the test patients dies in a car accident. Their data cannot simply be discarded (that would introduce bias) nor treated as if they had "failed" the treatment. Survival analysis provides the mathematical framework to treat this partial observation in a statistically valid way.

The parallel with digital products is direct. On a platform like Spotify, at the moment you run the experiment analysis, many users are still active. They have not "churned" yet, but that does not mean they never will. Treating them as retained forever is optimistic; discarding them is absurd; and simply waiting for all of them to "die" before ending the experiment is unfeasible. This is exactly the problem survival analysis was invented for.

Spotify's work (Chandar et al., 2022) proposes a metric called time to inactivity, which measures how long it takes for a user to become inactive for an entire week. They model this using a Cox Proportional Hazards model, fed by user characteristics and engagement metrics observable in the first 14 days of the experiment. The model output is a survival curve per user, and the mean time to inactivity restricted to the analysis horizon (the so-called restricted mean survival time) is the quantity that enters the A/B test. It is worth noting that the Cox model carries as a central assumption the proportionality of hazards over time, and this assumption does not always hold in engagement data; the paper discusses this limitation and the validation framework they propose applies to other survival models as well.

The practical result is interesting. In a corpus of 51 experiments run between March and December 2020, the predictive metric showed a time-dependent AUC of 0.90 in the first weeks and still 0.83 at a 24-week horizon. More importantly, it showed discriminative power comparable to observed 4-week retention, but with greater statistical power, which means, in practice, detecting real changes faster and with fewer users.


Surrogate metrics and the type I error problem

LinkedIn tackled the problem from a conceptually different angle. Spotify's case is primarily a censoring problem: the event of interest (inactivity) will happen, but it has not happened yet when it is time to close the experiment, and survival analysis solves this. LinkedIn's case is primarily a substitution problem: the measured metric (PCH) is a distinct quantity from the north-star metric (confirmed hire) and only serves as an approximation of it. Both ultimately share the use of a predictive metric in the A/B test, but the statistical pitfalls involved are different.

LinkedIn's solution was to build a metric called PCH (Predicted Confirmed Hire), a predicted probability, for each application, of it turning into a confirmed hire. The prediction uses signals of application quality, job segment, and distribution of applications. Unlike the actual confirmed hire, PCH is available within a few days after the application.

Here an important and counterintuitive technical point appears, which their paper formalizes very well: using a predictive metric directly in an A/B test, as if it were the truth, inflates the false positive rate.

The intuition is as follows. Under reasonable assumptions (unbiased prediction and model error uncorrelated with the prediction), the variance of the real north-star metric is approximately the variance of the prediction plus the variance of the model error. When you run the t-test directly on the predicted metric, you are underestimating the variance of what actually matters. The result is an artificially low p-value.

The numbers are striking. With a model that has a predictive R² of 0.85 (that is, a reasonably good model), a p-value of 0.05 on the surrogate metric corresponds to a p-value of approximately 0.07 on the real north-star metric. Nearly 30% underestimation. In a controlled simulation with 10,000 samples under the true null hypothesis, they obtained 560 false positives instead of the 500 expected at 5% significance.

In the corpus of 203 real experiments they analyzed, this becomes even clearer. 30 experiments appeared statistically significant when looking directly at PCH. After applying the variance correction, adding the model error to the variance before computing t, only 2 remained. After applying variance reduction techniques (CUPED, which uses each user's pre-experiment behavior to discount metric noise via a simple linear regression, without introducing bias under randomization), the count rose to 10. These 10 are consistent with what is observed in the real north-star metric months later.

The lesson goes beyond LinkedIn's case. Any company that uses a predictive-model-based metric (and this includes any "quality score," "predicted satisfaction score," or "predicted LTV") without making this correction is systematically shipping features that appear positive but are not.

And there is an even more fundamental problem, prior to the statistical issue of variance. For a surrogate metric to be formally valid, the treatment must affect the north-star metric exclusively through the surrogate. This is Prentice's criterion (1989): conditional on the surrogate, treatment and final outcome must be independent. In other words, the surrogate must capture the entire causal path. If the treatment influences the final outcome through a route that the surrogate does not see, even the corrected test will systematically deceive. The criterion is restrictive in practice, almost no surrogate satisfies it exactly, and the modern methodological evolution starts precisely from there. The work of Athey et al. (2016) proposes combining multiple short-term variables into a surrogate index that approximates full mediation, with statistical guarantees more robust than those of a single surrogate. This is the conceptual piece that LinkedIn and Spotify each check, in their own way, when statistically validating the surrogate before adopting it in decision-making.


It is time to level up

So far, traditional A/B tests have worked wonders in digital products and will continue to be the dominant working tool. But increasingly we need to bring the cutting edge of scientific development to gain more robustness. It is necessary to know how to evaluate the problem and know how to evaluate the tool, to understand how far it takes you and when it is time to look for something new, or to improve what you already have. Statistics has already dealt with problems of extreme complexity in recent decades. Health research is expensive and can even cost lives, so whenever you see a limitation in your testing framework, try to understand whether statistics has not already attempted to solve that problem in another setup.

Using this knowledge can be the difference between a false positive dragged on for months or years and the real success of your metric. It is hard to evaluate the success of metrics and features when you are always looking at the next test. That is why an adequate validation and research stage before the experiment can save time, suffering, and mistakes.

For those who want to go deeper, Survival Analysis: Techniques for Censored and Truncated Data (Klein & Moeschberger) is the canonical reference on survival analysis, and Causal Inference for Statistics, Social, and Biomedical Sciences (Imbens & Rubin) is the foundation for the causality problems that support the rest. It is also worth remembering that survival analysis and surrogate metrics do not exhaust the range of alternatives to traditional A/B: long-term holdouts, sequential tests with always-valid p-values, Bayesian approaches, and layered designs solve variations of the same problem, each with their own assumptions and costs.


References

  1. Duan, W., Ba, S., & Zhang, C. (2021). Online Experimentation with Surrogate Metrics: Guidelines and a Case Study. Proceedings of WSDM '21.
  2. Chandar, P., St. Thomas, B., Maystre, L., Pappu, V., Sanchis-Ojeda, R., Wu, T., Carterette, B., Lalmas, M., & Jebara, T. (2022). Using Survival Models to Estimate User Engagement in Online Experiments. Proceedings of WWW '22.
  3. Athey, S., Chetty, R., Imbens, G., & Kang, H. (2016). Estimating Treatment Effects using Multiple Surrogates: The Role of the Surrogate Score and the Surrogate Index. arXiv.
  4. Prentice, R. L. (1989). Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in Medicine, 8(4), 431–440.
  5. Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. Proceedings of WSDM '13.
  6. Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society: Series B, 34(2), 187–202.
  7. Xu, Y., Duan, W., & Huang, S. (2018). SQR: Balancing Speed, Quality and Risk in Online Experiments. Proceedings of KDD '18.
  8. Hohnhold, H., O'Brien, D., & Tang, D. (2015). Focusing on the Long-term: It's Good for Users and Business. Proceedings of KDD '15.
  9. Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data. Springer.
  10. Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.

← Back to all posts