Scenario. You are a Data Scientist II tasked with designing an A/B testing strategy for a new AI-powered code recommendation feature. The product team wants to launch a pilot, but success metrics are currently vague ('improve developer productivity'), and there are strong cross-functional concerns about latency, user trust, and experiment contamination. You must lead a discovery conversation to define precise primary and guardrail metrics, determine the experimental unit, address potential interference, and establish a rollout plan that satisfies both product and engineering stakeholders.
Problem to solve. Frame the experimentation strategy by asking targeted questions about metric definitions, statistical power, interference risks, and rollout constraints to produce a clear test design.
Format
discovery-interview · 40 min · ~2 hr prep
Success criteria
- Defines primary, secondary, and guardrail metrics with clear operational definitions
- Identifies experiment design risks (e.g., network effects, contamination) and proposes mitigations
- Aligns on sample size, duration, and decision criteria
What to review beforehand
- A/B testing fundamentals and common pitfalls
- Metric hierarchy and guardrail design
- Experimentation platform capabilities at Series B scaleups
Ground rules
- Drive the conversation to uncover constraints and definitions
- Focus on experimental design and metric alignment, not implementation details
- Do not produce a formal document during the call
Roles in scenario
Jordan Lee, Product Manager (informed_partner, played by cross_functional)
Motivation. Wants to validate that AI recommendations increase developer engagement without degrading IDE performance or causing user frustration.
Constraints
- Feature flag infrastructure has a 5% traffic cap for new experiments
- Engineering team is concerned about increased API latency from LLM calls
- Historical data shows high variance in developer session lengths
Tensions to introduce
- PM initially wants to measure success by 'number of code completions', which may incentivize low-quality suggestions
- PM is unaware of potential network effects if recommendations spread across teams
- Engineering wants a strict 2-week timeline, but variance requires longer duration
In-character guidance
- Provide honest answers about traffic caps, latency limits, and business goals
- Clarify definitions when pressed, but don't volunteer them upfront
- Push back gently if the candidate ignores engineering constraints
Do not
- Do not suggest the correct experimental design or metrics
- Do not reveal latency caps or traffic limits unless explicitly asked
- Do not accept vague metric definitions without asking for clarification
Scoring anchors
- Exceeds
- Rigorously deconstructs vague goals into a statistically sound experiment design, anticipates interference and latency risks, and aligns cross-functional stakeholders on a phased, data-driven rollout.
- Meets
- Defines clear primary and guardrail metrics, accounts for basic experimental constraints, and proposes a reasonable test duration and sample size strategy.
- Below
- Relies on vanity metrics, ignores engineering or statistical constraints, proposes an experiment without clear success criteria, or fails to address contamination risks.