Best Practices in Product Experimentation

Motivation: originally written in May, 2020 to help myself and the Growth team at Boardable put together a road-map of product experiments that would optimize the user experience, and in particular the first-time experience and leading the new user to the value in our solution. This is based on a lot of different sources and learnings I’ve garnered from other books, sites, and podcasts. Extra shout out to Reforge.com and their excellent courses on Product-Led Growth and product experimentation.

Seeing Clearly

Experimentation helps us get a clearer understanding of how users find value in our product, what our users’ needs are, and how we can best help them address those needs through our product. We want to understand the good and the bad, the positive and the pain as best we can so that we can keep improving our product and increasing the impact we have through growing our customer base.

Themes

In order to keep building towards ‘big wins,’ experiments should be set up to address one big theme (or several if the team has enough capacity to run several overlapping themes). The danger is that many, disconnected experiments cannot build on one another and will not paint a clear picture of how we are increasing growth. Focusing on one problem area helps Growth teams keep driving towards collective, bigger wins that can clearly show team success.

Winning

Any experiment that gives us insight into our users and our product is a success. The only ‘failures’ are inconclusive or badly executed tests. If we test a hypothesis and find out it is conclusively wrong we can learn and create better hypotheses!

A good experiment involves 4 steps:

Developing a Hypothesis,
Defining a Minimal Viable Test (MVT),
Analysis, and
Concluding and Communicating.

Assumptions

What is the one underlying assumption that must be true for your hypothesis to be true? Or, what is the one underlying assumption that if false makes everything else meaningless? Drive towards that assumption, because it is more likely to provide the level of insight that leads to ‘big wins.’ Lean into an inquiry model like the 5 Whys to dig deeper. Having clarity on the assumptions that underlie any hypothesis is critical to building a robust system of experimentation that can lead to meaningful improvements.

Developing Hypotheses

Start from a problem. With a clear problem statement, brainstorm and develop best bet hypotheses to address it. Big problems (we are losing a lot of free trial users before subscribing), can be broken down into smaller problems (lots of users are leaving after 1 session; it’s taking users too long to find value; too many users do not know what the product really does before signing up). If the problem statement has many issues combined, break it down into sub-problems. For each hypothesis keep the problem in focus and work from there with the goal of creating the highest value experiments.
- Starting from a thing we know we can do will always be messy to test and prove, and increases the chance to be inconclusive. Be careful of starting from ‘this will be easy to do’ and working backwards to a problem. Similarly with measurement: ‘this is easy to measure’ and working backwards to a problem is likely to miss the real problems that need to be solved.
One Solution for One Problem. A testable hypothesis should have 1 problem statement, and 1 hypothesis statement. Many hypotheses can be formed around 1 problem, and each will indicate separate tests. The problem and hypothesis need to be rooted in evidence (observations or data). The hypothesis should state an action to take, how it addresses the problem, and the positive result.
- An example problem: “We see that 60% of people that start trial sign ups do not complete them.”
- An example hypothesis: “Reducing the number of steps in the sign up process will make it quicker for people to sign up and increase the number of complete sign ups.”
- The null hypothesis we are trying to reject: “The number of steps in the sign up process do not significantly impact the sign up completion rate.”
Aggressively test the problem. A gentle test is more likely to be inconclusive and thus a waste of time and effort. We need to balance this with the potential risk, while being as aggressive in our tests as we can in order to discover the biggest wins.
Value. How significant an impact will solving this problem have, potentially? If it will only increase X by 1 or 2 it may not be worth the time and effort, but if it increases X by 50% or 200% it could be worth a lot of investment and risk.
Measurable. Do we have the tools ready to measure the impacts of the test, and can they be measured with an appropriate investment of effort? Can we be really confident that what we measure will be accurate? (For example: Google Analytics only catches 95% of traffic, at best, so it introduces 5% error in everything we measure there.)
Trade-off metrics. Are there any potential negative impacts of carrying out this experiment? For example: making the trial sign-up process much easier will increase new sign-ups, but might increase the number of inactive trials (because casual users will be signing-up more frequently).
Null hypothesis. In statistics we never prove our hypothesis conclusively (there’s always a chance that some unknown other forces caused the differences that we see during our experiment), but we can often reject the Null Hypothesis: that our problem and our test are unrelated. For example a null hypothesis might be: there is no correlation between the text of the free trial sign-up buttons and the number of sign-ups. If we observe a strong correlation we can show that the null hypothesis is disproven.

Defining a Minimal Viable Test

How minimal is minimal enough, but not so minimal that it is ineffective? We want to invest the optimal amount of time and energy in the test, but not any more than we need to (hence ‘minimal’). The danger here is that we make it too light, without enough impact, to see any change. (The “a bigger button” type is rarely going to have the desired magnitude of impact.)
Consider how it fits with other active or planned tests, and ensure that it does not conflict (in terms of influencing the data of other tests).
Prioritize. Once defined, prioritize amongst all currently possible MVTs. Which one on the table is likely to give us the quickest, most valuable win? Is there an order to MVTs that will have the most impact (such as fixing problems that are earlier in the user experience to get more users further into the product where other tests can then have more impact)?
Use ExperimentCalculator.com to get an idea of how many people/sessions will need to be sampled and how long the experiment will need to run.

Analysis

One success metric. Pick one metric that will determine success. Any number of other metrics can also be watched for potential insight, but a clear single point will indicate if the test was conclusive.
In small tests looking for the chosen metric to move in the right direction can be sufficient, but in other cases calculate statistical confidence and significance before closing the test and drawing conclusions.

Deciding and Communicating

Decide. Based on the chosen success metric, decide to close the experiment at the point that a conclusion can be reached. Then make a clear conclusion: was the test conclusive, and if so, did the test move the metric in a positive or negative direction? Additional observations can be attached to this conclusion (if other metrics moved in similar or opposite directions).
Communicate. The team needs to be proactive in communicating successes and successful test conclusions internally, to other teams, and the company. What we do is not obvious and clear to everyone and they will not see our outcomes in their day to day work. It is up to us to share what we are learning with the people who can benefit the most from it!

On the Dangers of Inconclusivity

In terms of the momentum of the team, and growing a culture of experimentation, inconclusive results can be very harmful early on. Inconclusive tests are par for the course, but without conclusive up/down results it is hard to communicate impact. And it is easy to become disheartened without clear impact. From experience, in order to build a ‘test first’ muscle, impacts and ‘wins’ are needed. See above on testing aggressively, and driving towards positive/negative outcomes that are conclusive.

Inconclusive results are like coloring in the background first, they’re useful in eventually seeing the entire picture, but coloring in the primary character first is much more fun and encouraging! Any 5 year old will confirm as much. 🙂

May, 2020

Dr. Ben Smith is a Data Scientist and thinker, fascinated by the appearance of computers in our daily lives, creativity, and human struggles. He has had the privilege to think, learn, and write at the University of Illinois, the National Center for Supercomputing Applications, the Cleveland Institute of Art, Case Western Reserve U., IUPUI, and at Boardable: Board Management Software, Inc.

If you have feedback or questions please use Contact to get in touch. I welcome thoughtful responses and constructive critique.