“What am I ever going to use this for? Why does this matter? What time is lunch?” you constantly found yourself asking.
Then you land yourself in a career in digital marketing and think….I really should have paid a bit more attention in that class.
Enter statistical significance. It’s certainly not the sexiest topic, but it can play a big role in your testing strategy in PPC campaigns. If you’re interested in testing pieces of your account in a data-backed way, statistical significance is the way to go.
Let’s dive into what statistical significance is, things to consider when incorporating into your PPC accounts, some tools you can use, and other considerations.
What is Statistical Significance?
There are quite a few definitions out there that might only make things a bit more confusing for some. For example, here is what Wikipedia says:
“In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis.”
Yeah…that just leads to more questions.
Here’s the definition I found most approachable:
““Statistical significance helps quantify whether a result is likely due to chance or to some factor of interest,” says Redman. When a finding is significant, it simply means you can feel confident that it’s real, not that you just got lucky (or unlucky) in choosing the sample.”
This is from an article on Harvard Business Review that might be a great read for those of you who are at least familiar with the topic of statistical significance.
Basically, as marketers, we’re trying to determine if a test we set up is showing a winner and if that winning stance is most likely determined by an actual change in performance across our variants or if it’s a faulty sampling error.
Why Does Statistical Significance Matter in PPC?
When running tests in your PPC campaigns, it’s important to not only understand the results of the test but also how confident you should be about them. The goal is to identify the winning variant and feel confident that if you were to run the test again, you would get the same result.
Being confident in testing is important because we as PPC pros rely on a compounding effect with our testing. We set up one test between two variants, let the test run, review the data, choose a winner, pause the loser, and pit a new challenger against the winner.
Each time we retain the winning variant and create a new challenger. Usually, the new challenger has some relation to the previous winner. For example, if we tested offering a $10 discount versus a 5% discount and the $10 message won, we would most likely keep using that $10 messaging and test another aspect of our copy. This new test would have $10 reflected in both variants, with the test being some other wording.
If we were to make a false conclusion in a test and continue to iterate off of a false winner, our performance, in the long run, could suffer.
When determining statistical significance, we’re looking for a confidence level in our results. Most often, you’ll see this represented as a percentage, 95% confidence. What this means is that if we were to run this test again, we would expect that 95 times out of 100, we would get a similar set of results. Not the exact same results, but similar. (If you’re interested in more of that nuance, definitely read that HBR article.)
The confidence level you choose is up to you, but keep in mind, the higher percentage you choose, the harder it will be to achieve those numbers. More often than not, I see either 90% or 95% be the preference of most advertisers and it’s my personal preference to stick to that range.
Anything higher than 95% can be a very tough result to achieve. Anything lower than 90% is getting into a territory where you’re only confident that test would win 4 out of 5 times instead of 9 out of 10, and that’s getting a little close for my liking.
Setting Up Tests For Success
Now, I’m not going to turn this into a Stats class. That simply would be far too boring.
Instead, I’d like to briefly talk through a couple of factors that I think hinder PPC pros in their testing more than anything: specificity, volume, and time. From my experience, these factors all play important roles in testing, and can sometimes work together and can sometimes be at odds with each other.
Here are some things to keep in mind about each.
Many PPC pros like to control nearly everything. We can see the nuance between the keywords “running shoes” and “women’s blue running shoes” and typically want to craft ad messaging that speaks to each as opposed to a message that can work for both.
The same goes for nearly everything else we can test in our campaigns. We want to have individually crafted strategies for each aspect of our account because, in theory, that should give us the most control and best opportunity for success at the keyword, ad, bid modifier, etc. level.
But how should we balance this desire for specificity with volume? They are always going to be at odds with each other.
To have a statistically significant test, you’ll almost always need at least a couple hundred users to sample from. If you’re testing at an individual keyword level, that could take a while to achieve.
When setting up your tests, think about the amount of volume that normally passes through the keywords, ad groups, campaigns, etc. you want to test. Is it a lot? A little? Somewhere in between?
Each level of volume means something different to how you set up a test. If each keyword only has a small number of impressions, but your campaign on the whole normally has a few hundred impressions a month, it might be better to set up an aggregate ad test at the campaign level to ensure you have enough data to determine a winner within that month.
Time can also play a large role in how you set up tests. As mentioned above, the amount of volume flowing through a test can impact how quickly you’re able to determine a winner with any statistical significance. Sometimes it’s OK to have a test that runs for a while, but at other times, it can be a hindrance to progress in the account.
For high volume areas, I like to run a test for at least a week, if not two. This allows me to rule out any day of week fluctuations that could cause a false result. Usually after 2 weeks, if I have a confidence level of 90% or higher, I’ll end a test and turn over a new one.
On the other end of the spectrum, having low volume through your campaigns will usually mean a longer testing cycle. I prefer to not have tests last for more than a month, but in certain instances can let them run for two months if needed. That said, the longer you run a test without turning over a winner, the longer you’re delaying that compounded interest effect I mentioned earlier. If you’re able to, it might make sense to combine numerous portions of your account for aggregate testing so you can turn over tests more quickly.
In other instances, there’s simply no way around a long test. If your account is a low volume account and has no chance to scale up, you’re going to be fighting an uphill battle with statistical significance. In that case, there are a couple of things you can do: set your expectations for longer testing cycles or lower your standards away from the 90%-95% range down into the 80% range or so to help you turn over tests more frequently.
Tools for Determining Statistical Significance
There is a boatload of free and paid tools out there to help you with statistical significance.
Personally, I use the KISS Metrics tool. When it comes to A/B ad testing, we have an AdWords script created by one of our team members that helps quickly determine the statistically significant winner in your ad test.
Revisiting the HBR Article: Science vs. Business
One final takeaway I very much enjoyed in the HBR article is the expert’s differentiation between science and business uses for statistical significance. He even wrote a book on it.
I very much like the distinction that “science is meant to stand the test of time”. Science doesn’t change. The laws of physics are always the same. So using statistical significance here makes quite a bit of sense.
Marketing is never going to “withstand the test of time”. If you were to ask Don Draper about how you should market baked beans, I bet he wouldn’t have ever thought of a talking dog named Duke. Same goes for toothpaste. No one cared about whitening back in the day. They just liked that one of the brands had mint!
So what I’m saying is this: test all the factors you possibly can for statistical significance, choose your winner, then hurry and go out to find something new to test because this previous winner will not stand the test of time. Always be testing, always try to be improving, but also, always use data to help make the most informed and confident decisions.
What tools do you use for statistical significance? We’d love to hear about your processes for testing in the comments!