Have Hypothesis, Will Test

In this article I detail the approach we, members of the Shape Intelligence Center team, take when we do a rigorous analysis of customers’ threat data and automation. In this example I’ll walk you through a two-sample statistical test that will stand up to deep technical scrutiny, and provide valuable insight to our customers.


When we provide quarterly threat briefings to customers, an inevitable question we get is: “How am I doing compared to my peers?” The easiest way to respond to this question is to provide aggregated comparisons for specific application and platform flows. For example, we can tell Customer X that 17% of their 2020 Q2 web login traffic was automated, while 13% of 2020 Q2 web login traffic for their peers in Industry Y was automated. So, one option for a response to this question is neatly summarized in a bar chart.

Figure 1: Login Automation for 2020 Q2 for Customer X and Peers in Industry Y


Although this graphic is a stunning example of the basics of Python’s matplotlib, I’m sure all math lovers like me took one look at it and immediately asked: “Is this difference significant?” Aggregations have a sneaky way of hiding the details, and Customer X’s overall automation might be influenced by one or two massive attacks. To answer this question, we need to dive into what happened during 2020 Q2 to decide if Customer X really has more automation than their peers. And this answer is usually what customers actually want to hear, even if their question wasn’t: “What are the results of two sample statistical tests that you ran to determine whether our automation is statistically different from that of our peers?” 


In this post, I’ll walk through this example to explain how to decide if 17% is truly bigger than 13%, and discuss how this work can actually tell customers how they’re doing.

Step 1: Start at the beginning

A natural way to compare all of 2020 Q2 for Customer X and their peers is to look at the data on a daily basis. So the first step is to acquire the daily web login automation percentages.

Figure 2: Daily Login Automation % for 2020 Q2 for Customer X and Peers in Industry Y


Automation tends to be erratic and unpredictable, and the scatter plot certainly reflects that. Both sample sets appear to have some outliers, but Customer X does seem to have more automation for the first half of 2020. Our next step is to review some summary statistics for both sample sets, namely the means, medians, and standard deviations. 

In both cases, we can see the mean is influenced by erratic spikes in automation and the median automation percentage is subsequently lower. This result is further verified with a high standard deviation for both Customer X and Peers in Industry Y. But the similarity between those two standard deviations means it is reasonable to compare these two sets. More importantly though, Customer X has a higher mean and a higher median, indicating it’s safe to test this hypothesis. Whether we care about the mean or median specifically depends on what test we select, which is our next step.

Step 2: Assumptions can make a bad test selection out of you and me

The basic statistical testing most of us are more familiar with typically assumes the data is approximately normal. I could expound upon what normal means for a long time, but that’s not the purpose of this post and I might accidentally start a war between parametric and nonparametric statisticians. For my work, I use two things: intuition about the data (usually supported by a scatter plot), and visual inspection of a probability plot. I intuitively believe daily automation percentages are non normal, and the scatter plot from earlier appears to back this claim. The stats portion of scipy provides a built-in function to generate probability plots, so we can easily inspect them for Customer X and the peer data.

Figure 3: Probability Plots for Customer X and Peers in Industry Y


These plots compare the quantiles of the two sample sets against the quantiles of a normal distribution. What we want to see, in order to claim the data is approximately normal, is a scatter plot that roughly falls around the best-fit line, shown in red. What we don’t want to see, in order to claim the data is approximately normal, is exactly what we do see: a clear shape to the scatter plot that is NOT on the best-fit line. The probability plots, combined with our initial idea about the data, indicate that we should not assume normality. As a result, we have to pick a test for non normal data. 


The second common assumption of many two sample statistical tests is that the two sets of data have equal variances. My approach is to avoid theoretical musings on this assumption, and test it directly using Levene’s test. Levene’s test is specifically designed to determine whether two or more groups have the same variance. It’s perfectly suited for figuring out what assumption is relevant for the data, and is readily available in scipy’s stats.

The null hypothesis for Levene’s is that the variances are equal. And we can see with that massive p-value that we fail to reject the null hypothesis. As a result, we now have the two assumptions we need to select our test: our data is non normal, and our variances are equal.

Step 3: Determine where to insert “statistically significant” into your groundbreaking results

Given the assumptions, the test for us is Mann-Whitney U, which also goes by approximately 15 other names. Specifically, we want to test the hypothesis that Customer X’s login automation percentages are higher than their peers. In a surprise to no one, we can run this test easily with scipy’s stats.

Wow, look at that tiny p-value! Our groundbreaking results are that Customer X’s daily login automation percentages for 2020 Q2 are statistically and significantly higher than their peer’s daily login automation percentages for 2020 Q2, based on two sample testing using the Mann-Whitney U test. But perhaps, this is not the exact groundbreaking results we want to convey to Customer X.

Step 4: Translate those groundbreaking results into a statement that actually makes sense

This translation can take many forms. My general approach when talking to Customer X is to say something along these lines: “Based on statistical testing, your login automation for 2020 Q2 was higher than your peers.” That’s really the key point we need to convey. And they probably are not interested in the exact methods I used, although I’m always happy to explain. Usually I would expound upon that statement a little bit more to say, “We analyzed all of the data for 2020 Q2, and confirmed that your higher automation level was not just the product of a few large attacks.” These two simple sentences convey to Customer X that we thoroughly  compared how they were performing in relation to their peers. Although we are likely to still show them the simple bar chart above, we have done the work to rigorously support any conclusions we draw from that basic graphic.


As you can see, two-sample statistical testing will really let you tell a customer how they are doing relative to their peers. It is important that you select the right test, so checking your assumptions (normal data, equal variances) will make sure you provide the right results. Questions or comments are welcome.


Published Sep 15, 2020
Version 1.0

Was this article helpful?