Big Data Analysis and Thought Diversity: Lessons From Anscombe's Quartet

By Aaron Ferguson, Technical Director-Cyber & Information Analytics Office, National Security Agency

Aaron Ferguson, Technical Director-Cyber & Information Analytics Office, National Security Agency

According to Gartner Incorporated, the world's leading information technology research and advisory company, Big Data is defined as “high volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”I submit that all organizations that collect, process, and analyze Big Data have similar insight and decision making challenges. This is especially true in cyber. In light of the recent Office of Personnel Management (OPM)and Internal Revenue Service (IRS) data breaches, organizations are rapidly looking for innovative Big Data analysis and analytic development methods to extract value and actionable intelligence from their data. This value and actionable intelligence drives the planning, development, and deployment of effective mitigations. However, trying to meet these challenges with thought-homogeneous (little to no variety of opinions, expertise or perspectives) teams will result in: (1) a reduced ability to counter and/or neuter adversaries’ ability to deny, degrade, and destabilize their computer networks; and (2) an inability to plan, develop, and deploy mitigations that matter.

According to a 2013 study by Deloitte, cultivating diversity of thoughts on and in analysis teams can boost innovation and creative problem solving. Francis Anscombe’s seminal 1973 “Graphs in Statistical Analysis” paper can provide important lessons in how analysts approach Big Data analysis. The combination of thought diversity and visualize first-analyze-second (VFAS) can make Big Data analysis a more valuable investment regardless of organization.

Anscombe’s Quartet

Anscombe used four fictitious data sets with nearly identical simple statistical properties, yet appeared very different when graphed. Each data set consisted of eleven (x,y) points, shown below in Table The summary statistics for each data set are close to identical:
mean x value = 9 mean y value = 7.50correlation between x and y = 0.816
variance for x = 11 variance for y = 4.12
trend line equation is y = 0.5x + 3

Based on these summary statistics, analysts would say these data sets, while relatively different numerically, show the same statistical behavior, so, they must be describing the same actual behavior. However, as shown in Figure 1 below, Dataset I shows a linear relationship between x and y while Dataset II shows a strong non-linear relationship between x andy. The latter graph indicates that nonlinear regression may have been the proper tool to use. Data set III shows a linear relationship between x and y, except for a large outlier, while Dataset IV shows x remaining constant, except for an outlier. This “quartet” shows that “things are not always what they seem.”

While performing analysis on Big Data, analysts will often provide summary statistics, e.g., the mean, variance, correlation, and trend lines to see what patterns emerge. Summary statistics are extremely useful because they allow analysts to describe big data with just a few numbers. One could also argue that summary statistics allow decision makers to assess risk. Or do they? Well, no, not really. Anscombe makes the point that analysts should visualize their data before applying any analysis tools. Within the context of Big Data because different visualizations may offer competing or alternative hypotheses and, hopefully, with thought diverse analysis teams, inspire diversity of thought.

Thought Diversity

Thought Diversity realizes that an individual’s thought processes are derived from their unique experiences and therefore provides unique perspectives on situations. It is important to note that cultural/ethnic diversity can spawn thought diversity. By putting together teams of varying subject matter expertise and analytic approaches, experts can rely upon their intuition and divergent perspectives. Having a Cyber Intelligence Analyst that understands malware behavior working side by side with a Political Scientist that under stands open source intelligence and a Data Scientist that can glean adversary tradecraft based on advanced analytic techniques can produce behavior-enriched insights. Figure 2 below shows a fictitious characterization of the number of correctly classified (via a statistical classifier) malicious actor malware samples collected at the beginning of 2015. A statistician may not be able to explain the gap between mid-January and mid-February and a malware analyst may surmise that the gap reflects a period where the adversary is refining their malware tradecraft before redeploying it. However, a political scientist or Intelligence Analyst may notice that this period of time is consistent with a Lunar New Year celebration so malware attacks would likely decrease.

The point of this illustration is that Anscombe’s Quartet is telling us to step back and look at our Big Data(graph/visualize) in its raw state before applying any advanced analytic tools or capabilities and observe the patterns or behaviors naturally emerge. This will allow thought-diverse analysis teams to provide objective perspectives of what is subjectively measured.

Weekly Brief

Read Also

Automate, Orchestrate, and Delegate

Automate, Orchestrate, and Delegate

Ian Hill, Global Director of Cyber Security, BAM
Becoming a Leader in Enterprise Security

Becoming a Leader in Enterprise Security

RANDY RAW, VP of Information Security, Veterans United Home Loans
How Blockchain can Support Future Industrial Evolution

How Blockchain can Support Future Industrial Evolution

Odile PANCIATICI, Blockchain Project VP, Groupe Renault
How Modernized Encryption Standards and TLS 1.3May Impact Your Security Strategy

How Modernized Encryption Standards and TLS 1.3May Impact Your...

Ben Schoenecker, CISSP, Director of Information Security, Hendrick Automotive Group
IT Security: A Practical Approach

IT Security: A Practical Approach

Christopher McCarey, Director of IT Security for Gila River Hotels & Casinos – Wild Horse Pass, Lone Butte and Vee Quiva

"Keeping it REAL with your Security Vendors"

Robert Pace - VP/CISO, Invitation Homes