We have entered the era of Big Data. Today, electronic devices are collecting data all around us and making it easily accessible to people the world over. Maybe more exciting, the ability to use Big Data for business and personal ends is closer than ever. Easy-to-use commercial tools have become available and are just waiting for the right user to come along with a Big Data dataset to crack open some really hard problems. From healthcare, to manufacturing, to environmental monitoring, there are many arenas in which these problems will be solved using data analytics. But Big Data is not the only data out there—there is insight to be discovered in data sets of just a few hundred records, but it takes savvy and some clever tricks to unlock it. What's more, there is a wealth of data that does not strictly fit the definition of Big Data, but that might be extensive in some particular dimension. For the purposes of this post, I will simplistically divide data into two categories: Big Data and Small Data. 

What is Big Data? 

Many people have tried to define Big Data, focusing on velocity, variety, and volume. This debate is rather academic and is highly subjective. What, exactly, is enough of each thing? How do they interact? I have come up with a rule of thumb for myself: Big Data is uniform data that is collected faster than an expensive HD/SSD can write to disk, and won't fit in memory on a typical computer. Let's unpack what this means:

  • Uniform data — huge parts of the dataset are complete matrices. Data with gaps in it does not apply.

  • Collected faster than can be written — This is a simple way to say it has to be coming in so rapidly that it needs special accommodation. But this has big implications. First, there is so much data that it needs to be processed in parts (e.g., using MapReduce). Second, it isn't just big. It is fast. A list of all data about every IPO on the NYSE in history is a lot of data, but it doesn't come in very fast. A temperature measurement from my front porch every 0.1s might be fast, but it isn't very large.

  • Larger than the memory of a normal computer can hold — memory is the fast storage of a computer. if a whole data set can be held in memory, it can be accessed very quickly and therefore processed easily. Large datasets that can't be held in memory need to be processed efficiently or else will take a very long time to be analyzed (or fail altogether).

Such properties mean that specialized tools are the primary way to reliably extract meaningful insights from Big Data. Moreover, because of the volume and velocity of the data streams, there is real value in taking smaller windows of data and analyzing those—think Twitter, with its transient hashtags and social trends—rather than trying to gather all available data. More exciting, because there is such a vast amount of diverse data in a single format, subtle trends can be extracted.

As a thought experiment, imagine that you are the publisher of a successful online game where players are in charge of a mining operation and have to regularly click around to keep your workers motivated and resources flowing. You, as the publisher, get click data from 3,000,000 users a day. Your job is to optimize in-app purchases. In doing so, you might notice that people who click from the main menu screen to collect gold are 3% more likely to make an in-app purchase in the next 30 days than those who check on their workers first. Extracting a small trend from a huge number of degrees of freedom requires a huge amount of data. Only by collecting it in a uniform way from an automated system can you begin to look for these sorts of trends.

Such analyses are becoming possible every day in a variety of fields, and I have no doubt that the next generations of technologies will both rely on and enable ever more advanced analyses of this sort.

The Promise of Small Data

Lecture 2: The Periodic Table Instructor: Donald Sadoway View the complete course: http://ocw.mit.edu/3-091SCF10 License: Creative Commons BY-NC-SA More information at http://ocw.mit.edu/terms More courses at http://ocw.mit.edu

But not all valuable data is Big Data. There is much promise in Small Data as well. I always like to talk about the periodic table as an example of Small Data mining. I particularly like Prof. Don Sadoway of MIT's description of the insight that led Dmitry Mendeleev to propose it. The video is to the right, but in short: he saw that there were patterns in the masses of elements, except where there weren't. He had the key insight that there might be undiscovered elements that would make the pattern more repeatable. Indeed, he was correct. The amazing part is that he had about half of today's periodic table to find a trend in. Perhaps more stunning is the fact that the periodic table remains largely unchanged today, over 150 years later. While today, we typically try to draw conclusions from more than 60 data points at a time, looking for complex trends in Small Data sets can be a powerful tool to gain insight into a variety of problems. 

I am not just splitting hairs. Big Data is so big that you need a specific class of analytical tools to analyze it. We at Citrine use a number of data science tools to help our customers, and many of these tools are optimized specifically for Small Data. Indeed, we were able to use a Small Data set (several thousand measurements of materials properties) to create thermoelectrics.citrination.com, a machine learning based model for novel thermoelectrics, and discover a totally new class of thermoelectric material (http://arxiv.org/abs/1502.07635). The trick with this sort of analysis is in the features you identify. Rather than throwing the kitchen sink at machine learning algorithms, you have to carefully choose what variables to include in the model and make sure each training row is clean. But with care, a few thousand rows of data or less can allow for the exploration of totally unexplored space: like Mendeleev with the periodic table.

I actually get really excited by Small Data analysis. I am inspired by what we could do if we reach back and collect humanity's accumulated knowledge into stores that allow us to mine this data for new insights. Of course, I have a bias to materials science. I imagine a day when we have every data point from Bessemer to modern nanotech under one roof. That day will lead to breakthroughs in energy, health, transportation, and every physical device we interact with each day. In other fields, though, the potential of data analysis is enormous even though Big Data specifically has not yet won the day. Environmental studies, genetic data, agricultural data, entertainment: in all of these fields is Small Data extremely powerful. That is not to say that Big Data won't provide value; it will. But the dream of analyzing data for breakthroughs with ever more powerful tools does not stop with Big Data; it extends to data of all sizes used in clever ways.

[1] Note: Small Data is only small in that it is not Big Data, but it can come in all volumes and sizes