Moneyball movie poster

Moneyball movie poster

This is a special time of year for me: the beginning of a new baseball season, and the hope against hope that the Chicago Cubs can finally win a World Series after a 107-year championship drought (here’s a realistic view of what that would look like).

While my research career and work at Citrine focus on materials informatics, I also do sports analytics as a hobby. Baseball stands out among American professional sports for being particularly data-obsessed, and the book and movie Moneyball have elevated baseball analytics to a pop culture phenomenon. Billy Beane, general manager of the Oakland Athletics, famously used advanced data analytics to gain a competitive edge against perennial titans such as the New York Yankees, despite having one of the smallest payrolls in baseball.

We founded Citrine because we want to help customers unlock a Moneyball edge in materials and manufacturing. Just as the Oakland Athletics became four times more efficient (in terms of payroll dollars per win) than the Boston Red Sox by harnessing the power of data, materials and manufacturing companies can make R&D and production dramatically more efficient by analyzing large-scale data about the materials and chemicals they use.

Comparing my passions for baseball stats and materials informatics, I am struck by how absolutely integral data analytics is to baseball in comparison to the status quo in materials. To make the point, I’ll give some examples of statistically-derived facts that are commonplace in baseball, and analogous questions in materials that are impossible to answer without inordinate effort:

Baseball

Most and least efficient Major League teams in terms of payroll dollars per win in the 2012 baseball season. Source: fool.com

Most and least efficient Major League teams in terms of payroll dollars per win in the 2012 baseball season. Source: fool.com

  • Colorado Rockies rookie Trevor Story became the first player in Major League history to hit home runs in his first three games as a professional player
  • On average, a team will win a game for every 10 total runs they score
  • My childhood idol Mark Grace was the last Chicago Cubs player to hit for the cycle (a single, double, triple, and home run in the same game), and did so on May 9, 1993

Materials

  • What is the highest-reported superconducting critical temperature as a function of year? How about by journal and by year?
  • Which commercial aluminum alloys have elastic modulus above 75 GPa and yield strength above 250 GPa?
  • How does the adsorption energy of organic molecules on Au(111) vary according to the molecular masses of those molecules?
  • What chemical features are most important in governing the viscosity of paint?
Correlation between relative payroll and regular season win percent for all non-Oakland Major League teams from 2000-2013, where each point represents the a binned average of 15 team-seasons. The Oakland Athletics’ performance is shown in green. Source: fivethirtyeight.com  

Correlation between relative payroll and regular season win percent for all non-Oakland Major League teams from 2000-2013, where each point represents the a binned average of 15 team-seasons. The Oakland Athletics’ performance is shown in green. Source: fivethirtyeight.com

 

When any discipline becomes more data-driven, it will not ever revert back to the old pure-intuition way of doing things. Consider the quantitative revolution on Wall Street that upended the finance industry in the 1980s and 1990s. In baseball, Moneyball thinking has permanently transformed the game from the front office on down. Teams are locked in an arms race for analytics talent, managers are shifting their defensive formations and making frequent, subtle in-game adjustments, and star players like Zack Greinke openly aim to optimize their performance using advanced statistical analyses.

No such transformation has yet occurred in materials. The above materials questions, while comparable in complexity to the example baseball insights, would each take tens or hundreds of hours of manual data collection and analysis to answer satisfactorily. As a result, groundbreaking materials insights remain hidden in data, awaiting discovery. Our vision at Citrine is to instantly reveal these Moneyball insights to our customers using large-scale data aggregation and machine learning.

The sharp contrast between the data-intensiveness of baseball discourse and comparably analytics-starved materials science raises an obvious question: What causes this difference? I outline some important distinctions below.

Baseball
Materials
Centralization of data
High
Exhaustive, clean historical data sets available from industry standard companies such as Elias Sports Bureau or Stats LLC
Low
No entity has collected all materials and chemical data, though Citrine is working toward this goal
Standardization of data
High
The baseball community has agreed on which aspects of games should be recorded, and how to record them
Low
The materials community lacks data standards, but Citrine is working on changing this situation
Variability and gaps in data
Low
Baseball data are unambiguous: a hit is a hit, no matter who is keeping score, and key in-game events are always recorded properly
High
All experiments have innate uncertainty, researchers may not fully document aspects of their work, and materials phenomena are best described as probability distributions, not scalar facts.
Relational nature of data
High
A finite set of entities such as players, positions, and teams have well-defined properties such as games, at-bats, and strikeouts that can be readily mined with simple SQL-style queries
Low
How similar is iron-deficient Fe0.95O to FeO? Which materials are sensibly described with the notion of a chemical formula? Across which dimensions could we compare polyethylene to Inconel?
Relevance of qualitative data
Medium
Baseball teams still utilize scouting reports, which are qualitative evaluations of player performance by domain experts
Low
The idea of “chemical resistance” is incredibly important to polymers, even though quantifying it (how resistant to which chemicals?) is challenging

To illustrate just how stark the difference is between baseball and materials, here is a hypothetical baseball data dystopia: Imagine if officials in different stadiums all chose to record different aspects of their games, using a range of non-standard nomenclatures (is a "home run" the same thing as a "four bagger" and a “round trip?”) and then published these data in idiosyncratic box score formats in hundreds of different newspapers, months or years after the game. Further, many key observations from the game would only be recorded privately and never published. Unfortunately, this is precisely the reality we face in the materials community today.

Given the above set of facts, the challenges to unleashing the Moneyball era in materials are daunting. But the opportunities for radical data-driven advancement are even more exciting, and that is precisely why we created the Citrine platform. We have developed a data extraction pipeline that turns documents about materials and chemicals into a highly structured, machine-readable database of facts and relationships. We are building grassroots support around our open MIF (Materials Information File) and next-generation PIF (Physical Information File) standards for representing materials data. We use a combination of heuristics and machine learning to resolve gaps and ambiguities in materials datasets. We have created a toolset that enables extremely powerful searches and predictive AI-based models of large-scale materials data in spite of the complex, non-relational nature of those data. And finally, we are engaging the global community of materials researchers to help us organize and curate the public data on our platform.

We are convinced that a combination of state-of-the-art software and a brilliant user community can bring about the data-driven future of materials and manufacturing that will launch entire industries forward. We’re proud to play our part in this transformation, and to put cutting-edge data analytics capabilities into the hands of visionary Beanesian materials scientists and engineers. Citrine can’t help the Cubs win the World Series, but our platform can crunch huge volumes of data to optimize the properties of the advanced materials in the helmets and shoes they’ll wear when they finally do break through. That’s enough for at least an honorary championship ring, right?

Comment