*Edited on March 18 to clarify results of mean predictor approach. Thanks to Dogus Cubuk for suggesting the change.*

In my previous post, I talked about getting materials data ready for training machine learning-based models. Here, I will take the next step and actually use the dataset we built previously to create predictive models.

**Feature Engineering**

Let’s examine the top few band gaps in our dataset:

LiH,2.981 BeH2,5.326 B9H11,2.9118 B2H5,6.3448 BH3,5.3234 B5H7,3.5551 H34C19,5.4526 H3N,4.3287 H2O,5.5175 HF,6.7187

Our goal now is to construct a set of input features that a machine learning model could use to predict these band gaps. The features should contain patterns that a machine learning algorithm could identify, if provided with several thousand training examples of how changes in those features influence band gap. For example, what is it about the alkali hydride LiH that causes its band gap to be about 3 eV, whereas the alkaline earth hydride BeH2 has a much wider gap of over 5 eV? Feature engineering is precisely where our expertise and intuition as materials scientists enters the modeling process.

These features must be characteristics we either* know or can compute *for any material whose band gap we wish to predict. This statement implies some important constraints: For example, we might wish to use another property such as formation energy to predict band gap, but we only *know* formation energies for a relatively small number of compounds. Likewise, crystal structure is extraordinarily important in governing materials behavior, but if we want to use crystal structure features in our model, we can only model materials *whose crystal structures are known a priori*. We could imagine using machine learning or another approach to first predict crystal structure, and then use the predicted crystal structure as input for a band gap model. Here, rather than daisy-chaining models, we will keep things simple and attempt to simply map a chemical formula (e.g., NaCl) directly to a value for band gap.

Let us begin with an extremely simple and naive representation of a chemical compound for our band gap model, which we will later improve upon by injecting some physical insight. We will define a material using a vector wherein the *n*th component of the vector represents the atomic fraction of the element having atomic number *n*+1 in the material. Thus, pure H would be:

[1, 0, 0, 0, …]

Beryllium hydride (BeH2) would be:

[0.67, 0, 0, 0.33, 0, 0, …]

This representation has the advantage of simplicity, but the disadvantage that it is a rather “bloated” way to express a chemical compound. Clearly, for most materials, the vast majority of the components of the composition vector will be 0. However, this approach is good enough to illustrate next steps.

```
from pymatgen import Composition, Element
from numpy import zeros, mean
# Training file containing band gaps extracted from Materials Project
# created in previous blog post and linked here
trainFile = open("bandgapDFT.csv","r").readlines()
# Input: pymatgen Composition object
# Output: length-100 vector representing any chemical formula
def naiveVectorize(composition):
vector = zeros((MAX_Z))
for element in composition:
fraction = composition.get_atomic_fraction(element)
vector[element.Z - 1] = fraction
return(vector)
# Extract materials and band gaps into lists, and construct naive feature set
materials = []
bandgaps = []
naiveFeatures = []
MAX_Z = 100 # maximum length of vector to hold naive feature set
for line in trainFile:
split = str.split(line, ',')
material = Composition(split[0])
materials.append(material) #store chemical formulas
naiveFeatures.append(naiveVectorize(material)) #create features from chemical formula
bandgaps.append(float(split[1])) #store numerical values of band gaps
```

**Model Building**

We now have a (naive) way of converting each material in our training set into a vector for our machine learning model to crunch, and equivalently we can express any new chemical formula with this representation for the purposes of making predictions. Time for the fun part!

You’re probably salivating with anticipation as to which amazing machine learning algorithm we are going to use to model band gap. I have some anti-climactic news: Unless you work at Google and are training machine vision systems to autonomously recognize cats in millions of images, don’t worry too much about using the latest and greatest algorithms. While you may be on the cutting edge of materials informatics, you are likely not on the cutting edge of machine learning (nor should you be; you’re a materials scientist!).

Before we construct any real models, let us establish a baseline for accuracy with a trivial approach: simply guessing the mean of the band gap distribution.

```
# Establish baseline accuracy by "guessing the average" of the band gap set
# A good model should never do worse.
baselineError = mean(abs(mean(bandgaps) - bandgaps))
print("The MAE of always guessing the average band gap is: " + str(round(baselineError, 3)) + " eV")
```

The mean predictor gives the following result:

The MAE of always guessing the average band gap is: 0.728 eV

A sophisticated model should absolutely never do worse than this, or something is very, very wrong! With that in mind, we will start our real modeling effort with a straightforward approach; after all, in the words of Einstein, everything should be made as simple as possible, but not simpler. Let’s begin by attacking our data with a linear ridge regression model. Yes, you read that correctly: We are starting off with a *linear* model. Before we throw the kitchen sink at the problem and turn to more complex models, we should evaluate the efficacy of a basic approach. Indeed, linear techniques such as logistic regression can--despite their simplicity--perform strikingly well in comparison to more sophisticated algorithms, and offer the advantages of speed and interpretability.

Let’s see how a linear ridge regressor does in modeling band gaps using our naive feature set:

```
# Train linear ridge regression model using naive feature set
from sklearn import linear_model, cross_validation, metrics, ensemble
#alpha is a tuning parameter affecting how regression deals with collinear inputs
linear = linear_model.Ridge(alpha = 0.5)
cv = cross_validation.ShuffleSplit(len(bandgaps),\
n_iter=10, test_size=0.1, random_state=0)
scores = cross_validation.cross_val_score(linear, naiveFeatures,\
bandgaps, cv=cv, scoring='mean_absolute_error')
print("The MAE of the linear ridge regression band gap model using the naive feature set is: "\
+ str(round(abs(mean(scores)), 3)) + " eV")
```

You will see from the code that we are using an approach called tenfold cross-validation to determine the quality of our model. One of the most fundamental tenets of statistics and machine learning is that one cannot evaluate the quality of a model by examining the error it achieves on the data *to which it was fitted*. A simple way to understand this concept is to consider the fact that a polynomial regression with enough terms can perfectly fit *any* training data. However, the question then would be, what happens if we introduce new data and would like such an overfitted model to generalize? A polynomial regression with too many terms will almost certainly fail such a test. In this respect, machine learning models are no different.

Cross-validation is a rigorous approach to avoiding overfitting. In *n*-fold cross validation (in this case, *n*=10), we partition our training data into *n *equally-sized chunks, fit our model to the data in *n*-1 of those chunks, and use the resulting model to predict the held-out *n*th chunk. We perform this procedure *n* times and average the resulting error on the held-out data, reporting a chosen error statistic such as mean squared error or mean absolute error (our choice here). An overfitted model will perform poorly in cross-validation, as it will tend to give very large errors on the held-out data.

When we perform this cross-validation procedure using our linear ridge regression model, the linear model manages to outperform our guess-the-mean approach from above:

The MAE of the linear ridge regression band gap model using the naive feature set is: 0.47 eV

Let’s also take a look at the fitted regression coefficients (just a few of the 100 total are shown here):

```
# Let's see which features are most important for the linear model
print("Below are the fitted linear ridge regression coefficients for each feature (i.e., element) in our naive feature set")
linear.fit(naiveFeatures, bandgaps) # fit to the whole data set; we're not doing CV here
print("element: coefficient")
for i in range(MAX_Z):
element = Element.from_Z(i + 1)
print(element.symbol + ': ' + str(linear.coef_[i]))
```

Running the script gives the following output (excerpted):

O: 2.28865291048 F: 3.95035733949 Cl: 2.81918301294 Ti: -0.675957059589 V: -0.704965141051

Examining these coefficients reveals an intuitively satisfying result: electronegative elements such as O, Cl, and F tend to strongly increase compounds’ band gaps, whereas metallic elements such as V and Cr are generally associated with smaller band gaps.

**Feature Engineering Revisited**

We were honest with ourselves by calling our first approach to feature engineering naive; we can probably do better than an unwieldy 100-component vector for representing materials’ compositions. Let us now construct an alternative feature set, which builds some basic chemical concepts into the resulting model. This feature set will be far more compact than our 100-component composition vector. In particular, we will:

Order the two elements in the binary compound according to their atomic fraction abundance in the compound

Express the stoichiometry of a binary compound simply by the ratio of the more abundant element to the less abundant element

Calculate the difference in electronegativity between the two elements in the binary

Include the periodic table group numbers of each element in the binary

Within this feature space, BeH2 now becomes [ratio, electronegativity_difference, more_abundant_element_group, less_abundant_element_group]:

[2.0, 0.63, 1.0, 2.0]

Here is the code to create these features for each material in our training data:

```
# Create alternative feature set that is more physically-motivated
physicalFeatures = []
for material in materials:
theseFeatures = []
fraction = []
atomicNo = []
eneg = []
group = []
for element in material:
fraction.append(material.get_atomic_fraction(element))
atomicNo.append(float(element.Z))
eneg.append(element.X)
group.append(float(element.group))
# We want to sort this feature set
# according to which element in the binary compound is more abundant
mustReverse = False
if fraction[1] > fraction[0]:
mustReverse = True
for features in [fraction, atomicNo, eneg, group]:
if mustReverse:
features.reverse()
theseFeatures.append(fraction[0] / fraction[1])
theseFeatures.append(eneg[0] - eneg[1])
theseFeatures.append(group[0])
theseFeatures.append(group[1])
physicalFeatures.append(theseFeatures)
```

We have thus “compressed” our previous representation by a factor of 25. Let’s see how well this new feature set works with our same linear ridge regression from before:

The MAE of the linear ridge regression band gap model using the naive feature set is: 0.47 eV The MAE of the linear ridge regression band gap model using the physical feature set is: 0.664 eV

Our cross-validation MAE actually increased! How can this be? Well, we have changed from a linear regression with 100 coefficients to a linear regression with only four coefficients. It appears that five degrees of freedom (four coefficients plus a constant) are too few to reasonably model band gaps with a linear model. This outcome invites the obvious next step...

**Going Beyond a Linear Model: Black-Box Machine Learning**

As of yet, it is not clear whether our clever new feature set is actually an improvement over the naive composition vector. Indeed, the cross-validated MAE of our linear ridge regression got worse with the much smaller, physically-motivated feature vector. However, a major change in approach remains untested: switching to a nonlinear model that in fact has no functional form.

For the purposes of this exercise, we are going to construct a random forest--an ensemble of decision trees. It turns out that, while a *single* decision tree is often a poor classifier, a collection of many decision trees trained on different subsets of data can be very powerful for modeling data. Random forests have a number of tuning parameters (as with most machine learning algorithms, unfortunately), but here we highlight only the number of trees in the forest; we will choose 10 for computational expediency. Our code will thus construct 10 independent decision trees and average the band gap predictions from each of them. The code to do so is very simple:

`rfr = ensemble.RandomForestRegressor(n_estimators=10) #try 10 trees in the forest`

Here is the performance of a random forest-based band gap model:

The MAE of the nonlinear random forest band gap model using the naive feature set is: 0.366 eV The MAE of the nonlinear random forest band gap model using the physical feature set is: 0.265 eV

We observe that a random forest-based approach outperforms linear ridge regression with *both* feature sets, and the physically-motivated four-feature vector outperforms the naive 100-component composition vector. The random forest trained on the physical features is actually decent at estimating band gaps of materials!

**Conclusion**

At the conclusion of this exercise, we have created a random forest-based model that can predict the band gaps of binary compounds with a cross-validated MAE of under 0.3 eV. Not too shabby! From here, the two main levers we have to further enhance our model are: (1) adding more training data (read: email Materials Project and tell them to do more DFT calculations!) and (2) further feature engineering. Probably our simple four-feature approach is not yet optimized. Again, this is where your background as a materials scientist is essential!

We should also put our results here in context. Machine learning is indeed a potent tool for analyzing materials data; when used properly, it can produce powerful, original insights. But it is just that: another tool in the materials scientist’s toolbox. Machine learning (or any other computational technique) is not a substitute for scientific judgment or common sense. As with all computational models, garbage in equals garbage out. However, with a combination of robust training data and insightful features, you can build very fast and very effective models of materials behavior--without a supercomputer, and without a computer science degree.

*Have a new machine learning or statistics-based model of materials? Developed a great feature or descriptor for modeling a particular property? Found a useful materials data set for model-building purposes? Let us know—we’d love to host it on our platform, and reference it back to you!*