Data Science For The Greater Good


Connecting the Details to the Big Picture

AIC? BIC? WTF?

No one gets the perfect model on the first try. Model selection and tuning is an iterative process. But after you have tried every model and combination of hyperparameters, how do you know which one is the “best” model? Measures such as r-squared and root mean squared error work well for regression tasks, while accuracy and F1-scores are good starts for classification problems. However, all of those measures simply assess how well a model has learned a particular sample of data. Unless compared to a test set of data, those measures give no insight into overfitting or unjustified complexity of the model.


Stacked Generalization

One of the most influential papers in the field of data science was published in 1992 by David H. Wolpert concerning the novel technique of “stacked generalization.” Informally known as “model stacking,” this was proposed as a strategy to “reduce the generalization error rate of a generalizer.” To understand this in layman’s terms, we need to define what a generalizer is and what the generalization error rate is. First, a generalizer is a mathematical function that describes a population from a subsample of that population. Essentially, it uses the subsample to describe (i.e. generalize) the characteristics of the total population. For any subsample that is smaller than the total population, it is reasonable to assume that there will be some part of the total population that the generalizer does not describe well. This inability of the generalizer to describe some portion of the total population is what Wolpert is referring to with the term “generalization error rate.” Thus, his stacked generalization technique can be used to increase how well the mathematical function describes the total population.


What Makes You Special?

Interpreting the composition of an anomaly using SHAP values


The Process of Model Selection

The global collection and reliance on data is growing at a staggering rate with no indication of slowing down. Extracting knowledge and insights from this data in its various forms to solve problems is the main goal of every data scientist. This goal is accomplished by modeling the trends in the data to either understand what factors lead to the generation of particular data (i.e. understand the nature of the data for scientific discovery) or to predict future outcomes. However, because these goals are broad in definition, there is no set procedure or magic bullet that will result in successful modeling for any dataset. Fortunately, there are a number of different models a data scientist can use to gain as much information about the available data.


Feature Engineering with FeatureTools

As data scientists, we are told time and time again that a model is only as good as the data it is given. In short, garbage in results in garbage out. But what makes data “good?” Whether it is tabular, a collection of images, text documents, or in any other form, data is simply the starting point of the project. It is neither inherently good nor bad. It just either exists or it does not exist. And let’s all be honest with ourselves for a minute, based on the explosion of data center construction and the formation of numerous companies (and occupations) surrounding data management, the data exists. You may just have to be a little creative to find it.