Preventing Underfitting and Overfitting

In a job interview you may be asked when a model is under or overfiting in data science and how to acually avoid it. Here are the answers you can return when you get this questions:

Underfitting

First, what is acually Under fitting? It means that the model you created is not able to handle the data, or in other words: The model is not able to learn the (complex) pattern in your data. You can detect underfitting using several approaches

by loocking at the loss function over time, not decreasing at all
by loocking at your other metrics like accuracy, f1-score, not really changing
your model is as good as a random prediction based on the distribution of your labels

Overfitting

Overfitting is the other extreme: you have a to complex model that can handle the training data very well but fail on the test data.

To prevent overfitting, you have multiple options.

Simplicify

Make your model more simple if possible. This again can be done in different ways depending on the model you are using. In general it is a good pattern to start with the simplest model possible to have it as a baseline (or going even more simple by just randomly picking a value)

For ensemble methods, reduce the amount of learners
For neuronal nets, reduce the amount of layers, units, and epochs or try a Dropout Layer

In general you can reduce the amount of feature you pass to the model.

Data

More data will help as well. Given your dataset is (heavy) unbalanced, every additional datapoint in the minority class will help your model a lot. But remember that the data have to be in the same quality as the existing dataset and not introduce some noise.

K-Fold Cross-validation

Given you want to split your data into the 5 classical buckets where 1 bucket (about 20%) should be used as the test set. This slice or bucket will rotate and all metrics will be averaged to make sure that the model is more stabil.

Early Stopping

When your model no more improve, interrupt the training. When you train using the keras library, you can add a callback function after an epoch or batch that can terminate the training.

Regularization

Regularization will also help to reduce overfitting. The loss function will get extended, to have a some punish on small values of coefficients.

L1 / Lasso: use it if you have a lot of outliers
L2 / Ridge: use it in general

Data augmentation

This is the process of generating new data from existing one. This is very common when you train on images: You add small modifications to the existing image like mirroring or rotating by several degrees.