3 ways to handle missing data

3 ways to handle missing data

Image source: feathericons.com (help-circle, MIT)

In this blog post we will analyse how to handle missing values in your dataset. In general three methods exist:

We will look in more detail for every step. Lets directly start with "let them stay".

Let them stay (work around)

In case only a small amount of data for a particular column is missing (less than 5%), you can stay in the dataset and create features out of them.

Added a new column that returns "1" in case of the missing value, overwise "0". This allows you to use the column for training. For the column containing the data itself, you can use any valid value (or impute one) - the model should not use this value at all for training (it will have a low influence on the final model). An example implementation is presentend in the following code snippet:

import pandas as pd

def let_them_stay(df: pd.DataFrame, column: str) -> pd.DataFrame:
    df[f'{column}_NA'] = df[column].isna().astype("int")
    df[column] = df[column].fillna(0)
    return df

# usage
df = pd.DataFrame([{"a": pd.NA, "c": 15}])
transformed = let_them_stay(df, 'a')
print(transformed)

When you have multiple empty / nan values for several columns, add a new column that sums up this column and use it again as a feature.

import pandas as pd
from typing import List
import pdb


def count_nan_cols(row, columns: List[str]) -> int:
    result = 0
    for column in columns:
        if column in row and pd.isna(row[column]):
            result += 1
    return result

def let_them_stay_columns(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame:
    df['number_of_nans'] = df.apply(lambda row: count_nan_cols(row, columns), axis='columns')
    return df

# usage
df = pd.DataFrame([{"a": pd.NA, "b": pd.NA, "c": 15}])
transformed = let_them_stay_columns(df, ["a", "b"])
print(transformed)

Remove them

Several approaches can be done when you decide to remove the data. * Remove them totally * Remove only outliers * Remove rows with invalid data * More criteria to remove data

Remove them totally

To remove an entire column, you can use the drop method on the DataFrame class, providing the axis you want to delete and setting "axis" to the value of "columns". Note that the drop is by default functional, resulting in returning a new dataframe instead of changing the input DataFrame.

import pandas as pd
data = pd.DataFrame([{"a":pd.NA}])
cleaned_df = data.drop("a", axis="columns")
print(cleaned_df)

Remove only outliers

Removing outliers from your dataset can be done for example by using the tukey_rule. We calculate the 0.25 and 0.75 quantile of the data, calculate the iqr (Interquartile range), calculate the min and max values of the allowed values and finally drop every row that not matches this range. Following code will do this for us:

import pandas as pd
import numpy as np

def tukey_rule(df: pd.DataFrame, column_name: str):
    q1 = df[column_name].quantile(0.25)
    q3 = df[column_name].quantile(0.75)
    iqr = q3 - q1
    max_value = q3 + 1.5 * iqr
    min_value = q1 - 1.5 * iqr
    matches = np.logical_and(df[column_name] < max_value, df[column_name]> min_value)
    return df[matches]

# usage
df = pd.DataFrame([-5,-1,0,1,5],columns=["a"])
tukey_df = tukey_rule(df, "a")
print(tukey_df)

Regarding the 1.5 value that is hard coded in the example, you may need to adjust this value for your needs. When you have a normal distributed data, a higher factor will remove less data as less outliers are considered as "real" outliers. The provided value is common in literature.

Remove rows with invalid data

You may also consider removing only rows with the invalid data. Note that this can reduce your traning set a lot.

To drop any row with a missing value, use the dropna method on the DataFrame class with no arguments.

df_clean  = df.dropna()

To drop entire nan columns, use the following snippet:

df_clean = df.dropna(axis=0, how='all')

To drop only the rows with missing values in a particular column "col1", use following code:

df_clean = df.dropna(subset=['col3'], how='any')

More criteria for remove

To give you even more ideas why you should (or not) remove rows or columns from the data, think about following questions:

  • data entry error: A application that was responsible for collecting the data crashed and write only part of the data. Can the data still be used?
  • mechanical error: A sensor is working out of specifications and returns out of range values
  • didnt need the data: You just don't need this data for your use case - drop it to speed up further processing
  • all rows same value: All rows have the same value - you don't gain any information and can remove the column.
  • assertions fail: for example, some values are negative but all should be positive

Impute

The final solution for handling missing values is to impute them. On the positive side of this approach, you don't loose any information, but this can also reduce your feature variability. In general their is no rule what value for imputation to use - just think about the business contest what matches better or try them out as a hyperparamter for your model. You can do it with a small amount of code.

Using (arithmetic) Mean

Defined as sum of all values, divided by the number of total values.

def mean(data):
    total_sum = float(sum(data))
    size = len(data)
    mean = total_sum / size
    return mean

print(mean([0,1,2,0,1,2,3,4,5,6,7,8,9,10]))

When you want to use (or try) the mean to impute missing values, just write the following line and pandas will do the math for you:

import pandas as pd

df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.apply(lambda col: col.fillna(col.mean()), axis=0)
print(df)
# Value with index 13 was imputed by 3.69

Using Median

Defined as the element e as position n in a list of k. Where k is sorted in ascending oder (from low to high). Where n is at index length of k divided by 2. When k is odd, n is the mean of the corresponding neighbours (n-1 and n+1)

In pure python you can calulate it using folowing snippet:

# code is just for illustration purpose
def median(data):
    sorted_data = sorted(data)
    size = len(data)

    if size % 2 == 0:
        return sorted_data[int(size / 2)]
    else:
        return (sorted_data[int(size / 2) - 1] + sorted_data[int(size / 2)]) / 2

print(median([0,1,2,0,1,2,3,4,5,6,7,8,9,10]))
print(median([0,1,2,0,1,2,3,4,5,6,7,8,9]))

And by using pandas we can thrink this down to a one liner:

import pandas as pd

df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.apply(lambda col: col.fillna(col.median()), axis=0)
print(df)
# Value with index 13 was imputed by 3.0

Using Mode

Mode is defined as the value that appears the most time.

def mode(data):
    temp = {}
    for e in data:
        if e in temp:
            temp[e] = temp[e] + 1
        else:
            temp[e] = 1

    current_max_count = 0
    current_max_value = None

    for k,v in temp.items():
        if v > current_max_count:
            current_max_count = v
            current_max_value = k
    return current_max_value

print(mode([10,1,2,10,1,2,3,4,5,6,7,8,9]))

If you are working with categorical data or a variable with outliers, then use the mode of the column. And the pandas code:

import pandas as pd

df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.apply(lambda col: col.fillna(col.mode()[0]), axis=0)
print(df)
# Value with index 13 was imputed by 3.0

Note that you have to pick the element from the mode DataFrame.

Filling

Filling will not use a particular calculated value like we did before, but will use values values that are just before or after the NA value. In pandas it can be done with the following methods:

import pandas as pd

df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.fillna(method='ffill')
print(df)
# Value with index 13 was imputed to value of item 12 - nine.

For using the the that is next after the NA, you can use bfill instead - but this will not work in the given example, as NA is the last value and nothing follows.

Manually

As you now the business, you may think of even better ways to fill in missing values. Just take a moment to think about it.

Model

You can train a model to fill up your missing data as well. For example KNN (k-nearest neighbors) can be used for quantitive column. For the categorical case, you can one-hot encode them and train whem with anything you like - then use the value with the highest score or probabiliy to fill in the missing category.

Thanks for Reading!