In this blog post we will analyse how to handle missing values in your dataset. In general three methods exist:
We will look in more detail for every step. Lets directly start with "let them stay".
In case only a small amount of data for a particular column is missing (less than 5%), you can stay in the dataset and create features out of them.
Added a new column that returns "1" in case of the missing value, overwise "0". This allows you to use the column for training. For the column containing the data itself, you can use any valid value (or impute one) - the model should not use this value at all for training (it will have a low influence on the final model). An example implementation is presentend in the following code snippet:
import pandas as pd
def let_them_stay(df: pd.DataFrame, column: str) -> pd.DataFrame:
df[f'{column}_NA'] = df[column].isna().astype("int")
df[column] = df[column].fillna(0)
return df
# usage
df = pd.DataFrame([{"a": pd.NA, "c": 15}])
transformed = let_them_stay(df, 'a')
print(transformed)
When you have multiple empty / nan values for several columns, add a new column that sums up this column and use it again as a feature.
import pandas as pd
from typing import List
import pdb
def count_nan_cols(row, columns: List[str]) -> int:
result = 0
for column in columns:
if column in row and pd.isna(row[column]):
result += 1
return result
def let_them_stay_columns(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame:
df['number_of_nans'] = df.apply(lambda row: count_nan_cols(row, columns), axis='columns')
return df
# usage
df = pd.DataFrame([{"a": pd.NA, "b": pd.NA, "c": 15}])
transformed = let_them_stay_columns(df, ["a", "b"])
print(transformed)
Several approaches can be done when you decide to remove the data. * Remove them totally * Remove only outliers * Remove rows with invalid data * More criteria to remove data
To remove an entire column, you can use the drop method on the DataFrame class, providing the axis you want to delete and setting "axis" to the value of "columns". Note that the drop is by default functional, resulting in returning a new dataframe instead of changing the input DataFrame.
import pandas as pd
data = pd.DataFrame([{"a":pd.NA}])
cleaned_df = data.drop("a", axis="columns")
print(cleaned_df)
Removing outliers from your dataset can be done for example by using the tukey_rule. We calculate the 0.25 and 0.75 quantile of the data, calculate the iqr (Interquartile range), calculate the min and max values of the allowed values and finally drop every row that not matches this range. Following code will do this for us:
import pandas as pd
import numpy as np
def tukey_rule(df: pd.DataFrame, column_name: str):
q1 = df[column_name].quantile(0.25)
q3 = df[column_name].quantile(0.75)
iqr = q3 - q1
max_value = q3 + 1.5 * iqr
min_value = q1 - 1.5 * iqr
matches = np.logical_and(df[column_name] < max_value, df[column_name]> min_value)
return df[matches]
# usage
df = pd.DataFrame([-5,-1,0,1,5],columns=["a"])
tukey_df = tukey_rule(df, "a")
print(tukey_df)
Regarding the 1.5 value that is hard coded in the example, you may need to adjust this value for your needs. When you have a normal distributed data, a higher factor will remove less data as less outliers are considered as "real" outliers. The provided value is common in literature.
You may also consider removing only rows with the invalid data. Note that this can reduce your traning set a lot.
To drop any row with a missing value, use the dropna method on the DataFrame class with no arguments.
df_clean = df.dropna()
To drop entire nan columns, use the following snippet:
df_clean = df.dropna(axis=0, how='all')
To drop only the rows with missing values in a particular column "col1", use following code:
df_clean = df.dropna(subset=['col3'], how='any')
To give you even more ideas why you should (or not) remove rows or columns from the data, think about following questions:
The final solution for handling missing values is to impute them. On the positive side of this approach, you don't loose any information, but this can also reduce your feature variability. In general their is no rule what value for imputation to use - just think about the business contest what matches better or try them out as a hyperparamter for your model. You can do it with a small amount of code.
Defined as sum of all values, divided by the number of total values.
def mean(data):
total_sum = float(sum(data))
size = len(data)
mean = total_sum / size
return mean
print(mean([0,1,2,0,1,2,3,4,5,6,7,8,9,10]))
When you want to use (or try) the mean to impute missing values, just write the following line and pandas will do the math for you:
import pandas as pd
df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.apply(lambda col: col.fillna(col.mean()), axis=0)
print(df)
# Value with index 13 was imputed by 3.69
Defined as the element e as position n in a list of k. Where k is sorted in ascending oder (from low to high). Where n is at index length of k divided by 2. When k is odd, n is the mean of the corresponding neighbours (n-1 and n+1)
In pure python you can calulate it using folowing snippet:
# code is just for illustration purpose
def median(data):
sorted_data = sorted(data)
size = len(data)
if size % 2 == 0:
return sorted_data[int(size / 2)]
else:
return (sorted_data[int(size / 2) - 1] + sorted_data[int(size / 2)]) / 2
print(median([0,1,2,0,1,2,3,4,5,6,7,8,9,10]))
print(median([0,1,2,0,1,2,3,4,5,6,7,8,9]))
And by using pandas we can thrink this down to a one liner:
import pandas as pd
df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.apply(lambda col: col.fillna(col.median()), axis=0)
print(df)
# Value with index 13 was imputed by 3.0
Mode is defined as the value that appears the most time.
def mode(data):
temp = {}
for e in data:
if e in temp:
temp[e] = temp[e] + 1
else:
temp[e] = 1
current_max_count = 0
current_max_value = None
for k,v in temp.items():
if v > current_max_count:
current_max_count = v
current_max_value = k
return current_max_value
print(mode([10,1,2,10,1,2,3,4,5,6,7,8,9]))
If you are working with categorical data or a variable with outliers, then use the mode of the column. And the pandas code:
import pandas as pd
df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.apply(lambda col: col.fillna(col.mode()[0]), axis=0)
print(df)
# Value with index 13 was imputed by 3.0
Note that you have to pick the element from the mode DataFrame.
Filling will not use a particular calculated value like we did before, but will use values values that are just before or after the NA value. In pandas it can be done with the following methods:
import pandas as pd
df = pd.DataFrame(map(lambda x: {"a": x}, list(range(3)) + list(range(10)) + [pd.NA]))
df = df.fillna(method='ffill')
print(df)
# Value with index 13 was imputed to value of item 12 - nine.
For using the the that is next after the NA, you can use bfill instead - but this will not work in the given example, as NA is the last value and nothing follows.
As you now the business, you may think of even better ways to fill in missing values. Just take a moment to think about it.
You can train a model to fill up your missing data as well. For example KNN (k-nearest neighbors) can be used for quantitive column. For the categorical case, you can one-hot encode them and train whem with anything you like - then use the value with the highest score or probabiliy to fill in the missing category.
Thanks for Reading!