Missing values
Data aggregation, extraction, and consolidation is often not perfect and sometimes results in missing values. There are several common strategies to deal with missing values in datasets:
- Removing all the rows with missing values from the dataset. This is simple to apply, but you may end up throwing away a big chunk of information that would have been valuable to your model.
- Using models that are, by nature, not impacted by missing values such as decision tree-based models: random forests, boosted trees. Unfortunately, the linear regression model, and by extension the SGD algorithm, does not work with missing values (http://facweb.cs.depaul.edu/sjost/csc423/documents/missing_values.pdf).
- Imputing the missing data with replacement values; for example, replacing missing values with the median, the average, or the harmonic mean of all the existing values, or using clustering or linear regression to predict the missing values. It may be interesting to add the information that these values were missing in the first place to the dataset.
In the end, the right strategy will depend on the type of missing data and of course, the context. While replacing missing blood pressure numbers in a patient medical record by some average may not be acceptable in a healthcare context, replacing missing age values by the average age in the Titanic dataset is definitely adapted to a data science competition.
However, Amazon ML's documentation is not 100% clear on the strategy used to deal with missing values:
If the target attribute is present in the record, but a value for another numeric attribute is missing, then Amazon ML overlooks the missing value. In this case, Amazon ML creates a substitute attribute and sets it to 1 to indicate that this attribute is missing.
In the case of missing values, a new column is created with a Boolean flag to indicate that the value was missing in the first place. But it is not clear whether the whole row or sample is dismissed or overlooked or if just the cell is removed. There is no mention of any type of imputation.