Top 5 Novice Mistakes in Machine Learning.

Keerthana Durai
4 min readMay 9, 2022

Dealing with Missing Values

In Data Pre-Processing, the key step is to treat the missing data's because machine learning models won’t accept NaN value as their input. There are innumerable ways available to fill up these NaN values but to pick best way we need to understand the significance of missing values.

One among them is to drop all missing values from dataset but before proceeding it, check the over-all percentage of NaN values present in dataset. If it is less than 1% we can go with dropping all missing values otherwise we need to impute the data’s, by choosing other methods like Central tendency measures, KNN Imputer, etc.

When we having numbers in feature we go with mean or medium. Mean is an average value and we can count it by summarizing all the values in a row and then dividing the result by their amount. Median also shows us an average number but it is the value exactly in the middle of a row.

I am swore most of us will choose mean for imputation rather choosing medium, if we also have skewed distribution in dataset. But medium is better than mean.

Why Medium is better than Mean?

Mean uses all values in dataset to give an final average values but medium will not be influenced by any abnormal/extreme values in dataset.

Outliers Ignorance

Outliers are the abnormal values which deviates from rest of others. At some time, these outliers may be sensitive also. Without examine the dataset completely we can’t ignore the outliers.

For instance:

  • Prediction of depth value based on observed rainfall have high significance for outliers.
  • Prediction of house price doesn’t have any significance of outliers.

Data Leakage

What is Data Leakage problem in ML Models?

Data Leakage happens when the data, that we are used to train a model having the information which the model is trying to predict. This results in unreliable and bad prediction outcomes after model deployment.

This problem may occur probably due to the data standardization or normalization methods. Because most of us will proceed with these methods before splitting the data in to train & test. This will impact the vital role of test set which plays in prediction phase.

Choose appropriate models

Unsplash

Choosing models predominantly based on the dataset size and complexity because if we are dealing with complex problem we need to use some highly efficient models like SVN, KNN, Random Forest and so on.

Most times, I am sure EDA phase will helps us to choose model. In visualization if the data is linearly seperatable, then we can go with linear regression. If we have no idea about data SVM and KNN will be helpful.

In real time, myself felt moving to some complex models unwantedly may have some explainability issue with business oriented people’s. For example, Linear regression will be easy explainable than neural network algorithms.

Validate with respective metrics

Metrics are the quantitative measure of the model predictors and actuals. These metrics are distinct for all types of ML problems. If the problem is in Regression side the key metrics are Accuracy(R2 score), MAE (Mean Absolute Error) and RMSE (Root mean square error). If it is in classification side the key metrics are Precision, Recall, F1score and Confusion matrix.

At initial stage, I too had some confusion in choosing metric predominantly for deep learning networks which comes under regression. So, while designing itself make sure you have answered this question, What type of problem you are dealing with, Regression or Classification?. It is base for all choosing the model and metrics. It is the quite small blog but I hope it may help you guys somewhere.

I would love to hear some feedback . Thank you for your time!

--

--