10%

Try : Insurtech, Application Development

AgriTech(1)

Augmented Reality(20)

Clean Tech(3)

Customer Journey(12)

Design(32)

Solar Industry(4)

User Experience(53)

Edtech(9)

Events(34)

HR Tech(2)

Interviews(10)

Life@mantra(11)

Logistics(5)

Strategy(15)

Testing(9)

Android(47)

Backend(29)

Dev Ops(4)

Enterprise Solution(23)

Frontend(28)

iOS(43)

Javascript(15)

AI in Insurance(33)

Insurtech(61)

Product Innovation(44)

Solutions(18)

E-health(7)

HealthTech(18)

mHealth(4)

Telehealth Care(2)

Telemedicine(3)

Artificial Intelligence(124)

Bitcoin(8)

Blockchain(19)

Cognitive Computing(7)

Computer Vision(8)

Data Science(16)

FinTech(47)

Banking(4)

Intelligent Automation(26)

Machine Learning(46)

Natural Language Processing(13)

expand Menu Filters

Model selection with cross-validation: A quest for an elite model

3 minutes, 13 seconds read

What do you call a prediction model that performs tremendously well on the same data it was trained on? Technically, a tosh! It will perform feebly on unseen data, thus leading to a state called overfitting

To combat such a scenario, the dataset is split into train set and test set. The model is then trained on the train set and is kept deprived of the test set. This test set is utilized to estimate the efficacy of the model. To decide on the best train-test split, two competing cornerstones need to be focused on. Firstly, less training data will give rise to greater variance in the parameter estimates, and secondly, less testing data will lead to greater variance in the performance statistic. Conventionally, an 80/20 split is considered to be a suitable starting point such that neither variance is too high. 

Yet another problem arises when we try to fine-tune the hyperparameters. There is a possibility for the model to still overfit on the testing data due to data leakage. To prevent this, a dataset should typically be di