Model selection with cross-validation: A quest for an elite model

3 minutes, 13 seconds read

Published on Oct 23, 2020

Updated on Oct 28, 2020

What do you call a prediction model that performs tremendously well on the same data it was trained on? Technically, a tosh! It will perform feebly on unseen data, thus leading to a state called overfitting.

To combat such a scenario, the dataset is split into train set and test set. The model is then trained on the train set and is kept deprived of the test set. This test set is utilized to estimate the efficacy of the model. To decide on the best train-test split, two competing cornerstones need to be focused on. Firstly, less training data will give rise to greater variance in the parameter estimates, and secondly, less testing data will lead to greater variance in the performance statistic. Conventionally, an 80/20 split is considered to be a suitable starting point such that neither variance is too high.

Yet another problem arises when we try to fine-tune the hyperparameters. There is a possibility for the model to still overfit on the testing data due to data leakage. To prevent this, a dataset should typically be divided into train, validation, and test sets. The validation set acts as an intermediary between the training part and the final evaluation part. However, this indeed reduces the training examples, thus making it less likely for the model to generalize, and the performance rather depends merely on a random split.

Here’s where cross-validation comes to our rescue!

Cross-validation (CV) eliminates the explicit requirement of a validation set. It facilitates the model selection and aids in gauging the generalizing capability of a model. The rudimentary modus operandi is the k-fold CV, where the dataset is split into k groups/folds and k-1 folds are used to train the model, while the held out k^thfold is used to validate the model. Henceforth, each fold gets an opportunity to be used as a test set. This way, in each fold, the evaluation score is retained and the model is then discarded. The model’s skill is summarised by the mean of the evaluation scores. The variance of the evaluated scores is often expressed in terms of standard deviation.

But is it feasible when the dataset is imbalanced?

Probably not! In case of imbalanced data an extension to k-fold CV, called Stratified k-fold CV proves to be the magic bullet. It maintains the class proportion in all the folds as it was in the original dataset, thus making it available for the model to train on both, the minority as well as majority classes.

Determining the value of k

This is a baffling concern though! Taking into account the bias-variance trade-off, the value of k should be decided carefully. Consequently, the k value should be chosen such that each fold can act as a representative of the dataset. Jumping on the bandwagon, it is preferred to set the k value as 5 or 10 since experimental success is observed with these values.

There are some other variations of cross-validation viz.,

Leave One Out CV (LOOCV): Only one sample is held out for the validation part
Leave P Out CV (LPOCV): Similar to LOOCV, P samples are held out for the validation part
Nested CV: Each fold involves cross-validation, making it a double cross-validation. It is generally used when tuning hyperparameters

Finally yet importantly, some tidbits that shouldn’t be ignored:

It is important to shuffle the data before moving ahead with cross-validation
To avoid data leakage, any data preparation step should be carried out on the training data within the cross-validation loop
It is preferable to repeat the cross-validation procedure by using repeated k-fold or repeated stratified k-fold CV for more reliable results especially, the variance in the performance metrics.

Voila! We finally made it! If the model evaluation scores are acceptably high and have low variance, it’s time to party hard! Our mojo has worked!

The Rise of Platform Engineering

The rise of DevOps marked a significant shift in software development, bringing together development and operations teams for faster and more reliable deployments. As the complexity of applications and infrastructure grew, DevOps teams often found themselves overwhelmed with managing both code and infrastructure.

Platform engineering offers a solution by creating a dedicated team focused on building and maintaining a self-service platform for application development. By standardizing tools and processes, it reduces cognitive overload, improves efficiency, and accelerates time-to-market.

Platform engineers are the architects of the developer experience. They curate a set of tools and best practices, such as Kubernetes, Jenkins, Terraform, and cloud platforms, to create a self-service environment. This empowers developers to innovate while ensuring adherence to security and compliance standards.

Role of DevOps and Cloud Engineers

Platform engineering reshapes the traditional development landscape. While platform teams focus on building and managing self-service infrastructure, application teams handle the development of software. To bridge this gap and optimize workflows, DevOps engineers become essential on both sides.

Platform and cloud engineering are distinct but complementary disciplines. Cloud engineers are the architects of cloud infrastructure, managing services, migrations, and cost optimization. On the other hand, platform engineers build upon this foundation, crafting internal developer platforms that abstract away cloud complexity.

Key Features of Platform Engineering:

Let’s dissect the core features that make platform engineering a game-changer for software development:

Abstraction and User-Friendly Platforms:

An internal developer platform (IDP) is a one-stop shop for developers. This platform provides a user-friendly interface that abstracts away the complexities of the underlying infrastructure. Developers can focus on their core strength – building great applications – instead of wrestling with arcane tools.

But it gets better. Platform engineering empowers teams through self-service capabilities.This not only reduces dependency on other teams but also accelerates workflows and boosts overall developer productivity.

Collaboration and Standardization

Close collaboration with application teams helps identify bottlenecks and smooth integration and fosters a trust-based environment where communication flows freely.

Standardization takes center stage here. Equipping teams with a consistent set of tools for automation, deployment, and secret management ensures consistency and security.

Identifying the Current State

Before building a platform, it’s crucial to understand the existing technology landscape used by product teams. This involves performing a thorough audit of the tools currently in use, analyzing how teams leverage them, and identifying gaps where new solutions are needed. This ensures the platform we build addresses real-world needs effectively.

Security

Platform engineering prioritizes security by implementing mechanisms for managing secrets such as encrypted storage solutions. The platform adheres to industry best practices, including regular security audits, continuous vulnerability monitoring, and enforcing strict access controls. This relentless vigilance ensures all tools and processes are secure and compliant.

The Platform Engineer’s Toolkit For Building Better Software Delivery Pipelines

Platform engineering is all about streamlining and automating critical processes to empower your development teams. But how exactly does it achieve this? Let’s explore the essential tools that platform engineers rely on:

Building Automation Powerhouses:

Infrastructure as Code (IaC):

Tools like Terraform and Ansible are the popular choices for IaC. They automate infrastructure provisioning and configuration, eliminating manual errors and enabling consistent, repeatable deployments.

CI/CD Pipelines:

Tools like Jenkins and GitLab CI/CD are essential for automating testing and deployment processes, ensuring applications are built, tested, and delivered with speed and reliability.

Maintaining Observability:

Monitoring and Alerting:

Prometheus and Grafana is a powerful duo that provides comprehensive monitoring capabilities. Prometheus scrapes applications for valuable metrics, while Grafana transforms this data into easy-to-understand visualizations for troubleshooting and performance analysis.

All-in-one Monitoring Solutions:

Tools like New Relic and Datadog offer a broader feature set, including application performance monitoring (APM), log management, and real-time analytics. These platforms help teams to identify and resolve issues before they impact users proactively.

Site Reliability Tools To Ensure High Availability and Scalability:

Container Orchestration:

Kubernetes orchestrates and manages container deployments, guaranteeing high availability and seamless scaling for your applications.

Log Management and Analysis:

The ELK Stack (Elasticsearch, Logstash, Kibana) is the go-to tool for log aggregation and analysis. It provides valuable insights into system behavior and performance, allowing teams to maintain consistent and reliable operations.

Managing Infrastructure

Secret Management:

HashiCorp Vault protects secretes, centralizes, and manages sensitive data like passwords and API keys, ensuring security and compliance within your infrastructure.

Cloud Resource Management:

Tools like AWS CloudFormation and Azure Resource Manager streamline cloud deployments. They automate the creation and management of cloud resources, keeping your infrastructure scalable, secure, and easy to manage. These tools collectively ensure that platform engineering can handle automation scripts, monitor applications, maintain site reliability, and manage infrastructure smoothly.

The Future is AI-Powered:

The platform engineering landscape is constantly evolving, and AI is rapidly transforming how we build and manage software delivery pipelines. The tools like Terraform, Kubecost, Jenkins X, and New Relic AI facilitate AI capabilities like:

Enhance security
Predict infrastructure requirements
Optimize resource security
Predictive maintenance
Optimize monitoring process and cost

Conclusion

Platform engineering is becoming the cornerstone of modern software development. Gartner estimates that by 2026, 80% of development companies will have internal platform services and teams to improve development efficiency. This surge underscores the critical role platform engineering plays in accelerating software delivery and gaining a competitive edge.

With a strong foundation in platform engineering, organizations can achieve greater agility, scalability, and efficiency in the ever-changing software landscape. Are you ready to embark on your platform engineering journey?

Building a robust platform requires careful planning, collaboration, and a deep understanding of your team’s needs. At Mantra Labs, we can help you accelerate your software delivery. Connect with us to know more.

CX Consulting

Product Engineering

Technology Modernization

Web Application Development

Mobile Application Development

Robotic Process Automation

Testing

Mantra AI: AI Strategy and Implementation

BFSI

Digital Health

Consumer Internet

Our Story

How We Work

Projects

Partnerships & Certifications

News and Events

Press Releases

Case Studies

Industry Reports

Blogs

Industry Use Cases

Newsletters