Biggani Org, Technology Desk
Artificial Intelligence (AI) and Machine Learning (ML) technologies are revolutionizing today’s world. AI is a technology that teaches computers to think like humans, and ML is a branch of it where computers can automatically learn from data and make decisions.
Have you ever built a machine learning model that performed wonderfully on training data but failed in the real world? If so, you are not alone. Machine learning is a complex process, and even small mistakes here can lead to major problems.
Wrong Data, Wrong Decisions
The entire foundation of machine learning depends on data. But if the data itself is incorrect or misleading, the resulting model will effectively be useless. During the COVID-19 pandemic, numerous models were built that were later found not to work correctly. One of the main causes of such model failures is “garbage in, garbage out”—meaning, wrong data will always yield wrong results.
Often, datasets contain hidden variables or confounding markers that mislead the model. For example, in COVID-19 chest imaging datasets, it was found that scans for more seriously ill patients were usually performed while lying down, whereas healthier patients were scanned while standing. As a result, the model learned to identify the patient’s position during the scan, rather than diagnosing the disease itself.
Spurious Correlations and Misleading Signals
Many times, models rely on factors that are unrelated to the actual problem. For example, the US Army once built a model to identify tanks in photos. But it turned out the model was actually making decisions based on the lighting conditions in the images, not on the tanks themselves. So, when the time of day changed, the model completely failed.
There are several methods to address such problems. One approach is Explainable AI, which shows which parts of the input the model is using to make its decisions. If it is found that the model is focusing more on the background instead of the main object, then it is not an effective model.
Data Leakage and Testing Errors
Another major issue in machine learning models is “data leakage”. That is, the model gets access to information during training that it shouldn’t actually know. This happens when the model inadvertently learns features from the entire dataset before training begins.
For example, sometimes data is normalized in advance, which can give the model hints about future data. Such issues often arise in stock market prediction models, where past data is analyzed to try to predict the future. If the model unintentionally gets access to future data, its predictions may seem unusually accurate. But in reality, it will fail in new situations.
Limitations of the Evaluation Process
Using the wrong metric when evaluating a model can also be a big problem. In many cases, decisions are made based solely on “accuracy”, which can be misleading. For example, if a model only predicts the majority class, it may have high accuracy, but it’s not effective.
Especially in time series forecasting models, using the wrong metric can lead to major issues. Research has shown that many complex models don’t actually perform better than very simple models. So, it is wise to use multiple metrics when evaluating models.
How Can These Mistakes Be Avoided?
To avoid such problems, researchers now follow a guideline called the “REFORMS Checklist”. This ensures that every stage of model training is completed properly and that no important mistakes are present. Additionally, advanced tracking and validation technologies can be used to reduce data leakage and overfitting.
When building machine learning models, rather than being overconfident, one should analyze model performance with the mindset that mistakes are possible. It’s crucial to verify each decision and data source. Because a faulty model can create major problems not only for research, but also for real-life decision-making.
To Learn More
- Machine Learning Pitfalls Explained
- Explainable AI Techniques
- REFORMS Checklist for Machine Learning

Leave a comment