Categories: Aktualności

Dlaczego modele uczenia maszynowego mogą zawodzić w nowych warunkach i co możemy z tym zrobić?

Machine learning models have their fair share of admirers, mostly for their ability to dig into colossal datasets and churn out highly accurate results. But they’re not invincible. Quite the contrary, according to recent findings from MIT scientists. They uncovered a chink in the otherwise resilient armor of top-rated models – a failure to carry their credibility from one situation to another.

One would think high-accuracy is a testament to generalizability. Not according to MIT researchers, though. Marzyeh Ghassemi, an associate professor in MIT’s Department of Electrical Engineering and Computer Science, suggests that a model that may be a superstar in one context may crash in another for up to 75% of the datasets. She counsels caution in blindly relying on average performance metrics when putting models into service in real-world scenarios.

When Models Falter and What Lies Beneath

An enlightening paper presented at the 2025 Neural Information Processing Systems (NeurIPS) conference by the team reveals how deep this problem runs. In essence, they found that models initially trained on diagnosing illnesses in one hospital with chest X-rays could perform deplorably in another hospital. The hitch? Aggregated statistics largely overlook this discrepancy, camouflaging the poor performance with respect to specific patient groups, like those with certain conditions such as pleural diseases or enlarged cardiomediastinum.

A key issue identified by the researchers was the presence of spurious correlations—essentially, relationships learned during training that do not translate into new environments. Uncover one, and the fallout can be far-reaching, if not catastrophic. For instance, imaging models may associate specific markings on X-rays from one hospital with a disease, but fail to see the same disease in another hospital’s scans where the marking is absent. Unlearning these spurious relationships is certainly a challenge.

Poking Holes in Traditional Beliefs and Looking Ahead

The conventional wisdom held was that if models ranked highly in one setting, they would equally shine in another. This premise, referred to as “accuracy-on-the-line,” met its downfall at the hands of the MIT team’s investigations. Their work showed that models that were crowned in one context could in fact be the laggards in another.

The researchers navigated this situation with a novel algorithm called OODSelect, spearheaded by MIT postdoc Olawale Salaudeen. The technique involves a scrutiny of thousands of models trained on data, and then retested on different data. The algorithm casts a spotlight on those models that performed admirably in the initial setting but flunked significantly in a new one.

What’s the way forward? The team has already put forth their code and the identified subsets for others to use, hoping that the machine learning community will embrace OODSelect. This way, organizations that stumble upon areas where their models are underperforming can course-correct by taking targeted steps to improve those specific areas.

“We hope the released code and OODSelect subsets serve as a bridge,” the researchers write, indicating their drive towards creating benchmarks and models that grapple with the adverse effects of spurious correlations.

To cloth this discussion in greater detail, check out the original article from MIT News: Why it’s critical to move beyond overly aggregated machine-learning metrics.

Max Krawiec

This website uses cookies.