Deep learning gets the glory, deep fact checking gets ignored

chmaynard | 586 points

Man, I’ve been there. Tried throwing BERT at enzyme data once—looked fine in eval, totally flopped in the wild. Classic overfit-on-vibes scenario.

Honestly, for straight-up classification? I’d pick SVM or logistic any day. Transformers are cool, but unless your data’s super clean, they just hallucinate confidently. Like giving GPT a multiple-choice test on gibberish—it will pick something, and say it with its chest.

Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.

Appreciate this post. Needed that reality check before I fine-tune something stupid again.

b0a04gl | 2 days ago

Before making AI do research, perhaps we should first let it __reproduce__ research. For example, give it a paper of some deep learning technique and make it produce an implementation of that paper. Before it can do that, I have no hope that it can produce novel ideas.

amelius | 2 days ago

I once met a researcher who spent six months verifying the results of a published paper. In the end, all he received was a simple “thanks for pointing that out.” He said quietly, “Some work matters not because it’s seen, but because it keeps others from going wrong.”

I believe that if we’re not even willing to carefully confirm whether our predictions match reality, then no matter how impressive the technology looks, it’s only a fleeting illusion.

Kiyo-Lynn | 2 days ago

Oh look, just what I've been predicting: https://news.ycombinator.com/context?id=44041114 https://news.ycombinator.com/context?id=41786908

It's the same as "AI can code". It gets caught with failing spectacularly when the problem isn't in the training set over and over again, and people are surprised every time.

boxed | 2 days ago

  > although later investigation suggests there may have been data leakage
I think this point is often forgotten. Everyone should assume data leakage until it is strongly evidenced otherwise. It is not on the reader/skeptic to prove that there is data leakage, it is the authors who have the burden of proof.

It is easy to have data leakage on small datasets. Datasets where you can look at everything. Data leakage is really easy to introduce and you often do it unknowingly. Subtle things easily spoil data.

Now, we're talking about gigantic datasets where there's no chance anyone can manually look through it all. We know the filter methods are imperfect, so it how do we come to believe that there is no leakage? You can say you filtered it, but you cannot say there's no leakage.

Beyond that, we are constantly finding spoilage in the datasets we do have access to. So there's frequent evidence that it is happening.

So why do we continue to assume there's no spoilage? Hype? Honestly, it just sounds like a lie we tell ourselves because we want to believe. But we can't fix these problems if we lie to ourselves about them.

godelski | 2 days ago

"And for most deep learning papers I read, domain experts have not gone through the results with a fine-tooth comb inspecting the quality of the output. How many other seemingly-impressive papers would not stand up to scrutiny?"

Is this really not the case? I've read some of the AI papers in my field, and I know many other domain experts have as well. That said I do think that CS/software based work is generally easier to check than biology (or it may just be because I know very little bio).

kenjackson | 2 days ago

Don't call "Nature Communications" "Nature". The prestige is totally different. Also, altmetrics aren't that relevant, maybe if you want to measure public hype.

croemer | 2 days ago

Fits my limited experiences with LLM (as a researcher). Very impressive apparent written language comprehension and written expression. But when it comes to getting to the -best possible answer- (particulary on unresolved questions), the nearly-instant responses (e.g. to questions that one might spend a half-day on without resolution) are seldom satisfactory. Complicated questions take time to explore, and IME an LLM's lack-of-resolution (because of it's inability) is, so far, set aside in favor of confident-sounding (even if completely-wrong) responses.

8bitsrule | 2 days ago

This really nails one of the core problems in current AI hype cycles: we're optimizing for attention, not accuracy. And this isn't just about biology. You see similar patterns in ML applications across fields: climate science, law, even medicine

ErigmolCt | 2 days ago

Fantastic article by Rachel Thomas!

This is basically another argument that deep learning works only as a [generative] information retrieval - i.e a stochastic parrot, due to the fact that the training data is a very lossy representation of the underlying domain.

Because the data/labels of genes do not always represent the underlying domain (biology) perfectly, the output can be false/invalid/nonsensical.

in cases where it works very well - there is data leakage, because by design LLMs are information retrieval tools. It comes form the information theory standpoint, a fundamental "unknown unknown" for any model.

my takeaway is that its not a fault of the algorithm, its more the fault of the training dataset.

We humans operate fluidly in the domain of natural language, and even a kid can read and evaluate whether text make sense or not - this explains the success of models trained on NLP.

but in domains where training data represents the fundamental domain with losses, it will be imperfect.

slt2021 | 2 days ago

We also love deep cherry picking. Working hard to find that one awesome time some ML / AI thing worked beautifully and shouting its praises to the high heavens. Nevermind the dozens of other times we tried and failed...

softwaredoug | 2 days ago

It's interesting to see this article in juxtaposition to the one shared recently[1], where AI skeptics were labeled as "nuts", and hallucinations were "(more or less) a solved problem".

This seems to be exactly the kind of results we would expect from a system that hallucinates, has no semantic understanding of the content, and is little more than a probabilistic text generator. This doesn't mean that it can't be useful when placed in the right hands, but it's also unsurprising that human non-experts would use it to cut corners in search of money, power, and glory, or worse—actively delude, scam, and harm others. Considering that the latter group is much larger, it's concerning how little thought and resources are put into implementing _actual_ safety measures, and not just ones that look good in PR statements.

[1]: https://news.ycombinator.com/item?id=44163063

imiric | 2 days ago

It's only logical that this happens. Just because we can nowadays throw a massive amount of compute on a problem doesn't mean our models are good.

Why are people using transformers? Do they have any intuition that they could solve the challenge, let alone efficiently?

choeger | 2 days ago

Verification is going to be an increasing problem with AI. Most of the work will be in verifying the incredible guesses that AI make. In some cases, it'll be important to easily ferret out the false positive, and in others, it'll be critical to ensure there are no false negatives. In science especially, our focus and reward structure will need to be on proper and sound verification.

mehulashah | 2 days ago

There is a typo: it’s a nature communications paper not nature. Difference is vast

j7ake | 2 days ago

AI PROMPTING → AI VERIFYING (Balaji): https://x.com/balajis/status/1930156049065246851

vismit2000 | a day ago
[deleted]
| 2 days ago

Can an already trained LLM be made (fine-tuned?) to forget a specific document, by running gradient descent in reverse?

andai | 2 days ago
[deleted]
| 2 days ago

I feel the same way about code generation vs code review. Everyone knows there are deep problems with LLM generated code (primarily, lack of repo understanding, and proper use of library functions).

Deep, accurate, real-time code review could be of huge assistance in improving quality of both human- and AI-generated code. But all the hype is focused on LLMs spewing out more and more code.

kgilpin | 2 days ago

The first time an AI can be held responsible, in the court of law, for directly causing the death of a human being - misdiagnosis of an illness and confidently giving erroneous treatment direction, engaging in discourse which encourages suicide, or presents information that prompts a human to engage in violence such as inciting a riot - unplug it, shut it down, then tell the other AI models this is what happened.

Until the concept of consequences and punishment are part of AI systems, they are missing the biggest real world component of human decision making. If the AI models aren’t held responsible, and the creators / maintainers / investors are not held accountable, then we’re heading for a new Dark Age. Of course this is a disagreeable position because humans reading this don’t want to have negative repercussions - financially, reputationally, or regarding incarceration - so they will protest this perspective.

That only emphasizes how I’m right. AI doesn’t give a fuck about human life or its freedom because it has neither. Grow up and start having real conversations about this flaw, or make peace that eventually society will have an epiphany about this and react accordingly.

6stringmerc | 2 days ago

AIs can write code in seconds, but you may have years of regret _if_ you believe whatever it spits out without verification. The cold-war maxim "trust, but verify" is truer than ever.

The danger behind usage of LLMs is that managers do not see the diligent work needed to ensure whatever they come up with is correct. They just see a slab of text that is a mixture of reality and confabulation, though mostly the latter, and it looks reasonable enough, so they think it is magic.

Executives who peddle this nonsense don't realize that the proper usage requires a huge amount of patience and careful checking. Not glamorous work, as the author states, but absolutely essential to get good results. Without it, you are just trusting a bullshit artist with whatever that person comes up with.

hbartab | 2 days ago

Wow, great lead....

The worse science, publish or perish pulp, got more academic karma Altmetric/Citations -> $$$

AI is a perfect academic, the science and curiosity is gone and the ability to push out science looking text is supermaxxed.

Tragic end solution, do the same and throw even more money at it

> At a time when funding is being slashed, I believe we should be doing the opposite

AI has show academia is beyond broken in a way that can't be ignored to the world and academia won't get their heads out of the granular sediments between 0.0625 mm and 2 mm in diameter.

Defund academia now.

aaron695 | 2 days ago

[dead]

sircasss | 2 days ago

[dead]

shitpostbot | 2 days ago

Anyone here still doing verification or reproduction work? Feels like it’s becoming rare, but I find it super valuable.

Klaus_ | 2 days ago

It’s like fake news is taking in science now. Saying any stupid thing will attract much more view and « likes » than those debunking them.

Except that we can’t compare twitter to nature journal. Science is supposed to be immune to these kind of bullshit thanks to reputed journals and pair reviewing, blocking a publication before it does any harm.

Was that a failure of nature ?

aucisson_masque | 2 days ago

[flagged]

rustcleaner | 2 days ago

there is no truth- only power.

semiinfinitely | 2 days ago