Fooled by "AI for Science" Hype? A Scientist's Painful Lesson

Set ScienceAI as starred

Get the latest AI for Science news

first-hand

Image

Image

Pictured: Nick McGreivy.

Editor | Radish Skin

Speaking of AI for Science, many might first think of numerous achievements, such as Alphafold3, Evo2, and other tools that can predict the structure and function of almost all life molecules, and GNoME, which can discover 2.2 million new crystals... These achievements represent the progress of AI applications in the scientific field.

However, have these achievements been exaggerated? Putting theory aside, what is the actual effect of artificial intelligence in the real world?

Today, I'd like to share an unusual story with you.

The protagonist of the story is Nick McGreivy, a physicist who received his PhD from Princeton University last year.

He was once enthusiastic about using "AI to accelerate physics" and thus shifted his research focus to machine learning. However, when he tried to apply AI technology to actual physics problems, the results were greatly disappointing.

Unlike the general reaction when people first tried ChatGPT and other chatbots and got stupid answers (at most they'd complain on social media, but they'd still use them, manual facepalm), Nick carefully analyzed and summarized the lessons he learned from using PINN to solve partial differential equations, and delved into some easily overlooked methodological errors behind this matter, while also analyzing the scientific research scenarios where these errors might exist, and finally drew some conclusions.

Translating these conclusions into plain language: the widespread use of artificial intelligence in the scientific community benefits "scientists" more than "science." Furthermore, researchers' papers often only report successes and hide failures, leading to a significant survivor bias, making this field resemble "social media retouched photos" – behind the glamorous achievements lie failures filtered out and overly beautified expectations.

So, what led Nick, who was once enthusiastic about artificial intelligence, to draw such a conclusion? Is "AI accelerating scientific discovery" truly a "false proposition"? A recent article published by Nick might offer some clues.

Image

The following is ScienceAI's full translation and compilation of Nick McGreivy's article.

In 2018, as a second-year PhD student in plasma physics at Princeton University, I decided to shift my research focus to machine learning. I didn't have a specific research project at the time, but I believed that leveraging AI to accelerate physics research could have a greater impact. (Frankly, high-paying jobs in AI also motivated me.)

I ended up researching what AI pioneer Yann LeCun later called a "pretty hot topic": using AI to solve partial differential equations (PDEs). However, when I tried to build on what I thought were great research results, I found that AI methods performed nowhere near as powerfully as advertised.

Initially, I tried applying PINN, a widely cited AI method, to some fairly simple partial differential equations, but found it surprisingly fragile.

Later, despite dozens of papers claiming that AI methods could solve PDEs much faster than standard numerical methods—in some cases, even a million times faster—I found that most of these comparisons were biased. When I compared these AI methods on an equal footing with state-of-the-art numerical methods, any narrow advantages AI possessed usually disappeared.

This experience made me question the claims that AI is about to "accelerate" or even "revolutionize" science. Are we really about to enter what DeepMind calls "a new golden age of AI-powered scientific discovery"? Or is the overall potential of AI in science being overstated—as it was in my own field?

Many other institutions have found similar problems. For example, in 2023, DeepMind claimed to have discovered 2.2 million crystal structures, marking "an order-of-magnitude expansion of the number of stable materials known to humanity." But when materials scientists analyzed these generated compounds, they found them to be "mostly rubbish" and "politely" stated that the paper "yielded no new materials."

Related links:

https://www.nature.com/articles/s41586-023-06735-9 https://journals.aps.org/prxenergy/abstract/10.1103/PRXEnergy.3.011002

Additionally, Princeton University computer scientists Arvind Narayanan and Sayash Kapoor compiled a list of 648 papers across 30 fields, all of which committed a methodological error called "data leakage." Every paper had data leakage, leading to overly optimistic results. They argue that AI-based scientific research is facing a "reproducibility crisis."

Related links:

https://reproducible.cs.princeton.edu/

https://arxiv.org/abs/2405.15828

However, over the past decade, the application of AI in scientific research has dramatically increased. Of course, the impact on computer science is most significant, but other disciplines—physics, chemistry, biology, medicine, and social sciences—have also seen rapid adoption of AI. Across all scientific publications, AI usage grew from 2% in 2015 to nearly 8% in 2022. While data for the past few years is hard to find, we have good reason to believe this continuous upward trend is still ongoing.

Image

Pictured: More and more scientists are using AI for research.

To be clear, AI can drive scientific breakthroughs. What I'm concerned about is the scale and frequency of breakthroughs. Does AI truly show enough potential to justify such a massive investment of talent, training, time, and capital, shifting from existing research directions to a single paradigm?

Each scientific field has a different experience with AI, so we should be cautious in our discourse. However, I am convinced that some lessons from my experience can be broadly applied across science:

1. More and more scientists are enthusiastic about using AI for research, not because it "benefits science," but rather because its existence "benefits scientists" themselves.

2. Because AI researchers almost never publish negative results, the "AI" discipline is experiencing "survivor bias."

3. Published positive results are often overly optimistic about AI's potential.

Related links: https://arxiv.org/abs/2412.07727

Thus, I've come to believe that AI, overall, is not as successful or revolutionary in science as it appears.

Ultimately, I don't know if AI can reverse decades of declining scientific productivity and stagnating (or even decelerating) scientific progress. I don't think anyone can. But unless there's a major (and in my view, unlikely) breakthrough in advanced AI, I expect AI to be more of an incremental, uneven, conventional tool for scientific progress rather than a revolutionary one.

Disappointing Experience with PINN

In the summer of 2019, I first experienced what would later become the subject of my thesis: using AI to solve partial differential equations. PDEs are mathematical equations used to model various physical systems, and solving (i.e., simulating) PDEs is an extremely important task in computational physics and engineering. My lab uses PDEs to simulate the behavior of plasma, for example, inside fusion reactors and in the interstellar medium of outer space.

The AI models used to solve PDEs are custom deep learning models, more similar to ChatGPT than AlphaFold.

The first method I tried was the so-called Physics-Informed Neural Networks (PINN). The concept of PINN was recently introduced in an influential paper that has garnered hundreds of citations.

Related links:

https://www.sciencedirect.com/science/article/abs/pii/S002199918307125

https://github.com/maziarraissi/PINNs

Compared to standard numerical methods, PINN is a completely different approach to solving PDEs. Standard methods represent a PDE solution as a set of pixels (like in an image or video) and derive equations for each pixel value. In contrast, PINN represents a PDE solution as a neural network and incorporates the equations into the loss function.

As a naive graduate student who didn't even have an advisor yet, PINN appealed to me immensely. They seemed so simple, elegant, and versatile.

They also seemed to achieve good results. The paper introducing PINN stated: their "effectiveness" has been "demonstrated through a series of classical fluid problems, quantum mechanics, reaction-diffusion systems, and the propagation of nonlinear shallow water waves." I thought, if PINN could solve all these PDEs, then they surely could also solve some plasma physics PDEs that my lab was interested in.

However, when I replaced an example in that influential paper (one-dimensional Burgers equation) with another equally simple partial differential equation (one-dimensional Vlasov equation), the results looked completely different from the exact solution.

Eventually, after much tuning, I got some seemingly correct results. However, when I tried slightly more complex partial differential equations (e.g., the one-dimensional Vlasov-Poisson equation), no matter how much I tuned, I couldn't get a suitable solution.

After weeks of failure, I messaged a friend at another university, who told me he had also tried using PINN but hadn't gotten good results.

Lessons Learned from PINN Experiments

Eventually, I realized what the problem was. The original authors of the PINN paper, like me, "observed that certain specific settings could produce great results for one equation, but might not work for another." However, to convince readers how powerful PINN was, they did not show any examples where PINN failed.

This experience taught me a few things.

First, be cautious about taking AI research at face value. Most scientists don't want to mislead anyone, but because they have strong incentives to present favorable results, there is still a risk of being misled. Going forward, I must be more cautious, and even (or especially) suspicious of impressive, influential papers.

Second, papers are rarely published about when AI methods fail, only about when they succeed.

The original authors of the PINN paper did not publish the PDEs that their method could not solve. I also did not publish my failed experiments, only presenting a poster at a lesser-known conference. Therefore, few researchers heard about them. In fact, despite PINN's popularity, it took four years for a paper on its failure modes to be published. That paper has now been cited nearly a thousand times, indicating that many other scientists also tried PINN and found similar problems.

Related links:

https://github.com/nickmcgreivy/PINN/blob/master/APS-Poster-McGreivy-2019.pdf

https://proceedings.neurips.cc/paper/2021/hash/df438e5206f31600e6ae4af72f2725f1-Abstract.html

Third, I concluded that PINN was not the method I wanted. They were simple and elegant, but also too unreliable, too cumbersome, and too slow.

As of today, six years later, the original PINN paper has been cited 14,000 times, making it the most cited numerical methods paper of the 21st century.

Although it is now generally accepted that PINN is often not as good as standard numerical methods for solving PDEs, its performance on another class of problems called inverse problems remains controversial. Proponents claim PINN is "especially effective" for inverse problems, but some researchers strongly dispute this.

I don't know which side of the argument is correct. I'm willing to believe that all this PINN research has yielded some useful results, but I wouldn't be surprised if one day we look back at PINN and find it was just a big citation bubble.

Weak Baselines Lead to Overoptimism

My thesis focused on using deep learning models to solve partial differential equations, models that, similar to traditional solvers, treat the PDE solution as a set of pixels on a grid or graph.

Unlike PINN, this approach showed great potential for the complex, time-dependent PDEs my lab was interested in. Most impressively, paper after paper demonstrated that this method could solve PDEs much faster than standard numerical methods—often by several orders of magnitude.

The examples that excited my advisor and me most were PDEs in fluid mechanics, such as the Navier-Stokes equations. We believed we might see similar acceleration because the PDEs we cared about—for example, those describing plasma in fusion reactors—have a similar mathematical structure. Theoretically, this could allow scientists and engineers like us to simulate larger systems, optimize existing designs faster, and ultimately accelerate the pace of research.

By then, I was mature enough to know that in AI research, things aren't always as rosy as they seem. I knew reliability and robustness could be serious issues. If AI models could offer faster simulation speeds but these simulations were less reliable, would the trade-off be worth it? I didn't know the answer then, so I set out to find it.

But as I tried—and mostly failed—to make these models more reliable, I began to question how much potential AI models truly showed in accelerating PDEs.

According to some high-profile papers, AI solves Navier-Stokes equations orders of magnitude faster than standard numerical methods. However, I eventually found that the baseline methods used in these papers were not the fastest numerical methods currently available. When I compared AI with more advanced numerical methods, I found that AI was not faster (or at most only slightly faster) than the stronger baseline methods.

Image

Pictured: When AI methods used to solve PDEs are compared to strong baselines, any narrow advantages AI might have often disappear.

My advisor and I eventually published a systematic review examining research using AI to solve fluid mechanics PDEs. We found that out of 76 papers claiming superiority over standard numerical methods, 60 (79%) used weaker baseline methods, either because they didn't compare to more advanced numerical methods or because they didn't compare on an equal footing. Papers with larger reported accelerations were all compared to weak baseline methods, suggesting that the more impressive the results, the more likely the paper's comparison was unfair.

Related links: https://www.nature.com/articles/s42256-024-00897-5

Image

Pictured: Results from a systematic review comparing AI methods used to solve fluid mechanics PDEs to standard numerical methods. Few papers reported negative results, and most papers reporting positive results compared against weak baselines.

We again found evidence that researchers tend not to report negative results, an effect known as reporting bias. We ultimately concluded that AI for PDE solving research is overly optimistic: "Weak baselines lead to overly positive results, and reporting bias leads to underreporting of negative results."

These findings sparked debate about AI in computational science and engineering:

1. Lorena Barba, a professor at George Washington University (GWU) who has discussed poor research practices in what she calls "scientific machine learning for fooling the masses," believes our findings are "conclusive evidence supporting our computational science community's concerns about AI hype and unscientific optimism."

2. Stephan Hoyer, head of an independent team at Google Research that reached similar conclusions, described our paper as "a great summary of why I switched from AI for PDEs to weather forecasting and climate modeling," which are applications where AI looks more promising.

3. Johannes Brandstetter, a professor at Johannes Kepler University Linz (JKU Linz) and co-founder of a startup providing "AI-driven physics simulations," believes that AI might yield better results in more complex industrial applications and that "the future of the field is undoubtedly promising and potentially impactful."

In my opinion, AI might eventually play a role in some applications related to solving PDEs, but at present, I don't see much reason for optimism. I'd like to see more focus on achieving the reliability of numerical methods and on red teaming AI methods; currently, they lack both theoretical guarantees and the experimentally verified robustness of standard numerical methods.

I also hope funding agencies will incentivize scientists to create challenging problems for PDE systems. CASP is a good example, a biennial protein structure prediction competition that has helped incentivize and focus research in the field over the past 30 years.

Will AI Accelerate Scientific Development?

In addition to protein structure (a typical example of AI achieving scientific breakthroughs), some examples of AI making scientific progress include:

1. Weather forecasting: AI forecasts have shown a 20% improvement in accuracy compared to traditional physics-based forecasts (though resolution is still lower).

2. Drug discovery: Preliminary data shows AI-discovered drugs having greater success in Phase I clinical trials (but not in Phase II). If this trend continues, it means nearly double the end-to-end drug approval rate.

But AI companies, academic and government organizations, and the media are increasingly viewing AI not just as a useful scientific tool, but as something that "will have a transformative impact on science."

I believe we should not ignore these claims. Although, according to DeepMind, current LLMs "still struggle to reach the deeper creativity and reasoning human scientists rely on," it's conceivable that advanced AI systems might one day fully automate the research process. I don't think this will happen in the short term—or even ever. But if such systems are created, there's no doubt they will change and accelerate scientific development.

However, based on some lessons from my research experience, I believe we should be skeptical of the idea that more traditional AI techniques can significantly accelerate scientific progress.

Scientific Implications of AI

Most narratives about AI accelerating scientific development come from AI companies or scientists engaged in AI research who directly or indirectly benefit from these narratives. For example, NVIDIA CEO Jensen Huang has spoken about "AI driving scientific breakthroughs" and "increasing the pace of scientific development by a million times." Due to economic conflicts of interest, NVIDIA often makes exaggerated claims about AI's application in science.

You might think that the increasing adoption of AI by scientists proves its utility in scientific research. After all, if the use of AI in scientific research is growing exponentially, it must be because scientists find it useful, right?

I'm not so sure. In fact, I suspect scientists are turning to AI not so much because it benefits science, but because it benefits them personally.

Consider my motivation for switching to AI in 2018. While I genuinely believed AI could play a role in plasma physics, my main drivers were higher salaries, better job prospects, and academic prestige. I also noticed that senior figures in the lab were often more interested in AI's funding potential than in technical considerations.

Subsequent research found that scientists using AI were more likely to publish highly cited papers, averaging three times more citations than other scientists. Given such strong incentives to use AI, it's not surprising that so many scientists choose to do so.

Therefore, even if AI achieves truly impressive results in science, it doesn't necessarily mean it has contributed to science. More often, it merely reflects the potential for future AI applications.

This is because scientists working on AI (including myself) often adopt backward thinking. Instead of first identifying a problem and then trying to find a solution, we first assume AI is the solution and then look for problems that need solving.

But because it's hard to identify open scientific challenges that can be solved with AI, this "hammer looking for a nail" style of science means researchers often solve problems that are suitable for AI but have already been solved or won't create new scientific knowledge.

To accurately assess AI's impact on science, we need to genuinely examine science itself. Unfortunately, scientific literature is not a reliable source for evaluating AI's achievements in science.

One problem is survivor bias. In the words of one researcher, because AI research "publishes almost no negative results," we typically only see AI's successes in science, not its failures. However, without negative results, our attempts to evaluate AI's impact on science are often distorted.

Anyone who has studied the reproducibility crisis knows that survivor bias is a major problem in science. Often, the culprit is a filtering process where statistically non-significant results are filtered out of the scientific literature.

For example, the distribution of z-values in medical research is shown below. Z-values between -1.96 and 1.96 indicate statistically non-significant results. The clear discontinuities near these values suggest that many scientists either did not publish results falling within these values or modified data before reaching the threshold for statistical significance.

The problem is that if researchers fail to publish negative results, it can lead doctors and the public to overestimate the effectiveness of medical treatments.

Image

Pictured: Distribution of over one million z-values in medical research. Negative results (z-values between -1.96 and -1.96) are largely missing.

A similar thing happens in AI science, although the selection process is not based on statistical significance, but on whether the proposed method outperforms others or successfully completes some new task. This means that researchers in AI science almost always report AI's successes and rarely publish results when AI fails.

The second problem is that, even when successfully published, certain methodological pitfalls often lead to overly optimistic conclusions about AI's application in science. The details and severity of these pitfalls seem to vary across fields, but most can be categorized into four types: data leakage, weak baselines, selective adoption, and false positives.

While the reasons for this tendency towards overoptimism are complex, the core issue seems to be a conflict of interest, where those evaluating AI models also benefit from those evaluations.

These problems seem bad enough, and I encourage people to treat impressive results in AI science with the same innate skepticism as they would surprising results in nutritional science.

Alright, the story is over.

I wonder if it has any revelations for you. Feel free to leave your thoughts in the comment section.

Related content: https://www.understandingai.org/p/i-got-fooled-by-ai-for-science-hypeheres

Artificial Intelligence × [Biology Neuroscience Mathematics Physics Chemistry Materials]

"ScienceAI" focuses on interdisciplinary research and integrated development of artificial intelligence with other cutting-edge technologies and fundamental sciences.

Welcome to follow and star us, and click on the like and "In Look" buttons below.

Click "Read original article" to join the professional community for more opportunities for communication and cooperation and services.

Main Tag:Artificial Intelligence in Science

Sub Tags:AI ApplicationsReporting BiasReproducibility CrisisScientific Research


Previous:Breaking News! OpenAI Teams Up with Legendary Apple Designer Jony Ive to Announce New Company "io": Targeting Next-Gen AI Interaction Hardware

Next:Understanding RAG, Agent, and Multimodality: Industry Practices and Future Trends

Share Short URL