Synced Review
Editor: Xiniu
【Synced Review】Academic websites, originally treasure troves of knowledge, are now facing paralysis due to the relentless plundering by AI crawlers. From DiscoverLife to BMJ, millions of abnormal access attempts have overwhelmed servers, threatening the lifeline of open-access research. What is behind this "digital locust plague," and how should the academic community respond?
Imagine a quiet library, suddenly overrun by a crowd of uninvited guests. They don't browse, they don't contemplate; they just frantically photocopy every page of every book.
How could such a noisy scene not disturb those engrossed in their studies, buried in books?
Today, academic websites are experiencing a similar "digital invasion."
Recently, Nature published an article detailing these behaviors.
Article address: https://www.nature.com/articles/d41586-025-01661-4
Digital 'Locust Plague' Sweeps Academic World
DiscoverLife is an online image library with nearly 3 million precious species photos, serving as a research lifeline for many biologists.
However, starting in February this year, the website has been inundated with millions of abnormal access attempts daily, leading to slow page loading and even complete paralysis.
When you try to open an image of a rare insect, you might only face a "server busy" message.
Who is the culprit?
Not hackers, nor viruses, but a swarm of silent AI crawlers, frantically "devouring" data to "feed" generative artificial intelligence.
These data-intensive crawlers are plaguing academic publishers and researchers, especially those operating journal paper, database, and other resource websites.
"It's like the Wild West out there right now," says Andrew Pitts, CEO of PSI, a company based in Oxford, UK, that provides a verified global IP address database for the scholarly communication community.
"The biggest problem is that the volume of access is simply too high, putting immense pressure on systems. This not only costs money but also disrupts genuine users."
Websites experiencing these issues are trying to block these crawler bots and reduce the interference they cause.
But this is by no means easy, especially for small institutions with limited resources.
"If these issues are not resolved, some smaller institutions may disappear entirely," says Michael Orr, a zoologist at the State Museum of Natural History Stuttgart in Germany.
Proliferation of Crawler Programs
Internet crawlers are not new.
For decades, crawlers from search engines like Google have been scanning web pages, aiding information retrieval.
However, the rise of generative AI has unleashed a flood of "bad bots."
This year, BMJ, a medical journal publisher based in London, found that crawler bot traffic on its website had exceeded that of real users.
Ian Mulvany, BMJ's Chief Technology Officer, stated that these aggressive bot behaviors led to server overload and disruption of services for legitimate customers.
Not just BMJ, Jes Kainth, Director of Service Delivery at Highwire Press (an internet hosting service provider specializing in academic publishing), directly stated: "We've seen a surge in bad bot traffic, and it's become a serious problem."
The Confederation of Open Access Repositories (COAR) reported in April that over 90% of its 66 surveyed members had encountered AI crawlers scraping content.
Approximately two-thirds of these members experienced service disruptions as a result.
Kathleen Shearer, COAR Executive Director, stated: "Our repositories are open access, so in a way we welcome content reuse. But some crawlers are too aggressive, causing severe operational issues such as downtime."
Why Target Academic Websites?
Data is the new oil.
This saying is vividly demonstrated in the AI era.
Large Language Models (LLMs) and image generators, these AI tools rely on vast amounts of high-quality data for training, and academic websites (journal papers, databases, open knowledge bases) have become "gold mines."
Because the content on these websites is authoritative, fresh, and often well-structured.
As Will Allen, Vice President at web service provider Cloudflare, said: "If your content is novel or highly relevant, it's invaluable to developers building AI chatbots."
These crawlers often operate via anonymous IP addresses, bypassing paywalls, and even ignoring the website's robots.txt file (used to regulate crawler behavior).
Josh Jarrett, Senior Vice President at Wiley publishing, stated that they found crawlers attempting to access subscribed content. In April, Wiley also issued a statement emphasizing that unauthorized and illegal scraping is unacceptable.
But clever bad bots are very adept at bypassing paywalls.
Struggle Amidst the Crisis
Facing the flood of crawlers, academic websites are striving to save themselves.
But in many cases, restricting bot access without affecting legitimate users is very difficult.
A common method is to integrate a file that tells bots which behaviors are allowed or forbidden.
But bad bots often disregard these rules.
Another method is to completely ban all crawler-like behaviors, but this "one-size-fits-all" approach might inadvertently harm legitimate users.
Mulvany explained that scholars often access journals via proxy servers (which means many requests might come from the same IP address), and this access method looks very much like bot behavior.
"We have to find a balance between protecting the website from being crashed by traffic surges and not affecting users' normal access to these resources," Mulvany stated.
"This is really annoying and requires a lot of effort to mitigate these risks."
These websites can also block specific crawler programs, but first, they need to distinguish between benevolent and malicious crawlers.
Cloudflare and PSI are working hard to identify bad bots, but new types of AI crawlers are constantly emerging, making them difficult to fully contain.
"We urgently need international agreements on fair use of AI and respect for these types of resources," Orr stated.
"Otherwise, in the long run, these tools will not find available training resources."
References:
https://www.nature.com/articles/d41586-025-01661-4
https://coar-repositories.org/news-updates/open-repositories-are-being-profoundly-impa cted-by-ai-bots-and-other-crawlers-results-of-a-coar-survey/