AI Bots Swarm Library, Cultural Heritage Sites, Causing Slowdowns and Crashes

Bots scraping data for artificial intelligence (AI) models have become a serious problem across the internet during the past several months, impacting libraries, cultural heritage websites, and other content important to these institutions.

Engelberg Center logoBots scraping data for artificial intelligence (AI) models have become a serious problem across the internet during the past several months, impacting libraries, cultural heritage websites, and other content important to these institutions. The bots are designed to visit websites and retrieve data to feed large language model AIs such as ChatGPT, Gemini, and Claude. Libraries, cultural heritage institutions, and open source and open data sites have become ripe targets for these bots since they generally offer vetted, quality information and good metadata, and are often not shielded by registration or login requirements.

“In late 2024, isolated reports began to appear from individual online cultural heritage collections. Those reports described servers and collections straining—and sometimes breaking—under the load of swarming bots,” according to the report “Are AI Bots Knocking Cultural Heritage Offline?” by Michael Weinberg, published last month by the Engelberg Center on Innovation Law and Policy at the New York University School of Law. “The bots were reportedly scraping all of the data from collections to build datasets to train AI models. This activity was overwhelming the systems designed to keep those collections online.”

Any institution operating a site to which “you don’t have to log in, you’re getting hammered,” Nathan Curulla, cofounder and CRO of ByWater Solutions, told LJ. ByWater—provider of migration, hosting, training, support, and development services for open-source library systems including Koha ILS and Aspen Discovery—is in the process of implementing Cloudflare services for all of its Koha and Aspen customers to combat the sudden spike in bot traffic.

In a recent email to its library customers forwarded to LJ, Curulla said that Cloudflare has been implemented for 95 percent of its Aspen customers and about 60 percent of its Koha libraries “with our target for completion being August of this year.”

In a June article, Judy Panitch, director of library communications for the University of North Carolina (UNC) at Chapel Hill, wrote that the library’s online catalog “was receiving so much traffic that it was periodically shutting out students, faculty, and staff, including the head of User Experience” on December 2 last year. When David Romani, system administrator and the library’s security liaison inspected, he found that “the searches were coming from addresses spread broadly across the United States using reputable ISPs such as AT&T, Spectrum, and Verizon. Each interaction looked exactly like something that happens thousands of times a day at a research library.”

Closer inspection revealed how unusual the searches were. For example, on a single day in December, there were 11,329 searches—from thousands of different internet addresses—for Finnish music. “In November, before we had this problem, we got something like 15 searches with the terms ‘Finnish’ and ‘music,’” Jason Casden, head of software development for UNC, told Panitch.

The bots were also conducting faceted searches, with specifications ranging from date to place of publication to language or even location at specific campus libraries. “A human might apply up to a half-dozen facets. We were seeing requests with 15, 20, 25 facets, which is almost impossible to do, even deliberately,” Casden told Panitch. UNC’s IT team set up rules to block addresses that made two complex queries quickly in a row, but the solution only worked for a week.

Noting that old style bots “were rarely a problem” for well-run websites, Eric Hellman, technologist, library advocate, and founder of the Unglue.it ebook project recently posted on his blog that “the current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive, and Dropbox (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources—one speaker at Code4lib [conference in Princeton, NJ, in March] described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!”

Hellman noted that the Internet Archive is no longer saving snapshots on its Wayback Machine of one of the best open-access publishers in the academic library field, MIT Press, because of Cloudflare blocking, and that AI bots caused recent outages at Project Gutenberg—which hosts tens of thousands of public domain ebooks—and temporarily took down OAPEN for almost two days.

He expressed frustration with how poorly coded these AI bots are currently, writing that “Project Gutenberg makes all its content available with one click on a file in its feeds directory. OAPEN makes all its books available via an API. There’s no need to make a million requests to get this stuff!! Who (or what) is programming these idiot scraping bots? Have they never heard of a sitemap??? Are they summer interns using ChatGPT to write all their code? Who gave them infinite memory, CPUs, and bandwidth to run these monstrosities?”

As part of its report, the Engelberg Center conducted a small survey of galleries, libraries, archives, and museums, which indicated that a significant majority of respondents had experienced a recent spike in bot traffic. This increase had not been anticipated, and few institutions had been actively tracking bot traffic “prior to the bots triggering a crisis in their collection. Many respondents did not realize they were experiencing a growth in bot traffic until the traffic reached the point where it overwhelmed the service and knocked online collections offline.” While the report notes that the survey was circulated on targeted listservs and the results reflect “a strong response bias,” dozens of institutions reported being impacted.

Respondents indicated that the bots tend to “swarm” for relatively brief periods of time, but that the frequency seemed to be increasing, and robots.txt—a small file that usually tells bots which URLs they can access—does not appear to be an effective way to prevent these new bots from overwhelming, slowing, or even crashing a site. Respondents said that they are “deploying a range of home-grown and third-party firewall-based countermeasures to try to screen out bots based on IP address, geography, domain, and user agent string, [and that] some of these efforts appear to be effective, although few are confident that they will be sustainable in the long term.”

However, most respondents also said they had also been reluctant to take more aggressive steps, such as requiring registrations and logins to view their collections. They weren’t confident that such measures would be effective in the medium turn, and were concerned that implementing those changes could have negative impacts on wanted web traffic from real users, with “login-based restrictions [running] counter to their larger goal of making the collections easily available online.” Respondents also expressed concern that this new generation of AI training bots could create an environment of unsustainably escalating costs for hosts of quality online collections on the open web.

Hellman predicts a disheartening future if the growing problem cannot be resolved, writing that “we are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won’t be to make money, it will be for survival. Captchas will only be solvable by advanced AIs, and only the wealthy will be able to use internet libraries.”

Author Image
Matt Enis

menis@mediasourceinc.com

@MatthewEnis

Matt Enis (matthewenis.com) is Senior Editor, Technology for Library Journal.

Comment Policy:
  • Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
  • Don't use obscene, profane, or vulgar language.
  • Stay on point. Comments that stray from the topic at hand may be deleted.
  • Comments may be republished in print, online, or other forms of media.
  • If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.


RELATED 

ALREADY A SUBSCRIBER?

We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing

ALREADY A SUBSCRIBER?