How to block AI crawlers (and prevent they melt your server)

There is a specific kind of dread that sets in when you check your server metrics and see the CPU pegged at a solid 100%.

That was my reality recently over at electronica.club. The site was struggling, SSHing into the terminal felt sluggish, and running top revealed a sea of lsphp processes eating every spare cycle the processor had to offer.

At first, I was convinced I had misconfigured my caching network. Running a heavy, dynamic WooCommerce store on LiteSpeed usually means wrestling with cache purge loops and complex inventory relationships. I spent hours tweaking WooCommerce exclusion URIs, adjusting stock status triggers, and throttling the LiteSpeed Crawler to run low and slow.

But the CPU stayed maxed out.

It wasn’t until I finally dug into the raw server access logs that I found the real culprit. It wasn’t a broken WordPress plugin or a cache miss stampede. It was Meta.

Specifically, it was meta-externalagent/1.1.

The AI Scraping Tax

Meta’s AI crawler was hammering my server, requesting thousands of old, dynamic WooCommerce filter combinations (like ?filter_color=red,orange,white). Because these obscure URLs returned 404 errors, LiteSpeed rightfully refused to cache them. This meant that for every single request the bot made, WordPress had to wake up, execute PHP, and query the database just to generate a “Page Not Found” response.

Here is the harsh reality of the modern web: AI companies are aggressively scraping our data to train their massive language models. But unlike Googlebot, which crawls your site and rewards you with search rankings and human traffic, AI crawlers are strictly extractive. They consume your bandwidth, spike your CPU, and drive up your hosting bills—offering absolutely zero value in return.

As a solo founder, server resources are literally money. I wasn’t about to let a trillion-dollar company melt my server for free training data.

Here is exactly how I blocked them and dropped my CPU usage back to near-zero instantly.

Step 1: The Immediate Server-Level Block (.htaccess)

If a bot is aggressive enough, even a standard robots.txt file might not save you, because processing the request still takes a tiny bit of server effort. The most effective way to kill the spike immediately is to block the User-Agent at the web-server level.

I opened the .htaccess file in my public_html folder and dropped this exact rule right at the top:

Apache

<IfModule mod_rewrite.c>
RewriteEngine On
# Block Meta AI Training Crawler
RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Meta-ExternalFetcher [NC]
RewriteRule ^.* - [F,L]
</IfModule>

The moment I saved that file, the web server began issuing instant 403 Forbidden errors to the bot before PHP even had a chance to wake up. The lsphp processes vanished from my terminal. The CPU dropped magically to idle.

(Note: This targets Meta’s AI training bots specifically, not the facebookexternalhit crawler, so your standard social media link previews won’t break).

Step 2: The Ultimate AI Blocklist (robots.txt)

With the immediate fire put out, I needed a permanent, polite-but-firm “Do Not Enter” sign for the rest of the AI ecosystem.

If you use an SEO plugin like Rank Math, you might find that your robots.txt file is “not writeable” from the WordPress dashboard. The fastest bypass is to just create a physical file named robots.txt directly in your server’s root directory (public_html). A physical file will always override the virtual one generated by WordPress.

I created the file and pasted in this comprehensive blocklist to keep OpenAI, Anthropic, Google Gemini, and Amazon out of my server’s resources:

Plaintext

# Block Meta AI Training
User-agent: meta-externalagent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /

# Block OpenAI (ChatGPT) Training & Scraping
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /

# Block Anthropic (Claude) Training
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /

# Block Google Gemini Training (Does not affect normal Googlebot SEO)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl (Massive open-source scraper used by many AIs)
User-agent: CCBot
Disallow: /

# Block Apple & Amazon AI Training
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /

Taking Back Your Hardware

If your site feels sluggish, don’t just assume you have a bloated plugin or need to upgrade your hosting plan. Check your access logs.

Blocking these bots isn’t just about saving CPU cycles; it is about taking back control of your infrastructure. We build sites to serve our customers and grow our businesses, not to act as free computing nodes for Big Tech’s latest AI models. Lock down your .htaccess, update your robots.txt, and let your server breathe again.

How to block AI crawlers (and prevent they melt your server)

The AI Scraping Tax

Step 1: The Immediate Server-Level Block (.htaccess)

Step 2: The Ultimate AI Blocklist (robots.txt)

Taking Back Your Hardware

Leave a Reply Cancel reply