Block Them All
The LAN Shall Know His Own
Not everything, but one may want to reduce load on your systems as much as possible: less CPU waste, less log noise if you have logging turned on. There's been mutterings on the internet about those nasty AI scrappers causing so much load that (perhaps not maximally efficient software such as) fail2ban cannot keep up with the log spam. This is not new; spammers caused so much load on a mail server operated by Brian W. Kernighan and Rob Pike as related in "The Practice of Programming" that they had to write some new software. That was back in the now glorious 1990s when mentioning AI in your grant would get your funding correctly cut, the good AI winter still being a thing back then.
server "example.org" {
...
location match "[.]aspx?$" {
block drop
}
location match "[.]bak$" {
block drop
}
location match "[.]dll$" {
block drop
}
location match "[.]jsp$" {
block drop
}
location match "[.]js$" {
block drop
}
location match "[.]sql$" {
block drop
}
location match "[.]php$" {
block drop
}
...
That's for httpd on OpenBSD, where patterns(7) does not appear to support alternations, and likely won't help much, but it will save some bytes on bandwidth, especially if the too many security scanners go nuts. More on those annoying pests, below. A major bummer would be if you forget you added these blocks, and future you adds some javascript file for who knows what reason, maybe to show a "please remove javascript from your browser" public service notification, and then future future you has to spend time debugging why things Do Not Work™.
Nginx apparently supports the "block drop" thing via the custom response code 444. This code should make it easier for subsequent log scanning to pick out naughty hosts for additional review and possibly banning. Nginx can also do rate limiting, or there are various modules or other servers that have rate limiting builtin.
A benefit here is to reduce the log spam to a manageable level so that subsequent refinements are easier to see. Studying the logs can turn up "wait, why is that happening?" which can be otherwise difficult to spot if 99.98% of the log is annoying AI bots and other such noise. Those who do not like this sort of sysadmin gardening, pulling at the weeds, will probably need to have someone else run the service, or to eat the cost of all the extra untamed traffic.
Probably one should expect more Balkanization of the Internet as individuals, organizations, and countries take various countermeasures against bot spam and other such bad actors. Hopefully without an archduke being shot, though if it gets us a new AI winter it may well be worth it.
Another place to block or at least slow them down is in the firewall. This will generally lack application layer insight, but is very efficient. Downsides here are false positives on the power users, the folks who know enough to make a lot of connections, but not enough to pipeline or otherwise send their requests over a single SSH connection. Another risk of false positives would be a lot of people hitting your services from a shared connection, e.g. a group at a coffee shop.
queue std on $pub_if bandwidth 1M max 1M
...
queue legacyweb parent std bandwidth 128K max 256K \
flows 1024 qlimit 1024 default
...
pass in on $pub_if proto tcp from any to any port { 80, 443 } \
modulate state (max 640, source-track rule, max-src-conn 4) \
set queue legacyweb
The low bandwidth will help too many systems from overrunning your bandwidth fees, with the downside that your system will be easier to Denial-of-Service or rendered too slow for everyone by just having a few attacking systems download a few big files over and over. On the other hand, hasty AI scrapers will have to wait for all that big content to come over, which they may not want to do. Monitoring for bad actors therefore becomes more important with low system limits, or on the other hand for legitimate users who may be getting the short end of the anti-AI-derpbot stick.
A variation here, if your firewall support it, is to rate limit by netblock, so if you know that random Brazil cloud IPs are spamming your systems with SYN packets, only allow two or three connections at the same time from their entire netblock. Or you could just ban them all.
Those Annoying Security Scanners Mentioned Above
These fine folks clog up your logs with noise. Opinions vary, but one option is to block them all, or at least their known addresses. Blocking this noise will help avoid filling up limited source connection slots with useless traffic, will keep your logs cleaner, etc. What's not to like?
Another option is to share the IPs of the bad actors with others, though I haven't tried this.
Consequent Problems
If you have a trap, perhaps a poisoned link that results in an auto-ban, malicious actors much in need of regulation will try to trick folks into visiting the link. This can be a problem with software that automatically requests additional resources by default, see e.g. any of those bad mail clients: send them a message with a bad URL, their client follows it, client is blacklisted, whoops!
Source addresses can also be forged, or routes temporarily taken, abused, and returned to the common pool. This may suggest short-term bans, which in turn suggests an IP reputation database, and a means to review why a particular IP address is banned (naughty server requests? sourced from blacklist X? manual entry?) and also a means to whitelist your known good addresses, like what if one of those terrible techbros takes over spamhaus and puts 9.9.9.9 into the all the blacklists, or a user goes to Brazil only to find that all of Brazil has been banned? Maybe still allow access to the VPN port from most everywhere, which in turn suggests different levels of blacklisting: everything, most services, not the SMTP ports, etc.