3 EN. Protecting a website from scraping bots
It's being said that the web is plagued by rampant DDoS level scraping by AI bots.
Website owners end up blocking whole CIDRs to cope with it.
Often this leads to legitimate users getting blocked from accessing the website.
Why is scraping by AI bots such a big issue?
Is it hard to spot a client that is scraping your website among the other clients?
I suppose that a client that is presently scraping could be detected and banned from accessing the website subsequently.
The IP address of the client could be banned for example.
Or is it hard to manage single IP bans on the web server side?
Also the bandwidth of the malicious client could be throttled.
On some IRC networks there are trap channels.
If you join one of these, you get temporarily banned from conecting to the network.
Also on IRC upon connection to a network the servers routinely verify if the client's IP is not in one of the blocklists commonly used by different networks such as the EFnet RBL and the dronebl. Clients that are on these lists are not allowed to connect.
There are also websites that have a section where you enter an infinite loop if you click on the links on the page you're being presented.
Generally bots get trapped in the loop and this is a successful method to spot bots.
By the way when using the HTTP protocol, are connections usually persistent or is a new connection made for each request?
How is it with Gopher?
I know that for Gemini a new connection is made for each request.