GiM's gemlog - 2025-08-29 On fighting bots
I want to share my *opinion*, why I believe we must actively fight bots, and why blocking them is NOT GOOD ENOUGH. Be warned though, that there might be quite some SHOUTING in this post.
I've seen both 1) post by Jeff:
and an earlier post by Jorge Sanz:
I also shared this earlier on bbs, an article about some modern Anti-AI frameworks (direct link to article):
First, I don't think arstechnica calling people "AI haters" is right, It should rather call them (us?) >concerned webcitizens<. It SHOULD however, call by it's name, what is basically CONTENT STEALING. Especially, that they - ai companies - are very likely also stealing arstechnica's content. But that's just a sidenote.
The whole article is worth reading, for many reasons:
- it contains AI spokes-people, claiming bots respect robots.txt - we all know, that is NOT the case
- it talks about tools that are made to sabotage collected content (more on that later)
- it contains some interesting comments from the authors
All three tools are garbage content generators. But not just "a garbage" as that wouldn't really work. What I think all of them does, is that they need some learning corpus (to generate markov chain), and then they need some dictionary corpus, which is then used with the markov chain, to generate something that just >maybe< could've been written by a human being, but is just a babble.
I have picked iocaine, mostly because it doesn't store any data, so it requires some CPU time, but there's no storage requirement. And bots can go >really< deep. On one of the nodes, where I'm running it, the URI path itself is ~1000 chars, and contains well over 100 elements.
Only today (so starting 00:00 up till 11:30), gptbot ingested around 20MB of crap data. Petalbot ~800k.
I don't know if my actions have any meaning, I just truly hope, they will somewhow notice the bias, and will just stay the F*C out.
Now, the slight issue with those tools, is that they mostly are "local", meaning all the links generated are within a single domain. What I think will be another step of evolution of those tarpits, is creating communities that share their hostnames, and where tarpits will link one to another (making it much harder, for STEALING companies to filter results out).
$ published: 2025-08-29 11:37:25 $
$ last-edit: 2025-08-29 11:37:25 $
---