Robots.txt is a file that gives search engine crawlers a polite hint on which pages shouldn’t be crawled. It’s not legally binding (I’m not a lawyer). It used to be beneficial for both webmasters and search engine crawlers — Google used to actually take down sites by accident by sending them too much traffic. (Obviously, not a concern anymore).
How can sites tell LLMs what data shouldn’t be included in a training corpus? But are the incentives there for both data creators and consumers?
* Avo
Forget about it. With all these nasty LLM stuff companies take it as granted that they can steal everything and everywhere.