RSL is a new initiative by a group of big internet publishers that seeks to define the conditions under which AI crawlers can harvest their content. Their guide describes the various ways the content can be made available, including for free or a paid royalty but only by digging deeper into their reference material was I able to figure out how to prohibit all usage.
Your robots.txt needs to link to a XML file, like this:
License: https://your-domain.tld/rsl.xml
Then in that file you want this:
<rsl xmlns="https://rslstandard.org/rsl">
<content url="/" server="https://rslcollective.org/api">
<license>
<prohibits type="usage">all</prohibits>
</license>
</content>
</rsl>
That’s it.
If you want to be more liberal you could change the <prohibits> line to
<permits type="usage">search</permits>
That will let them use the content for search, which is probably quite similar to what traditional search engines do. More details in their reference docs.
Optionally to dispel any plausible deniability you can also add a link to rsl.xml as a Link header in every HTTP response.
Link: <https://example.com/rsl.xml>; rel="license"; type="application/rsl+xml"
It’s still too early to say whether AI crawlers will respect the terms of the license any publishers specify, it’ll probably take a court case or two to sort that out.
PieFed has added RSL to it’s code just now. Instance admins who wish to disable RSL can set the ALLOW_AI_CRAWLERS environment variable to anything.