unrobotstxt ~master
Translation of Google's robots exclusion protocol (robots.txt) parser and matcher
To use this package, run the following command in your project's root directory:
Manual usage
Put the following dependency into your project's dependences section:
unrobotstxt
This is a D translation of Google's robots exclusion protocol (robots.txt) parser and matcher. It's derived from Google's open source project, but not affiliated with Google in any way.
Features
- Matches Google's (open source) implementation of the
robots.txtstandard - Available as a library or a standalone test tool
@safe
Standalone tool
Can be used to test a robots.txt file, to see if it blocks/allows the URLs you expect.
Usage example
$ wget https://dlang.org/robots.txt
$ cat robots.txt
User-agent: *
Disallow: /phobos-prerelease/
Disallow: /library-prerelease/
Disallow: /cutting-edge/
$ robotstxt robots.txt MyBotName /index.html
user-agent 'MyBotName' with URI '/index.html': ALLOWED
$ robotstxt robots.txt MyBotName /cutting-edge/index.html
user-agent 'MyBotName' with URI '/cutting-edge/index.html': DISALLOWED
Building
Run dub build from repo root. You can put the resulting robotstxt binary in your PATH.
Alternatively, download, build and run from the DUB registry with dub run unrobotstxt.
Library
Usage example
import std;
import unrobotstxt;
void main()
{
const robots_txt = readText("robots.txt");
auto matcher = new RobotsMatcher();
if (matcher.AllowedByRobots(robots_txt, ["MyBotName"], "/index.html"))
{
// Do bot stuff
}
}
There's no API for parsing once and then making multiple URL checks.
For pure Google-style parsing (no matching), you can also implement the callbacks in the RobotsParseHandler abstract class and pass it to ParseRobotsTxt.
Documentation
See the generated docs. The example above is pretty much what you get, though.
The code supports a StrictSpelling version that corresponds to a kAllowFrequentTypos global boolean in the original C++ version. It disables some typo permissiveness (e.g., "Disalow" instead of "Disallow"), but still allows various other quirk permissiveness. Otherwise the API matches the original C++ code.
Contributing
Bug fixes and misc. improvements are welcome, but make a fork if you want to extend/change the API in ways that don't match the original. I've named this project unrobotstxt to leave the robotstxt name available for a project with a more idiomatic API.
- ~master released 5 years ago
- sarneaud/unrobotstxt
- Apache 2.0
- Copyright © 1999-2020, Google LLC
- Authors:
- Dependencies:
- none
- Versions:
-
Show all 2 versions0.1.0 2020-Jul-03 ~master 2020-Jul-03 - Download Stats:
-
-
0 downloads today
-
0 downloads this week
-
0 downloads this month
-
0 downloads total
-
- Score:
- 0.0
- Short URL:
- unrobotstxt.dub.pm