Documentation for crwlr / crawler (v0.3)

Attention: You're currently viewing the documentation for v0.3 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Loaders

Loaders are a very essential part of this library. As the name implies they are in charge of loading resources. The package is shipped with two loaders: the HttpLoader and the PoliteHttpLoader. But you can also write your own loaders, you just have to implement the LoaderInterface.

use Crwlr\Crawler\Loader\LoaderInterface;

class MyLoader implements LoaderInterface
{
    public function __construct(private UserAgentInterface $userAgent, private LoggerInterface $logger)
    {
    }

    public function load(mixed $subject): mixed
    {
        // Load something, in case it fails return null.
    }

    public function loadOrFail(mixed $subject): mixed
    {
        // Load something, in case it fails throw an exception.
    }
}

To use it in your crawler add:

class MyCrawler extends Crawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        return new MyLoader($userAgent, $logger);
    }

    // define user agent
}

The way to add a loader to the crawler is via the protected loader() method. It's called only once in the constructor of the Crawler class, and then it's automatically passed on to every step that has an addLoader method.

HttpLoader

The HttpLoader needs an implementation of the PSR-18 ClientInterface. As default it uses the guzzle client but you can extend the class and use a different implementation if you want.

Sometimes crawling a page requires having some cookies a page sends you via HTTP response headers. As PSR-18 clients don't persist cookies themselves, the HttpLoader has its own cookie jar. If your crawler shall not use cookies, you can deactivate it:

$loader = new HttpLoader();

$loader->dontUseCookies();

When you build your own loading step and the loader should at some point forget all the cookies it has persisted until now, you can access the loader via $this->loader and flush the cookie jar:

$this->loader->flushCookies();

PoliteHttpLoader

This loader just extends the HttpLoader and uses two traits:

CheckRobotsTxt

Get the robots.txt and stick to its rules (this also means that this loader works only when you're using a BotUserAgent in your crawler)

WaitPolitely

Wait a little between two requests. The wait time depends on how long the latest request took to be answered. This means if the server starts to respond slower, the crawler also waits longer between requests.

If you don't want to use a BotUserAgent in your crawler but you would like to use this feature, just make a Loader like this:

use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\Http\Traits\WaitPolitely;

class MyLoader extends HttpLoader
{
    use WaitPolitely;
}