Documentation for crwlr / crawler (v1.9)

Loaders

Loaders are a fundamental part of this library. As the name suggests, they are responsible for loading resources. By default, the Crwlr\Crawler\HttpCrawler creates an instance of the Crwlr\Crawler\Loader\Http\HttpLoader with default settings and automatically passes it to all loading steps (those implementing the Crwlr\Crawler\Steps\Loading\LoadingStepInterface interface).

Accessing the Loader

There are several ways to access the loader instance of your crawler, or even provide your own custom loader.

Crawler::getLoader()

When using the HttpCrawler::make() shortcut method to obtain a crawler instance, you can easily access and customize the created loader via the Crawler::getLoader() method.

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$loader = $crawler->getLoader();

// Customize loader settings here.

Inside Custom Crawler Class

If you extend the HttpCrawler, you can override the loader() method, call the parent::loader() method to get the default HttpLoader instance, and then customize it.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = parent::loader($userAgent, $logger);

        // Customize loader settings here.

        return $loader;
    }

    // Define user agent
}

Alternatively, you can make your crawler use your own custom loader instance.

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends Crawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        return new MyLoader($userAgent, $logger);
    }

    // Define user agent
}

From a Loading Step

When building your own loading step by extending the abstract Crwlr\Crawler\Steps\Loading\LoadingStep, you can access the loader via $this->loader from within the step.

For instance, if you want your step to use the headless browser for loading, even though the crawler’s loader is configured to use the (Guzzle) HTTP client, you can switch to the headless browser just for the step invocation and switch back afterward.

use Crwlr\Crawler\Steps\Loading\LoadingStep;

class SomeLoadingStep extends LoadingStep
{
    public function outputType(): StepOutputType
    {
        return StepOutputType::AssociativeArrayOrObject;
    }

    protected function invoke(mixed $input): Generator
    {
        // This example assumes that the loader is an instance of the HttpLoader.
        $previouslyUsedBrowser = $this->loader->usesHeadlessBrowser();

        if (!$previouslyUsedBrowser) { // Switch to using the headless browser.
            $this->loader->useHeadlessBrowser();
        }

        // Load the input URL and yield the response.
        yield $this->loader->load(new Request('GET', $input));

        if (!$previouslyUsedBrowser) { // Switch back to using the (Guzzle) HTTP client.
            $this->loader->useHttpClient();
        }
    }
}

Note: This example is only meant to demonstrate how to access the loader within a LoadingStep. If you want to implement switching between using the headless browser and the HTTP client at the beginning and end of a step invocation, take a look at our browser extension package, where we have implemented this exact functionality for easy reuse.

The HttpLoader

The package currently includes one loader: the Crwlr\Crawler\Loader\Http\HttpLoader. It offers several methods that can be used to customize its behavior.

The following code examples assume that $loader is an instance of the Crwlr\Crawler\Loader\Http\HttpLoader class. To learn how to obtain your crawler’s loader instance, see here.

You can customize the behavior of the loader regarding cookies.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

// If you want to flush all previously saved cookies.
// This is probably mainly useful inside a custom step.
$loader->flushCookies();

// or

// If you don't want your crawler to use cookies at all.
$loader->dontUseCookies();

Max Redirects

You can set the maximum number of redirects.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->setMaxRedirects(15);

Using a Headless Browser to load pages (Execute Javascript)

You can make the HttpLoader use a headless browser to load pages by calling the useHeadlessBrowser() method. This method utilizes the chrome-php/chrome library under the hood, so you need to have Chrome or Chromium installed on your system.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useHeadlessBrowser();

If you need to provide the chrome-php browser factory with the name of your Chrome executable, or some customization options, you can use the methods $loader->browser()->setExecutable(), $loader->browser()->setOptions() and $loader->browser()->addOptions():

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useHeadlessBrowser();

$loader->browser()->setExecutable('chromium');

$loader->browser()->setOptions([
    'windowSize' => [1024, 800],
    'enableImages' => false,
]);

// or
$loader->browser()->addOptions([
    'noSandbox' => true,
]);

Additionally, you can configure which event the headless browser should wait for before considering the page load complete (possible options can be found in the chrome-php readme). You can also define the maximum wait time before a timeout is triggered (the default is 30 seconds).

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader
    ->browser()
    ->waitForNavigationEvent(Page::DOM_CONTENT_LOADED)
    ->setTimeout(60_000); // 60 seconds.
The chrome-php library offers a lot of additional functionality, such as taking screenshots, scrolling, clicking on elements, and more. The feature in the main crawler package is primarily intended to obtain HTML source code after JavaScript execution in the browser. If you're looking for more advanced usage of a headless browser, please take a look at our browser extension package.

Loader Events

The abstract Crwlr\Crawler\Loader\Loader class provides methods to register callback functions for specific events, which are triggered by the HttpLoader whenever they occur. The available events are: beforeLoad, onCacheHit, onSuccess, onError and afterLoad. These events can be very helpful, for instance, if you want to track the number of requests sent during your entire crawling procedure and how many of them received successful responses. Here’s how you can hook into these events:

use Crwlr\Crawler\Loader\Http\HttpLoader;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;

/** @var HttpLoader $loader */

$loader->beforeLoad(function (RequestInterface $request) {
    // Called before sending a request.
});

$loader->onCacheHit(function (RequestInterface $request, ResponseInterface $response) {
    // Called when a response for the request is found in the cache.
});

$loader->onSuccess(function (RequestInterface $request, ResponseInterface $response) {
    // Called when a successful response is returned.
});

$loader->onError(function (RequestInterface $request, ResponseInterface $response) {
    // Called when an error response is returned.
    // Won't be called when using the loadOrFail() method.
});

$loader->afterLoad(function (RequestInterface $request) {
    // Called after loading a request, regardless of success or error.
    // Won't be called when using the loadOrFail() method.
});

Using Proxy Servers

If you want your loader to use proxy servers, you can utilize the HttpLoader::useProxy() and HttpLoader::useRotatingProxies() methods. With rotating proxies, the loader will automatically switch to the next proxy in the provided array for each subsequent request.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useProxy('http://1.2.3.4:8084');

// or

$loader->useRotatingProxies([
    'http://2.3.4.5:8085',
    'http://3.4.5.6:8086',
    'http://4.5.6.7:8087',
]);

Building a Custom Loader

If you want to build a custom loader for your crawler, such as an FTP loader, you can do so by implementing the Crwlr\Crawler\Loader\LoaderInterface or by extending the Crwlr\Crawler\Loader\Loader class, which provides some base functionality.

The following example is untested and may not work as-is; it serves only to illustrate how you can build a custom loader. If you are genuinely interested in an FTP loader, let us know on Twitter, Github, or via the contact form.

use Crwlr\Crawler\Loader\Loader;
use Crwlr\Crawler\Logger\CliLogger;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class FtpLoader extends Loader
{
    public function __construct(
        private string $server,
        private string $user,
        private string $password,
        private string $localBasePath,
        UserAgentInterface $userAgent,
        ?LoggerInterface $logger = null,
    ) {
        parent::__construct($userAgent, $logger ?? new CliLogger());
    }

    public function load(mixed $subject): mixed
    {
        $ftp = ftp_connect($this->server);

        ftp_login($ftp, $this->user, $this->password);

        $splitFilePath = explode('/', $subject);

        $fileName = end($splitFilePath);

        if (ftp_get($ftp, $this->localBasePath . '/' . $fileName, $subject)) {
            $this->logger->info('Loaded file ' . $subject);

            yield $this->localBasePath . '/' . $fileName;
        } else {
            $this->logger->error('Failed to load file ' . $subject);
        }

        ftp_close($ftp);
    }

    public function loadOrFail(mixed $subject): mixed
    {
        // Same as load(), but throw an exception if loading fails.
    }
}

To use it in your crawler, you need to create a custom crawler class:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyFtpCrawler extends Crawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        return new FtpLoader('ftp://some.example.com', 'foo', 'bar', '/my/local/path', $userAgent);
    }

    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent('FtpCrawler');
    }
}

The protected loader() method of a crawler is called only once in the constructor, and the loader it returns is automatically passed on to every step that has an addLoader() method.