What's new in crwlr / crawler v0.5?

2022-09-03

We're already at v0.5 of the crawler package and this version comes with a lot of new features and improvements. Here's a quick overview of what's new.

Use a Headless Browser via HttpLoader

The HttpLoader has a new method useHeadlessBrowser(). If you call it, the loader then uses the chrome-php/chrome library under the hood, to load pages via your chrome/chromium installation.

class MyCrawler extends HttpCrawler
{
    public function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new PoliteHttpLoader($userAgent, logger: $logger);

        $loader->useHeadlessBrowser();

        return $loader;
    }

    // ...
}

If you need to pass some customization options to the library, you can use setHeadlessBrowserOptions() or addHeadlessBrowserOptions():

class MyCrawler extends HttpCrawler
{
    public function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new PoliteHttpLoader($userAgent, logger: $logger);

        $loader->useHeadlessBrowser();

        $loader->setHeadlessBrowserOptions([
            'windowSize' => [1024, 800],
            'enableImages' => false,
        ]);

        // or 
        $loader->addHeadlessBrowserOptions([
            'noSandbox' => true,
        ]);

        return $loader;
    }

    // ...
}

More about the options in the chrome-php/chrome docs.

Useful Helpers for Development

There are two new methods making your life easier while working on crawlers:

Step::maxOutputs()

Limit the amount of outputs a step will yield at most. When the limit is reached it will stop every call to invoke the step before it does anything to avoid doing unnecessary work.

$crawler = new MyCrawler();

$crawler->input('https://www.crwlr.software/packages/crawler/v0.4/getting-started')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('main nav a'))
    ->addStep(Http::get()->maxOutputs(10))
    ->addStep(Html::root()->extract(['title' => 'h1']));

So if you're building a crawler for a big source, you can test run it doing just a fraction of the work it will actually do when finally removing the limits.

Crawler::outputHook()

When running your crawler you'll get the composed results or just the outputs of the last step. In case those are not as you would expect them to be, you may want to check the outputs of previous steps. To make this easier you can pass a Closure to the new outputHook() method. It will be called with every output of every step. To know which step a certain output is coming from, the Closure also receives the $stepIndex (incrementing number starting at 0) and the step object itself ($step):

$crawler = new MyCrawler();

$crawler->input('https://www.crwlr.software/packages/crawler/v0.4/getting-started')
    ->addStep(Http::get())                      // stepIndex 0
    ->addStep(Html::getLinks('main nav a'))     // stepIndex 1
    ->addStep(Http::get())                      // stepIndex 2
    ->outputHook(function (Output $output, int $stepIndex, StepInterface $step) {
        if ($stepIndex === 1) {
            var_dump($output->get());
        }
    });

DomQuery::toAbsoluteUrl()

The Html::getLink() and Html::getLinks() steps are mainly to just get links to follow, to then extract data from the pages behind those URLs. Sometimes you might want to extract data and get links from the same page. To avoid having to solve this very overcomplicated by using a step group, you can now use the toAbsoluteUrl() method of the DomQuery class (the abstract base class behind Dom::cssSelector() and Dom::xPathQuery()):

$crawler = new MyCrawler();

$crawler->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#listing .row')
            ->extract([
                'title' => 'a.title',
                'url' => Dom::cssSelector('a.title')->attribute('href')->toAbsoluteUrl(),
            ])
    );

// Run crawler and process results

Just like with the Html::getLink() and Html::getLinks() steps, this only works when the step immediately before the extraction step is an HTTP loading step. Otherwise, it won't know the base URL to resolve the relative paths against.

Step::uniqueInput()

Besides Step::uniqueOutput() there is now also the Step::uniqueInput() method. You can use it on any step, and it works just the same as the uniqueOutput() method, ignoring any duplicate inputs the step receives.