Crawler v1.8: Paving the Way to a Better v2.0

2024-06-05

Version 1.8 of the crwlr/crawler package is out now, featuring important new functions that will replace existing ones in v2.0. There was one problem that I sometimes received negative feedback about and that I was unhappy with myself: the way composing crawling result data worked. I have now found a solution that I am quite happy with. The new functionality will lead to better performance, further minimized memory usage, and will hopefully be a lot easier to understand.

What Was Wrong with Composing Results?

Result when following step has multiple outputs
Composing result data with
the old addToResult()

This visualization from the v1.7 documentation illustrates the issue pretty well: once you call the Step::addToResult() method on a step, a Result instance is created and attached to the outputs until the end of the crawling procedure. In a linear flow, where one input leads to one output, there is no problem. However, if a step produces multiple outputs from one received input, the attached Result object is not duplicated. Instead, all resulting outputs reference the same Result object, complicating things for the user as well as the internal logic of the crawler behind the scenes.

One main reason for this behavior was to enable this code example also shown below the visualization:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/authors')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('#authors a'))
    ->addStep(Http::get())
    ->addStep(
        /* Get author data from the author detail page, including the book detail page URLs,
         * which are multiple per author */
        Html::root()
            ->extract([
                'name' => 'h1',
                'age' => '#author-data .age',
                'bornIn' => '#author-data .born-in',
                'bookUrls' => Dom::cssSelector('#author-data .books a.book')->link(),
            ])
            ->addToResult(['name', 'age', 'bornIn'])
    )
    ->addStep(Http::get()->useInputKey('bookUrls'))
    ->addStep(
        /* Add the book titles as property 'books' to the author Result object */
        Html::root()
            ->extract(['books' => 'h1'])
            ->addToResult()
    );

In the example, data about an author is extracted from an author detail page that contains links to the detail pages of all the books they have published. The goal is to obtain data in this form:

[
    'name' => 'John Example',
    'age' => '51',
    'bornIn' => 'Lisbon',
    'books' => [
        'Some Novel',
        'Another Novel',
    ]
]

It works to achieve this result, but it feels quite hacky, and I think it’s also hard to understand why it works.

I knew for some time that I wanted to change the functionality for composing crawling results. The goal was to pass data to the end of the crawling procedure, but in cases where multiple outputs stem from one input, the data should simply be duplicated. Implementing this was quite easy. However, if the code example is left unchanged, individual results per book would be generated, resulting in:

[
    'name' => 'John Example',
    'age' => '51',
    'bornIn' => 'Lisbon',
    'books' => 'Some Novel',
],
[
    'name' => 'John Example',
    'age' => '51',
    'bornIn' => 'Lisbon',
    'books' => 'Another Novel',
],

The reason is: on the author page, it keeps the author data, and we have only one author. But then, from one author input, we load multiple books, resulting in multiple outputs and the author data being duplicated.

I experimented with several ideas until I finally arrived at the solution I now call:

Sub Crawlers

The example can now be solved like this:

$crawler
    ->input('https://www.example.com/authors')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('#authors a'))
    ->addStep(Http::get())
    ->addStep(
        Html::root()
            ->extract([
                'name' => 'h1',
                'age' => '#author-data .age',
                'bornIn' => '#author-data .born-in',
                'books' => Dom::cssSelector('#author-data .books a.book')->link(),
            ])
            ->subCrawlerFor('books', function (Crawler $crawler) {
                // Behind the scenes, the sub crawler receives the book detail urls,
                // from the 'books' property above, as initial inputs.
                return $crawler
                    ->addStep(Http::get())
                    ->addStep(Html::root()->extract('h1'));
            }),
    );

I think it's a lot easier to understand what's going on there, and we can add even more nested data about the books if we want to, which isn't even possible with the old functionality. If you want to know more about it, read the new docs.

The new Methods to Keep Step Output Data

Here’s a brief overview of the new methods that allow step output data to be passed along:

Step::keep()

This is the replacement for Step::addToResult() and is used with steps that produce associative arrays or objects outputs. It can be called without arguments to include all output data, or with a string or an array of strings to pick specific output properties.

Step::keepAs()

This is a new additional method used with steps that produce scalar outputs, to define a key, that the scalar output value will be kept with.

The split into two different methods also helps the crawler to detect incorrect usage before the crawler run even starts, that could potentially cause errors, prematurely terminating a crawling process.

Additionally, there are two similar methods to pass on data from step inputs. This can be useful for directly adding data from initial inputs to the crawling result data or in the context of sub crawlers. These methods are: Step::keepFromInput() and Step::keepInputAs().

For more info, see the documentation.

I hope you like this update, and I am very excited about and looking forward to the implementation of v2.0 😊🥳🚀.