Documentation for crwlr / crawler (v1.3)

Attention: You're currently viewing the documentation for v1.3 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Composing Results

If you don't define anything yourself actively, the Result objects that you'll get from the Crawler::run() call, are the last step's outputs. But often, the output of the last step alone will not be the whole data you want to get from your crawler. It may be necessary to compose the final result with data coming from different steps (/pages). This can be achieved using the methods Step::addToResult() and/or Step::addLaterToResult(), that are available on any step.

Step::addToResult()

With Linear Data Flow

Result object attached to I/O objects
Result object attached to I/O objects

Let's first have a look at the addToResult() method. The first step that you call this method on, will create a Result object that receives either all or only parts of the data from the step's output. This Result object is then carried along by all the Input and Output objects, originating from the output, the result was created from. The subsequent steps can then add data to that Result object.

Example Use Case:
You want to get data from job ads from a job listing and most of the data about the jobs is found on the job posting detail pages, but the job location is only mentioned in the listing. The code for this use case could look like this:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/jobs')
    ->addStep(Http::get())
    ->addStep(
        /* Get all the job URLs along with their locations from the list page. */
        Html::each('#jobs .job')
            ->extract([
                'url' => Dom::cssSelector('a.job-link')->link(),
                'location' => '.job-location',
            ])
            ->addToResult(['location'])
    )
    ->addStep(
        /* Load the detail pages using the url key from the previous step's outputs. */
        Http::get()->useInputKey('url')
    )
    ->addStep(
        /* Get more data about each job from the detail pages. */
        Html::root()
            ->extract([
                'title' => 'h1',
                'text' => '.job-ad-content',
            ])
            ->addToResult()
    );

With Subsequent Steps with Multiple Outputs

Result when following step has multiple outputs
Result when the following step
has multiple outputs

In case some step along the way yields multiple outputs, the Result object is passed on to all the outputs, but only as a reference. So, it basically remains one single Result object after it was created, and at the end the crawler will only return that one Result object. If data is added in the area where it is attached to multiple Output objects, all of them add to the same Result object instance and the added properties become arrays.

This behaviour can be useful, imagine this case: you want to get all authors with a list of their book titles from the website of a publishing company. There is a page for every author and his or her books are listed as images with links to book detail pages. You can get the book titles only from the detail pages. You call addToResult() first on the author page and again on the book detail pages. This will get you a Result object per author with an array of books inside.

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/authors')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('#authors a'))
    ->addStep(Http::get())
    ->addStep(
        /* Get author data from the author detail page, including the book detail page URLs,
         * which are multiple per author */
        Html::root()
            ->extract([
                'name' => 'h1',
                'age' => '#author-data .age',
                'bornIn' => '#author-data .born-in',
                'bookUrls' => Dom::cssSelector('#author-data .books a.book')->link(),
            ])
            ->addToResult(['name', 'age', 'bornIn'])
    )
    ->addStep(Http::get()->useInputKey('bookUrls'))
    ->addStep(
        /* Add the book titles as property 'books' to the author Result object */
        Html::root()
            ->extract(['books' => 'h1'])
            ->addToResult()
    );

Running this crawler will return Result objects with data like:

[
    'name' => 'John Example',
    'age' => '51',
    'bornIn' => 'Lisbon',
    'books' => [
        'Some Novel',
        'Another Novel',
    ]
]

Step::addLaterToResult() - Delay Result Object Creation

Delay creating Result object by using the addLaterToResult method
Delay creating a final Result object
by using the addLaterToResult method

But there are also cases where you might want to add data to the final result without immediately creating the Result object. Instead, remember the data and add it to all Result objects that are created in a subsequent step that yields multiple outputs. In this case you can use the addLaterToResult() method.

Imagine the same case as above, with the authors and books. But instead you want to get Result objects for each book. And for the example, let's say, the author's name is not mentioned on the book detail page. In this case you can call addLaterToResult() on the step that is parsing the author detail page, and then addToResult() on the book detail page, which will actually create the final Result objects.

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/authors')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('#authors a'))
    ->addStep(Http::get())
    ->addStep(
        /* Get only the author's name and add it as property author
         * to all Results that are created later */
        Html::root()
            ->extract([
                'author' => 'h1',
                'bookUrls' => Dom::cssSelector('#author-data .books a.book')->link(),
            ])
            ->addLaterToResult(['author'])
    )
    ->addStep(Http::get()->useInputKey('bookUrls'))
    ->addStep(
        /* As we're now calling addToResult(), the final Result objects are created.
         * The author property from above is added to each Result created here. */
        Html::root()
            ->extract([
                'title' => 'h1',
                'releaseDate' => '#bookDetails .release .date',
                'description' => '#bookDetails p.description',
            ])
            ->addToResult()
    );

Running this crawler will return Result objects with data like:

[
    'title' => 'Some Novel',
    'releaseDate' => '2023-01-12',
    'description' => 'Santiago is an aging, experienced fisherman who...',
    'author' => 'John Example',
]

Choosing/Naming Result Data

Steps producing Array Output

Calling either addToResult() or addLaterToResult() without any argument is meant for steps that produce array output. It will add all the keys with their values from the output to the result.

use Crwlr\Crawler\Steps\Html;

$crawler->addStep(
    Html::each('.jobAd')
        ->extract([
            'title' => 'h1',
            'location' => '.jobData .location',
            'salary' => '.jobData .salary',
            'applyUrl' => '#apply a',
        ])
        ->addToResult()
);

This example adds the properties title, location, salary and applyUrl to the result.

If you extract some properties that are only needed for the next step, but not relevant for the result, you can pick the properties you want to add:

use Crwlr\Crawler\Steps\Html;

$crawler->addStep(
    Html::each('.jobAd')
        ->extract([
            'title' => 'h1',
            'location' => '.jobData .location',
            'salary' => '.jobData .salary',
            'applyUrl' => '#apply a',
        ])
        ->addToResult(['title', 'location', 'salary'])
);

Steps producing Scalar Value Outputs

For steps that produce just one single scalar value (without a key) as output, you need to give it a key that it should get in the result.

use Crwlr\Crawler\Steps\Html;

$crawler->addStep(
    Html::getLink('#apply a')
        ->addToResult('applyUrl')
);

So, this will add the absolute URL behind the selected link element as applyUrl to the result.

Step::keepInputData() - Merge Input to Output

Another way you can forward data to the next step, is the Step::keepInputData() method. When called, the step will merge the input data it's getting to the outputs it is producing. This can for example be useful when you want to add data from the initial inputs you're giving your crawler, to the final results.

Example:
Let's say, you want to crawl real estate ads. You get them from some real estate platform, and you want to group them by "home type" (house, apartment). This information is available as a search filter, but not mentioned on the detail pages. In this case you can

  • give your crawler the URLs for the filtered search result pages, along with the type value in an array, as initial inputs,
  • forward them to the first step's output using the keepInputData() method
  • and then add it to the result from the first step's output.
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->inputs([
        [
            'type' => 'house',
            'listUrl' => 'https://www.example-real-estate.com/search?filter[type]=house',
        ],
        [
            'type' => 'apartment',
            'listUrl' => 'https://www.example-real-estate.com/search?filter[type]=apartment',
        ],
    ])
    ->addStep(
        Http::get()
            ->useInputKey('listUrl')        /* Call the Http::get() step with the listUrl
                                             * from the array input above */
            ->keepInputData()               /* Merge the original input data to the step's
                                             * output */
            ->outputKey('response')         /* As the Http step yields object output
                                             * without a key, define this key that it will
                                             * have in the merged array output. */
            ->addLaterToResult(['type'])    /* Add the type that was forwarded from the
                                             * input to the output data, delayed to the
                                             * result */
    )
    ->addStep(Html::getLinks('#searchResults .item a')->useInputKey('response'))
    ->addStep(Http::get())
    ->addStep(
        Html::root()
            ->extract([/* real estate data */])
            ->addToResult()
    );

As it is merging input and output data together, the output of a step where keepInputData() is called, will always be an array. If input or output is a scalar value (non array), you need to define a key that it will have in the output array.

For the step's output this can be done like above, using the outputKey() method.

To give a scalar input value a key you can use the keepInputData() method with the key as argument.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/something')
    ->addStep(
        Http::get()
            ->keepInputData('url')
            ->outputKey('response')
    )
    // ...