Documentation for crwlr / crawler (v0.3)

Attention: You're currently viewing the documentation for v0.3 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Composing Results

Sometimes the output of the last crawler step alone will not be the whole result data you want to get from your crawler. It may be necessary to compose the final result from different steps (/pages). For example when you want to get jobs from a job listing and most of the data about the jobs is found on the job posting detail page, but the job location is only mentioned in the listing. This is why it's possible to compose results over multiple steps.

First you should know, that the Crawler internally wraps input and output data in Input and Output objects between the steps. But what you're finally receiving at the end from the Crawler::run() method is a Result object. When you don't define anything what you want to get as result it just converts the outputs of the last step to results.

Result object attached to I/O objects
Result object attached to I/O objects

When you actively define what exactly a step shall add to the final result, the crawler creates a Result object at the first step that adds something and carries it along with the Input and Output objects. The following steps can then add properties to the existing result object.

Behaviour of Result objects in the data flow

Result when following step has multiple outputs
Result when following step has multiple outputs

In case some step along the way yields multiple outputs, the Result object is passed on to all of the outputs, but only as a reference, so it remains one Result object. And at the end the crawler will only give you the one Result object. If data is added in the area where it is attached to multiple outputs, the data is added to the result property as an array.

How to define results

There are two different ways to tell a step that it should add data to the final Result object.

For steps yielding array output

Most steps that extract data, yield arrays as output. So in most cases the way to go is the addKeysToResult method of the step.

$myCrawler->addStep(
    Html::each('.jobAd')
        ->extract(['title' => 'a', 'location' => '.location'])
        ->addKeysToResult()
);

This will add all it's keys to the final result.
If you need to extract some data only for the next step, but don't want to add it to the final result, you can add only some keys:

$myCrawler->addStep(
    Html::each('.jobAd')
        ->extract([
            'title' => 'a',
            'location' => '.location',
            'salary' => '.salary',
        ])
        ->addKeysToResult(['title', 'location'])
);

For steps yielding a single value (non array)

If a step yields only a single value that is not an array you add it to the final result via the setResultKey() method of the step.

$myCrawler->addStep(
    Html::getLink('#someLink')
        ->setResultKey('url')
);

Or as an alternative syntax you can also use:

$myCrawler->addStep('url', Html::getLink('#someLink'));