Sometimes the output of the last crawler step alone will not be the whole result data you want to get from your crawler. It may be necessary to compose the final result from different steps (/pages). For example when you want to get jobs from a job listing and most of the data about the jobs is found on the job posting detail page, but the job location is only mentioned in the listing. This is why it's possible to compose results over multiple steps.
First you should know, that the Crawler internally
wraps input and output data in
between the steps. But what you're finally receiving at the
end from the
Crawler::run() method is a
When you don't define anything what you want to get as result
it just converts the outputs of the last step to results.
When you actively define what exactly a step shall add to
the final result, the crawler creates a
Result object at the
first step that adds something and carries it along with the
Output objects. The following steps can then
add properties to the existing result object.
Behaviour of Result objects in the data flow
In case some step along the way yields multiple outputs, the
Result object is passed on to all the outputs, but only
as a reference, so it remains one
Result object. And at the
end the crawler will only give you the one
If data is added in the area where it is attached to
multiple outputs, the data is added to the result property
as an array.
How to define results
There are two different ways to tell a step that it
should add data to the final
For Steps with Array Output
Most steps that extract data, yield arrays as output. So in
most cases the way to go is the
addKeysToResult method of
use Crwlr\Crawler\Steps\Html; $myCrawler->addStep( Html::each('.jobAd') ->extract(['title' => 'a', 'location' => '.location']) ->addKeysToResult() );
This will add all it's keys to the final result.
If you need to extract some data only for the next step, but don't want to add it to the final result, you can add only some keys:
use Crwlr\Crawler\Steps\Html; $myCrawler->addStep( Html::each('.jobAd') ->extract([ 'title' => 'a', 'location' => '.location', 'salary' => '.salary', ]) ->addKeysToResult(['title', 'location']) );
For Steps with Scalar Output
If a step yields only a single value that is not an array
you add it to the final result via the
method of the step.
use Crwlr\Crawler\Steps\Html; $myCrawler->addStep( Html::getLink('#someLink') ->setResultKey('url') );
Or as an alternative syntax you can also use:
use Crwlr\Crawler\Steps\Html; $myCrawler->addStep('url', Html::getLink('#someLink'));