Grouping Steps
Groups are here, so you can call two or more different steps with the same input. A group step, when invoked, calls all the steps in it one by one, and combines their outputs to one big group step output array.
Example:
You may want to extract data from an HTML document using CSS selectors, and also to get some data from JSON-LD structured data from a <script>
block within the same document. No problem, just make a group:
use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/blog-post-with-json-ld');
->addStep(Http::get())
->addStep(
Crawler::group()
->addStep(
Html::first('#content article.blog-post')
->extract([
'title' => 'h1',
'date' => '.date',
])
)
->addStep(
Html::schemaOrg()
->onlyType('BlogPosting')
->extract([
'description',
'author' => 'author.name',
])
)
->addToResult()
);
Crawler::group()
creates a Group
object that you can add steps to, just like to the crawler itself. The Group
object also implements the StepInterface
, so it can be added to the crawler like any other normal step.
In the example above, both steps produce array output, that the group merges to a combined group step output array like:
[
'title' => 'Blog post title',
'date' => '2022-01-12',
'description' => 'This is a very sophisticated blog post about rocket science.',
'author' => 'Christian Olear',
]
Assigning an Output Key to Scalar Output Steps
In case you want to use a step that produces scalar (non array) outputs in a group, you need to assign a key that it's output value will have in the combined output array. You can do so by calling the outputKey()
method on the step.
use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
Crawler::group()
->addStep(
Html::first('article.jobAd')
->extract([
'title' => 'h1',
'location' => '.',
])
)
->addStep(
Html::getLink('#applyButton')
->outputKey('applyLink') /* Assign key to the output value */
)
Prevent Steps from Adding to the Combined Output
You can use this on any step, but it probably only makes sense within the context of a group. You can call the dontCascade()
method on a step, and it
will do what it usually does, but if it yields output, it will not be added to the combined group step output.
This makes sense when you need to call something that you don't really need the output from, but it's necessary as preparation for the actual step that produces relevant output.
use Crwlr\Crawler\Crawler;
Crawler::group()
->addStep(
(new StepToPrepareSomething())->dontCascade()
)
->addStep(
(new StepWithRelevantOutput())
)
If you don't need the output for the next step after the group, but in some form for the next step within the group, we've got you covered:
Manipulate/Prepare the Original Input for Further Steps
Another method that is only useful within the context of a group is updateInputUsingOutput()
. Most likely you will use it in combination with
dontCascade()
. Let's have a look at this:
use Crwlr\Crawler\Crawler;
Crawler::group()
->addStep(
(new StepToPrepareSomething())
->dontCascade()
->updateInputUsingOutput(function (mixed $input, mixed $output) {
// do something with the $input data
return $input;
})
)
->addStep(
(new StepWithRelevantOutput())
)
As mentioned, by default in a group all the steps receive the same input, from the previous step. Using the updateInputUsingOutput()
method on the step,
that is only here to prepare something for the step that will actually deliver needed data, you can further prepare the original input data for the following step(s).