Steps and Data Flow
Steps are the building blocks for your crawlers. There are a
lot of ready to use steps, and you can also build your own custom
ones. Crawlers accept all classes that implement the
StepInterface
as steps.
When a crawler is run, it calls one step after another with some input. Usually you define the initial inputs for the first step manually. Most of the time that'll be one or many urls that need to be loaded.
$crawler = new MyCrawler();
$crawler->inputs([
'https://www.crwlr.software/packages/url',
'https://www.crwlr.software/packages/robots-txt'
]);
Any step can produce either one, zero or multiple outputs from one input it is called with.
Further steps that you add to the crawler after the first one, are called with the outputs of the previous step as input.
So the data (inputs and outputs) is cascading down the steps of the crawler.
Example
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = new MyCrawler();
$crawler->inputs([
'https://www.crwlr.software/packages/url',
'https://www.crwlr.software/packages/robots-txt'
]);
$crawler->addStep(Http::get())
->addStep(Html::getLinks('#versions a'))
->addStep(Http::get())
->addStep(
Html::first('article')
->extract(['title' => 'h1'])
);
foreach ($crawler->run() as $result) {
// do something with result
}