Documentation for crwlr / crawler (v1.5)

Attention: You're currently viewing the documentation for v1.5 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Steps and Data Flow

Steps are the building blocks for your crawlers. There are a lot of ready to use steps, and you can also build your own custom ones. Crawlers accept all classes that implement the StepInterface as steps.

When a crawler is run, it calls one step after another with some input. Usually you define the initial inputs for the first step manually. Most of the time that'll be one or many urls that need to be loaded.

$crawler = new MyCrawler();

$crawler->inputs([
    'https://www.crwlr.software/packages/url',
    'https://www.crwlr.software/packages/robots-txt'
]);

Any step can produce either one, zero or multiple outputs from one input it is called with.

A step yielding one output
A step without output
A step yielding multiple outputs

Further steps that you add to the crawler after the first one, are called with the outputs of the previous step as input.

Animation showing how and when output is converted to input again for the next step

So the data (inputs and outputs) is cascading down the steps of the crawler.

Example


use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler->inputs([
    'https://www.crwlr.software/packages/url',
    'https://www.crwlr.software/packages/robots-txt'
]);

$crawler->addStep(Http::get())
    ->addStep(Html::getLinks('#versions a'))
    ->addStep(Http::get())
    ->addStep(
        Html::first('article')
            ->extract(['title' => 'h1'])
    );

foreach ($crawler->run() as $result) {
    // do something with result
}
Visualization showing the complete data flow through a whole crawler with multiple steps