Documentation for crwlr / crawler (v0.6)

Attention: You're currently viewing the documentation for v0.6 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Looping Steps

There is one very typical use case where the simple cascading steps would force you to build a custom step to solve it. You may already guess that it's about pagination. Don't worry, there's a simpler solution to crawl paginated listings, namely loops.

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/listing');

// Loop through the result pages as long as there is a link
// with id nextPage
$crawler->addStep(
    Crawler::loop(Http::get())
        ->withInput(Html::getLink('#nextPage'))
);

// Further parsing of the listing pages and extracting items
$crawler->addStep('url', Html::getLinks('#listing .item a'))
    ->addStep(Http::get())
    ->addStep(
        Html::first('.someElement')
            ->extract([
                'title' => 'h1',
                'id' => '#itemId'
            ])
            ->addKeysToResult()
    );

foreach ($crawler->run() as $result) {
    // Do something with results.
}
Animation showing how a loop step works by default

Let's take a closer look at the loop step in this example:
Wrapping any step in a Loop step using Crawler::loop(), will loop it with its own output until the step doesn't yield any output anymore.

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Loading\Http;

Crawler::loop(Http::get());

For the pagination case being able to loop the step with its own output is still not enough, because the Http step needs a URL as input.

The withInput() hook

Animation showing how a loop step works with a withInput hook

Here the withInput() hook comes to the rescue. It takes a Closure or even another step as callback and when the loop step has output it first calls that callback/step with the output and passes the result of the callback back as input to the loop step.

When the callback returns null the loop stops.

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

Crawler::loop(Http::get())
    ->withInput(Html::getLink('#nextPage'))

So, the example uses a step to get a link with id nextPage from the loaded page. If there is no such link, the loop stops.

As mentioned you can also just use a Closure like:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Loading\Http;

Crawler::loop(Http::get())
    ->withInput(function (mixed $input, mixed $output) {
        // $input is the original input of the loop step,
        // and $output of course the $output that it yielded.

        // The callback is also bound to the underlying 
        // step that is being looped, so you can also use
        // the logger via $this->logger.

        // return whatever you need to pass as new input to the loop step
    })

As you can see, the withInput callback receives not only the output, but also the original input, that the loop step was called with. This is useful for example when you want to keep some kind of state in the input.
Example: Let's assume you're having a custom step before the loop step, that reads a list of categories that you then want to loop through. That custom step could yield a custom class with a __toString method returning a URL for the current category. In the withInput method you set the pointer of that class in the input to the next category item and pass it on to the next iteration of the loop step. If there is no next category, you return null in the withInput method, so the loop stops.

keepLoopingWithoutOutput()

Let's add a detail to the example above: some category pages can return 404 responses. Let's assume that's normal, because the categories you're getting from the previous step aren't guaranteed to exist. Normally a 404 would make the loop stop, because when there is no output, it assumes the loop is finished. In this case just use keepLoopingWithoutOutput:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Loading\Http;

Crawler::loop(Http::get())
    ->withInput(function (mixed $input, mixed $output) {
        // When the iteration actually has no output,
        // this callback is still called. The $output
        // argument is null in this case.

        // When you return null from here, it still
        // stops the loop.
    })
    ->keepLoopingWithoutOutput();

stopIf()

As you already know all the outputs from the step that is being looped, are passed on as inputs to the next step just like any normal step does. And you can manually stop a loop by returning null from the withInput hook callback. But the last output that is triggering the callback is still being passed on to the next step even though it's not used for another loop iteration. If you want to prevent this, you can add a callback using the stopIf method:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Loading\Http;

Crawler::loop(Http::get())
    ->stopIf(function (mixed $input, mixed $output) {
        $responseContent = $output->response->getBody()->getContents();

        // An important thing to know, always rewind response body
        // streams after reading the content, if you'll need it
        // again somewhere else. When using it again without
        // rewinding you'll just get an empty string.
        $output->response->getBody()->rewind();

        return $responseContent === '{ "success": false }';
    })
    ->keepLoopingWithoutOutput();

Prevent infinite loops with a max iterations limit

It's easy to somehow end up in an infinite loop. It may not even be your fault. You can define it should stop when there is no link with id #nextPage on a loaded page, and it currently works. And then the site owners decide to add a #nextPage link on the last page, linking to the same last page again, or something like this.

Therefore, it's good to set some limit, defining how often the loop is allowed to iterate at most. The default limit if you don't set anything else, automatically is 1000. You can set your own limit using maxIterations():

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

Crawler::loop(Http::get())
    ->withInput(Html::getLink('#nextPage'))
    ->maxIterations(40000);

Defer cascading outputs to the next step

As this library is working with Generators, an output from one step, may well call the next step before the loop has finished. If you need the loop to wait until it's done looping and then start to pass on all the outputs to the next step, use cascadeWhenFinished:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

Crawler::loop(Http::get())
    ->withInput(Html::getLink('#nextPage'))
    ->cascadeWhenFinished();