Documentation for crwlr / crawler (v1.7)

Unique Inputs and Outputs

Sometimes you may have a data source containing the same items multiple times, but you don't want to have duplicates in your results. Just use the uniqueOutputs or uniqueInputs method on any step:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.item a')
            ->uniqueOutputs()
    );

// Run crawler and process results

With uniqueInputs:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('.item a'))
    ->addStep(
        Http::get()
            ->uniqueInputs()
    );

// Run crawler and process results

Using a key to check for array/object uniqueness

When the step output is an array (or object) you can improve performance by defining a key that should be used to check for uniqueness:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::each('.item')
            ->extract([
                'title' => 'h3',
                'price' => '.productPrice',
                'description' => '.text'
            ])
            ->uniqueOutputs('title')
    );

// Run crawler and process results

Because for array (and object) the crawler otherwise internally builds a simple string key to check for uniqueness by serializing and hashing the array/object.

That's also the secret to how this works without bloating memory consumption. The step is still a Generator function, but it internally remembers the string keys that it already yielded.