Documentation for crwlr / crawler (v0.3)

Unique Output

Sometimes you may have a data source containing the same items multiple times, but you don't want to have duplicates in your results. Just use the uniqueOutputs method on any step:

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.item a')
            ->uniqueOutputs()
    );

// Run crawler and process results

Using a key to check for array/object uniqueness

When the step output is an array (or object) you can improve performance by defining a key that should be used to check for uniqueness:

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::each('.item')
            ->extract([
                'title' => 'h3',
                'price' => '.productPrice',
                'description' => '.text'
            ])
            ->uniqueOutputs('title')
    );

// Run crawler and process results

Because for array (and object) the crawler otherwise internally builds a simple string key to check for uniqueness by serializing and hashing the array/object.

That's also the secret to how this works without bloating memory consumption. The step is still a Generator function, but it internally remembers the string keys that it already yielded.