What's new in crwlr / crawler v0.2 and v0.3

2022-04-30

There are already two new 0.x versions of the crawler package. Here a quick summary of what's new in versions 0.2 and 0.3.

v0.2.0

uniqueOutputs() Step Method

Sometimes you'll have data sources containing the same items multiple times, but you don't want to have duplicates in your results. By calling uniqueOutputs() on any step you can now very easily prevent getting duplicate outputs even though the steps are still generator functions.

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.item a')
            ->uniqueOutputs()
    );

// Run crawler and process results

When the output of a step is an array or even an object, you can also define a key on that array/object that can be used to check for uniqueness.

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::each('.item')
            ->extract([
                'title' => 'h3',
                'price' => '.productPrice',
                'description' => '.text'
            ])
            ->uniqueOutputs('title')
    );

// Run crawler and process results

This can improve performance because otherwise it will create simple string keys for any array/object by serializing and hashing it, to check for uniqueness.

This is also the secret how it's done with Generator functions: the step internally remembers the keys it's already yielded. This memory is reset when the crawler run is finished.

runAndTraverse() on the Crawler

As a result of using generators, you need to iterate the results, the run() method returns, otherwise nothing will happen when calling that method.

But often you won't actually need to do anything with the results, where you're calling the crawler, because you've set a store to store the results or maybe the crawler even just needs to call some urls, but you don't need any results. So, to avoid having loops with empty body, or using PHP's iterator_to_array, or things like that, you can now use runAndTraverse():

$myCrawler->runAndTraverse();

v0.3.0

Monitoring memory usage

The library is built to be as memory efficient as possible, but as crawlers typically are programs dealing with vast amounts of data, you can still potentially hit memory limits. When that happens, but you're not really sure why, you can now tell the crawler to log messages about its current memory usage after every step invocation, to maybe get a hint what's causing it:

$crawler->monitorMemoryUsage();

Or if it should only log messages once the memory usage exceeds X bytes:

$crawler->monitorMemoryUsage(1000000000);

So it won't pollute your logs while it's not really necessary.

Fixes

Both new versions also contain a fixes/improvements, especially v0.3 fixes how generators are used internally to be as memory efficient as possible.