Documentation for crwlr / crawler (v1.10)

Defining a Crawling Procedure

When your crawler class is set up, you can instantiate it and start configuring the procedure it should run:

$myCrawler = new MyCrawler();

// Provide initial input, add steps and finally run it.

Provide initial Input

You can provide a single initial input by using the input() method:

$myCrawler->input('https://www.crwlr.software/packages');

Or provide multiple initial inputs by calling input() multiple times:

$myCrawler->input('https://www.crwlr.software/packages/url');

$myCrawler->input('https://www.crwlr.software/packages/crawler');

Or provide multiple initial inputs as an array, by using the inputs() method:

$myCrawler->inputs([
    'https://www.crwlr.software/packages/url',
    'https://www.crwlr.software/packages/crawler',
]);

The inputs() method also adds additional inputs, so you can use both methods multiple times and nothing will get lost.

Add Steps

Steps are the central building blocks for your crawlers. To understand how the data flows through the steps of your crawler, read Steps and Data Flow. Check out the Included Steps to see what the steps included in the package can do for you. If you need to build your own custom step, read this.

To add a step to your crawler, simply use the addStep() method:

use Crwlr\Crawler\Steps\Loading\Http;

$myCrawler->addStep(Http::get());

The method returns the crawler instance itself, so you can also chain addStep() calls:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$myCrawler->addStep(Http::get())
    ->addStep(Html::each('#list .item')->extract(['url' => 'a']))
    ->addStep(new MyCustomStep());

Choosing a Key from Array Input

When the output from a previous step is an array but the next step needs only a certain element from that array as its input, you can choose that array key by using the Step::useInputKey() method.

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$myCrawler
    ->addStep(Http::get())
    ->addStep(
        Html::each('#list .item')
            ->extract([
                'title' => 'a.title',
                'url' => Dom::cssSelector('a.title')->link(),
            ])
    )
    ->addStep(
        Http::get()->useInputKey('url') 
    );

The Html step produces array outputs like ['title' => '...', 'url' => '...'] and the following Http::get() step uses only the url from those arrays as its input.

Getting/Handling Result Data

When you've added the steps that your crawler shall perform, you can finally run it using one of the methods run() or runAndTraverse(). One thing to know is that the Crawler class internally uses generators to be as memory efficient as possible. This means you need to iterate the Generator that the run() method returns, otherwise it won't do anything.

foreach ($myCrawler->run() as $result) {
    // $result is an instance of Crwlr\Crawler\Result
}

When you actually don't need to receive all the results where you're calling the crawler (e.g. because you defined a store) you can just use runAndTraverse() instead:

$myCrawler->setStore(new MyStore());

$myCrawler->runAndTraverse();

And if you simply just want the results to be printed, when you run your crawler script from the command line, you can use the runAndDump() method.

$myCrawler->runAndDump();

Memory Usage

Crawlers typically are programs dealing with large amounts of data, which is why the library uses generators wherever possible to be as memory efficient as possible.

If your crawler still needs a bit more memory than your current PHP config allows, the Crawler class contains two convenient helper methods to get the current memory limit and set a higher limit if the php installations allows it.

use Crwlr\Crawler\Crawler;

Crawler::getMemoryLimit();
// Wrapper for ini_get('memory_limit'), returns a string like e.g. 512M

Crawler::setMemoryLimit('1G');
// Wrapper for ini_set('memory_limit', <value>), returns either the prev.
// limit as string or false on failure.

If you think your crawler is consuming too much memory, you can also monitor its memory usage while it's running via log messages:

$crawler->monitorMemoryUsage();

It will then print a log message, telling you the current memory usage in bytes (using memory_get_usage()) after every step invocation with one input.

The method also has one parameter that you can use to tell it to only log messages when the usage exceeds X bytes:

$crawler->monitorMemoryUsage(1000000000);