Documentation for crwlr / crawler (v0.3)

Attention: You're currently viewing the documentation for v0.3 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

The Crawler

As pointed out on the getting started page, the first thing you need to do to build a crawler, is creating a class extending the Crawler (or HttpCrawler) class.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return BotUserAgent::make('MyBot');
    }
}

The minimum, the HttpCrawler requires you to define is a user agent. You can read more about user agents here.

The Crawler class also requires you to define a loader. The HttpCrawler by default uses the PoliteHttpLoader. Read more about loaders here.

Another dependency for crawlers is a logger. It takes any implementation of the PSR-3 LoggerInterface and by default uses the CliLogger shipped with the package. Read more about loggers here.

Configuring a Crawler Procedure

When your crawler class is set up, you can instantiate it and start configuring the procedure it should run:

$myCrawler = new MyCrawler();

// Provide initial input, add steps and finally run it.

Provide initial input

You can provide a single initial input by using the input() method:

$myCrawler->input('https://www.crwlr.software/packages');

Or provide multiple initial inputs by calling input() multiple times:

$myCrawler->input('https://www.crwlr.software/packages/url');

$myCrawler->input('https://www.crwlr.software/packages/crawler');

Or provide multiple initial inputs as an array, by using the inputs() method:

$myCrawler->inputs([
    'https://www.crwlr.software/packages/url',
    'https://www.crwlr.software/packages/crawler',
]);

The inputs() method also adds additional inputs, so you can use both methods multiple times and nothing will get lost.

Add steps

Steps are the central building blocks for your crawlers. To understand how the data flows through the steps of your crawler, read Steps and Data Flow. Check out the Included Steps to see what the steps included in the package can do for you. If you need to build your own custom step, read this.

To add a step to your crawler, simply use the addStep() method:

$myCrawler->addStep(Http::get());

The method returns the crawler instance itself, so you can also chain addStep() calls:

$myCrawler->addStep(Http::get())
    ->addStep(Html::each('#list .item')->extract(['url' => 'a']))
    ->addStep(new MyCustomStep());

Getting/Handling Result Data

When you've added the steps that your crawler shall perform, you can finally run it using one of the methods run() or runAndTraverse(). One thing to know is that the Crawler class internally uses generators to be as memory efficient as possible. This means you need to iterate the Generator that the run() method returns, otherwise it won't do anything.

foreach ($myCrawler->run() as $result) {
    // $result is an instance of Crwlr\Crawler\Result
}

When you actually don't need to receive all the results where you're calling the crawler (e.g. because you defined a store) you can just use runAndTraverse() instead:

$myCrawler->setStore(new MyStore());

$myCrawler->runAndTraverse();

Memory Usage

Crawlers typically are programs dealing with large amounts of data, which is why the library uses generators wherever possible to be as memory efficient as possible.

If your crawler still needs a bit more memory than your current PHP config allows, the Crawler class contains two convenient helper methods to get the current memory limit and set a higher limit if the php installations allows it.

Crawler::getMemoryLimit();
// Wrapper for ini_get('memory_limit'), returns a string like e.g. 512M

Crawler::setMemoryLimit('1G');
// Wrapper for ini_set('memory_limit', <value>), returns either the prev.
// limit as string or false on failure.

If you think your crawler is consuming too much memory, you can also monitor its memory usage while it's running via log messages:

$crawler->monitorMemoryUsage();

It will then print a log message, telling you the current memory usage in bytes (using memory_get_usage()) after every step invocation with one input.

The method also has one parameter that you can use to tell it to only log messages when the usage exceeds X bytes:

$crawler->monitorMemoryUsage(1000000000);