Documentation for crwlr / crawler (v1.10)

The Crawler

As pointed out on the getting started page, the first thing you need to do to build a crawler, is creating a class extending the Crawler or better the HttpCrawler class.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return BotUserAgent::make('MyBot');
    }
}

The minimum, the HttpCrawler requires you to define is a user agent. The Crawler class also requires you to define a loader. The HttpCrawler by default uses the HttpLoader. You can read more about loaders here.

User Agents

User agents are very simple. The basic UserAgentInterface only defines that implementations need to have a __toString() method. The HttpLoader sends that string as User-Agent HTTP Header with every request.

Bot User Agents

If you want to be polite and identify as a bot, you can use the BotUserAgent to do so. It can be created with at least the name (product-token) of your bot, but optionally you can also add a URL where you provide infos about your crawler and also a version number.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new BotUserAgent('MyBot', 'https://www.example.com/my-bot', '1.2');
    }
}

The toString() method of the BotUserAgent will return this user-agent string:

Mozilla/5.0 (compatible; MyBot/1.2; +https://www.example.com/my-bot)

Non Bot User Agents

If you are, for example, just crawling your own website to check it for broken links, or you want to see what a site returns for a certain browser user agent or other things like that, just use the UserAgent class. You can provide any string as user agent, and it will ignore the robots.txt file.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent(
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0'
        );
    }
}

Simple Crawler Instance Shortcut

It may seem unnecessary to create a class just to define your user agent. Therefore, if you don’t need to customize anything else in your crawler, you can create your crawler instance with a single line of code:

use Crwlr\Crawler\HttpCrawler;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

// Or to get an instance with a regular (non-bot) user agent

$crawler = HttpCrawler::make()->withUserAgent('Mozilla/5.0 (Macintosh,...) ...');

// Or use
$crawler = HttpCrawler::make()->withMozilla5CompatibleUserAgent();
// to get the user agent "Mozilla/5.0 (compatible)", indicating an unspecified browser
// that is compatible with Mozilla 5.0.

Loggers

Another dependency for crawlers is a logger. It takes any implementation of the PSR-3 LoggerInterface and by default uses the CliLogger shipped with the package, which just echoes the log lines.

To use your own logger, just define the protected logger() method in your crawler:

use Crwlr\Crawler\HttpCrawler;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function logger(): LoggerInterface
    {
        return new MyLogger();
    }

    // user agent...
}

The logger() method is called only once in the constructor of the Crawler class, and then the logger instance is automatically handed over to every step that you add to the crawler.

Some included steps log some information about what they are doing or if there was any problem or error. In your custom steps you can use the logger via $this->logger. That's the same in all callbacks that are bound to a step, like updateInputUsingOutput() in groups.