Documentation for crwlr / crawler (v0.6)

Attention: You're currently viewing the documentation for v0.6 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

The Crawler

As pointed out on the getting started page, the first thing you need to do to build a crawler, is creating a class extending the Crawler or better the HttpCrawler class.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return BotUserAgent::make('MyBot');
    }
}

The minimum, the HttpCrawler requires you to define is a user agent. The Crawler class also requires you to define a loader. The HttpCrawler by default uses the HttpLoader. You can read more about loaders here.

User Agents

User agents are very simple. The basic UserAgentInterface only defines that implementations need to have a __toString() method. The HttpLoader sends that string as User-Agent HTTP Header with every request.

Bot User Agents

If you want to be polite and identify as a bot, you can use the BotUserAgent to do so. It can be created with at least the name (product-token) of your bot, but optionally you can also add a URL where you provide infos about your crawler and also a version number.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new BotUserAgent('MyBot', 'https://www.example.com/my-bot', '1.2');
    }
}

The toString() method of the BotUserAgent will return this user-agent string:

Mozilla/5.0 (compatible; MyBot/1.2; +https://www.example.com/my-bot)

Non Bot User Agents

If you are, for example, just crawling your own website to check it for broken links, or you want to see what a site returns for a certain browser user agent or other things like that, just use the UserAgent class. You can provide any string as user agent, and it will ignore the robots.txt file.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent(
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0'
        );
    }
}

Loggers

Another dependency for crawlers is a logger. It takes any implementation of the PSR-3 LoggerInterface and by default uses the CliLogger shipped with the package, which just echoes the log lines.

To use your own logger, just define the protected logger() method in your crawler:

use Crwlr\Crawler\HttpCrawler;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function logger(): LoggerInterface
    {
        return new MyLogger();
    }

    // user agent...
}

The logger() method is called only once in the constructor of the Crawler class, and then the logger instance is automatically handed over to every step that you add to the crawler.

Some included steps log some information about what they are doing or if there was any problem or error. In your custom steps you can use the logger via $this->logger. That's the same in all callbacks that are bound to a step, like the withInput() hook in loops and updateInputUsingOutput() in groups.