Documentation for crwlr / crawler (v0.4)

Attention: You're currently viewing the documentation for v0.4 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

User Agents

User agents are very simple. The basic UserAgentInterface only defines that implementations need to have a __toString() method. The HttpLoader sends that string as User-Agent HTTP Header with every request.

If you want to just use some specific browser user agent you can do it like this in your Crawler class:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent(
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0'
        );
    }
}

Bot User Agent

If you want to be polite and identify as a bot, you can use the BotUserAgent to do so. It can be created with at least the name of your bot, but you can also add an url where you provide infos about your crawler and a version number.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new BotUserAgent('MyBot', 'https://www.example.com/my-bot', '1.2');
    }
}

The toString() method of the BotUserAgent will return this user-agent string:

Mozilla/5.0 (compatible; MyBot/1.2; +https://www.example.com/my-bot)