Documentation for crwlr / crawler (v0.2)

Attention: You're currently viewing the documentation for v0.2 of the crwlr/crawler package.

The newest version of this page is available in v0.5. However, this page no longer exists in the latest version of the package documentation (v3.5), which likely means the feature or topic was removed or significantly changed.

You can click here to view the newer version of this page (v0.5) or, if you're planning to use a more recent version of the package, please check the release notes on GitHub.

User Agents

User agents are very simple. The basic UserAgentInterface only defines that implementations need to have a __toString() method. The HttpLoader sends that string as User-Agent HTTP Header with every request.

If you want to just use some specific browser user agent you can do it like this in your Crawler class:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent(
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0'
        );
    }
}

Bot User Agent

If you want to be polite and identify as a bot, you can use the BotUserAgent to do so. It can be created with at least the name of your bot, but you can also add an url where you provide infos about your crawler and a version number.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new BotUserAgent('MyBot', 'https://www.example.com/my-bot', '1.2');
    }
}

The toString() method of the BotUserAgent will return this user-agent string:

Mozilla/5.0 (compatible; MyBot/1.2; +https://www.example.com/my-bot)