Documentation for crwlr / crawler-ext-browser (v1.3)

Attention: You're currently viewing the documentation for v1.3 of the crawler-ext-browser package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Taking a Screenshot

Basic Usage

The Screenshot step, as its name implies, captures an image of the page associated with a given URL. A basic example is demonstrated below:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com')
    ->addStep(Screenshot::loadAndTake(__DIR__ . '/storepath'));

$crawler->runAndDump();

Upon executing the crawler, the screenshot image is saved as a file in the specified storage path. The output of this step is a Crwlr\CrawlerExtBrowser\Aggregates\RespondedRequestWithScreenshot object. This class extends the RespondedRequest class from the crawler package, enabling access not only to the screenshot image but also to the response itself in subsequent steps or for adding it to the result.

Properties that can be added to the result are: screenshotPath and all request properties, namely url, status, headers and body.

Timeout

The default timeout in the chrome-php library is 30 seconds. If you want to specify a different duration, you can use the timeout() method in your step definition. This allows you to set the maximum amount of time the browser should wait for a page to load.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Screenshot::loadAndTake(__DIR__ . '/storepath')
            ->timeout(120.0) // Seconds.
    );

$crawler->runAndDump();

Combining Screenshot Capture and Data Extraction

As previously mentioned, the screenshot step produces objects that extend the RespondedRequest class. Consequently, subsequent steps can access all the response data as if an Http step was used. Below is an example crawler that captures a screenshot and then extracts the page title from the loaded page in the next step.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$myStorePath = __DIR__ . '/storepath';

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Screenshot::loadAndTake($myStorePath)
            ->addToResult(['url', 'screenshotPath'])
    )
    ->addStep(
        Html::metaData()
            ->only(['title'])
            ->addToResult()
    );

$crawler->runAndDump();

Customizing the Request

The step shares functionality with the HTTP step from the crawler package. Therefore, you can also send custom HTTP headers, decide how to handle error responses (using stopOnErrorResponse() or yieldErrorResponses()), and specify certain keys from the input to be used as the URL or for HTTP headers (using useInputKeyAsUrl() and useInputKeyAsHeader() or useInputKeyAsHeaders()).
Please note: It's not possible to instruct the browser to use a different method than GET, thus sending a request body is also not supported.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$myStorePath = __DIR__ . '/storepath';

$crawler
    ->inputs([
        ['link' => 'https://www.example.com', 'someHeaderValue' => '123abc'],
        ['link' => 'https://example.com/error', 'someHeaderValue' => '123abc'],
    ])
    ->addStep(
        Screenshot::loadAndTake($myStorePath, ['x-some-header' => 'value'])
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeader('someHeaderValue', 'x-header-value')
            ->yieldErrorResponses()
            ->addToResult(['url', 'status', 'screenshotPath'])
    );

$crawler->runAndDump();

Waiting After Page Load Before Taking a Screenshot

There might be instances when you prefer not to capture a screenshot immediately after the page was loaded, but instead wait for a certain amount of time (e.g. because you know that after the page was rendered something happens that you want to await). In such cases, you can use the waitAfterPageLoaded() method.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$myStorePath = __DIR__ . '/storepath';

$crawler
    ->input('https://www.crwlr.software')
    ->addStep(
        Screenshot::loadAndTake($myStorePath)
            ->waitAfterPageLoaded(1.5)
            ->addToResult(['url', 'screenshotPath'])
    );

$crawler->runAndDump();