Documentation for crwlr / crawler (v1.1)

Attention: You're currently viewing the documentation for v1.1 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Building Custom Steps

When you need your crawler to do something that is not covered by an included step, just build your own. A custom step needs to implement the StepInterface but for convenience just extend the abstract Step class, because you don't have to worry about all the methods that the crawler needs internally. What you need to define yourself is the protected invoke method.

use Crwlr\Crawler\Steps\Step;

class MyStep extends Step
{
    protected function invoke(mixed $input): Generator
    {
        // Implement what the step should do.
    }
}

What's coming in as $input is either one of the input values you manually defined if this is the first step in your crawler, or one of the outputs of the step that is executed before this one.

Validating and Sanitizing Input

So, theoretically this could be anything, which is why you can also add your own validateAndSanitizeInput() method. There you can validate if the step can somehow deal with the input (and otherwise throw an InvalidArgumentException) and also sanitize it, so in the invoke method you'll know what's inside $input.

Let's assume the step does something with an HTML document and therefore wants to get an instance of the Symfony DomCrawler. The HTML source code string could be delivered in various ways, e.g. in a PSR-7 Response object or simply just as string,...

use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Step;
use Psr\Http\Message\ResponseInterface;
use Symfony\Component\DomCrawler\Crawler;

class MyStep extends Step
{
    protected function validateAndSanitizeInput(mixed $input): mixed
    {
        if (is_string($input)) {
            return new Crawler($input);
        }

        if ($input instanceof ResponseInterface || $input instanceof RespondedRequest) {
            // Avoid using ->getBody()->getContents() directly, because if it
            // is used again at a later point you'd first have to rewind the
            // stream to get the body again.
            // Better always use this Http::getBodyString() helper method to
            // get the body as string from an HTTP message.
            return new Crawler(Http::getBodyString($input));
        }

        throw new InvalidArgumentException('Input must be string, PSR-7 Response or RespondedRequest.');
    }

    /**
     * @param Crawler $input
     * @return Generator
     */
    protected function invoke(mixed $input): Generator
    {
        // Implement what the step should do.
    }
}

The abstract Step class takes care of internally calling both methods and handing over the return value of the validateAndSanitizeInput() method to the invoke() method, when the crawler calls the step.

Yielding output

If you're not familiar with PHP generators you can read about them here.

Assuming you want to make a step that splits a string into separate lines and pass the lines as separate outputs (inputs) to the next step, it would look like this:

use Crwlr\Crawler\Steps\Step;

class MyStep extends Step
{
    /**
     * @param string $input
     * @return Generator
     */
    protected function invoke(mixed $input): Generator
    {
        foreach (explode(PHP_EOL, $input) as $line) {
            yield $line;
        }
    }
}