Documentation for crwlr / crawler (v1.10)

Custom Steps

Creating a Custom Step Class

When you need your crawler to perform a task not covered by any included step, you can easily build your own. Your custom step class needs to extend the abstract Crwlr\Crawler\Steps\Step class, and you need to implement the invoke() and outputType() methods.

use Crwlr\Crawler\Steps\Step;

class MyStep extends Step
{
    public function outputType(): StepOutputType
    {
        return StepOutputType::Scalar;
    }

    protected function invoke(mixed $input): Generator
    {
        // Implement what the step should do and yield output values.

        yield 'foo';
    }
}

Further information about these two methods below.

Yielding Step Output Data Using Generators

As you can see in the invoke() method, instead of returning the output values, we use the yield keyword to pass them on. If you're not familiar with PHP generators you can read our quickstart tutorial on PHP generators.

For instance, to create a step that splits a string into separate lines and passes each line as a separate output (input) to the next step, it would look like this:

use Crwlr\Crawler\Steps\Step;

class MyStep extends Step
{
    public function outputType(): StepOutputType
    {
        return StepOutputType::Scalar;
    }

    /**
     * @param string $input
     * @return Generator
     */
    protected function invoke(mixed $input): Generator
    {
        foreach (explode(PHP_EOL, $input) as $line) {
            yield $line;
        }
    }
}

Step Output Types

Each step must also implement the outputType() method, returning a Crwlr\Crawler\Steps\StepOutputType enum. There are three options:

  • StepOutputType::Scalar
  • StepOutputType::AssociativeArrayOrObject or
  • StepOutputType::Mixed

Understanding the types of outputs a step can yield is important for the crawler to detect misconfigurations (such as using the wrong keep methods on steps) early on, before even starting to actually crawl. This helps prevent errors that might occur after the crawler has already been running for some time.

Decide the output type this way:

  • If you know your step will only yield associative arrays or objects, return StepOutputType::AssociativeArrayOrObject.
  • If it will only yield scalar values, like string, int, float, bool, return StepOutputType::Scalar.
  • If it could yield either scalar or non-scalar values based on the instance's state, return the corresponding type from the outputType() method based on the current state of the instance (see example below).
  • If it's not possible to determine the output type, e.g., because it also depends on the inputs it is called with, return StepOutputType::Mixed.

Here's an example of an outputType() implementation that determines the output type based on the state of the step instance:

class MyStep extends Step
{
    public bool $yieldsScalarValues = true;

    public function yieldScalarValues(): self
    {
        $this->yieldsScalarValues = true;

        return $this;
    }

    public function yieldAssociativeArrays(): self
    {
        $this->yieldsScalarValues = false;

        return $this;
    }

    public function outputType(): StepOutputType
    {
        if ($this->yieldsScalarValues) {
            return StepOutputType::Scalar;
        }

        return StepOutputType::AssociativeArrayOrObject;
    }

    protected function invoke(mixed $input): Generator
    {
        if ($this->yieldsScalarValues) {
            yield 'foo';
        } else {
            yield ['foo' => 'bar'];
        }
    }
}

Validating and Sanitizing Input

The $input argument of the invoke() method is either an initial input value you manually defined, if this is the first step in your crawler, or an output value from the preceding step. So, theoretically, it can be any value. In order to build your step for reusability, you can implement a validateAndSanitizeInput() method. This method allows you to validate whether the step can handle the input (throwing an InvalidArgumentException if it can't) and sanitize it, ensuring the invoke() method receives a predictable input.

Let's assume the step processes an HTML document and requires an instance of the Symfony DomCrawler. The HTML source code string could be delivered in various formats, such as a PSR-7 Response object or a plain string.

Here’s an example:

use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Step;
use Psr\Http\Message\ResponseInterface;
use Symfony\Component\DomCrawler\Crawler;

class MyStep extends Step
{
    protected function validateAndSanitizeInput(mixed $input): mixed
    {
        if (is_string($input)) {
            return new Crawler($input);
        }

        if ($input instanceof ResponseInterface || $input instanceof RespondedRequest) {
            // Avoid using ->getBody()->getContents() directly, as you would
            // need to rewind the stream to retrieve the body again later.
            // Instead, use the Http::getBodyString() helper method to get
            // the body as a string from an HTTP message.
            return new Crawler(Http::getBodyString($input));
        }

        throw new InvalidArgumentException('Input must be string, PSR-7 Response or RespondedRequest.');
    }

    /**
     * @param Crawler $input
     * @return Generator
     */
    protected function invoke(mixed $input): Generator
    {
        // Implement the step's functionality here.
    }
}

The abstract Step class ensures that both methods are called internally. It passes the return value of the validateAndSanitizeInput() method to the invoke() method when the crawler calls the step.