What's new in crwlr / crawler v0.4

2022-05-10

Last friday version 0.4 of the crawler package was released with some pretty useful improvements. Read what's shipped with this new minor update.

Step Output Filters

Any step that extends the Step class, shipped with the package, now has the where() and orWhere() methods that you can use to filter the step's outputs. Here's a quick example from the docs:

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
        ->orWhere('chartsUK', Filter::equal(1))
);

There's not only simple filter methods like equal, greaterThan, lessThan, and so on. But also string filters like stringContains, stringStartsWith and even filters specially made to filter urls by its components like urlHost, urlDomain, urlPath and so on.

Would you maybe like to contribute?

The list of available filters is actually not very big yet, and I could think of a lot of useful filters to have here. If you have an idea, and you'd consider contributing, I think this could be a rather easy thing to add some filter methods. Just reach out to me on twitter if you have any questions about it.

New Constraints for Html::getLink() and Html::getLinks() Steps

These steps now have a few new methods to restrict the links that it shall find:

// Only links to urls on the same domain.
Html::getLinks()->onSameDomain();

// Only links to urls not on the same domain.
Html::getLinks()->notOnSameDomain();

// Only links to urls on (a) certain domain(s).
Html::getLinks()->onDomain('example.com');

Html::getLinks()->onDomain(['example.com', 'crwl.io']);

// Only links to urls on the same host (includes subdomain).
Html::getLinks()->onSameHost();

// Only links to urls not on the same host.
Html::getLinks()->notOnSameHost();

// Only links to urls on (a) certain host(s)
Html::getLinks()->onDomain('blog.example.com');

Html::getLinks()->onDomain(['blog.example.com', 'www.crwl.io']);

The steps know the url of the HTML document, because it can only be used immediately after an Http step. This way you can get all the internal (same host/domain) or external (not same host/domain) links. Or even all the links to any list of hosts/domains.

If you're not sure if you should filter for host (includes subdomain like www) or domain (only the registrable domain like example.com), consider the following:
Sometimes pages have parts of a website (that you'd consider one website) on separate subdomains, like jobs.example.com or blog.example.com. On the other hand sometimes there are also bigger organizations having actual different websites (e.g. for several companies of a group) on the same domain which you maybe don't want to crawl. So, there is no general answer for this, just have a look at the pages you'd like to crawl.

Stores now also get the Logger

The crawler automatically passes on the logger to all the steps you add and from this version on it also does for stores. This can be breaking (if you're wondering: 0.x versions can also contain breaking things as defined in semver) as the StoreInterface now also requires theaddLogger() method. The new abstract Store class already implements it, so you can just extend it.

Use Csv Step without Column Mapping

The Csv step can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.

$csv = <<<CSV
id,firstname,surname
1,john,doe
2,jane,doe
CSV;

$crawler = new MyCrawler();

$crawler->input($csv);

$crawler->addStep(
    Csv::parseString()
        ->skipFirstLine()
        ->addKeysToResult()
);

Gets the following results:

array(3) {
  ["id"]=>
  string(1) "1"
  ["firstname"]=>
  string(4) "john"
  ["surname"]=>
  string(3) "doe"
}

array(3) {
  ["id"]=>
  string(1) "2"
  ["firstname"]=>
  string(4) "jane"
  ["surname"]=>
  string(3) "doe"
}