What's new in crwlr / crawler v0.6?

2022-10-03

Version 0.6 is probably the biggest update so far with a lot of new features and steps from crawling whole websites, over sitemaps to extracting metadata and schema.org structured data from HTML. Here is an overview of all the new stuff.

ℹ️ First an important info, if you're already using the library:
0.x versions are still "development versions" and can potentially contain changes that are breaking backwards compatibility. I try to avoid it and there won't be a lot, but this version contains one breaking change:
The PoliteHttpLoader (and the traits WaitPolitely and CheckRobotsTxt) have been removed. The politeness features are now baked into (dependencies of) the HttpLoader. Throttling (WaitPolitely) is done by default, but you can configure it and your crawler loads and respects robots.txt files depending on if you're using a BotUserAgent. More on this on the new Documentation Page about Politeness.

Crawling whole Websites

The new Http::crawl() step allows you to easily crawl whole websites, and it also has a lot of options, like:

  • Only crawl to a certain depth.
  • Start from a sitemap.
  • Stay on the same domain or on the same host.
  • Only load URLs with certain paths.
  • ...

In this example you start with a sitemap and load only URLs with a path starting with /foo/:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(
        Http::crawl()
            ->inputIsSitemap()
            ->pathStartsWith('/foo/')
    );

You can read more about this feature here.

New Sitemap Steps

There is 2 new steps to work with sitemaps. One to get all sitemap URLs listed in the robots.txt file of some website. And the other one to get all the URLs (optionally also with the additional data like priority) listed in the sitemap.

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Sitemap;

$crawler->input('https://www.example.com/')
    ->addStep(Sitemap::getSitemapsFromRobotsTxt())
    ->addStep(Http::get())
    ->addStep(Sitemap::getUrlsFromSitemap());

Read more about these steps here.

Extracting Metadata and schema.org structured data from HTML documents

There are two new HTML steps to easily extract Metadata and schema.org structured data (in JSON-LD format) from HTML documents:

Extracting Metadata

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Html;

$crawler->input('https://www.crwlr.software/')
    ->addStep(Http::get())
    ->addStep(
        Html::metaData()
            ->only(['title', 'description', 'og:image'])
    );

This step gets you all the data from <meta> tags which have a name or property attribute and also the title from the <title> tag. Read more here.

Extracting schema.org structured data

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Html;

$crawler->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(
        Html::schemaOrg()
            ->onlyType('JobPosting')
            ->extract([
                'title',
                'description',
                'company' => 'hiringOrganization.name',
            ])
    );

Read more about this here.

CSS selector first(), last(), nth(), even() and odd() methods

CSS has selectors like :first-child, :nth-child(n), :last-child and so on. But they are easily misunderstood. For example: #main a:first-child is not just the first link inside the element with id="main", but: the first element inside the id="main" element only if it is a link. So, when the first child element inside that element is for example a <div>, the selector won't match anything.

Now you can solve this like:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;

Html::root()->extract([
    'firstLink' => Dom::cssSelector('#main a')->first(),
]);

Read more about this feature here.