Sitemap Steps

The Sitemap Protocol is a convenient way for site-owners to give an overview of all the (important) pages of a website available for crawling. The Sitemap step provides simple methods to get all sitemap URLs from the robots.txt file and to get all the URLs out of sitemap files.

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Sitemap;

$crawler->input('https://www.example.com/')
    ->addStep(Sitemap::getSitemapsFromRobotsTxt())
    ->addStep(Http::get())
    ->addStep(Sitemap::getUrlsFromSitemap());

The Sitemap::getSitemapsFromRobotsTxt() step gets all the sitemap URLs listed in the robots.txt file. The initial input for that step can be any URL, doesn't have to be the home page.

The example then uses the Http::get() step to load all the found sitemaps. And then the Sitemap::getUrlsFromSitemap() step gets you all the URLs from all the sitemaps.

If you want to also get the additional data about the URLs, a sitemap can provide (lastmod, changefreq and priority), then you can use the withData() method of the step:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Sitemap;

$crawler->input('https://www.example.com/')
    ->addStep(Sitemap::getSitemapsFromRobotsTxt())
    ->addStep(Http::get())
    ->addStep(
        Sitemap::getUrlsFromSitemap()
            ->withData()
    );

Documentation for crwlr / crawler (v0.7)

Sitemap Steps