Crwlr Recipes: How to Scan any Website for schema.org Structured Data Objects

2023-11-16

This is the first article of our "Crwlr Recipes" series, providing a collection of thoroughly explained code examples for specific crawling and scraping use-cases. This first article describes how you can crawl any website fully (all pages) and extract the data of schema.org structured data objects from all its pages, with just a few lines of code.

The Use-Case

The actual use-case behind this could be, that you want to build a metasearch site, for instance for news articles, cooking recipes, job ads, real estate ads, or other things that site owners pretty commonly provide via schema.org structured data objects in their website's source code. I'd recommend this method (loading all pages of a website) only if you plan to use it with any website that you absolutely don't know anything about its structure. If you want to do this with sites with a known structure, there will definitely be more efficient ways.

If you haven't heard of schema.org structured data yet: on schema.org you can find a standardized catalog of entities (things like already mentioned above: articles, recipes, ads, organizations, people,...), defining which properties each entity should or can provide, how they can relate to each other and a lot more. Let's say you provide cooking recipes on your website, you probably want to add structured data, because this makes it possible for Google to show so-called rich snippets with your structured content. If you want to know more about it, here's a video diving deeper.

The Code

For our example, let's say we want to look for job ads (schema.org type JobPosting).

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');

// During development when you want to try things and re-run your crawler
// repeatedly after changes, you can use a response cache.
// See https://www.crwlr.software/packages/crawler/v1.3/response-cache 
//
// $crawler->getLoader()->setCache(
//     (new FileCache(__DIR__ . '/responsecache'))->ttl(86400) // time to live in seconds
// );

$crawler
    ->input('https://www.example.com')
    ->addStep(Http::crawl())
    ->addStep(
        Html::schemaOrg()
            ->onlyType('JobPosting')
            ->toArray()
    );

$crawler->runAndDump();
// This will dump the result data when you run the crawler on the command line.
// To save the data, use a Store. More on this further below.

And that's it. This crawler will load all links that it finds starting from https://www.example.com and get all the schema.org objects of type JobPosting from the source. There is a lot that you can optionally configure, but this code does the job for a single website.

The crawler loads page after page and prints found schema.org objects.

Customizing the Crawl Step

The crawl step has a lot of options for customization. In our example, it may be helpful to limit how many pages the crawler will load from one single website because some websites are huge and consist of millions of pages, and it would take an unreasonable amount of time for your crawler to load all of them. There are two different ways to achieve such a limitation.

Depth

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::crawl()
            ->depth(2)
    )
    // Add the Html::schemaOrg() step and so on...

The crawl depth defines how many levels it should follow the link tree. With a depth of 2, it will follow the links from the example.com homepage and then all the links that it finds on those first-level sub-pages. After that, the crawler stops and won't load any new links from the second level.

The actual number of pages that will be loaded using this technique, may vary a lot, because it depends on how many links there are per page. Therefore you can also define a fixed limit by using the maxOutputs() method.

Max Outputs

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::crawl()
            ->maxOutputs(500)
    )
    // Add the Html::schemaOrg() step and so on...

The maxOutputs() method works with any step included in the library and when used with the Http::crawl() step, it stops loading any further pages when the limit is reached.

Stick to Domain instead of Host

By default, the Http::crawl() step, will load all links to URLs on the same host. In the example the host is www.example.com. This means, that it would not follow links to a different subdomain, like jobs.example.com. If you want to follow all links on the same domain (any subdomain ending with example.com or only example.com as the host), you can call the step's sameDomain() method.

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::crawl()
            ->maxOutputs(500)
            ->sameDomain()
    )
    // Add the Html::schemaOrg() step and so on...

There are more customization options for the crawl step, but these are probably the most useful for our use case.

Customizing the schema.org Step

There's not a lot to customize with this step. As you can see in the example, we choose to get only objects of type JobPosting. If you want, you can also tell it to use only certain properties from the JobPosting objects.

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::crawl()
            ->maxOutputs(500)
            ->sameDomain()
    )
    ->addStep(
        Html::schemaOrg()
            ->onlyType('JobPosting')
            ->extract([
                'title',
                'location' => 'jobLocation.address.addressLocality',
                'salary' => 'baseSalary',
                'description',
            ])
    );

The extract() method takes an array of property names to use from the extracted schema.org objects. As demonstrated, you can also get nested properties using dot notation (also see JSON steps). And you can also map the property names to the keys that you'd like to have in your crawling result.

Storing the Scraped Data

In the example, we used $crawler->runAndDump(). Doing so, the crawler will just print out the results. But of course, we like to store the extracted data somewhere. For this reason, the package comes with a concept called Stores. Stores are classes that implement the Crwlr\Crawler\Stores\StoreInterface (or better: extend the Crwlr\Crawler\Stores\Store class). So, you can always easily build your own implementation. Just implement a store() method, receiving a Crwlr\Crawler\Result object as the only argument. After adding your store to your crawler, it is automatically called with each crawling result (in our case with each job posting).

The library ships with two very simple and ready-to-use store implementations: the JsonFileStore and the SimpleCsvFileStore. Let's use the JsonFileStore to store our crawling results in a .json file.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Stores\JsonFileStore;

$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');

$store = new JsonFileStore(__DIR__ . '/results', 'example-com-jobs');

$crawler
    // Define input and add the necessary steps
    ->setStore($store);

$crawler->runAndTraverse();

For the JsonFileStore to know where it can store the generated files, you need to provide it with a path and optionally, with a file prefix. Also, when using a store you need to use the runAndTraverse() method instead to run your crawler. And that's it.

The crawler creates a .json file containing the scraped data when the JsonFileStore is used.

The Final Code

Now the full example with all the improvements looks like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Stores\JsonFileStore;

$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');

$store = new JsonFileStore(__DIR__ . '/results', 'example-com-jobs');

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::crawl()
            ->maxOutputs(500)
            ->sameDomain()
    )
    ->addStep(
        Html::schemaOrg()
            ->onlyType('JobPosting')
            ->extract([
                'title',
                'location' => 'jobLocation.address.addressLocality',
                'salary' => 'baseSalary',
                'description',
            ])
    )
    ->setStore($store);

$crawler->runAndTraverse();