What's new in crwlr / crawler v0.6?
Version 0.6 is probably the biggest update so far with a lot of new features and steps from crawling whole websites, over sitemaps to extracting metadata and schema.org structured data from HTML. Here is an overview of all the new stuff.
ℹ️ First an important info, if you're already using the library:
0.x versions are still "development versions" and can potentially contain changes that are breaking backwards compatibility. I try to avoid it and there won't be a lot, but this version contains one breaking change:
PoliteHttpLoader (and the traits
CheckRobotsTxt) have been removed. The politeness features are now baked into (dependencies of) the
HttpLoader. Throttling (
WaitPolitely) is done by default, but you can configure it and your crawler loads and respects
robots.txt files depending on if you're using a
BotUserAgent. More on this on the new Documentation Page about Politeness.
Crawling whole Websites
Http::crawl() step allows you to easily crawl whole websites, and it also has a lot of options, like:
- Only crawl to a certain depth.
- Start from a sitemap.
- Stay on the same domain or on the same host.
- Only load URLs with certain paths.
In this example you start with a sitemap and load only URLs with a path starting with
use Crwlr\Crawler\Steps\Loading\Http; $crawler->input('https://www.example.com/sitemap.xml') ->addStep( Http::crawl() ->inputIsSitemap() ->pathStartsWith('/foo/') );
You can read more about this feature here.
New Sitemap Steps
There is 2 new steps to work with sitemaps. One to get all sitemap URLs listed in the
robots.txt file of some website. And the other one to get all the URLs (optionally also with the additional data like priority) listed in the sitemap.
use Crwlr\Crawler\Steps\Loading\Http; use Crwlr\Crawler\Steps\Sitemap; $crawler->input('https://www.example.com/') ->addStep(Sitemap::getSitemapsFromRobotsTxt()) ->addStep(Http::get()) ->addStep(Sitemap::getUrlsFromSitemap());
Extracting Metadata and schema.org structured data from HTML documents
There are two new HTML steps to easily extract Metadata and schema.org structured data (in JSON-LD format) from HTML documents:
use Crwlr\Crawler\Steps\Loading\Http; use Crwlr\Crawler\Steps\Html; $crawler->input('https://www.crwlr.software/') ->addStep(Http::get()) ->addStep( Html::metaData() ->only(['title', 'description', 'og:image']) );
This step gets you all the data from
<meta> tags which have a
property attribute and also the title from the
<title> tag. Read more here.
Extracting schema.org structured data
use Crwlr\Crawler\Steps\Loading\Http; use Crwlr\Crawler\Steps\Html; $crawler->input('https://www.example.com/foo') ->addStep(Http::get()) ->addStep( Html::schemaOrg() ->onlyType('JobPosting') ->extract([ 'title', 'description', 'company' => 'hiringOrganization.name', ]) );
CSS selector first(), last(), nth(), even() and odd() methods
CSS has selectors like
:last-child and so on. But they are easily misunderstood. For example:
#main a:first-child is not just the first link inside the element with
id="main", but: the first element inside the
id="main" element only if it is a link. So, when the first child element inside that element is for example a
<div>, the selector won't match anything.
Now you can solve this like:
use Crwlr\Crawler\Steps\Dom; use Crwlr\Crawler\Steps\Html; Html::root()->extract([ 'firstLink' => Dom::cssSelector('#main a')->first(), ]);