Release of crwlr / crawler v0.1.0

2022-04-18

After months of hard work, today I'm finally releasing the first version (v0.1.0) of the crwlr/crawler package. Here some information on what it is, its state and current and future features.

What it is

If you're still asking yourself what crawling or scraping actually is, or what's the difference between those two, there is an explanation in the package docs.

Building crawlers or scrapers in PHP can be a lot of boilerplate work. The package provides a good foundation, so you'll be faster building such programs. Using the central Crawler class you can build crawlers/scrapers in a modular way using so-called steps. There are some common, ready to use steps shipped with the package, and you can also write your own.

A simple Example how to use it

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return BotUserAgent::make('MyBot');
    }
}

$crawler = new MyCrawler();

$crawler->input('https://www.example.com/articles')     // Initial input to start with
    ->addStep(Http::get())                              // Load the listing page
    ->addStep(Html::getLinks('#artList .article a'))    // Get the links to the articles
    ->addStep(Http::get())                              // Load the article pages
    ->addStep(
        Html::first('article')                          // Extract the data
            ->extract([
                'title' => 'h1',
                'date' => '.date',
                'author' => '.articleAuthor'
            ])
            ->addKeysToResult()
    );

foreach ($crawler->run() as $result) {
    // Do something with the Results
}

The current State

There is still a lot to do, and it actually is a 0.x version as defined in semver. This means there can still be bigger/breaking changes in its API until v1, so keep that in mind. As the library is also the foundation for the crawling/scraping tool that I'm working on at crwl.io, I'll build a lot of crawlers and scrapers with it and improve it on the way. I plan to make it to a stable v1 release in the second half of the year.

I believe a well tested codebase is important for good software, so there are already about 370 unit tests and I already started to add some first integration tests that start a simple PHP web server for that purpose. I'll keep adding integration test examples as I'll build new crawlers using the library.

What's already included

A few ready to use steps for
- HTTP requests
- extracting data from
  - HTML
  - XML
  - JSON
  - CSV
Logging - The crawler takes any implementation of the PSR-3 LoggerInterface and the included steps log information about what they are doing. There is also a simple CliLogger that simply echoes the log messages for cli usage.
Stores for convenient handling of the final result objects.
Response Cache to cache response while working on a crawler (but still a bit half-baked).
Using an actual Bot-User-Agent and sticking to rules defined in a robots.txt file if you want to.

What's already planned for future Releases

Get HTML source after running JS. Plan to use chrome-php/chrome for this.
Steps doing simple string manipulation.
Async HTTP requests.
More loaders for FTP, Filesystem, maybe SOAP,...
Read schema.org structured data from HTML documents.
...

I'll definitely be very happy if you try it out and even more when you tell me what you think about it. Of course please tell me when you discover any bugs, or have trouble understanding something. If you like it, consider starring it on github ;)!