PHP Composer Packages for Crawler and Scraper Development
crwlr.software is a collection of open source PHP composer packages that provide the necessary tools to build web crawlers and scrapers. The crawler package contains everything and helps you build crawlers as fast as possible. There are also sub-packages that you can use standalone.
The main package of this collection, providing kind of a framework and a lot of ready to use, so-called steps, that you can use to build your own web crawlers and scrapers with.
The Swiss Army knife for urls. Parses urls to components (scheme, host, domain, path,...). You can access and modify url components, compare components of different urls and resolve relative to absolute urls. Also supports internationalized domain names.
This library provides a very convenient API to create, access and manipulate query strings used in HTTP GET (as part of the URL) or POST (as part of the body) requests.
Use this library within crawler and scraper programs to parse robots.txt files and check if your crawler user-agent is allowed to load certain paths.
Latest Blog Posts
Version 0.6 is probably the biggest update so far with a lot of new features and steps from crawling whole websites, over sitemaps to extracting metadata and schema.org structured data from HTML. Here is an overview of all the new stuff.» Read more
We're already at v0.5 of the crawler package and this version comes with a lot of new features and improvements. Here's a quick overview of what's new.» Read more
There is a new package in town called query-string. It allows to create, access and manipulate query strings for HTTP requests in a very convenient way. Here's a quick overview of what you can do with it and also how it can be used via the url package.» Read more
Last friday version 0.4 of the crawler package was released with some pretty useful improvements. Read what's shipped with this new minor update.» Read more
There are already two new 0.x versions of the crawler package. Here a quick summary of what's new in versions 0.2 and 0.3.» Read more
After months of hard work, today I'm finally releasing the first version (v0.1.0) of the crwlr / crawler package. Here some information on what it is, its state and current and future features.» Read more
Homograph attacks are using internationalized domain names (IDN) for malicious links including domains that look like trusted organizations. You can use the crwlr Url class to detect and monitor urls containing IDNs in your user's input.» Read more