Documentation for crwlr / html-2-text (v0.1)

Building Custom Node Converters

Adding a Custom Node Converter

If you don't like how the package converts certain tags by default, you can customize it by building your own node converter.

use Crwlr\Html2Text\Aggregates\DomNodeAndPrecedingText;
use Crwlr\Html2Text\NodeConverters\AbstractBlockElementWithDefaultMarginConverter;

class UnorderedListConverter extends AbstractBlockElementWithDefaultMarginConverter
{
    public function nodeName(): string
    {
        return 'ul';
    }

    public function convert(DomNodeAndPrecedingText $node): string
    {
        // The DomNodeAndPrecedingText object is an aggregate, containing the actual
        // DOMNode object of the node that shall be converted...
        $domNode = $node->node;

        // ...and the preceding text. In most cases you won't need the preceding text,
        // but in some cases it can be helpful to decide spacing. Mostly you'll just
        // need to pass it to the addSpacingBeforeAndAfter() method. See further below.
        $precedingText = $node->precedingText;

        // Your code to convert the node to text.
        $text = ...;

        return $text;
    }
}

This demo class extends the AbstractBlockElementWithDefaultMarginConverter. There are three distinct abstract parent classes available for use:

  • AbstractBlockElementWithDefaultMarginConverter for block elements typically with a default margin (top and bottom) in the browser (e.g. p, ul).
  • AbstractBlockElementConverter for block elements without a default margin in the browser (e.g. div).
  • AbstractInlineElementConverter for inline elements (e.g. span, strong,...).

Those three parent converter classes come with a method addSpacingBeforeAndAfter(). You can utilize this method to automatically incorporate the necessary line breaks before and after the text returned by your custom converter. See below.

Once your custom node converter is ready, you can integrate it into a converter instance (Html2Text).

use Crwlr\Html2Text\Html2Text;

$converter = new Html2Text();

$converter->addConverter(new UnorderedListConverter());

$converter->convertHtmlToText($html); // non-static version of Html2Text::convert($html).
                                      // This will now use your custom node converter for <ul> tags.

Automatically add Spacing based on preceding Text

use Crwlr\Html2Text\Aggregates\DomNodeAndPrecedingText;
use Crwlr\Html2Text\NodeConverters\AbstractBlockElementWithDefaultMarginConverter;

class UnorderedListConverter extends AbstractBlockElementWithDefaultMarginConverter
{
    public function nodeName(): string
    {
        return 'ul';
    }

    public function convert(DomNodeAndPrecedingText $node): string
    {
        $text = ...;

        return $this->addSpacingBeforeAndAfter($text, $node->precedingText);
    }
}

The addSpacingBeforeAndAfter() method is implemented differently in all three parent node converter classes. Let's examine the distinct return values. The method is called with World! as the first argument, and the preceding text is Hello.

When called in a class extending AbstractBlockElementWithDefaultMarginConverter, the return value is:

string(10) "

World!

"

When called in a class extending AbstractBlockElementConverter:

string(8) "
World!
"

And in a class extending AbstractInlineElementConverter:

string(6) "World!"

Handling Child Nodes Conversion

The next challenge for your converter is that most nodes can encompass various child nodes. To address this, leverage the main converter (the Html2Text class), to accurately convert any child node, which all have their own node converters (either custom or just simple fallback block/inline element converters), to text. Two approaches cater to different scenarios:

  • If you are building a converter for a simple wrapper element (e.g. <div>, <blockquote>) without the need to manually handle child nodes, use $this->getNodeText().
  • For more complex elements involving special child elements (e.g. <table> => <tr> => <td> or <ul> => <li>) use $this->getConverter()->getTextFrom() to separately get converted text for any child node of the main node.

Refer to the demo below for a clearer distinction between the two methods.

Getting Converted Text for All Child Nodes of the Main Node

As previously mentioned, when you are building a node converter for a straightforward wrapper element (such as <div>, <blockquote>), and there's no need for manual handling of specific child nodes (e.g., <tr> and <td> within a <table>), employ $this->getNodeText() within your custom node converter class. This method allows you to obtain properly converted text for all child nodes of the main node undergoing conversion

use Crwlr\Html2Text\Aggregates\DomNodeAndPrecedingText;

class BlockquoteConverter extends AbstractBlockElementWithDefaultMarginConverter
{
    public function nodeName(): string
    {
        return 'blockquote';
    }

    public function convert(DomNodeAndPrecedingText $node): string
    {
        // For details about the indent() method, see further below.
        $addText = $this->indent($this->getNodeText($node)); 

        return $this->addSpacingBeforeAndAfter($addText, $node->precedingText);
    }
}

Getting Converted Text for Single Child Nodes of the Main Node

When manually iterating through the child nodes of the main node, because you need to handle specific child nodes manually, utilize $this->getConverter()->getTextFrom() to obtain the correctly converted text for any node.

class SomeNodeConverter extends AbstractBlockElementConverter
{
    public function nodeName(): string
    {
        return 'some';
    }

    public function convert(DomNodeAndPrecedingText $node): string
    {
        $text = '';

        foreach ($node->node->childNodes as $childNode) {
            if ($childNode->nodeName === 'somechild') {
                // Some special child node handling.

                // If you again need to get the converted text for the children of this child node,
                // you can use $this->getNodeText($childNode).
            } else {
                $text .= $this->getConverter()->getTextFrom($childNode, $text);
            }
        }

        return $text;
    }
}

The getConverter() method returns the parent Html2Text instance associated with your node converter. Html2Text::getTextFrom() is a pivotal function that can be used with any DOMNode or DOMNodeList. Essentially, it encapsulates the core functionality invoked when calling Html2Text::convert().

To further illustrate the distinction between these two methods, consider the following example.
Your converter receives a node like this:

<div>
    <a href="https://www.example.com">Some <strong>Link</strong></a>
</div>

And this is the convert() method of your node converter class:

public function convert(DomNodeAndPrecedingText $node): string
{
    $text = '';

    foreach ($node->node->childNodes as $childNode) {
        $domNodeAndPrecedingText = new DomNodeAndPrecedingText($childNode, $text);

        var_dump($this->getNodeText($domNodeAndPrecedingText));

        var_dump($this->getConverter()->getTextFrom($childNode, $text));
    }

    return $text;
}

The two var_dump() calls, dump this:

string(9) "Some LINK"
string(36) "[Some LINK](https://www.example.com)"

As you can see

  • $this->getNodeText() only converts the children.
  • $this->getConverter()->getTextFrom() also converts the element itself.

Using $this->getConverter()->getTextFrom() with the main node (that the convert() method of your converter is called with) itself, you would end up with an infinite loop.

Indentation

You might have read that you can configure the default indentation size used when indenting text (e.g. when converting <ul>, <ol>, <blockquote> or <dl> elements). The AbstractNodeConverter is equipped with a handy method that automatically manages text indentation based on the configured indentation size for you.

use Crwlr\Html2Text\Aggregates\DomNodeAndPrecedingText;

class BlockquoteConverter extends AbstractBlockElementWithDefaultMarginConverter
{
    public function nodeName(): string
    {
        return 'blockquote';
    }

    public function convert(DomNodeAndPrecedingText $node): string
    {
        $addText = $this->indent($this->getNodeText($node));

        return $this->addSpacingBeforeAndAfter($addText, $node->precedingText);
    }
}

This method indents each line of the text provided as the first argument with the number of spaces configured as the indentation size (default = 2).

Additionally, you can specify the indentation level as the second argument:

$this->indent($this->getNodeText($node), 3);

This means, when the indentation size is set to two, the method indents the text by 6 space characters. This is mainly useful for nested lists (<ul> or <ol>).