Searching for HTML in an easy way

By Jean Hertel, 3/21/17

symfony , silex , domcrawler , xpath , cssselector

You certainly already had to grab some HTML and fetch some content inside it with PHP. Whenever you need to do this, the most common solution is to use regex and functions like strpos. Some developers try to keep code clean using classes like DomDocument, But end up having to go back to regular expressions because of poorly formed HTML. Others try to venture into the world of XPath, but end up facing the same difficulties as regular expressions.

Fortunately there is a much better and simpler solution, the DomCrawler component of Symfony. This component allows you to easily cross the HTML, searching for other elements or text. If your HTML is poorly formed, it will do everything possible to correct the HTML and allow you to search it. To make things even better, you can use the CSSSelector component that lets you write CSS selectors and turn them into XPath querys.

Let’s see an example in action.

<?php
require "vendor/autoload.php";

use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">Hello World!</p>
        <p>Hello Crawler!</p>
        <div>
            <span>Hellouuww</span>
            <span customAttribute="monster">Not too quick!</span>
            <p id="ninja">A ninja paragraph!</p>
        </div>
    </body>
</html>
HTML;

/* Create the crawler */
$crawler = new Crawler($html);

/* Search the ID ninja */
$subCrawler = $crawler->filter('#ninja');

foreach($subCrawler as $domElement) {
    var_dump($domElement->nodeName);
    var_dump($domElement->nodeValue);
}

$subCrawler2 = $crawler->filter('body > p');

foreach($subCrawler2 as $domElement) {
    var_dump($domElement->nodeValue);
}

Note that queries are very simple using CSS selectors. Just like in the jQuery library, whenever you make a query, an array will be returned.

There are several useful methods, such as children, parent, first, last, siblings, etc. The cool thing here is that most php developers have already had contact with jQuery, and therefore, the methods will seem very familiar.

To conclude, let’s get the value of a class that changes dynamically in the second span.

<?php
require "vendor/autoload.php";

use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">Hello World!</p>
        <p>Hello Crawler!</p>
        <div>
            <span>Hellouuww</span>
            <span class="monster">Not too quick!</span>
            <p id="ninja">A ninja paragraph!</p>
        </div>
    </body>
</html>
HTML;

$crawler = new Crawler($html);

$unknowClass = $crawler->filter('span')->last()->attr('class');

var_dump($unknowClass);

If you know how to use XPath correctly, you can use the filterXPath method directly, Avoiding the overhead of translating the CSS selector to XPath.