How to parse Html/xml and extract information

Published on : April 27, 2026

Author:

Category: Uncategorized


There are many ways to parse Html/xml and extract information from it. Here I describe one way to parse Html/xml and extract information from it using simple Html DOM parser.

DOM – Documents Object Model allows you to operate XML document through DOM API. DOM is used to parse and modify Html.

To parse Html or xml using DOM you need

  • PHP 5+
  • Supports invalid HTML
  • Find tags on an HTML page with selectors just like jquery
  • Extract contents from HTML in a single line.

Example

How to get HTML elements:

[sourcecode]
// Create DOM from URL or file
$html = file_get_html(‘http://www.example.com/’);

// Find all images
foreach($html->find(‘img’) as $element)
echo $element->src . ‘<br>’;

// Find all links
foreach($html->find(‘a’) as $element)
echo $element->href . ‘<br>’;
[/sourcecode]

How to modify HTML elements:

[sourcecode]
// Create DOM from string
$html = str_get_html(‘<div id="hello">Hello</div><div id="world">World</div>’);

$html->find(‘div’, 1)->class = ‘bar’;

$html->find(‘div[id=hello]’, 0)->innertext = ‘foo’;

echo $html;
[/sourcecode]

Extract content from HTML:

[sourcecode]
// Dump contents (without tags) from HTML
echo file_get_html(‘http://www.google.com/’)->plaintext;
[/sourcecode]

Scraping Slashdot:

[sourcecode]
// Create DOM from URL
$html = file_get_html(‘http://slashdot.org/’);

// Find all article blocks
foreach($html->find(‘div.article’) as $article) {
$item[‘title’] = $article->find(‘div.title’, 0)->plaintext;
$item[‘intro’] = $article->find(‘div.intro’, 0)->plaintext;
$item[‘details’] = $article->find(‘div.details’, 0)->plaintext;
$articles[] = $item;
}

print_r($articles);
[/sourcecode]