There are many ways to parse Html/xml and extract information from it. Here I describe one way to parse Html/xml and extract information from it using simple Html DOM parser.
DOM – Documents Object Model allows you to operate XML document through DOM API. DOM is used to parse and modify Html.
To parse Html or xml using DOM you need
Example
How to get HTML elements:
[sourcecode]
// Create DOM from URL or file
$html = file_get_html(‘http://www.example.com/’);
// Find all images
foreach($html->find(‘img’) as $element)
echo $element->src . ‘<br>’;
// Find all links
foreach($html->find(‘a’) as $element)
echo $element->href . ‘<br>’;
[/sourcecode]
How to modify HTML elements:
[sourcecode]
// Create DOM from string
$html = str_get_html(‘<div id="hello">Hello</div><div id="world">World</div>’);
$html->find(‘div’, 1)->class = ‘bar’;
$html->find(‘div[id=hello]’, 0)->innertext = ‘foo’;
echo $html;
[/sourcecode]
Extract content from HTML:
[sourcecode]
// Dump contents (without tags) from HTML
echo file_get_html(‘http://www.google.com/’)->plaintext;
[/sourcecode]
Scraping Slashdot:
[sourcecode]
// Create DOM from URL
$html = file_get_html(‘http://slashdot.org/’);
// Find all article blocks
foreach($html->find(‘div.article’) as $article) {
$item[‘title’] = $article->find(‘div.title’, 0)->plaintext;
$item[‘intro’] = $article->find(‘div.intro’, 0)->plaintext;
$item[‘details’] = $article->find(‘div.details’, 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
[/sourcecode]