Web developers sometime need data from other website. Getting data from other website called web scraping. The basic idea for getting data from website is first download the page or get html content of page and then match the data by regular expression. Simply for web scraping we used curl.
Here we will scrap a website data in different ways
Suppose we want to get title of a website (ex: http://www.amazon.com
Example :
[sourcecode language=”php”]
<?php
$html=file_get_contents("http://www.amazon.com/");
$res = preg_match(‘/<title[^>]*>([^<]+)</title>/im’, $html, $title_matches);
if (!$res){
return null;
}
// Clean up title: remove EOL’s and excessive whitespace.
$title = preg_replace(‘/s+/’, ‘ ‘, $title_matches[1]);
$title = trim($title);
echo $title;
?>
[/sourcecode]
This is not good practice because there is no timeout how many time it request for website content. So if connection problem then you may not get data.
[sourcecode language=”php”]
<?php
$url = ‘http://www.facebook.com’;
$timeout = 30000;
$match = array();
$html=getHTML($url,$timeout); //Call getHTML Method with url and timeout
$res = preg_match(‘/<title[^>]*>([^<]+)</title>/im’, $html, $title_matches); //Check Data With Title Regular expression
if (!$res){
return null;
}
$title = preg_replace(‘/s+/’, ‘ ‘, $title_matches[1]); // clean title extra space.
$title = trim($title);
echo $title; //Print Url Title
// Method For Curl Website
function getHTML($url,$timeout){
$ch = curl_init($url); // Curl Initialization
$agent = ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 FirePHP/0.7.4′; // you can use fake uer agent (This is optional)
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); //If the request site is https then use this
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$content = curl_exec( $ch );
curl_close ( $ch );
return $content;
}
?>
[/sourcecode]
Suppose We want to get data from website table that is complex and possible but difficult to match with regular expression
we can use third party library called Simple PHP DOM Parser for parsing html data easy way
This is the table data we want to get
[sourcecode language=”html”]
<h3 align=’center’>— Table One —</h3>
<table id="one" border="1" cellpadding="3" style="border-collapse: collapse">
<tr class="main">
<td align="center">Company</td>
<td class="buy" align="center">Buy Price</td>
<td class="cell" align="center">Cell Price</td>
</tr>
<tr class="body">
<td align="center">Sample Ltd</td>
<td class="buy" align="center">100.00</td>
<td class="cell" align="center">200.00</td>
</tr>
</table>
<h3 align=’center’>— Table Two —</h3>
<table id="two" border="1" cellpadding="3" style="border-collapse: collapse">
<tr class="main">
<td align="center">Company</td>
<td class="buy" align="center">Buy Price</td>
<td class="cell" align="center">Cell Price</td>
</tr>
<tr class="body">
<td align="center">Sample Ltd</td>
<td class="buy" align="center">600.00</td> <!— This is our Expected Data —>
<td class="cell" align="center">900.00</td>
</tr>
</table>
[/sourcecode]
[sourcecode language=”php”]
<?php
include(‘lib/HtmlDomParser/simple_html_dom.php’); //include dom parser library with correct path
$html = file_get_html(‘https://dev.cybernetikz.com/wp-content/uploads/2015/01/sample.html’);
$dateNodes = $html->find(‘table[id="two"] tr[class="body"] td[class="buy"]’);
if(!empty($dateNodes)){
foreach($dateNodes as $date){
echo "Your Request Table Value is : ".trim($date->innertext).'<br>’;
break;
}
}else{
echo "Something Wrong With IT";
}
?>
[/sourcecode]
This is quite simple and ok. But in some case it can not work properly when the request page does not exists. No time out limit on this technique.
Now we will combine all those solution to make a strong solution to scrap a page and get exact data from a page.
[sourcecode language=”php”]
<?php
include(‘lib/HtmlDomParser/simple_html_dom.php’);
$url = ‘https://dev.cybernetikz.com/wp-content/uploads/2015/01/sample.html’;
$timeout = 30000; //Time out Limit in millisecond (Here 30 sec)
$html=str_get_html(getHTML($url,$timeout));
$dateNodes = $html->find(‘table[id="two"] tr[class="body"] td[class="buy"]’); // Parse data Using Php Dom Parser
if(!empty($dateNodes)){
foreach($dateNodes as $date){
echo trim($date->innertext).'<br>’;
break;
}
}else{
echo "Your Request Data Is Empty";
}
// Method For Curl Website
function getHTML($url,$timeout){
$ch = curl_init($url); // Curl Initialization
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); //If the request site is https then use this
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$content = curl_exec( $ch );
curl_close ( $ch );
return $content;
}
?>
[/sourcecode]
When you request other website data for several time. You are consider as a robot or spammer , for this the server block your ip.
To avoid blocking we send request site we want to curl by different IP of diffent county that means we going to use proxy for this.
in curl we easily add a valid proxy to scrap content from website
[sourcecode language=”php”]
<?php
$url = ‘http://www.facebook.com’;
$timeout = 30000;
$match = array();
$html=getHTML($url,$timeout); //Call getHTML Method with url and timeout
$res = preg_match(‘/<title[^>]*>([^<]+)</title>/im’, $html, $title_matches); //Check Data With Title Regular expression
if (!$res){
return null;
}
$title = preg_replace(‘/s+/’, ‘ ‘, $title_matches[1]); // clean title extra space.
$title = trim($title);
echo $title; //Print Url Title
echo $html;
// Method For Curl Website
function getHTML($url,$timeout){
//(server:port)– some sample used for this example you have to use your own
$proxy_list = array(‘server:port’,
‘server:port’,
‘server:port’
);
$ch = curl_init($url); // Curl Initialization
// Choose a random proxy
$proxy = $proxy_list[array_rand($proxy_list)];
curl_setopt($ch, CURLOPT_PROXY, $proxy);
$agent = ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 FirePHP/0.7.4’; // you can use fake uer agent (This is optional)
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); //If the request site is https then use this
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_TIMEOUT, $timeout); //timeout in seconds
curl_setopt($ch, CURLOPT_URL, $url);
$content = curl_exec( $ch );
curl_close ( $ch );
return $content;
}
[/sourcecode]
Example Of Port: server = 127.0.0.1 Port= 80;
Note: server and port must be valid. Otherwise this script not work properly
Live Demo : Click Here To View
Download Source Code : Click Here