Scraping website content using Php curl

Published on : April 27, 2026

Author:

Category: Uncategorized


Scraping Website Content Using Php Curl and Proxy

Web developers sometime need data from other website. Getting data from other website called web scraping. The basic idea for getting data from website is first download the page or get html content of page and then match the data by regular expression. Simply for web scraping we used curl.

Here we will scrap a website data in  different ways

Get Data Website Using simple php method and Regular Expression

Suppose we want to get title of a website (ex: http://www.amazon.com

  • First Easy way for to get website data using simple file_get_contents()
  • And match with regular expression

Example :

[sourcecode language=”php”]
<?php
$html=file_get_contents("http://www.amazon.com/");

$res = preg_match(‘/<title[^>]*>([^<]+)</title>/im’, $html, $title_matches);
if (!$res){
return null;
}
// Clean up title: remove EOL’s and excessive whitespace.
$title = preg_replace(‘/s+/’, ‘ ‘, $title_matches[1]);
$title = trim($title);
echo $title;
?>
[/sourcecode]

This is not good practice because there is no timeout how many time it request for website content. So if connection problem then you may not get data.

Get Data Website Using Php Curl and Regular Expression

  • Step 1 : first we used php curl to get html content of the web page
  • Step 2 : From The Page Source We Match Expected Value With Regular Expression

[sourcecode language=”php”]
<?php
$url = ‘http://www.facebook.com’;
$timeout = 30000;
$match = array();
$html=getHTML($url,$timeout); //Call getHTML Method with url and timeout
$res = preg_match(‘/<title[^>]*>([^<]+)</title>/im’, $html, $title_matches); //Check Data With Title Regular expression
if (!$res){
return null;
}
$title = preg_replace(‘/s+/’, ‘ ‘, $title_matches[1]); // clean title extra space.
$title = trim($title);
echo $title; //Print Url Title

// Method For Curl Website
function getHTML($url,$timeout){
$ch = curl_init($url); // Curl Initialization
$agent = ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 FirePHP/0.7.4′; // you can use fake uer agent (This is optional)
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); //If the request site is https then use this
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$content = curl_exec( $ch );
curl_close ( $ch );
return $content;
}
?>
[/sourcecode]

Suppose We want to get data from website table that is complex and possible but difficult to match with regular expression
we can use third party library called Simple PHP DOM Parser for parsing html data easy way

This is the table data we want to get

[sourcecode language=”html”]
<h3 align=’center’>— Table One —</h3>

<table id="one" border="1" cellpadding="3" style="border-collapse: collapse">
<tr class="main">
<td align="center">Company</td>
<td class="buy" align="center">Buy Price</td>
<td class="cell" align="center">Cell Price</td>
</tr>

<tr class="body">
<td align="center">Sample Ltd</td>
<td class="buy" align="center">100.00</td>
<td class="cell" align="center">200.00</td>
</tr>

</table>

<h3 align=’center’>— Table Two —</h3>

<table id="two" border="1" cellpadding="3" style="border-collapse: collapse">
<tr class="main">
<td align="center">Company</td>
<td class="buy" align="center">Buy Price</td>
<td class="cell" align="center">Cell Price</td>
</tr>

<tr class="body">
<td align="center">Sample Ltd</td>
<td class="buy" align="center">600.00</td> <!— This is our Expected Data —>
<td class="cell" align="center">900.00</td>
</tr>
</table>
[/sourcecode]

Get Data Website Using SimplePhpDomParser

  • Step 1 : first we used php file_get_html() method to get content of the web page
  • Step 2 : From The Page Source We Parse our expected data using PHP Simple HTML DOM Parser

[sourcecode language=”php”]
<?php
include(‘lib/HtmlDomParser/simple_html_dom.php’); //include dom parser library with correct path
$html = file_get_html(‘https://dev.cybernetikz.com/wp-content/uploads/2015/01/sample.html’);
$dateNodes = $html->find(‘table[id="two"] tr[class="body"] td[class="buy"]’);
if(!empty($dateNodes)){
foreach($dateNodes as $date){
echo "Your Request Table Value is : ".trim($date->innertext).'<br>’;
break;
}
}else{
echo "Something Wrong With IT";
}
?>

[/sourcecode]

This is quite simple and ok. But in some case it can not work properly when the request page does not exists. No time out limit on this technique.

Scrap Website Using Php Dom Parser and Curl

Now we will combine all those solution to make a strong solution to scrap a page and get exact data from a page.

  • We use Curl to get html content from a page
  • Then we use Simple php dom parser to parse data from a page

[sourcecode language=”php”]
<?php
include(‘lib/HtmlDomParser/simple_html_dom.php’);
$url = ‘https://dev.cybernetikz.com/wp-content/uploads/2015/01/sample.html’;
$timeout = 30000; //Time out Limit in millisecond (Here 30 sec)
$html=str_get_html(getHTML($url,$timeout));
$dateNodes = $html->find(‘table[id="two"] tr[class="body"] td[class="buy"]’); // Parse data Using Php Dom Parser
if(!empty($dateNodes)){
foreach($dateNodes as $date){
echo trim($date->innertext).'<br>’;
break;
}
}else{
echo "Your Request Data Is Empty";
}
// Method For Curl Website
function getHTML($url,$timeout){
$ch = curl_init($url); // Curl Initialization
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); //If the request site is https then use this
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$content = curl_exec( $ch );
curl_close ( $ch );
return $content;
}
?>
[/sourcecode]

scraping websites with php curl under proxy

When you request other website data for several time. You are consider as a robot or spammer , for this the server block your ip.
To avoid blocking we send request site we want to curl by different IP of diffent county that means we going to use proxy for this.

in curl we easily add a valid proxy to scrap content from website

[sourcecode language=”php”]

<?php
$url = ‘http://www.facebook.com’;
$timeout = 30000;
$match = array();
$html=getHTML($url,$timeout); //Call getHTML Method with url and timeout
$res = preg_match(‘/<title[^>]*>([^<]+)</title>/im’, $html, $title_matches); //Check Data With Title Regular expression
if (!$res){
return null;
}
$title = preg_replace(‘/s+/’, ‘ ‘, $title_matches[1]); // clean title extra space.
$title = trim($title);
echo $title; //Print Url Title
echo $html;
// Method For Curl Website
function getHTML($url,$timeout){
//(server:port)– some sample used for this example you have to use your own
$proxy_list = array(‘server:port’,
‘server:port’,
‘server:port’
);
$ch = curl_init($url); // Curl Initialization
// Choose a random proxy
$proxy = $proxy_list[array_rand($proxy_list)];
curl_setopt($ch, CURLOPT_PROXY, $proxy);
$agent = ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 FirePHP/0.7.4’; // you can use fake uer agent (This is optional)
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); //If the request site is https then use this
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_TIMEOUT, $timeout); //timeout in seconds
curl_setopt($ch, CURLOPT_URL, $url);
$content = curl_exec( $ch );
curl_close ( $ch );
return $content;
}

[/sourcecode]

Example Of Port: server = 127.0.0.1 Port= 80;
Note: server and port must be valid. Otherwise this script not work properly

Live Demo : Click Here To View

Download Source Code : Click Here