PHP Web Media Grabber

PHP Web Media Grabber

Advanced PHP web resources extractor

Creating a full featured image gallery using flickr images

The Need

We need to create an image gallery hosted on our server but using the pictures provided by flickr.
Both the thumbs and the full size images will be grabbed from the flickr and downloaded to our localhost using PHP Web Image Grabber class from the WiseLoop PHP Web Media Grabber package.
The image gallery will be displayed in within our website and nobody will know that the pictures are coming from flickr.

The Solution

Grabbing and downloading media web resources form websites are the things that WiseLoop PHP Web Media Grabber is about.
WiseLoop PHP Web Media Grabber is a set of PHP classes designed to grab, extract or even download web media files the web such as images, videos, audios, flash files, documents, javascript sources, css stylesheet files etc.
This package allows complex media extraction in a flexible manner, just by using only a few lines of code.
The extraction is made from any given web URL that contains or refers media files using web links (a href tags), various tag attributes (src, embed, param, movie etc.) or even inline css styling attributes (such as background images); also, the media grabbing engine is able to identify more than the obvious media resources having the most common file extensions - it will find the media generated dynamically by the servers or media files that have no valid extensions or no extensions at all (such as images generated at runtime by the web servers); the identification is made by checking the server response header when pinging the tested media resource.
For more information please checkout the product page at http://wiseloop.com/product/php-web-media-grabber

The Implementation

Step 1: Install the WiseLoop PHP Web Media Grabber package

  • Step 1.1: copy entire package contents to your web server;
  • Step 1.2: create (if not exists) the /download directory under the /php-web-media-grabber directory;
  • Step 1.3: make sure that the /php-web-media-grabber/download directory has full access rights (chmod 0777 for linux);
  • Step 1.4: include /php-web-media-grabber/bin/wlWmg.php in your application.

Step 2: Choosing a gallery theme and establishing the url from the images will be grabbed

Lets say that we want our picture gallery to be about school; after doing an image search with the keyword school flickr.com shows:

tut1-flickr1.jpg
flickr.com school search screenshot

By doing so, the web browser reveals the url to be parsed to the grabbing engine in order to extract the pictures:

http://www.flickr.com/search/?q=school&f=hp

We want to grab and download to our host all the pictures presented here: their thumbs and their full size image that hides behind the thumbs links.
We will use the PHP Web Image Grabber class to grab and download the targeted url images, and then some short PHP custom code to nicely display the gallery.

Step 3: Use the WiseLoop PHP Web Media Grabber in your application

All we need to do is to create a wlWmgImageGrabber object and pass to it the url page address to be processed.
For the start, our code should be something like this (we will now focus only on grabbing and leave the gallery displaying for later - we will print out the results as an array for now):

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url);                                  //creating the image grabber object
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

After running the code above we obtain:

tut1-res1.jpg
Grabbing results

Quite nice, but there are some issues:

  • the grabbing engine grabbed more that we need: the flickr logo, the search button, the buddy icons from the right etc. We need only the pictures inside the result area;
  • only the thumbs were grabbed, the full size images hidden behind the thumbs are missing;
  • the grabbing engine was able to identify and extract the images, but no download to our localhost was performed.

Ok, so let's start with first issue:

Step 4: Filtering grabbing results

If we do not need all media to be extracted from a page we need to setup some filters and add them to the grabber filter list. To create a filter we must instantiate the wlWmgFilter class by calling its constructor with some arguments (see the class documentation). Also, some filters can be automatically added to the grabber object by passing some additional sets of values to the grabber constructor.
Back to our case, we notice that all the needed picture files have ".jpg" extension. So let's use that in order to filter the results:

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, '.jpg');                          //creating the filtered(by .jpg extension) image grabber object
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

or, if we want more than one extension (use an array):

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

or, if we want to use the wlWmgFiter objects explicitly:

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url);                                  //creating the image grabber object
$extensionFilter = new wlWmgFilter(                                 //creating the extensions filter object
    array('.jpg', 'jpeg'),
    wlWmgFilter::TYPE_URL,
    wlWmgFilter::OPERATOR_CONTAINS
);
$ig->addFilter($extensionFilter);                                   //adding the filter to the grabber
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

Yes its better, but there are still some unwanted pictures in our grab results (some of the buddy icons from the right are jpegs).
Ok, let's try something stronger then:

Step 5: Filtering grabbing results by limiting the searching HTML area

This is a very strong feature: the grabbing engine is able to search for images (or any media) only inside a designated HTML area specified by a tag; in this way you can skip grabbing from the start any unwanted pictures by narrowing the full HTML target page to a smaller area consisting of a tag content; an incomplete tag (tag slice) can be specified also, the tag will auto-complete depending on the contextual HTML content.
In order to do HTML area searching, we need to know a little bit about the HTML DOM structure of our target page.
The page source shows:

tut1-page-source1.jpg
Page source of our url; green marked area represents the needed area to be searched

After just one minute checking on the page source, we notice that the needed area to be searched is inside the following tag:

<div class="ResultsThumbs" id="ResultsThumbsDiv">

So let's search inside that area only for images:

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->setGrabOnlyFromTagSlice('photo-display-container');            //limiting the searching area: only images founded inside that tag will be grabbed
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

Quite smart. Isn't it? Notice that we did'nt specify the full tag definition: we used only a slice: 'photo-display-container'; the grabbing engine was able to identify the full tag depending on the full page HTML source code. This can be very helpful when even the tag that delineates the designated searching HTML area is dynamically generated, but it has some static properties that can lead to its unique identification (such as CSS class, or ID).
Just for fun, let's pretend that we do not need all the images, we need only the first 10 pictures - we'll have a 10 picture counting gallery only.

Step 6: Limiting the number of grabbed media images

If we want to limit the number of grabbed media we'll need to setup the limiter like this:

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->setGrabOnlyFromTagSlice('photo-display-container');            //limiting the searching area: only images founded inside that tag will be grabbed
$ig->setLimit(10);                                                  //limiting the grabbed images count: only first 10 images will be grabbed
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

Now we are sure that we have grabbed the needed thumbs.
How about the full size images that hides behind the thumbs?

Step 7: Following links in search for full size images

Let's click on the first image. The web browser shows:

tut1-flickr2.jpg
Followed thumb subpage

Oops; this is not very good for us. Of course, we (human beings) know what is the full size picture of the followed thumb: obviously it is the bus. But how about a computer program (our PHP script)? Does it know that? There are many images in that pages: the flckr logo, the bus, the buddy icons from the right, some commercial banners, the buddy icons from bottom etc.). How the computer will identify the full size image that it has to extract?
Simple. Assuming this:
The largest picture (in bytes) will be the needed picture. It will check every picture contained by this sub-page and the largest (in bytes) will most likely be the one that we need.
Let's now activate the sub page links following:

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->setGrabOnlyFromTagSlice('photo-display-container');            //limiting the searching area: only images founded inside that tag will be grabbed
$ig->setLimit(10);                                                  //limiting the grabbed images count: only first 10 images will be grabbed
$ig->setFollowSubpagesLinks(true);                                  //activating the sub-pages links following
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

Wow! that was real sloooooow! Why is that?
Because the script had to check every sub-page hidden behind every thumb, and for every subpage it had to check every picture contained by that sub-page in order to find the largest (in bytes) one. That can lead up to thousands of url checkings (pages and/or images) that can even reach the PHP timeout limit error.
So, let's fix that ok?

Step 8: Limiting the searching HTML area in the followed sub-pages

Ok, you already know what this is about.
What if we limit the searched are in the followed sub-pages like we did for the main page at Step 5?
Yes if we know what tag holds the full size image in the followed sub-page the problem is solved.
After under one minute checking on the sub-page source, we notice that the full size image (the bus) is inside the following tag:

<div class="photo-div">

So our PHP script becomes:

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->setGrabOnlyFromTagSlice('photo-display-container');            //limiting the searching area: only images founded inside that tag will be grabbed
$ig->setLimit(10);                                                  //limiting the grabbed images count: only first 10 images will be grabbed
$ig->setFollowSubpagesLinks(true);                                  //activating the sub-pages links following
$ig->setFollowedSubpageGrabOnlyFromTagSlice('main-photo-container');//limiting the sub-page searching area for finding the full size images
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

The result:

tut1-res2.jpg
Final grabbing result array

Please note the parent/child image relations feature: when grabbing image galleries with link following enabled, the followed thumbs are set to be parents for the full size images founded underneath them; in this way you will know for every grabbed thumb the corresponding full size image and vice-versa.

Step 9: Enabling the download feature

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->setGrabOnlyFromTagSlice('photo-display-container');            //limiting the searching area: only images founded inside that tag will be grabbed
$ig->setLimit(10);                                                  //limiting the grabbed images count: only first 10 images will be grabbed
$ig->setFollowSubpagesLinks(true);                                  //activating the sub-pages links following
$ig->setFollowedSubpageGrabOnlyFromTagSlice('main-photo-container');//limiting the sub-page searching area for finding the full size images
$ig->setDoDownload(true);                                           //enabling the download feature
$ig->grab();                                                        //the grab command
echo '<pre>'.print_r($ig->getValidMediaTable(true), true).'</pre>'; //displaying the result array

Step 10: Displaying our gallery

We will follow a classical recipe for displaying image galleries: in the main page we'll show the thumbs, and after clicking the thumb the full size picture corresponding to that thumb will be shown.

require_once dirname(__FILE__)."/../bin/wlWmg.php";                 //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->setGrabOnlyFromTagSlice('photo-display-container');            //limiting the searching area: only images founded inside that tag will be grabbed
$ig->setLimit(10);                                                  //limiting the grabbed images count: only first 10 images will be grabbed
$ig->setFollowSubpagesLinks(true);                                  //activating the sub-pages links following
$ig->setFollowedSubpageGrabOnlyFromTagSlice('main-photo-container');//limiting the sub-page searching area for finding the full size images
$ig->setDoDownload(true);                                           //enabling the download feature
$ig->grab();                                                        //the grab command
$thumbs = $ig->getMediaRoots();                                     //get the images thumbs (media that have kids)
foreach($thumbs as $thumb) {
    $images = $ig->getMediaKids($thumb);                            //get the thumb full size images (only one in this case)
    if(count($images)) {
        $image = $images[0];                                        //get the thumb full size image
        echo '<a href="'.$image->getGrabbedUrl().'" style="float:left; margin:10px; background-color:#d5d5d5; padding:5px;">';  //link to full size image
        echo '<img src="'.$thumb->getGrabbedUrl().'"/>';            //the thumb
        echo '</a>';
    }
}

The result:

tut1-res3.jpg
Our simple image gallery

Step 11: Beautifying our gallery thumbnails

If we want our gallery to be more attractive we can also use a very powerful inline image processor provided by WiseLoop.
For a full feature list you can visit the PHP Graphic Works product page at http://wiseloop.com/product/php-graphic-works
For our purpose we will use the live feature of the product to standardize the thumbs dimensions and to apply a rounded mask and a reflection effects over the thumbs:

require_once dirname(__FILE__)."/../bin/wlWmg.php";                     //including the WiseLoop PHP Web Media Grabber Package: use your installation path here
require_once dirname(__FILE__)."/../../php-graphic-works/bin/wlGw.php";   //including the WiseLoop PHP Graphic Works Package: use your installation path here
$url= 'http://www.flickr.com/search/?q=school&f=hp';                //the url
$ig = new wlWmgImageGrabber($url, array('.jpg', '.jpeg'));          //creating the filtered(by .jpg and .jpeg extensions) image grabber object
$ig->setGrabOnlyFromTagSlice('photo-display-container');            //limiting the searching area: only images founded inside that tag will be grabbed
$ig->setLimit(10);                                                  //limiting the grabbed images count: only first 10 images will be grabbed
$ig->setFollowSubpagesLinks(true);                                  //activating the sub-pages links following
$ig->setFollowedSubpageGrabOnlyFromTagSlice('main-photo-container');//limiting the sub-page searching area for finding the full size images
$ig->setDoDownload(true);                                           //enabling the download feature
$ig->grab();                                                        //the grab command
$thumbs = $ig->getMediaRoots();                                     //get the images thumbs (media that have kids)
foreach($thumbs as $thumb) {
    $images = $ig->getMediaKids($thumb);                            //get the thumb full size images (only one in this case)
    if(count($images)) {
        $image = $images[0];                                        //get the thumb full size image
        $fxChain = 'CropAlign(center-center, 100, 70);Mask(rounded, 20);Reflection();';                     //the effects chain to be applied over the thumbs
        $fxThumbPath = './../../php-graphic-works/live/do.php?img='.$thumb->getGrabbedUrl().'&fx='.$fxChain;  //path to the inline graphic processor (requires WiseLoop PHP Graphic Works)
        echo '<a href="'.$image->getGrabbedUrl().'" style="float:left; margin:10px; background-color:#ffffff; padding:5px;">';  //link to full size image
        echo '<img src="'.$fxThumbPath.'"/>';                       //the thumb
        echo '</a>';
    }
}

The result:

tut1-res4.jpg
Our simple image gallery with thumb image processing using PHP Graphic Works

You can buy WiseLoop PHP Graphic Works from here: http://codecanyon.net/item/php-graphic-works/177929

Future Developments

By this time you can realize that WiseLoop PHP Web Media Grabber is a powerful tool that can help you develop complex specialized media grabbers for various websites.
You could wrap the code above into a nice class that can behave like a native grabber in order to grab, extract and download from flickr with only 3 lines of code!
The class definition could look like:

class wlWmgFlickrGrabber extends wlWmgImageGrabber {
    public function __construct($search) {
        $url = 'http://www.flickr.com/search/?q='.$search.'&f=hp';
        parent::__construct($url, array('.jpg', '.jpeg'));
        $this->setGrabOnlyFromTagSlice('photo-display-container');
        $this->setFollowSubpagesLinks(true);
        $this->setFollowedSubpageGrabOnlyFromTagSlice('main-photo-container');
    }
}

Usage:

$flickr = new wlWmgFlickrGrabber('school');                                 //create the flickr grabber object
$flickr->setDoDownload(true);
$flickr->grab();
echo '<pre>'.print_r($flickr->getValidMediaTable(true), true).'</pre>';     //displaying the result array
Regular License $13.00
Use by you or one client, in a single end product which end users are not charged for.

Extended License $65.00
Use by you or one client, in a single end product which end users can be charged for.

Short Information

If you want to grab, retrieve or download various media files (images, videos, audios, flash files, documents, javascript sources, css stylesheet files etc.) from a public web site and use them locally in within your site, PHP Web Media Grabber is the perfect tool that will help you do that.
Extract and download anything from anywhere to your localhost server! Use the downloaded media locally into your website without external hot-linking.

Buyer rating:
111 Sales