WiseLoop PHP Web Media Grabber 3.1.1
Advanced PHP web resources extractor
wlWmgImageGrabber Class Reference
Inheritance diagram for wlWmgImageGrabber:

Public Member Functions

 __construct ($url, $fileExtensions=null, $checkContentType=true)
 setFollowDirectImageLinks ($followFlag)
 getFollowDirectImageLinks ()
 setFollowSubpagesLinks ($followFlag)
 getFollowSubpagesLinks ()
 setFollowedSubpageImgLimitCount ($limit)
 getFollowedSubpageImgLimitCount ()
 setFollowedSubpageGrabOnlyFromTagSlice ($tagSlice)
 getFollowedSubpageGrabOnlyFromTagSlice ()
 isFilterAccomplished ($media, $filter)

Static Public Member Functions

static getImageFileExtensions ()
static getImageFormats ()

Data Fields

const IMAGE_FILE_EXT = '.jpg|.jpeg|.gif|.png|.bmp|.tif|.tiff|.yuv|.ai|.eps|.ps|.svg|.drw|.ief|.jfif|.svg|.cod|.ras|.cmx|.ico|.pnm|.pbm|.pgm|.rgb|.xbm|.xpm|.xwd'

Protected Member Functions

 mainFilter ()
 ready ()

Private Attributes

 $_fileExtensions
 $_checkContentType
 $_followDirectImageLinks
 $_followSubpagesLinks
 $_followedSubpageImgLimitCount
 $_followedSubpageGrabOnlyFromTagSlice

Detailed Description

WiseLoop Web Image Grabber Processor class definition
This class is designed to retrieve images referred or contained by an url page and stores them in the $_validMedia array variable.
It uses the base class wlWmgProcessor capabilities to search an url page for images (referred into img src tags, a href tag links to full size, inline css background images).
WiseLoop Web Image Grabber main features:

  • smart image recognition (all formats and extensions, all locations: under the img src tag attribute, under a href link tag, under inline css attribute or by content-type);
  • default native support for most common web image extensions (jpg, jpeg, gif, png, bmp, tif, tiff, yuv, ai, eps, ps, svg, drw, ief, jfif, svg, cod, ras, cmx, ico, pnm, pbm, pgm, rgb, xbm, xpm, xwd);
  • a href link following: the grabbing engine is capable of following a href link tags that can hide behind them another images - this is a very powerful feature that can help grabbing entire image galleries (thumbs and full size images) that are displaying only the thumbs on the starting page and those thumbs are linked with an a href tag to the real full size image;
  • parent/child image relation-ing: when grabbing image galleries with A Href Link following enabled, the followed thumbs are set to be parents for the full size images founded underneath them; in this way you will know for every grabbed thumb the corresponding full size image and viceversa;
  • inline CSS background image recognition: the grabbing engine is able to identify images that are referred inline inside the CSS background or background-image attributes;
  • image search and identification by the HTML content-type response header: the grabbing engine is able to identify more than the obvious image resources having the most common image file extensions - it will find the images generated dynamically by the servers or images that have no valid image extensions or no extensions at all; the identification is made by checking the server response header when pinging the tested media resource;
  • image extension filtering: only those images having the specified extensions will be included in the grabbing results;
  • image dimensions filtering: only those images having the specified dimensions (width / height) will be included in the grabbing results;
  • image format filtering: only those images having the specified format (portrait or landscape) will be included in the grabbing results;
  • media url name (filename) filtering: only those images having or containing in their url names some specified strings will be included in the grabbing results;
  • media size filtering: only those images having the specified size (in bytes) will be included in the grabbing results;
  • image count limiter: number of grabbed images will be limited to a specified value;
  • HTML area searching: the grabbing engine is able to search for images only inside a designated HTML area specified by a tag; in this way you can skip grabbing from the start any unwanted pictures by narrowing the full HTML target page to a smaller area consisting of a tag content; an incomplete tag (tag slice) can be specified also, the tag will auto complete depending on the contextual HTML content;
  • downloading capability: the WiseLoop PHP Image Grabber is able to download the grabbed images to the local server, so those images can be referred or used as local resources in the future;
Note:
WiseLoop takes no responsibility if the targeted url changes its tag structure or its HTML DOM tree, resulting in unexpected data retrieval; this will not be considered as malfunction or bug, and you should check the targeted url's HTML DOM tree for changes and modify the code that instantiates this class or any inherited classes.
Also, WiseLoop assumes no responsibility for any abusive use of this class and/or violation of terms of usage of the target url.
See also:
wlWmgProcessor
Author:
WiseLoop

Constructor & Destructor Documentation

__construct ( url,
fileExtensions = null,
checkContentType = true 
)

Constructor.
Creates a wlWmgImageGrabber object.

Parameters:
string$urlthe target page url
string | array$fileExtensionsimage extensions filter; if string, the types should be separated by '|' (ex. '.jpg|.png|.gif')
bool$checkContentTypeif true, the grabber engine will check also the content-type of the grabbed media to make sure it is an image
Returns:
wlWmgImageGrabber

Member Function Documentation

getFollowDirectImageLinks ( )

Returns the follow direct images links flag that specifies if the grabbing engine should follow a href direct image links that hide behind another images (direct link to the picture: the full size image file is linked directly under the thumb without embedding it into a html subpage).

Returns:
bool
See also:
ready()
getFollowedSubpageGrabOnlyFromTagSlice ( )

Returns the tag (or tag slice) that belongs to all subpages whose contents will be searched by the processor to find media.

Returns:
string
See also:
ready(), setFollowSubpagesLinks()
getFollowedSubpageImgLimitCount ( )

Returns the followed subpages checked images count limit that specifies how many images should be checked on each of the followed subpages in order to find the full size image of a gallery thumb.

Returns:
int
See also:
ready(), setFollowSubpagesLinks()
getFollowSubpagesLinks ( )

Returns the follow subpages links flag that specifies if the grabbing engine should follow a href subpages links in order to search them for another images (link to a html subpage that embeds the full size picture among other html tag elements)

Returns:
bool
See also:
ready()
static getImageFileExtensions ( ) [static]

Returns the available image files extensions.

Returns:
array
static getImageFormats ( ) [static]

Returns the available image formats (portrait or landscape).

Returns:
array
isFilterAccomplished ( media,
filter 
)
Parameters:
wlWmgMedia$media
wlWmgFilter$filter
Returns:
bool

Reimplemented from wlWmgProcessor.

mainFilter ( ) [protected]

Filters the media array by leaving only the valid images (founded under links, img tags or inline css backgrounds).

Returns:
void

Reimplemented from wlWmgProcessor.

ready ( ) [protected]

Search additional images by following the a href link tags if $_followAHref is true.
The grabbing engine is capable of following a href link tags that can hide behind them another images;
This is a very powerful feature that can help grabbing entire image galleries (thumbs and full size images) that are displaying only the thumbs on the starting page and those thumbs are linked with an a href tag to the real full size image.
There are two ways for the thumbs to hide behind them its full size image:

  • direct link to the picture: by clicking the thumb the browser loads and displays the full size image file directly without embedding it into a html page - this is the easiest and the fastest case, the full size images can be grabbed just by checking the direct links for valid image content-type responses;
  • link to a html subpage that embeds the full size picture among other html tag elements (even other pictures that have their role in the subpage design) - in this case the grabbing engine makes the assumption that the largest image (in bytes) from the subpage is the searched full size image
    Grabbing full size images from subpages can be quite slow also because the grabbing engine has to search for images into all subpages that the gallery thumbs are hiding behind their a href links, and every subpage can have many images also; checking sizes for all subpage images in order to identify the thumbs full size image can be a quite time consuming task.
    To avoid checking all the subpage images you can set a limit using setFollowedSubpageImgLimitCount($x) method; this will force the grabbing engine to check only the first $x images if you are sure that the full size image is in the first $x images embedded in the subpage. If you intend to grab an image gallery that holds the full size images into separate html subpages linked under the thumbs, please take a moment to study and set this limit; it can save you a lot of processing, bandwidth, cache disk space and of course improves execution time.
    Returns:
    void
    See also:
    setFollowedSubpageImgLimitCount()

Reimplemented from wlWmgProcessor.

setFollowDirectImageLinks ( followFlag)

Sets the follow direct images links flag that specifies if the grabbing engine should follow a href direct image links that hide behind another images (direct link to the picture: the full size image file is linked directly under the thumb without embedding it into a html subpage).

Parameters:
bool$followFlag
Returns:
void
See also:
ready()
setFollowedSubpageGrabOnlyFromTagSlice ( tagSlice)

Sets the tag (or tag slice) that belongs to all subpages whose contents will be searched by the processor to find media.

Parameters:
string$tagSlice
Returns:
void
See also:
ready(), setFollowSubpagesLinks()
setFollowedSubpageImgLimitCount ( limit)

Sets the followed subpages checked images count limit that specifies how many images should be checked on each of the followed subpages in order to find the full size image of a gallery thumb.

Parameters:
int$limitthe new subpages checked images count limit
Returns:
void
See also:
ready(), setFollowSubpagesLinks()
setFollowSubpagesLinks ( followFlag)

Sets the follow subpages links flag that specifies if the grabbing engine should follow a href subpages links in order to search them for another images (link to a html subpage that embeds the full size picture among other html tag elements)

Parameters:
bool$followFlag
Returns:
void
See also:
ready()

Field Documentation

$_checkContentType [private]

Type: bool - if true, the grabber engine will check also the content-type of the grabbed media to make sure it is an image;
this checking will add some additional processing as headers for each possible media will be downloaded in order to perform the image validation.
If false, the grabber engine will use only on the provided (or common) extensions for image files and therefore the grabbing process will be faster but can bring non-image files also.

$_fileExtensions [private]

Type: array|string - image extensions filter; if string, the types should be separated by '|' (ex. '.jpg|.png|.gif')

$_followDirectImageLinks [private]

Type: bool - specifies if the grabbing engine should follow a href direct image links that hide behind another images (direct link to the picture: the full size image file is linked directly under the thumb without embedding it into a html subpage)

See also:
$_followSubpagesLinks
$_followedSubpageGrabOnlyFromTagSlice [private]

Type: string - the processor will test only the media founded inside the contents of the subpage tag specified here;
an incomplete tag (tag slice) can be specified also, the tag will auto complete depending on the contextual HTML content;
This makes sense only if $_followSubpagesLinks is set to true

See also:
$_followSubpagesLinks, $_followedSubpageImgLimitCount, $_followedSubpageGrabOnlyFromTagSlice
$_followedSubpageImgLimitCount [private]

Type: int - specifies how many images should be checked on each of the followed subpages in order to find the full size image of a gallery thumb; default value is zero meaning that there is no limit to be applied: all images from all followed subpages will be checked (can be very slooooow)
This makes sense only if $_followSubpagesLinks is set to true

See also:
$_followSubpagesLinks, $_followedSubpageImgLimitCount, $_followedSubpageGrabOnlyFromTagSlice
$_followSubpagesLinks [private]

Type: bool - specifies if the grabbing engine should follow a href subpages links in order to search them for another images (link to a html subpage that embeds the full size picture among other html tag elements)

See also:
$_followDirectImageLinks, $_followedSubpageImgLimitCount, $_followedSubpageGrabOnlyFromTagSlice
const IMAGE_FILE_EXT = '.jpg|.jpeg|.gif|.png|.bmp|.tif|.tiff|.yuv|.ai|.eps|.ps|.svg|.drw|.ief|.jfif|.svg|.cod|.ras|.cmx|.ico|.pnm|.pbm|.pgm|.rgb|.xbm|.xpm|.xwd'

Image file extensions


The documentation for this class was generated from the following file:
 All Data Structures Functions Variables