WiseLoop PHP Web Media Grabber 3.1.1
Advanced PHP web resources extractor
wlWmgProcessor Class Reference
Inheritance diagram for wlWmgProcessor:

Public Member Functions

 __construct ($targetUrl)
 setUrl ($targetUrl)
 getUrl ()
 getCurl ()
 getCurlTitle ()
 setDoDownload ($doDownload)
 getDoDownload ()
 setLimit ($limit)
 getLimit ()
 setGrabOnlyFromTagSlice ($tagSlice)
 getGrabOnlyFromTagSlice ()
 setHtmlContent ($htmlContent)
 getHtmlContent ()
 setCacheTime ($cacheTime)
 getCacheTime ()
 setBaseDownloadDir ($path)
 getBaseDownloadDir ()
 addFilter ($filter)
 makeLocalFileName ($media)
 hasError ()
 getError ()
 getLocalDownloadDirName ()
 getLocalDownloadDir ()
 getLocalCacheDir ()
 getLocalDownloadUrl ()
 getLogPath ()
 getGrabberTypeName ()
 setGrabberTypeName ($grabberTypeName)
 removeResultByKey ($key)
 isFilterAccomplished ($media, $filter)
 grab ()
 getValidMedia ()
 getInvalidMedia ()
 hasInvalidMedia ()
 countValidMedia ()
 countInvalidMedia ()
 getMediaRoots ()
 getMediaKids ($parent)
 getValidMediaNames ()
 getValidMediaTable ($withAHref=true, $rootsOnly=false)
 getInvalidMediaTable ()

Protected Member Functions

 setLocalDownloadDir ($localDownloadDir)
 process ()
 mainFilter ()
 beforeValidate ()
 ready ()
 clean ()
 applyFilters ($when)
 getSearchStringsFilter ()
 hasFilters ($type=null)

Protected Attributes

 $_curl
 $_filters
 $_limit
 $_validMedia
 $_invalidMedia
 $_curlTitle

Private Member Functions

 prepare ()
 loadUrl ()
 normalizeMedia ()
 validateMedia ()
 download ()
 isCacheUpdated ($cacheFilePath)

Private Attributes

 $_error
 $_htmlContent
 $_doDownload
 $_grabOnlyFromTagSlice
 $__localDownloadDir
 $__localCacheDir
 $__localDownloadUrl
 $_htmlCacheTime
 $_grabberTypeName
 $_baseDownloadDir
 $_logger

Detailed Description

WiseLoop Web Media Grabber Processor class definition
This class is designed to retrieve various media referred or contained by an url page and stores them in the $_validMedia array variable.
The media contained or referred by the url can be filtered using wlWmgFilter objects; also it can be downloaded to localhost server for future usage.

Note:
WiseLoop takes no responsibility if the targeted url changes its tag structure or its HTML DOM tree, resulting in unexpected data retrieval; this will not be considered as malfunction or bug, and you should check the targeted url's HTML DOM tree for changes and modify the code that instantiates this class or any inherited classes.
Also, WiseLoop assumes no responsibility for any abusive use of this class and/or violation of terms of usage of the target url.
See also:
wlWmgMedia, wlWmgFilter
Author:
WiseLoop

Constructor & Destructor Documentation

__construct ( targetUrl)

Constructor.
Creates a wlWmgProcessor object.

Parameters:
string$targetUrlreal target url to be parsed, scanned and processed
Returns:
wlWmgProcessor

Reimplemented in wlWmgCssGrabber, and wlWmgJsGrabber.


Member Function Documentation

addFilter ( filter)

Adds a filter to the current media grabber processor that will be used to filter the grabbed media.

Parameters:
wlWmgFilter$filterthe filter
Returns:
void
applyFilters ( when) [protected]

Filters the valid media array leaving only media that meets the filters conditions.

Parameters:
string$whenthe filter time
Returns:
void
See also:
addFilter, wlWmgFilter::BEFORE_URL_VALIDATION, wlWmgFilter::AFTER_URL_VALIDATION
beforeValidate ( ) [protected]

This routine is designed to be overwritten in child classes if some other processing is needed to be made.
This method is called just before validateMedia() method.

Returns:
void
clean ( ) [protected]

Cleanup code (freeing memory & stuff).

Returns:
void
countInvalidMedia ( )

Returns the invalid media count.

Returns:
int
countValidMedia ( )

Returns the valid media count.

Returns:
int
download ( ) [private]

Downloads (only if the download switch variable is true) all the media contained in the valid media array.

Returns:
void
getBaseDownloadDir ( )

Returns the local parent directory full path of the downloaded media.

Returns:
string
getCacheTime ( )

Returns the HTML page caching time.

Returns:
int
getCurl ( )

Returns the associated wlCurl object.

Returns:
wlCurl
getCurlTitle ( )

Returns the title tag contents of the targeted url (the targeted page title).

Returns:
string
getDoDownload ( )

Returns the download switch variable: if the grabbed media will be downloaded on localhost.

Returns:
bool
getError ( )

Returns the error message of the current media grabber.

Returns:
string
getGrabberTypeName ( )

Returns the grabber type name (used to compute media directory name).

Returns:
string
getGrabOnlyFromTagSlice ( )

Returns the tag (or tag slice) whose contents will be searched by the processor to find media.

Returns:
string
getHtmlContent ( )

Returns the HTML content (the target url page contents) that embeds the media.

Returns:
string
getInvalidMedia ( )

Returns the invalid media array.

Returns:
array
getInvalidMediaTable ( )

Returns a nice array containing all the needed information about the invalid grabbed media.

Returns:
array
getLimit ( )

Returns the media limit count to be grabbed.

Returns:
int
getLocalCacheDir ( )

Computes and returns the local cache directory path.

Returns:
string
getLocalDownloadDir ( )

Computes and returns the local full download directory path.

Returns:
string
getLocalDownloadDirName ( )

Computes and returns the local download directory name.

Returns:
string
getLocalDownloadUrl ( )

Computes and returns the local downloaded media directory path.

Returns:
string
getLogPath ( )

Returns the log file path.

Returns:
string
getMediaKids ( parent)

Returns an array consisting of children of the given parent media.

Parameters:
wlWmgMedia$parent
Returns:
array
getMediaRoots ( )

Returns an array of valid media items that have no parents.

Returns:
array
getSearchStringsFilter ( ) [protected]

Returns the first search type filter founded in the filters array.

Returns:
wlWmgFilter the first search type filter founded in the filters array or null if no search type filter is found
getUrl ( )

Returns the target url string to be parsed, scanned and processed.

Returns:
string
getValidMedia ( )

Returns the valid media array.

Returns:
array
getValidMediaNames ( )

Returns an array containing the media names.

Returns:
array
getValidMediaTable ( withAHref = true,
rootsOnly = false 
)

Returns a nice array containing all the needed information about the valid grabbed media.

Parameters:
bool$withAHrefspecifies if the media url should be enveloped into an a href link tag
bool$rootsOnlyspecifies if only the valid media items that have no parents should be returned
Returns:
array
grab ( )

Returns the grabbed results.

Returns:
array the grabbed results
hasError ( )

Returns if the current media grabber has encountered any errors during the grabbing process.

Returns:
bool
hasFilters ( type = null) [protected]

Returns if the grabbing processor has filters (optionally of a specified type).

Parameters:
string$typethe filter type, if null the method will return if the grabbing processor has filters of any type
Returns:
bool
hasInvalidMedia ( )

Tests if the current media grabber has invalid media.

Returns:
bool
isCacheUpdated ( cacheFilePath) [private]

Tests if a HTML page cache is up to date. Type: string $cacheFilePath path of the HTML cache file

Returns:
bool
isFilterAccomplished ( media,
filter 
)

Test if a filter conditions are meet for a certain media.

Parameters:
wlWmgMedia$mediathe media
wlWmgFilter$filterthe filter
Returns:
bool if the filter is accomplished

Reimplemented in wlWmgImageGrabber.

loadUrl ( ) [private]

Reads the entire content of the targeted url.

Exceptions:
Exception
Returns:
string the contents of the targeted url
mainFilter ( ) [protected]

Filters the valid media array by leaving only media that meet the current grabbing goal (media type).
Although this method has a quite general purpose, it is possible that for a specific media types (such as images) this method needs to be overwritten in order to include all the conditions that are necessary to grab the desired media type.

Returns:
void

Reimplemented in wlWmgFlashGrabber, wlWmgImageGrabber, and wlWmgVideoGrabber.

makeLocalFileName ( media)

Generates a valid local file name for the passed media object.

Parameters:
wlWmgMedia$media
Returns:
string the local media file name
normalizeMedia ( ) [private]

Normalizes the media urls by making them absolute.

Returns:
void
prepare ( ) [private]

Loads the targeted url page contents, gets its title, creates the local directories and fills the valid media array with media founded using the first search type filter retrieved from the filters array.

Returns:
bool true if everything was ok, false if the targeted url could not be loaded or no search type filter was founded in the filters array
process ( ) [protected]

Executes the grabbing procedure.

Returns:
array
ready ( ) [protected]

This routine is designed to be overwritten in child classes if some other processing is needed to be made.
This method is called just before clean() and download() methods.

Returns:
void

Reimplemented in wlWmgImageGrabber.

removeResultByKey ( key)

Removes a media from the valid array and moves it to the invalid media array.

Parameters:
int$keythe array key
Returns:
void
setBaseDownloadDir ( path)

Sets the local parent directory full path of the downloaded media.

Parameters:
string$path
Returns:
void
setCacheTime ( cacheTime)

Sets the HTML page caching time.

Parameters:
int$cacheTimethe new caching time expressed in minutes
Returns:
void
setDoDownload ( doDownload)

Sets the download switch variable: if the grabbed media will be downloaded on localhost.

Parameters:
bool$doDownloadspecifies if the grabbed media will be downloaded on localhost
Returns:
void
setGrabberTypeName ( grabberTypeName)

Sets the grabber type name (used to compute media directory name).

Parameters:
string$grabberTypeName
Returns:
void
setGrabOnlyFromTagSlice ( tagSlice)

Sets the tag (or tag slice) whose contents will be searched by the processor to find media.

Parameters:
string$tagSlice
Returns:
void
setHtmlContent ( htmlContent)

Sets the HTML content (the target url page contents) that embeds the media.
You can use this method if you have an external HTML content and you don't want or need to parse a real url.

Parameters:
string$htmlContent
Returns:
void
setLimit ( limit)

Sets the media limit count to be grabbed.

Parameters:
int$limitthe new media limit count
Returns:
void
setLocalDownloadDir ( localDownloadDir) [protected]

Sets (forced) the local download directory path.

Parameters:
string$localDownloadDirthe new local download directory path
Returns:
void
setUrl ( targetUrl)

Sets the target url to be parsed, scanned and processed.

Parameters:
string$targetUrlreal target url to be parsed, scanned and processed
Returns:
void
validateMedia ( ) [private]

Validates the founded media array:

  • checks the HTTP header responses and moves the invalid medias from valid media array to the invalid media array;
  • removes duplicates;
  • applies count limit filter;
  • applies AFTER_URL_VALIDATION time type filters
    Returns:
    void

Field Documentation

$__localCacheDir [private]

Type: string - the local cache directory full path

$__localDownloadDir [private]

Type: string - the local download directory full path

$__localDownloadUrl [private]

Type: string - the local url path of the downloaded media

$_baseDownloadDir [private]

Type: string - the local parent directory full path of the downloaded media; if not set, the wlWmgConfig::DOWNLOAD_DIR() will be used

$_curl [protected]

Type: wlCurl - the curl object that holds the real target url to be parsed, scanned and processed

$_curlTitle [protected]

Type: string - the title tag contents of the targeted url (the targeted page title)

$_doDownload [private]

Type: bool - specifies if the grabbed media will be downloaded on localhost

$_error [private]

Type: string - a message that indicates what was wrong with the grabbing, empty string if everything was ok

$_filters [protected]

Type: array - objects of wlWmgFilter type used to filter the grabbed media

$_grabberTypeName [private]

Type: string - the grabber type (used to compute media directory name);

$_grabOnlyFromTagSlice [private]

Type: string - the processor will grab only the media founded inside the contents of the tag specified here; an incomplete tag (tag slice) can be specified also, the tag will auto complete depending on the contextual HTML content

$_htmlCacheTime [private]

Type: int - HTML page caching time expressed in minutes

See also:
wlWmgConfig
$_htmlContent [private]

Type: string - holds the HTML content that embeds the media (usually read from the target url)

$_invalidMedia [protected]

Type: array - the invalid founded media (broken links, invalid headers etc.) consisting of wlWmgMedia objects

$_limit [protected]

Type: int - media limit count (how many media items will be grabbed), zero stands for no limit

$_logger [private]

Type: wlWmgLogger - logger object

$_validMedia [protected]

Type: array - the valid founded media consisting of wlWmgMedia objects


The documentation for this class was generated from the following file:
 All Data Structures Functions Variables