CleanCode Perl Libraries

Perl

Java

JavaScript

Certified
Class

Testable
Class

Standalone
Mode

Diagnostic
Enabled

NAME

Net::DataMining - Provides a mechanism for extracting data for a list of items from a set of web sites.

SYNOPSIS

(This is actually a complete standalone program.)

        use Net::DataMining;
        @extractionSpec = [ {
                        urlTemplate => "http : //taxonomy.com/list.cgi?type=%s",
                        pageName => "shapes",
                        columnList => [
                                { dispList => "Color", regExpHead => "Color (.*?)" },
                                { dispList => ["Size-min","Size-max"],
                                        regExpHead => "Size/cm", regExpTail => "$STD_PAT$STD_PAT"
                                },
                                { dispList => "Voice", regExpHead => "Voice" },
                                . . .
                                ]
                },
                . . .
        ];
        @itemList = [
                [ "grebe", shape53, "waterfowl" ],
                [ "merganser", shape19, "waterfowl" ],
                [ "kookaburra", shape3, "kingfisher" ],
                . . .
        ];
        $dataMiner = Net::DataMining->new(
                extractionSpec => \@extractionSpec,   # see examples and specifications
                fileName => "birdfile.txt",
                storagePath => "\docs\taxonomy",
                itemName => "species",
                resumeItem => "kookaburra",
                goLive => 1,
                saveLiveFiles => 1
        );
        $dataMiner->process(\@itemList);  # see examples and specifications

REQUIRES

Perl5.005, Data::Handy, File::Handy, Array::Slice, Proc::ExecJava, Java.URLReader (see below)

DESCRIPTION

This module is capable of grabbing variable data across related web pages (varying by parameter) and unrelated web pages (varying by URL), saving all the web pages retrieved, creating a concise data table suitable for viewing, printing, or importing into a spreadsheet, and even traversing cookie gateways.

Having said that, it does take a bit of effort to get setup. Here's a quick overview: In the constructor, you provide an extractionSpec which defines what, where, and how to mine data. The extractionSpec specifies one or more data series. Each data series describes how to mine a set of parameterized web pages defined by a urlTemplate. (The placeholder in the URL template is replaced by each item in the itemList thereby yielding a set of web pages.) Each data series includes a columnList specifying one or more column data sets. Each column data set describes how to extract and reformat one or more pieces of data from each web page.

In the process method, you provide an itemList which defines the list of items for which to do data mining. Each item in the itemList specifies an item key to plug into the urlTemplate above, an alternate-key if needed, and a category for organizing the stored web pages.

While this module knows how to handle cookies, it does not handle the secure http protocol (https) because the underlying java library does not.

What's that, you wonder? Java? That brings me to one minor inconvenience. Currently my languages of interest are Java and Perl, so I tend to write in whatever suits my fancy at the time. My URL reader was already around in Java, not in Perl, so that's what I used.

Preliminary Setup

Determine Parameterized URLs

Record the URL for each web site you're interested in, then parameterize it with a placeholder for each item. So, for example, in http://www.avian.com/data.cgi?L=32&X=sparrow&I=432 it is pretty clear where the sparrow goes; simply replace it with a %s to get your urlTemplate. But some web sites use encoded values; you might just as well have seen something like http://www.avian.com/data.cgi?L=32&X=321&I=432. After you try a few different birds, you could probably determine that 321 is the value for sparrow. Record that value in your item list as the alternate-key for sparrow.

Determine Cookies

If your site of interest uses cookies, you must obtain the necessary cookie manually. Any of the major browsers will divulge the cookie if you set the cookie access level to "prompt". So visit the site, record the cookie, and add that to your data series. This works fine in principle, but there might be a wrinkle. If the website provides a long-lasting cookie, you can just put it in your data series and forget it. But if the website creates quick-expiring cookies, you'll have to do that before every run of your program. Note that if you get a long-lasting cookie, your regular browser activities visiting the same site will generally not interfere with your data mining, nor will the data mining interfere with your browsing. Each will have separate cookies assigned.

Determine Patterns

Review the source for each web page to determine the regular expressions needed to capture the data you need. Ignore the HTML formatting; that will be automatically stripped out by Net::DataMining before attempting to pattern match. (Instruction on regular expressions is beyond the scope of this discussion; see my reference page for more information.) The regular expressions for the regExpHead are often just plain strings. For regExpTail, I provide a couple very basic ones, $STD_PAT and $STD_PAT_SKIP, which are frequently all that you need. $STD_PAT simply matches whitespace followed by non-whitespace. $STD_PAT_SKIP is the same, except it does not save the sub-expression. You may want to grab local copies for brevity:

        my $STD_PAT      = $Net::DataMining::STD_PAT;
        my $STD_PAT_SKIP = $Net::DataMining::STD_PAT_SKIP;

When creating your own, be careful when attempting to match what you think is a number field. If it is a dollar figure, it could have any of these ()$-,. characters as well. And, for any number, it could also have NA or -- or  (an en dash) or some other notation for an unknown or unavailable number.

A typical scenario might be

        some label here (in thousands)   other-number-here   number-you-want-here

So a quite typical approach would be:

        regExpHead => "some label here \(in thousands\)"
        regExpTail => "$STD_PAT_SKIP$STD_PAT"

That regExpTail grabs the second number; to grab the first just use "$STD_PAT" while to grab the third use "$STD_PAT_SKIP$STD_PAT_SKIP$STD_PAT". The regExpHead is a pattern also; you could use something like "some.*?$.*?$".

When data is matched (via regular expression sub-expressions), it is automatically converted to a pure number if it is dressed up with financial notation (as in (153) => -153), currency notation (as in $43.22 => 43.22), or "user-friendly" notation (as in 13,353,234 => 13353234), or any combination of these.

If a sub-expression does not grab a value, this will show up in the final output as "--" (two hyphens).

Examples

This module is easier to visualize with a few examples.

Example 1

This example shows the absolute minimum specification to get some useful output. Consider this data series (one element in the extractionSpec list):

        {
        urlTemplate => "http : //math.com/shapes?type=%s",
        pageName => "shapes",
        columnList => [
                { dispList => "Sides", regExpHead => "Sides" },
                { dispList => "Axes", regExpHead => "Axes of Symmetry" }
                ]
        }

The data series will retrieve a set of web pages, one for each item in the itemList:

        [
                [ "rhombus" ],
                [ "trapezoid" ],
                [ "square" ]
        ]

One such page is http://math.com/shapes?type=rhombus which will be saved as rhombus-2002-07-01.htm (that is, with the current date as part of the file name). A piece of the page looks like this:

        . . .
        <P>Rhombus Details</P><BR>
        <TABLE>
        <TR><TD>Sides</TD><TD>4</TD></TR>
        <TR><TD>Axes of Symmetry</TD><TD>2</TD></TR>
        . . .

The resultant output file will be:

        Item    Sides   Axes
        rhombus 4       2
        trapezoid       4       1
        Square  4       4

(Fields are separated by a single tab, which is terribly useful to a spreadsheet or a word-processor, but alas, here it simply looks like misaligned columns.)

Example 2

Now we run into an issue: using just the dispList, the data we want is not uniquely specified! So we add in the startPat and stopPat modifiers to isolate the data:

        {
        urlTemplate => "http : //www.avian.com/%s.asp",
        pageName => "monthBirdCount",
        startPat => 'Month',
        stopPat => 'Annual',
        columnList => [
                { dispList => "Zone1", regExpHead => "Zone 1" },
                { dispList => "Zone2", regExpHead => "Zone 2" },
                { dispList => "Zone3", regExpHead => "Zone 3" }
                ]
        }

Also, let's assume that what the web page wants to see and what we want to see for the item name are different. We specify both an key and an alternate-key in this itemList:

        [
                [ "diving ducks", "diving" ],
                [ "dabbling ducks", "dabbling" ],
                [ "shorebirds" ]
        ]

Here's a web page fragment (from http://www.avian.com/diving.asp) for diving ducks:

        <H2>Monthly Counts</H2>
        <P>Zone 1: 2341, Zone 2: 302, Zone 3: 8002</P>
        <H2>Annual Counts</H2>
        <P>Zone 1: 13132, Zone 2: 11302, Zone 3: 14820</P>

... which will yield, in part, this output:

        Item    Zone1   Zone2   Zone3
        diving ducks    2341    302     8002
        dabbling ducks...
        . . .

Example 3

Now let's say that the web page from the previous example looked like this:

        <H2>Monthly Counts -- N1   N2   N3   Zone 1    Zone 2    Zone 3</H2>
        <P>type A  93, 309, 11, 2341, 302, 8002</P>
        <P>type B  1, 8, 395, 293, 2305, 35</P>

That is, three pieces of data, rather than one piece, is needed. Also, we want to skip the first three numbers and get the last three. Here we will have to explicitly set the regExpTail, which we've used implicitly till now. Also, it is vital that the number of elements in dispList matches the number of subexpression collectors in regExpTail.

        columnList => [
                { dispList => ["Zone1","Zone2", "Zone3"],
                  regExpHead => "type A",
                  regExpTail => "$STD_PAT_SKIP{3}$STD_PAT$STD_PAT$STD_PAT"
                  },
                ]

This will produce three columns of output labelled Zone1, Zone2, and Zone3, in that order, in the resultant table.

What if you want the columns in a different order? If you want Zone3, Zone2, and Zone1, i.e., a simple reversal, just add reverseOrder = 1> to the columnList. If you want some intervening columns, split the column data set into multiple entries. So, for example, extract Zone1, then some other column, then Zone2 and Zone3. If you wish to mix columns between web pages, apply the same principle to multiple data series.

If you are reading closely, you might wonder about those commas in the sample HTML above. They are, in fact, included in the match by $STD_PAT. We can still use $STD_PAT, though: Recall that matches are automatically converted to pure numbers. And since a comma is one of the characters stripped in that conversion, it goes away harmlessly.

Data Series

A data series describes how to mine the same web page for every item, and consists of these fields:

pageName: String; descriptive name of the web page. This is used as a file name prefix for the stored web page and for program status displays. Also, used in conjunction with a nameSelector in the object constructor. If a nameSelector is specified, only those series whose name matches will be used to collect web pages and generate output.
type: String; category of this data series; used in conjunction with a typeSelector in the object constructor. If a typeSelector is specified, only those series whose type matches will be used to collect web pages and generate output.
urlTemplate: String; URL for the web page containing a single %s placeholder where each item is inserted. Additional placeholders (any printf placeholders) may be put in this template string if needed. Each will be filled from a list of additional arguments supplied in your itemList (see the process method for details).
cookie: Optional; string; cookie required to access a web page. Note that you have to supply the text of the cookie; once you have it, this module will make a repetitive task easy.
startPat: Optional; regular expression string; denotes the start of the active data region of the web page. If omitted, the entire web page is used.
stopPat: Optional; regular expression string; denotes the end of the active data region of the web page. If omitted, scans to the end of the web page.
regExpTail: Optional; regular expression string; defines the sub-expression to capture data from the web page. This tail is concatenated with the regExpHead for each column data set to form the complete regular expression used. This may be overridden on an individual basis by a column data set. If omitted and not overridden, the module default is used $STD_PAT. by non-white space.
noData: Optional; regular expression string; a marker to indicate nothing was found. Typically, if a database error occurs on a server, it might report something like "No data found for xxx". Specifying this simply provides a diagnostic message in such cases.
columnList: Array reference of hash references; each hash reference describes one or more column data sets, as detailed in the next section.

Column Data Set

A column data set--contained within a data series--describes how to extract and reformat data from a web page, and consists of these fields:

dispList: String or array reference; defines one or more column titles for final data table. If more than one, use a reference to an array of strings. The number of strings must match the number of subexpressions collected by the regExpTail. Furthermore, the order of strings must match the order of data collected by the subexpressions. Columns are created in the same order.
numeric: Optional; boolean; the system can usually determine if a string should be interpreted as a number, but to avoid any ambiguity, you may specify this boolean flag. Example ambiguous item: the S and P stars value, stored as e.g. "2-".
reverseOrder: Optional; boolean; indicates to reverse the order of columns collected. If the web page lists items ascending and you want descending, for example, use this switch.
startPat: Optional; regular expression string; marks the start of a subregion of the data region of the web page already selected by the series startPat and stopPat. If omitted, uses the entire region.
stopPat: Optional; regular expression string; marks the end of a subregion of the data region of the web page already selected by the series startPat and stopPat. If omitted, scans to the end of the series region.
divFactor: Optional; number; used to scale the number(s) collected. For example, if a number is 15,200,000 and the divFactor is 1,000,000, the resultant number returned will be 15.2. This allows for more concise data display in some cases. If omitted, no scaling is done.
regExpHead: This is the front half of a regular expression used to fetch some data. The back half is the most specific regExpTail. That is, it is either the column data set's regExpTail, the data series regExpTail, or the Net::DataMining module's regExpTail. Note that this parameter is typically a constant string, but may also be a function reference, allowing calculating a value on the fly. The function must take a single argument, the key value.
regExpTail: This is the back half of a regular expression used to fetch some data. The front half is regExpHead.
accessor: Optional; function reference; used to provide calculable data rather than data extracted from a web page. Typically used to add constant data to output ("constant" in the sense of not coming from the remote web page, though possibly different for each key value, as determined by the accessor logic). If, for example, we have a set of stock symbols, and in our program we have these associated with company names, we provide an accessor function to fetch the company name locally, adding that as a column to the output. If present, this function call will override any local data specified via the regular expression properties above. The function must take a single argument, the key value.

CLASS VARIABLES

$VERSION: Current version of this class.
$FILLER: A pattern to match whitespace. It consists of \s plus non-breaking spaces which show up on some web pages.
$STD_PAT: Most common pattern for recognizing spaces then a figure. A figure may be a plain number or a "dressed-up" dollar amount, i.e. including any of these characters: ()$-,.
$STD_PAT_SKIP: Same as the $STD_PAT but without storing the sub-expression.

CONSTRUCTOR

new

PACKAGE->new(args)

Creates a DataMining object to extract data for a list of items from a set of web sites. The argument list is a hash, as specified below.

Parameters:

extractionSpec => array reference: List of data series (hash references), each defining how to extract data from a web page. See the description above for the contents of a data series.
fileName => string: Optional; name of file to store mined data. If provided, mined table is written to this file. (In any case, mined table is available internal to the program as the return value of the process method.)
storagePath => string: Template string defining path for web page storage; If categorizing items, use two placeholders: first %s marks the category, second %s marks the item. If not categorizing, use just one placeholder %s for the item.
itemName => string: Optional; column name for item (in first column of the resultant table). (default="Key").
resumeItem => string: Optional; if specified, the process method begins processing at the specified item in the itemList; otherwise it processes the entire list.
goLive => boolean: Optional; indicates to use stored web pages or to fetch live web pages (default=1).
saveLiveFiles => boolean: Optional; indicates to save web pages for future invocations (default=1).
selectFunc => function reference: Optional; must be a function with takes a single argument, a data series from the extractionSpec, and returns a boolean indicating to process the data series or not. See also typeSelector and nameSelector.
typeSelector => string: Optional; if specified, only those data series whose type matches the typeSelector will be used. See also nameSelector and selectFunc.
nameSelector => string: Optional; if specified, only those data series whose pageName matches the nameSelector will be used. See also typeSelector and selectFunc.
outputCols => array reference: An array specifying the ordering of columns in the output. Each element of the array reference must be unique and match exactly some entry in a dispList. The one exception is that you should also include one array element of the form xxx:key, where xxx may be any label you wish for the key field. You may use all data labels enumerated by the sum of all dispList items, but you may freely omit some to obtain fewer columns in your output.
includeImages => boolean: Optional; if specified, the basename of the target of an IMG element will be retained as part of the scan. Normally, all HTML elements are stripped before scanning to remove all non-displayed text. This option leaves the names of images in place, since they are technically part of what is displayed. The image element is reformatted from, e.g. <img src="/my/project/images/bird.gif" /> to [#bird.gif#].
textDump => boolean: Optional; if specified, the text of each retrieved HTML page will be written to the data file. Each will be identified by symbol and series. After all series for a symbol have been written, then the actual single data line for the symbol will be written. This is useful for diagnostic purposes; it allows you to see the actual text rather than the HTML code, for constructing regular expressions.
dateLimit => string (yyyy/mm/dd): Optional; used only when goLive is not set. If specified, instead of the latest local file for each item being scanned for building the summary, the latest file on or before this date will be used. Handy for historical perspectives.
javaBinPath => String: Java executable path
javaClassPath => String: Java classpath

Returns:

a newly created object

METHODS

process

OBJ->process(itemList)

Iterates through the itemList creating a two-dimensional table of mined data as specified by the extractionSpec. That is, for each item in itemList, one line is written in the table. The line of data consists of one or more pieces of data from each web page in the extractionSpec. In addition, each retrieved web page is optionally stored on your disk for quicker reuse. The itemList is a list of sub-arrays. Each sub-array specifies a key, an alternate-key, a category, and a list of additional arguments to fill in your URL.

The key will always be written into the first column of the output table. This same key is substituted into each URL in the extractionSpec, then each web page is fetched. Sometimes a different value is needed for the URL rather than the key; hence you may supply an alternate-key in the itemList. For example, a company report for Krispy Kreme might require a URL of http://www.coreports.com/annual?ID=4932956&which=10K (i.e. Krispy Kreme's ID is 4932956). In your mined data table, though you want "Krispy Kreme", not 4932956, so your key will be "Krispy Kreme" and your alternate-key 4932956.

The category allows you to subdivide your itemList into categories for storing the retrieved web pages. For securities you might have a hold list and a watch list, for example. If you have specified your storagePath as /docs/financial/%s/%s the first %s placeholder is the category; the second place holder is the key. So your list of stocks will be stored as /docs/financial/hold/key/*.html and /docs/financial/watch/key/*.html.

If your URL needs more than just the key field to uniquely specify the web page you wish to retrieve, you may specify additional key-dependent arguments in your item list for each item. These will be filled in to the URL template where you specify, using standard printf placeholders (%s, %f, etc.). The key will always go into the first placeholder, however.

The resultant table is returned by this method, so you can perform further processing if needed. In addition, if a fileName was supplied to the constructor, the table is written to the specified file.

Parameters:

itemList - array reference; each element is itself an array reference containing three fixed arguments and any number of optional arguments to add to your URL: [ key, alternate-key, category, optional-arg1, optional-arg2,... ]

title - optional; string to output as the first line of the output file.

Returns:

String; mined data as two-dimensional text table.

BUGS

None

AUTHOR

Michael Sorens

VERSION

$Revision: 229 $ $Date: 2008-03-08 18:17:25 -0800 (Sat, 08 Mar 2008) $

SINCE

CleanCode 0.9

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 378:: =back doesn't take any parameters, but you said =back -- end of examples
Around line 583:: =back doesn't take any parameters, but you said =back -- end of CLASS VARIABLES section
Around line 723:: =back doesn't take any parameters, but you said =back -- end of CONSTRUCTOR section
Around line 840:: =back doesn't take any parameters, but you said =back -- end of METHOD section

CleanCode Perl Libraries