NAME

slice - Return a slice of one or more files specified by pattern and offset.

SYNOPSIS

slice options files

options

--bodyTag | --nobodyTag

Adds html and body tag brackets around the extracted text, i.e. <html><body>...</body></html>.

--titleTag=string

String used to generate a title tag and an h1 tag. Requires the bodyTag option to be used also. Changes the bracketing tags to: <html><head><title>...</title></head><body><h1>...</h1>...</body></html>.

--startText=string

String (typically opening HTML fragment) printed preceding each sliced file. See the note on interpolated text below.

--middleText=string

String (typically HTML row/cell tag fragments) printed between each pair of sliced files. Not printed if only one file to slice. See the note on interpolated text below.

--endText=string

String (typically closing HTML fragment) printed following each sliced file. See the note on interpolated text below.

--startPat=pattern

Start extraction with first occurrence of pattern.

--stopPat=pattern

Stop extraction with first occurrence of pattern. If omitted, or not found, extracts through end of file.

--startAdj=[!]pattern | integer

If a pattern, adjusts starting line determined by startPat by searching forward (or backward with ! prefix). If a number, adjusts the starting line by the number (positive or negative).

--stopAdj=[!]pattern | integer

If a pattern, adjusts ending line determined by stopPat by searching forward (or backward with ! prefix). If a number, adjusts the ending line by the number (positive or negative).

--colPattern=pattern

After slicing by rows via the various start and stop options, you may additionally slice by columns by specifying a pattern to match within each line. If omitted, entire line is returned as part of the extraction. If included, you must include exactly one subexpression group (with parentheses) to grab a piece of text; otherwise, you'll just get a count of what was matched. If the pattern fails, the entire line is skipped (i.e. you do not get the original line, nor a blank line--you get no line!).

--verbose | --noverbose

If true, prints info about matched line numbers.

files

One or more files to slice. If no file specified, reads from STDIN.

REQUIRES

Perl5.005, Getopt::Long, Data::Handy, Array::Slice

DESCRIPTION

Slice extracts a piece of a text file (or a set of files). It was named after the analogous array slice concept in Perl. If you think of a text file as an array of lines, slice returns an array slice of that array, but rather than specifying by line number, you specify by pattern (i.e. regular expression).

startPat and stopPat are the main selection patterns to define a range from a file. Both of them match the first occurrence of their respective patterns in the file. You may refine the range, though, with startAdj and endAdj. With these, you may offset the range either forward or backward. startAdj and endAdj may be patterns or signed integers. A pattern p will move the range boundary forward; while !p will move the range boundary backward (i.e. prefix the pattern with a "!"). Similarly, a positive integer moves the boundary forward; a negative integer moves it backward. (All of these movements are by line.)

When this program is used with a web page, one would generally lose the proper HTML structure by extracting a middle section. The command-line options bodyTag, titleTag, startText, middleText, and endText provide some correction for this.

Interpolated Text

The startText, middleText, and endText command-line options are subject to text interpolation as follows. Instances of \n and \t are converted to actuals newlines and tabs, respectively.

<FILE_PATH> is replaced with the full current file specification.

<FILE_NAME> is replaced with the current file name (i.e. no path).

<FILE_BASE> is replaced with the base name (i.e. no path or extension).

Examples

Example for market guide screen:

 % slice.pl --bodyTag \
        --startText="<table>\n" --endText="\t</td></tr>\n</table>\n" \
        --startPat="Total Match" --stopPat="colspan=10" \
        --startAdj=!tr --stopAdj="colspan=10" < input.htm

Example for series of pages from www.entertainmentpublications.com.au stored in files p01.htm through p14.htm:

 % perl -I/mydocu~1/ms/devel/perl slice.pl
        --startText="<table>\n" --endText="</table>\n" --bodyTag
        --titleTag="Melbourne E-Book Listing"
        --startPat="search results" --stopPat=zone --startAdj=9 --stopAdj=-7 p*.htm