ConvertFrom-Text

CleanCode PowerShell Libraries v1.2.08 API: CleanCode » FileTools » ConvertFrom-Text

ConvertFrom-Text

NAME

ConvertFrom-Text

SYNOPSIS

Imports a text file using regular expressions.

SYNTAX

ConvertFrom-Text [-InputObject] <String[]> [-Pattern] <Regex> [-RequireAll] [-Multiline] [<CommonParameters>]

DESCRIPTION

Converts a text file to a collection of PowerShell objects. This function allows you to import records from any text stream for which you can parse its fields by regular expression, whether each "record" corresponds to one line or multiple lines from your input stream.

All the effort is in constructing the appropriate regular expression. Review the examples to see how to do this for:

* fixed-width fields within a line
* fixed-width fields + final, ragged-right field within a line
* variable-width, variable-delimited fields within a single-line
* variable-width, variable-delimited fields spanning multiple lines

Typically you feed an array or pipeline data to ConvertFrom-Text, e.g.

        $records = Get-Content .\FixedWidth.log | ConvertFrom-Text -Pattern $regex

-or-

        $records = ConvertFrom-Text -InputObject $myDataArray -Pattern $regex

If your data is concatenated into a single string (like a herestring) but each record is still contained within a single line within that string, you can split up your input on line endings before feeding it, e.g.:

         $records = $data -split "`r`n" | ConvertFrom-Text -pattern $regex

However, ConvertFrom-Text provides the -Multiline switch for handling multi-line records as well. In this case, you *must* send your data all together as a single string (or at least all the lines from which a single record are extracted must be together).

Building a Regex
----------------

Give each field in the file a name and determine its length. Take those two values and plug them into this template for a capture group:

        (?<field name goes here>.{field length goes here})

Repeat that for each field then lay each one down adjacent to the previous one. Here I am representing names with "n" and lengths with "l":

        (?<n1>.{l1})(?<n2>.{l2})(?<n3>.{l3})

Finally, add a caret (^) at the front end and a dollar sign ($) at the rear:

        ^(?<n1>.{l1})(?<n2>.{l2})(?<n3>.{l3})$

These anchors -- beginning-of-line (^) and end-of-line ($) metacharacters -- enforce a more strict pattern match, requiring the entire line to match. Of course, you are free to omit the anchors if you wish to allow more tolerance, i.e. to ignore extra characters at the end of a line (like a comment, perhaps).

Together with the RequireAll switch, this provides flexibility:

        > Anchors, RequireAll=$false:    matches entire line, non-matches ignored

        > Anchors, RequireAll=$true:     matches entire line, non-matches error

        > No anchors, RequireAll=$false: matches substring, non-matches ignored

        > No anchors, RequireAll=$true:  matches substring, non-matches error

PARAMETERS

-InputObject <String[]>

        Data to import.

        Required?                    true

        Position?                    1

        Default value

        Accept pipeline input?       true (ByValue, ByPropertyName)

        Accept wildcard characters?  false

-Pattern <Regex>

        Regular expression to match input lines.

        The pattern must provide one or more named capture groups--see the examples

        for details.

        This must be a regular expression either as a [string] or a [regex].

        The latter allows including option flags in the expression.

        Required?                    true

        Position?                    2

        Default value

        Accept pipeline input?       false

        Accept wildcard characters?  false

-RequireAll [<SwitchParameter>]

        Requires all lines in the file to match the supplied regular expression;

        otherwise, throws an exception.

        If omitted or set to false, non-matching lines are silently ignored.

        Required?                    false

        Position?                    named

        Default value                False

        Accept pipeline input?       false

        Accept wildcard characters?  false

-Multiline [<SwitchParameter>]

        Specifies whether records span multiple lines; default is a single line per record.

        Required?                    false

        Position?                    named

        Default value                False

        Accept pipeline input?       false

        Accept wildcard characters?  false

        This cmdlet supports the common parameters: Verbose, Debug,

        ErrorAction, ErrorVariable, WarningAction, WarningVariable,

        OutBuffer and OutVariable. For more information, see

        about_CommonParameters (http://go.microsoft.com/fwlink/?LinkID=113216).

INPUTS

Array of strings.

OUTPUTS

Array of custom objects defined by the named capture groups in the regex.

NOTES

        This function is part of the CleanCode toolbox

        from http://cleancode.sourceforge.net/.

        Since CleanCode 1.2.02

EXAMPLES

-------------------------- EXAMPLE 1 --------------------------

PS>Get-Content .\FixedWidth.txt | ConvertFrom-Text -Pattern $regex

=== FIXED-WIDTH RECORDS === Construct a regular expression per the Description section.

        ^(?<n1>.{l1})(?<n2>.{l2})(?<n3>.{l3})$

Use this regex...

        $regex = "^(?<FirstName>.{7})(?<LastName>.{10})(?<Id>.{3})$"

...for this file:

        --------------------------

        12345671234567890123

        george jetson    5

        warren buffett   123

        horatioalger     -99

        --------------------------

...to get this output:

        Id    FirstName     LastName

        --    ---------     --------

        123   1234567       1234567890

        5     george        jetson

        123   warren        buffett

        -99   horatio       alger

-------------------------- EXAMPLE 2 --------------------------

PS>$data -split "`r`n" | ConvertFrom-Text -Pattern $regex

=== HERE-STRING === This is the same as the previous example with a here-string, so use the same regex...

        $regex = "^(?<FirstName>.{7})(?<LastName>.{10})(?<Id>.{3})$"

...for this data:

        $data = @"

        12345671234567890123

        george jetson    5

        warren buffett   123

        horatioalger     -99

"@

-------------------------- EXAMPLE 3 --------------------------

PS>Get-Content .\RaggedRight.txt | ConvertFrom-Text -Pattern $regex

=== RAGGED-RIGHT RECORDS === The ragged right format defines all columns by fixed width except for the last column, which simply runs to the end of the line. This is handled almost identically to the fixed-width field example.

Modify the final capture group in the regular expression to use .* instead of .{n} as shown below. I have also renamed it from Id to Description since that is now a more likely field name.

Use this regex...

        $regex = "^(?<FirstName>.{7})(?<LastName>.{10})(?<Description>.*)$"

...for this file:

        --------------------------

        12345671234567890123

        george jetson    arbitrary text here

        warren buffett   stuff

        horatioalger     more stuff

        --------------------------

-------------------------- EXAMPLE 4 --------------------------

PS>Get-Content .\apache.log | ConvertFrom-Text -Pattern "^$apacheExtractor$"

=== VARIABLE RECORDS === There are countless variations of log files, but one class of log file that is very common is that generated by a web server. The Apache/NCSA common log format, a standardized format used by Apache web servers, is a good use case to illustrate because it has several special cases: it contains fields separated by white space but also allows whitespace *within* a field when the field is delineated either by quotes (as in the "Access request" field) or brackets (as in the "Timestamp" field). Here are just a few lines from a log using this common log format (see http://httpd.apache.org/docs/2.2/logs.html#common):

        --------------------------

        127.0.0.1 - frank [10/Oct/2012:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

        111.111.111.111 - martha [18/Oct/2012:01:17:44 -0700] "GET / HTTP/1.0" 200 101

        111.111.111.111 - - [18/Oct/2007:11:17:55 -0700] "GET /style.css HTTP/1.1" 200 4525

        --------------------------

Each row contains 7 fields-here is the first record split apart:

        Host or IP address         127.0.0.1

        Remote log name            -

        Authenticated user name    frank

        Timestamp                  [10/Oct/2000:13:55:36 -0700]

        Access request             GET /apache_pb.gif HTTP/1.0

        Result status code         200

        Bytes transferred          2326

The fields have varied formats so each matching expression below is customized. All fields are separated by whitespace so the collection of expressions is concatenated together with any amount of whitespace (\s+) between items. Here is the pattern to match the line, except for the anchors:

        $apacheExtractor = "(?<Host>\S*)",

           "(?<LogName>.*?)",

           "(?<UserId>\S*)",

           "\[(?<TimeStamp>.*?)\]",

          "`"(?<Request>[^`"]*)`"",

           "(?<Status>\d{3})",

           "(?<BytesSent>\S*)" -join "\s+"

The anchors--added in the invocation line above--force the line to match in its entirety (i.e. with no extraneous characters before or after).

Here is the output from the above input sample:

        TimeStamp                  LogName Host            UserId Status Request                     BytesSent

        ---------                  ------- ----            ------ ------ -------                     ---------

        10/Oct/2012:13:55:36 -0700 -       127.0.0.1       frank  200    GET /apache_pb.gif HTTP/1.0 2326

        18/Oct/2012:01:17:44 -0700 -       111.111.111.111 martha 200    GET / HTTP/1.0              101

        18/Oct/2007:11:17:55 -0700 -       111.111.111.111 -      200    GET /style.css HTTP/1.1     4525

-------------------------- EXAMPLE 5 --------------------------

PS>$data | ConvertFrom-Text -pattern $regex -Multiline

=== MULTI-LINE RECORDS === (Adapted from Per Ostergaard's "Matching multi-line text and converting it into objects" at http://msgoodies.blogspot.com/2008/12/matching-multi-line-text-and-converting.html)

Here multi-line data within a here-string must be used *without* splitting it into lines because the regular expression needs to match all fields together.

        $data=@'

        DC Options: IS_GC

        DC=company,DC=org

            BLL\045ADDC001 via RPC

                DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec

                Last attempt @ 2007-08-21 13:38:53 was successful.

        CN=Configuration,DC=company,DC=org

            BLL\045ADDC001 via RPC

                DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec

                Last attempt @ 2007-08-21 13:38:53 was successful.

        CN=Schema,CN=Configuration,DC=company,DC=org

            BLL\045ADDC001 via RPC

                DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec

                Last attempt @ 2007-08-21 13:38:54 was successful.

'@

The regular expression below uses inline regular expression modifiers (see http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx), which are typical when dealing with multi-line input.
* The "m" allows ^ and $ to match the start/end of each line
of a multi-line input rather than the beginning and end of the *entire* input.
* The s allows . (dot) to span multiple lines in the input.
* The x activates free-spacing mode, allowing white-space
(and #comments !) within the regex for the benefit of readability.

        $regex = [regex] '(?msx)

            ^ (?<partition> (CN|DC)=[^$]+?)\s*$

            .+? # skip intervening

            (?<Site> \w+) \\ (?<DC> \w+)

.+?

            Last\ attempt\D+ (?<date> [\d\-]+\ [\d\:]+ )

To use data from a file, simply concatenate it all together (e.g. with Out-String):

        PS> Get-Content data.txt | Out-String | ConvertFrom-Text -pattern $regex

ConvertFrom-Text

NAME

SYNOPSIS

SYNTAX

DESCRIPTION

PARAMETERS

INPUTS

OUTPUTS

NOTES

EXAMPLES

RELATED LINKS