CleanCode logo
sitemap
SEARCH:

CleanCode PowerShell Libraries v1.2.08 API: CleanCode » FileTools » ConvertFrom-Text

ConvertFrom-Text

NAME

ConvertFrom-Text

SYNOPSIS

Imports a text file using regular expressions.

SYNTAX

ConvertFrom-Text [-InputObject] <String[]> [-Pattern] <Regex> [-RequireAll] [-Multiline] [<CommonParameters>]

DESCRIPTION

Converts a text file to a collection of PowerShell objects. This function allows you to import records from any text stream for which you can parse its fields by regular expression, whether each "record" corresponds to one line or multiple lines from your input stream.

All the effort is in constructing the appropriate regular expression. Review the examples to see how to do this for:

* fixed-width fields within a line
* fixed-width fields + final, ragged-right field within a line
* variable-width, variable-delimited fields within a single-line
* variable-width, variable-delimited fields spanning multiple lines

Typically you feed an array or pipeline data to ConvertFrom-Text, e.g.
        $records = Get-Content .\FixedWidth.log | ConvertFrom-Text -Pattern $regex
-or-
        $records = ConvertFrom-Text -InputObject $myDataArray -Pattern $regex

If your data is concatenated into a single string (like a herestring) but each record is still contained within a single line within that string, you can split up your input on line endings before feeding it, e.g.:

         $records = $data -split "`r`n" | ConvertFrom-Text -pattern $regex

However, ConvertFrom-Text provides the -Multiline switch for handling multi-line records as well. In this case, you *must* send your data all together as a single string (or at least all the lines from which a single record are extracted must be together).

Building a Regex
----------------

Give each field in the file a name and determine its length. Take those two values and plug them into this template for a capture group:

        (?<field name goes here>.{field length goes here})

Repeat that for each field then lay each one down adjacent to the previous one. Here I am representing names with "n" and lengths with "l":

        (?<n1>.{l1})(?<n2>.{l2})(?<n3>.{l3})

Finally, add a caret (^) at the front end and a dollar sign ($) at the rear:

        ^(?<n1>.{l1})(?<n2>.{l2})(?<n3>.{l3})$

These anchors -- beginning-of-line (^) and end-of-line ($) metacharacters -- enforce a more strict pattern match, requiring the entire line to match. Of course, you are free to omit the anchors if you wish to allow more tolerance, i.e. to ignore extra characters at the end of a line (like a comment, perhaps).

Together with the RequireAll switch, this provides flexibility:
        > Anchors, RequireAll=$false:    matches entire line, non-matches ignored
        > Anchors, RequireAll=$true:     matches entire line, non-matches error
        > No anchors, RequireAll=$false: matches substring, non-matches ignored
        > No anchors, RequireAll=$true:  matches substring, non-matches error

PARAMETERS

-InputObject <String[]>
        Data to import.

        Required?                    true
        Position?                    1
        Default value                
        Accept pipeline input?       true (ByValue, ByPropertyName)
        Accept wildcard characters?  false

-Pattern <Regex>
        Regular expression to match input lines.
        The pattern must provide one or more named capture groups--see the examples
        for details.
        This must be a regular expression either as a [string] or a [regex].
        The latter allows including option flags in the expression.

        Required?                    true
        Position?                    2
        Default value                
        Accept pipeline input?       false
        Accept wildcard characters?  false

-RequireAll [<SwitchParameter>]
        Requires all lines in the file to match the supplied regular expression;
        otherwise, throws an exception.
        If omitted or set to false, non-matching lines are silently ignored.

        Required?                    false
        Position?                    named
        Default value                False
        Accept pipeline input?       false
        Accept wildcard characters?  false

-Multiline [<SwitchParameter>]
        Specifies whether records span multiple lines; default is a single line per record.

        Required?                    false
        Position?                    named
        Default value                False
        Accept pipeline input?       false
        Accept wildcard characters?  false

<CommonParameters>
        This cmdlet supports the common parameters: Verbose, Debug,
        ErrorAction, ErrorVariable, WarningAction, WarningVariable,
        OutBuffer and OutVariable. For more information, see 
        about_CommonParameters (http://go.microsoft.com/fwlink/?LinkID=113216). 

INPUTS

Array of strings.

OUTPUTS

Array of custom objects defined by the named capture groups in the regex.

NOTES



        This function is part of the CleanCode toolbox
        from http://cleancode.sourceforge.net/.

        Since CleanCode 1.2.02

EXAMPLES


-------------------------- EXAMPLE 1 --------------------------

PS>Get-Content .\FixedWidth.txt | ConvertFrom-Text -Pattern $regex

=== FIXED-WIDTH RECORDS === Construct a regular expression per the Description section.

        ^(?<n1>.{l1})(?<n2>.{l2})(?<n3>.{l3})$

Use this regex...

        $regex = "^(?<FirstName>.{7})(?<LastName>.{10})(?<Id>.{3})$"

...for this file:
        --------------------------
        12345671234567890123
        george jetson    5  
        warren buffett   123
        horatioalger     -99
        --------------------------

...to get this output:

        Id    FirstName     LastName  
        --    ---------     --------  
        123   1234567       1234567890
        5     george        jetson    
        123   warren        buffett   
        -99   horatio       alger

-------------------------- EXAMPLE 2 --------------------------

PS>$data -split "`r`n" | ConvertFrom-Text -Pattern $regex

=== HERE-STRING === This is the same as the previous example with a here-string, so use the same regex...

        $regex = "^(?<FirstName>.{7})(?<LastName>.{10})(?<Id>.{3})$"

...for this data:
        $data = @"
        12345671234567890123
        george jetson    5  
        warren buffett   123
        horatioalger     -99
        "@

-------------------------- EXAMPLE 3 --------------------------

PS>Get-Content .\RaggedRight.txt | ConvertFrom-Text -Pattern $regex

=== RAGGED-RIGHT RECORDS === The ragged right format defines all columns by fixed width except for the last column, which simply runs to the end of the line. This is handled almost identically to the fixed-width field example.

Modify the final capture group in the regular expression to use .* instead of .{n} as shown below. I have also renamed it from Id to Description since that is now a more likely field name.

Use this regex...

        $regex = "^(?<FirstName>.{7})(?<LastName>.{10})(?<Description>.*)$"

...for this file:
        --------------------------
        12345671234567890123
        george jetson    arbitrary text here 
        warren buffett   stuff
        horatioalger     more stuff
        --------------------------

-------------------------- EXAMPLE 4 --------------------------

PS>Get-Content .\apache.log | ConvertFrom-Text -Pattern "^$apacheExtractor$"

=== VARIABLE RECORDS === There are countless variations of log files, but one class of log file that is very common is that generated by a web server. The Apache/NCSA common log format, a standardized format used by Apache web servers, is a good use case to illustrate because it has several special cases: it contains fields separated by white space but also allows whitespace *within* a field when the field is delineated either by quotes (as in the "Access request" field) or brackets (as in the "Timestamp" field). Here are just a few lines from a log using this common log format (see http://httpd.apache.org/docs/2.2/logs.html#common):
        --------------------------
        127.0.0.1 - frank [10/Oct/2012:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
        111.111.111.111 - martha [18/Oct/2012:01:17:44 -0700] "GET / HTTP/1.0" 200 101
        111.111.111.111 - - [18/Oct/2007:11:17:55 -0700] "GET /style.css HTTP/1.1" 200 4525
        --------------------------

Each row contains 7 fields-here is the first record split apart:
        Host or IP address         127.0.0.1
        Remote log name            -
        Authenticated user name    frank
        Timestamp                  [10/Oct/2000:13:55:36 -0700]
        Access request             GET /apache_pb.gif HTTP/1.0
        Result status code         200
        Bytes transferred          2326

The fields have varied formats so each matching expression below is customized. All fields are separated by whitespace so the collection of expressions is concatenated together with any amount of whitespace (\s+) between items. Here is the pattern to match the line, except for the anchors:

        $apacheExtractor = "(?<Host>\S*)",
           "(?<LogName>.*?)",
           "(?<UserId>\S*)",
           "\[(?<TimeStamp>.*?)\]",
          "`"(?<Request>[^`"]*)`"",
           "(?<Status>\d{3})",
           "(?<BytesSent>\S*)" -join "\s+"

The anchors--added in the invocation line above--force the line to match in its entirety (i.e. with no extraneous characters before or after).

Here is the output from the above input sample:

        TimeStamp                  LogName Host            UserId Status Request                     BytesSent
        ---------                  ------- ----            ------ ------ -------                     ---------
        10/Oct/2012:13:55:36 -0700 -       127.0.0.1       frank  200    GET /apache_pb.gif HTTP/1.0 2326     
        18/Oct/2012:01:17:44 -0700 -       111.111.111.111 martha 200    GET / HTTP/1.0              101      
        18/Oct/2007:11:17:55 -0700 -       111.111.111.111 -      200    GET /style.css HTTP/1.1     4525

-------------------------- EXAMPLE 5 --------------------------

PS>$data | ConvertFrom-Text -pattern $regex -Multiline

=== MULTI-LINE RECORDS === (Adapted from Per Ostergaard's "Matching multi-line text and converting it into objects" at http://msgoodies.blogspot.com/2008/12/matching-multi-line-text-and-converting.html)

Here multi-line data within a here-string must be used *without* splitting it into lines because the regular expression needs to match all fields together.

        $data=@'
        DC Options: IS_GC
        DC=company,DC=org
            BLL\045ADDC001 via RPC
                DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
                Last attempt @ 2007-08-21 13:38:53 was successful.
        CN=Configuration,DC=company,DC=org
            BLL\045ADDC001 via RPC
                DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
                Last attempt @ 2007-08-21 13:38:53 was successful.
        CN=Schema,CN=Configuration,DC=company,DC=org
            BLL\045ADDC001 via RPC
                DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
                Last attempt @ 2007-08-21 13:38:54 was successful.
        '@

The regular expression below uses inline regular expression modifiers (see http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx), which are typical when dealing with multi-line input.
* The "m" allows ^ and $ to match the start/end of each line
of a multi-line input rather than the beginning and end of the *entire* input.
* The s allows . (dot) to span multiple lines in the input.
* The x activates free-spacing mode, allowing white-space
(and #comments !) within the regex for the benefit of readability.

        $regex = [regex] '(?msx)
            ^ (?<partition> (CN|DC)=[^$]+?)\s*$
            .+? # skip intervening
            (?<Site> \w+) \\ (?<DC> \w+)
            .+?
            Last\ attempt\D+ (?<date> [\d\-]+\ [\d\:]+ )
        '

To use data from a file, simply concatenate it all together (e.g. with Out-String):

        PS> Get-Content data.txt | Out-String | ConvertFrom-Text -pattern $regex

RELATED LINKS

-none-

This documentation set was created with CleanCode's DocTreeGenerator.

Valid XHTML 1.0!Valid CSS!Get CleanCode at SourceForge.net. Fast, secure and Free Open Source software downloads
Copyright © 2001-2015 Michael Sorens • Contact usPrivacy Policy
Usage governed by Mozilla Public License 1.1 and CleanCode Courtesy License
CleanCode -- The Website for Clean DesignRevised 2015.12.16