CleanCode Perl Libraries
Multi-Lingual Library Maintainability
available: Perl not available: Java not available: JavaScript not available: Certified
Class
not available: Testable
Class
not available: Standalone
Mode
not available: Diagnostic
Enabled

NAME

Convert::SimpleHtml2Text - Converts an HTML document to text.

SYNOPSIS

        use Convert::SimpleHtml2Text;
        $plainText =  simpleHtmlToText($htmlText);
        $plainText =  simpleHtmlToText($htmlText, 1);

EXPORTS

Default: simpleHtmlToText

Optional: none

REQUIRES

Perl5.005

DESCRIPTION

Converts an HTML document to text by stripping out all formatting tags and doing simple conversions. Multiple blanks are all removed on each line. Leading and trailing blanks are removed on each line. Multiple line breaks are removed.

Example: This block of HTML is an abbreviated version of Google's search page, a refreshingly simple web page.

        <html>
        <head>
        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
        <title>Google</title>
        </head>
        <body bgcolor=#ffffff text=#000000 link=#0000cc vlink=#551a8b alink=#ff0000 onLoad=sf()>
        <br>
        <form action="/search" name=f>
        <table cellspacing=0 cellpadding=0>
        <tr>
        <td align=center>
        <input type=submit value="Google Search" name=btnG>
        <input type=submit value="I'm Feeling Lucky" name=btnI>
        </td>
        <td valign=top nowrap>
        &nbsp;&#8226;&nbsp;<a href=/advanced_search?hl=en>Advanced&nbsp;Search</a><br>
        &nbsp;&#8226;&nbsp;<a href=/preferences?hl=en>Preferences</a><br>
        &nbsp;&#8226;&nbsp;<a href=/language_tools?hl=en>Language Tools</a>
        </td>
        </tr>
        </table>
        </form>
        <p>
        <font size=-2>&copy;2002 Google</font>
        <font size=-2>- Searching 3,083,324,652 web pages</font>
        </body>
        </html>

Running that piece of HTML through the simpleHtmlToText function results in:

        Google
        &#8226;  Advanced Search
        &#8226;  Preferences
        &#8226;  Language Tools
        &copy;2002 Google
        - Searching 3,083,324,652 web pages

The optional keepImages flag allows you to retain a little bit of information about a graphic file--the base file name. With this input:

        <html>
        <body>
        <p>Some text here...</p>
        <a href="some.url.com"><img src="/some/path/file.gif" /></a>
        </body>
        </html>

you get this output:

        Some text here...
        [#file.gif#]

...assuming you have set the keepImages flag to true.

FUNCTIONS

simpleHtmlToText

simpleHtmlToText(text, keepImages)

simpleHtmlToText(text)

Convert an HTML document (represented by a string) into text, optionally keeping the image references. If keepImages is true, then image references will be condensed and retained. For example, <IMG SRC="http://www.content.com/abc/def/hello.gif"> will be replaced by [#hello.gif#]. The "[# #]" brackets are used for easy selection of images by other applications.

Parameters:

text - string; a string representing the HTML document.

keepImages - optional; a boolean indicating to keep image file names in output.

Returns:

A string representing the text of the HTML document.

BUGS

None

AUTHOR

Michael Sorens

VERSION

$Revision: 8 $ $Date: 2006-12-19 21:13:43 -0800 (Tue, 19 Dec 2006) $

SINCE

CleanCode 0.9

SEE ALSO

Java version

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 178:

=back doesn't take any parameters, but you said =back -- end of FUNCTION section


CleanCode Perl Libraries Copyright © 2001-2013 Michael Sorens - Revised 2013.06.30 Get CleanCode at SourceForge.net. Fast, secure and Free Open Source software downloads