CleanCode Perl Libraries |
Home | Perl | Java | PowerShell | C# | SQL | Index | Tools | Download | What's New |
Multi-Lingual Library | Maintainability | ||||||||||||
Perl | Java | JavaScript | Certified Class |
Testable Class |
Standalone Mode |
Diagnostic Enabled |
Convert::SimpleHtml2Text - Converts an HTML document to text.
use Convert::SimpleHtml2Text;
$plainText = simpleHtmlToText($htmlText);
$plainText = simpleHtmlToText($htmlText, 1);
Default: simpleHtmlToText
Optional: none
Perl5.005
Converts an HTML document to text by stripping out all formatting tags and doing simple conversions. Multiple blanks are all removed on each line. Leading and trailing blanks are removed on each line. Multiple line breaks are removed.
Example: This block of HTML is an abbreviated version of Google's search page, a refreshingly simple web page.
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Google</title>
</head>
<body bgcolor=#ffffff text=#000000 link=#0000cc vlink=#551a8b alink=#ff0000 onLoad=sf()>
<br>
<form action="/search" name=f>
<table cellspacing=0 cellpadding=0>
<tr>
<td align=center>
<input type=submit value="Google Search" name=btnG>
<input type=submit value="I'm Feeling Lucky" name=btnI>
</td>
<td valign=top nowrap>
• <a href=/advanced_search?hl=en>Advanced Search</a><br>
• <a href=/preferences?hl=en>Preferences</a><br>
• <a href=/language_tools?hl=en>Language Tools</a>
</td>
</tr>
</table>
</form>
<p>
<font size=-2>©2002 Google</font>
<font size=-2>- Searching 3,083,324,652 web pages</font>
</body>
</html>
Running that piece of HTML through the simpleHtmlToText function results in:
Google
• Advanced Search
• Preferences
• Language Tools
©2002 Google
- Searching 3,083,324,652 web pages
The optional keepImages
flag allows you to retain a little bit of information about a graphic file--the base file name. With this input:
<html>
<body>
<p>Some text here...</p>
<a href="some.url.com"><img src="/some/path/file.gif" /></a>
</body>
</html>
you get this output:
Some text here...
[#file.gif#]
...assuming you have set the keepImages
flag to true.
simpleHtmlToText(text, keepImages)
simpleHtmlToText(text)
Convert an HTML document (represented by a string) into text, optionally keeping the image references. If keepImages
is true, then image references will be condensed and retained. For example, <IMG SRC="http://www.content.com/abc/def/hello.gif"
> will be replaced by [#hello.gif#]
. The "[# #]" brackets are used for easy selection of images by other applications.
text
- string; a string representing the HTML document.
keepImages
- optional; a boolean indicating to keep image file names in output.
A string representing the text of the HTML document.
None
Michael Sorens
$Revision: 8 $ $Date: 2006-12-19 21:13:43 -0800 (Tue, 19 Dec 2006) $
CleanCode 0.9
Java version
Hey! The above document had some coding errors, which are explained below:
=back doesn't take any parameters, but you said =back -- end of FUNCTION section
Home | Perl | Java | PowerShell | C# | SQL | Index | Tools | Download | What's New |
CleanCode Perl Libraries | Copyright © 2001-2013 Michael Sorens - Revised 2013.06.30 |