com.cleancode.net
Class SimpleHtmlToText

java.lang.Object
  extended by com.cleancode.net.SimpleHtmlToText

public class SimpleHtmlToText
extends Object

Converts an HTML document to text by stripping out all formatting tags and doing simple conversions. Tables are converted to use tabs and newlines (though nested tables may not appear correct). Multiple blanks are all removed on each line. Leading and trailing blanks are removed on each line. Multiple line breaks are removed. <HR> tags are printed as simple dividers (=============). Common HTML codes are converted to ASCII equivalents (&, ©, <, and >).

Since:
CleanCode 0.9
Version:
$Revision: 9 $
Author:
Michael Sorens
See Also:
REConverter

Nested Class Summary
static class SimpleHtmlToText.Test
          A standalone test class.
 
Field Summary
static String VERSION
          Current version of this class.
 
Constructor Summary
SimpleHtmlToText()
          Construct a SimpleHtmlToText object.
 
Method Summary
 String convert(String content)
          Convert an HTML document (represented by a String) into text.
 String convert(String content, boolean keepImages)
          Convert an HTML document (represented by a String) into text, optionally keeping the image references.
static void main(String[] args)
          Main program for standalone mode.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERSION

public static final String VERSION
Current version of this class.

Constructor Detail

SimpleHtmlToText

public SimpleHtmlToText()
Construct a SimpleHtmlToText object.

Method Detail

convert

public String convert(String content)
Convert an HTML document (represented by a String) into text. Image references are deleted just as all other HTML tags are.

Parameters:
content - a String representing the HTML document
Returns:
a String representing the text of the HTML document

convert

public String convert(String content,
                      boolean keepImages)
Convert an HTML document (represented by a String) into text, optionally keeping the image references. If keepImages is true, then image references will be condensed and retained. For example, <code><img src="http://www.content.com/abc/def/hello.gif"></code> will be replaced by <code>[#hello.gif#]</code>. The "[# #]" brackets are used for easy selection of images by other applications.

Parameters:
content - a String representing the HTML document
keepImages - a boolean indicating to keep image file names in output
Returns:
a String representing the text of the HTML document

main

public static void main(String[] args)
                 throws IOException
Main program for standalone mode. Converts the specified file to text and prints the results to stdout.

Usage: SimpleHtmlToText filename

Parameters:
args - filename to convert
Throws:
IOException - if any problem reading file.


CleanCode Java Libraries Copyright © 2001-2012 Michael Sorens - Revised 2012.12.10 Get CleanCode at SourceForge.net. Fast, secure and Free Open Source software downloads