com.cleancode.net
Class URLReader

java.lang.Object
  extended by com.cleancode.net.URLReader

public class URLReader
extends Object

Fetches the contents of a URL with a variety of options.

Since:
CleanCode 0.9
Version:
$Revision: 9 $
Author:
Michael Sorens

Field Summary
static String PSEUDO_LINE_DELIMITER
          For POST data on the command line, use this to indicate where actual line breaks go.
static String VERSION
          Current version of this class.
 
Constructor Summary
URLReader(String urlString, boolean verbose, String proxyProperty)
          Creates a URLReader object to fetch URLs.
 
Method Summary
 String getContent()
          Returns HTML content of previously fetched URL.
 String getText(boolean keepImages)
          Returns text extracted from content of previously fetched URL.
static void main(String[] args)
          Standalone program to fetch a URL.
 void readFromURL()
          Fetch a URL using URL objects to establish a connection.
 void readFromURLConn(String[] args, String agent, String postData)
          Fetch a URL using URLConnection objects.
 void readRaw()
          Fetch a URL in raw mode--using Socket objects--to establish a connection.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERSION

public static final String VERSION
Current version of this class.


PSEUDO_LINE_DELIMITER

public static final String PSEUDO_LINE_DELIMITER
For POST data on the command line, use this to indicate where actual line breaks go.

See Also:
Constant Field Values
Constructor Detail

URLReader

public URLReader(String urlString,
                 boolean verbose,
                 String proxyProperty)
Creates a URLReader object to fetch URLs.

Parameters:
urlString - target URL
verbose - diagnostic output flag
proxyProperty - host:port string for proxy server to use
Method Detail

getContent

public String getContent()
Returns HTML content of previously fetched URL.

Returns:
string representation of fetched URL

getText

public String getText(boolean keepImages)
Returns text extracted from content of previously fetched URL. This uses the SimpleHtmlToText converter to extract text from the HTML content. You may optionally retain a marker for each image (i.e. the image file name in brackets).

Parameters:
keepImages - boolean indicating whether to keep markers for each image in the text.
Returns:
string representation of extracted text of URL

readRaw

public void readRaw()
             throws IOException
Fetch a URL in raw mode--using Socket objects--to establish a connection. A raw HTTP GET command initiates the transaction.

Throws:
IOException - if I/O problem

readFromURL

public void readFromURL()
                 throws IOException
Fetch a URL using URL objects to establish a connection. No cookies may be sent with this mode.

Throws:
IOException - if I/O problem

readFromURLConn

public void readFromURLConn(String[] args,
                            String agent,
                            String postData)
                     throws IOException
Fetch a URL using URLConnection objects. Cookies may be sent with the URL. If postData is empty, an HTTP GET is used. If postData is present, the string is split into lines via embedded instances of the PSEUDO_LINE_DELIMITER, then POSTed.

Parameters:
args - list of Strings, beginning with URL; remainder of list are cookies.
agent - agent string to send (or null to use the system default)
postData - data to send via HTTP POST, if present
Throws:
IOException - if I/O problem

main

public static void main(String[] args)
Standalone program to fetch a URL.
 Usage: java [ options ] URLReader url { cookie... }

 Options:
  -Dproxy=<string> - host:port specification for proxy server
  -Draw - use sockets
  -Dtext - convert HTML to text
  -Dtext=1 - convert HTML to text, but leave image references
  -Dverbose - indicate program actions
  -Dagent=IE - use IE 5.0 user agent identifier
  -Dagent=NS - use NS 4.76 user agent identifier
  -Dagent=<string> - use specified user agent identifier
  -Dpost=<string> - data for HTTP POST [experimental]
 
 Sample invocations:
 java URLReader "http://www.dell.com/"
 java URLReader "http://www.aaii.com/stkscrns/archive/" "session=LOEFGMO"
 

Parameters:
args - command-line arguments.


CleanCode Java Libraries Copyright © 2001-2012 Michael Sorens - Revised 2012.12.10 Get CleanCode at SourceForge.net. Fast, secure and Free Open Source software downloads