Package | Description |
---|---|
org.htmlparser |
The basic API classes which will be used by most developers when working with
the HTML Parser.
|
org.htmlparser.beans |
The beans package contains Java Beans using the HTML Parser.
|
org.htmlparser.http |
The http package is responsible for HTTP connections to servers.
|
org.htmlparser.lexer |
The lexer package is the base level I/O subsystem.
|
org.htmlparser.lexerapplications.thumbelina |
Extract the images behind thumbnail images.
|
org.htmlparser.nodes |
The nodes package has the concrete node implementations.
|
org.htmlparser.parserapplications |
Example applications.
|
org.htmlparser.sax |
The sax package implements a SAX (Simple API for XML) parser for HTML.
|
org.htmlparser.scanners |
The scanners package contains classes responsible for the tertiary
identification of tags.
|
org.htmlparser.tags |
The tags package contains specific tags.
|
org.htmlparser.util |
Code which can be reused by many classes, is located in this package.
|
Modifier and Type | Method and Description |
---|---|
Remark |
NodeFactory.createRemarkNode(Page page,
int start,
int end)
Create a new remark node.
|
Text |
NodeFactory.createStringNode(Page page,
int start,
int end)
Create a new text node.
|
Tag |
NodeFactory.createTagNode(Page page,
int start,
int end,
java.util.Vector attributes)
Create a new tag node.
|
void |
Node.doSemanticAction()
Perform the meaning of this tag.
|
NodeIterator |
Parser.elements()
Returns an iterator (enumeration) over the html nodes.
|
NodeList |
Parser.extractAllNodesThatMatch(NodeFilter filter)
Extract all nodes matching the given filter.
|
NodeList |
Parser.parse(NodeFilter filter)
Parse the given resource, using the filter provided.
|
void |
Parser.postConnect(java.net.HttpURLConnection connection)
Called just after calling connect.
|
void |
Parser.preConnect(java.net.HttpURLConnection connection)
Called just prior to calling connect.
|
void |
Parser.setConnection(java.net.URLConnection connection)
Set the connection for this parser.
|
void |
Parser.setEncoding(java.lang.String encoding)
Set the encoding for the page this parser is reading from.
|
void |
Parser.setInputHTML(java.lang.String inputHTML)
Initializes the parser with the given input HTML String.
|
void |
Parser.setResource(java.lang.String resource)
Set the html, a url, or a file.
|
void |
Parser.setURL(java.lang.String url)
Set the URL for this parser.
|
void |
Parser.visitAllNodesWith(NodeVisitor visitor)
Apply the given visitor to the current page.
|
Constructor and Description |
---|
Parser(java.lang.String resource)
Creates a Parser object with the location of the resource (URL or file).
|
Parser(java.lang.String resource,
ParserFeedback feedback)
Creates a Parser object with the location of the resource (URL or file)
You would typically create a DefaultHTMLParserFeedback object and pass
it in.
|
Parser(java.net.URLConnection connection)
Construct a parser using the provided URLConnection.
|
Parser(java.net.URLConnection connection,
ParserFeedback fb)
Constructor for custom HTTP access.
|
Modifier and Type | Method and Description |
---|---|
protected NodeList |
FilterBean.applyFilters()
Apply each of the filters.
|
protected java.net.URL[] |
LinkBean.extractLinks()
Internal routine to extract all the links from the parser.
|
protected java.lang.String |
StringBean.extractStrings()
Extract the text from a page.
|
Modifier and Type | Method and Description |
---|---|
java.net.URLConnection |
ConnectionManager.openConnection(java.lang.String string)
Opens a connection based on a given string.
|
java.net.URLConnection |
ConnectionManager.openConnection(java.net.URL url)
Opens a connection using the given url.
|
void |
ConnectionMonitor.postConnect(java.net.HttpURLConnection connection)
Called just after calling connect.
|
void |
ConnectionMonitor.preConnect(java.net.HttpURLConnection connection)
Called just prior to calling connect.
|
Modifier and Type | Method and Description |
---|---|
char |
Page.getCharacter(Cursor cursor)
Read the character at the given cursor position.
|
static void |
Lexer.main(java.lang.String[] args)
Mainline for command line operation
|
protected Node |
Lexer.makeRemark(int start,
int end)
Create a remark node based on the current cursor and the one provided.
|
protected Node |
Lexer.makeString(int start,
int end)
Create a string node based on the current cursor and the one provided.
|
protected Node |
Lexer.makeTag(int start,
int end,
java.util.Vector attributes)
Create a tag node based on the current cursor and the one provided.
|
Node |
Lexer.nextNode()
Get the next node from the source.
|
Node |
Lexer.nextNode(boolean quotesmart)
Get the next node from the source.
|
Node |
Lexer.parseCDATA()
Return CDATA as a text node.
|
Node |
Lexer.parseCDATA(boolean quotesmart)
Return CDATA as a text node.
|
protected Node |
Lexer.parseJsp(int start)
Parse a java server page node.
|
protected Node |
Lexer.parsePI(int start)
Parse an XML processing instruction.
|
protected Node |
Lexer.parseRemark(int start,
boolean quotesmart)
Parse a comment.
|
protected Node |
Lexer.parseString(int start,
boolean quotesmart)
Parse a string node.
|
protected Node |
Lexer.parseTag(int start)
Parse a tag.
|
protected void |
Lexer.scanJIS(Cursor cursor)
Advance the cursor through a JIS escape sequence.
|
void |
Page.setConnection(java.net.URLConnection connection)
Set the URLConnection to be used by this page.
|
void |
Page.setEncoding(java.lang.String character_set)
Begins reading from the source with the given character set.
|
abstract void |
Source.setEncoding(java.lang.String character_set)
Set the encoding to the given character set.
|
void |
InputStreamSource.setEncoding(java.lang.String character_set)
Begins reading from the source with the given character set.
|
void |
StringSource.setEncoding(java.lang.String character_set)
Set the encoding to the given character set.
|
void |
Page.ungetCharacter(Cursor cursor)
Return a character.
|
Constructor and Description |
---|
Lexer(java.net.URLConnection connection)
Creates a new instance of a Lexer.
|
Page(java.net.URLConnection connection)
Construct a page reading from a URL connection.
|
Modifier and Type | Method and Description |
---|---|
protected java.net.URL[][] |
Thumbelina.extractImageLinks(Lexer lexer,
java.net.URL docbase)
Get the links of an element of a document.
|
Modifier and Type | Method and Description |
---|---|
void |
AbstractNode.doSemanticAction()
Perform the meaning of this tag.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
StringExtractor.extractStrings(boolean links)
Extract the text from a page.
|
protected boolean |
SiteCapturer.isHtml(java.lang.String link)
Returns
true if the link contains text/html content. |
protected void |
SiteCapturer.process(NodeFilter filter)
Process a single page.
|
Modifier and Type | Method and Description |
---|---|
void |
Feedback.error(java.lang.String message,
ParserException e)
Error message.
|
Modifier and Type | Method and Description |
---|---|
protected void |
XMLReader.doSAX(Node node)
Process nodes recursively on the DocumentHandler.
|
Modifier and Type | Method and Description |
---|---|
protected Tag |
CompositeTagScanner.createVirtualEndTag(Tag tag,
Lexer lexer,
Page page,
int position)
Creates an end tag with the same name as the given tag.
|
static java.lang.String |
ScriptDecoder.Decode(Page page,
Cursor cursor)
Decode script encoded by the Microsoft obfuscator.
|
protected void |
CompositeTagScanner.finishTag(Tag tag,
Lexer lexer)
Finish off a tag.
|
Tag |
Scanner.scan(Tag tag,
Lexer lexer,
NodeList stack)
Scan the tag.
|
Tag |
TagScanner.scan(Tag tag,
Lexer lexer,
NodeList stack)
Scan the tag.
|
Tag |
CompositeTagScanner.scan(Tag tag,
Lexer lexer,
NodeList stack)
Collect the children.
|
Tag |
ScriptScanner.scan(Tag tag,
Lexer lexer,
NodeList stack)
Scan for script.
|
Tag |
StyleScanner.scan(Tag tag,
Lexer lexer,
NodeList stack)
Scan for style definitions.
|
Modifier and Type | Method and Description |
---|---|
void |
BaseHrefTag.doSemanticAction()
Perform the meaning of this tag.
|
void |
MetaTag.doSemanticAction()
Perform the META tag semantic action.
|
Modifier and Type | Class and Description |
---|---|
class |
EncodingChangeException
The encoding is changed invalidating already scanned characters.
|
Modifier and Type | Method and Description |
---|---|
void |
ParserFeedback.error(java.lang.String message,
ParserException e) |
void |
DefaultParserFeedback.error(java.lang.String message,
ParserException exception)
Print an error message.
|
static void |
FeedbackManager.error(java.lang.String message,
ParserException e) |
Modifier and Type | Method and Description |
---|---|
static Parser |
ParserUtils.createParserParsingAnInputString(java.lang.String input)
Create a Parser Object having a String Object as input (instead of a url or a string representing the url location).
|
boolean |
NodeIterator.hasMoreNodes()
Check if more nodes are available.
|
boolean |
IteratorImpl.hasMoreNodes()
Check if more nodes are available.
|
Node |
NodeIterator.nextNode()
Get the next node.
|
Node |
IteratorImpl.nextNode()
Get the next node.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.Class nodeType)
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.Class nodeType,
boolean recursive,
boolean insideTag)
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
NodeFilter filter)
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
NodeFilter filter,
boolean recursive,
boolean insideTag)
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.String[] tags)
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String[] |
ParserUtils.splitTags(java.lang.String input,
java.lang.String[] tags,
boolean recursive,
boolean insideTag)
Split the input string in a string array,
considering the tags as delimiter for splitting.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.Class nodeType)
Trim all tags in the input string and
return a string like the input one
without the tags and their content.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.Class nodeType,
boolean recursive,
boolean insideTag)
Trim all tags in the input string and
return a string like the input one
without the tags and their content (optional).
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
NodeFilter filter)
Trim all tags in the input string and
return a string like the input one
without the tags and their content.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
NodeFilter filter,
boolean recursive,
boolean insideTag)
Trim all tags in the input string and
return a string like the input one
without the tags and their content (optional).
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.String[] tags)
Trim all tags in the input string and
return a string like the input one
without the tags and their content.
|
static java.lang.String |
ParserUtils.trimTags(java.lang.String input,
java.lang.String[] tags,
boolean recursive,
boolean insideTag)
Trim all tags in the input string and
return a string like the input one
without the tags and their content (optional).
|
void |
NodeList.visitAllNodesWith(NodeVisitor visitor)
Utility to apply a visitor to a node list.
|
HTML Parser is an open source library released under LGPL.