public final class NexusTokenizer
extends java.lang.Object
A simple token pull-parser for the NEXUS file format as specified in:
Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.
The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:
' '
or
'\t'
. Whitespace is only returned if the option is set'\r'
, '\n'
or '\r\n'
.
The parser will return the character unless convertNL
is
set, in which case it will replace the token with the user specified
new line characterThe parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).
Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream
NB: in this implementation, the token #NEXUS is considered special and when
read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'.
This token has special meaning and is reflected in it having its own token type
NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));
ntp.setReadWhiteSpace(false);
// ignore whitespace
ntp.setIgnoreComments(true);
// ignore comments
ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
// all tokens in uppercase
String nToken = ntp.readToken();
while(nToken != null) {
System.out.println("Token: " + nToken);
System.out.println("Col: " + ntp.getCol());
System.out.println("Row: " + ntp.getRow());
}
Modifier and Type | Field | Description |
---|---|---|
static char |
ADDITION |
|
static char |
ASTERIX |
|
static char |
B_SLASH |
|
static char |
B_TICK |
|
static char |
C_RETURN |
|
static char |
COLON |
|
static char |
COMMA |
|
static char |
D_QUOTE |
|
static char |
DASH |
|
static char |
EQUALS |
|
static char |
F_SLASH |
|
static char |
G_THAN |
|
static char |
HASH |
|
static int |
HEADER_TOKEN |
Flag indicating last token read was the header token #NEXUS
|
static char |
L_BRACE |
|
static char |
L_BRACKET |
|
static char |
L_FEED |
|
static char |
L_PARENTHESIS |
|
static char |
L_THAN |
|
static int |
NEWLINE_TOKEN |
Flag indicating last token read was a newline symbol/word
|
static char |
PERIOD |
|
static int |
PUNCTUATION_TOKEN |
Flag indicating last token read was a punctuation symbol
|
static char |
R_BRACE |
|
static char |
R_BRACKET |
|
static char |
R_PARENTHESIS |
|
static char |
S_QUOTE |
|
static char |
SEMI_COLON |
|
static char |
SPACE |
|
static char |
TAB |
|
static int |
UNDEFINED_TOKEN |
Flag indicating last token read was undefined
|
static int |
WHITESPACE_TOKEN |
Flag indicating last token read was whitespace
|
static int |
WORD_LOWERCASE |
Flag indicating words should be converted to lowercase
|
static int |
WORD_TOKEN |
Flag indicating last token read was a word
|
static int |
WORD_UNMODIFIED |
Flag indicating words should be untouched
|
static int |
WORD_UPPERCASE |
Flag indicating words should be converted to uppercase
|
Constructor | Description |
---|---|
NexusTokenizer(java.io.PushbackReader pr) |
Constructor for a
NexusTokenParser |
NexusTokenizer(java.lang.String file) |
Constructor for a
NexusTokenParser |
Modifier and Type | Method | Description |
---|---|---|
boolean |
convertNewLine() |
Gets the flag indicating whether this parser instance should convert
newline characters.
|
int |
getCol() |
Gets the current column position of the cursor.
|
java.lang.String |
getLastReadToken() |
Returns the last read token.
|
int |
getLastTokenType() |
Determine the type of the last read token.
|
int |
getRow() |
Gets the current row position of the cursor.
|
int |
getWordModification() |
Gets the word modification flag currently in use
|
java.lang.String |
readToken() |
Reads a token in from the underlying stream.
|
boolean |
readWhiteSpace() |
Get the flag indicating whether or not this parser object is reading
(and returning) whitespace
|
java.lang.String |
seek(int tokenType) |
Seeks through the stream to find the next token of the specified type.
|
java.lang.String |
seek(java.lang.String token) |
Seeks through the stream to find the token argument.
|
void |
setConvertNewLine(boolean b) |
Sets the
convertNL flag. |
void |
setIgnoreComments(boolean b) |
Sets the
ignoreComments flag. |
void |
setNewLineChar(char nl) |
Sets the character to be convert newline characters into
|
void |
setReadWhiteSpace(boolean b) |
Sets the
readWS flag. |
void |
setWordModification(int flag) |
Sets the flag value for word modification.
|
public static final char L_PARENTHESIS
public static final char R_PARENTHESIS
public static final char L_BRACKET
public static final char R_BRACKET
public static final char L_BRACE
public static final char R_BRACE
public static final char F_SLASH
public static final char B_SLASH
public static final char COMMA
public static final char SEMI_COLON
public static final char COLON
public static final char EQUALS
public static final char ASTERIX
public static final char S_QUOTE
public static final char D_QUOTE
public static final char B_TICK
public static final char ADDITION
public static final char DASH
public static final char L_THAN
public static final char G_THAN
public static final char HASH
public static final char PERIOD
public static final char L_FEED
public static final char C_RETURN
public static final char TAB
public static final char SPACE
public static final int WORD_UPPERCASE
public static final int WORD_LOWERCASE
public static final int WORD_UNMODIFIED
public static final int UNDEFINED_TOKEN
public static final int WORD_TOKEN
public static final int PUNCTUATION_TOKEN
public static final int NEWLINE_TOKEN
public static final int WHITESPACE_TOKEN
public static final int HEADER_TOKEN
public NexusTokenizer(java.lang.String file) throws java.io.IOException
NexusTokenParser
file
- File name for the NEXUS filejava.io.IOException
- I/O errorspublic NexusTokenizer(java.io.PushbackReader pr) throws java.io.IOException
NexusTokenParser
pr
- PushbackReaderjava.io.IOException
- I/O errorspublic boolean readWhiteSpace()
readWS
flagpublic boolean convertNewLine()
convertNL
flagpublic void setReadWhiteSpace(boolean b)
readWS
flag. True means that the parser will return
whitespace characters as a token (where whitespace = ' ' or '\t').b
- flag value for readWS
public void setConvertNewLine(boolean b)
convertNL
flag. True means that the the parser will
convert newline characters ('\r', '\n' or '\r\n') into either the default
('\n' if setNewLineChar()
is not called) or to a user specified
newline charb
- flag value for convertNL
public void setIgnoreComments(boolean b)
ignoreComments
flag. True means that the the tokenizer
will ignore comments (i.e. sections of a nexus file delimited by '[...]'.
When set to true, the tokenizer will return the first token available after
a comment.b
- flag value for ignoreComments
public void setNewLineChar(char nl)
nl
- Replacement newline characterpublic int getCol()
public int getRow()
public int getWordModification()
public void setWordModification(int flag)
WORD_UNMODIFIED
indicates that the tokens should be
returned in the case that they are read from the stream. This value can
be set at any time between token reads and thus the next token read will
be altered depending on this value. The default is WORD_UNMODIFIED.
flag
- Flag value, one of WORD_LOWERCASE
,
WORD_UPPERCASE
or WORD_UNMODIFIED
public java.lang.String readToken() throws java.io.IOException, NexusParseException
convertNL
is set, in which case it will replace
the token with the user specified new line characterString
token or
null
if EOF is reached
(i.e. no more tokens to read)java.io.IOException
- I/O errorsNexusParseException
- Parsing errorspublic int getLastTokenType()
readToken()
has been called, the type of token returned can be determined by calling
getLastTokenType()
. This returns one of five different constants:
UNDEFINED_TOKEN
: default before anything is read
from the streamWORD_TOKEN
: word token was readPUNCTUATION_TOKEN
: punctuation token was readNEWLINE_TOKEN
: newline token was readWHITESPACE_TOKEN
: whitespace token was read (never
returned unless whitespace is being returned) HEADER_TOKEN
: last token was the special word #NEXUSpublic java.lang.String seek(int tokenType) throws java.io.IOException, NexusParseException
String
token or
null
if EOF is reached
(i.e. no more tokens to read)java.io.IOException
- I/O errorsNexusParseException
- Thrown by parsing errors or if
tokenType == WHITESPACE_TOKEN &&
readWhiteSpace() == falsepublic java.lang.String seek(java.lang.String token) throws java.io.IOException, NexusParseException
String
token or
null
if token is not found
(i.e. EOF is reached)java.io.IOException
- I/O errorsNexusParseException
- Thrown by parsing errors or if
token is whitespace &&
readWhiteSpace() == falsepublic java.lang.String getLastReadToken()
readToken()
stores the
returned token so that it can be retrieved again. However, each consuming
readToken()
call replaces this buffer with the new token.