pal.io
Class NexusTokenizer

java.lang.Object
  extended bypal.io.NexusTokenizer

public final class NexusTokenizer
extends java.lang.Object

Comments

A simple token pull-parser for the NEXUS file format as specified in:

Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.

The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:

The parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).

Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream

NB: in this implementation, the token #NEXUS is considered special and when read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'. This token has special meaning and is reflected in it having its own token type

Usage

NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));
ntp.setReadWhiteSpace(false);
    // ignore whitespace ntp.setIgnoreComments(true);
     // ignore comments ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
// all tokens in uppercase String nToken = ntp.readToken();

while(nToken != null) {
    System.out.println("Token: " + nToken);
    System.out.println("Col: " + ntp.getCol());
    System.out.println("Row: " + ntp.getRow());
}


Field Summary
static char ADDITION
           
static char ASTERIX
           
static char B_SLASH
           
static char B_TICK
           
static char C_RETURN
           
static char COLON
           
static char COMMA
           
static char D_QUOTE
           
static char DASH
           
static char EQUALS
           
static char F_SLASH
           
static char G_THAN
           
static char HASH
           
static int HEADER_TOKEN
          Flag indicating last token read was the header token #NEXUS
static char L_BRACE
           
static char L_BRACKET
           
static char L_FEED
           
static char L_PARENTHESIS
           
static char L_THAN
           
static int NEWLINE_TOKEN
          Flag indicating last token read was a newline symbol/word
static char PERIOD
           
static int PUNCTUATION_TOKEN
          Flag indicating last token read was a punctuation symbol
static char R_BRACE
           
static char R_BRACKET
           
static char R_PARENTHESIS
           
static char S_QUOTE
           
static char SEMI_COLON
           
static char SPACE
           
static char TAB
           
static int UNDEFINED_TOKEN
          Flag indicating last token read was undefined
static int WHITESPACE_TOKEN
          Flag indicating last token read was whitespace
static int WORD_LOWERCASE
          Flag indicating words should be converted to lowercase
static int WORD_TOKEN
          Flag indicating last token read was a word
static int WORD_UNMODIFIED
          Flag indicating words should be untouched
static int WORD_UPPERCASE
          Flag indicating words should be converted to uppercase
 
Constructor Summary
NexusTokenizer(java.io.PushbackReader pr)
          Constructor for a NexusTokenParser
NexusTokenizer(java.lang.String file)
          Constructor for a NexusTokenParser
 
Method Summary
 boolean convertNewLine()
          Gets the flag indicating whether this parser instance should convert newline characters.
 int getCol()
          Gets the current column position of the cursor.
 java.lang.String getLastReadToken()
          Returns the last read token.
 int getLastTokenType()
          Determine the type of the last read token.
 int getRow()
          Gets the current row position of the cursor.
 int getWordModification()
          Gets the word modification flag currently in use
 java.lang.String readToken()
          Reads a token in from the underlying stream.
 boolean readWhiteSpace()
          Get the flag indicating whether or not this parser object is reading (and returning) whitespace
 java.lang.String seek(int tokenType)
          Seeks through the stream to find the next token of the specified type.
 java.lang.String seek(java.lang.String token)
          Seeks through the stream to find the token argument.
 void setConvertNewLine(boolean b)
          Sets the convertNL flag.
 void setIgnoreComments(boolean b)
          Sets the ignoreComments flag.
 void setNewLineChar(char nl)
          Sets the character to be convert newline characters into
 void setReadWhiteSpace(boolean b)
          Sets the readWS flag.
 void setWordModification(int flag)
          Sets the flag value for word modification.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

L_PARENTHESIS

public static final char L_PARENTHESIS
See Also:
Constant Field Values

R_PARENTHESIS

public static final char R_PARENTHESIS
See Also:
Constant Field Values

L_BRACKET

public static final char L_BRACKET
See Also:
Constant Field Values

R_BRACKET

public static final char R_BRACKET
See Also:
Constant Field Values

L_BRACE

public static final char L_BRACE
See Also:
Constant Field Values

R_BRACE

public static final char R_BRACE
See Also:
Constant Field Values

F_SLASH

public static final char F_SLASH
See Also:
Constant Field Values

B_SLASH

public static final char B_SLASH
See Also:
Constant Field Values

COMMA

public static final char COMMA
See Also:
Constant Field Values

SEMI_COLON

public static final char SEMI_COLON
See Also:
Constant Field Values

COLON

public static final char COLON
See Also:
Constant Field Values

EQUALS

public static final char EQUALS
See Also:
Constant Field Values

ASTERIX

public static final char ASTERIX
See Also:
Constant Field Values

S_QUOTE

public static final char S_QUOTE
See Also:
Constant Field Values

D_QUOTE

public static final char D_QUOTE
See Also:
Constant Field Values

B_TICK

public static final char B_TICK
See Also:
Constant Field Values

ADDITION

public static final char ADDITION
See Also:
Constant Field Values

DASH

public static final char DASH
See Also:
Constant Field Values

L_THAN

public static final char L_THAN
See Also:
Constant Field Values

G_THAN

public static final char G_THAN
See Also:
Constant Field Values

HASH

public static final char HASH
See Also:
Constant Field Values

PERIOD

public static final char PERIOD
See Also:
Constant Field Values

L_FEED

public static final char L_FEED
See Also:
Constant Field Values

C_RETURN

public static final char C_RETURN
See Also:
Constant Field Values

TAB

public static final char TAB
See Also:
Constant Field Values

SPACE

public static final char SPACE
See Also:
Constant Field Values

WORD_UPPERCASE

public static final int WORD_UPPERCASE
Flag indicating words should be converted to uppercase

See Also:
Constant Field Values

WORD_LOWERCASE

public static final int WORD_LOWERCASE
Flag indicating words should be converted to lowercase

See Also:
Constant Field Values

WORD_UNMODIFIED

public static final int WORD_UNMODIFIED
Flag indicating words should be untouched

See Also:
Constant Field Values

UNDEFINED_TOKEN

public static final int UNDEFINED_TOKEN
Flag indicating last token read was undefined

See Also:
Constant Field Values

WORD_TOKEN

public static final int WORD_TOKEN
Flag indicating last token read was a word

See Also:
Constant Field Values

PUNCTUATION_TOKEN

public static final int PUNCTUATION_TOKEN
Flag indicating last token read was a punctuation symbol

See Also:
Constant Field Values

NEWLINE_TOKEN

public static final int NEWLINE_TOKEN
Flag indicating last token read was a newline symbol/word

See Also:
Constant Field Values

WHITESPACE_TOKEN

public static final int WHITESPACE_TOKEN
Flag indicating last token read was whitespace

See Also:
Constant Field Values

HEADER_TOKEN

public static final int HEADER_TOKEN
Flag indicating last token read was the header token #NEXUS

See Also:
Constant Field Values
Constructor Detail

NexusTokenizer

public NexusTokenizer(java.lang.String file)
               throws java.io.IOException
Constructor for a NexusTokenParser

Parameters:
file - File name for the NEXUS file
Throws:
java.io.IOException - I/O errors

NexusTokenizer

public NexusTokenizer(java.io.PushbackReader pr)
               throws java.io.IOException
Constructor for a NexusTokenParser

Parameters:
pr - PushbackReader
Throws:
java.io.IOException - I/O errors
Method Detail

readWhiteSpace

public boolean readWhiteSpace()
Get the flag indicating whether or not this parser object is reading (and returning) whitespace

Returns:
returns the readWS flag

convertNewLine

public boolean convertNewLine()
Gets the flag indicating whether this parser instance should convert newline characters. As the specification says (see link in class description above), newline characters may be '\r', '\n', '\r\n'. To provide some kind of uniformity, the parser can convert these symbols into one specified. As a default, this feature is off.

Returns:
returns the convertNL flag

setReadWhiteSpace

public void setReadWhiteSpace(boolean b)
Sets the readWS flag. True means that the parser will return whitespace characters as a token (where whitespace = ' ' or '\t').

Parameters:
b - flag value for readWS

setConvertNewLine

public void setConvertNewLine(boolean b)
Sets the convertNL flag. True means that the the parser will convert newline characters ('\r', '\n' or '\r\n') into either the default ('\n' if setNewLineChar() is not called) or to a user specified newline char

Parameters:
b - flag value for convertNL

setIgnoreComments

public void setIgnoreComments(boolean b)
Sets the ignoreComments flag. True means that the the tokenizer will ignore comments (i.e. sections of a nexus file delimited by '[...]'. When set to true, the tokenizer will return the first token available after a comment.

Parameters:
b - flag value for ignoreComments

setNewLineChar

public void setNewLineChar(char nl)
Sets the character to be convert newline characters into

Parameters:
nl - Replacement newline character

getCol

public int getCol()
Gets the current column position of the cursor. Changed after each read.

Returns:
Column number (zero indexed)

getRow

public int getRow()
Gets the current row position of the cursor. Changed after each read.

Returns:
Row number (zero indexed)

getWordModification

public int getWordModification()
Gets the word modification flag currently in use

Returns:
Flag value for word modification

setWordModification

public void setWordModification(int flag)
Sets the flag value for word modification. The token case can be changed to lowercase or uppercasse once it has been read from the stream (depending on the set flag). WORD_UNMODIFIED indicates that the tokens should be returned in the case that they are read from the stream. This value can be set at any time between token reads and thus the next token read will be altered depending on this value. The default is WORD_UNMODIFIED.

Parameters:
flag - Flag value, one of WORD_LOWERCASE, WORD_UPPERCASE or WORD_UNMODIFIED

readToken

public java.lang.String readToken()
                           throws java.io.IOException,
                                  NexusParseException
Reads a token in from the underlying stream. Tokens are individual chunks read from the underlying stream. Each token is one of the four basic types:

Returns:
returns a String token or null if EOF is reached (i.e. no more tokens to read)
Throws:
java.io.IOException - I/O errors
NexusParseException - Parsing errors

getLastTokenType

public int getLastTokenType()
Determine the type of the last read token. After readToken() has been called, the type of token returned can be determined by calling getLastTokenType(). This returns one of five different constants:

Returns:
Last token read.

seek

public java.lang.String seek(int tokenType)
                      throws java.io.IOException,
                             NexusParseException
Seeks through the stream to find the next token of the specified type. The type value can be one of:

Returns:
returns a String token or null if EOF is reached (i.e. no more tokens to read)
Throws:
java.io.IOException - I/O errors
NexusParseException - Thrown by parsing errors or if tokenType == WHITESPACE_TOKEN && readWhiteSpace() == false

seek

public java.lang.String seek(java.lang.String token)
                      throws java.io.IOException,
                             NexusParseException
Seeks through the stream to find the token argument.

Returns:
returns a String token or null if token is not found (i.e. EOF is reached)
Throws:
java.io.IOException - I/O errors
NexusParseException - Thrown by parsing errors or if token is whitespace && readWhiteSpace() == false

getLastReadToken

public java.lang.String getLastReadToken()
Returns the last read token. Each call to readToken() stores the returned token so that it can be retrieved again. However, each consuming readToken() call replaces this buffer with the new token.

Returns:
return the last read token