NexusTokenizer

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pal.io
Class NexusTokenizer

java.lang.Object
  pal.io.NexusTokenizer

public final class NexusTokenizer
extends java.lang.Object

Comments

A simple token pull-parser for the NEXUS file format as specified in:

Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.

The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:

Punctuation: any of the punctuation characters (see constants)
Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
Word: any string of characters delimited by whitespace or punctuation
Newline: '\r', '\n' or '\r\n'. The parser will return the character unless convertNL is set, in which case it will replace the token with the user specified new line character

The parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).

Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream

NB: in this implementation, the token #NEXUS is considered special and when read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'. This token has special meaning and is reflected in it having its own token type

Usage


	NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));

	ntp.setReadWhiteSpace(false);
    // ignore whitespace
	ntp.setIgnoreComments(true);
     // ignore comments
	ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
 // all tokens in uppercase
	String nToken = ntp.readToken();

	

	while(nToken != null) {

	    System.out.println("Token: " + nToken);

	    System.out.println("Col:   " + ntp.getCol());

	    System.out.println("Row:   " + ntp.getRow());

	}

Field Summary
`static char`	`ADDITION`
`static char`	`ASTERIX`
`static char`	`B_SLASH`
`static char`	`B_TICK`
`static char`	`C_RETURN`
`static char`	`COLON`
`static char`	`COMMA`
`static char`	`D_QUOTE`
`static char`	`DASH`
`static char`	`EQUALS`
`static char`	`F_SLASH`
`static char`	`G_THAN`
`static char`	`HASH`
`static int`	`HEADER_TOKEN` Flag indicating last token read was the header token #NEXUS
`static char`	`L_BRACE`
`static char`	`L_BRACKET`
`static char`	`L_FEED`
`static char`	`L_PARENTHESIS`
`static char`	`L_THAN`
`static int`	`NEWLINE_TOKEN` Flag indicating last token read was a newline symbol/word
`static char`	`PERIOD`
`static int`	`PUNCTUATION_TOKEN` Flag indicating last token read was a punctuation symbol
`static char`	`R_BRACE`
`static char`	`R_BRACKET`
`static char`	`R_PARENTHESIS`
`static char`	`S_QUOTE`
`static char`	`SEMI_COLON`
`static char`	`SPACE`
`static char`	`TAB`
`static int`	`UNDEFINED_TOKEN` Flag indicating last token read was undefined
`static int`	`WHITESPACE_TOKEN` Flag indicating last token read was whitespace
`static int`	`WORD_LOWERCASE` Flag indicating words should be converted to lowercase
`static int`	`WORD_TOKEN` Flag indicating last token read was a word
`static int`	`WORD_UNMODIFIED` Flag indicating words should be untouched
`static int`	`WORD_UPPERCASE` Flag indicating words should be converted to uppercase

Constructor Summary
`NexusTokenizer(java.io.PushbackReader pr)` Constructor for a `NexusTokenParser`
`NexusTokenizer(java.lang.String file)` Constructor for a `NexusTokenParser`

Method Summary
`boolean`	`convertNewLine()` Gets the flag indicating whether this parser instance should convert newline characters.
`int`	`getCol()` Gets the current column position of the cursor.
`java.lang.String`	`getLastReadToken()` Returns the last read token.
`int`	`getLastTokenType()` Determine the type of the last read token.
`int`	`getRow()` Gets the current row position of the cursor.
`int`	`getWordModification()` Gets the word modification flag currently in use
`java.lang.String`	`readToken()` Reads a token in from the underlying stream.
`boolean`	`readWhiteSpace()` Get the flag indicating whether or not this parser object is reading (and returning) whitespace
`java.lang.String`	`seek(int tokenType)` Seeks through the stream to find the next token of the specified type.
`java.lang.String`	`seek(java.lang.String token)` Seeks through the stream to find the token argument.
`void`	`setConvertNewLine(boolean b)` Sets the `convertNL` flag.
`void`	`setIgnoreComments(boolean b)` Sets the `ignoreComments` flag.
`void`	`setNewLineChar(char nl)` Sets the character to be convert newline characters into
`void`	`setReadWhiteSpace(boolean b)` Sets the `readWS` flag.
`void`	`setWordModification(int flag)` Sets the flag value for word modification.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

L_PARENTHESIS

public static final char L_PARENTHESIS

See Also:: Constant Field Values

R_PARENTHESIS

public static final char R_PARENTHESIS

See Also:: Constant Field Values

L_BRACKET

public static final char L_BRACKET

See Also:: Constant Field Values

R_BRACKET

public static final char R_BRACKET

See Also:: Constant Field Values

L_BRACE

public static final char L_BRACE

See Also:: Constant Field Values

R_BRACE

public static final char R_BRACE

See Also:: Constant Field Values

F_SLASH

public static final char F_SLASH

See Also:: Constant Field Values

B_SLASH

public static final char B_SLASH

See Also:: Constant Field Values

COMMA

public static final char COMMA

See Also:: Constant Field Values

SEMI_COLON

public static final char SEMI_COLON

See Also:: Constant Field Values

COLON

public static final char COLON

See Also:: Constant Field Values

EQUALS

public static final char EQUALS

See Also:: Constant Field Values

ASTERIX

public static final char ASTERIX

See Also:: Constant Field Values

S_QUOTE

public static final char S_QUOTE

See Also:: Constant Field Values

D_QUOTE

public static final char D_QUOTE

See Also:: Constant Field Values

B_TICK

public static final char B_TICK

See Also:: Constant Field Values

ADDITION

public static final char ADDITION

See Also:: Constant Field Values

DASH

public static final char DASH

See Also:: Constant Field Values

L_THAN

public static final char L_THAN

See Also:: Constant Field Values

G_THAN

public static final char G_THAN

See Also:: Constant Field Values

HASH

public static final char HASH

See Also:: Constant Field Values

PERIOD

public static final char PERIOD

See Also:: Constant Field Values

L_FEED

public static final char L_FEED

See Also:: Constant Field Values

C_RETURN

public static final char C_RETURN

See Also:: Constant Field Values

TAB

public static final char TAB

See Also:: Constant Field Values

SPACE

public static final char SPACE

See Also:: Constant Field Values

WORD_UPPERCASE

public static final int WORD_UPPERCASE

Flag indicating words should be converted to uppercase

See Also:: Constant Field Values

WORD_LOWERCASE

public static final int WORD_LOWERCASE

Flag indicating words should be converted to lowercase

See Also:: Constant Field Values

WORD_UNMODIFIED

public static final int WORD_UNMODIFIED

Flag indicating words should be untouched

See Also:: Constant Field Values

UNDEFINED_TOKEN

public static final int UNDEFINED_TOKEN

Flag indicating last token read was undefined

See Also:: Constant Field Values

WORD_TOKEN

public static final int WORD_TOKEN

Flag indicating last token read was a word

See Also:: Constant Field Values

PUNCTUATION_TOKEN

public static final int PUNCTUATION_TOKEN

Flag indicating last token read was a punctuation symbol

See Also:: Constant Field Values

NEWLINE_TOKEN

public static final int NEWLINE_TOKEN

Flag indicating last token read was a newline symbol/word

See Also:: Constant Field Values

WHITESPACE_TOKEN

public static final int WHITESPACE_TOKEN

Flag indicating last token read was whitespace

See Also:: Constant Field Values

HEADER_TOKEN

public static final int HEADER_TOKEN

Flag indicating last token read was the header token #NEXUS

See Also:: Constant Field Values

Constructor Detail

NexusTokenizer

public NexusTokenizer(java.lang.String file)
               throws java.io.IOException

Constructor for a NexusTokenParser
Parameters:: file - File name for the NEXUS file
Throws:: java.io.IOException - I/O errors

NexusTokenizer

public NexusTokenizer(java.io.PushbackReader pr)
               throws java.io.IOException

Constructor for a NexusTokenParser
Parameters:: pr - PushbackReader
Throws:: java.io.IOException - I/O errors

Method Detail

readWhiteSpace

public boolean readWhiteSpace()

Get the flag indicating whether or not this parser object is reading (and returning) whitespace

Returns:: returns the readWS flag

convertNewLine

public boolean convertNewLine()

Gets the flag indicating whether this parser instance should convert newline characters. As the specification says (see link in class description above), newline characters may be '\r', '\n', '\r\n'. To provide some kind of uniformity, the parser can convert these symbols into one specified. As a default, this feature is off.

Returns:: returns the convertNL flag

setReadWhiteSpace

public void setReadWhiteSpace(boolean b)

Sets the readWS flag. True means that the parser will return whitespace characters as a token (where whitespace = ' ' or '\t').

Parameters:: b - flag value for readWS

setConvertNewLine

public void setConvertNewLine(boolean b)

Sets the convertNL flag. True means that the the parser will convert newline characters ('\r', '\n' or '\r\n') into either the default ('\n' if setNewLineChar() is not called) or to a user specified newline char

Parameters:: b - flag value for convertNL

setIgnoreComments

public void setIgnoreComments(boolean b)

Sets the ignoreComments flag. True means that the the tokenizer will ignore comments (i.e. sections of a nexus file delimited by '[...]'. When set to true, the tokenizer will return the first token available after a comment.

Parameters:: b - flag value for ignoreComments

setNewLineChar

public void setNewLineChar(char nl)

Sets the character to be convert newline characters into

Parameters:: nl - Replacement newline character

getCol

public int getCol()

Gets the current column position of the cursor. Changed after each read.

Returns:: Column number (zero indexed)

getRow

public int getRow()

Gets the current row position of the cursor. Changed after each read.

Returns:: Row number (zero indexed)

getWordModification

public int getWordModification()

Gets the word modification flag currently in use

Returns:: Flag value for word modification

setWordModification

public void setWordModification(int flag)

Sets the flag value for word modification. The token case can be changed to lowercase or uppercasse once it has been read from the stream (depending on the set flag). WORD_UNMODIFIED indicates that the tokens should be returned in the case that they are read from the stream. This value can be set at any time between token reads and thus the next token read will be altered depending on this value. The default is WORD_UNMODIFIED.

Parameters:: flag - Flag value, one of WORD_LOWERCASE, WORD_UPPERCASE or WORD_UNMODIFIED

readToken

public java.lang.String readToken()
                           throws java.io.IOException,
                                  NexusParseException

Reads a token in from the underlying stream. Tokens are individual chunks read from the underlying stream. Each token is one of the four basic types:

Word: any string of characters delimited by whitespace or punctuation
Punctuation: any of the punctuation characters (see constants)
Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
Newline: '\r', '\n' or '\r\n'. The parser will return the character unless convertNL is set, in which case it will replace the token with the user specified new line character

Returns:: returns a String token or null if EOF is reached (i.e. no more tokens to read)
Throws:: java.io.IOException - I/O errors; NexusParseException - Parsing errors

getLastTokenType

public int getLastTokenType()

Determine the type of the last read token. After readToken() has been called, the type of token returned can be determined by calling getLastTokenType(). This returns one of five different constants:

UNDEFINED_TOKEN : default before anything is read from the stream
WORD_TOKEN : word token was read
PUNCTUATION_TOKEN : punctuation token was read
NEWLINE_TOKEN : newline token was read
WHITESPACE_TOKEN : whitespace token was read (never returned unless whitespace is being returned)
HEADER_TOKEN : last token was the special word #NEXUS

Returns:: Last token read.

seek

public java.lang.String seek(int tokenType)
                      throws java.io.IOException,
                             NexusParseException

Seeks through the stream to find the next token of the specified type. The type value can be one of:

WORD_TOKEN
PUNCTUATION_TOKEN
NEWLINE_TOKEN
WHITESPACE_TOKEN
HEADER_TOKEN

Returns:: returns a String token or null if EOF is reached (i.e. no more tokens to read)
Throws:: java.io.IOException - I/O errors; NexusParseException - Thrown by parsing errors or if tokenType == WHITESPACE_TOKEN && readWhiteSpace() == false

seek

public java.lang.String seek(java.lang.String token)
                      throws java.io.IOException,
                             NexusParseException

Seeks through the stream to find the token argument.

Returns:: returns a String token or null if token is not found (i.e. EOF is reached)
Throws:: java.io.IOException - I/O errors; NexusParseException - Thrown by parsing errors or if token is whitespace && readWhiteSpace() == false

getLastReadToken

public java.lang.String getLastReadToken()

Returns the last read token. Each call to readToken() stores the returned token so that it can be retrieved again. However, each consuming readToken() call replaces this buffer with the new token.

Returns:: return the last read token

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pal.io Class NexusTokenizer

Comments

Usage

L_PARENTHESIS

R_PARENTHESIS

L_BRACKET

R_BRACKET

L_BRACE

R_BRACE

F_SLASH

B_SLASH

COMMA

SEMI_COLON

COLON

EQUALS

ASTERIX

S_QUOTE

D_QUOTE

B_TICK

ADDITION

DASH

L_THAN

G_THAN

HASH

PERIOD

L_FEED

C_RETURN

TAB

SPACE

WORD_UPPERCASE

WORD_LOWERCASE

WORD_UNMODIFIED

UNDEFINED_TOKEN

WORD_TOKEN

PUNCTUATION_TOKEN

NEWLINE_TOKEN

WHITESPACE_TOKEN

HEADER_TOKEN

NexusTokenizer

NexusTokenizer

readWhiteSpace

convertNewLine

setReadWhiteSpace

setConvertNewLine

setIgnoreComments

setNewLineChar

getCol

getRow

getWordModification

setWordModification

readToken

getLastTokenType

seek

seek

getLastReadToken

pal.io
Class NexusTokenizer