org.apache.pdfbox.pdfparser
Class ConformingPDFParser

java.lang.Object
  extended by org.apache.pdfbox.pdfparser.BaseParser
      extended by org.apache.pdfbox.pdfparser.ConformingPDFParser

public class ConformingPDFParser
extends BaseParser

Author:
Adam Nichols

Field Summary
protected  RandomAccess inputFile
           
 
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, FORCE_PARSING, forceParsing, pdfSource
 
Constructor Summary
ConformingPDFParser(java.io.File inputFile)
          Constructor.
 
Method Summary
protected  byte consumeWhitespace()
          This will read all bytes until a non-whitespace character is found.
protected  byte consumeWhitespaceBackwards()
          This will read all bytes (backwards) until a non-whitespace character is found.
 COSDocument getDocument()
          This will get the document that was parsed.
 COSBase getObject(long objectNumber, long generation)
           
 PDDocument getPDDocument()
          This will get the PD document that was parsed.
 boolean isRecursivlyRead()
           
 void parse()
          This will parse the stream and populate the COSDocument object.
protected  COSNumber parseNumber(java.lang.String number)
           
protected  long parseTrailerInformation()
           
protected  COSBase processCosObject(java.lang.String string)
           
protected  java.lang.String readBackwardUntilWhitespace()
           
protected  byte readByte()
           
protected  byte readByteBackwards()
           
protected  COSDictionary readDictionaryBackwards()
           
protected  int readInt()
          This will read an integer from the stream.
protected  java.lang.String readLine()
          This will read a line starting with the byte at offset and going forward until it finds a newline.
protected  java.lang.String readLineBackwards()
          This will read a line starting with the byte at offset and going backwards until it finds a newline.
protected  long readLongBackwards()
          This will consume any whitespace, read in bytes until whitespace is found again and then parse the characters which have been read as a long.
protected  COSName readNameBackwards()
           
protected  COSNumber readNumber()
          This will read in a number and return the COS version of the number (be it a COSInteger or a COSFloat).
protected  COSBase readObject()
          This actually reads the object data.
 COSBase readObject(long objectNumber, long generation)
          This will read an object from the inputFile at whatever our currentOffset is.
protected  COSBase readObjectBackwards()
           
protected  java.lang.String readString()
          This will read the next string from the stream.
protected  java.lang.String readWord()
           
 void setRecursivlyRead(boolean recursivlyRead)
           
 
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSStream, parseCOSString, parseDirObject, readExpectedString, readString, setDocument, skipSpaces
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

inputFile

protected RandomAccess inputFile
Constructor Detail

ConformingPDFParser

public ConformingPDFParser(java.io.File inputFile)
                    throws java.io.IOException
Constructor.

Parameters:
input - The input stream that contains the PDF document.
Throws:
java.io.IOException - If there is an error initializing the stream.
Method Detail

parse

public void parse()
           throws java.io.IOException
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.

Throws:
java.io.IOException - If there is an error reading from the stream or corrupt data is found.

getDocument

public COSDocument getDocument()
                        throws java.io.IOException
This will get the document that was parsed. parse() must be called before this is called. When you are done with this document you must call close() on it to release resources.

Returns:
The document that was parsed.
Throws:
java.io.IOException - If there is an error getting the document.

getPDDocument

public PDDocument getPDDocument()
                         throws java.io.IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources.

Returns:
The document at the PD layer.
Throws:
java.io.IOException - If there is an error getting the document.

parseTrailerInformation

protected long parseTrailerInformation()
                                throws java.io.IOException,
                                       java.lang.NumberFormatException
Throws:
java.io.IOException
java.lang.NumberFormatException

readByteBackwards

protected byte readByteBackwards()
                          throws java.io.IOException
Throws:
java.io.IOException

readByte

protected byte readByte()
                 throws java.io.IOException
Throws:
java.io.IOException

readBackwardUntilWhitespace

protected java.lang.String readBackwardUntilWhitespace()
                                                throws java.io.IOException
Throws:
java.io.IOException

consumeWhitespaceBackwards

protected byte consumeWhitespaceBackwards()
                                   throws java.io.IOException
This will read all bytes (backwards) until a non-whitespace character is found. To save you an extra read, the non-whitespace character is returned. If the current character is not whitespace, this method will just return the current char.

Returns:
the first non-whitespace character found
Throws:
java.io.IOException - if there is an error reading from the file

consumeWhitespace

protected byte consumeWhitespace()
                          throws java.io.IOException
This will read all bytes until a non-whitespace character is found. To save you an extra read, the non-whitespace character is returned. If the current character is not whitespace, this method will just return the current char.

Returns:
the first non-whitespace character found
Throws:
java.io.IOException - if there is an error reading from the file

readLongBackwards

protected long readLongBackwards()
                          throws java.io.IOException,
                                 java.lang.NumberFormatException
This will consume any whitespace, read in bytes until whitespace is found again and then parse the characters which have been read as a long. The current offset will then point at the first whitespace character which preceeds the number.

Returns:
the parsed number
Throws:
java.io.IOException - if there is an error reading from the file
java.lang.NumberFormatException - if the bytes read can not be converted to a number

readInt

protected int readInt()
               throws java.io.IOException
Description copied from class: BaseParser
This will read an integer from the stream.

Overrides:
readInt in class BaseParser
Returns:
The integer that was read from the stream.
Throws:
java.io.IOException - If there is an error reading from the stream.

readNumber

protected COSNumber readNumber()
                        throws java.io.IOException
This will read in a number and return the COS version of the number (be it a COSInteger or a COSFloat).

Returns:
the COSNumber which was read/parsed
Throws:
java.io.IOException

parseNumber

protected COSNumber parseNumber(java.lang.String number)
                         throws java.io.IOException
Throws:
java.io.IOException

processCosObject

protected COSBase processCosObject(java.lang.String string)
                            throws java.io.IOException
Throws:
java.io.IOException

readObjectBackwards

protected COSBase readObjectBackwards()
                               throws java.io.IOException
Throws:
java.io.IOException

readNameBackwards

protected COSName readNameBackwards()
                             throws java.io.IOException
Throws:
java.io.IOException

getObject

public COSBase getObject(long objectNumber,
                         long generation)
                  throws java.io.IOException
Throws:
java.io.IOException

readObject

public COSBase readObject(long objectNumber,
                          long generation)
                   throws java.io.IOException
This will read an object from the inputFile at whatever our currentOffset is. If the object and generation are not the expected values and this object is set to throw an exception for non-conforming documents, then an exception will be thrown.

Parameters:
objectNumber - the object number you expect to read
generation - the generation you expect this object to be
Returns:
Throws:
java.io.IOException

readObject

protected COSBase readObject()
                      throws java.io.IOException
This actually reads the object data.

Returns:
the object which is read
Throws:
java.io.IOException

readString

protected java.lang.String readString()
                               throws java.io.IOException
This will read the next string from the stream.

Overrides:
readString in class BaseParser
Returns:
The string that was read from the stream.
Throws:
java.io.IOException - If there is an error reading from the stream.

readDictionaryBackwards

protected COSDictionary readDictionaryBackwards()
                                         throws java.io.IOException
Throws:
java.io.IOException

readLineBackwards

protected java.lang.String readLineBackwards()
                                      throws java.io.IOException
This will read a line starting with the byte at offset and going backwards until it finds a newline. This should only be used if we are certain that the data will only be text, and not binary data.

Parameters:
offset - the location of the file where we should start reading
Returns:
the string which was read
Throws:
java.io.IOException - if there was an error reading data from the file

readLine

protected java.lang.String readLine()
                             throws java.io.IOException
This will read a line starting with the byte at offset and going forward until it finds a newline. This should only be used if we are certain that the data will only be text, and not binary data.

Overrides:
readLine in class BaseParser
Parameters:
offset - the location of the file where we should start reading
Returns:
the string which was read
Throws:
java.io.IOException - if there was an error reading data from the file

readWord

protected java.lang.String readWord()
                             throws java.io.IOException
Throws:
java.io.IOException

isRecursivlyRead

public boolean isRecursivlyRead()
Returns:
the recursivlyRead

setRecursivlyRead

public void setRecursivlyRead(boolean recursivlyRead)
Parameters:
recursivlyRead - the recursivlyRead to set