org.apache.pdfbox.pdfparser
Class NonSequentialPDFParser

java.lang.Object
  extended by org.apache.pdfbox.pdfparser.BaseParser
      extended by org.apache.pdfbox.pdfparser.PDFParser
          extended by org.apache.pdfbox.pdfparser.NonSequentialPDFParser

public class NonSequentialPDFParser
extends PDFParser

PDFParser which first reads startxref and xref tables in order to know valid objects and parse only these objects. Thus it is closer to a conforming parser than the sequential reading of PDFParser. This class can be used as a PDFParser replacement. First parse() must be called before page objects can be retrieved, e.g. getPDDocument(). This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.


Field Summary
static java.lang.String SYSPROP_EOFLOOKUPRANGE
           
static java.lang.String SYSPROP_PARSEMINIMAL
           
 
Fields inherited from class org.apache.pdfbox.pdfparser.PDFParser
xrefTrailerResolver
 
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, FORCE_PARSING, forceParsing, pdfSource
 
Constructor Summary
NonSequentialPDFParser(java.io.File file, RandomAccess raBuf)
          Constructs parser for given file using given buffer for temporary storage.
NonSequentialPDFParser(java.io.File file, RandomAccess raBuf, java.lang.String decryptionPassword)
          Constructs parser for given file using given buffer for temporary storage.
NonSequentialPDFParser(java.lang.String filename)
          Constructs parser for given file using memory buffer.
 
Method Summary
 PDPage getPage(int pageNr)
          Returns the page requested with all the objects loaded into it.
 int getPageNumber()
          Returns the number of pages in a document.
 PDDocument getPDDocument()
          This will get the PD document that was parsed.
 SecurityHandler getSecurityHandler()
          Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.
 void parse()
          This will parse the stream and populate the COSDocument object.
protected  COSStream parseCOSStream(COSDictionary dic, RandomAccess file)
          This will read a COSStream from the input stream using length attribute within dictionary.
 void setEOFLookupRange(int byteCount)
          Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.
 
Methods inherited from class org.apache.pdfbox.pdfparser.PDFParser
getDocument, getFDFDocument, isContinueOnError, parseStartXref, parseTrailer, parseXrefStream, parseXrefTable, setTempDirectory
 
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseDirObject, readExpectedString, readInt, readLine, readString, readString, setDocument, skipSpaces
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SYSPROP_PARSEMINIMAL

public static final java.lang.String SYSPROP_PARSEMINIMAL
See Also:
Constant Field Values

SYSPROP_EOFLOOKUPRANGE

public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
See Also:
Constant Field Values
Constructor Detail

NonSequentialPDFParser

public NonSequentialPDFParser(java.lang.String filename)
                       throws java.io.IOException
Constructs parser for given file using memory buffer.

Parameters:
filename - the filename of the pdf to be parsed
Throws:
java.io.IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(java.io.File file,
                              RandomAccess raBuf)
                       throws java.io.IOException
Constructs parser for given file using given buffer for temporary storage.

Parameters:
file - the pdf to be parsed
raBuf - the buffer to be used for parsing
Throws:
java.io.IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(java.io.File file,
                              RandomAccess raBuf,
                              java.lang.String decryptionPassword)
                       throws java.io.IOException
Constructs parser for given file using given buffer for temporary storage.

Parameters:
file - the pdf to be parsed
raBuf - the buffer to be used for parsing
decryptionPassword - password to be used for decryption
Throws:
java.io.IOException - If something went wrong.
Method Detail

setEOFLookupRange

public void setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

Parameters:
byteCount - number of trailing bytes

parse

public void parse()
           throws java.io.IOException
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.

Overrides:
parse in class PDFParser
Throws:
java.io.IOException - If there is an error reading from the stream or corrupt data is found.

getSecurityHandler

public SecurityHandler getSecurityHandler()
Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.

Returns:
the security handler.

getPDDocument

public PDDocument getPDDocument()
                         throws java.io.IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.

Overrides:
getPDDocument in class PDFParser
Returns:
The document at the PD layer.
Throws:
java.io.IOException - If there is an error getting the document.

getPageNumber

public int getPageNumber()
                  throws java.io.IOException
Returns the number of pages in a document.

Returns:
the number of pages.
Throws:
java.io.IOException - if PAGES or other needed object is missing

getPage

public PDPage getPage(int pageNr)
               throws java.io.IOException
Returns the page requested with all the objects loaded into it.

Parameters:
pageNr - starts from 0 to the number of pages.
Returns:
the page with the given pagenumber.
Throws:
java.io.IOException - If something went wrong.

parseCOSStream

protected COSStream parseCOSStream(COSDictionary dic,
                                   RandomAccess file)
                            throws java.io.IOException
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.

Overrides:
parseCOSStream in class BaseParser
Parameters:
dic - dictionary that goes with this stream.
file - file to write the stream to when reading.
Returns:
parsed pdf stream.
Throws:
java.io.IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.