Class PDFTextStripper

  • Direct Known Subclasses:
    PDFTextStripperByArea

    public class PDFTextStripper
    extends PDFStreamEngine
    This class will take a pdf document and strip out all of the text and ignore the formatting and such. Please note; it is up to clients of this class to verify that a specific user has the correct permissions to extract text from the PDF document. The basic flow of this process is that we get a document and use a series of processXXX() functions that work on smaller and smaller chunks of the page. Eventually, we fully process each page and then print it.
    Author:
    Ben Litchfield
    • Field Detail

      • LINE_SEPARATOR

        protected final java.lang.String LINE_SEPARATOR
        The platform's line separator.
      • charactersByArticle

        protected java.util.ArrayList<java.util.List<TextPosition>> charactersByArticle
        The charactersByArticle is used to extract text by article divisions. For example a PDF that has two columns like a newspaper, we want to extract the first column and then the second column. In this example the PDF would have 2 beads(or articles), one for each column. The size of the charactersByArticle would be 5, because not all text on the screen will fall into one of the articles. The five divisions are shown below Text before first article first article text text between first article and second article second article text text after second article Most PDFs won't have any beads, so charactersByArticle will contain a single entry.
      • output

        protected java.io.Writer output
    • Constructor Detail

      • PDFTextStripper

        public PDFTextStripper()
                        throws java.io.IOException
        Instantiate a new PDFTextStripper object.
        Throws:
        java.io.IOException - If there is an error loading the properties.
    • Method Detail

      • getText

        public java.lang.String getText​(PDDocument doc)
                                 throws java.io.IOException
        This will return the text of a document. See writeText.
        NOTE: The document must not be encrypted when coming into this method.
        Parameters:
        doc - The document to get the text from.
        Returns:
        The text of the PDF document.
        Throws:
        java.io.IOException - if the doc state is invalid or it is encrypted.
      • writeText

        public void writeText​(PDDocument doc,
                              java.io.Writer outputStream)
                       throws java.io.IOException
        This will take a PDDocument and write the text of that document to the print writer.
        Parameters:
        doc - The document to get the data from.
        outputStream - The location to put the text.
        Throws:
        java.io.IOException - If the doc is in an invalid state.
      • processPages

        protected void processPages​(PDPageTree pages)
                             throws java.io.IOException
        This will process all of the pages and the text that is in them.
        Parameters:
        pages - The pages object in the document.
        Throws:
        java.io.IOException - If there is an error parsing the text.
      • startDocument

        protected void startDocument​(PDDocument document)
                              throws java.io.IOException
        This method is available for subclasses of this class. It will be called before processing of the document start.
        Parameters:
        document - The PDF document that is being processed.
        Throws:
        java.io.IOException - If an IO error occurs.
      • endDocument

        protected void endDocument​(PDDocument document)
                            throws java.io.IOException
        This method is available for subclasses of this class. It will be called after processing of the document finishes.
        Parameters:
        document - The PDF document that is being processed.
        Throws:
        java.io.IOException - If an IO error occurs.
      • processPage

        public void processPage​(PDPage page)
                         throws java.io.IOException
        This will process the contents of a page.
        Parameters:
        page - The page to process.
        Throws:
        java.io.IOException - If there is an error processing the page.
      • startArticle

        protected void startArticle()
                             throws java.io.IOException
        Start a new article, which is typically defined as a column on a single page (also referred to as a bead). This assumes that the primary direction of text is left to right. Default implementation is to do nothing. Subclasses may provide additional information.
        Throws:
        java.io.IOException - If there is any error writing to the stream.
      • startArticle

        protected void startArticle​(boolean isLTR)
                             throws java.io.IOException
        Start a new article, which is typically defined as a column on a single page (also referred to as a bead). Default implementation is to do nothing. Subclasses may provide additional information.
        Parameters:
        isLTR - true if primary direction of text is left to right.
        Throws:
        java.io.IOException - If there is any error writing to the stream.
      • endArticle

        protected void endArticle()
                           throws java.io.IOException
        End an article. Default implementation is to do nothing. Subclasses may provide additional information.
        Throws:
        java.io.IOException - If there is any error writing to the stream.
      • startPage

        protected void startPage​(PDPage page)
                          throws java.io.IOException
        Start a new page. Default implementation is to do nothing. Subclasses may provide additional information.
        Parameters:
        page - The page we are about to process.
        Throws:
        java.io.IOException - If there is any error writing to the stream.
      • endPage

        protected void endPage​(PDPage page)
                        throws java.io.IOException
        End a page. Default implementation is to do nothing. Subclasses may provide additional information.
        Parameters:
        page - The page we are about to process.
        Throws:
        java.io.IOException - If there is any error writing to the stream.
      • writePage

        protected void writePage()
                          throws java.io.IOException
        This will print the text of the processed page to "output". It will estimate, based on the coordinates of the text, where newlines and word spacings should be placed. The text will be sorted only if that feature was enabled.
        Throws:
        java.io.IOException - If there is an error writing the text.
      • writeLineSeparator

        protected void writeLineSeparator()
                                   throws java.io.IOException
        Write the line separator value to the output stream.
        Throws:
        java.io.IOException - If there is a problem writing out the line separator to the document.
      • writeWordSeparator

        protected void writeWordSeparator()
                                   throws java.io.IOException
        Write the word separator value to the output stream.
        Throws:
        java.io.IOException - If there is a problem writing out the word separator to the document.
      • writeCharacters

        protected void writeCharacters​(TextPosition text)
                                throws java.io.IOException
        Write the string in TextPosition to the output stream.
        Parameters:
        text - The text to write to the stream.
        Throws:
        java.io.IOException - If there is an error when writing the text.
      • writeString

        protected void writeString​(java.lang.String text,
                                   java.util.List<TextPosition> textPositions)
                            throws java.io.IOException
        Write a Java string to the output stream. The default implementation will ignore the textPositions and just calls writeString(String).
        Parameters:
        text - The text to write to the stream.
        textPositions - The TextPositions belonging to the text.
        Throws:
        java.io.IOException - If there is an error when writing the text.
      • writeString

        protected void writeString​(java.lang.String text)
                            throws java.io.IOException
        Write a Java string to the output stream.
        Parameters:
        text - The text to write to the stream.
        Throws:
        java.io.IOException - If there is an error when writing the text.
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Parameters:
        text - The text to process.
      • getStartPage

        public int getStartPage()
        This is the page that the text extraction will start on. The pages start at page 1. For example in a 5 page PDF document, if the start page is 1 then all pages will be extracted. If the start page is 4 then pages 4 and 5 will be extracted. The default value is 1.
        Returns:
        Value of property startPage.
      • setStartPage

        public void setStartPage​(int startPageValue)
        This will set the first page to be extracted by this class.
        Parameters:
        startPageValue - New value of 1-based startPage property.
      • getEndPage

        public int getEndPage()
        This will get the last page that will be extracted. This is inclusive, for example if a 5 page PDF an endPage value of 5 would extract the entire document, an end page of 2 would extract pages 1 and 2. This defaults to Integer.MAX_VALUE such that all pages of the pdf will be extracted.
        Returns:
        Value of property endPage.
      • setEndPage

        public void setEndPage​(int endPageValue)
        This will set the last page to be extracted by this class.
        Parameters:
        endPageValue - New value of 1-based endPage property.
      • setLineSeparator

        public void setLineSeparator​(java.lang.String separator)
        Set the desired line separator for output text. The line.separator system property is used if the line separator preference is not set explicitly using this method.
        Parameters:
        separator - The desired line separator string.
      • getLineSeparator

        public java.lang.String getLineSeparator()
        This will get the line separator.
        Returns:
        The desired line separator string.
      • getWordSeparator

        public java.lang.String getWordSeparator()
        This will get the word separator.
        Returns:
        The desired word separator string.
      • setWordSeparator

        public void setWordSeparator​(java.lang.String separator)
        Set the desired word separator for output text. The PDFBox text extraction algorithm will output a space character if there is enough space between two words. By default a space character is used. If you need and accurate count of characters that are found in a PDF document then you might want to set the word separator to the empty string.
        Parameters:
        separator - The desired page separator string.
      • getSuppressDuplicateOverlappingText

        public boolean getSuppressDuplicateOverlappingText()
        Returns:
        Returns the suppressDuplicateOverlappingText.
      • getCurrentPageNo

        protected int getCurrentPageNo()
        Get the current page number that is being processed.
        Returns:
        A 1 based number representing the current page.
      • getOutput

        protected java.io.Writer getOutput()
        The output stream that is being written to.
        Returns:
        The stream that output is being written to.
      • getCharactersByArticle

        protected java.util.List<java.util.List<TextPosition>> getCharactersByArticle()
        Character strings are grouped by articles. It is quite common that there will only be a single article. This returns a List that contains List objects, the inner lists will contain TextPosition objects.
        Returns:
        A double List of TextPositions for all text strings on the page.
      • setSuppressDuplicateOverlappingText

        public void setSuppressDuplicateOverlappingText​(boolean suppressDuplicateOverlappingTextValue)
        By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
        Parameters:
        suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.
      • getSeparateByBeads

        public boolean getSeparateByBeads()
        This will tell if the text stripper should separate by beads.
        Returns:
        If the text will be grouped by beads.
      • setShouldSeparateByBeads

        public void setShouldSeparateByBeads​(boolean aShouldSeparateByBeads)
        Set if the text stripper should group the text output by a list of beads. The default value is true!
        Parameters:
        aShouldSeparateByBeads - The new grouping of beads.
      • getEndBookmark

        public PDOutlineItem getEndBookmark()
        Get the bookmark where text extraction should end, inclusive. Default is null.
        Returns:
        The ending bookmark.
      • setEndBookmark

        public void setEndBookmark​(PDOutlineItem aEndBookmark)
        Set the bookmark where the text extraction should stop.
        Parameters:
        aEndBookmark - The ending bookmark.
      • getStartBookmark

        public PDOutlineItem getStartBookmark()
        Get the bookmark where text extraction should start, inclusive. Default is null.
        Returns:
        The starting bookmark.
      • setStartBookmark

        public void setStartBookmark​(PDOutlineItem aStartBookmark)
        Set the bookmark where text extraction should start, inclusive.
        Parameters:
        aStartBookmark - The starting bookmark.
      • getAddMoreFormatting

        public boolean getAddMoreFormatting()
        This will tell if the text stripper should add some more text formatting.
        Returns:
        true if some more text formatting will be added
      • setAddMoreFormatting

        public void setAddMoreFormatting​(boolean newAddMoreFormatting)
        There will some additional text formatting be added if addMoreFormatting is set to true. Default is false.
        Parameters:
        newAddMoreFormatting - Tell PDFBox to add some more text formatting
      • getSortByPosition

        public boolean getSortByPosition()
        This will tell if the text stripper should sort the text tokens before writing to the stream.
        Returns:
        true If the text tokens will be sorted before being written.
      • setSortByPosition

        public void setSortByPosition​(boolean newSortByPosition)
        The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. For example, a PDF writer may write out all text by font, so all bold or larger text, then make a second pass and write out the normal text.
        The default is to not sort by position.

        A PDF writer could choose to write each character in a different order. By default PDFBox does not sort the text tokens before processing them due to performance reasons.
        Parameters:
        newSortByPosition - Tell PDFBox to sort the text positions.
      • getSpacingTolerance

        public float getSpacingTolerance()
        Get the current space width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.
        Returns:
        The current tolerance / scaling factor
      • setSpacingTolerance

        public void setSpacingTolerance​(float spacingToleranceValue)
        Set the space width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.
        Parameters:
        spacingToleranceValue - tolerance / scaling factor to use
      • getAverageCharTolerance

        public float getAverageCharTolerance()
        Get the current character width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.
        Returns:
        The current tolerance / scaling factor
      • setAverageCharTolerance

        public void setAverageCharTolerance​(float averageCharToleranceValue)
        Set the character width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.
        Parameters:
        averageCharToleranceValue - average tolerance / scaling factor to use
      • getIndentThreshold

        public float getIndentThreshold()
        returns the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start.
        Returns:
        the number of whitespace character widths to use when detecting paragraph indents.
      • setIndentThreshold

        public void setIndentThreshold​(float indentThresholdValue)
        sets the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start. The default value is 2.0.
        Parameters:
        indentThresholdValue - the number of whitespace character widths to use when detecting paragraph indents.
      • getDropThreshold

        public float getDropThreshold()
        the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start.
        Returns:
        the character height multiple for max allowed whitespace between lines in the same paragraph.
      • setDropThreshold

        public void setDropThreshold​(float dropThresholdValue)
        sets the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start. The default value is 2.5.
        Parameters:
        dropThresholdValue - the character height multiple for max allowed whitespace between lines in the same paragraph.
      • getParagraphStart

        public java.lang.String getParagraphStart()
        Returns the string which will be used at the beginning of a paragraph.
        Returns:
        the paragraph start string
      • setParagraphStart

        public void setParagraphStart​(java.lang.String s)
        Sets the string which will be used at the beginning of a paragraph.
        Parameters:
        s - the paragraph start string
      • getParagraphEnd

        public java.lang.String getParagraphEnd()
        Returns the string which will be used at the end of a paragraph.
        Returns:
        the paragraph end string
      • setParagraphEnd

        public void setParagraphEnd​(java.lang.String s)
        Sets the string which will be used at the end of a paragraph.
        Parameters:
        s - the paragraph end string
      • getPageStart

        public java.lang.String getPageStart()
        Returns the string which will be used at the beginning of a page.
        Returns:
        the page start string
      • setPageStart

        public void setPageStart​(java.lang.String pageStartValue)
        Sets the string which will be used at the beginning of a page.
        Parameters:
        pageStartValue - the page start string
      • getPageEnd

        public java.lang.String getPageEnd()
        Returns the string which will be used at the end of a page.
        Returns:
        the page end string
      • setPageEnd

        public void setPageEnd​(java.lang.String pageEndValue)
        Sets the string which will be used at the end of a page.
        Parameters:
        pageEndValue - the page end string
      • getArticleStart

        public java.lang.String getArticleStart()
        Returns the string which will be used at the beginning of an article.
        Returns:
        the article start string
      • setArticleStart

        public void setArticleStart​(java.lang.String articleStartValue)
        Sets the string which will be used at the beginning of an article.
        Parameters:
        articleStartValue - the article start string
      • getArticleEnd

        public java.lang.String getArticleEnd()
        Returns the string which will be used at the end of an article.
        Returns:
        the article end string
      • setArticleEnd

        public void setArticleEnd​(java.lang.String articleEndValue)
        Sets the string which will be used at the end of an article.
        Parameters:
        articleEndValue - the article end string
      • writeParagraphSeparator

        protected void writeParagraphSeparator()
                                        throws java.io.IOException
        writes the paragraph separator string to the output.
        Throws:
        java.io.IOException - if something went wrong
      • writeParagraphStart

        protected void writeParagraphStart()
                                    throws java.io.IOException
        Write something (if defined) at the start of a paragraph.
        Throws:
        java.io.IOException - if something went wrong
      • writeParagraphEnd

        protected void writeParagraphEnd()
                                  throws java.io.IOException
        Write something (if defined) at the end of a paragraph.
        Throws:
        java.io.IOException - if something went wrong
      • writePageStart

        protected void writePageStart()
                               throws java.io.IOException
        Write something (if defined) at the start of a page.
        Throws:
        java.io.IOException - if something went wrong
      • writePageEnd

        protected void writePageEnd()
                             throws java.io.IOException
        Write something (if defined) at the end of a page.
        Throws:
        java.io.IOException - if something went wrong
      • setListItemPatterns

        protected void setListItemPatterns​(java.util.List<java.util.regex.Pattern> patterns)
        use to supply a different set of regular expression patterns for matching list item starts.
        Parameters:
        patterns - list of patterns
      • getListItemPatterns

        protected java.util.List<java.util.regex.Pattern> getListItemPatterns()
        returns a list of regular expression Patterns representing different common list item formats. For example numbered items of form:
        1. some text
        2. more text
        or
        • some text
        • more text
        etc., all begin with some character pattern. The pattern "\\d+\." (matches "1.", "2.", ...) or "\[\\d+\]" (matches "[1]", "[2]", ...).

        This method returns a list of such regular expression Patterns.

        Returns:
        a list of Pattern objects.
      • matchPattern

        protected static java.util.regex.Pattern matchPattern​(java.lang.String string,
                                                              java.util.List<java.util.regex.Pattern> patterns)
        iterates over the specified list of Patterns until it finds one that matches the specified string. Then returns the Pattern.

        Order of the supplied list of patterns is important as most common patterns should come first. Patterns should be strict in general, and all will be used with case sensitivity on.

        Parameters:
        string - the string to be searched
        patterns - list of patterns
        Returns:
        matching pattern
      • showGlyph

        protected void showGlyph​(Matrix textRenderingMatrix,
                                 PDFont font,
                                 int code,
                                 java.lang.String unicode,
                                 Vector displacement)
                          throws java.io.IOException
        This method was originally written by Ben Litchfield for PDFStreamEngine.
        Overrides:
        showGlyph in class PDFStreamEngine
        Parameters:
        textRenderingMatrix - the current text rendering matrix, Trm
        font - the current font
        code - internal PDF character code for the glyph
        unicode - the Unicode text for this glyph, or null if the PDF does provide it
        displacement - the displacement (i.e. advance) of the glyph in text space
        Throws:
        java.io.IOException - if the glyph cannot be processed