Package org.apache.pdfbox.text
Class PDFMarkedContentExtractor
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFMarkedContentExtractor
-
public class PDFMarkedContentExtractor extends PDFStreamEngine
This is an stream engine to extract the marked content of a pdf.- Author:
- Johannes Koch
-
-
Constructor Summary
Constructors Constructor Description PDFMarkedContentExtractor()
Instantiate a new PDFTextStripper object.PDFMarkedContentExtractor(java.lang.String encoding)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
beginMarkedContentSequence(COSName tag, COSDictionary properties)
Called when a marked content group beginsvoid
endMarkedContentSequence()
Called when a a marked content group endsjava.util.List<PDMarkedContent>
getMarkedContents()
void
processPage(PDPage page)
This will initialize and process the contents of the stream.protected void
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page.protected void
showGlyph(Matrix textRenderingMatrix, PDFont font, int code, java.lang.String unicode, Vector displacement)
This method was originally written by Ben Litchfield for PDFStreamEngine.void
xobject(PDXObject xobject)
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Constructor Detail
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor() throws java.io.IOException
Instantiate a new PDFTextStripper object.- Throws:
java.io.IOException
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor(java.lang.String encoding) throws java.io.IOException
Constructor. Will apply encoding-specific conversions to the output text.- Parameters:
encoding
- The encoding that the output will be written in.- Throws:
java.io.IOException
-
-
Method Detail
-
beginMarkedContentSequence
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
Description copied from class:PDFStreamEngine
Called when a marked content group begins- Overrides:
beginMarkedContentSequence
in classPDFStreamEngine
- Parameters:
tag
- indicates the role or significance of the sequenceproperties
- optional properties
-
endMarkedContentSequence
public void endMarkedContentSequence()
Description copied from class:PDFStreamEngine
Called when a a marked content group ends- Overrides:
endMarkedContentSequence
in classPDFStreamEngine
-
xobject
public void xobject(PDXObject xobject)
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Parameters:
text
- The text to process.
-
getMarkedContents
public java.util.List<PDMarkedContent> getMarkedContents()
-
processPage
public void processPage(PDPage page) throws java.io.IOException
This will initialize and process the contents of the stream.- Overrides:
processPage
in classPDFStreamEngine
- Parameters:
page
- the page to process- Throws:
java.io.IOException
- if there is an error accessing the stream.
-
showGlyph
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, java.lang.String unicode, Vector displacement) throws java.io.IOException
This method was originally written by Ben Litchfield for PDFStreamEngine.- Overrides:
showGlyph
in classPDFStreamEngine
- Parameters:
textRenderingMatrix
- the current text rendering matrix, Trmfont
- the current fontcode
- internal PDF character code for the glyphunicode
- the Unicode text for this glyph, or null if the PDF does provide itdisplacement
- the displacement (i.e. advance) of the glyph in text space- Throws:
java.io.IOException
- if the glyph cannot be processed
-
-