PdfExtractor

Inheritance: java.lang.Object, com.aspose.pdf.facades.IVentureLicenseTarget, com.aspose.pdf.facades.Facade

public final class PdfExtractor extends Facade

Class for extracting images and text from PDF document.

Constructors

ConstructorDescription
PdfExtractor()Initializes new PdfExtractor object.
PdfExtractor(IDocument document)Initializes new PdfExtractor object on base of the document .

Methods

MethodDescription
getStartPage()Gets start page in the page range where extracting operation will be performed.
setStartPage(int value)Sets start page in the page range where extracting operation will be performed.
getEndPage()Gets end page in the page range where extracting operation will be performed.
setEndPage(int value)Sets end page in the page range where extracting operation will be performed.
getExtractTextMode()Gets the mode for extract text’s result.
setExtractTextMode(int value)Sets the mode for extract text’s result.
getTextSearchOptions()Gets text search options.
setTextSearchOptions(TextSearchOptions value)Sets text search options.
getExtractImageMode()Sets the mode for extract images process.
setExtractImageMode(int value)Sets the mode for extract images process.
isBidi()Is true when text has hebriew or arabic symbols.
extractText()Extracts text from a Pdf document.
extractText(Charset encoding)Extracts text from a Pdf document using specified encoding.
extractTextInternal(TextEncodingInternal encoding)For Internal usage only
getText(String outputFile)Saves text to file.
getText(OutputStream outputStream)Saves text to stream.
bindPdf(String inputFile)Bind input PDF file.
bindPdf(InputStream inputStream)Binds PDF document from stream.
extractImage()Extract images from PDF file.
hasNextImage()Checks if more images are accessible in PDF document.
getNextImage(String outputFile)Retreives next image from PDF document.
getNextImage(String outputFile, ImageType format)Retreives next image from PDF document with given image format.
getNextImage(OutputStream outputStream, ImageType format)Retreive next image from PDF file and stores it into stream with given image format.
getNextImage(OutputStream outputStream)Retreive next image from PDF file and stores it into stream.
getAttachNames()Returns list of attachments in PDF file.
extractAttachment()Extracts attachments from a Pdf document.
extractAttachment(String attachmentFileName)Extracts attachment to PDF file by attachment name.
getAttachment(String outputPath)Stores attachment into file.
hasNextPageText()Indicates that whether can get more texts or not.
getNextPageText(String outputFile)Saves one page’s text to file.
getNextPageText(OutputStream outputStream)Saves one page’s text to stream.
getText(OutputStream outputStream, boolean filterNotAscii)Saves text to stream.
getAttachment()Saves all the attachment file to streams.
getAttachmentInfo()Gets the list of attachments.
getResolution()Gets resolution for extracted images.
setResolution(int value)Set resolution for extracted images.
getPassword()Gets input file’s password.
setPassword(String value)Sets input file’s password.
extractMarkedContentAsImages(Page page, String path)Gets all the Marked Content containers as separate images.

PdfExtractor()

public PdfExtractor()

Initializes new PdfExtractor object.

PdfExtractor(IDocument document)

public PdfExtractor(IDocument document)

Initializes new PdfExtractor object on base of the document .

Parameters:

ParameterTypeDescription
documentIDocumentPdf document.

getStartPage()

public int getStartPage()

Gets start page in the page range where extracting operation will be performed.


PdfExtractor ext = new PdfExtractor();
 ext.bindBdf("sample.pdf");
 ext.setStartPage(2);
 ext.setEndPage(5);
 ext.extractText();

Returns: int - start page in the page range.

setStartPage(int value)

public void setStartPage(int value)

Sets start page in the page range where extracting operation will be performed.


PdfExtractor ext = new PdfExtractor();
 ext.bindBdf("sample.pdf");
 ext.setStartPage(2);
 ext.setEndPage(5);
 ext.extractText();

Parameters:

ParameterTypeDescription
valueintstart page in the page range.

getEndPage()

public int getEndPage()

Gets end page in the page range where extracting operation will be performed.


PdfExtractor ext = new PdfExtractor();
 ext.bindBdf("sample.pdf");
 ext.setStartPage(2);
 ext.setEndPage(3);
 ext.extractText();

Returns: int - end page.

setEndPage(int value)

public void setEndPage(int value)

Sets end page in the page range where extracting operation will be performed.


PdfExtractor ext = new PdfExtractor();
 ext.bindBdf("sample.pdf");
 ext.setStartPage(2);
 ext.setEndPage(3);
 ext.extractText();

Parameters:

ParameterTypeDescription
valueintend page.

getExtractTextMode()

public int getExtractTextMode()

Gets the mode for extract text’s result.


The example demonstratres the ```
ExtractTextMode
``` property usage in text extraction scenario.


  PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf(@"D:\Text\text.pdf");
  extractor.setExtractTextMode(1);
 	extractor.extractText();
 	extractor.getText(@"D:\Text\text.txt");

Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.

Returns: int - extract text’s result.

setExtractTextMode(int value)

public void setExtractTextMode(int value)

Sets the mode for extract text’s result.


The example demonstratres the ```
ExtractTextMode
``` property usage in text extraction scenario.


  PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf(@"D:\Text\text.pdf");
  extractor.setExtractTextMode(1);
 	extractor.extractText();
 	extractor.getText(@"D:\Text\text.txt");

Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.

Parameters:

ParameterTypeDescription
valueintextract text’s result.

getTextSearchOptions()

public TextSearchOptions getTextSearchOptions()

Gets text search options.

Returns: TextSearchOptions - text search options.

setTextSearchOptions(TextSearchOptions value)

public void setTextSearchOptions(TextSearchOptions value)

Sets text search options.

Parameters:

ParameterTypeDescription
valueTextSearchOptionstext search options.

getExtractImageMode()

public int getExtractImageMode()

Sets the mode for extract images process.


Default value is ExtractImageMode.DefinedInResources that extracts all images defined in resources. To extract actually shown images ExtractImageMode.ActuallyUsed mode should be used.

Returns: int - ExtractImageMode value

setExtractImageMode(int value)

public void setExtractImageMode(int value)

Sets the mode for extract images process.


Default value is ExtractImageMode.DefinedInResources that extracts all images defined in resources. To extract actually shown images ExtractImageMode.ActuallyUsed mode should be used.

Parameters:

ParameterTypeDescription
valueintExtractImageMode value

isBidi()

public boolean isBidi()

Is true when text has hebriew or arabic symbols. This case must be specially considered because string functions change their behaviour and start process text from right to left (except numbers and other non text chars).

Returns: boolean - boolean value

extractText()

public void extractText()

Extracts text from a Pdf document.


First example demonstratres how to extract all the text from PDF file.


  PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf("D:\Text\text.pdf");
 	extractor.extractText();
 	extractor.getText("D:\Text\text.txt");

Second example demonstratres how to extract each page’s text into one txt file.

PdfExtractor extractor = new PdfExtractor();
  extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
  extractor.extractText();
  String prefix = TestPath + "Aspose.Pdf.Kit";
  String suffix = ".txt";
  int pageCount = 1;
  while (extractor.hasNextPageText())
  {
      extractor.getNextPageText(prefix + pageCount + suffix);
      pageCount++;
  }

extractText(Charset encoding)

public void extractText(Charset encoding)

Extracts text from a Pdf document using specified encoding.


First example demonstrates how to extract all the text from PDF file.


  PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf("D:\\Text\\text.pdf");
 	extractor.extractText(Encoding.Unicode);
 	extractor.getText("D:\\Text\\text.txt");

Second example demonstrates how to extract each page’s text into one txt file.

PdfExtractor extractor = new PdfExtractor();
  extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
  extractor.extractText(java.nio.charset.Charset.forName("UTF-8"));
  String prefix = TestPath + "Aspose.Pdf.Kit";
  String suffix = ".txt";
  int pageCount = 1;
  while (extractor.hasNextPageText())
  {
      extractor.getNextPageText(prefix + pageCount + suffix);
      pageCount++;
  }

Parameters:

ParameterTypeDescription
encodingjava.nio.charset.CharsetThe encoding of the extracted text.

extractTextInternal(TextEncodingInternal encoding)

public void extractTextInternal(TextEncodingInternal encoding)

For Internal usage only

Parameters:

ParameterTypeDescription
encodingTextEncodingInternalThe encoding of the extracted text.

getText(String outputFile)

public void getText(String outputFile)

Saves text to file. see also: ExtractText

Parameters:

ParameterTypeDescription
outputFilejava.lang.StringThe file path and name to save the text.

getText(OutputStream outputStream)

public void getText(OutputStream outputStream)

Saves text to stream. see also: ExtractText

Parameters:

ParameterTypeDescription
outputStreamjava.io.OutputStreamThe stream to save the text.

bindPdf(String inputFile)

public void bindPdf(String inputFile)

Bind input PDF file.


PdfExtractor ext = new PdfExtractor();
 ext.bindPdf("sample.pdf");

Parameters:

ParameterTypeDescription
inputFilejava.lang.StringPDF fiel to bind

bindPdf(InputStream inputStream)

public void bindPdf(InputStream inputStream)

Binds PDF document from stream.


PdfExtractor ext = new PdfExtractor();
 InputStream stream = new FileInputStream("sample.pdf");
 ext.bindPdf(stream);

Parameters:

ParameterTypeDescription
inputStreamjava.io.InputStreamStream containing PDF document data

extractImage()

public void extractImage()

Extract images from PDF file.


PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf("sample.pdf");
 	extractor.extractImage();
 	int i = 1;
 	while (extractor.HasNextImage())
    {
 	    extractor.getNextImage("image-" + i +".pdf");
    }

hasNextImage()

public boolean hasNextImage()

Checks if more images are accessible in PDF document. Note: ExtractImage must be called before using of this method.


PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf("sample.pdf");
 	extractor.extractImage();
 	int i = 1;
 	while (extractor.hasNextImage())
    {
 	    extractor.getNextImage("image-" + i +".pdf");
    }

Returns: boolean - Trues if more images are accessible

getNextImage(String outputFile)

public boolean getNextImage(String outputFile)

Retreives next image from PDF document. Note: ExtractImage must be called before using of this method.


PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf("sample.pdf");
 	extractor.extractImage();
 	int i = 1;
 	while (extractor.hasNextImage())
    {
 	    extractor.getNextImage("image-" + i +".pdf");
    }

Parameters:

ParameterTypeDescription
outputFilejava.lang.StringFile where image will be stored

Returns: boolean - True is image is successfully extracted

getNextImage(String outputFile, ImageType format)

public boolean getNextImage(String outputFile, ImageType format)

Retreives next image from PDF document with given image format. Note: ExtractImage must be called before using of this method.

Parameters:

ParameterTypeDescription
outputFilejava.lang.StringFile where image will be stored
formatImageTypeImageType element

Returns: boolean - True is image is successfully extracted

getNextImage(OutputStream outputStream, ImageType format)

public boolean getNextImage(OutputStream outputStream, ImageType format)

Retreive next image from PDF file and stores it into stream with given image format.

Parameters:

ParameterTypeDescription
outputStreamjava.io.OutputStreamStream where image data will be saved
formatImageTypeThe format of the image.

Returns: boolean - True in case the image is successfully extracted.

getNextImage(OutputStream outputStream)

public boolean getNextImage(OutputStream outputStream)

Retreive next image from PDF file and stores it into stream.

Parameters:

ParameterTypeDescription
outputStreamjava.io.OutputStreamStream where image data will be saved

Returns: boolean - True in case the image is successfully extracted.

getAttachNames()

public List<String> getAttachNames()

Returns list of attachments in PDF file. Note: ExtractAttachments must be called befor using this method.


Example demonstrates how to extract attachment names form PDF file.


  PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf(TestSettings.GetInputFile("sample.pdf"));
 	extractor.ExtractAttachment();
 	List attachments = extractor.getAttachNames();
 	for (String name : ```
(Iterable)
```attachments)
 		System.out.println(name);

Returns: java.util.List<java.lang.String> - List of attachments

extractAttachment()

public void extractAttachment()

Extracts attachments from a Pdf document.

extractAttachment(String attachmentFileName)

public void extractAttachment(String attachmentFileName)

Extracts attachment to PDF file by attachment name.

Parameters:

ParameterTypeDescription
attachmentFileNamejava.lang.StringName of attachment to extract

getAttachment(String outputPath)

public void getAttachment(String outputPath)

Stores attachment into file.

Parameters:

ParameterTypeDescription
outputPathjava.lang.StringDirectory path where attachment(s) will be stored. Null or empty string means attachment(s) will be placed in the application directory.

hasNextPageText()

public boolean hasNextPageText()

Indicates that whether can get more texts or not.


The example demonstratres the ```
HasNextPageText
``` property usage in text extraction scenario.


  PdfExtractor extractor = new PdfExtractor();
  extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
  extractor.extractText(Encoding.Unicode);
  String prefix = TestPath + "Aspose.Pdf.Kit";
  String suffix = ".txt";
  int pageCount = 1;
  while (extractor.hasNextPageText())
  {
      extractor.getNextPageText(prefix + pageCount + suffix);
      pageCount++;
  }

Returns: boolean - Can get more texts or not, true is can, or false.

getNextPageText(String outputFile)

public void getNextPageText(String outputFile)

Saves one page’s text to file.


The example demonstratres the GetNextPageText method usage in text extraction scenario.


  PdfExtractor extractor = new PdfExtractor();
  extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf");
  extractor.extractText(Encoding.Unicode);
  String prefix = TestPath + @"Aspose.Pdf.Kit";
  String suffix = ".txt";
  int pageCount = 1;
  while (extractor.hasNextPageText())
  {
      extractor.getNextPageText(prefix + pageCount + suffix);
      pageCount++;
  }

Parameters:

ParameterTypeDescription
outputFilejava.lang.StringThe file path and name to save the text.

getNextPageText(OutputStream outputStream)

public void getNextPageText(OutputStream outputStream)

Saves one page’s text to stream.


The example demonstratres the ```
GetNextPageText
``` method usage in text extraction scenario.


  PdfExtractor extractor = new PdfExtractor();
  extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf");
  extractor.extractText(Encoding.Unicode);
  String prefix = TestPath + "Aspose.Pdf.Kit";
  String suffix = ".txt";
  int pageCount = 1;
  while (extractor.hasNextPageText())
  {
      FileInputStream fs = new FileInputStream(prefix + pageCount + suffix, FileMode.Create);
      extractor.getNextPageText(fs);
      fs.close();
      pageCount++;
  }

Parameters:

ParameterTypeDescription
outputStreamjava.io.OutputStreamThe stream to save the text.

getText(OutputStream outputStream, boolean filterNotAscii)

public void getText(OutputStream outputStream, boolean filterNotAscii)

Saves text to stream. see also: ExtractText

Parameters:

ParameterTypeDescription
outputStreamjava.io.OutputStreamThe stream to save the text.
filterNotAsciibooleanIf this parameter is true all Not ASCII simbols will be removed

getAttachment()

public ByteArrayOutputStream[] getAttachment()

Saves all the attachment file to streams.


PdfExtractor extractor = new PdfExtractor();
 	extractor.bindPdf(path + "Attach.pdf");
 	extractor.extractAttachment();
 	IList names = extractor.getAttachNames();
 	ByteArrayOutputStream[] tempStreams =  extractor.getAttachment();
 	for (int i=0; i<tempStreams.Length; i++)
    {
 		string name = (string)names[i];
 		OutputStream fs = new FileOutputStream(path + name);
 		fs.write(tempStreams[i].toByteArray());
 		fs.close();
    }

Returns: java.io.ByteArrayOutputStream[] - The stream array of the attachment file in the pdf document.

getAttachmentInfo()

public List<FileSpecification> getAttachmentInfo()

Gets the list of attachments.

Returns: java.util.List<com.aspose.pdf.FileSpecification> - Returns an List.

getResolution()

public int getResolution()

Gets resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it’s enough to set resolution to 150 or 300.

Returns: int - int value

setResolution(int value)

public void setResolution(int value)

Set resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it’s enough to set resolution to 150 or 300.

Parameters:

ParameterTypeDescription
valueintint value

getPassword()

public String getPassword()

Gets input file’s password.

Returns: java.lang.String - String value

setPassword(String value)

public void setPassword(String value)

Sets input file’s password.

Parameters:

ParameterTypeDescription
valuejava.lang.StringString value

extractMarkedContentAsImages(Page page, String path)

public void extractMarkedContentAsImages(Page page, String path)

Gets all the Marked Content containers as separate images.

Every Marked Content will be saved as image with png format named with MCID_.png

Parameters:

ParameterTypeDescription
pagePagePage for process.
pathjava.lang.StringThe path where images will be saved.