com.aspose.pdf.facades

Class PdfExtractor

  • All Implemented Interfaces:
    com.aspose.ms.System.IDisposable, IFacade


    public final class PdfExtractor
    extends Facade

    Class for extracting images and text from PDF document.

    • Constructor Detail

      • PdfExtractor

        public PdfExtractor()

        Initializes new PdfExtractor object.

      • PdfExtractor

        public PdfExtractor(IDocument document)

        Initializes new PdfExtractor object on base of the document.

        Parameters:
        document - Pdf document.
    • Method Detail

      • getStartPage

        public int getStartPage()

        Gets start page in the page range where extracting operation will be performed.


         
         
         PdfExtractor ext = new PdfExtractor();
         ext.bindBdf("sample.pdf");
         ext.setStartPage(2);
         ext.setEndPage(5);
         ext.extractText();
         
        Returns:
        start page in the page range.
      • setStartPage

        public void setStartPage(int value)

        Sets start page in the page range where extracting operation will be performed.


         
         
         PdfExtractor ext = new PdfExtractor();
         ext.bindBdf("sample.pdf");
         ext.setStartPage(2);
         ext.setEndPage(5);
         ext.extractText();
         
        Parameters:
        value - start page in the page range.
      • getEndPage

        public int getEndPage()

        Gets end page in the page range where extracting operation will be performed.


         
         
         PdfExtractor ext = new PdfExtractor();
         ext.bindBdf("sample.pdf");
         ext.setStartPage(2);
         ext.setEndPage(3);
         ext.extractText();
         
        Returns:
        end page.
      • setEndPage

        public void setEndPage(int value)

        Sets end page in the page range where extracting operation will be performed.


         
         
         PdfExtractor ext = new PdfExtractor();
         ext.bindBdf("sample.pdf");
         ext.setStartPage(2);
         ext.setEndPage(3);
         ext.extractText();
         
        Parameters:
        value - end page.
      • getExtractTextMode

        public int getExtractTextMode()

        Gets the mode for extract text's result.


          The example demonstratres the  ExtractTextMode property usage in text extraction scenario.
         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf(@"D:\Text\text.pdf");
          extractor.setExtractTextMode(1);
                extractor.extractText();
                extractor.getText(@"D:\Text\text.txt");
          
        Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.
        Returns:
        extract text's result.
      • setExtractTextMode

        public void setExtractTextMode(int value)

        Sets the mode for extract text's result.


          The example demonstratres the  ExtractTextMode property usage in text extraction scenario.
         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf(@"D:\Text\text.pdf");
          extractor.setExtractTextMode(1);
                extractor.extractText();
                extractor.getText(@"D:\Text\text.txt");
          
        Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.
        Parameters:
        value - extract text's result.
      • getTextSearchOptions

        public TextSearchOptions getTextSearchOptions()

        Gets text search options.

        Returns:
        text search options.
      • setTextSearchOptions

        public void setTextSearchOptions(TextSearchOptions value)

        Sets text search options.

        Parameters:
        value - text search options.
      • getExtractImageMode

        public int getExtractImageMode()

        Sets the mode for extract images process.


        Default value is ExtractImageMode.DefinedInResources that extracts all images defined in resources. To extract actually shown images ExtractImageMode.ActuallyUsed mode should be used.
        Returns:
        ExtractImageMode value
        See Also:
        ExtractImageMode
      • setExtractImageMode

        public void setExtractImageMode(int value)

        Sets the mode for extract images process.


        Default value is ExtractImageMode.DefinedInResources that extracts all images defined in resources. To extract actually shown images ExtractImageMode.ActuallyUsed mode should be used.
        Parameters:
        value - ExtractImageMode value
        See Also:
        ExtractImageMode
      • isBidi

        public boolean isBidi()

        Is true when text has hebriew or arabic symbols. This case must be specially considered because string functions change their behaviour and start process text from right to left (except numbers and other non text chars).

        Returns:
        boolean value
      • extractText

        public void extractText()

        Extracts text from a Pdf document.


         First example demonstratres how to extract all the text from PDF file.
         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf("D:\Text\text.pdf");
                extractor.extractText();
                extractor.getText("D:\Text\text.txt");
          
        Second example demonstratres how to extract each page's text into one txt file.
          PdfExtractor extractor = new PdfExtractor();
          extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
          extractor.extractText();
          String prefix = TestPath + "Aspose.Pdf.Kit";
          String suffix = ".txt";
          int pageCount = 1;
          while (extractor.hasNextPageText())
          {
              extractor.getNextPageText(prefix + pageCount + suffix);
              pageCount++;
          }
                
      • extractText

        public void extractText(Charset encoding)

        Extracts text from a Pdf document using specified encoding.


         First example demonstrates how to extract all the text from PDF file.
         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf("D:\\Text\\text.pdf");
                extractor.extractText(Encoding.Unicode);
                extractor.getText("D:\\Text\\text.txt");
          
        Second example demonstrates how to extract each page's text into one txt file.
          PdfExtractor extractor = new PdfExtractor();
          extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
          extractor.extractText(java.nio.charset.Charset.forName("UTF-8"));
          String prefix = TestPath + "Aspose.Pdf.Kit";
          String suffix = ".txt";
          int pageCount = 1;
          while (extractor.hasNextPageText())
          {
              extractor.getNextPageText(prefix + pageCount + suffix);
              pageCount++;
          }
                
        Parameters:
        encoding - The encoding of the extracted text.
      • extractTextInternal

        public void extractTextInternal(TextEncodingInternal encoding)
        For Internal usage only
        Parameters:
        encoding - The encoding of the extracted text.
      • getText

        public void getText(String outputFile)

        Saves text to file. see also:ExtractText

        Parameters:
        outputFile - The file path and name to save the text.
      • getText

        public void getText(OutputStream outputStream)

        Saves text to stream. see also:ExtractText

        Parameters:
        outputStream - The stream to save the text.
      • bindPdf

        public void bindPdf(String inputFile)

        Bind input PDF file.


         
         
         PdfExtractor ext = new PdfExtractor();
         ext.bindPdf("sample.pdf");
         
        Specified by:
        bindPdf in interface IFacade
        Overrides:
        bindPdf in class Facade
        Parameters:
        inputFile - PDF fiel to bind
      • bindPdf

        public void bindPdf(InputStream inputStream)

        Binds PDF document from stream.


         
         
         PdfExtractor ext = new PdfExtractor();
         InputStream stream = new FileInputStream("sample.pdf");
         ext.bindPdf(stream);
         
        Specified by:
        bindPdf in interface IFacade
        Overrides:
        bindPdf in class Facade
        Parameters:
        inputStream - Stream containing PDF document data
      • extractImage

        public void extractImage()

        Extract images from PDF file.


         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf("sample.pdf");
                extractor.extractImage();
                int i = 1;
                while (extractor.HasNextImage())
                {
                    extractor.getNextImage("image-" + i +".pdf");
                }
          
      • hasNextImage

        public boolean hasNextImage()

        Checks if more images are accessible in PDF document. Note: ExtractImage must be called before using of this method.


         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf("sample.pdf");
                extractor.extractImage();
                int i = 1;
                while (extractor.hasNextImage())
                {
                    extractor.getNextImage("image-" + i +".pdf");
                }
          
        Returns:
        Trues if more images are accessible
      • getNextImage

        public boolean getNextImage(String outputFile)

        Retreives next image from PDF document. Note: ExtractImage must be called before using of this method.


         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf("sample.pdf");
                extractor.extractImage();
                int i = 1;
                while (extractor.hasNextImage())
                {
                    extractor.getNextImage("image-" + i +".pdf");
                }
          
        Parameters:
        outputFile - File where image will be stored
        Returns:
        True is image is successfully extracted
      • getNextImage

        public boolean getNextImage(String outputFile,
                                    ImageType format)

        Retreives next image from PDF document with given image format. Note: ExtractImage must be called before using of this method.

        Parameters:
        outputFile - File where image will be stored
        format - ImageType element
        Returns:
        True is image is successfully extracted
      • getNextImage

        public boolean getNextImage(OutputStream outputStream,
                                    ImageType format)

        Retreive next image from PDF file and stores it into stream with given image format.

        Parameters:
        outputStream - Stream where image data will be saved
        format - The format of the image.
        Returns:
        True in case the image is successfully extracted.
      • getNextImage

        public boolean getNextImage(OutputStream outputStream)

        Retreive next image from PDF file and stores it into stream.

        Parameters:
        outputStream - Stream where image data will be saved
        Returns:
        True in case the image is successfully extracted.
      • getAttachNames

        public List<String> getAttachNames()

        Returns list of attachments in PDF file. Note: ExtractAttachments must be called befor using this method.


         Example demonstrates how to extract attachment names form PDF file.
         
         
          PdfExtractor extractor = new PdfExtractor();
                extractor.bindPdf(TestSettings.GetInputFile("sample.pdf"));
                extractor.ExtractAttachment();
                List attachments = extractor.getAttachNames();
                for (String name :  (Iterable<String>)attachments)
                        System.out.println(name);
          
        Returns:
        List of attachments
      • extractAttachment

        public void extractAttachment()
        Extracts attachments from a Pdf document.
      • extractAttachment

        public void extractAttachment(String attachmentFileName)

        Extracts attachment to PDF file by attachment name.

        Parameters:
        attachmentFileName - Name of attachment to extract
      • getAttachment

        public void getAttachment(String outputPath)

        Stores attachment into file.

        Parameters:
        outputPath - Directory path where attachment(s) will be stored. Null or empty string means attachment(s) will be placed in the application directory.
      • hasNextPageText

        public boolean hasNextPageText()

        Indicates that whether can get more texts or not.


          The example demonstratres the  HasNextPageText property usage in text extraction scenario.
         
         
          PdfExtractor extractor = new PdfExtractor();
          extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
          extractor.extractText(Encoding.Unicode);
          String prefix = TestPath + "Aspose.Pdf.Kit";
          String suffix = ".txt";
          int pageCount = 1;
          while (extractor.hasNextPageText())
          {
              extractor.getNextPageText(prefix + pageCount + suffix);
              pageCount++;
          }
          
        Returns:
        Can get more texts or not, true is can, or false.
      • getNextPageText

        public void getNextPageText(String outputFile)

        Saves one page's text to file.


         The example demonstratres the GetNextPageText method usage in text extraction scenario.
         
         
          PdfExtractor extractor = new PdfExtractor();
          extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf");
          extractor.extractText(Encoding.Unicode);
          String prefix = TestPath + @"Aspose.Pdf.Kit";
          String suffix = ".txt";
          int pageCount = 1;
          while (extractor.hasNextPageText())
          {
              extractor.getNextPageText(prefix + pageCount + suffix);
              pageCount++;
          }
          
        Parameters:
        outputFile - The file path and name to save the text.
      • getNextPageText

        public void getNextPageText(OutputStream outputStream)

        Saves one page's text to stream.


          The example demonstratres the  GetNextPageText method usage in text extraction scenario.
         
         
          PdfExtractor extractor = new PdfExtractor();
          extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf");
          extractor.extractText(Encoding.Unicode);
          String prefix = TestPath + "Aspose.Pdf.Kit";
          String suffix = ".txt";
          int pageCount = 1;
          while (extractor.hasNextPageText())
          {
              FileInputStream fs = new FileInputStream(prefix + pageCount + suffix, FileMode.Create);
              extractor.getNextPageText(fs);
              fs.close();
              pageCount++;
          }
                
        Parameters:
        outputStream - The stream to save the text.
      • getText

        public void getText(OutputStream outputStream,
                            boolean filterNotAscii)

        Saves text to stream. see also:ExtractText

        Parameters:
        outputStream - The stream to save the text.
        filterNotAscii - If this parameter is true all Not ASCII simbols will be removed
      • getAttachment

        public ByteArrayOutputStream[] getAttachment()

        Saves all the attachment file to streams.


         
         
          PdfExtractor extractor = new PdfExtractor();     
                extractor.bindPdf(path + "Attach.pdf");
                extractor.extractAttachment();
                IList names = extractor.getAttachNames();
                ByteArrayOutputStream[] tempStreams =  extractor.getAttachment();
                for (int i=0; i<tempStreams.Length; i++)
                {
                        string name = (string)names[i];
                        OutputStream fs = new FileOutputStream(path + name);
                        fs.write(tempStreams[i].toByteArray()); 
                        fs.close();
                }
          
        Returns:
        The stream array of the attachment file in the pdf document.
      • getAttachmentInfo

        public List<FileSpecification> getAttachmentInfo()

        Gets the list of attachments.

        Returns:
        Returns an List<FileSpecificatio>.
      • getResolution

        public int getResolution()

        Gets resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it's enough to set resolution to 150 or 300.

        Returns:
        int value
      • setResolution

        public void setResolution(int value)

        Set resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it's enough to set resolution to 150 or 300.

        Parameters:
        value - int value
      • getPassword

        public String getPassword()

        Gets input file's password.

        Returns:
        String value
      • setPassword

        public void setPassword(String value)

        Sets input file's password.

        Parameters:
        value - String value
      • extractMarkedContentAsImages

        public void extractMarkedContentAsImages(Page page,
                                                 String path)

        Gets all the Marked Content containers as separate images.

        Every Marked Content will be saved as image with png format named with MCID_<ID number of block for the page>.png
        Parameters:
        page - Page for process.
        path - The path where images will be saved.