2.2.4 Graphics stream parsing proof of conceptThe script shows a simple example of reading in a PDF, andusing the decodegraphics.py module to try to write the same informationout to a new PDF through a reportlab canvas. (If you know about reportlab,you know that if you can faithfully render a PDF to a reportlab canvas, youcan do pretty much anything else with that PDF you want.) This kind oflow level manipulation should be done only if you really need to.decodegraphics is really more than a proof of concept than anythingelse. For most cases, just use the Form XObject capability, as shown inthe examples/rl1/booklet.py demo.
Free download PDF Split and Merge 2.2.2 for Windows 10. PDF Split and Merge is a free and very useful application that gives you the possibility to split and merge PDF files. Features: multiple pdf selection in merge section. Page subset selection in merge section. Requires Java Runtime E. PDF Split and Merge Basic is an open source tool (GPL license) designed to handle.pdf files. It requiers a Java Virtual Machine 1.4.2 or higher and it’s released in 2 versions, basic and enhanced. A simple tool designed to split and merge pdf files.
3.1 Core libraryThe philosophy of the library portion of pdfrw is to provide intuitivefunctions to read, manipulate, and write PDF files. There should beminimal leakage between abstraction layers, although getting usefulwork done makes “pure” functionality separation difficult.A key concept supported by the library is the use of Form XObjects,which allow easy embedding of pieces of one PDF into another.Addition of core support to the library is typically done carefullyand thoughtfully, so as not to clutter it up with too many specialcases.There are a lot of incorrectly formatted PDFs floating around; supportfor these is added in some cases. The decision is often based on whatacroread and okular do with the PDFs; if they can display them properly,then eventually pdfrw should, too, if it is not too difficult or costly.Contributions are welcome; one user has contributed some decompressionfilters and the ability to process PDF 1.5 stream objects.
Additionalfunctionality that would obviously be useful includes additionaldecompression filters, the ability to process password-protected PDFs,and the ability to output linearized PDFs. 4.2 DifficultiesThe apparent primary difficulty in mapping PDF files to Python is thePDF file concept of “indirect objects.” Indirect objects providethe efficiency of allowing a single piece of data to be referred tofrom more than one containing object, but probably more importantly,indirect objects provide a way to get around the chicken and eggproblem of circular object references when mapping arbitrary datastructures to files. To flatten out a circular reference, an indirectobject is referred to instead of being directly included in anotherobject. PDF files have a global mechanism for locating indirect objects,and they all have two reference numbers (a reference number and a“generation” number, in case you wanted to append to the PDF filerather than just rewriting the whole thing).pdfrw automatically handles indirect references on reading in a PDFfile. When pdfrw encounters an indirect PDF file object, thecorresponding Python object it creates will have an ‘indirect’ attributewith a value of True. When writing a PDF file, if you have createdarbitrary data, you just need to make sure that circular references arebroken up by putting an attribute named ‘indirect’ which evaluates toTrue on at least one object in every cycle.Another PDF file concept that doesn’t quite map to regular Python is a“stream”. Streams are dictionaries which each have an associatedunformatted data block.
Pdfrw handles streams by placing a specialattribute on a subclassed dictionary. 4.3 Usage ModelThe usage model for pdfrw treats most objects as strings (it takes theirstring representation when writing them to a file). The two mainexceptions are the PdfArray object and the PdfDict object.PdfArray is a subclass of list with two special features. First,an ‘indirect’ attribute allows a PdfArray to be written out asan indirect PDF object. Second, pdfrw reads files lazily, soPdfArray knows about, and resolves references to other indirectobjects on an as-needed basis.PdfDict is a subclass of dict that also has an indirect attributeand lazy reference resolution as well. (And the subclassedIndirectPdfDict has indirect automatically set True).But PdfDict also has an optional associated stream. The stream objectdefaults to None, but if you assign a stream to the dict, it willautomatically set the PDF /Length attribute for the dictionary.Finally, since PdfDict instances are indexed by PdfName objects (whichalways start with a /) and since most (all?) standard Adobe PdfNameobjects use names formatted like “/CamelCase”, it makes sense to allowaccess to dictionary elements via object attribute accesses as well asobject index accesses.
So usage of PdfDict objects is normally viaattribute access, although non-standard names (though still with aleading slash) can be accessed via dictionary index lookup. 4.3.3 Manipulating PDFs in memoryFor the most part, pdfrw tries to be agnostic about the contents ofPDF files, and support them as containers, but to do useful work,something a little higher-level is required, so pdfrw works tounderstand a bit about the contents of the containers. For example:. PDF pages. Pdfrw knows enough to find the pages in PDF files you readin, and to write a set of pages back out to a new PDF file. Form XObjects.
Pdfrw can take any page or rectangle on a page, andconvert it to a Form XObject, suitable for use inside another PDFfile. It knows enough about these to perform scaling, rotation,and positioning. reportlab objects. Pdfrw can recursively create a set of reportlabobjects from its internal object format. This allows, for example,Form XObjects to be used inside reportlab, so that you can reusecontent from an existing PDF file when building a new PDF withreportlab.There are several examples that demonstrate these features inthe example code directory. 5.2 PDF object model supportThe sub-package contains one module for each of theinternal representations of the kinds of basic objects that existin a PDF file, with the module in thatpackage simply gathering them up and making them available to themain pdfrw package.One feature that all the PDF object classes have in common is theinclusion of an ‘indirect’ attribute.
If ‘indirect’ exists and evaluatesto True, then when the object is written out, it is written out as anindirect object. That is to say, it is addressable in the PDF file, andcould be referenced by any number (including zero) of container objects.This indirect object capability saves space in PDF files by allowingobjects such as fonts to be referenced from multiple pages, and alsoallows PDF files to contain internal circular references. This lattercapability is used, for example, when each page object has a “parent”object in its dictionary. 5.2.2 Name objectsThe module contains the PdfName singleton object,which will convert a string into a PDF name by prepending a slash. It canbe used either by calling it or getting an attribute, e.g.: PdfName.Rotate PdfName('Rotate') PdfObject('/Rotate')In the example above, there is a slight difference between the objectsreturned from PdfName, and the object returned from PdfObject. ThePdfName objects are actually objects of class “BasePdfName”.
Thisis important, because only these may be used as keys in PdfDict objects. 5.2.5 Dict objectsThemodule contains the PdfDict class, which is a subclass of dict that isused to represent dictionaries in a PDF file. 5.3 File reading, tokenization and parsingcontains the PdfReader class, which can read a PDF file (or be passed afile object or already read string) and parse it. It uses the PdfTokensclass in for low-level tokenization.The PdfReader class does not, in general, parse into containers (e.g.inside the content streams). There is a proof of concept for doing thatinside the examples/rl2 subdirectory, but that is slow and not well-developed,and not useful for most applications.An instance of the PdfReader class is an instance of a PdfDict – thetrailer dictionary of the PDF file, to be exact.
It will have a privateattribute set on it that is named ‘pages’ that is a list containing allthe pages in the file.When instantiating a PdfReader object, there are options availablefor decompressing all the objects in the file. Pdfrw does not currentlyhave very many options for decompression, so this is not all that useful,except in the specific case of compressed object streams.Also, there are no options for decryption yet. If you have PDF filesthat are encrypted or heavily compressed, you may find that using anotherprogram like pdftk on them can make them readable by pdfrw.In general, the objects are read from the file lazily, but this is notcurrently true with compressed object streams – all of these are decompressedand read in when the PdfReader is instantiated.
5.4 File outputcontains the PdfWriter class, which can create and output a PDF file.There are a few options available when creating and using this class.In the simplest case, an instance of PdfWriter is instantiated, andthen pages are added to it from one or more source files (or createdprogrammatically), and then the write method is called to dump theresults out to a file.If you have a source PDF and do not want to disturb the structureof it too badly, then you may pass its trailer directly to PdfWriterrather than letting PdfWriter construct one for you. There is anexample of this (alter.py) in the examples directory. 5.5 Advanced featurescontains functions to build Form XObjects out of pages or rectangles onpages. These may be reused in new PDFs essentially as if they were images.buildxobj is careful to cache any page used so that it only appears inthe output once.provides the makerl function, which will translate pdfrw objects into aformat which can be used with.It is normally used in conjunction with buildxobj, to be able to reuseparts of existing PDFs when using reportlab.builds on the foundation laid by buildxobj.
Itcontains classes to create a new page (or overlay an existing page)using one or more rectangles from other pages. There are examplesshowing its use for watermarking, scaling, 4-up output, splittingeach page in 2, etc.contains code that can find specific kinds of objectsinside a PDF file. The extract.py example uses this module to createa new PDF that places each image and Form XObject from a source PDF ontoits own page, e.g. For easy reuse with some of the other examples orwith reportlab. 7.1 Pure Python.reportlab is must-have software if you want to programmaticallygenerate arbitrary PDFs.pyPdf is, in some ways, very full-featured. It can do decompressionand decryption and seems to know a lot about items inside at leastsome kinds of PDF files. In comparison, pdfrw knows less aboutspecific PDF file features (such as metadata), but focuses on tryingto have a more Pythonic API for mapping the PDF file containersyntax to Python, and (IMO) has a simpler and better PDF fileparser.
The Form XObject capability of pdfrw means that, in manycases, it does not actually need to decompress objects – theycan be left compressed.pdftools feels large and I fell asleep trying to figure out how itall fit together, but many others have done useful things with it.My understanding is that pagecatcher would have done exactly what Iwanted when I built pdfrw. But I was on a zero budget, so I’ve neverhad the pleasure of experiencing pagecatcher. I do, however, use andlike (open source, fromthe people who make pagecatcher) so I’m sure pagecatcher is great,better documented and much more full-featured than pdfrw.This looks like a useful, actively-developed program. It is quitelarge, but then, it is trying to actively comprehend a full PDFdocument. From the website:“PDFMiner is a suite of programs that help extracting and analyzingtext data of PDF documents. Unlike other PDF-related tools, itallows to obtain the exact location of texts in a page, as well asother extra information such as font information or ruled lines.
Itincludes a PDF converter that can transform PDF files into othertext formats (such as HTML). It has an extensible PDF parser thatcan be used for other purposes instead of text analysis.”. 8 Release informationRevisions:0.4 – Released 18 September, 2017. Python 3.6 added to test matrix.
Proper unicode support for text strings in PDFs added. buildxobj fixes allow better support creating form XObjectsout of compressed pages in some cases. Compression fixes for Python 3+. New subsetbooklets.py example. Bug with non-compressed indices into compressed object streams fixed. Bug with distinguishing compressed object stream first objects fixed. Better error reporting added for some invalid PDFs (e.g.
PDF Split and Merge is a free application that allows you to work with PDF files.Among these options you'll find the ones to divide the file into different documents, join different files and create only one, extract parts of the original file, mix several documents, change the page order, etc.You can create groups of tasks and automatize them. Use them as plugins, import and export them when installing the program in other computers.If you usually work with PDF files, PDF Split and Merge is a really interesting choice.
Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
March 2023
Categories |