These objects contain a reference to the next free object and the generation number to be used if the object becomes valid again. Flag “f” means that the object may still be present in a file, but is marked free, so it shouldn’t be used. Other objects have the subsequent numbers 22, 23 and 24.Īll objects are marked with either an “f” or “n” flag. The third subsection has four objects, the first of which has an ID 21 and starts at an offset 25518 from the beginning of the file. The second subsection has an object ID 3 and contains one element, the object 3 that starts at an offset 25324 bytes from the beginning of the document. The last object in the cross-reference table uses the generation number 0. The first object has an ID 0 and always contains one entry with generation number 65535 that is at the head of the list of free objects (note the letter “f” that means free). After that, there is another space separator, followed by a letter “f” or “n” to indicate whether the object is free or in use. What follows is a space separator with another number specifying the object’s generation number.
The first 10 bytes are the object’s offset from the start of the PDF document to the beginning of that object. Each object is represented by one entry, which is 20 bytes long (including the CRLF).
The first number in those lines corresponds to the object number, while the second line states the number of objects in the current subsection. In the example above, we can see that we have four subsections (note the four lines that only contain two numbers). We can display the cross reference table of the PDF document by simply opening the PDF with a text editor and scrolling to the bottom of the document. Each object is represented by one entry in the cross reference table, which is always 20 bytes long. The purpose of a cross reference table is that it allows random access to objects in the file, so we don’t need to read the whole PDF document to locate the particular object. This is the cross reference table, which contains contains the references to all the objects in the document. The Body section is used to hold all the document’s data being shown to the user. In the body of the PDF document, there are objects that typically include text streams, images, other multimedia elements, etc. Currently the version numbers are of the form 1.N, where the N is from range 0-7.
What follows are some ASCII characters that are using non-printable characters (note the ‘.’ dots), which are usually there to tell some of the software products that the file contains binary data and shouldn’t be treated as 7-bit ASCII text. The following bytes are taken from the output below: 2550 4446 2d31 2e33 0a25 c4e5 and correspond to the ASCII text “%PDF-1.3.%”.
The ‘%’ character is a comment in PDF, so the above example actually presents the first and second line being comments, which is true for all PDF documents.
The temp.pdf PDF document uses the PDF specification 1.3. If we want to find that out, we can use the hex editor or simply use the xxd command as below:Ġ000000: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7 %PDF-1.3.%…… This is the first line of a PDF file and specifies the version number of the used PDF specification which the document uses. The basic structure of a PDF file is presented in the picture below:Įvery PDF document has the following elements: Header
PDF has more functions than just text: it can include images and other multimedia elements, be password protected, execute JavaScript and so on. There are almost 800 pages of the documentation for the PDF file format alone, so reading through that is not something to do on a whim. The PDF file format specification is publicly available here and can be used by anyone interested in PDF file format.
PDF is a portable document format that can be used to present documents that include text, images, multimedia elements, web page links and more. In this article, we’ll take a look at the PDF file format and its internals. In our case, we should first understand the PDF file format in detail. Whenever we want to discover new vulnerabilities in software, we should first understand the protocol or file format in which we’re trying to discover new vulnerabilities.