Adding Page Labels to PDFs

Aug 26, 2024, 12:16 AM @ 📑 The Office

I hate it when I open up a PDF and find that the page numbers don’t match what’s written on the page. Many of the books in my digital library of academic research and reference material have this problem. It makes it a big hassle to find a particular page—the book says “156,” but when you go to that page in the PDF, now you’re somehow on page 137. I hate it!

There are many solutions to this problem. The easiest one, and the one most people probably choose, is “just deal with it.” But I am not most people. The slightly more tech-savvy might open up Adobe Acrobat and edit the page numbers in there—usually not too much of a hassle. But I don't have Acrobat.

“Hardcore” solutions to the page-labeling problem

Then, there are the more hardcore solutions. The easiest of the those is to open up the PDF in a text editor and find the text /Catalog. Then add the following template right after it (as explained in answers to this Stack Exchange question):

/PageLabels <<
	/Nums [
		% labels 1st page with the string "cover"
		0 << /P (cover) >>
		% numbers pages 2-6 in small roman numerals
		1 << /S /r >>
		% numbers pages 7-x in decimal arabic numerals
		6 << /S /D >>
	]
>>
🛈

See the PDF 1.7 Standard (p. 382, §12.4.2) for a table of things you can put into /Nums. The main points of interest are

  • /S for page numbering “Style,” which can be /D for Decimal Arabic numerals, /R or /r for uppercase or lowercase Roman numerals, and /A or /a for alphabetic page numbering; and
  • /P, which is the page numbering prefix. This one can be anything. And, if you don’t set a style, then only the prefix is used. This is useful if you want a page label like “(cover),” for example.

This method works sometimes, but technically it breaks the PDF because every PDF has a sort of map to all its important internal parts. You can try to fix it with the PDF program mutool clean my.pdf. Sometimes, though, the PDF is beyond repair and you have to try another way.

I used this method for a long time, but after recently doing more with Python, I wanted to see if there was another way. Yes, there is!

Python makes it easier

It turns out there’s a nice library called pypdf. It can do page labels just fine. Here’s a small example that I used today:

from pypdf import PdfReader, PdfWriter

r = PdfReader("in.pdf")
w = PdfWriter(r)

blank_page = w.insert_blank_page(index=2)
w.set_page_label(0, 2,                prefix="")
w.set_page_label(3, len(r.pages) - 1, style="/D")

with open("out.pdf", "wb") as file:
    w.write(file)

It works really well and it was pretty straightforward to learn how to use. But the page ranges for each page label start from zero (just like in my “hardcore” method above) and the end of the page range isn’t optional. I hate doing mental math, even if it’s just x - 1. So, I took my small example above and made it into a command line program that makes this process easy.

Challenges in making the page-labeling program

I decided to call it pdf-pagelabels in a spark of genius. I won’t explain more about how to use it. Instead, I want to document some of the challenges I solved and the things I learned. Here’s the program’s help text.

usage: pdf-pagelabels [-h] input output page_range [OPTIONS] [page_range [OPTIONS]...]

Add page labels to a PDF file

positional arguments:
  input                 Input PDF file
  output                Output PDF file
  page_range [OPTIONS]  Page range and label specifications

options:
  -h, --help            show this help message and exit
  -s STYLE, --style STYLE
                        Page label style, one of D, R, r, A, or a
  -p PREFIX, --prefix PREFIX
                        Page label prefix
  -t START, --start START
                        Starting number for page label

Basically, you give a page range, you set options for that page range, and then repeat that pattern as many times as needed. It’s a straightforward syntax, but it turned out to be a big challenge to handle with Python’s argparse module.

Handling command-line arguments

The module can handle arguments in a few different ways. For example, it can process just the arguments it knows about, leaving the rest behind for you to deal with. That’s not quite what I wanted because it wouldn’t maintain the order of the options. They need to stay with their page ranges. A workaround would be to let the parser ignore everything after the input and output files, but then the program can’t provide a help message for those options.

I was stumped on this problem for quite a while. It took many Google searches and a really long conversation with a GPT before I settled on a strategy.

My GPT conversation yielded two main strategies: one, ignore the page label options and process them manually. It worked, but what’s the point of using argparse if I just have to do it manually anyway?

The second option was a similar idea, but much too complicated for my taste. It was to define a custom action for argparse to handle all the options after input and output. This way, the options could be included in the help text, but it still wasn’t quite right.

A third strategy which occurred to me when reading the argparse documentation was to use a sub-parser. This also could have worked, but the resulting help text was unclear because this isn’t an intended use for that.

In the end, the strategy that finally worked for me was different. I looked at a lot of questions on Stack Exchange, but it was this answer which seemed the most appropriate for my program. The gist is to create a main parser which handles the input and output arguments. Then write a generator which splits the remaining arguments into page label + options groups. Finally, pass each group to a second parser before letting the program do its work.

Using a generator function

This is my first time working with generator functions in Python. Basically, a generator is a way to loop through a data structure using custom logic, rather than just stepping through each item one at a time. Generator functions look just like regular functions, but instead of a return, they use the yield keyword to output something in a loop.

For this program, I was looking to make groups that looked like this: page_range [OPTIONS]. The generator function looks at each command line argument and decides if it’s an option or a page range. If it’s a page range, it checks if we’ve already processed a page range, and if so, it yields that before starting a new group. Otherwise, if it’s an option, it appends the option and its value to a list. And just like that, we have groups of page range + options!

# Given ["1-2", "-p", "", "3-end", "-s", "D"]
def page_label_args_generator(args):
    result = []
    i = 0
    while i < len(args):
        if args[i].startswith("-"):
            result.extend([args[i], args[i + 1]])
            i += 2
        else:
            if result:
                yield result
                result = [args[i]]
                i += 1
                continue
            result.append(args[i])
            i += 1
    yield result
# The generator function yields two items:
# ["1-2", "-p", ""]
# ["3-end", "-s", "D"]

Processing the page ranges

With the program’s arguments in the right order, We can process page ranges and turn them into something that pypdf can understand. The rest of the program is pretty straightforward. The rest of the work involves breaking down page ranges, accounting for different possibilities. I split it on “-”, then check how many items we’re looking at. If it’s one, then I only change the page label for that page. If there are two numbers, add labels for those pages. For cases like 3- or 3-end, I check if the second item matches, and if so, I return None to mean the end of the PDF.

def parse_page_range(range_str):
    parts = range_str.split("-")
    if len(parts) == 1:
        return int(parts[0]), int(parts[0])
    elif len(parts) == 2:
        start = int(parts[0])
        if parts[1] == "" or parts[1].lower() == "end":
            return start, None
        return start, int(parts[1])
    else:
        raise ValueError("Invalid page range format")

And that’s pretty much it! If you’re interested in this program, you can find it on GitHub.


Profile

Written by Randy JosleynLanguage learner, language teacher, music lover. Living in Beijing, Boise, and elsewhere