Automated processing of PDF files

Every year I receive a PDF document from our accountants containing tax forms (P11D) for our employees. Splitting that up and emailing it on by hand is tedious, and I live in fear of sending the wrong document to someone so I’ve automated the process.

The key steps are to split the original file into one for each individual, rename those individual files after the relevant employee, and then email the file to that person.

Splitting uses a python script to divide the original PDF into two page chunks, as each individual form occupies two pages. I use PyPDF2 to do this, it can be installed using pip.

Listing 1: Split tax forms

#!/usr/bin/env python3
from PyPDF2 import PdfFileWriter, PdfFileReader
import re

def main():
    
    inputpdf = PdfFileReader(open("p11d-2017.pdf", "rb"))

    for i in range(0, inputpdf.numPages, 2):
        output = PdfFileWriter()
        output.addPage(inputpdf.getPage(i))
        secondPage = inputpdf.getPage(i+1)
        output.addPage(secondPage)
        
        with open("%s-p11d.pdf" % i, "wb") as outputStream:
            output.write(outputStream)

if __name__ == '__main__':
    main()

I had originally intended to name the files after the individual concerned, finding the name in the extractText method of PyPDF2, but this unfortunately fails to find the text in our PDFs and is documented as being unreliable. To resolve this I use a second stage, based on pdftotext. Using this I’ve written a shell script rename.sh which takes the name of a file to rename, extracts the text and searches that for the name of the relevant employee. Files are then copied to a new directory and named accordingly.

Listing 2: Rename form after employee

#!/usr/bin/env bash

file=$1
name=$(pdftotext $file - | grep Name: | cut -f2 -d: | xargs)
cp $file "$name.pdf"

All that’s then left to do is to email the files to their owners. The file to email mapping is critical: you don’t want to email the wrong file to someone. Copious dry runs emailing only me, then a dry run emailing the recipient, not actually including the file help to provide reassurance.

Listing 3: Email P11Ds to employees

import smtplib

from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.application import MIMEApplication

employees = { "Fred Bloggs" : "fred@example.com",
              "Joe Smith" : joe@example.com }

me = 'sender@example.com'
pwd = "elided" # This needs to be an application specific password from gmail as we have 2FA enabled.

s = smtplib.SMTP_SSL('smtp.gmail.com', 465) 
s.ehlo()
s.login(me, pwd)

p11ds = employees.keys()
for p11d in p11ds:
    # Create the container (outer) email message.
    msg = MIMEMultipart()
    msg['Subject'] = 'P11D 2017'
    body = MIMEText(" Here's your P11D for 2017.  No more excuses for not doing your tax return!\n\nGiles")
    msg.attach(body)

    you = employees[p11d]
    p11dName = "%s.pdf" % p11d
    with open(p11dName, 'rb') as fp:
        pdf = MIMEApplication(fp.read())
        pdf.add_header('Content-Disposition', 'attachment', filename= p11dName)
    msg.attach(pdf) # This is the critical line!  Drop for dry runs
    s.sendmail(me, you, msg.as_string())
    
s.quit()

As I’ve mentioned before, I use BBDB to maintain a simple employee database. This allows me to grab the name and email address data in the employees dictionary from BBDB using a custom record layout.

(add-to-list 'bbdb-layout-alist '(short-email
                                  (order  mail)
                                  (primary . t)
                                  (toggle . t)))

(defun bbdb-display-record-short-email (record layout fields)
  (let ((copy (copy-sequence record)))
    (bbdb-record-set-field copy 'organization '(""))
    (bbdb-display-record-one-line copy
                                  layout
                                  fields)))

There’s a bug in the BBDB 3.1.2 documentation for bbdb-layout-alist. It claims that “When you add a new layout FOO, you can write a corresponding layout function `bbdb-display-record-layout-FOO’”. Actually the corresponding function should be bbdb-display-record-FOO. Most of the hard work of my custom display format is delegated to the built in bbdb-display-record-one-line, but I erase the organization field from the record passed as there doesn’t seem to be a more elegant way of preventing it from being displayed.

Pressing *t in the *BBDB* buffer then toggles through the available layouts of all displayed records. Once short-email is displayed buffer can be copied and then a simple edit produces the required format for the employees structure.

Automated processing of PDF files

Share this: