★ Automating date collection in scanned documents.


I was reading the other day on MacSparky about a set of Hazel rules to extract dates from the content of PDF files. The rules themselves can be found on Macorios.com and I have to say, they’re pretty brilliant. However, one of the items under the known issues made it not a feasible solution for me:

The rules work only reliable, when there is only one date in the file’s content. If there is more then one date, the rules can mix up and you can get a improper renamed file (Hazel runs the rules from top to bottom. So if the file’s content has to dates, lets say 01/30/2012 and 02/02/2013, the result will be 2012-01-02).

Since most of the docs I’m scanning are bills, they’re going to have at least two dates in them – a statement closing date and a due date. This meant that nearly every document I wanted to process was not going to work. Sigh.

But wait – I’m using a ScanSnap S1300 and one of the features is that it will take highlighted text and attach it as a keyword to the resulting PDF. Once the scan is completed, Hazel processes and runs this script. The script does a bunch of things:

  • gets a list of the keywords
  • converts any of the keywords which appear to be dates to the local short date format
  • prefixes all the keywords with ‘@’ and writes them to the Spotlight comments
  • also writes @month and @year to spotlight comments
  • exports the short date format keyword to Hazel

Hazel then renames the file with that date and passes it along to another folder (called “Dated Scans” in my workflow), where it reprocesses it with the company name based on the account number it finds in the PDF content. My hazel is set up like this:

Screen Shot 2013 02 25 at 3 31 34 PM

(The “filedate” variable in the Rename action is exported from the script.)

I’ve tested with my bills over the past month – most of the misses have been because I wasn’t careful with the highlighter and caught some extra characters. I did find one format that was particularly troublesome was “February 25, 2013” – the comma may cause it to be parsed as two separate keywords.

[In the interest of full disclose: The Scansnap link is an Amazon affiliate link. If you click it and then buy something from Amazon, I get a percentage. And you get something you wanted. So we both win.]


Leave a Reply

Your email address will not be published. Required fields are marked *