Thursday, March 8, 2012

File Naming Conventions

I had a bunch of files. I was looking at ways to sort them. It seemed that it might help if they were all named according to file naming conventions, so that files of a certain type would be named in certain standard ways.

It was not immediately clear if there was any scholarly consensus as to what approach would be best. Along with general advice, there seemed to be at least two fundamentally different philosophies. On one hand, sources like the National Technology Assistance Project (NTAP) recommended using a folder hierarchy and relatively plain-English filenames. For instance, a file named "2009-05-15 Letter to Smith, R P" might be found in the 2009 > Correspondence subfolder; or in a different scheme, it might be in the Completed Projects > Waterway subfolder. On the other hand, Vincent Santaguida recommended putting the information into the filename itself and avoiding folder hierarchies. (I found that document and others on an Exadox webpage.) Santaguida's first example said this:

Do: Z:\Prod\QA\AssL7_WO_Suzuki_L3688_20090725.xls

Don't: Z:\Production \Quality Control\Assembly Line7\Work Orders\Clients\Suzuki Motors\ LOT3688_July‐25‐2009.xls
Depending on who was using the files and how much they knew about the variety of filenames in the archive, it seemed that Santaguida's approach might benefit from a formal, elaborate naming scheme -- with, for instance, a reference work where users could look up "Quality Control" or "Assembling Line 7" (and other variations) and find the proper rules for naming documents related to those topics, and a list or guidance system leading to relevant documents already filed. I could see where such a system might be valuable in some settings. I had a couple of concerns about it, though. One was, what happens if you lose the reference list, or if the specialized database management system creating such filenames goes on the fritz?

It seemed that, for most purposes, PC Magazine had the better idea: make your filenames indicative of what the file contained, in terms that potential users could understand -- and, I would add, within a scheme that would not require more maintenance than users or database managers would devote to it. For instance, aside from special projects like this one, I was not generally going to invest the time to create a highly precise file naming arrangement. It did seem that Santaguida's approach could help reduce file duplication, but I felt that DoubleKiller gave me an adequate solution for that. The other thing was that I didn't actually know how life was going to turn out yet. File arrangements grew up on the fly, as new situations emerged. I wasn't positioned to put it all into a rigid structure.

In other words, while adopting some principles recommended in the Best Practice for File-Naming produced by the Government Records Branch of North Carolina, I was concerned that "Records will be moved from their original location" -- that, in other words, I might have to re-sort things that I had already sorted once -- but I didn't see an easy way around that. Building their location into the filename would have been a bad idea because, in many cases, I *wanted* to be able to re-sort things at random.

Within the individual filename, Santaguida's second principle seemed right: "Put sufficient elements in the structure [particularly in the filename] for easy retrieval and identification but do not overdo it." I had been working toward a couple of basic formats:
2009-05-04 13.59 Message from X to Y re Z.pdf
Shakespeare--Romeo and Juliet.pdf
Shakespeare--Romeo and Juliet - with notes.pdf
Garfunkel--Bridge Over Troubled Water.mp3
The present project, I decided, was one in which I could mostly tend toward the first example: Year-Mo-Da Hr.Mn DocType from Sender to Recipient re Subject.ext. I would use periods and hyphens (-)for some limited purposes, but would tend not to use other punctuation. This tended to agree with Santaguida's third rule: Do not use characters such as ! # $ % & ' @ ^ ` ~ + , . ; =)]([. Santaguida said don't use spaces, but I had rejected that in opting for plain-English filenames. He also said to use the underscore (_) to separate filename elements, but that was unnecessary in my approach. It also had the potential to confuse things. I noticed that some naming and conversion programs used the underscore in place of the space, giving me "File_name.exe" instead of "File name.exe." In Santaguida's approach, that would falsely suggest that "File" and "name" were two different elements. I planned to scout out and remove underscores. The intention to minimize punctuation also seemed generally consistent with various uses of special punctuation in Windows.

I also had to think about Santaguida's seventh principle. He recommended putting surname first, first name second: "Smith, Roger" rather than "Roger Smith." Actually, in his approach, it was "Smith-Roger." It seemed to me that there were some reasons not to do it that way, at least in my system. One was that I would have to sweep filenames (at least new filenames) for consistency occasionally at any rate, to catch and rename those newly created documents where some variation appeared (e.g., "Smith, R.P.," "Smith, R P," 'R Smith"). There didn't seem to be any difference between one approach and the other in that sense. What seemed more practical was to use whatever name I actually tended to use for someone, so that I would be most likely to name it correctly the first time, when I was thinking about something other than my file naming scheme. Typically, this would be along the lines of "Roger Smith" -- which would also have the advantage of eliminating extra hyphens and commas.

Once I had such ideas in mind, I went through the list of files, using Excel to generate batch commands to rename many files at once. Where the files had one of the structures shown above, I was able to use FIND and other text functions to segregate certain elements (e.g., Author, Date) into separate columns, and then use unique filters and other tools to eliminate variations (in e.g., personal names).



Hello Ray,
Your concern for databases that go on the fritz is a valid concern; for that reason eXadox does not employ a proprietary database. It takes advantage of Microsoft's standard file management NTFS. It is therefore possible to locate any eXadox named documents using standard Microsoft tools such as Windows Explorer. If the standard Microsoft search tools do not work, there are bigger problems.

Regarding the use of complex hierarchical folder structures, eXadox can work that way also; undue complexities can be avoided by adding more info in the file name. The benefit is that the file can stand more on its own with an information-rich file name.
Vincent Santaguida