Saturday, February 18, 2012

Saving Disk Space; Finding Types of Files to Shrink

I wanted to save drive space.  The first line of attack was to use a freeware program to find large files.  Among the various possibilitiesWinDirStat, TreeSize, and SpaceSniffer seemed to be the most positively reviewed and/or familiar to me.  Among those three, TreeSize and WinDirStat used a directory listing to indicate the largest folders, while SpaceSniffer and WinDirStat used a graphic approach to highlight large individual files.  In other words, WinDirStat offered both.  The graphic approach led me to multiple space-saving solutions faster than the TreeSize approach.  Between the two graphic progams, I found SpaceSniffer's graphic presentation more readable and zoomable than that of WinDirStat, except that WinDirStat made it easier to tell, right from the start, whether a folder was large because it had many small or few large files.  SpaceSniffer also offered a full right-click Windows Explorer context menu, while the right-click options in WinDirStat were more limited.

This sort of program was good for quickly locating large space hogs.  The concept, which I pursued to some extent, was that you would probably free up the most space most quickly by homing in on very large files or folders.  But once I got past the point of being impressed by the graphics, I noticed certain drawbacks.  One was disorientation, especially in SpaceSniffer:  its method of zooming in on a particular folder was not the exact opposite of its method of zooming out.  It seemed like I was coming out by a route different than the one I had gone in on.  So it could be hard to build an intuitive sense of where you were located, with respect to the drive or directory as a whole.  Another drawback was that I could not compare across partitions.  I had to back out and start over, or run a different session of the program, to see clearly that I should be focusing on one drive rather than another.  Another missing dimension of coherence:  file type.  I might never know, from looking at the graphic maps, that I was consistently making PDFs or JPGs that were larger than they needed to be.  Even a grossly large PDF could escape notice when nestled among AVIs ten times its size.  There was also no logging or comparison capability that might alert me to the fact that a certain folder had been growing more rapidly than I would have expected -- or, for that matter, that a certain folder had disappeared since the last time I ran the comparison.

I thought that I might be able to capture the chief benefits of these programs -- that is, their ability to draw attention to large files -- while also adding at least some of those missing ingredients (though admittedly I could not compete with their graphics).  What I wanted, for this purpose, was a simple file list that I could sort according to selected criteria, particularly folder size, file date, file size, and file extension.  This approach seemed likely to require a much larger time investment up front; but once I had the file list, it seemed I would be able to do quite a bit of analysis and revision of files and processes, so as to root out a variety of kinds of disk space waste.  In short, I was seeking a systematic approach that would minimize my preoccupation with the same few large files (which may have had to remain large for good reason), turning my attention instead to other areas where I could make an impact on drive bloat.

The approach I took was to develop a somewhat automated method of generating a list of files across multiple partitions, and then put that list into a spreadsheet.  It took some work to figure out a way to automate the production of a file list that would work simply in a spreadsheet.  The problem was that simple batch commands, with which I had some familiarity, did not want to put all relevant information about each file on a single line.  (While I was particularly interested in the full path, file size, and date, others might have wanted to draw on some other available types of file information, such as file attributes.)

As described in another post, I found the desired solution by installing TCC/LE and running its PDIR command in a batch file. This operation required two parts. First, I had to work up the command that I would run to make everything happen. This command would start TCC/LE and would tell it to open a batch file. To find where TCC was, I used the Properties > Target path I found in a right-click on its Start Menu icon. To see if this would all work, I generated a little batch file called Test File.bat. Then I ran this:

"C:\Program Files\JPSoft\TCCLE13x64\tcc.exe" "D:\Current\Test File.bat"
It worked. Test File.bat ran. So I went ahead and replaced the second part of that command with the name and location of the real batch file that I wanted to run. I called it ListAllFiles.bat, and I put it in a permanent location with other batch files that I would run for various purposes. (For the moment, it was an empty file, but that would change momentarily.) I also made a copy of the Start Menu shortcut that would run TCC. I modified that shortcut's Properties so that its Target line contained the information just described: the path to tcc.exe, and then the path to ListAllFiles.bat, each in quotation marks as shown in the line quoted above. Now I could double-click on that icon to make TCC run ListAllFiles.bat, instead of having to look up the proper command syntax. Once that was done, I just needed to make sure ListAllFiles.bat said what I wanted. I was still working on that, but for now it looked like this:
@echo off
echo For cleanest results, empty the Recycle Bin before proceeding.
echo Notepad may take several minutes to display a large list.
echo Be patient ...
:: PDIR requires TCC/LE to be installed

pdir D:\ /s /(dy-m-d zc fpn) > "D:\Current\List of All Files on D and E.txt"

pdir E:\ /s /(dy-m-d zc fpn) >> "D:\Current\List of All Files on D and E.txt"

start notepad.exe "D:\Current\List of All Files on D and E.txt"

The last four lines (double-spaced for clarity, in case the blog wraps them) were where the action was; everything before that just displayed a few informational notices and a comment about PDIR that would not be visible when the batch file ran. I opened the resulting text file, "List of All Files on D and E.txt," in a LibreOffice Calc spreadsheet, since LibreOffice (alternately OpenOffice 3.3) could accommodate a large number of rows. When I did that, LibreOffice Calc detected that it was a text file and defaulted to LibreOffice Writer, but I just copied and pasted from there into Calc. I told it that the imported text had fixed-width columns, and I pointed out where those columns were.  LibreOffice Calc crashed repeatedly during this process -- I had to remember to save frequently -- but in the end it came through.

So I had my table. I added a column to display the extension, extracted it using the RIGHT function (using different values to extract extensions of different length, e.g., .html, .px), and saved the extension in an adjacent column (i.e., undisturbed by those calculations of varying extension length).  I added a column in which I could record the date on which I had last examined the file to see whether it was currently feasible to shrink it, and another column for notes.  For instance, I had a 6GB zip file that I couldn't get into right now, because there were things I needed to do and learn before I would be ready for the project that its opening would commence.So I added that "Last Examined" column.  For purposes of making a first pass through the files, I could filter out that one, among many others, as not being of further concern right now.

Now that the spreadsheet was ready, I could do some file sorting.  First, I sorted by date, and also by extension, to find any files whose name or other properties might need to be adjusted, perhaps with the aid of a relevant utility (e.g., SetFileDate, TrID).  Then I sorted by file extension.  I added a Flag column to the spreadsheet, to indicate files that would be worth looking at, to see if I could possibly shrink them.  I put an X in the Flag column for each AVI file.  I suspected that, with or without editing, most AVIs could probably be converted to MP4 or some other more compressed format that would retain about the same apparent quality for a fraction of the space.  Likewise for BMPs (most could probably be JPGs) and WAVs (many could be converted to MP3).  I also flagged all ISOs, which I probably didn't need to keep, and all ZIP (and RAR and 7z) files, because they could hold a lot of unnecessary stuff somewhat removed from notice.

Next, I sorted the spreadsheet in order of declining file size, and flagged all files larger than 500MB.  Within that sort, I filtered for common image extensions (including PDF as well as JPG and PNG) and flagged all files larger than 100MB.  I could see that these steps were going to draw attention to whole folders full of files.  For example, I noticed a folder of mixed MP3 and WAV files, all of which could have been MP3s.  In that instance, I suspected I would probably wind up converting the whole folder at once.

I figured I would probably refine this approach, if I went back through it again sometime in the future.  But for right now, these steps resulted in the flagging of about 9% of the total number of files on these partitions -- and those 9% accounted for about 60% of disk usage.  I probably wouldn't have time to work through all of those files individually.  But identifying a few major categories of unnecessarily large files did seem likely to yield some reductions in disk usage that I could achieve through mass conversions and other relatively simple steps.  Finally, while I had concluded that I liked LibreOffice Calc -- it had pretty much stopped crashing as I got more familiar with it, thus probably making fewer finger fumbles -- the focus on just 9% of the files gave me a list short enough to handle in Excel 2003.



Correction on WinDirStat: it is possible to view multiple partitions simultaneously.


Just received a note that I should have tried Directory Report for this purpose. A quick look didn't reveal how spending $25 on that program would help this project. But I'm open to clarification.


Unlike Windirstat (which uses treemaps) Directory Report looks just like the MS-Explorer but it always shows the folder's size and size with subdirectories. This display format is easier to use than treemaps.

Also, Directory Report has a window which shows all files in descending size order.

This allows you to quickly drill down to space hogging directories to locate problem files.

Directory Report can save all screens directly to Excel or to a csv file.

You can save just the file name or many file and avi, mp3, msi, MS-Office, dll/exe and wav properties.

You can use many filters to scan just the files/directories you want.

Plus Directory Report is faster than WinDirStat