Wednesday, December 30, 2009

Sorting and Manipulating a Long Text List to Eliminate Some Files

In Windows XP, I made a listing of all of the files on a hard drive.  For that, I could have typed DIR *.* > OUTPUTFILE.TXT, but instead I used PrintFolders.  I selected the option for full pathnames, so each line in the file list was like this:  D:\FOLDER\SUBFOLDER\FILENAME.EXT, along with date and file size information.

I wanted to sort the lines in this file list alphabetically.  They already were sorted that way, but DIR and PrintFolders tended to insert blank lines and other lines (e.g., "=======" divider lines) that I didn't want in my final list.  The question was, how could I do that sort?  I tried the SORT command built into WinXP, but it seemed my list was too long.  I tried importing OUTPUTFILE.TXT into Excel, but it had more than 65,536 lines, so Excel couldn't handle it.  It gave me a "File not loaded completely" message.  I tried importing it into Microsoft Access, but it ended with this:

Import Text Wizard

Finished importing file 'D:\FOLDER\SUBFOLDER\OUTPUTFILE.TXT to table 'OUTPUTTXT'.  Not all of your data was successfully imported.  Error descriptions with associated row numbers of bad records can be found in the Microsoft Office Access table 'OUTPUTFILE.TXT'.

And then it turned out that it hadn't actually imported anything.  At this point, I didn't check the error log.  I looked for freeware file sorting utilities, but everything was shareware.  I was only planning to do this once, and didn't want to spend $30 for the privilege.  I did download and try one shareware program called Sort Text Lists Alphabetically Software (price $29.99), but it hung, probably because my text file had too many lines.  After 45 minutes or so, I killed it.

Eventually, I found I was able to do the sort very quickly using the SORT command in Ubuntu.  (I was running WinXP inside a VMware virtual machine on Ubuntu 9.04, so switching back and forth between the operating systems was just a matter of a click.)  The sort command I used was like this:
sort -b -d -f -i -o SORTEDFILE.TXT INPUTFILE.TXT
That worked.  I edited SORTEDFILE.TXT using Ubuntu's GEDIT program (like WinXP's Notepad).  For some reason, PrintFolders (or something) had inserted a lot of lines that did not match the expected pattern of D:\FOLDER\SUBFOLDER\FILENAME.EXT.  These may have been shortcuts or something.  Anyway, I removed them, so everything in SORTEDFILE.TXT matched the pattern.

Now I wanted to parse the lines.  My purpose in doing the file list and analysis was to see if I had any files that had the same names but different extensions.  I suspected, in particular, that I had converted some .doc and .jpg files to .pdf and had forgotten to zip or delete the original .doc and .jpg files.  So I wanted to get just the file names, without extensions, and line them up.  But how?  Access and Excel still couldn't handle the list.

This time around, I took a look at the Access error log mentioned in its error message (above).  The error, in every case, was "Field Truncation."  According to a Microsoft troubleshooting page, truncation was occurring because some of the lines in my text file contained more than 255 characters, which was the maximum Access could handle.  I tried importing into Access again, but this time I chose the Fixed Width option rather than Delimited.  It only went as far as 111 characters, so I just removed all delimiting lines in the Import Text Wizard and clicked Finish.  That didn't give me any errors, but it still truncated the lines.  Instead of File > Get External Data > Import, I tried Access's File > Open command.  Same result.

I probably could have worked through that problem in Access, but I had not planned to invest so much time in this project, and anyway I still wasn't sure how I was going to use Access to remove file extensions and folder paths so that I would just have filenames to compare.  I generally used Excel rather than Access for that kind of manipulation.  So I considered dividing up my text list into several smaller text files, each of which would be small enough for Excel to handle.  I'd probably have done that manually, by cutting and pasting, since I assumed that a file splitter program would give me files that Excel wouldn't recognize.  Also, to compare the file names in one subfile against the file names in another subfile would probably require some kind of lookup function.

That sounded like a mess, so instead I tried going at the problem from the other end.  I did another directory listing, this time looking only for PDFs.  I set the file filter to *.pdf in PrintFolders.  I still couldn't fit the result into Excel, so I did the Ubuntu SORT again, this time using a slightly more economical format:
sort -bdfio OUTPUTFILE.TXT INPUTFILE.TXT
This time, I belatedly noticed that PrintFolders and/or I had somehow introduced lots of duplicate lines, which would do much to explain why I had so many more files than I would have expected.  As advised, I used another Ubuntu command:
sort OUTPUTFILE.TXT | uniq -u
to remove duplicate lines.  But this did not seem to make any difference.  Regardless, after I had cleaned out the junk lines from OUTPUTFILE.TXT, it did all fit into Excel, with room to spare.  My import was giving me lots of #NAME? errors, because Excel was splitting rows in such a way that characters like "-" (which is supposed to be a mathematical operator) were the first characters in some rows, but were followed by letters rather than numbers, which did not compute.  (This would happen if e.g., the split came at the wrong place in a file named "TUESDAY--10AM.PDF."  So when running the Text Import Wizard, I had to designate each column as a Text column, not General.

I then used Excel text functions (e.g., MID and FIND) on each line, to isolate the filenames without pathnames or extensions.  I used Excel's text concatenations functions to work up a separate DIR command for each file I wanted to find.  In other words, I began with something like this:
D:\FOLDER\SUBFOLDER\FILENAME.EXT

and I ended with something like this:
DIR "FILE NAME."* /b/s/w >> OUTPUT.TXT

The quotes were necessary because some file names have spaces in them, which confuses the DIR command.  I forget what the /b and other options were about, but basically they made the output look the way I wanted.  The >> told the command to put the results in a file called OUTPUT.TXT.  If I had used just one > sign then that would have meant I wanted OUTPUT.TXT to be recreated every time a match was found.  Using two >> signs was an indication that OUTPUT.TXT should be created if it does not yet exist, but otherwise the results of the command should just be appended to whatever is already in OUTPUT.TXT.

In cooking up the final batch commands, I would have been helped by the MCONCAT function in the Morefunc add-in, but I didn't know about it yet.  I did use Morefunc's TEXTREVERSE function in this process, but I found that it would crash Excel when the string it was reversing was longer than 128 characters.  Following other advice, I used Excel's SUBSTITUTE command instead.

I took the thousands of resulting commands (such as the DIR FILENAME.* >> OUTPUT.TXT shown above), one for each file type (e.g., FILE NAME.*) that I was looking for, into a DOS batch file (i.e., a text file created in Notepad, with a .bat extension, saved in ANSI format) and ran it.  It began finding files (e.g., FILENAME.DOC, FILENAME.JPG) and listing them in OUTPUT.TXT.  Unfortunately, this thing was running very slowly.  Part of the slowness, I thought, was due to the generally slower performance of programs running inside a virtual machine.  So I thought I'd try my hand at creating an equivalent shell script in Ubuntu.  After several false starts, I settled on the FIND command.  I got some help from the Find page in O'Reilly's Linux Command Directory, but also found some useful tips in Pollock's Find tutorial.  It looked like I could recreate the DOS batch commands, like the example shown above, in this format:

find -name "FILE NAME.*" 2>/dev/null | tee -a found.txt

The "-name" part instructed FIND to find the name of the file.  There were three options for the > command, called a redirect:  1> would have sent the desired output to the null device (i.e., to nowhere, so that it would not be visible or saved anywhere), which was not what I wanted; 2> sent error messages (which I would get because the lost+found folder was producing them every time the FIND command tried to search that folder) to the null device instead; and &> would have sent both the standard output and the error messages to the same place, whatever I designated.  Then the pipe ("|") said to send everything else (i.e., the standard output) to TEE.  TEE would T the standard output; that is, it would send it to two places, namely, to the screen (so that I could see what was happening) and also to a file called found.txt.  The -a option served the same function as the DOS >> redirection, which is also available in Ubuntu's BASH script language, which is what I was using here:  that is, -a appended the desired output to the already existing found.txt, or created it if it was not yet existing.  I generated all of these commands in Excel - one for each FILE NAME.* - and saved them to a Notepad file, as before.  Then, in Ubuntu, I made the script executable by typing "chmod +x " at the BASH prompt and then ran it by typing "./" and it ran.  And the trouble proved to be worthwhile:  instead of needing a month to complete the job (which was what I had calculated for the snail's-pace search underway in the WinXP virtual machine), it looked like it could be done in a few hours.

And so it was.  I actually ran it on two different partitions, to be sure I had caught all duplicates.  Being cautious, I had the two partitions' results output to two separate .txt files, and then I merged them with the concatenation command:  "cat *.txt > full.lst."  (I used a different extension because I wasn't sure whether cat would try to combine the output file back into itself.  I think I've had problems with that in DOS.)  Then I renamed full.lst to be Found.txt, and made a backup copy of it.

I wanted to save the commands and text files I had accumulated so far, until I knew I wouldn't need them anymore, so I zipped them using the right-click context menu in Nautilus.  It didn't give me an option to simultaneously combine and delete the originals.


Next, I needed to remove duplicate lines from Found.txt.  I now understood that the command I had used earlier (above) had failed to specify where the output should go.  So I tried again:

sort Found.txt | uniq -u >> Sorted.txt

This produced a dramatically shorter file - 1/6 the size of Found.txt.  Had there really been that many duplicates?  I wanted to try sorting again, this time by filename.  But, of course, the entries in Sorted.txt included both folder and file names, like this:

./FOLDER1/File1.pdf
./SomeotherFolder/AAAA.doc
Sorting normally would put them in the order shown, but sorting by their ending letters would put the .doc files before the .pdfs, and would also alphabetize the .doc files by filename before foldername.  Sorting them in this way would show me how many copies of a given file there had been, so that I could eyeball the possibility that the ultimate list of unique files would really be so much shorter than reported in Found.txt.  I didn't know how to do that in bash, so I posted a question on it.


Meanwhile, I found that the output of the whole Found.txt file would fit into Excel.  When I sorted it, I found that each line was duplicated - but only once.  So plainly I had done something wrong in arriving at Sorted.txt.  From this point, I basically did the rest of the project in Excel, though it belatedly appeared that there were some workable answers in response to my post.

0 comments: