Showing posts with label command. Show all posts
Showing posts with label command. Show all posts

Thursday, March 15, 2012

Batch Converting Many Text Files to PDF

I had a bunch of .TXT files that I wanted to convert to PDF.  I had solved this problem previously, but it looked like I hadn't written it out clearly, so that's the purpose of this post.  This explanation includes solutions to several other sub-problems.  All together, the things presented here were useful for solving a variety of problems.

First, I made a list of the files to convert.  My preferred way of doing this was to use DIR.  First, I would open a command window.  My preferred way of doing *that* was to use the "Open command window here" context menu (i.e., right-click in Windows Explorer) option.  An alternative was to use Start > Run > cmd, but then I would have to navigate to the desired folder using commands like CD.

The DIR command I usually used, to make a list of files, was DIR /s /a-d /b > filelist.txt.  (Information on DIR and other DOS-style commands was available in the command window by typing the command followed by /?.  For example, DIR /? told me that that the /s option would tell DIR to search subdirectories.  A variation on the DIR command:  DIR *.txt /s /a-d /b.  The addition of *.txt, in that example, would tell DIR that I wanted a list of only the *.txt files in the folder in question (and its subfolders).  If I wanted to search a whole drive, I'd make it DIR D:\*.txt /s /a-d /b > filelist.txt.  If I wanted to search multiple drives, I'd use >> rather than > in the command for the second drive, so that the results would add to rather than overwrite the filelist.txt created by the preceding command.

Using DIR that way could gather files from all over the drive.  Sometimes it was better to gather the files into one folder first, and then run my DIR command just on that folder.  An easy way of finding certain kinds of files was to use the Everything file finding utility, and then just cut and paste all those files from Everything to the desired folder.  For instance, a search in Everything for this:

"see you tomorrow" *.txt
would find all text files whose names contained that phrase.  Cutting and pasting that specialized list into a separate folder would quickly give me a manageable set of files on which I could focus my DIR command.  (There were other directory listing or printing programs that would also do this work; I just found them more convoluted than the simple DIR command.)

Once I had dirlist.txt, I copied its contents into Excel (or I could have used Excel to open dirlist.txt) and used various formulas to create the commands that would convert my text files into PDF.  The form of the command was like this:
notepad /p textfile.txt
I wasn't sure in the case of Notepad specifically, but I was able to run some programs (e.g., Word) from the command line by just typing one word (instead of e.g., "notepad.exe," or a longer statement of the path to the folder where e.g., winword.exe was located) because I had put the necessary shortcuts in C:\Windows.

Those Notepad commands would send the text files to my default printer.  My default printer was Bullzip.  When I installed it, it gave me a separate shortcut leading to its options.  For this purpose, I set its options so that it did not open the document after creation (General tab), specified an output folder (General tab), and indicated that no dialogs or questions should be asked (Dialogs tab).

I copied the desired commands from Excel to a Notepad text file and saved it with a .bat extension.  The rest of the file name didn't matter, but the .bat extension was important to make it an executable program.  In other words, if I double-clicked on PrintThoseFiles.bat (or if I selected PrintThoseFiles.bat and hit Enter) in Windows Explorer, the batch file would run and those commands would execute.  (I could also run the batch file from the command line, just by typing its name and hitting Enter -- which meant that I could have a batch file running other batch files.)

So that pretty much did it for me.  I ran the batch file, running lots of Notepad commands, and it produced lots of good-looking PDFs.

Please feel free to post questions or comments.

Tuesday, March 13, 2012

Converting URL-Linked Webpages (Bookmarks, Favorites) to PDF

At some point, I discovered that it was possible to save a link to a webpage by just dragging its icon from the Address bar in an Internet browser (e.g., Internet Explorer, Chrome, Firefox) to a folder in Windows Explorer.  Doing that would create a URL file with two bits of information:  the location of the webpage, and the name by which I referred to it.  For example, I might have renamed one of those URL files, pointing at some random CNN.com webpage, to be "July 30 Article about China."  I had also gotten some URL files by dragging Firefox bookmarks to a Windows Explorer folder.

I now had a bunch of those URL files.  I wanted to print the linked webpages in PDF format, using those given names as the names of the resulting PDFs.  And I wanted to automate it, so that I could just provide the set of URL files, or something derived from it, and some program would do the rest.  Input a dozen URL link files; output a dozen PDFs, each containing a printout of what was on the webpage cited in one of the URL files.

Part of the solution was easy enough.  On the command line, I could type two commands:

echo China Story.url > filelist.txt
type "China Story.url" >> filelist.txt
Note the seeming inconsistency in quotation marks.  (I would have to continue using >> rather than > with any subsequent additions, so as not to overwrite what was already in Filelist.txt.)  Filelist.txt would then contain three lines, like this:
China Story.url
[Internet Shortcut]
URL=http://www.cnn.com/Chinastory.html
I could do a simple find-and-replace to get rid of the [Internet Shortcut] part, and manipulate the rest of it in a spreadsheet like Microsoft Excel, using DIR to produce the filenames.  So the spreadsheet created from filelist.txt could have a series of commands whose concept would be like this:
print webpage located at URL as PDF and call it China Story.pdf
So then I needed a PDF printer that would print webpages from the command line.  I had previously searched for something similar and had wound up using wkHTMLtoPDF (which they actually wrote as just all lower-case:  wkhtmltopdf, where the "wk" apparently stood for "webkit").  It had been somewhat complicated.  I decided to look for something else.  There followed a time-consuming search that brought me back to wkHTMLtoPDF.

As before, wkHTMLtoPDF seemed vastly more capable than the alternatives I had been looking at.  After the initial reluctance that drove me to look at those other possibilities, I realized that, this time, due to the different nature of the project, the wkHTMLtoPDF part might not complicated at all.  Among the many options available in the help file ("wkhtmltopdf --help" on the command line), it looked like I might get acceptable results from a command like this:
start /wait wkhtmltopdf -s Letter -T 25 -B 25 -L 25 -R 25 "http://www.nytimes.com/" "D:\NY Times Front Page.pdf"
That command, which I would have to run from within the folder where wkhtmltopdf was installed, would specify a Size of letter-sized paper, a Top margin of 0.25 inch, and so forth.  I just needed to come up with my list of URLs and their names (above).  I noticed that some of those items were really long, so in Excel I added a column to calculate their lengths (using LEN) and edited some of them.  I also had to run tests (i.e., Excel formulas looking at the contents of preceding rows) to verify that I had exact alternating pairs:  one file title followed by one URL.  In some instances, somehow the file name had come over in corrupted form, so I had to add some rows to account for all the URLs.  Among other things, ampersands (&) in URL file names seemed to result in some confusion; I'd have been better off to replace them in advance with "and."  This cleanup took longer than expected.  If I were doing a truly huge number of URLs, I would probably have wanted to begin with a dry run of a hundred or so, to test it out and see if there were any other preliminary steps that might ease things along.

So I assembled my batch file lines and ran them in a batch file.  They worked for the first 42 PDFs.  Then I ran into a problem.  Wkhtmltopdf gave me this error and said it was going to close:
Warning: Javascript confirm: You need Adobe Flash Player 8 (or above) to view the charts.  It is a free and lightweight installation from Adobe.com.  Please click on Ok to install the same. (answered yes)
Error: Failed loading page."  The error message gave the URL and then said, "Sometimes it will work just to ignore this error with --load-error-handling ignore."
I was going to rewrite my batch commands, putting "--load-error-handling ignore" at the end of each.  But when I clicked Close, the program just kept on running.  I was closing wkhtmltopdf only for that particular session (i.e., for just one URL being PDFd); I wasn't closing the batch file.  So I let it run, helping it over another one or two similar speed bumps along the way, figuring that I would catch any failed PDFs in the post-game review.  "Speed bumps" is perhaps the wrong metaphor.  It was not fast.  Then again, neither was my Internet connection; that may have been the slowing factor.  In one case, there was no error message; it just got up to 57% completion in PDFing a URL and then stopped.  I hit Ctrl-C.  This made no difference in the active command window, so I killed it.  That may have been dumb, or maybe it was just the ticket.  Back in the other running command window, where I had started, I now had an option to terminate the batch job.  I said no.  This happened a couple more times.  It looked like it might be a problem with the pages on a particular website. 

After a while, it was finished.  Now I wanted to see what it had done.  The first step was to do a count and, if necessary, a comparison of the URL files being examined against the PDFs output.  I had 507 URL files but only 489 resulting PDFs.  Some of the missing pages were not found.  Apparently webmasters had made some changes between the time when I created the URL files and now.  Some were evidently due to a malfunction somewhere; they were available when I revisited them manually.  Others were PDFs (as distinct from HTML webpages) that apparently did not work in this process.  I had to PDF a few webpages by hand.  Then I used Adobe Acrobat to combine the resulting individual PDFs into a single PDF.  Its merge process would treat the names of the individual files as bookmarks, saving me that labor.  In the final wrapup, there were some imperfections, but on this first run it appeared that the basic process was successful in converting large numbers of webpages into a PDF.

Monday, March 12, 2012

Windows 7: Improving Command Window

I was looking for advice on how to edit the Windows 7 registry to make permanent changes to the command window (cmd.exe).  I went into its Properties and made changes (to e.g., the colors and fonts), and I added parameters to the command line that would open a command box (saved in e.g., a desktop shortcut).  I wasn't seeing clear advice on how to edit the registry to make those changes permanent -- assuming that such a registry edit was possible.

During this search, I ran across a reference to Console.  The purpose of this program was apparently just to make the command window (or PowerShell, or other command interface) more adjustable and appealing.  Console version 2.0 (confusingly cited by some as Console2) appeared to have drawn a few positive reviews.  I downloaded and ran the portable version 2.00b (beta).  In its menu, I went into View, where Tabs and Status Bar were the only items I left checked.  (The others were available by right-clicking anywhere in the Console window.)  Then, adjusting some advice in light of Console Help (i.e., console.chm in the portable program's folder), went into its Edit > Settings and did the following:

  • Console:  Startup dir:  set to my working folder, D:\Current.  Window size:  80 rows, 80 columns, save on exit.  Buffer size:  1000 rows, 0 columns.  Save settings to user directory:  not checked, since I wanted the settings to be saved to the folder where they were loaded (i.e., the customized Start Menu folder that would persist even if I had to reinstall Windows).  Console colors map:  click on the black box next to the black square and change everything that would have been black to be white (or whatever) instead.
  • Appearance:  Font:  Consolas size 11, bold, Clear Type smoothing, Custom color.  Position:  docking:  none.  Snap to desktop edges:  off.
  • Hotkeys:  New Tab 1 = Ctrl-T.  (Actually hit Ctrl-T and then click Assign.)  Copy Selection = Ctrl-C.  Paste = Ctrl-V.  Mouse:  Select text:  Left.
I created a shortcut to Console.exe.  I put one copy of that link in the Start Menu.  On reboot, it functioned as desired.

I let some time pass.  As it turned out, my changes to the Windows 7 command window proved to be permanent, even without registry edits.  And I liked it as it was.  I was not actually using Console.  It wa an option to remain aware of, but it appeared unnecessary for my purposes at present.

Friday, March 9, 2012

Windows 7: "Access Denied" on the Command Line When Using FIND

I tried to run a FIND command in the CMD (sometimes called the DOS) box in Windows 7.  I got an error message:

Access Denied - D:\FOLDER\SUB FOLDER
This was bizarre.  I had been using the command line in this Win7 installation for months.  Well, something had apparently changed.  I got this message regardless of whether I used my pre-installed "Open Command Window Here" right-click option in Windows Explorer or the Administrator Command Box option I had created in the Start menu.

A search led me through various possible solutions.  I tried especially to tweak the permissions for the entire partition and for the individual folder via right-click > Properties > Security tab > Advanced > Owner tab > Edit > select Administrators and check "Replace owner on subcontainers and objects," and to play with other tabs and settings in that vicinity.

One option I hadn't seen previously was to open the Local Group Policy Editor (Start > Run > gpedit.msc) and go into Computer Configuration > Windows Settings > Security Settings > Local Policies > Security Options > right-click on "User Account Control:  Behavior of the elevation prompt for administrators in Admin Approval Mode" > Properties > Elevate Without Prompting.  But I already had that selected.

That writeup raised the question, though:  had I not completely disabled User Account Control (UAC)?  As advised, I went into Start > Run > Regedit and navigated to HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Policies\System.  There, I right-clicked on EnableLUA and verified that the value was zero.  Nonetheless, I exported the tweak, extracted the relevant lines from the REG file, and added them to my Win7RegEdit.reg file for future installations, just to be sure.

When I was in Permissions (via Windows Explorer folder right-click > Properties > Security tab > Edit), I noticed that the permissions checkboxes were checked but greyed out.  I couldn't change them.  Was I somehow not in the right account?  A search led to a proffered batch file (Permissions.zip) that read as follows:
@echo off
title File/Folder Permissions
echo.
echo.             BY KAOS - Windows 7 Forums
echo.
echo.
set /p a=Enter Path Of File/Folder:
echo.
set /p b=User Name:
echo Type "deny" = remove access OR type "grant" = allow full access
set /p c=Permission Type:
echo.
echo.
echo.
if %c%==deny goto lock
if %c%==grant goto unlock
if %a%==menu goto start2
if %b%==menu goto start2
if %c%==menu goto start2
:lock
cacls %a% /e /d %b%
cls
exit
:unlock
cacls %a% /e /g %b%:f
cls
The gist of this batch file seemed to be that I could try a CACLS command of this form:  cacls [folder path] /e /g [username]:f, where /e meant "edit," /g specified the user, and :f gave full control.  So, in my case, what username would I use?  Control Panel > User Accounts said that I had set up my system so that the only options were Ray (Administrator) and Guest (turned off).  Maybe this was the problem:  I had been thinking in terms of the Administrator account rather than the Ray account.  So, OK, I tried typing this on the command line, to grant full control to the whole drive:
cacls D:\ /e /g Ray:f
In response to that command, I got a short message:  "processed dir: D:\.  Did that mean champagne?  I retried the command that had given me the "Access Denied" error.  No, still denied.  Alright, how about same command, different user:  Administrator.  Same output.  Still denied.  Baffling!

I ran across a post that said something about turning off simple file sharing and permissions, and then permissions.  It raised a question:  was there a way to reset permissions to the default, and start over?  For the drive (i.e., right-clicking on D: in Windows Explorer), I went into Share with > Advanced sharing > Sharing tab > Advanced Sharing.  I unclicked Share This Folder > Apply.  I got a note indicating that I had some files currently opened, and they would be closed.  I clicked Yes > OK > Close.  This, in itself, didn't have any effect on another try of the FIND command; still denied. 

I had recently gotten an indication that the Recycle Bin on drive D was corrupted.  I had said go ahead and empty it.  That message had recurred.  I had also been getting bothersome messages when I tried to move or delete folders, telling me that these folders were shared and confirming that I really wanted to do what I had said I wanted to do.  It seemed that these problems might be related.  But how, and what could I do about it?

I found a Windows XP article from 2002 that said, "If you don't have a thorough understanding of the various permissions and their relationships, it can be nearly impossible to sort out a permission problem and find a solution."  So I could see how Windows 7 was a direct descendant from Windows XP:  both could make it impossible to get any work done.  The article said that there was a difference between sharing permissions and NTFS permissions, and that the more restrictive one wins.  So if I wanted to grant full control to everyone for everything, I had to do that in two different right-click > Properties tabs:  the Sharing tab and also the Security tab.  But it really looked like I had done that, over and over again.

Ah, but now I saw a new problem.  In the Security tab, I saw that I had a little red circle with an X in it, next to the Administrator group.  There was no right-click option to explain it.  I guessed that the problem was that I had entered the wrong CACLS command, regarding Administrator rather than Administrators (plural).  So that was interesting.  I clicked on Edit, selected Administrator, and clicked Remove.  Then I re-ran that CACLS command with a reference to Administrators, this time, instead of Administrator.  But still no joy on my FIND command:  Access denied.

So anyway, as I was saying, it did seem that I had given full control to almost everyone listed in the Security tab.  I mean, literally, Everyone:  I had an entry for them, and they had full control, and so did SYSTEM, Ray, and Administrators.  But not Authenticated Users, and not plain old Users.  Who the hell were all these people, anyway, and why did we all need to have so many kinds of access to my computer so that I could get work done?  (Sigh.)  Wiser minds knew; I did not.  Anyway, I went ahead and gave full control to my whole world to everyone and his brother, Users and all.  And still the godforsaken command did not run.

And, by the way, at this point I searched in vain for those greyed-out permissions boxes I had seen earlier.  Evidently I had altered something significant, in all this screwing around.  Not so significant as to actually let me get any work done, but significant certainly in the sense that I could no longer detect greyness when I searched therefor.  Not in the Security, nor back in the Sharing tab.  Speaking of which, I now saw that my reestablished share of drive D now had permissions only for Everyone.  Did Everyone include me and all the other Administrators and Users and Authenticated Users of my home system (I was living alone), or did I need to add the whole gang back to my computer?  Not sure.

It occurred to me that I did have a solution.  It was called System Restore.  But, alas, the mere fact of telling Windows to keep system restores, accompanied by weekly checking to make sure that the task was really running as scheduled, did not necessarily mean that I would actually have recourse to any system restore points, other than the one created that very morning.  Apparently Windows was not content with the 10GB of disk space I had set aside for this purpose.  Fortuitously, I did have an Acronis drive image from a week or so earlier, and so, without further ado, I wiped the drive and restored that.  Did there exist any further difficulty?  Yes, there did.  My Acronis backup was too recent.  Apparently this problem had lurked for days and/or was not only, or primarily, a matter of drive C (stored in Acronis) as distinct from drive D (not backed up in Acronis).

I tried a different command that, I knew, I had run within recent days:  DIR.  It ran.  Now, why would DIR run and FIND not run?  FIND took a look inside files; were my permissions of some type not reaching into the files?  I right-clicked on the files in question.  They didn't have sharing or security options collectively; I had to click them one at a time to get a Security tab.  It said everyone had full control.  The Advanced > Owner tab option said the owner of at least one file was Administrators.  Anyway, the CACLS command was supposed to take care of user accouint issues.

I tried the same FIND command on another computer of virtually identical configuration.  It, too, provided a FIND error.  A search led to a brilliant insight:  my command was wrong.  I was trying to use FIND on a directory, when it only works on files.  I had to make one change:  I had to add a star (asterisk) to the end of my search.  The FIND command worked without error when I did it this way:
find /i /c "X-Message-Delivery:" "D:\Folder\Sub Folder\*"
Solution:  operator error.  Case closed.

Friday, February 17, 2012

Windows 7: Finding a DIR Alternative

I needed a DIR-type listing that would provide extended information about a file:  its name, date, and size, and also its path (i.e., the folder and subfolder where it was located), all on a single line of output.  DIR didn't seem to be capable of this, and neither did the utilities I found with a search (e.g., Karen's Directory Printer).

Another search raised the possibility that certain Linux utilities brought over to Windows might have this kind of capability.  I didn't want to run a Linux virtual machine on Win7; I just wanted to be able to run Linux commands that might add functionality I wasn't getting in Windows 7.

Linux commands were probably not the only alternative.  For instance, I could have learned how to use Windows PowerShell scripts.  My general impression of the Microsoft approach (as in the contrast between original BASIC and VB) was that, unfortunately, something that could be done with one relatively simple command in another tool would require three or four lines of code, which I would be able to write only after mastering a handful of relatively abstruse programming concepts, in the Microsoft product.  This impression seemed borne out when a search led to an indication that the DIR equivalent in PowerShell would require a multiline FOREACH loop.

Preliminary inquiries gave me the impression that Cygwin sought to provide a subsystem that would emulate a Linux machine within Windows.  There were indications that other projects (e.g., MSYS) sought to provide a somewhat comparable (e.g., 110MB) environment.  These seemed a tad heavy for my purposes; I was looking for something more like GnuWin, which was described as relying "only on libraries provided with any standard 32-bits MS-Windows operating system" and as not needing any Unix emulation.  Ideally, I would have some cool, relatively simple Linux-like commands available at the Windows command prompt.

By this point in my investigation, several people had mentioned CoreUtils.  This turned out to be a package within GnuWinThe CoreUtils homepage described it as "the basic file, shell, and text manipulation utilities of the GNU operating system."  GNU was "a Unix-like operating system," in development since 1983, that apparently provided most of the materials used by Linux (which was, in turn, the source of Debian Linux, from which Ubuntu was built).

To clarify, it appeared that the CoreUtils existed in GNU, and there was an offshoot called CoreUtils for Windows.  Apparently this was what I would be getting through GnuWin.  There were other approaches to this sort of thing (e.g., Gow, UTools, UnxUtils), but my sense at this point was that GnuWin was dominant in this category.

I looked at the list of tools included in CoreUtils (for Windows).  I didn't count them, but I thought I remembered seeing an indication that there were more than 100 of them.  They were grouped into three main categories:  file utilities, text utilities, and shell utilities.  In the file utilities group, the description of the ls command was simply "lists directory contents,"; and vdir would apparently provide a "long directory listing."  These sounded like what I needed.  Examples in the text utilities category included comm ("compares two sorted files line by line") and uniq ("remove duplicate lines from a sorted file").  Examples in the shell utilities category included sleep ("suspends execution for a specified time") and uname ("print system information").

Although I could have just clicked on a download link, I went into the folder for the latest version and saw that it had not been updated since 2005.  This made me wonder whether I should have opted instead for Gow (short for GNU on Windows), which had apparently been updated as recently as November 2011.  I found a spate of (1 2 3 4 5) brief summaries of Gow published about that time.  Their similarities raised the thought that they may have been written from similar press releases.  Not that that would necessarily be bad.  Any product being promoted in 2011 could count as fresh air against a 2005 alternative.  But it was not reassuring that none of these explained clearly whether Gow was genuinely different, or just a borrowing, from the seemingly better-documented and more widely used GnuWin.  I found a page stating that Gow had been developed by a corporation in 2010 and used for some years before being released as open source.  This appeared to be an authoritative page.  It puzzlingly characterized GnuWin as being appropriate "if you want just one or two utilities."  A list of Gow utilities seemed similar, at a glance, to the GnuWin list (above), though I noticed that it did not have vdir.  The seeming mischaracterization of GnuWin, combined with the sense of evasion in the press-release writeups, persuaded me to stick with Plan A.

So now I did download and install the executable (exe; not src.exe) version of CoreUtils (6MB).  But, weird thing, they didn't give me a way to run the program.  My Start Menu had links to several PDFs.  Actually, it was rather messed up: they gave me four shortcuts to a total of two PDFs, and some of those links were buried about five layers deep in superfluous subdirectories. They also gave me two links to CoreUtils Help files that, when I clicked on them, gave me the familiar "Why can't I get Help from this program?" message that Windows 7 kindly provided when I would try to run Help files written for Windows XP.

Obviously, I ignored the manuals' actual contents and went looking for a way to run the program.  Weird thing:  I had all these redundant and dysfunctional help materials, and a link to an Uninstall routine, but no actual "Run CoreUtils" shortcut. I was half-tempted to uninstall them as defective, when it occurred to me that, well, they're supposed to be run from the command line, not the Start Menu.  So, OK, I went to the command line and typed "ls."  Windows said, "'ls' is not recognized as an internal or external command, operable program or batch file."  Hmm.  The manual, then, if I must.  Or manuals, I should say:  a regular-looking manual and also what appeared to be the set of Linux MAN (i.e., manual) pages, both in PDF format.  Neither had installation instructions.  I went to the ls MAN page.  It seemed to say that "ls -a" would be a working command.  Well, not on my machine, it wasn't.

I rooted around and found an article on how to use CoreUtils.  It said that I would have to adjust the PATH environment variable to tell the system where to look for the CoreUtils command instructions.  My way of applying those instructions was as follows:  first, in Windows Explorer, find where the CoreUtils executables (e.g., ls.exe) were installed.  On a 32-bit Windows 7 system, the location would probably be C:\Program Files\GnuWin32\bin; on a 64-bit system, C:\Program Files (x86)\GnuWin32\bin.  With that folder selected, click on the address bar at the top of Windows Explorer, make sure the whole address was highlighted (Ctrl-A if necessary), and copy the address (Ctrl-C).  Now I went to Start > Run > SystemPropertiesAdvanced.exe (could have used sysdm.cpl and then the Advanced tab) > Environmental Variables > System Variables > highlight Path > Edit > hit the End key.  There, I typed a semicolon (";") and then pasted in what I had copied from the Windows Explorer address bar.  (Could have typed it manually, using the 32-bit or 64-bit address just shown, but this was more accurate and it also forced me to verify the actual location.)  I OKed out of there and tried ls -a again on the command line.  Did I have to reboot to make the Path take hold?  Yes.  That was it.  I had ls, and it listed files.

So now, how about getting all that information mentioned at the outset -- path, date, etc., all on one line?  First question:  how could I get command-line command help?  In Windows, it was DIR /?.  But the /? option gave me an error with ls.  "man ls" didn't work either.  Page 9 of the manual PDF said the MAN pages were no longer being maintained.  I wasn't sure if that applied to what looked like the MAN pages included with GnuWin.  There wasn't a MAN MAN page in that PDF.  Page 10 said --help might work.  I tried "ls --help" and experienced satisfaction.  What I was seeing there looked like what appeared on pages 50-52 of the man PDF, pages 60-70 of 176 (text pages 52-62) in the more explanatory help PDF.  I wasn't inclined to read 11 pages to figure out how to get my directory listing.  Skimming down through the ls --help output, I tried "ls -l -N -R."  Good, but no cigar:  the path wasn't on the same line as the filename; no improvement over DIR.

The user's guide PDF didn't seem to think that there actually was a way to print the file's path on the same line as its date, filename, etc.  And so there I was.  I had come all this way with faith in my heart for the infinite possibilities of Linux.  I fervently believed that, with GNU, anything was possible.  But now, with my limited knowledge of Linux and such, cruel reality was saying Bismillah, no! we will not let you have all that stuff on one line of output.  There actually probably was a way to do it with some other tool, like the awe-inspiring grep, available in a different GnuWin package.  But I wasn't quite ready to go there.  In this project, grep looked, for me, like a bridge too far.

I thought about posting a question in the GnuWin Help forum.  But there had only been a handful of posts there in the last couple of months.  I also thought about going down the list of other utilities contained in CoreUtilities, so as to demonstrate to myself that this hadn't been a wild goose chase.  I thought about trying Gow after all, just in case its version of ls had different capabilities.  I thought about working up a kludge in which I would do a listing of all directories first (with e.g., "dir /ad /s /b") and then try to invent a way to append the pathname to each file line.

But before pursuing those rather lame possibilities, I noticed TCC/LE, advertised as a complete, powerful replacement for Windows CMD.  (TCC was short for "Take Command Console.")  It got 3.5 stars from 43 voters at Softpedia, only a solitary vote (five stars) at CNET -- but it had apparently been updated there just a few days earlier.  At MajorGeeks, it averaged 4.07 from 38 voters.  The description said it had enhanced commands (specifically including DIR) with new options.  A search didn't encourage the sense that there was a regular category of this sort of thing, with lots of competitors.  I downloaded and installed it.  The installation process seemed pretty slick, ending with a direct ride to their forums.  The installation left me with an open CMD window with a funky prompt, though apparently it was actually their own version of a command window.  (I did have another Win7 command window open throughout the installation.  It remained functional; I was able to close and open a new one after installation.)  I typed Help at their command prompt and went straight into their GUI help dialog, which actually made me say "Wow."  It wasn't spectacular; it was just good, and helpful, which I guess counts as spectacular after a long slog.  I replaced their ugly prompt with the ordinary Windows one by typing "prompt $P$g" at the prompt, though not without first amusing myself with variants (e.g., "Now what?").

Eventually I discovered that their help dialog was more or less the same as their online help page.  The manual had a large number of further instructions on how to tinker with the prompt and, it seemed, everything else.  Typing "option" at the prompt brought up settings, but not an obvious way to preserve prompt settings between sessions; it appeared the answer to that might lie somewhere within their SET command.  Anyway, I found information on their DIR command almost instantly, and also got a cursory version of it by typing dir /? at their prompt.  It led me to PDIR, and there I found the answer I was looking for.  What I had to type in a TCC/LE command window was this:

pdir D:\ /s /(dy-m-d zc fpn) > dirlist.txt
That gave me all of the information I was looking for, on a line-by-line basis, for every file on drive D, output into dirlist.txt.  Specifically, with the options in that sequence, I got the date (y-m-d), size (with commas), and the file path and name.

I took a quick look at their list of Commands by Category.  I also saw that they had a number of video and textual tutorials.  An impressive program.  But in any case, this investigation was done.

Saturday, January 14, 2012

Batch Merging Many Scattered JPGs into Many Multipage PDFs - Second Try

I had previously looked for a way to combine multiple JPG files into a single PDF.  As explained in more detail in that previous post, the specific problem was that I might have sets of several JPGs in a single folder that should be merged into several different PDFs, and there might be multiple folders at various places that would have such PDF sets.  Hence, if I wanted to automate this project across dozens of folders containing hundreds of JPGs, it seemed that I would need a command-line solution rather than a GUI.  This post updates that previous attempt.

There were commercial programs that seemed to offer the necessary command-line capabilities, such as PDF Merger Deluxe ($30) and ParmisPDF Enterprise Edition ($300).  A search suggested, however, that PDFsam might offer a freeware alternative.

Assembling the List of JPGs to Be Converted

Before investigating PDFsam, I decided to get a more specific sense of what I needed to accomplish.  In a command window (Start > Run > Cmd), I navigated to the root of the drive I wanted to search (using commands like D: and "cd \").  (The root was the folder whose command prompt looked like C:\ or D:\, as distinct from e.g., C:\FolderZ.)  Being at the root folder meant that my command would apply to all subfolders on that drive.  Once I was there, I ran this command:

DIR *.jpg /s /b /a-d > jpgslist.txt
That gave me a list of files (but not directories, /a-d) in all subdirectories (/s), listed in bare (i.e., simple, /b) format, saved in a new file called jpgslist.txt.  (For more information on DIR or any other DOS-like command, type DIR /? at the command prompt, or search for informational webpages.)  If I'd had files with a .jpeg (as distinct from .jpg) extension, I would have added a second line, referring to *.jpeg and using a double-arrow (>>) to tell the program to add these to the existing jpgslist.txt, rather than creating a new jpgslist.txt (which was what the single arrow (>) would do).

Now I wanted to see which folders had more than one JPG.  I would use Microsoft Excel to help me with this.  I could either open jpgslist.txt and copy its contents into an Excel spreadsheet, or import it into Excel.  In Excel, I did a reverse text search to find the last backslash in each text line, so as to distinguish file names from the directories holding them.  I sorted the spreadsheet by folder name and file name.  I set up a column to detect folders containing more than one JPG, and deleted the rows of folders not containing more than one JPG.  I might still want to do another search and conversion for isolated JPGs at some point, but that would be a different project.

Next, I wanted to see if I could eliminate some folders.  For instance, I might not want to PDF and combine JPGs that were awaiting image editing, or important photos whose quality might get degraded in the PDF conversion.  In other words, I decided that this particular project was just for those JPGs that I was going to combine into a single PDF and then delete.  To get a concise list of folders containing multiple JPGs, I went into Data > Filter > Advanced Filter.  (That's Excel 2003.)  I moved the output into another worksheet.  I could then do a VLOOKUP to automatically mark rows to be deleted.  So that gave me the folders to work on.

Now it was time to decide which files to combine, and in what order.  In some cases, I had named files very similarly -- usually with just an ending digit change (e.g., Photo 01, Photo 02 ...).  So I set up a couple of columns to find the filename's length, subtract a half-dozen characters, and decide whether those first X characters were the same as in the preceding row.  If so, and if both were in the same folder, we had a match.  I discovered, at this point, that one or two folders contained large numbers of files.  I decided to combine those manually.  With those out of the way, it seemed that the next step was to decide the names of the resulting multi-image PDFs (e.g., Medieval Churches.pdf), and to put those names on the spreadsheet rows, next to the individual JPGs that would go into them.

At this point, as described in another post, I learned how to use PDFsam to combine several PDFs into one PDF.  So I had a rough idea of the start of my project (i.e., identify the JPGs that I would want to merge into a single output PDF), and I also had a basic sense of the end of my project (i.e., use PDFsam to merge multiple PDFs into that single output PDF).  I was missing the middle part, where I would convert the original JPGs into single-page PDFs and would get them into a form where PDFsam could work on them.

Converting Individual JPGs to Individual PDFs

I had originally assumed that I would start by converting the JPGs to PDFs within the various folders where they were originally located.  So if I had File1.jpg in E:\Folder1, and if I had File2.jpg in E:\Folder2, my conversion would result in two files in each of those folders:  File1.jpg and File1.pdf in Folder1, and File2.jpg and File2.pdf in Folder2.  Then I would use PDFsam to merge the PDFs (i.e., File1.pdf and File2.pdf) from those locations; delete the original JPGs and PDFs; and move Output.pdf to an appropriate location.

I didn't entirely like that scenario.  It seemed like it could make a mess.  As I reviewed another post in which I had worked through similar issues, I decided that a better approach might (1) make a list of original JPG file locations, (2) move those JPGs to a single folder where I could convert them to individual PDFs, (3) merge the individual PDFs into concatenated PDFs, (4) delete the individual JPGs and PDFs, and (5) move the concatenated PDFs to the desired locations.  I decided to try this approach.

One problem with moving files from many folders to one folder was that there might be two files with the same name.  They would coexist peacefully as long as they were in separate folders; but when they converged into one target folder, something would get overwritten or left behind.  It seemed that a first step, then, was to rename the source JPGs, so that each one would have a unique name -- preferably a short name without spaces, potentially making it easier to write commands for them as needed.  In this step, as in others, it would be important to keep a list indicating how various files were changed.  To rename the files where they were, I returned to my spreadsheet and used various formulas to produce a bunch of rename commands of this type:
ren "D:\Folder Name\File Name.jpg" "D:\Folder Name\ZZZ_00001.jpg"
after doing a search to make sure that my computer did not already have any files with names resembling ZZZ_?????.jpg.  The spreadsheet gave me one REN command for each JPG.  I copied those commands into a Notepad file, named it Renamer.bat, and double-clicked to run that batch file.  (A slower and more cautious approach would have been to run it in a command window, perhaps with Pause commands among its renaming lines, so that I could monitor what it was doing.)  A search in a file-finding program like Everything now confirmed that the number of files on my computer with ZZZ_?????.jpg names was equal to the number of files listed in my spreadsheet.  I cut and pasted all those ZZZ_?????.jpg files from Everything to a single folder, D:\Workspace.  (I could also have used the spreadsheet to generate Move commands to run in a batch file for the same purpose.)

Now I had a spreadsheet telling me what the original names of these ZZZ_?????.jpg files were, and I had all those ZZZ files together in D:\Workspace.  My spreadsheet also told me which of them were supposed to be put together into which merged output PDFs.  But they weren't ready to be merged by PDFsam, because they were still JPGs, not PDFs.

To convert the JPGs to PDFs, I could have prepared another batch file, using IrfanView commands to do the conversion, like those that I had previously played with in another project.  But I figured it would be easier to use IrfanView's File > Batch Conversion/Rename.  There, I told IrfanView to Add All of the ZZZ files to its list.  I specified PDF as the Batch Conversion Settings output format, and set its Options > General tab to indicate that Preview was not needed (and adjusted other settings as desired).  I told it to Use Current ("Look In") Directory as the Output Directory for Result Files (adding "Output" as the name of the output subfolder to be created).  Then I clicked Start Batch.

That produced one PDF, in the Output subfolder, for each original JPG.  I hadn't done anything to change their filenames, so ZZZ_00001.jpg had been converted to ZZZ_00001.pdf.  Spot checks indicated that the resulting single-page PDFs were good.  I deleted the original ZZZ*.jpg files, moved the output PDFs up into D:\Workspace, made a backup, and turned to the project of merging those single-page PDFs into multipage PDFs.

Preparing XML Files to Concatenate PDFs

In my spreadsheet, I had already decided which ZZZ files would be merged together, and what the resulting multipage PDFs would be called.  Now -- referring, again, to the other post in which I worked through the process of using PDFsam -- I needed that information to create File Value lines for a set of ConcatList.xml files that PDFsam would then merge into a set of output PDFs.

In other words, I would have a batch file that would run PDFsam, and I would have a data file, in XML format, to specify the single-page PDFs that PDFsam would combine into the multipage output PDF.  I would have a pair of such files (i.e., a batch file and an XML data file) for each resulting multipage PDF.  In my particular project, there were 65 single-page PDFs, and they would be combined into a total of eight multipage PDFs.  So I would have eight pairs of .bat + .xml files, and the eight XML files would contain a total of 65 File Value lines.

To the extent possible, I would want to automate the creation of these batch and data files.  Sorting 65 data lines into eight different XMLs would be tedious and easily confused.  Things would get much worse if I wanted, in some later project, to use these procedures for hundreds or thousands of JPGs or other files.

I began by adding a column to my spreadsheet that contained the exact text of the appropriate File Value line.  Example:  for ZZZ_00001.pdf, the line would read like this:
<file value="D:\Workspace\ZZZ_00001.pdf"/>
To produce that result, if the Excel spreadsheet's cell D2 contained ZZZ_00001.pdf, its cell E2 would contain this formula:
="<file value="&CHAR(34)&"D:\Workspace\"&D2&CHAR(34)&"/>"
(Note the use of CHAR(34) to add quotation marks where they would otherwise be misunderstood.)  Next, I wanted to assign those File Value lines to the appropriate batch files.  A search confirmed that I didn't have any files on my data drives with YYY_????? names, so I decided that my first multipage output PDF would be called YYY_00001.pdf, and that the pair of files used to produce it would be YYY_00001.bat and YYY_00001.xml.  In other words, the File Value line for ZZZ_00001.pdf (above) would have to be one of the File Value lines appearing in YYY_00001.xml.  But the next File Value line in YYY_00001.xml could be a ZZZ file out of sequence (e.g., ZZZ_00027.pdf), if that happened to be the next original file that I wanted to put into YYY_00001.pdf.

Since YYY_00001.pdf was going to be the temporary working name of the multipage PDF that I would ultimately be calling "Short Letters to Mother," I sorted the spreadsheet (making sure to first use Edit > Copy, Edit > Paste Special to convert formulas to values) by the column containing my those ultimate desired filenames, and worked up a column indicating the corresponding YYY filename.  In other words, each cell in that column contained one of eight different labels, from YYY_00001.pdf to YYY_00008.pdf.

With that in place, I was ready to generate some batch commands.  Each batch command would use the ECHO command to send the contents of spreadsheet cells to YYY*.xml files.  My first attempt looked like this:
echo <file value="D:\Workspace\ZZZ_00001.pdf"/> >> YYY_00001.xml
The double greater-than signs (">>") indicated that YYY_00001.xml would be created, if it didn't already exist, and that the File Value line (above) would be added to it.  This first try produced an error, as I feared it might:  ">> was unexpected at this time."  The less-than and greater-than symbols were confusing Windows.  I had to modify the formula in my spreadsheet (or use Ctrl-H) to add carets (^) before them, like this:
^<file value="D:\Workspace\ZZZ_00001.pdf"/^>
That worked.  Now YYY_00001.xml contained that line.  With commands like that for each of the 65 single-page PDFs, my spreadsheet now had cells like these:
echo ^<file value="D:\Workspace\ZZZ_00051.pdf"/^> >> YYY_00004.xml
echo ^<file value="D:\Workspace\ZZZ_00025.pdf"/^> >> YYY_00006.xml
I sorted the rows in my spreadsheet by the appropriate column to make sure the single-page PDFs would get added to their multipage PDFs in the proper order.  (If necessary, I would have added another column containing numbers that I could manipulate to insure the desired order.)  Then I copied all those cells over to Notepad and saved it as a new batch file that I called Sorter.bat.  I ran Sorter.bat and got eight XMLs, as hoped.  Spot checks seemed to indicate that the process had worked correctly.

My eight XML files were not complete for purposes of PDFsam.  Each of them would need lines of code preceding and following the File Value lines.  As described in the other post, those files would begin with
<?xml version="1.0" encoding="UTF-8"?>
<filelist>
and would end with
</filelist>
I saved those two beginning lines into a text file called Header.txt, and I saved that ending line into another text file called Tailer.txt.  Now I needed to combine them with the XML files that Sorter.bat had just created.  For that purpose, it seemed that my spreadsheet could benefit from a separate page dedicated to XML file manipulation.  I added that page, filtered my existing page for unique values in the YYY*.pdf column, and placed the results on that new page. 

I could now see that I was too early in adding .xml extensions to the eight files.  I went back into the spreadsheet and changed it to produce files without extensions (e.g., YYY_00001 instead of YYY_00001.xml), and then I deleted the XML files and re-ran Sorter.bat (as modified) to verify that it was all still working.

With that change, I returned to the spreadsheet's XML Files page.  Next to each of the eight XML filenames, I added columns to produce commands of this type:
copy Header.txt+YYY_00001+Tailer.txt YYY_00001.xml
I put those eight commands into a batch file and ran it.  It worked:  I had eight XML files with everything that PDFsam needed.  There was just one small glitch:  at the end of each resulting XML file, there was a little arrow, pointing to the right.  Searches didn't yield any obvious explanations.  I wasn't sure if it would make a difference; I thought it might just be a symbol representing end-of-file or line feed.  I decided to forge ahead and see what happened.  Except for that little character, I had exactly what I needed for my XML files.

Preparing Batch Files to Use the XMLs

The next step was to create a matching YYY_?????.bat file for each YYY_?????.xml file.  This batch file would run the commands necessary to merge the single-page PDFs listed in the XML file.  I would use the same techniques as in the XML files.  There would be no Tailer.txt file this time; the line that would need to change, in each BAT file, was the very last line.  So my COPY command (above) would just have Header.txt plus the variable line to produce the YYY_?????.bat file.  The variable (last) line of the batch file had to look like this:
%JAVA% %JAVA_OPTS% -jar %CONSOLE_JAR% -l D:\Workspace\YYY_00001.xml -o D:\Workspace\Merged\YYY_00001.pdf concat
In other words, it would have two variables:  the name of the XML file providing the input, and the name of the PDF file containing the output, saved in a Merged subfolder.  It was pretty straightforward, by now, to use the spreadsheet to generate the necessary commands and to run them in another Sorter.bat file (see above).  I just had to remember to delete my previous YYY_????? files (without extension), so that their contents would not get thrown into the mix.  I did wish that PDFsam's -l option were expressed as -L, so that nobody would think it was the number one, but I wasn't yet ready to experiment and find out whether -L would work just as well.  Anyway, to produce a line like the one shown immediately above, my Excel formula looked like this:
="echo %%JAVA%% %%JAVA_OPTS%% -jar %%CONSOLE_JAR%% -l D:\Workspace\"&A2&".xml -o D:\Workspace\Merged\"&A2&".pdf concat >> "&A2 
where cell A2 contained the filename without extension (e.g., YYY_00001).  I had to use double percentage symbols to get a single one to come through.  I put the resulting eight lines into Sorter.bat, and it produced eight YYY_????? files, as before.  I ran another batch file to combine Header.txt plus the YYY_????? files to produce YYY_?????.bat -- again, same as above, but without Tailer.bat and making sure to produce .bat files, not .xml files. 

These steps gave me eight pairs of .bat and .xml files.  The batch files looked good, except for the little arrow at the end (above).  Now, if all went well, the batch files would run, would consult the XML files for the lists of single-page PDFs to merge, and would produce eight YYY_?????.pdf output files in the Merged subfolder.  I would not want to run the batch files manually, if I were producing a large number of merged PDFs, so I wrote a batch file to run the batch files.  The commands in this file looked like this:
@echo off
call YYY_00001.bat
call YYY_00002.bat
and so forth.  I ran this batch file.  It gave me error messages.  I opened a command window and typed just its first action line:  call YYY_00001.bat.  The error was:
FATAL  Error executing ConsoleClient
java.lang.Exception: org.pdfsam.console.exceptions.console.ParseException: PRS001 - Parse error. Invalid name (D:\Workspace\YYY_00001.xml) specified for <l>, must be an existing, readable, file.
Oh.  Dumb mistake.  The XML files weren't in D:\Workspace.  I moved them and tried again on the command line.  Another error:
Error on line 5 of document file:///D:/Workspace/YYY_00001.xml : Content is not allowed in trailing section.
That little arrow was on line 5.  I moved it to line 6 and tried again.  Same error, except now it said the problem was on line 6.  I deleted the little arrow and tried again.  That solved that problem.  Now a different error:  "The system cannot find the path specified."  That was probably because I had not yet created the Merged subfolder.  Apparently PDFsam was not going to create a folder that did not already exist.  I created it and tried again.  Success!  YYY_00001.pdf was created in the Merged folder with the desired single-page PDFs in it.

Now I just had to figure out how to prevent that little arrow from appearing in the XML files. It came at the end of both the XML and the BAT files, and it got there when I used the COPY command to combine the header, the command, and the tailer text files.  The solution was to add the /b switch:
copy /b Header.txt+YYY_00001+Tailer.txt YYY_00001.xml
With that change, I went back through the process of creating the XML files.  Then I tried running YYY_00001.bat again.  I got an error:  "Cannot overwrite output file (overwrite is false)."  Oops.  I had forgotten to get rid of the previously produced YYY_00001.pdf in the Merged folder.  This time I was successful without having to manually remove the little arrow -- it wasn't there anymore in the new YYY_00001.xml.  I ran the batch file that called the eight YYY_?????.bat files.  It ran and produced eight multipage PDFs.  Those eight contained a total of 65 pages.  I combined them all in Acrobat, just to take a quick look.  They were all good.

Putting the Multipage PDFs Back Where They Belong

Now I needed to decide where to put the multipage PDFs.  In this case, the individual PDFs that went into each of the multipage PDFs had all come from the same folder.  That is, I did not have Folder1\PDF1 plus Folder73\PDF2 going into BigPDF-A.  So I could use the spreadsheet to determine semi-automatically the path and filename for a rename command.  I wound up with Rename commands like this:
ren YYY_00004.pdf "Medieval Craft Workers.pdf"
followed by Move commands like this:
move /y "Medieval Craft Workers.pdf" "E:\IMAGES\Medieval Craftsmanship\"
I verified that the multipage PDFs had returned to the folders whence the single-page PDFs had originated.  This project was finished.

Saturday, July 31, 2010

VMware Workstation, Ubuntu Host, Windows XP Guest: Automated Way to Map Network Drives

I was using Windows XP virtual machines (VMs) in VMware Workstation 7 on Ubuntu Linux.  I wanted to automate the process of mapping network drives.  I typically used the Map Network Drive technique in Windows Explorer for this purpose.  It appeared possible to automate this with a registry edit, but apparently a  more typical and robust approach was to use the "net use" command.

The net use command could be entered from the command line.  I was more interested in saving it in a batch file that I could apply to multiple network drives and could re-run anytime without having to remember or research the proper syntax.  Microsoft seemed to say that, for my purposes, the command to map a network drive would look something like this:

net use d: "\\vmware-host\Shared Folders\DATA" /persistent:yes
where DATA was the name that I had given to the drive in Ubuntu.  This resulted in an entry in Windows Explorer that read, "Data on 'vmware-host\Shared Folders' (D:)."  I was not able, at this point, to automate the process of right-clicking and renaming that to be simply "DATA."

I combined several of those commands in a batch file.  A batch file was just a file created in Notepad, with one command on each line.  I also included a comment, in case I wanted to write an "undo" batch file.  The line just shown, made permanent and accompanied by that comment, looked like this:
net use d: "\\vmware-host\Shared Folders\DATA" /persistent:yes
; to disconnect, use this (I think):  net use d: "\\vmware-host\Shared Folders\DATA" /delete
I saved that batch file in the folder containing my various WinXP installation materials.  So then, for future installations, all I had to do was to double-click on that batch file in Windows Explorer, or start it from the command line or from another batch file, and my drive mapping would proceed automatically.

Sunday, July 18, 2010

Making Space on a Windows XP System Drive

I wanted to make space on my computer's C drive.  For some purposes, it could be more sensible to just install a bigger hard drive or make a larger partition for drive C.  But for other purposes (in my case, where drive C was in a virtual machine (VM) in VMware Workstation and you really didn't want to deal with the slowness and overhead of a huge virtual drive), there might be no alternative but to make space.  Drive C had a habit of just continuing to grow and grow, if you let it; I had one that got up to around 35GB.

I started with a general-purpose Web Developers Notes cleanup page.  I was already doing one thing they suggested, which was to use portable versions of various utilities.  IrfanView, for example, was available in both an installed version and a portable version.  Typically, there was no difference in functionality; the main thing was just that you had to create a link to the portable version if you wanted to have it listed in your Start > Programs list.  Portable versions could be run from anywhere, which means they wouldn't have to be on drive C.  So could installed versions, in theory; but in practice, programs didn't always run correctly and updates were not always applied, when the program was not located where the programmers expected it to be.  Back in the late 1990s, I did spend an enormous amount of time trying to figure out which installed programs could safely be installed somewhere other than the default location, but ultimately I concluded it wasn't worth the hassle.  In short, if it was a portable version, I put it in a folder labeled "Installed Here" on drive D; otherwise, I installed it on C.  Those who hadn't done this during installation could, as advised, uninstall from C and reinstall on D.

I was also doing another thing they suggested, which was to keep data files (including e-mail) on a separate partition.  Program files went onto drive C if they had to be on drive C, or if they would be a lot less hassle if they were on drive C (e.g., see previous paragraph).  Stuff generated by me and by the rest of the world went on drive D whenever possible.  It helped, for this purpose, to relocate those folders (unwanted by me, at least) that Windows automatically created, including "My Documents" and "My Pictures" and "Microsoft, I Need Your Help in Telling Me Where to Put Everything."  Also, in Microsoft Office programs (among many others), I could change settings to store files by default in a folder that was not on drive C.  Then you could back up drive C once every couple of months - whenever you had accumulated enough new program installations and adjustments -- using Acronis or some other drive mirroring program, while continuing to back up your drive D (data) partition on a daily if not hourly basis.

Another suggestion was to delete programs that were not being used via Control Panel > Add or Remove Programs.  It was unwise to remove programs that you need, or to remove programs whose function was unclear.  No point making extra work for yourself or screwing up your system.  (Incidentally, the command to open Add or Remove Programs was this:

rundll32.exe shell32.dll,Control_RunDLL appwiz.cpl,,,
That was a pretty funky command, and for future reference I saved a webpage containing others like it.  I will be combining these commands in a single batch file (below) for one-click all-purpose cleanup.)

Another space-saving step they recommended, which I rejected for my purposes, was to empty out the browser cache (in e.g., Internet Explorer, Firefox).  Why bother?  It would fill up again -- I would want it to fill up again, so as to load pages faster and save the cookies that would store my login information for many webpages -- and I would still need the disk space to accommodate it.  This step would make sense for a one-time task, like making an image of drive C.  For more enduring space saving, the more sensible step was to go into those browsers and make the cache smaller.

A step they should have recommended, but didn't, was to move the page file.  It could be huge.  After moving it and rebooting, I made sure there was not still a copy on drive C.  There was also apparently a Microsoft utility to protect against performance degradation due to pagefile fragmentation.

Moving the paging file was more complicated in my case, because I was using VMware.  (In other words, those who are not using VMware should skip this paragraph.)  In the case of my virtual machine, it seemed to be preset to about 2GB.  Then I came across a mention of the option of setting up a separate virtual drive, within my virtual machine, for the paging file.  The advice, there, was to go into VMware Workstation for this virtual machine and choose VM > Settings > Hardware tab > Add > Hard Disk > Next > Create a new virtual disk > IDE, Independent, Persistent > Next.  I set the disk capacity at a 4GB single file, which seemed like plenty when combined with the 2GB of RAM I was allocating to the VM.  Then, continuing with the advice, I powered up the VM and, with a series of right clicks, I initialized, partitioned, and formatted that drive, set its drive letter to I (so that it would not interfere with D or other drives I was already using), and set its pagefile.  I varied somewhat from the advice on one point; I set the size of the drive I pagefile to a minimum and maximum of 3.5GB, having heard that making the system enlarge the file could take a hit on performance.  (I originally tried 4GB, but Windows gave me warnings that the pagefile drive was running out of free space.)  Finally, they advised me to reboot again and set drive I to nonpersistent. This, it seemed, would take care of the problem of pagefile fragmentation.

They also recommended defragmenting the hard drive.  I used Smart Defrag for this purpose.  It was supposedly running all the time, but I included it in my batch file anyway because it always seemed to have things that needed to be done whenever I did open it.

They suggested using WinXP's Disk Cleanup (command line:  cleanmgr).  Good idea, but typically this made less space than one might have imagined.  Again, there was a tradeoff:  you could make more space by deleting things that might cost you extra time (e.g., Office setup files) whenever you did next need them.

Another possibility was to run a program to see which files and folders were most space-consuming. Raymond recommended TreeSize Free and JDiskReport, both of which were portable freeware.

Someone at HelpWithWindows.com recommended deleting unneeded old files.  Some of this was already being taken care of, for me, via Advanced WindowsCare 2, which I had included in my Startup folder.  Still, I used a complex command to open a search dialog, where I could search for files matching these patterns.
*.bak
*.chk
*.gid
*.old
*.tmp
*.~mp
*.$$$
*.000
~*.*
*~.*
Not every file coming up in those searches would deserve deletion, but many would.

These steps, combined, freed up about 2GB (10%) of my drive C.  The batch file I wrote to automate some of these steps looked like this:
start rundll32.exe shell32.dll,Control_RunDLL appwiz.cpl,,,
start "" "C:\Program Files\IObit\IObit SmartDefrag\IObit SmartDefrag.exe"
start cleanmgr
start "" "D:\Miscellany\Installation\Installed Here\TreeSizeFree.exe"
start "" "D:\Miscellany\Installation\Installed Here\JDiskReport.exe"
type nul > %temp%\1.fnd & start %temp%\1.fnd & del /q /f "%temp%\1.fn
It ran slowly, but it did tend to automate the steps needed in the process.