Ray Woodcock's Latest: text

Showing posts with label text. Show all posts

Thursday, March 15, 2012

Batch Converting Many Text Files to PDF

I had a bunch of .TXT files that I wanted to convert to PDF. I had solved this problem previously, but it looked like I hadn't written it out clearly, so that's the purpose of this post. This explanation includes solutions to several other sub-problems. All together, the things presented here were useful for solving a variety of problems.

First, I made a list of the files to convert. My preferred way of doing this was to use DIR. First, I would open a command window. My preferred way of doing *that* was to use the "Open command window here" context menu (i.e., right-click in Windows Explorer) option. An alternative was to use Start > Run > cmd, but then I would have to navigate to the desired folder using commands like CD.

The DIR command I usually used, to make a list of files, was DIR /s /a-d /b > filelist.txt. (Information on DIR and other DOS-style commands was available in the command window by typing the command followed by /?. For example, DIR /? told me that that the /s option would tell DIR to search subdirectories. A variation on the DIR command: DIR *.txt /s /a-d /b. The addition of *.txt, in that example, would tell DIR that I wanted a list of only the *.txt files in the folder in question (and its subfolders). If I wanted to search a whole drive, I'd make it DIR D:\*.txt /s /a-d /b > filelist.txt. If I wanted to search multiple drives, I'd use >> rather than > in the command for the second drive, so that the results would add to rather than overwrite the filelist.txt created by the preceding command.

Using DIR that way could gather files from all over the drive. Sometimes it was better to gather the files into one folder first, and then run my DIR command just on that folder. An easy way of finding certain kinds of files was to use the Everything file finding utility, and then just cut and paste all those files from Everything to the desired folder. For instance, a search in Everything for this:

"see you tomorrow" *.txt

would find all text files whose names contained that phrase. Cutting and pasting that specialized list into a separate folder would quickly give me a manageable set of files on which I could focus my DIR command. (There were other directory listing or printing programs that would also do this work; I just found them more convoluted than the simple DIR command.)

Once I had dirlist.txt, I copied its contents into Excel (or I could have used Excel to open dirlist.txt) and used various formulas to create the commands that would convert my text files into PDF. The form of the command was like this:

notepad /p textfile.txt

I wasn't sure in the case of Notepad specifically, but I was able to run some programs (e.g., Word) from the command line by just typing one word (instead of e.g., "notepad.exe," or a longer statement of the path to the folder where e.g., winword.exe was located) because I had put the necessary shortcuts in C:\Windows.

Those Notepad commands would send the text files to my default printer. My default printer was Bullzip. When I installed it, it gave me a separate shortcut leading to its options. For this purpose, I set its options so that it did not open the document after creation (General tab), specified an output folder (General tab), and indicated that no dialogs or questions should be asked (Dialogs tab).

I copied the desired commands from Excel to a Notepad text file and saved it with a .bat extension. The rest of the file name didn't matter, but the .bat extension was important to make it an executable program. In other words, if I double-clicked on PrintThoseFiles.bat (or if I selected PrintThoseFiles.bat and hit Enter) in Windows Explorer, the batch file would run and those commands would execute. (I could also run the batch file from the command line, just by typing its name and hitting Enter -- which meant that I could have a batch file running other batch files.)

So that pretty much did it for me. I ran the batch file, running lots of Notepad commands, and it produced lots of good-looking PDFs.

Please feel free to post questions or comments.

Tuesday, March 13, 2012

Choosing Emacs as a Text Editor with Macro Capabilities

I had a project that called for a text editor with macro capabilities. I went searching for one. This post describes how that turned out.

I started by trying Notepad++. It did have a simple macro recording capability: just select Macro > Start Recording. The particular application I had in mind involved a search for certain text. This couldn't be done in Notepad++. Its help file said, "Currently (v.5.4.3), macros cannot save Find(/Replace) operations."

So I looked for alternatives. Wikipedia offered a longish list of them, but without detailed feature information. I found a Lifehacker article that suggested Emacs as a second-best after Notepad++. I hadn't used Emacs since 1982. Last time I'd tried, maybe around 2005, it had been intimidating. Was I ready to invest time to become competent in Emacs? Lifehacker called it "primarily a text editor for serious programmers." Lifehacker said an alternative with comparably good macro capabilities was Vim or, less intimidatingly, Cream.

I looked at Cream. I saw that it had a keyboard macro capability (Shift-F8 to record, F8 to play). I downloaded and installed it and tried its macro features. It may have saved the macro somewhere, at least temporarily. But it didn't give me an option to save it permanently or assign it to a function key. There didn't seem to be a way to auto-run a macro on startup or file opening. This tool seemed to be rather extremely different from something like Emacs.

What I was really looking for, and what I found at this point, was the Extra Features table in Wikipedia's comparison of text editors. There, I found a column, "Macro Language." I didn't necessarily need a whole language, but I had missed not having something along those lines, which I had used extensively in certain DOS programs (especially WordPerfect and Paradox) in the 1990s, and had never gotten back to after switching to Microsoft Office. If I relied on Wikipedia's comparison to steer me to a text editor that definitely had a macro language, newline conversion capability, large file support, and graphical shell integration, for no cost in a Windows environment, then my candidates seemed to be AkelPad, ConTEXT, GNU Emacs, PSPad, Vim, and XEmacs.

I had missed some things about the Emacs world, and was inclined in that direction. Wikipedia and WikiVs articles on the choice between Emacs and Vim did not change that feeling. After browsing a discussion or two of the choice between GNU Emacs and XEmacs, I decided I preferred the former. But then it developed that there were quite a few other versions of Emacs. I was tempted to go with MicroEmacs, as I thought it might be simpler. The JASSPA version seemed to be most actively developed. But then I decided that, to avoid bugs and have deep support in complex problems, it made more sense to go with the most mainstream version, and that seemed to be GNU Emacs.

I downloaded and installed the most recent GNU Emacs from Softpedia (apparently released within the past few months), though possibly I should have gone to the official download site. I say "installed," but actually it seemed to be pretty much a portable. Looking through the two accompanying "read-me" files (one of which was not formatted for use on Windows machines), I saw that they advised putting it into a folder whose directory path contained no spaces. There were a few installation options. One was to run the file called Emacs\bin\addpm.exe. Another was to create a shortcut to Emacs\bin\runemacs.exe. There was also the option of running a REG file to make some changes. The file as they gave it to me contained just a couple of lines that added a context menu option, "Open with Emacs." The entire contents of the REG file were:

Windows Registry Editor Version 5.00
[HKEY_CLASSES_ROOT\AllFilesystemObjects\shell\Open With Emacs\command]
@="C:\\emacs\\bin\\emacsclientw.exe -n -a \"C:\\emacs\\bin\\runemacs.exe\" \"%1\""

The read-me seemed to say that I could add some formatting options to that REG file, specifically:

[HKEY_LOCAL_MACHINE\SOFTWARE\GNU\Emacs]
"Emacs.Background"="Black"
"Emacs.Foreground"="Wheat"
"Emacs.ScrollBars"="OFF"
"Emacs.MenuBar"="OFF"
"Emacs.ToolBar"="OFF"
"Emacs.Geometry"="125x43+0+0"

But when I tried to add those keys and values to the registry manually, I didn't see that they were making any difference. I wondered if they were perhaps intended for the text version as distinct from the GUI version; the latter seemed to have its own menu-driven configuration options.

That seemed to be all there was, by way of preliminaries. Going into the actual Emacs program, the read-me files recommended starting with Emacs > Help > Tutorial. The opening screen contained a link to the tutorial and also to a Guided Tour and to the Emacs Manual. The tutorial was contained within Emacs and didn't seem to be available online. The Guided Tour and the Emacs Manual were available online. There were also other tutorials out there. Examples included a Rice University PDF that only seemed to be available from a University of Illinois website, a University of Chicago tutorial, and a GNU Emacs 101 for Beginners webpage. There were other help sources as well, including the GNU Emacs FAQ for MS Windows, the help-emacs-windows listserv, and two Usenet groups (gnu.emacs.help and comp.emacs).

The GNU Emacs for Beginners page, and others, had some tips. One was to start with the menus and worry about the "keybindings" (i.e., the Ctrl- and Alt- multi-key options) later. Another was to use Ctrl-] (i.e., control - right bracket) to kill funky processes, if the program started to seem out of control. Also, they said, "Meta" was the old (still used in Emacs) name for the Alt key.

The University of Chicago tutorial seemed to have been written for non-GUI users. I read it partway nonetheless. It was talking about key combinations. It said they were written in shorthand, as C- (for Ctrl-) and M- (for Meta- (i.e., Alt-)). Sometimes it would apparently be necessary to use Esc instead of Alt, but in those cases (unlike C- and M-) you'd just tap it, not hold it. You'd do your work in a buffer, not directly in a file; the changes in the buffer would then be written to the file. C-x commands would do things on the environmental level (e.g., open files, save your work, manipulate windows). Control combinations had to be entered sequentially, lifting off the Ctrl key and then hitting it again as needed. For example, C-x C-e did not mean C-x-e. The Rice University thing (dated 2002) also had information about these key combinations.

It looked like the best sources of information would be the FAQs, the Emacs Manual, and the Guided Tour, which was actually more like an introduction to general concepts. The manual included in Emacs itself seemed to be the same as the online manual. The included tutorial was actually just a long (maybe 15-page) single-spaced text file.

I wasn't sure I needed all that right now. I went looking for information on macros specifically. So that started my education in Emacs: Ctrl-F (or should I say C-f) would not do a Find; instead, it would move the cursor Forward one space. To find the part of the tutorial that talked about macros, I needed to use C-s (i..e, search forward) or perhaps C-r (i.e., reverse). There wasn't a search box per se; there was a search line on the bottom of the screen. Nothing on macros in the tutorial. The online Guided Tour said F3 would start recording a macro and F4 would stop and would also do playback.

So, OK, I would open a file of the kind that I wanted to edit; I would go to its end, since I was looking for the last occurrence of a certain text string (specifically, "Content-Disposition: attachment;"); I would do a C-r to back up to that last occurrence; and then I would need to know how to delete everything after that point. So here were the steps to get that far:

Opening the file: Menu > File > Open
Going to its end: C-end (just like in Word)
Searching for last occurrence of text: C-r.
Text to search for: I could have typed in the "Content-Disposition: attachment;" string that I was searching for, but could I paste it into the search? Ctrl-V didn't work, at least not without customization. A search led to an indication that the customization itself could be relatively easy, though there were multiple preparatory steps. First, I had to have an "init" file. It could be called _emacs and it had to be in my Home directory. To find Home, I typed C-x C-e (insert (getenv "HOME")) and saw that it printed, in the text of the file I was looking at, this location: C:\Users\Ray\AppData\Roaming. (I had to give it a minute after the C-e before it brought up the "Eval" space at the bottom of the screen. Once it did that, I could type the (insert (getenv "HOME")) part.) In Windows Explorer, I created a new _emacs file (not _emacs.txt) in that folder, opened it with Notepad, put (setq x-select-enable-clipboard t) into it, and closed it. I closed and restarted Emacs. Ctrl-V still did not work to paste contents of the Windows clipboard into Emacs. I noticed this in the lower half of the Emacs screen: Warning (initialization): `_emacs' init file is deprecated, please use `.emacs'. I tried to change _emacs to .emacs, but Windows Explorer said, "You must type a valid file name." So I left it as _emacs and, ultimately, I just typed in the "Content-Disposition: attachment;" text that I was looking for.
Marking the start of the text to be deleted: I wanted to delete everything from "Content-Disposition: attachment;" on to the end of the file. In Emacs-speak, this was called "killing" a "region." First, I had to set the "mark" at one end of the region. I probably could have done that already at the end of the file, but I wasn't thinking that way. I was at "Content-Disposition: attachment;" now, so I set the mark by hitting C-space. The bottom line in Emacs gave me a gratifying little confirmation: "Mark set."
Marking the end of the text to be deleted: the deletion would apply to everything between "mark" and "point." Point seemed to mean just the location of the cursor. So I just hit Ctrl-End to go to the end of the file. I saw that all this text was now highlighted.
Deleting the selected text: to kill the region, I just hit Delete.
Saving and closing the file: Menu > File > Close. I got a dialog telling me that a buffer had been modified and asking if I wanted to kill anyway. I said yes, and then I reopened the file to see what happened. My change had not been saved, so I redid it, and this time I did save first. I used the key commands instead of the menu pick: C-x C-s. Then I quit with C-x C-c.

Now I wanted to be able to automate those commands in a macro. That called for another sequence of steps:

Create the macro: I opened the next file that needed the treatment just described. I hit F3 and went through those steps, ending with the file save. Then I hit F4 to end the definition. I got a confirmation at the bottom: "Keyboard macro defined."
Name the macro: I just wanted to call it Macro1. It seemed that several macro-related commands started with C-x C-k. To name the macro, I typed C-x C-k n Macro1 followed by Enter. It didn't give me a confirmation, or maybe I missed it before I hit Enter, which may have been unnecessary.
Bind the macro to a specified key: in case I wanted to be able to run Macro1 by just hitting a key (instead of using its name), I typed C-x C-k b 1. This command would "b"ind the last keyboard macro (i.e., the one I had just defined) to key 1. To run it, I could hit C-x C-k 1. The same approach would work if I wanted to assign macros to C-x C-k 2 through 9 and A through Z (capital letters only). That is, (a) the binding process would be this simple, for 1-9 and A-Z, and (b) those particular key combinations were safe to use (i.e., were not already spoken for). I got a confirmation: "Keyboard macro bound to C_x C-k 1."
Save the macro in a file: to make Macro1 available if I stopped and restarted Emacs, I had to "visit" the file where I wanted to save its definition. I decided to call this file "Macros." There was a menu option to visit a new file. I tried C-x C-f Macros [Enter]. It defaulted to putting this file in the directory where I had last been working (i.e., where the files I was editing were located). I decided this was OK for now; I could move the Macros file later if I wanted to keep it after this job was done. It confirmed: New file. Now, to save Macro1 in Macros, I typed M-x insert-kbd-macro [Enter] Macro1 [Enter], where M- meant Alt-. The top pane showed a bunch of mostly numbers.
Test the macro: I closed Emacs. The confirmation line asked if I wanted to save Macros. I typed y. It closed. I reopened Emacs and typed C-x C-k 1. The bottom line said, "C-x C-k 1 is undefined." I typed M-x load-file [Enter]. It volunteered to load something from my Emacs program folder. That wasn't where I had saved the Macros file. So I would apparently have to type the full path name. Once I did type D:\Folder\Subfolder\Subsubfolder\Macros, I got a confirmation: Loading ... done. I made sure the next file to be fixed was visible and that's where the cursor was, and then I tried C-x C-k 1 again. Still undefined. I tried F4: "No kbd macro has been defined." It wouldn't accept C-x C-k e Macro1. I moved the file to D:\Folder and renamed it to Macros.el. This made no difference.
Correct my mistakes: eventually I came to understand that I should have named the macro (C-x C-k n Macro1) and then saved it by opening C:\Users\Ray\AppData\Roaming\_emacs and then typing M-x insert-kbd-macro [Enter] Macro1 [Enter]. So I hit F3, re-ran the sequence of steps, hit F4, and then named and saved it in that way. Or, I should say, tried to save. When I typed M-x insert-kbd-macro [Enter], it prompted me with "Insert kbd macro (name)," and when I typed Macro1, it said, "No match." I opened another file to edit, hit F4, and it ran. But it ended in the _emacs file, so maybe I hadn't hit F4 to end it at the right time. I went through the F3 paces and so forth, another time, with another file. This time it seemed to work: the _emacs file in the bottom pane of the two-pane Emacs window now contained some codes. I saved _emacs. But, oops, I hadn't assigned it to a key. I opened another file to edit, hit F4, and observed that the macro failed to run. I must have screwed something up. I went through the F3 process again. This time I hit F4 several times at the end. I typed C-x C-k n Macro1a [Enter]. I bound it to the 1 key by typing C-x C-k b 1. I saved it to _emacs by clicking in the bottom pane, where _emacs was still open, and typing M-x insert-kbd-macro [Enter] Macro1a [Enter]. This added more text to _emacs, identical to what I had added before except it referred to Macro1a instead of Macro1. I manually deleted the extraneous Macro1a lines and saved _emacs. I opened another file to edit and hit C-x C-k 1. The macro didn't work. It went to the end of the file and stopped. I suspected the reason was that, at file's end, I had hit C-f (i.e., the Windows Find combination), intending to tell it to search backwards, and had gotten an error message: I was supposed to search with C-r from the file's end (not with C-f). Apparently the error message terminated macro recording. So I went through the steps again, starting with F3. This time, there was a larger amount of code saved in _emacs, in the bottom pane, and it included references to desired steps like "delete) and "(menu-bar) file save-buffer." I manually edited _emacs again so that Macro1 (not Macro1a) would be associated with all these nice lush codes. I opened another file and hit C-x C-k 1. This time, it got to approximately the point where it should have started deleting, and stopped there with an error on the bottom line: "After 0 kbd macro iterations: Keyboard macro terminated by a command ringing the bell." I suspected this, too, was due to an error during macro recording, though not one that I would have expected to matter. After hitting C-r, I had told Emacs to search backwards for Content-d. That's as far as I got: a small d, instead of D as in Disposition. Typo. But the macro had recorded it literally, which it would not have done when recording a keyboard macro in Word. I wondered if I could please make it through a F3 macro recording session without any typographical errors. Trying again with F3, I got another Macro1a saved in _emacs. And this time, by God, it worked. C-x C-k 1 did exactly what it was supposed to do, right down to the final step of saving the edited file.

That worked. I had a text editor with macro capabilities. And I could tell already that it was going to be sweet, special, loving, but also nasty, and sweaty, and sometimes hostile and ugly . . . oops, familiar rant, wrong context. Point is, I was coming home.

Saturday, March 10, 2012

Optical Character Recognition Freeware: JOCR and FreeOCR

I had some PNG images of text files. I wanted to do optical character recognition (OCR) to convert their contents into text.

I already had Adobe Acrobat. I tried printing the PNGs as PDFs in Acrobat, which was capable of doing OCR. Unfortunately, its recognition was poor, even when I set its PDF printer to 1200 DPI.

Next, I turned to a fairly recent review of desktop OCR software. It looked like the best-known OCR freeware engines might be the Cuneiform, the Tesseract, and the SimpleOCR. I checked Softpedia for examples of those. Cuneiform was virtually unknown, and SimpleOCR was making a mediocre showing, but FreeOCR (using Tesseract) seemed relatively popular.

That review also directed me toward JOCR, which was apparently designed to do OCR directly from the screen. The reviews of JOCR at CNET and Softpedia were underwhelming. But because it was supposedly designed to do OCR directly from the screen, I decided to compare it against one of the others. FreeOCR seemed to be the most likely candidate. I might have gotten different results from one of the other OCR engines, or from another implementation of the Tesseract engine.

I let JOCR and FreeOCR try their luck with a screenshot taken from a maximized Notepad display of a text file, upsampled to 300 DPI. (FreeOCR had frozen with a 600 DPI file, which JOCR had been able to handle without difficulty.)

Briefly, the FreeOCR output was visibly inferior to the JOCR output. Of course, this was a test with text from an image, for which JOCR was specially designed. There was no question, at least within the parameters of this brief test, that JOCR was producing better output.

Compared against the original text, the primary problem with the JOCR output was in the area of capitalization. The recognized text was generally pretty accurate, with few dropped letters or other errors. Overall, its output would have made a bad impression, if pasted directly into the body of a professional letter or memorandum; but its output was quite good for archival purposes of capturing the wording in an image.

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

I had a bunch of filenames. I wanted to know which two-word phrases appeared most frequently. The purpose of this effort was to put those filenames into categories. I hoped that the approach of finding significant two-word phrases in those filenames would help me automate the file-sorting process. This post describes the first approach I took to identify those two-word phrases.

Parsing the Filenames

I hoped that two-word phrases would be a good size. I figured they would sort into many essentially one-word phrases (e.g., "a dog" would be trimmed to "dog," which would not have much significance) and a relatively small number of more significant two-word phrases (e.g., "big dog"). I hoped, that is, that I could easily filter the list for common words (a, an, the, if, of ...) and then count what was left. It seemed, though, that I could still have taken approximately the approach described here if I had had been looking for three-word or longer phrases.

The approach would have been similar if I had been examining sentences instead of filenames. I would have gotten the sentences into individual rows by using Microsoft Word and replacing sentence ends (e.g., ? ! .) with the same punctuation followed by ^p characters, which Word would interpret as the start of a new line, and then pasting the text into Excel. It might have taken a while to resolve sentence ends, given problems with e.g., "Mr.". I did not have more than 65,000 filenames, so Excel 2003 would handle it. Otherwise, I might have needed a newer spreadsheet (e.g., the Calc programs in LibreOffice or OpenOffice).

Consider this sentence or filename: "I saw a dog - a big dog." Using spaces as delimiters (as I intended to do), its two-word phrases would be "I saw," "saw a," "a dog," "dog -," "- a," "a big," and "big dog." This approach would produce some junk, but I hoped it would be easily filtered. Not worrying about the punctuation seemed faster than trying to anticipate endless punctuation combinations (e.g., "dog -- a"). It would fail in some instances (e.g., "dog--a"), but those too seemed likely to involve insignificant phrases.

That sentence contains seven spaces. I could count the number of spaces manually. With many sentences, it would make more sense to use a formula. One approach would be to use Excel's SUBSTITUTE and LEN functions to calculate the difference in length between the sentence with and without spaces, and then use MAX to see which row has the largest number of spaces. TRIM would remove excess spaces.

Since each two-word pair would contain one space, the sample sentence would contain seven pairs of words. Therefore, my spreadsheet needed enough columns to accommodate seven two-word phrases. The approach I took was to set up the spreadsheet like this:

The top row indicated the number of the two-word phrase. The first one was "I saw," in group 1. The second row indicated the number of the space found. For instance, the first space was at character position 2 (i.e., right after "I"), and the second space appeared at position 6 (after "saw"). This image shows only the first three groups; I would need four more. As it shows, the first group required calculation of only two numbers because its starting point (character position 1) was already known. I preferred this approach of breaking out the calculations visibly, rather than jamming them all together in one long, complicated formula. It was usually faster this way, and it made it easier to spot errors.

I used formulas in the header rows to facilitate copying. For instance, the formula in cell F2 was =F1, and in F1 it was =C1. That way, once I worked out my formulas in row 3, I could select and copy cells F1 through I3 repeatedly across the spreadsheet, until I had enough column groups to accommodate all of the spaces in my longest filename or sentence.

To prepare the following list of formulas for the cells on row 3, I used a macro. Those formulas were as follows:

C3	=FIND(" ",$A3)
D3	=FIND(" ",$A3,C3+1)
E3	=TRIM(LEFT(A3,D3))
F3	=IF(D3=LEN($A3),"",C3)
G3	=IF(F3="","",FIND(" ",$A3,F3+1))
H3	=IF(G3="","",IF(ISERROR(FIND(" ",$A3,G3+1)),LEN($A3),FIND(" ",$A3,G3+1)))
I3	=IF(H3="","",TRIM(MID($A3,F3,H3-F3+1)))
J3	=IF(H3=LEN($A3),"",G3)
K3	=IF(J3="","",FIND(" ",$A3,J3+1))
L3	=IF(K3="","",IF(ISERROR(FIND(" ",$A3,K3+1)),LEN($A3),FIND(" ",$A3,K3+1)))
M3	=IF(L3="","",TRIM(MID($A3,J3,L3-J3+1)))

So anyway, now I had a spreadsheet containing the two-word phrases that I wanted to sort and examine. I made a backup and proceeded to convert the spreadsheet's formulas to values. With the formulas frozen as values, I could sort and delete rows and columns without changing the results shown. There appeared to be functions for this purpose in ASAP Utilities and possibly elsewhere. An easy way to do this was to go to the top left-hand corner of the spreadsheet, hit Ctrl-A to select it all, and then hit Edit > Copy, Edit > Paste Special > Values > Enter > Enter. I had usually found that the same keystrokes worked in later versions of Excel. In this case, the keystrokes were Alt-E, C, Alt-E, S, V, Enter, Enter.

However achieved, the point is that cell C3 no longer contained the formula =FIND(" ",$A3). Now it just contained the number 2, and it would continue to contain that number even if I deleted cell A3. With that done, I could now delete the columns containing the numbers that I had used to detect my two-word phrases (e.g., columns C, D, F, G, H ...). I sped up that process by doing a column sort (select all (Ctrl-A) and then Data > Sort > Options > Sort Left to Right > Row 2. I put numbers at the tops of the Phrase columns, so as to keep their order straight.

Lining Up the Phrases

At this point, the cells toward the right side of my spreadsheet looked like this:

That is, the ends of the rows did not end at the same place. In this illustration, the filename on row 2 had 17 two-word phrases, while the one on row 3 had only 16. I wanted them all in a single column. I wasn't sure if I needed them in proper order (in this example, P2, Q2, R2, followed by P3, Q3, R3), but I preferred that if possible.

I had tackled a similar problem previously. That post still provided more details than I have written here. But in some regards, I now saw a way to simplify that process a bit. I deleted row 1, containing the column headings, as well as column 1, containing the original filenames. They seemed unnecessary and potentially confusing at this point. Then I created another worksheet (Insert > Worksheet). Its default name was Sheet2. In cell A1 in that worksheet, I entered a formula that referred to the corresponding cell in Sheet1, where my data were located. The formula that I put into cell A1 in Sheet2 was =IF(LEN(Sheet1!A1)>0,CELL("address",Sheet1!A1),"").

That formula produced this output in Sheet2, cell A1: [xA8.xls]Sheet1!$A$1. (xA8.xls was the name of my spreadsheet file.) I was going to use the preliminary information, the part that said [xA8.xls]Sheet1!, to tell the computer where a new entry began. In other words, if the spreadsheet had given me just $A$1, I would have changed my formula so that Sheet2, cell A1 would contain something like this: #$A$1. Any odd character would have been sufficient in place of # as long as it gave the computer something unique to look for.

Next, I copied that formula from Sheet2, cell A1 to each cell that corresponded to my data in Sheet1. In other words, in Sheet1 I hit Ctrl-End to discover that the bottom right corner of my spreadsheet was at cell Q14998. So now, in Sheet2, I copied that formula from cell A1 to all cells down to Q14998. A fast way to do that was to hit F5 and type Q14998 and hit Enter. That took me to Q14998. I put an X there. Then I went back to A1 and hit Ctrl-C > Ctrl-Shift-End > Enter. Boom! Done. I went back to Q14998 to double-check. Then Alt-O, C, A to autofit the column widths.

Now I had a bunch of cells that had ugly stuff like cell A1 (above): [xA8.xls]Sheet1!$A$1. I hit Ctrl-A > Ctrl-C to copy it all. I opened Notepad and hit Ctrl-V to paste it. (If there had been too much data for Notepad, I think something like Notepad++ would have worked.) The purpose of putting it into Notepad was to remove the column formatting. (There were delays with some of these operations, as the Windows clipboard and the various programs wrapped their heads around the size of the task.)

Now I copied it all from Notepad to Microsoft Word. Any editor would have worked, as long as it had a way of doing global search-and-replace operations that would produce line breaks. That is, in Word, I replaced all instances of the ugly stuff (i.e., [xA8.xls]Sheet1!) with a newline character (^p). Word reported that it had made about 170,000 replacements. This gave me the desired list, starting like this:

$A$1

$B$1

$C$1

$D$1

There were invisible tabs after the entries (e.g., it was actually $A$1 tab), and there were also some blank lines. I cleaned these up with various ^p^p and ^t^p replace operations, repeated until there were no more. The file was long -- about 3,700 pages.

I went back to Excel and deleted Sheet2: I didn't need it anymore, and all its calculations would slow things down. I created a new Sheet2 and pasted the results from Word into it. I wasn't sure if Word had inserted any other funky invisible characters, so I ran the results through Notepad on the way home, just to be sure.

But now, oops, new problem. My old Excel 2003 couldn't accommodate 170,000 rows. It gave me an error: "File not loaded completely." I took a look. Row 65536 contained a reference to $F$5911. I could have transferred the whole project over to LibreOffice, which could handle 170,000 rows, but that would have been premature. I had kept Notepad open, so now I searched for that item $F$5911 in the data that were still sitting there in Notepad. I deleted everything down to that point, copied, and repeated in an adjacent column. I had to do this a couple of times. So now I had a spreadsheet that was three columns wide and 65,536 rows long. Warning: do not print your results.

I inserted blank columns between these columns of output. Those blank columns would contain the text corresponding to those cell references. For instance, in Sheet2 cell A1, I now had $A$1. So in the blank cell B1, I put this formula: =INDIRECT("Sheet1!"&A1). I copied that formula to all the other cells in those adjacent columns. That gave me three sets of two columns each: the first provided the reference, and the second containd my actual two-word phrase.

At this point, I froze all those values, as above, and deleted the reference columns. So now I just had three columns containing two-word phrases. In short, I had managed to isolate the phrases. Now it was time to start slicing and dicing them.

Working with Unique Phrases

I wanted to filter those two-word phrases for unique ones. For example, if there were a thousand two-word phrases that said "Email from," I wanted just one entry of that nature, with a numerical companion cell containing the number 1,000.

To do that filtering, I tried to use LibreOffice Calc, where I could get all that material into column A. But Calc hung for about 15 hours before producing output. (Excel also hang at a number of points in this whole process.) I did not test Calc's output, but it appeared to have given me some repetitive items in that supposedly nonrepetitive list. I went back to Excel, filtering each of the three columns and then combining the short output into one column and filtering that. This gave me 7,539 unique two-word phrases. That list passed the test for duplicates, but counts indicated that it was not accounting for all 169,891 two-word phrases. The formulas to do those counts really slowed things down, so I deleted them for the moment.

I thought that it might help to clean up unnecessary characters. I sorted each of the three columns according to starting and ending characters, so as to detect extraneous ones (e.g., various kinds of brackets, quotation marks, and other symbols). I also did some global (Ctrl-H) searches in Excel to the same effect. In addition, I deleted a few meaningless two-word pairs. These efforts exposed me to the fact that many two-word phrases would consist, at least in part, of single letters (e.g., "A dog," "Roger P"). It seemed that I probably should have done global replace operations, even before breaking out the two-word phrases, so as to promote meaningful two-word combinations (e.g., "Roger P" and "P Smith" might be less recognizable than "Roger Smith"). Possibly a preliminary search for three- and even four-word proper nouns would have been a good approach there.

I also decided, along the way, that it would make sense to tolerate meaningful one-word terms (e.g., the "Pentium" word that would remain after removing the unhelpful "a" from "a Pentium"). That predictable conclusion raised questions about my sequence of steps generally.

After those cleanup efforts, I filtered again for unique two-word phrases and counted their occurrences. Working with three source columns required some extra steps, mostly to eliminate duplicates, but ultimately my count of items found from searches for unique items was within 1% of the actual number of such items. In other words, my filtering for unique items substantially accounted for all items. Specifically, I had 163,391 two-word phrases, and in some apparent overcounting, my list of unique phrases accounted for 163,772 such phrases.

Because those counts did not jibe perfectly, I decided to take an elimination approach -- that is, to remove the two-word phrases that lacked significance, rather than selecting those that had it. To help me in this process, I divided the two-word phrases into individual words, and then prepared to test each of them against a list of relatively insignificant words that I had cooked up for the occasion. Before doing that, I had to screen again for stray punctuation (now that e.g., a comma had emerged in "soon, however" such that "soon," would not be found in an elimination of "soon") -- making it increasingly appealing to find some way of doing that earlier in the process. Another step that would have been beneficially taken earlier, to the extent possible, would have been to convert date references (e.g., "2-13-09 project" vs. "Feb. 13 project") into a standard format, or else decide that none were significant. It seemed, generally, that renaming the files to standardize various names and terms (or at least revising aliases for them) should have occurred earlier in the process. Such changes seemed likely to shape this sentence-parsing process. For instance, if full names were to be used (e.g., "Roger P. Smith"), then a two- rather than three-word focus could be wrongheaded.

Project Abortion

At this point, I had an unpleasant realization. The need to pare down my spreadsheet, so as not to have to wait forever for calculations, and the loss of concentration due to long waits and the passage of time, had produced a situation where I had not realized that, in the past several hours, my work had ceased to be directly linkable back to the original list of files. There was no index. Worse, there had been global changes to remove troublesome punctuation, further reducing direct linkability. I could continue with this effort, using backups to reconstruct the links. But there seemed to be good reasons to reconceive the whole approach.

One emerging conclusion was that there just might not be a shortcut to a good rubric or other structured approach to the renaming of files, the coding of sentences, or the constructing of aliases for bits of text being examined in this kind of effort. One question in that regard was whether there would be a good way to identify and change troublesome characters or combinations of characters up front. That question led me to a second try.

Friday, April 22, 2011

Finding the Last Occurrence of a Character in an Excel String: A User-Defined Function

I was looking for the last occurrence of a hyphen in a string of text data in a cell in Excel 2003. I wasn't sure how many hyphens there would be, so I couldn't just do repeated searches for a hyphen from the beginning. I vaguely recalled that there was a way to do this, but I wasn't sure what it was. This post describes the process of answering that question.

This, I felt, was probably going to be close to the technique for counting the number of times a letter occurred in a cell. I had searched for the answer to that one previously. The answer here did involve the use of the SUBSTITUTE command too. In that other case, the solution was to count the difference in length of the string before and after substituting nothing (i.e., "") for the occurrence of the character in question.

As pointed out by Aladin Akyurek, that would work here too, if you did the substitution, calculated the length difference, and then searched for the last occurrence, now that you knew how many times the character appeared in the string. His solution appeared in a single formula, which I disfavored because it was easy to make mistakes in long formulas, hard to understand them, and hard to figure out where the mistake was. But he did use SUBSTITUTE well, whereas I never used it, so I had something to learn there.

What I found more interesting, though, was his alternate suggestion, involving something else that I never did: define a function and save it for future use. I surely should have invested the time to learn Visual Basic, when it replaced the old-style BASIC that I had used for minor programming in the 1980s and somewhat thereafter, but then I guess VB changed again in the 2000s and threw a lot of its users off the trail. Anyway, his function was one that I could actually understand, from the old BASIC days. The steps to make it work were:

Close all Excel files other than the one you're working on.
Go into Tools > Macro > Visual Basic Editor > Insert > Module.
Copy and paste this into the window:

Function Reverse(Text As String) As String
Dim i As Integer
Dim StrNew As String
Dim StrOld As String
StrOld = Trim(Text)
For i = 1 To Len(StrOld)
StrNew = Mid(StrOld, i, 1) & StrNew
Next i
Reverse = StrNew
End Function

Go to File > Close and return to Microsoft Excel.
Use the new REVERSE function by specifying the text to be reversed. For my present purposes, a good use would be like this: =FIND("-",REVERSE(A1)).

I gave that a whirl. It worked. But some months later, after reinstalling Windows and Excel, it didn't work. I was not sure what had changed. I tried going into Tools > Macro > Security and changing the level to Medium; I played with alternatives; but I still got error messages. As another approach, I installed Morefunc and then went into Tools > Morefunc and did what it required in order to install it, and then tried Morefunc's TEXTREVERSE command. That didn't work either. The message I was getting was #NAME, which apparently meant that Excel was not recognizing the function (i.e., TEXTREVERSE, or whatever it was called). Something seemed to be interfering with the operation of the Visual Basic script. I tried suspending my antivirus program, but that wasn't the solution. Ultimately, the solution -- getting Morefunc's TEXTREVERSE to work -- seemed to involve a combination of killing and restarting Excel, embedding Morefunc into the worksheet until it finally said "Update successful," and typing my formula in all lowercase. There was a new problem, though: Morefunc's TEXTREVERSE function would crash Excel 2003 when dealing with cells containing more than 127 characters. I tried again with the REVERSE code shown above. Now that was working too, and no 127-character limit. It worked on a cell containing more than 1,000 characters.

Thursday, April 21, 2011

Batch Renaming Text File Memos or Emails Using Their Contents to Create the Filenames

I was using Windows 7. I had text files that contained old email messages, one message per file. Each began with four or five lines: FROM, TO, DATE, and SUBJECT, sometimes preceded by a blank line. In that sense, they were very similar to memoranda. At present, these emails had numerical names (e.g., 0045.txt) that said nothing about their contents. I wanted to rename them to a Date - Time - From - To - Subject format, like this:

2010-04-13 13.45 Email from Joe to Nancy re Dinner.txt

In other words, their top several lines told me everything I needed to name them; I just didn't want to have to do it manually. This post describes the steps I took to automate that naming process.

My knowledge of programming was more or less limited to DOS-style batch file scripting. If I had been interested in entering single commands, I would have done that in a DOS box (a/k/a command window), which I could open in Windows 7 (customized) by going to Start > Run > cmd. I wanted to run the relevant commands automatically, across a number of text files, so I would be saving those commands in a batch file, created in Notepad as a text file (ANSI format), saved as CommandFile.bat, and executed by double-clicking on it in Windows Explorer. The question here was, what commands should go into that file?

Instead of creating a batch file, I could have used some other approach. For instance, it seemed possible that IntelliGet or TextCrawler would do the job better, provided I were interested in spending the time to learn how to use them. I did actually try with TextCrawler, but at this writing I had a question sitting in the TextCrawler forum for almost two days without an answer. Same thing with AutoIT. I wasn't sure how to proceed with those approaches, so at this point I was deciding to return to the batch file approach. I had been using batch files on a very simple level for a long time, so I started there.

A search for batch command syntax information led me to the A-Z Index site, which I had used before. I used that site to interpret what was happening in a two-line batch script that I found in another post:

@ECHO OFF
FOR /F "skip=2 delims=" %%i in (%1) DO >> lines.txt ECHO %%i & GOTO :eof

I've presented the DOS commands in capital letters to make them more visible. ECHO would tell the computer what to make visible onscreen or in the output file (in this case, lines.txt). The second line used a "FOR /F ... DO" loop to repeat the command. (There were other FOR options. For instance, FOR /D would operate on directories, not files.) In plain English, the concept there was, "For each X, do Y." The big idea was, apparently, that the "skip" part would tell it to skip two lines and then echo the third one to the lines.txt file, and then go to :eof (short for "end of file," i.e., end of this script). In other words, this FOR-DO combination just said, skip a line, skip a line, output a line, then break out of the loop (i.e., quit). So there would have to be another command that would call this script for each file in the directory. But before getting to that, I wanted to understand how the "skip" part worked, because that's what I would need to modify for my purposes. The syntax reference seemed to be saying that "skip=2" was a command to skip two lines, as I had just surmised. Another source seemed to say that the "delims" part would specify what would count as the unit of analysis. For instance, if there were a blank space between the equals sign (=) and the quotation mark, we would be looking for a blank space, not the end of a line, as a delimiter. The loop would restart each time it found a blank space. Apparently leaving no space between = and " meant that it would default to end-of-line as its delimiter. I wasn't sure about that. But what we seemed to be saying, so far, was that we were going to loop each time we encountered a line end, for a total of two loops. Then, having arrived at the third line, we would proceed with the main attraction: we would do something with "%%i in (%1)." Specifically, we would export it, whatever it was, to lines.txt. What we would be exporting would be %%i. The percent signs, I knew, indicated a variable. So we were going to export variable i, in %1, to lines.txt. I guessed that (%1), being in parentheses, referred to a varible already supplied. That is, the command that would call this script would indicate the name of a file to examine, and this %1 would represent that filename. So, to recap: having arrived at the third line of the file in question, let's export something (specifically, %%i) to lines.txt. Since there wasn't any more information on what the %i variable (presented as %%i because it was occurring inside a batch file) was supposed to do, it seemed that it must just refer to the whole unit of analysis -- in this case, the whole line -- where we had ended up, having skipped two lines to get here. In other words, this whole third line of the text file would be exported to lines.txt.

Now, how would all that work for me, where I wanted to output multiple lines? I thought I might try to identify each of the lines in question with a separate variable, and then concatenate (i.e., combine) those variables in the output. Basically, if variable L1 stands for line 1, and L2 for line 2, and so forth, then what I wanted was to output L1+L2+L3+L4 to lines.txt (though I preferred to call it output.txt). (The double output signs, >>, meant "add this onto whatever already exists in the output file, as distinct from >, which would mean "replace whatever is in the output file with this new data.") I wouldn't need a skip parameter -- I wanted to output all of the first four lines or, actually, the first five, since in some files the first line was a blank.

At this point, it occurred to me that I might have another option. I made a backup of the folder containing the emails. Then, at the DOS command prompt, I typed COPY *.* COMBINED.TXT. This put all of their contents into one large text file. I opened it and copied and pasted its entire contents into an Excel spreadsheet. Then I parsed its lines to produce the desired filenames. This was much more time-consuming than a scripted solution would have been, and it was not what I set out to find, but it did get the job done.

Thursday, December 31, 2009

Notes on Converting Word Processing Documents from 1985-1995

I was using Ubuntu 9.04 (Jaunty Jackalope) and VMware Workstation 6.5.2, running Windows XP virtual machines (VMs) as guests. I was trying, in one of those VMs, to convert some data files from the 1980s and 1990s. This post conveys some notes from that process.

I had used a number of different database, spreadsheet, and word processing programs back then. The filenames had extensions like .sec and .95. These suggested that the file in question was probably not a spreadsheet (whose extensions would probably have been .wks or .wk1 or .wq1). I suspected these were word processing docs, but what kind?

I had a copy of WordPerfect Office X4, so I tried opening them in that. The formats I had used principally back then were WordStar (3.3, I think), WordPerfect 6.0 for DOS, XyWrite III+, and plain ASCII text. So for some documents it took several tries, telling WordPerfect X4 to try these different formats, before the document would open properly. Even then, not all of them did.

I also tried the approach of highlighting a bunch of these files, right-clicking, and indicating that I wanted to convert them to Adobe Acrobat 8, or to combine them in Acrobat. Unfortunately, these efforts tended to cause Windows Explorer and/or Acrobat to crash.

It occurred to me to try another approach. I left Windows in VMware and dropped down to Ubuntu. I selected 57 files that I wanted to convert. OpenOffice 3.0 Writer started up by default. It opened them all. They had been last modified in 1993 and thereabouts. I think they were created with Word 3.1. For each file, I clicked a button and got a PDF created in the same folder with the same name and a PDF extension.

OOo Writer wasn't able to open some WordStar 3.3 files from the mid-1980s. Several sources referred me to Advanced Computer Innovations for that sort of conversion. Their prices weren't bad, but I didn't want to pay $1 per file per 50K for these old materials. Instead, I looked into old Microsoft converters. Those, unfortunately, did not appear to be available anymore. A search led to a forum that led to WordStar.org converters. Those, however, did not appear to go back to WordStar for DOS 3.3. Graham Mayor's page looked like a better bet. It gave me a Wrdstr32.zip file, but by the time I got around to it, I had already addressed my needs, so I didn't actually try this one.

Separately, somehow, I found (or maybe I had always retained) a copy of a program that seemed willing to install "Microsoft Word 97 Supplemental Converters." Searching for this led to a Microsoft page where I was able to download the Word 97-2000 Import Converter (wrd97cnv.exe); unfortunately, that proved to be a backwards conversion from Word 97 to Word 95. Trying again, I found that the Microsoft Office 2003 Resource Kit webpage led to a list of downloads that included an Office Converter Pack that I downloaded (oconvpck.exe). I seem to have installed this, and I think this is what ultimately did the job for me.

Resources for converting XyWrite III+ files were pretty scarce by now, a decade after what appears to have been the last (short-lived) effort to reconstruct a manual of its text-formatting codes. Apparently nobody who has a copy of the paper manual has gotten around to PDFing and posting it; or perhaps Nota Bene (which apparently bought XyWrite in the 1990s), for some reason, is unwilling to allow any such reference to be made available. But here are some examples of codes used, from what I've been able to figure out and recall:

«PT23» start using proportional type font no. 23
«PG» page break
«TS5,10» set tabs 5 and 10 spaces to the right
«DC1=A 1 a» set DC1 outline structure (first level = A, B, C ...)
lm=0 set left text margin at zero characters (i.e., not indented)
«FC» format centered, i.e., center text
«MDBO» begin boldface
«MDNM» end special formatting (e.g., boldface)
«SPi» set page number to i (e.g., for preface)

I also had some old .wpd (WordPerfect) documents. Not all of them had .wpd extensions to begin with. To bulk rename the ones that didn't, I searched for a bulk renamer, to rename them all to be .wpd files. I tried Bulk Rename, but its interface was complex and inflexible compared to that of ExplorerXP -- just select the files you want to rename, press F2, and set the parameters.

Once I had the files named with .wpd extensions, the next question was, how to get them into PDF format. That was easy with the others, above, to the extent that Microsoft Word could read them; I could PDF them from there. I shouldn't say it was "easy"; it was still a manual process, and I was now searching for a way to automate it. Unfortunately, I was not finding any freeware ways to convert from .wpd to .pdf. Later versions of WordPerfect include a Conversion Utility to bring those files into the modern era, but they are still wpd files. Adobe Acrobat 8.0 was able to recognize and convert the files (select multiple files in Windows Explorer, right-click and choose the Convert to PDF option), but they proceed one by one, and I had hundreds of files, and it took several seconds for each one to process. Also, it added an extra blank page to the ends of some if not all of these old WordPerfect documents. I didn't find any wpd to odt (OpenOffice Writer) converters. I thought about trying Google Docs, which someone said could bulk convert to pdf, but they didn't accept wpd as input. I tried looking for a converter from wpd to doc, and that led me to Zamzar.com, which would convert directly from .wpd to .pdf, but would only let me upload one file at a time. I found that the Options in OpenOffice (I was using the Ubuntu version) could be set to save automatically as Word documents, so I did that, and then uploaded a few of them to Google Docs and downloaded them as PDFs. The formatting was messed up on a couple of them. I tried a comparison without Google Docs, just converting to pdf from the .doc files that OpenOffice had saved. The formatting was better that way, so Google Docs didn't add anything; and the process of converting the Word docs to PDF was the same one-file-at-a-time thing as if I were printing from WordPerfect itself, so involving Word didn't add anything either. In the end, the best and probably fastest approach seemed to be to select a bunch of wpd files in Windows Explorer, right-click and select Convert to Adobe PDF.

This seemed likely to be a continuing effort, but these notes ended here.

Ray Woodcock's Latest

Pages

Thursday, March 15, 2012

Batch Converting Many Text Files to PDF

Tuesday, March 13, 2012

Choosing Emacs as a Text Editor with Macro Capabilities

Saturday, March 10, 2012

Optical Character Recognition Freeware: JOCR and FreeOCR

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

Friday, April 22, 2011

Finding the Last Occurrence of a Character in an Excel String: A User-Defined Function

Thursday, April 21, 2011

Batch Renaming Text File Memos or Emails Using Their Contents to Create the Filenames

Thursday, December 31, 2009

Notes on Converting Word Processing Documents from 1985-1995

Support This Blog

Total Pageviews

Archives

Pages

Thursday, March 15, 2012

Tuesday, March 13, 2012

Saturday, March 10, 2012

Thursday, March 1, 2012

Friday, April 22, 2011

Thursday, April 21, 2011

Thursday, December 31, 2009

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives