Wednesday, November 9, 2011

Documenting Computer Work with Screenshots and Duplicate Detectors

I was doing some work in Windows 7.  I wanted to log the changes periodically as I went along.  I had already developed a batch file, which I called Shotshooter.bat, to take screenshots and save them as .png files periodically.  (I think this will work in all versions of Windows.)

Having already done the NirCmd setup required to make that batch file work, I slightly revised it to read as follows.  (If the print is too small, use Ctrl-+ or copy and paste into Notepad.  Batch files are best not edited in a word processor like Word, since they change characters sometimes.)

:: Shotshooter.bat

:: Captures a series of screenshots.

:: See
http://www.nirsoft.net/utils/nircmd.html for info on NirCmd.exe.

:: Takes two arguments:  how many shots, and how many milliseconds between shots.

:: Sample usage:  shotshooter 3600 1000
:: That example would take screenshots every second for an hour,
::   assuming the computer could work that fast.

:: To kill the program sooner, use Task Manager (Ctrl-Alt-Del) > Processses > nircmd.exe.
:: IrfanView provides a fast way to play back results.

:: Create Screenshots folder on drive D

D:
md \Screenshots
:: Go to the drive and folder where you put NirCmd.exe

W:

cd "\Start Menu\Programs\Tools\Programming and Scripting\NirCmd"
:: Run NirCmd with the desired settings

nircmd.exe loop %1 %2 savescreenshot D:\Screenshots\scr~$currdate.yyyy-MM-dd$-~$currtime.HH_mm_ss$.png

:: Go see what you've created in D:\Screenshots

start explorer.exe /e,"D:\Personal Projects\View These Weekly"

Having created Shotshooter.bat, now I wanted to run it.  I had already used Ultimate Windows Tweaker (UWT) to install a right-click option to open a command window in any folder I would select in Windows Explorer, so all I had to do was open a CMD window where I had saved Shotshooter.bat, and just type "shotshooter 1200 30000" and hit Enter.

(If I hadn't installed that option with UWT, I could also have gone to Start > Run > cmd and then navigated manually to the proper folder with change-drive (e.g., D:) and change-folder (e.g., cd "folder name") commands like those used in Shotshooter.bat, above.  Note that quotation marks are necessary if folder names contain spaces or possibly if they are too long.)

Those parameters of 1200 and 30000 meant that Shotshooter.bat would use NirCmd to take a screenshot every 30000 milliseconds (i.e., every 30 seconds), and would take a total of 1200 screenshots (i.e., would run for 10 hours).  It would save them in D:\Screenshots, and now I would have to decide what to do with them.

When I was writing up these notes, I didn't recall whether NirCmd could also produce JPGs or other image formats.  It probably could.  But after getting used to IrfanView, it probably would have been easier to just use IrfanView (File > Batch Conversion/Rename) to do a mass conversion of PNGs into JPGs if necessary.

The problem with a straightforward slideshow was that, if I allowed a few seconds for each screenshot, I could easily wind up with an hourlong show that would feature extended periods of no change.  On this particular day, I had gone to the store and out to lunch, so presumably nothing was happening during those periods.  The changes that I would want to see might pop up for only a few seconds in that hourlong show.

It seemed I had better get rid of the PNGs that merely repeated the same unchanging screenshot for long periods of time.  To do this, I tried a couple of approaches.  After making a backup of my Screenshots folder, I started with Awesome Duplicate Photo Finder (ADPF).  I adjusted its settings to examine PNGs and told it to search only the D:\Screenshots folder.  It felt that, out of my 1,200 screenshots, 1,160 were potential duplicates.  Closer examination revealed that, while many of those files were not what ADPF considered 100% identical, hundreds were.  Unfortunately, ADPF did not offer a way to bulk-delete the 100% matches.  I did not take the time-consuming approach of just letting ADPF guide me through the 580 pairs of images comprising those 1,160 alleged duplicates, making manual choices as to whether I should delete one of the two images it showed me.

I tried another approach.  Among the many free duplicate file finders, I had long used DoubleKiller.  (Exact Duplicate Finder gave the same results as one type of DoubleKiller comparison, but offered fewer comparison options.)  For some reason, a CRC and size comparison in DoubleKiller gave me only 161 duplicates.  I suspected DoubleKiller was being too precise.  Doing an unreliable size-only comparison, it still found only 340 duplicates.

Following the advice on a page that recommended five duplicate file detectors, I downloaded and installed Dup Detector.  It was not easy to understand, but some tinkering I was able to get it to work.  The first time I ran it, I told it to search only for 100% matches.  (By default, it was set to search for "Dup if within 98.5% to 100% match.")  The thing that made it work was to go into its Options > "Automatic and Semi-auto delete setup," highlight the "Delete left image" criterion (the only criterion I needed to use in this case) and use the "Swap up" and "Swap down" buttons to make "No delete" come after "Delete left image."  Then, to make it run, I had to start with Get data > Build.  Also, because of the number of files, I thought I had better start with Find > "Find dups setup (method and restrictions)" set to find 9999 pairs.  I still wasn't sure, at the time when I was writing these remarks, whether that was a good number to put there.

When I ran Dup Detector to search for only 100% matches, it found that about half of the PNGs were duplicates.  I eyeballed some of them, using IrfanView to flip through them quickly with just a right-arrow keypress.  There did appear to be a lot of exact duplicates.  I ran an automatic delete to get rid of those dups.  Then I ran a DoubleKiller search for matches that had both identical sizes and identical CRC checksums.  DoubleKiller still turned up 79 pairs of duplicates.  Apparently there had been more than 9999 pairs, first time around.  (If there were more than 100 exact duplicates, then there would be more than 9999 possible pairs.)  To check this, I ran another Dup Detector search for 100% matches.  It found a bunch more.

It seemed, then, that the best strategy would have been to run DoubleKiller first, so as to get rid of one item in each exact pair.  I did that now.  Then I ran Dup Detector, looking again only for 100% matches.  It found none.  I tried again, this time with a search for 99.9% to 100% matches.  Again, it found 9999 pairs.  Some looked identical, in the program's necessarily reduced and imperfect matchup screen, but the differences in others were visible -- more than 0.1% different, I would have thought.  I tried another search, this time adding a decimal point -- looking, that is, for 99.99% to 100.00% matches.

While that was underway, I ran another of the simple comparisons available in ADPF.  It said that, of the 851 pictures remaining (out of the original 1,200), it found 811 similar pictures.  In the bottom pane, I clicked twice on the Similarity column heading, so as to see what it considered the 100% matches first.  I couldn't tell any difference between the ones that I looked at.  I wasn't sure why ADPF had not considered them identical.

Before doing anything with that ADPF comparison, I went back to Dup Detector.  It had completed its 99.99% search.  It was still finding 9,999 pairs.  I had already noticed that, unlike ADPF, Dup Detector was not able to show the right portion (maybe one-sixth) of my widescreen screenshots.  I also noticed that the first line of the Dup Detector report, in the top left corner of the screen, said that it was comparing 99.9% (not 99.99%) matches.  So unless there was a bug in that report, apparently one decimal place was as precise as it got.

I preferred ADPF's visual comparison, so I went back to it.  In the bottom pane, it looked like about two-thirds of its similar pictures were at the 100% level.  That would apparently mean I would have to do hundreds of manual comparisons:  Picture 1 might match Picture 2, and also Pictures 3, 4, 5 . . . I went down to the 99% matches.  These, too, were identical, as far as I could tell.  I had noticed, in IrfanView, that the only thing that had changed since the previous screenshot, among some screenshots, was that the system clock, in the bottom right-hand corner of the screen, had moved ahead by one minute, so maybe that sort of thing kept ADPF from catching them at the 100% level.  ADPF wasn't showing the taskbar or other outer edges of the screenshots, so I couldn't tell for sure.

With hardly any exceptions, I found that even the 95% matches in ADPF were virtually identical.  The only differences that I could detect, in the 50% of less of matches where I did see a difference, was that a different window might be foregrounded -- that, in other words, its title bar would be a different color in one screenshot than in the other.  In other words, the ADPF matching levels seemed more realistic than those in Dup Detector:  this was the kind of difference that I would expect to be detected at the 96% level, well before the 99.9% level.  A 99.9% match, I felt, should involve no more than the tiniest flyspeck of difference.

For my purposes, I did not get regular, visible differences in matches, in ADPF, until I was down at the 91% level.  There were few matches at that level, so I decided to err on the safe side, manually deleting duplicates down to the 95% level.  But in the process, I stopped along the way to re-run Dup Detector.  It seemed I might be able to calibrate it against ADPF.  In other words, I first deleted all of the 100% matches in ADPF, and then ran Dup Detector.  There were about 360 of those, or about 45% of the 811 similar pictures detected by ADPF.  Deleting them was pretty fast, once I got the keystroke combination worked out; it probably took 6-7 minutes.

Rerunning Dup Detector at the 99.9% setting, after deleting ADPF's 100% matches, still produced 9999 matches from the 482 screenshots remaining.  I expected it to produce matches that were extremely difficult to tell apart.  This was not the outcome I got.  There were a number of rather obvious (although still very minor) differences.  It seemed, at this point, that Dup Detector's supposed 99.9% match was not realistic and, for my purposes, not meaningful.  In general, it seemed that Dup Detector had been useful only for purposes of automated deletion of 100% matches, though possibly that function would have been served equally well by an easier DoubleKiller comparison in terms of file size and CRC.  It didn't look like Dup Detector had anything more to offer me at this point.

In the interests of automating future comparisons, I looked around for a free bulk CRC calculator, but ultimately got better results searching for an MD5sum program in CNET.  I thought about FSUM but finally went with the somewhat higher-ranked MD5summer.  Both would do batch work and yield text-file output, but FSUM was command-line.  MD5summer did give me the option of saving its calculated sums as a text file, which I then imported into Microsoft Excel.  Using text parsing functions (e.g., MID, LEFT, TRIM), I extracted the 32-character checksum, sorted, and used a formula to compare cells.  MD5summer had evidently identified only 322 duplicates in 161 pairs.  Possibly the reason the duplicates were found only in pairs was that I had run Shotshooter (above) in 30-second intervals:  there would be only two screenshots per minute, before the system clock changed, producing a screenshot with a different checksum.  That had probably been the case with the DoubleKiller CRC checksum results as well.

One workaround would have been to see if I could conceal the system clock before starting, or I could batch-trim the PNGs in IrfanView (File > Batch Conversion/Rename > Advanced > Crop) before running the checksum, but in this case I wanted the clock to be visible on the output.  (I could also have used cropping, or could have drawn circles and arrows on my PNGs, using something like Photoshop, to narrow the focus to particularly interesting changes on the screen, so as to reduce the percentage match that ADPF or other duplicate detectors would calculate.)  I could have done the batch-trim with a copy of the snapshots backup folder, so as to produce a list of files to be deleted without harming the originals (or, in this case, the copies).  But what if the only thing that changed (in some future application) was a small item in the center (i.e., not at the edge) of the screenshot?  I could use the spreadsheet to identify not only the duplicate pairs but also the time periods during which every minute had a matching pair, so as to lead me toward large stretches of time when nothing would change, and then maybe a manual comparison in ADPF would be manageable for the rest.

So those were possibilities for future projects.  At present, having deleted the ADPF matches down to the 95% level (leaving a few near-duplicates where I could quickly see a difference), I continued on a bit further in ADPF, deleting some additional near-duplicates.  At the 91% level, almost every pair of near-duplicates contained visible differences, so I stopped there.  So I was done with ADPF.

I now had 442 fairly distinct screenshots, out of the original 1,200.  I would have had more if I hadn't abandoned the computer for several hours while Shotshooter was running.  I viewed a bunch of them in IrfanView, again using the right-arrow key to move quickly to the next.  I had already set IrfanView (Options > Properties > Browsing/Editing) to go to the next file after I deleted one -- or maybe it did that anyway, by default.  It now occurred to me that some of my ADPF work might have been faster, and that differences might have been easier to see, if I had just paged through the screenshots (or at least some of them) in IrfanView, using the Delete key (alternately, the X button, up by the menu bar) to delete apparent duplicates.  When I ran into a stretch where there seemed to be many duplicates, I stopped hitting Delete (in case, by deleting too fast, I would accidentally delete one after the screen did change) and instead just selected and deleted many at once in Windows Explorer.

Holding down the right-arrow key in IrfanView allowed me to page quickly through obviously similar or dissimiliar screenshots.  There were still quite a few, requiring as many as 90 virtually duplicate screenshots to be deleted in one case.  It seemed that ADPF may have been fooled, not only by the system clock (which still seemed to be the only thing that was changing, in many cases), but also by relatively complex screenshots (e.g., showing photographic or Google Earth images rather than just text documents, spreadsheets, and Windows Explorer sessions).  That is, to my way of thinking, many of these images were 99% similar, but ADPF hadn't even considered them 91% similar, so possibly its comparison engine was miscalculating similarities in some conditions.

After these other steps, I took a final trip through the snapshots, in IrfanView, and deleted a few more that were very similar to the ones immediately preceding them.  I wound up with 207 snapshots, out of the original 1,200, that seemed to represent fairly well what I had been doing over a 10-hour period.  Again, the number could have been substantially larger -- maybe around 300-350 -- if I hadn't spent a few hours away from the computer.

Now I wanted to put these snapshots into some kind of slideshow.  I rarely made slideshows.  On a few of the screenshots, I decided to use Adobe Photoshop Elements to draw circles and lines to draw attention to changes, from one slide to the next, that might otherwise escape attention.  Then I figured I would use IrfanView (File > Slideshow > Save slideshow as EXE/SCR) to create a slideshow.  (I could also have used something like PowerPoint, except that apparently that would have required me to create 207 slides and then import a photo into each.)  But IrfanView wouldn't let me add circles and arrows or, as seemed increasingly appropriate, a voiceover, to explain what was going on.

I tried using both Adobe Premiere Elements and CyberLink PowerDirector, but neither of them wanted to let me export to a full widescreen format.  Saving it in a reduced format (e.g., 720x480) lost so much detail that it was hard to read what was being displayed.  They also created huge files.  These programs -- especially Premiere Elements -- were also pretty terrible at giving me a simple way of arranging slides.  I wound up just creating an IrfanView EXE slideshow in full-screen mode -- and you know what?  It was beautiful.  Visually, it was perfect.  It looked exactly like the regular computer screen from which I had created all those screenshots.  And it was only 91KB.  Tiny!  The only drawback was that it couldn't incorporate a voiceover and lines and arrows.  So the output side of this project was still in development.

1 comments:

raywood

In a later post, I describe a different approach to this project.