Showing posts with label crash. Show all posts
Showing posts with label crash. Show all posts

Sunday, April 8, 2012

Windows 7: BSOD: Errors 116 & 119: Interpreting the Minidump or Kernel Dump File

I had been having Blue Screen of Death (BSOD) crashes.  These were happening on one machine and not the other.  This was odd; both machines had virtually identical Windows 7 installations.  They also had the same motherboards and same amounts and kinds of RAM.  This post is a continuation in the effort to figure out why.

Given the similarities between the computers, I suspected the crashes were due to software.  Although the Windows installations were virtually identical, I was not always using exactly the same programs on both machines.  There was also a possibility that a CPU upgrade was responsible for a new bout of crashes:  both machines had previously had the same processors, but I had just installed a faster one on the crashing machine, and it had just begun crashing again.

In the previous episode, I had used BlueScreenView but had not known how to interpret its reports.  More accurately, I had not known how to interpret the minidump reports, viewable in BlueScreenView, that Windows would produce during a BSOD.  I wanted to be able to understand what the minidump file was telling me.

Understanding the minidump seemed especially important this time because, unlike the last time, the BSOD was not pausing onscreen long enough for me to see what it said.  It was flashing by so quickly that I just caught a glimpse of blue and then the machine was rebooting.  I recalled that I had seen, somewhere, a setting that would prevent that from happening.  Eventually I found it:  Start > Run > SystemPropertiesAdvanced.exe (or Control Panel > System > Advanced tab) > Startup and Recovery Settings > uncheck Automatically restart.  At present, my other settings there were for "Write an event to the system log," "Kernel memory dump" (not "None" or "Small memory dump (256KB)), "Dump file = %SystemRoot%\MEMORY.DMP," and "Overwrite any existing file" was selected.  I wasn't sure if those were the right settings; that's just what I had.  One source told me that I would want to overwrite because the MEMORY.DMP file would eat up lots of disk space.  With these settings, I would have a minidump for every crash and a MEMORY.DMP for only the most recent crash.  So then I clicked OK and got this message:

System Properties

Windows might not be able to record details that could help identify system errors because your current paging file is disabled or less than 800 megabytes.  Click OK to return to the Virtual Memory settings window, enable the paging file, and set the size to a value over 800 megabytes, or click Cancel to change your memory dump selection.
What we were inferring, from this, was that I could opt for the small memory dump with my existing settings, or else I would have to change the paging file settings.  Right there in the Advanced tab, I went to Performance Settings > Advanced tab > Virtual memory Change.  I had a 16MB paging file on drive C and a minimum 2GB paging file on another drive.  Apparently the kernel dump needed at least an 800MB paging file on drive C.  Since at least the days of Windows XP, I had emphasized putting the bulk of the paging file on another drive, in the belief that this would enhance performance.  A search now led to the suggestion that, especially on a machine with substantial RAM, I would rarely if ever run out of RAM and actually use the paging file.  On the other hand, a different post in that same thread quoted Microsoft as saying that paging files are used often and should be located on fast (and, obviously, uncompressed) drives if available.  A quick look at pagefile.sys on the second drive indicated that it was presently at the minimum 2GB size I had set for it, on a system with 12GB RAM.  So it seemed that advice to make the paging file half as large as RAM, or twice as large, or some other similar value, might significantly overstate how large a paging file I would actually need.  There had long been warnings that setting the minimum size too low would impose at least a slight performance hit, because Windows would have to dynamically resize the pagefile if it needed more space; but I thought that saving or otherwise manipulating a larger file might also cause a slowdown.  I concluded that the paging file probably was not being used often, that I didn't want to preallocate space that I might need for some other purpose in a pinch, that a fixed larger size could have its own drawbacks (including being inadvertently saved in a drive image), and therefore I should set the paging files on both drive C and the other drive to the System Managed Size > Set option.  After a reboot, I saw that the memory dump settings were as I had left them and the paging file size (with a full set of programs loaded) was 18427MB recommended and 24571MB currently allocated, or about 150% and 200%, respectively, of installed RAM.

One thing still on the burner was the indication, picked up from somewhere, that maybe I should be looking into the Windows Event Viewer (Start > Run > eventvwr.msc).  It seemed that Event Viewer was an alternative to BlueScreenView, so I wasn't sure I needed it.  Another recommended approach was to start by looking at the minidump to find the BCCode or STOP code, the cause, and the time when it happened.  I could see that BlueScreenView was showing me the Crash Time, the Bug Check Code, and a Caused by Driver column of information.  I didn't see a column for STOP codes.  I went into View > Choose Columns and saw that there wasn't even a column for STOP codes.  I had forgotten that Bug Check and STOP codes were synonyms.  Looking again, I saw that the three .dmp files shown in BlueScreenView all displayed Bug Check Codes of 0x00000116.  The "Caused by Driver" column listed three diffrent drivers, highlighted in the lower pane, but what was this bug check code telling me?  Microsoft's Bug Check Code Reference said that Bug Check 0x116 was VIDEO_TDR_ERROR.  The detailed description said, "This indicates that an attempt to reset the display driver and recover from a timeout failed."  (Later, I saw a suggestion that I would have found FaultWire more informative.  For this particular error code, I examined the suggestions below.)

So that was interesting.  It wasn't the CPU; it was the relatively new video card, an MSI R6570-MD2GD3 LP Radeon HD 6570 2GB.  I'd had it for a few weeks.  It seemed to me that the crashes were happening especially when I was running the Opera browser.  I couldn't make anything of the parameter information provided in the Bug Check Code Reference and listed in BlueScreenView, but I did do quick searches for the three drivers that were listed, for the three .dmp files shown in BlueScreenView:  pacer.sys, atikmpag.sys, and discache.sys.  Nothing jumped out at me for the other two, but I had seen pop-up dialogs referring to atikmpag when running Opera, and now it appeared that atikmpag.sys BSODs were related to video hardware problems (e.g., having a video card in the wrong slot). A right-click in Control Panel > Device Manager > Display Adapters indicated that I was already using the latest driver for the video card, and Opera said I was using the latest version.  Possibly this was happening only when Opera was overloaded:  I usually had a bunch of tabs open.  I decided to try the approach of killing Opera as soon as an atikmpag dialog popped up.  But the next crash wasn't due to Opera -- it wasn't running at the time -- so this was more like background information for the time being.  The next several runs of Opera produced no crashes, so possibly one or more of the steps taken here solved the problem.

Previously, I had gotten minidumps after an indication that my dump file size (presumably meaning my pagefile) was too small. Now that that was no longer a problem, I believed I could expect to see full kernel dumps instead of minidumps. I shelved my budding search for guidance on interpreting minidumps, to wait and see what I would get next.  After the next crash, I did have both a new minidump visible in BlueScreenView and a full MEMORY.DMP file in C:\Windows.  I wasn't sure how to view the MEMORY.DMP, so I ran a search and saw two options.  One was to upload the .DMP as an attachment to a request for help (at e.g., SevenForums.com).  I had a slow connection and my .DMP was about 1GB, so the recommended alternative (in an ExpertsExchange post) was to use Microsoft Debugging Tools for Windows.  (I had learned that I didn't have to pay to see the answers provided in ExpertsExchange.com threads; I just had to scroll to the bottom of the screen.)  The solution seemed to be to download the Windows SDK for Windows 7.  This gave me winsdk_web.exe.  That turned out to be a 2.5GB download that would require 4.5GB when installed.  I looked at my notes from the last time I flirted with the SDK.  I had apparently downloaded more than necessary; I was now seeing advice to download only the Debugging Tools for Windows.  (In my version of winsdk_web.exe, these were under Developer Tools, not under Common Utilities.)  This would be a 177MB download requiring 419MB when installed.  It downloaded and installed directly; it didn't give me an option of saving the download for future reinstallation.  It did not seem that it had actually downloaded 177MB, though; it was done in just a few minutes, and that would not have happened on my slow connection.

While that process was unfolding, I cleaned up the following notes that I had accumulated in the meantime; this post returns eventually (below) to the topic of using the SDK to read MEMORY.DMP.

One such miscellaneous note:  I saw a webpage on which Microsoft suggested two different sequences of steps, depending on whether Windows would start or not.  Since Windows was starting for me, their suggestion was, first, to undo recent changes using System Restore.  I had been having this problem for several days, past my most recent restore point.  Besides, by this point I believed I had traced the problem to the video hardware and/or Opera.  So the next step was to consult Control Panel > Action Center for clues.  Nothing there.  Next, make sure I was current on Control Panel > Windows Update.  I had already done that.

The next step recommended by Microsoft was to search for drivers on the manufacturer's website.  Well, I hadn't done that, not exactly.  I had relied on Device Manager, but now I went to the webpage of the video card manufacturer.  To do that, I started with GPU-Z (similiar to CPU-Z).  I discovered that I had to choose the Install rather than the Portable option:  the latter would make GPU-Z not only uninstalled but uninstallable on that machine.  Fortunately, I learned this on the machine that I was not trying to diagnose.  On the machine being examined, GPU-Z ran, and it gave me lots of information, but it didn't give me any more manufacturer information than I had gotten from Device Manager:  I was being lazy, but now I saw that I had an AMD Radeon HD 6570.  For that purpose, System Information for Windows (SIW) was a competent alternative.  To get the actual manufacturer information, it seemed I had no alternative but to consult my receipt, or the box that the video card came in.  Oddly, according to Device Manager, SIW, and GPU-Z, the driver I had installed was actually newer than the latest one on the manufacturer's webpage.  I decided to try the Roll Back Driver option in Device Manager.  That put it back to a driver dated about four months earlier.  I hadn't actually installed that older driver, to my knowledge; evidently Windows downloaded and installed the older driver automatically.  So I would have to see if that fixed the problem.  And in the long haul, that was one possible reason for the reduction in BSODs that I would experience in coming days.

In the meantime, the next step recommended by Microsoft was to use Safe Mode to troubleshoot problems.  They explained how to get into Safe Mode, but not what to do once I was there.  One possible intention was that I would load safe mode without startup programs that might be causing the problem.  A clean boot could be helpful at times, but did not seem highly relevant to the kind of crash I was having.  My crashes could occur after hours of operation.  Microsoft's final suggestion was to check for hard drive and memory errors.  I had recently run Windows Explorer > right-click on a drive > Properties > Tools > Check Now > check both options, and had also run MemTest86+.  These did not appear to be the problem in this case.

FaultWire offered other suggestions specifically oriented toward error 116.  The problem, they felt, was probably either in the driver or in hardware that was either defective or improperly installed.  On the video driver side, they suggested using their own commercial (nonfree) Driver Genius or Radar Sync to verify that I had the latest drivers, assuming I hadn't been comfortable with a direct search of the manufacturer's site.  On the hardware diagnostic side, they pointed me toward their Fix-It Utilities and System Suite, and also toward Eurosoft's PC Check and Iolo's System Mechanic (all commercial).  They also suggested checking the Windows 7 compatibility list

I did get another BSOD, within a day or two, but this time the error was different.  The number was 119 and the message was, "The video scheduler has encountered an unexpected fatal error."  I got it while running the Windows Experience Index test, so in that sense it seemed to be provoked by demanding use, as when Opera had been overloaded (above).  FaultWire had nothing new to add to what it had already said for error 116:  check the drivers, consider faulty hardware or incorrect hardware installation.  I hadn't previously searched the Win7 Compatibility Center, but now I did, and saw that there was no entry for my particular graphics card.  It was an MSI card, and a search of the Compatibility Center for "MSI" by itself turned up over 800 items, so it's not as though the database was weak.  I had evidently just stumbled into a product that was not listed.  I wasn't sure if that meant it hadn't been checked, or if it had been checked and was definitely not compatible.  Either way, this now seemed like something that I obviously should have checked before -- "obvious" being the standard word for what we have learned about, after we have learned it (or re-learned it, as the case may be).  I checked the manufacturer's page for the video card.  I was not impressed with MSI's website in this regard:  searching did not find the product, and when I did finally drill my way down to it, I got a notice:  "The specifications may differ from areas."  Some kind of typo there, but apparently they sold different products under the same model name.  I emailed MSI customer service, to verify that I was understanding the compatibility situation correctly.  They said no, it definitely was compatible.

I tried running the Windows Experience Index again, several weeks later.  By that point, I had rolled back the driver and had taken most if not all of the other steps described above.  This time, it did not crash.  I had also had no further crashes, with Opera or otherwise, during those weeks.  It seemed the driver rollback may have been the solution.  Having evidently solved the problem, the following notes are provided just for future reference.

By this point, I had installed SDK (above).  This gave me a couple of folders (e.g., C:\Program Files\Debugging Tools for Windows) and a Start Menu shortcut for a folder called Microsoft Windows SDK v7.0.  Choosing Open from the context menu for that folder shortcut took me to the C:\Program Data\Microsoft\Windows\Start Menu\Programs\Microsoft Windows SDK v7.0 folder.  There, I saw a shortcut for CMD Shell.  This opened up a command window.  It said, "The x64 compilers are not currently installed.  Please go to Add/Remove Programs to update your installation."  I went to Control Panel > Programs and Features > select Microsoft Windows SDK for Windows 7 (7.0) > click Change at the top of the list of programs there > Repair > Next.  But that didn't help.  I did a search and found that few people had had this problem.  My guess was that I got this message because I had installed only a fraction of the full contents of the SDK, and the solution was to install more of it, probably through that same Programs and Features route.  In that case, it seemed I might just ignore the message.

To use the SDK for reading MEMORY.DMP, Dirk Smith said I would actually run WinDbg.exe.  The link to this program (now located in C:\Program Files\Debugging Tools for Windows (x64)) had been installed in another Start Menu folder.  So evidently I was on the wrong track, when I opened the CMD Shell, or maybe WinDbg was just a front end for the command line.  Dirk said I needed to start by using WinDbg to find the proper symbol files.  This involved going into WinDbg > File > Symbol File Path.  There, I typed this:
srv*c:\cache*http://msdl.microsoft.com/download/symbols
Then I clicked OK.  Nothing seemed to happen.  But perhaps it was downloading the appropriate symbols quietly, which was what Dirk seemed to be saying.  The next step was apparently to go into WinDbg > File > Open Crash Dump > navigate to C:\Windows or wherever MEMORY.DMP was.  This got me a command window that seemed to hang, but apparently it was just figuring things out.  After a minute or two, it came back with errors:
Module load completed but symbols could not be loaded for atikmpag.sys.
Module load completed but symbols could not be loaded for atikmdag.sys.
Probably caused by:  dxgmms1.sys
Dirk said I could ignore the first two lines, but I wasn't so sure.  As noted above, an atikmpag file was named in one of my minidumps and I was seeing references to atikmpag in Opera.  He said I should focus on the last line, the reference to dxgmms1.sys.  That one hadn't been named in my minidumps.  Dirk told me to type "!analyze -v" (without quotes) in the command line at the bottom of the WinDbg screen.  That got me another error 119 message, and more besides:
****************************************
*                                                                             *
*                        Bugcheck Analysis                      *
*                                                                             *
*****************************************

VIDEO_SCHEDULER_INTERNAL_ERROR (119)
The video scheduler has detected that fatal violation has occurred. This resulted
in a condition that video scheduler can no longer progress. Any other values after
parameter 1 must be individually examined according to the subtype.

Arguments:
Arg1: 0000000000000001, The driver has reported an invalid fence ID.
Arg2: 0000000000004362
Arg3: 0000000000004363
Arg4: 0000000000004363

Debugging Details:
------------------
DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
BUGCHECK_STR:  0x119
PROCESS_NAME:  System
CURRENT_IRQL:  a
LAST_CONTROL_TRANSFER:  from fffff880015e322f to fffff8000307ed40
STACK_TEXT: 
[displaying, here, only the right end of each line - RW]
nt!KeBugCheckEx
watchdog!WdLogEvent5+0x11b
dxgmms1!VidSchiVerifyDriverReportedFenceId+0xad
dxgmms1!VidSchDdiNotifyInterruptWorker+0x19d
dxgmms1!VidSchDdiNotifyInterrupt+0x9e
dxgkrnl!DxgNotifyInterruptCB+0x83
atikmpag+0x52dc
atikmdag+0x4f526
atikmdag+0x4d479
atikmdag+0x62070
atikmdag+0xfb298
atikmdag+0x1015de
atikmdag+0x10161d
atikmdag+0x101714
atikmdag+0x101845
atikmdag+0x108d7b
atikmdag+0xfa0dc
atikmdag+0x4d15f
atikmpag+0x5ddb
nt!KiInterruptDispatch+0x16c
amdppm!C1Halt+0x2
nt!PoIdle+0x52a
nt!KiIdleLoop+0x2c

STACK_COMMAND:  kb
FOLLOWUP_IP:
dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
fffff880`053b9eb9 c744244053eeffff mov     dword ptr [rsp+40h],0FFFFEE53h
SYMBOL_STACK_INDEX:  2
SYMBOL_NAME:  dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: dxgmms1
IMAGE_NAME:  dxgmms1.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  4ce799c1
FAILURE_BUCKET_ID:  X64_0x119_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
BUCKET_ID:  X64_0x119_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
Followup: MachineOwner
Dirk said the right ends of the STACK TEXT lines were important for identifying third-party drivers.  Atikmpag and atikmdag were prominent there, just before (i.e., below) the dxgmms1 lines.  Anyway, the next step was to type "lmv" into the WinDbg command line.  This command provided details on all running programs or drivers (not sure) when Windows crashed.  As instructed, I searched this pile of information (using Ctrl-F) for the "probably caused by" item, which in my case (above) was dxgmms1.sys.  That search (with variations) found nothing.  I copied and pasted the WinDbg output into Notepad and tried my search there.  This time, it worked.  I tried it again in WinDbg, and this time it worked there too.  Not sure what I had done wrong the first time.  It seems the purpose of this step was to verify the manufacturer of the problematic file.  It looked like dxgmms1.sys came from Microsoft.  But if that Microsoft file had been the source of the problem, wouldn't I have been having these crashes before I installed the new video card?  WinDbg was showing me that the source of atikmdag.sys was AMD.  As Dirk said, Windows itself (i.e., Microsoft) was probably not the culprit.

It really looked like the purpose of this whole WinDbg and MEMORY.DMP rigmarole was just to get the identity of the driver manufacturer.  I wasn't sure this process was more effective than just doing web searches for the driver name and the error message.  I guess it added dxgmms1.sys to my list of possible causes, and provided confirmation that the atikmpag and atikmdag files were near the heart of this problem.  Whether I would be seeing more of this problem remained to be seen.  As noted above, the older driver presently seemed to have provided the desired stability.

There was one other approach that I hadn't pursued, and decided not to pursue at this point.  That was simply the suggestion to look at the time of the crash, in BlueScreenView, and then use NirSoft's MyEventViewer to examine events within a second or two before the crash.  Preliminarily, that seemed to be another way of getting at the contents of MEMORY.DMP, as listed in the STACK_TEXT above.  But possibly that would be more informative.  For me, further learning on that could await a future BSOD.

Wednesday, March 21, 2012

Windows 7: BSOD: PROCEXP111.SYS

My computer was sailing along, when suddenly I got a Blue Screen of Death (BSOD).  The message began, oddly, with a sentence fragment:

to your computer.

PROCEXP111.SYS

PAGE_FAULT_IN_NONPAGED_AREA

If this is the first time you've seen this Stop error screen, restart your computer.  If this screen appears again, follow these steps:

Check to make sure any new hardware or software is properly installed.  If this is a new installation, ask your hardware or software manufacturer for any Windows updates you might need.

If problems continue, disable or remove any newly installed hardware or software.  Disable BIOS memory options such as caching or shadowing.  If you need to use Safe Mode to remove or disable components, restart your computer, press F8 to select Advanced Startup Options, and then select Safe Mode.

Technical information:

*** STOP: 0x00000050 (0xFFFFFA8100043F20, 0x0000000000000000, 0xFFFFF880073B6DDD,0x0000000000000000)

*** PROCEXP111.SYS - Address FFFFF880073B6DDD base at FFFFF880073B5000, DateStamp 47194089

Collecting data for crash dump ...
Initializing disk for crash dump ...
Physical memory dump complete.
Contact your system admin or technical support group for further assistance.
I probably didn't need to type out all that information, but doing so provided a constructive outlet for frustration.  Besides, you never know what archaeologists of some future civilization will find absolutely crucial for understanding what they are digging out of the rocks.

This was the second time I'd gotten this BSOD, so evidently it was not going to be good enough to simply reboot and hope for the best.  To respond to the BSOD's first bit of advice, there wasn't any new hardware or software.  On the software side, I had recently restored a backup image of drive C that I had made more than a month earlier.  The first BSOD had occurred within the past few days, prior to the restoration.  In other words, I was suddenly getting BSODs on an install that had worked fairly well for weeks, with and also without software changes made during those weeks.

I did notice something atypical on the hardware side.  I had two different USB external drives connected.  That, itself, was not unusual, though it had not happened often.  The unusual part was that the system would not reboot without one of them being turned off.  It would get as far as giving me a message, which I think was "Loading Operating System," and then it would pause until I shut one of those USB drives off.

I wasn't actually doing anything in particular on the computer when the BSOD happened.  They had been up all night; I had just returned to the system in the morning; and at the moment I wasn't even using that computer; I was working on the other machine.  By the time I got to writing these notes, I didn't recall if I had even done anything on that computer.  Not much, anyway; nothing that would seem to have provoked the crash.

As I worked through this issue, I was guided by two posts I had written up a few months earlier, regarding a different STOP error.  One was a closer look at the "memory dump" concept mentioned toward the bottom of the BSOD; the other was a more general-purpose review of possibilities.  The memory dump investigation came to mind at this point because, on reboot, Windows 7 gave me a dialog that said, "Windows has recovered from an unexpected shutdown.  Windows can check online for a solution to the problem."  I hadn't always gotten this dialog after a crash.  It dimly seemed that something I had changed about my system, during the process of working through the prior memory dump post, had given me this information; otherwise I had to use something like BlueScreenView to see it.  The dialog gave me an option, "View problem details."  I took that option and got some technical information that I wasn't eager to read.  It pointed me toward two "Files that help describe the problem."  I copied the addresses of those files (without the actual filename), pasted them into the address bar in Windows Explorer, and looked at them.  One was an XML file that, if I just double-clicked on it (or if I pasted the full path and filename in Windows Explorer), would open as code in Internet Explorer.  This file was arguably more readable in Firefox, but I didn't see anything particularly informative in it.  The other was a minidump file that I opened in BlueScreenView.  (In Notepad, it was semi-gibberish.)  Problem is, I hadn't fared too well in interpreting the page dump, last time around, and that was still the case this time.  Following that previous guidance, I did notice that this day's minidump, and also the one from the previous BSOD, did contain lines referring specifically to PROCEXP111.SYS, named in the BSOD.  But, as before, I didn't know what else, if anything, I could do with the memory dump information.

Two other things worth noting about this crash.  First, after the crash, Glary Registry Repair found an unusually large number of registry errors.  Since I ran Glary every day, I suspected these were a result, not a cause, of the crash.  Second, in recent days the system had been functioning extremely slowly.  This seemed to depend on the number of programs running, but not entirely.  In particular, I was having the previously noted slowdowns that I had attributed to resource-hog programs (especially GoodSync and BeyondCompare).  Sometimes I noticed that, when those programs were out of the picture, the system sped up considerably; at other times, there seemed to be a lingering effect where the system continued to seem screwed up.  This was what had prompted me to do the system restore.  In that previous post, I mentioned trying Process Hacker to put a speed limit on these resource-intensive programs; but I also noted that this had not seemed to make much difference.  I wasn't sure that there was anything particularly wrong about the Windows 7 installation as a whole, and certainly wasn't eager to reinstall from scratch.

A search led to the information -- surprising to me, but obvious once stated -- that PROCEXP111.SYS was related to Sysinternals Process Explorer.  I had just begun using Process Explorer, after several weeks of using Process Hacker, to control certain programs -- especially GoodSync -- that made excessive resource demands, to the point of making the computer unusable while they were running.  I hadn't noticed specifically whether the previous BSOD named PROCEXP111.SYS as the culprit; but since it had occurred just one day earlier, probably PROCEXP111.SYS was named in that one too.  I probably could have figured this out from the minidump, with sufficient time investment.

I wasn't sure how to interpret this information.  Generally, Microsoft Sysinternals tools like Process Explorer had seemed relatively stable.  It seemed possible that the crash named PROCEXP111.SYS because, unlike Process Hacker, Process Explorer was actually succeeding in putting the brakes on some overly grabby programs, and they didn't like it.  That is, it may have been a problem with Process Explorer, but it may instead have been a problem with these other programs -- that, basically, they would either run at their preferred speed or not at all.

I tried another search.  This led to the suggestion that I should be using a more recent version.  I hadn't checked, but now I saw that mine was v. 11.04, copyright 2007.  Oops.  Upon closer examination, I saw that they were now up to v. 15.13.  I downloaded and installed the upgrade.  I wasn't sure how long it might be until the next BSOD due to Process Explorer, so I decided to close this post at this point.

Monday, January 3, 2011

VMware Workstation 7.1 Unrecoverable Error

I was using VMware Workstation 7.1 on Ubuntu 10.04 with a Windows XP SP3 guest.  I had a WinXP virtual machine (VM) open, and suddenly it crashed, with this error message:

VMware Workstation unrecoverable error: (mks)
Unexpected signal: 11.
A log file is available in "/media/VMS/VMware VMs/WXMUProjectC/vmware.log". Please request support and include the contents of the log file.

To collect data to submit to VMware support, select Help > About and click "Collect Support Data". You can also run the "vm-support" script in the Workstation folder directly.
We will respond on the basis of your support entitlement.
I pursued that option, but it turned out I didn't have a support entitlement, so I turned to this process of researching the solution on my own.  I had gotten a similar error message once before.  I wasn't entirely sure what I had done to solve the problem in that case, other than to keep flailing around until something clicked.  But as I reviewed that previous post, I did recall a different error message that I had gotten when starting the VM.  I started it again and saw this in the lower right-hand corner:
Could not connect Ethernet0 to virtual network "/dev/vmnet8"
More information can be foun din the vmware.log file.
Virtual device Ethernet0 will start disconnected.
I tried a search for that error and came across some possible answers.  One was just to reboot, but I had already done that.  Another was to use the Virtual Network Editor.  I found a VMware video on that.  It told me that vmnet8 was associated with NAT, which was the kind of network connection I had selected in VM > Settings > Hardware tab > Network Adapter.  The video then started to talk about adding a network adapter.

It seemed that this could have two implications for me.  First, I had just put a network interface card (NIC) into the computer, as an alternative to the motherboard's onboard network connector.  I had done that in an attempt to deal with a networking problem that turned out to be just a bad cable.  I did think that, previously, I had been getting the second error message previously, the one just quoted, "Could not connect Ethernet0."  But I had not been getting the crashes previously.  So probably I could fix that error by just removing the unnecessary NIC, instead of trying to figure out how to configure it as described in the video.  Second, I had preserved my Ubuntu /home directory during this most recent installation of Ubuntu.  I had also recently installed a new motherboard.  Possibly the settings that I had saved in that /home directory were still dreaming of the old days, with the previous motherboard; perhaps I would have to configure the VM's ethernet adapter anyway, so as to make it comfortable with the new motherboard.

I started by shutting down the machine, removing that unnecessary NIC, and restarting.  (Before shutting down, I checked Ubuntu's System > Administration > Update Manager, just in case there were updates that would make my life easier in unknown ways.)  I powered up the VM.  No Ethernet0 error message.  OK, one problem solved.  Would it crash?  I worked with the VM for a couple of days, but then it did crash again.  It seemed that the frequency of the crashes was much reduced, so maybe removing the NIC helped.

This time, I took a look at the log file.  It was in the folder containing the other files for the VM, including the .vmdk file, and it was named simply "vmware.log."  It contained a large number of entries, going back days, including what looked like several hundred, at the end of the file, that all occurred within the last second before the crash.  The first one in that last second was "Caught signal 11 -- tid 3652."  There hadn't been any others for several minutes before that, and I also noticed that the "11" was the same number as appeared in the error message onscreen (quoted above).  This did seem to be the beginning of the end.  After that "Caught signal 11" message, there in the log file, I saw many repetitions of a few other types of messages, like these:
mks| SIGNAL: stack B6CE1AE0 : [etc.]
mks| Backtrace[0] [etc.]
mks| SymBacktrace[0] [etc.]
mks| Panic: dropping lock (was bug 49968)
mks| Unexpected signal: 11.
mks| Core dump limit is 0 KB.
mks| Child process 19031 failed to dump core (status 0x6).

mks| Backtrace[0] [etc.]
mks| SymBacktrace[0] [etc.]
where [etc.] refers to various sets of computer gibberish (e.g., 0xb7823410).  It went on from there, but now we seemed to be at the point of no return, where the log showed VMware giving me the "Unexpected signal" error message onscreen.  A search led to a thread in which people were saying that they avoided this error by tinkering with their screen resolution.  When I saw that, I figured that we were talking about a kind of general reaction to somewhat incompatible hardware:  could be the NIC, could be the screen resolution, etc.

I had noticed that this particular crash occurred when I used my KVM to switch from the computer in question to a different computer.  It was a new KVM, an IOGear GCS72U.  I had previously been using a PS/2 KVM without problems, but my new motherboard did not have two PS/2 sockets, so I had to switch to this USB KVM.  Had I installed the KVM before or after taking out the NIC?  I couldn't remember.  But the KVM was a suspect.

Another suspect was the keyboard.  I had also had to get a new keyboard and mouse -- because, of course, I was using PS/2 devices previously, and now I had to have USB devices.  It was an inexpensive keyboard, a Logitech K120, and I had noticed that it did not work consistently on the other computer.  It would be working fine, and then there would suddenly be no more keyboard input.  The mouse was still working, but not the keyboard.  At that point, it didn't matter whether the keyboard was connected to that other computer through the KVM or was plugged into it directly; either way, it wouldn't work.  But I hadn't done a scientific study to determine whether it was the keyboard or the USB KVM that was screwing up.

So, putting it together, what had happened in this case was that I was using VMware on computer no. 1 (C1).  The keyboard and mouse had been working fine on both C1 and C2.  I hit the KVM switch button to move to C2.  That's when VMware crashed; and at the same time, suddenly the keyboard was not working on C2.  But the USB mouse would still work.  So it seemed that either the keyboard or the KVM was sending a funky keyboard-related signal to C2.

C2 was an older system, and I was in the process of upgrading it, and I thought that might solve the problem.  But in the meantime, I plugged my old PS/2 keyboard into C2 and rebooted, with the new keyboard and KVM still connected to both computers.  In this setup, over a period of days, I observed that the PS/2 keyboard continued to work consistently throughout, but the USB keyboard would sometimes stop working on C2, like before.

Then I had another VMware crash on C1, and that made me decide to try another angle.  I unplugged the KVM from C2.  By this time, I had replaced C2; now it was pretty much the same computer as C1 (same kind of motherboard, CPU, and case).  So now the mouse and keyboard connected to the KVM would work only on C1.  On C2, I kept using the PS/2 keyboard, and added a USB mouse.  If there were still crashes or freezes, I would have a better idea of whether the problem was the new USB keyboard or the new USB KVM.  At this writeup, a couple of weeks later, my recollection was that I did not have any further crashes.

I decided to send the KVM back to IOGear.  In the interim, I connected my old PS/2 KVM to both computers, and used it only with the PS/2 keyboard.  So now I had a separate USB mouse for each computer, but only one keyboard for both of them.  It took a while to get used to switching mice when I switched the keyboard from one computer to the other, but the point here is that there were no further crashes.  I had to wait for the replacement KVM to arrive from IOGear before I could say for sure whether it was the keyboard or the KVM; but by that point I had decided to stop using VMware on Ubuntu,