Sunday, April 8, 2012

Windows 7: BSOD: Errors 116 & 119: Interpreting the Minidump or Kernel Dump File

I had been having Blue Screen of Death (BSOD) crashes.  These were happening on one machine and not the other.  This was odd; both machines had virtually identical Windows 7 installations.  They also had the same motherboards and same amounts and kinds of RAM.  This post is a continuation in the effort to figure out why.

Given the similarities between the computers, I suspected the crashes were due to software.  Although the Windows installations were virtually identical, I was not always using exactly the same programs on both machines.  There was also a possibility that a CPU upgrade was responsible for a new bout of crashes:  both machines had previously had the same processors, but I had just installed a faster one on the crashing machine, and it had just begun crashing again.

In the previous episode, I had used BlueScreenView but had not known how to interpret its reports.  More accurately, I had not known how to interpret the minidump reports, viewable in BlueScreenView, that Windows would produce during a BSOD.  I wanted to be able to understand what the minidump file was telling me.

Understanding the minidump seemed especially important this time because, unlike the last time, the BSOD was not pausing onscreen long enough for me to see what it said.  It was flashing by so quickly that I just caught a glimpse of blue and then the machine was rebooting.  I recalled that I had seen, somewhere, a setting that would prevent that from happening.  Eventually I found it:  Start > Run > SystemPropertiesAdvanced.exe (or Control Panel > System > Advanced tab) > Startup and Recovery Settings > uncheck Automatically restart.  At present, my other settings there were for "Write an event to the system log," "Kernel memory dump" (not "None" or "Small memory dump (256KB)), "Dump file = %SystemRoot%\MEMORY.DMP," and "Overwrite any existing file" was selected.  I wasn't sure if those were the right settings; that's just what I had.  One source told me that I would want to overwrite because the MEMORY.DMP file would eat up lots of disk space.  With these settings, I would have a minidump for every crash and a MEMORY.DMP for only the most recent crash.  So then I clicked OK and got this message:

System Properties

Windows might not be able to record details that could help identify system errors because your current paging file is disabled or less than 800 megabytes.  Click OK to return to the Virtual Memory settings window, enable the paging file, and set the size to a value over 800 megabytes, or click Cancel to change your memory dump selection.
What we were inferring, from this, was that I could opt for the small memory dump with my existing settings, or else I would have to change the paging file settings.  Right there in the Advanced tab, I went to Performance Settings > Advanced tab > Virtual memory Change.  I had a 16MB paging file on drive C and a minimum 2GB paging file on another drive.  Apparently the kernel dump needed at least an 800MB paging file on drive C.  Since at least the days of Windows XP, I had emphasized putting the bulk of the paging file on another drive, in the belief that this would enhance performance.  A search now led to the suggestion that, especially on a machine with substantial RAM, I would rarely if ever run out of RAM and actually use the paging file.  On the other hand, a different post in that same thread quoted Microsoft as saying that paging files are used often and should be located on fast (and, obviously, uncompressed) drives if available.  A quick look at pagefile.sys on the second drive indicated that it was presently at the minimum 2GB size I had set for it, on a system with 12GB RAM.  So it seemed that advice to make the paging file half as large as RAM, or twice as large, or some other similar value, might significantly overstate how large a paging file I would actually need.  There had long been warnings that setting the minimum size too low would impose at least a slight performance hit, because Windows would have to dynamically resize the pagefile if it needed more space; but I thought that saving or otherwise manipulating a larger file might also cause a slowdown.  I concluded that the paging file probably was not being used often, that I didn't want to preallocate space that I might need for some other purpose in a pinch, that a fixed larger size could have its own drawbacks (including being inadvertently saved in a drive image), and therefore I should set the paging files on both drive C and the other drive to the System Managed Size > Set option.  After a reboot, I saw that the memory dump settings were as I had left them and the paging file size (with a full set of programs loaded) was 18427MB recommended and 24571MB currently allocated, or about 150% and 200%, respectively, of installed RAM.

One thing still on the burner was the indication, picked up from somewhere, that maybe I should be looking into the Windows Event Viewer (Start > Run > eventvwr.msc).  It seemed that Event Viewer was an alternative to BlueScreenView, so I wasn't sure I needed it.  Another recommended approach was to start by looking at the minidump to find the BCCode or STOP code, the cause, and the time when it happened.  I could see that BlueScreenView was showing me the Crash Time, the Bug Check Code, and a Caused by Driver column of information.  I didn't see a column for STOP codes.  I went into View > Choose Columns and saw that there wasn't even a column for STOP codes.  I had forgotten that Bug Check and STOP codes were synonyms.  Looking again, I saw that the three .dmp files shown in BlueScreenView all displayed Bug Check Codes of 0x00000116.  The "Caused by Driver" column listed three diffrent drivers, highlighted in the lower pane, but what was this bug check code telling me?  Microsoft's Bug Check Code Reference said that Bug Check 0x116 was VIDEO_TDR_ERROR.  The detailed description said, "This indicates that an attempt to reset the display driver and recover from a timeout failed."  (Later, I saw a suggestion that I would have found FaultWire more informative.  For this particular error code, I examined the suggestions below.)

So that was interesting.  It wasn't the CPU; it was the relatively new video card, an MSI R6570-MD2GD3 LP Radeon HD 6570 2GB.  I'd had it for a few weeks.  It seemed to me that the crashes were happening especially when I was running the Opera browser.  I couldn't make anything of the parameter information provided in the Bug Check Code Reference and listed in BlueScreenView, but I did do quick searches for the three drivers that were listed, for the three .dmp files shown in BlueScreenView:  pacer.sys, atikmpag.sys, and discache.sys.  Nothing jumped out at me for the other two, but I had seen pop-up dialogs referring to atikmpag when running Opera, and now it appeared that atikmpag.sys BSODs were related to video hardware problems (e.g., having a video card in the wrong slot). A right-click in Control Panel > Device Manager > Display Adapters indicated that I was already using the latest driver for the video card, and Opera said I was using the latest version.  Possibly this was happening only when Opera was overloaded:  I usually had a bunch of tabs open.  I decided to try the approach of killing Opera as soon as an atikmpag dialog popped up.  But the next crash wasn't due to Opera -- it wasn't running at the time -- so this was more like background information for the time being.  The next several runs of Opera produced no crashes, so possibly one or more of the steps taken here solved the problem.

Previously, I had gotten minidumps after an indication that my dump file size (presumably meaning my pagefile) was too small. Now that that was no longer a problem, I believed I could expect to see full kernel dumps instead of minidumps. I shelved my budding search for guidance on interpreting minidumps, to wait and see what I would get next.  After the next crash, I did have both a new minidump visible in BlueScreenView and a full MEMORY.DMP file in C:\Windows.  I wasn't sure how to view the MEMORY.DMP, so I ran a search and saw two options.  One was to upload the .DMP as an attachment to a request for help (at e.g., SevenForums.com).  I had a slow connection and my .DMP was about 1GB, so the recommended alternative (in an ExpertsExchange post) was to use Microsoft Debugging Tools for Windows.  (I had learned that I didn't have to pay to see the answers provided in ExpertsExchange.com threads; I just had to scroll to the bottom of the screen.)  The solution seemed to be to download the Windows SDK for Windows 7.  This gave me winsdk_web.exe.  That turned out to be a 2.5GB download that would require 4.5GB when installed.  I looked at my notes from the last time I flirted with the SDK.  I had apparently downloaded more than necessary; I was now seeing advice to download only the Debugging Tools for Windows.  (In my version of winsdk_web.exe, these were under Developer Tools, not under Common Utilities.)  This would be a 177MB download requiring 419MB when installed.  It downloaded and installed directly; it didn't give me an option of saving the download for future reinstallation.  It did not seem that it had actually downloaded 177MB, though; it was done in just a few minutes, and that would not have happened on my slow connection.

While that process was unfolding, I cleaned up the following notes that I had accumulated in the meantime; this post returns eventually (below) to the topic of using the SDK to read MEMORY.DMP.

One such miscellaneous note:  I saw a webpage on which Microsoft suggested two different sequences of steps, depending on whether Windows would start or not.  Since Windows was starting for me, their suggestion was, first, to undo recent changes using System Restore.  I had been having this problem for several days, past my most recent restore point.  Besides, by this point I believed I had traced the problem to the video hardware and/or Opera.  So the next step was to consult Control Panel > Action Center for clues.  Nothing there.  Next, make sure I was current on Control Panel > Windows Update.  I had already done that.

The next step recommended by Microsoft was to search for drivers on the manufacturer's website.  Well, I hadn't done that, not exactly.  I had relied on Device Manager, but now I went to the webpage of the video card manufacturer.  To do that, I started with GPU-Z (similiar to CPU-Z).  I discovered that I had to choose the Install rather than the Portable option:  the latter would make GPU-Z not only uninstalled but uninstallable on that machine.  Fortunately, I learned this on the machine that I was not trying to diagnose.  On the machine being examined, GPU-Z ran, and it gave me lots of information, but it didn't give me any more manufacturer information than I had gotten from Device Manager:  I was being lazy, but now I saw that I had an AMD Radeon HD 6570.  For that purpose, System Information for Windows (SIW) was a competent alternative.  To get the actual manufacturer information, it seemed I had no alternative but to consult my receipt, or the box that the video card came in.  Oddly, according to Device Manager, SIW, and GPU-Z, the driver I had installed was actually newer than the latest one on the manufacturer's webpage.  I decided to try the Roll Back Driver option in Device Manager.  That put it back to a driver dated about four months earlier.  I hadn't actually installed that older driver, to my knowledge; evidently Windows downloaded and installed the older driver automatically.  So I would have to see if that fixed the problem.  And in the long haul, that was one possible reason for the reduction in BSODs that I would experience in coming days.

In the meantime, the next step recommended by Microsoft was to use Safe Mode to troubleshoot problems.  They explained how to get into Safe Mode, but not what to do once I was there.  One possible intention was that I would load safe mode without startup programs that might be causing the problem.  A clean boot could be helpful at times, but did not seem highly relevant to the kind of crash I was having.  My crashes could occur after hours of operation.  Microsoft's final suggestion was to check for hard drive and memory errors.  I had recently run Windows Explorer > right-click on a drive > Properties > Tools > Check Now > check both options, and had also run MemTest86+.  These did not appear to be the problem in this case.

FaultWire offered other suggestions specifically oriented toward error 116.  The problem, they felt, was probably either in the driver or in hardware that was either defective or improperly installed.  On the video driver side, they suggested using their own commercial (nonfree) Driver Genius or Radar Sync to verify that I had the latest drivers, assuming I hadn't been comfortable with a direct search of the manufacturer's site.  On the hardware diagnostic side, they pointed me toward their Fix-It Utilities and System Suite, and also toward Eurosoft's PC Check and Iolo's System Mechanic (all commercial).  They also suggested checking the Windows 7 compatibility list

I did get another BSOD, within a day or two, but this time the error was different.  The number was 119 and the message was, "The video scheduler has encountered an unexpected fatal error."  I got it while running the Windows Experience Index test, so in that sense it seemed to be provoked by demanding use, as when Opera had been overloaded (above).  FaultWire had nothing new to add to what it had already said for error 116:  check the drivers, consider faulty hardware or incorrect hardware installation.  I hadn't previously searched the Win7 Compatibility Center, but now I did, and saw that there was no entry for my particular graphics card.  It was an MSI card, and a search of the Compatibility Center for "MSI" by itself turned up over 800 items, so it's not as though the database was weak.  I had evidently just stumbled into a product that was not listed.  I wasn't sure if that meant it hadn't been checked, or if it had been checked and was definitely not compatible.  Either way, this now seemed like something that I obviously should have checked before -- "obvious" being the standard word for what we have learned about, after we have learned it (or re-learned it, as the case may be).  I checked the manufacturer's page for the video card.  I was not impressed with MSI's website in this regard:  searching did not find the product, and when I did finally drill my way down to it, I got a notice:  "The specifications may differ from areas."  Some kind of typo there, but apparently they sold different products under the same model name.  I emailed MSI customer service, to verify that I was understanding the compatibility situation correctly.  They said no, it definitely was compatible.

I tried running the Windows Experience Index again, several weeks later.  By that point, I had rolled back the driver and had taken most if not all of the other steps described above.  This time, it did not crash.  I had also had no further crashes, with Opera or otherwise, during those weeks.  It seemed the driver rollback may have been the solution.  Having evidently solved the problem, the following notes are provided just for future reference.

By this point, I had installed SDK (above).  This gave me a couple of folders (e.g., C:\Program Files\Debugging Tools for Windows) and a Start Menu shortcut for a folder called Microsoft Windows SDK v7.0.  Choosing Open from the context menu for that folder shortcut took me to the C:\Program Data\Microsoft\Windows\Start Menu\Programs\Microsoft Windows SDK v7.0 folder.  There, I saw a shortcut for CMD Shell.  This opened up a command window.  It said, "The x64 compilers are not currently installed.  Please go to Add/Remove Programs to update your installation."  I went to Control Panel > Programs and Features > select Microsoft Windows SDK for Windows 7 (7.0) > click Change at the top of the list of programs there > Repair > Next.  But that didn't help.  I did a search and found that few people had had this problem.  My guess was that I got this message because I had installed only a fraction of the full contents of the SDK, and the solution was to install more of it, probably through that same Programs and Features route.  In that case, it seemed I might just ignore the message.

To use the SDK for reading MEMORY.DMP, Dirk Smith said I would actually run WinDbg.exe.  The link to this program (now located in C:\Program Files\Debugging Tools for Windows (x64)) had been installed in another Start Menu folder.  So evidently I was on the wrong track, when I opened the CMD Shell, or maybe WinDbg was just a front end for the command line.  Dirk said I needed to start by using WinDbg to find the proper symbol files.  This involved going into WinDbg > File > Symbol File Path.  There, I typed this:
srv*c:\cache*http://msdl.microsoft.com/download/symbols
Then I clicked OK.  Nothing seemed to happen.  But perhaps it was downloading the appropriate symbols quietly, which was what Dirk seemed to be saying.  The next step was apparently to go into WinDbg > File > Open Crash Dump > navigate to C:\Windows or wherever MEMORY.DMP was.  This got me a command window that seemed to hang, but apparently it was just figuring things out.  After a minute or two, it came back with errors:
Module load completed but symbols could not be loaded for atikmpag.sys.
Module load completed but symbols could not be loaded for atikmdag.sys.
Probably caused by:  dxgmms1.sys
Dirk said I could ignore the first two lines, but I wasn't so sure.  As noted above, an atikmpag file was named in one of my minidumps and I was seeing references to atikmpag in Opera.  He said I should focus on the last line, the reference to dxgmms1.sys.  That one hadn't been named in my minidumps.  Dirk told me to type "!analyze -v" (without quotes) in the command line at the bottom of the WinDbg screen.  That got me another error 119 message, and more besides:
****************************************
*                                                                             *
*                        Bugcheck Analysis                      *
*                                                                             *
*****************************************

VIDEO_SCHEDULER_INTERNAL_ERROR (119)
The video scheduler has detected that fatal violation has occurred. This resulted
in a condition that video scheduler can no longer progress. Any other values after
parameter 1 must be individually examined according to the subtype.

Arguments:
Arg1: 0000000000000001, The driver has reported an invalid fence ID.
Arg2: 0000000000004362
Arg3: 0000000000004363
Arg4: 0000000000004363

Debugging Details:
------------------
DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
BUGCHECK_STR:  0x119
PROCESS_NAME:  System
CURRENT_IRQL:  a
LAST_CONTROL_TRANSFER:  from fffff880015e322f to fffff8000307ed40
STACK_TEXT: 
[displaying, here, only the right end of each line - RW]
nt!KeBugCheckEx
watchdog!WdLogEvent5+0x11b
dxgmms1!VidSchiVerifyDriverReportedFenceId+0xad
dxgmms1!VidSchDdiNotifyInterruptWorker+0x19d
dxgmms1!VidSchDdiNotifyInterrupt+0x9e
dxgkrnl!DxgNotifyInterruptCB+0x83
atikmpag+0x52dc
atikmdag+0x4f526
atikmdag+0x4d479
atikmdag+0x62070
atikmdag+0xfb298
atikmdag+0x1015de
atikmdag+0x10161d
atikmdag+0x101714
atikmdag+0x101845
atikmdag+0x108d7b
atikmdag+0xfa0dc
atikmdag+0x4d15f
atikmpag+0x5ddb
nt!KiInterruptDispatch+0x16c
amdppm!C1Halt+0x2
nt!PoIdle+0x52a
nt!KiIdleLoop+0x2c

STACK_COMMAND:  kb
FOLLOWUP_IP:
dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
fffff880`053b9eb9 c744244053eeffff mov     dword ptr [rsp+40h],0FFFFEE53h
SYMBOL_STACK_INDEX:  2
SYMBOL_NAME:  dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: dxgmms1
IMAGE_NAME:  dxgmms1.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  4ce799c1
FAILURE_BUCKET_ID:  X64_0x119_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
BUCKET_ID:  X64_0x119_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
Followup: MachineOwner
Dirk said the right ends of the STACK TEXT lines were important for identifying third-party drivers.  Atikmpag and atikmdag were prominent there, just before (i.e., below) the dxgmms1 lines.  Anyway, the next step was to type "lmv" into the WinDbg command line.  This command provided details on all running programs or drivers (not sure) when Windows crashed.  As instructed, I searched this pile of information (using Ctrl-F) for the "probably caused by" item, which in my case (above) was dxgmms1.sys.  That search (with variations) found nothing.  I copied and pasted the WinDbg output into Notepad and tried my search there.  This time, it worked.  I tried it again in WinDbg, and this time it worked there too.  Not sure what I had done wrong the first time.  It seems the purpose of this step was to verify the manufacturer of the problematic file.  It looked like dxgmms1.sys came from Microsoft.  But if that Microsoft file had been the source of the problem, wouldn't I have been having these crashes before I installed the new video card?  WinDbg was showing me that the source of atikmdag.sys was AMD.  As Dirk said, Windows itself (i.e., Microsoft) was probably not the culprit.

It really looked like the purpose of this whole WinDbg and MEMORY.DMP rigmarole was just to get the identity of the driver manufacturer.  I wasn't sure this process was more effective than just doing web searches for the driver name and the error message.  I guess it added dxgmms1.sys to my list of possible causes, and provided confirmation that the atikmpag and atikmdag files were near the heart of this problem.  Whether I would be seeing more of this problem remained to be seen.  As noted above, the older driver presently seemed to have provided the desired stability.

There was one other approach that I hadn't pursued, and decided not to pursue at this point.  That was simply the suggestion to look at the time of the crash, in BlueScreenView, and then use NirSoft's MyEventViewer to examine events within a second or two before the crash.  Preliminarily, that seemed to be another way of getting at the contents of MEMORY.DMP, as listed in the STACK_TEXT above.  But possibly that would be more informative.  For me, further learning on that could await a future BSOD.

2 comments:

raywood

An earlier post and its following comments develop additional aspects of related problems.

raywood

A later post provides a summary of related steps.