Tracking down a production bug
Production Applications have bugs. It doesn't seem to matter how well
tested, or under what conditions that testing happens under, the bugs
still persist. These bugs typically are hard or impossible to reproduce
in our test lab, so having cooperation from the customer in tracking it
down is invaluable.
A client running the SoapBox Server 2005 SR2 recently reported major memory
issues under load conditions that should not have caused any troubles at all.
After scratching our heads for a bit, the need to get to the bottom of the
problem was serious enough to take a break out of daily routine and find the
issue. Fortunately the customer has been exceptionally helpful, and was
able to collect dump files for us.
Dump files, you ask? Well. This blog highlights the methodology taken to get
to the bottom of a critical problem in a production .Net application. This
mostly means obtaining a dump file, and using WinDbg to track down the issues.
For those of you used to developing in Visual Studio, the WinDbg debugging experience will
come as something of a shock. Frankly, it stinks. It's like going back 20+ years
and using GDB or some other command line driven debugger. Unfortunately WinDbg is the
only option for debugging MiniDump files that contain Managed Code.
The Problem
With relatively few users logged in, memory on the server was far higher than it
should have been. As more users log in, memory continues to grow until an
OutOfMemoryException happens, and the process restarts.
Environment
Windows 2003 Server (x86), .Net 1.1 SP1, SoapBox Server 2005 SR2, Dual Core Processor.
Obtaining a Dump File
The first step is work with the customer so that we can get a MiniDump
file. This file is a snapshot of the memory used by the process, the running threads and their call stacks, and just about everything else
you can think of. With this we can poke and prod to determine the application state when the dump was generated.
The easiest way to generate MiniDumps is to use with ADPlus. To get this setup:
- On the server, download and install the Debugging Tools For Window for your platform.
- Copy the DNExitProcess.xml file that I shameless stole from John Robbin's Bugslayer column to your local machine.
With the process up and running (but not yet doing anything), setup AD Plus to capture a dump when the process exits:
- C:\Program Files\Debugging Tools for Windows"\adplus -c c:\Temp\CrashDump\DNExitProcess.xml -o C:\Temp\CrashDump\ -pn SoapBoxServer.exe
At this point, when the SoapBoxServer.exe process terminates, ADPlus will create a MiniDump in the C:\Temp\CrashDump directory.
The name MiniDump is deceiving. These files will be huge. 2+ GB per file,
and there will typically be two of them. Fortunately, they compress very well.
Now That your customer has generated dumps, compressed them, and gotten them to you via a FTP or WebDAV folder, the real fun begins.
Setting up for Debugging
On your local workstation:
- Download and install the x86 version of the Debugging tools for Windows.
- Track down the actual binary and symbol files (.pdb files) that match the
version of software your customer is running. If you didn't save these when you
did your build, you're in real trouble. I bet you won't make THAT mistake again! :)
- Put the binary and symbol files into a local directory like: C:\Temp\SoapBoxBuildSymbols\
- Copy the dump files into a local directory like: C:\Temp\Debugging\
- Create an empty symbol directory C:\temp\symbols\, that will contain symbols downloaded from the Microsoft Symbol Server.
Setting up WinDbg
Run WinDbg. If you installed the Debugging Tools, you can find this in your start menu. If you didn't install the Debugging Tools, go do it now. They're at Debugging Tools For Window.
With WinDbg running, you need to set your Symbol Path, which is found under the File Menu. We want to set WinDbg to use our
local symbols (for our custom application), as well as the Microsoft Symbol Server for all .Net framework and Windows Symbols. To do this, set the symbol path to:
- C:\temp\SoapBoxBuildSymbols\; SRV*c:\temp\symbols*http://msdl.microsoft.com/download/symbols
Now we're ready to open the dump file and start debugging. Select Open Crash Dump, and select the dump file to debug.
If there are two dumps, I generally debug the 1st Chance Exception and/or Process Shutdown dump.
In the dump file window type, .load clr10/sos.dll. This loads Son of Strike
which is the .Net extension to WinDbg.
Without Son of Strike
you're not going to be able to do any significant debugging. Unfortunately the only version of this I've been able to find
is for x86 - I'm not sure yet what the answer is for x64 or IA64. If anyone knows the answer to this, please let me know.
Starting the Debug Process
Based on the fact that we're running out of memory, its pretty likely that we're suffering some sort of memory leak. Let's
take a quick look at the threads that are running on the system and the state they're in:
0:000> !threads
ThreadCount: 180
UnstartedThread: 0
BackgroundThread: 60
PendingThread: 0
DeadThread: 119
PreEmptive GC Alloc Lock
ID ThreadOBJ State GC Context Domain Count APT Exception
XXX 0x98c 0x00150070 0xa020 Enabled 0x00000000:0x00000000 0x0014b0a8 0 Ukn
XXX 0xbb4 0x00151790 0xb220 Enabled 0x00000000:0x00000000 0x0014b0a8 0 Ukn (Finalizer)
XXX 0 0x00197458 0x1800820 Enabled 0x00000000:0x00000000 0x0014b0a8 0 Ukn (Threadpool Worker)
XXX 0 0x001bb030 0x1820 Enabled 0x00000000:0x00000000 0x0014b0a8 0 Ukn System.OutOfMemoryException
...
XXX 0 0x128eebc8 0x1800820 Enabled 0x00000000:0x00000000 0x0014b0a8 0 Ukn (Threadpool Worker) System.OutOfMemoryException
XXX 0 0x12828628 0x800820 Enabled 0x00000000:0x00000000 0x0014b0a8 0 MTA (Threadpool Completion Port) System.OutOfMemoryException
...
XXX 0xd74 0x0d156680 0x800220 Enabled 0x00000000:0x00000000 0x0014b0a8 0 Ukn (Threadpool Completion Port) System.OutOfMemoryException
...
XXX 0xff0 0x12906490 0x1800220 Enabled 0x00000000:0x00000000 0x0014b0a8 0 Ukn (Threadpool Worker) System.OutOfMemoryException
...
0 0xbc8 0x0b39da48 0x1800220 Enabled 0x99dc8528:0x99dca49c 0x0014b0a8 0 MTA (Threadpool Worker) System.OutOfMemoryException
All those OutOfMemory Exceptions prove, beyond any doubt, that we've got a memory issue. Let's pull statistics on the
Managed Heap and see if we can figure out a bit more.
0:000> !DumpHeap -Stat
------------------------------
Heap 0
total 552,517 objects
------------------------------
Heap 1
total 306,058 objects
------------------------------
total 858,575 objects
Statistics:
MT Count TotalSize Class Name
0x7bee8da4 1 12 System.InvariantComparer
0x7becc074 1 12 System.Xml.XmlDownloadManager
0x7b58ed44 1 12 System.Drawing.Printing.PrintingPermission
.
.
.
0x006c2c3c 437,647 1,797,282,964 System.Byte[]
Just looking at these heap stats, I can see that I have 437K Byte Arrays allocated,
consuming a total of 1.8 Gigabytes. In terms of figuring out what the problem is,
I just found it. We're leaking byte arrays.
I'm actually somewhat surprised by this, as I was 95% certain this would end up
being a heap fragmentation issue. Finding the cause of this should be pretty
easy - certainly easier than finding the root cause of fragmentation or some
other obscure issue. Ironically, the place where these byte arrays are allocated
exists primarily to help us minimize heap fragmentation.
Using DumpHeap we can look at the byte arrays. With 400K+ of these, I want to limit the number I return.
Let's take a look at some of these byte arrays:
0:000> !DumpHeap -Type Byte[] -l 5 -min 4000
------------------------------
Heap 0
Address MT Size Gen
0x00e45430 0x006c2c3c 8,204 2 System.Byte[]
0x00e4dc58 0x006c2c3c 4,108 2 System.Byte[]
0x00e4ec64 0x006c2c3c 4,108 2 System.Byte[]
0x00e4fc70 0x006c2c3c 4,108 2 System.Byte[]
0x00e50c7c 0x006c2c3c 4,108 2 System.Byte[]
total 5 objects
------------------------------
Heap 1
Address MT Size Gen
0x04de7518 0x006c2c3c 65,548 2 System.Byte[]
0x04e1e7c8 0x006c2c3c 8,204 2 System.Byte[]
0x04e208c0 0x006c2c3c 8,204 2 System.Byte[]
0x04e22a30 0x006c2c3c 8,204 2 System.Byte[]
0x04e24aec 0x006c2c3c 8,204 2 System.Byte[]
total 5 objects
------------------------------
total 10 objects
Note: For those of you wondering why there are two heaps here's the answer:
Initially I though Heap 0 was the standard Managed Heap (complete with the expected Gen0, Gen1, Gen2), and Heap 1 was the
Large Object Heap.
Then I remembered we are running on a dual-core machine, and using the
Server Version of the CLR..
The key here is the phrase (from that link): We create a GC thread and a separated heap for each CPU. GC happens on these threads instead of on the allocating thread.
So there ya have it, we have 2 heaps - one for each CPU.
If I run this command without the -l 5 option, I see all 400K+ byte arrays. This takes about an hour to scroll
by on the screen. However, it's important to pick a "newer" array, so that we can be sure of getting one that
I would consider a "leaked" array. If we grabbed one from the top of the list, it's likley to be a legit,
pre-allocated array (which we do quite a bit of, in order to prevent heap fragmentation.
So I run the command, get the FULL list of arrays, then pick one from towards the end, and record the address.
The one I pick happens to have an address of 0x9c559b60.
Now that I have a byte array isolated, and know it's address, it's time to find out more about it:
0:000> !DumpObj 0x9c559b60
Name: System.Byte[]
MethodTable 0x006c2c3c
EEClass 0x006c2bc4
Size 4108(0x100c) bytes
GC Generation: 2
Array: Rank 1, Type System.Byte
Element Type: System.Byte
Content: 4,096 items
So this byte array is 4108 bytes, and lives in GC Generation 2. This is nice to know, but doesn't tell
us enough. 'Who owns this byte array?' is really the question we need answered. Fortunately that's pretty easy to answer:
0:000> !GCRoot 0x9c559b60
Scan Thread 0 (0xbc8)
Scan HandleTable 0x14a430
Scan HandleTable 0x14d348
HANDLE(Strong):6b13c8:Root:0x1338944(System.Net.Sockets.AcceptAsyncResult)->
0x13388d0(System.AsyncCallback)->
0x1318118(Winfessor.Proxy.Server.Socks.XMPPBytestreamsSocksListener)->
0x13180fc(Winfessor.Proxy.Server.DestroyDelegate)->
0x1276d7c(Winfessor.SoapBox.Server.ServerCore.Gateways.FileTransferProxy.FileTransferProxyComponent)->
0x4e0e078(Winfessor.SoapBox.Base.XMPPSocketManager)->
0xe4d5e0(Winfessor.SoapBox.Base.BufferPool)->
0xe4d648(System.Collections.Queue)->
0x912c9e0(System.Object[])->
0x9c559b60(System.Byte[])
We now have the owning call stack for the array. Now we're getting somewhere! Knowing our server as I do,
the thing that jumps out at me is the BufferPool (0xe4d5e0). The BufferPool is a class designed to allocate a number
of byte arrays at a time. These byte arrays are then passed into methods that cause them to be pinned in memory (such as
Socket.BeginReceive, Socket.BeginSend, and a number of LDAP and Active Directory methods). These pinned byte arrays tend to
be pinned for a long time, and if they're not carefully managed create all sorts of holes in the heap, which is very
problematic.
Let's look a bit closer:
0:000> !DumpObj 0xe4d5e0
Name: Winfessor.SoapBox.Base.BufferPool
MethodTable 0x0c4c0074
EEClass 0x0c4136b8
Size 28(0x1c) bytes
GC Generation: 2
mdToken: 0x02000061 (c:\program files\coversant\soapbox server
2005\winfessor.soapbox.base.dll)
FieldDesc*: 0x0c38ff7c
MT Field Offset Type Attr Value Name
0x0c38fe74 0x400013c 0x8 System.Int32 instance 100 _initialCount
0x0c38fe74 0x400013d 0xc System.Int32 instance 100 _growthCount
0x0c38fe74 0x400013e 0x10 System.Boolean instance 1 _gc
0x0c38fe74 0x400013f 0x4 CLASS instance 0x00e4d648 _queue
0x0c4c0074 0x4000109 0x14 System.Int32 instance 4096 _bufferSize
This doesn't tell me anything. But it does give me the address of the Queue. This queue is (by design)
full of byte arrays that are continually pinned and unpinned in memory by Win32 calls. Its purpose is
to help minimize heap fragmentation, which it does very well. Let's poke at the queue:
0:000> !dumpobj 0x00e4d648
Name: System.Collections.Queue
MethodTable 0x79bca304
EEClass 0x79bca3a8
Size 32(0x20) bytes
GC Generation: 2
mdToken: 0x0200011f (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll)
FieldDesc*: 0x79bca40c
MT Field Offset Type Attr Value Name
0x79bca304 0x40003bb 0x4 CLASS instance 0x0912c9e0 _array
0x79bca304 0x40003bc 0x8 System.Int32 instance 128646 _head
0x79bca304 0x40003bd 0xc System.Int32 instance 565323 _tail
0x79bca304 0x40003be 0x10 System.Int32 instance 435315 _size
0x79bca304 0x40003bf 0x14 System.Int32 instance 200 _growFactor
0x79bca304 0x40003c0 0x18 System.Int32 instance 15536770 _version
Its a bit weird that Version is 15536770. Version on a CLR Queue is updated for each call into the queue. This means there have been
15 million queue operations so far. This seems a bit high.
Let's take a look at the array (0x0912c9e0)that is used by the queue:
0:000> !DumpObj 0x0912c9e0
Name: System.Object[]
MethodTable 0x006c209c
EEClass 0x006c2018
Size 3276816(0x320010) bytes
GC Generation: 3
Array: Rank 1, Type CLASS
Element Type: System.Object
Content: 819,200 items
Well, here's our problem. This array has 819,200 items in it. Each of these
items is (I just happen to know) an array of bytes. Clearly something is
very, very wrong with the algorithm used to grow the queue! At this point I
could start looking at the individual array items, but they wouldn't tell me
much. For details on how to do this
see Arrays and SOS on MSDN.
So now we know problem: We have way, way too many of these byte arrays around. Now that
we have the culprit, it's time to figure out why.
Why are we growing Byte Arrays?
Being a process oriented and very methodical company, we have all the source
code used for the exact build our customers are running. The gateway into the Queue
of object is the ObjectPoolBase class that we have. This offers two methods: CheckIn,
and CheckOut. One of these two (or both!) methods must be broken and causing the
"Grow" algorithm to fire too often.
Note: Our SoapoBox 2005 product line was written primarily in Visual Basic.Net. Our
SoapBox 2007 product line has been converted to C#. This doesn't really have any impact
on anything, as the two langauges are pretty much identical in all ways that matter.
Protected Function CheckOutObject() As Object
Dim emptyQueue As Boolean = False
SyncLock _queue.SyncRoot
emptyQueue = (_queue.Count = 0)
End SyncLock
If emptyQueue Then
AddObjects(GrowthCount)
End If
Return _queue.Dequeue()
End Function
There's a race condition there. How on earth did that happen? This code has been tested, retested, and tested again. Yikes!
Even scarier than the race condition around the checking of queue size, is that the _queue.Dequeue method isn't protected
inside a lock. I seriously doubt the CLR Queue class is thread safe, so this is probably causing a problem.
Let's take a quick look at the Grow Algorithm to see how that looks:
Private Sub AddObjects(ByVal count As Integer)
'force garbage collection to make sure
'these new long lived objects cause as
'little fragmentation as possible
If _gc Then
System.GC.Collect()
System.GC.WaitForPendingFinalizers()
End If
For i As Integer = 1 To count
Dim obj As Object = GetObjectInstance()
SyncLock _queue.SyncRoot
_queue.Enqueue(obj)
End SyncLock
Next
'force garbage collection to make sure these new
'long lived objects cause as little fragmentation
'as possible
If _gc Then
System.GC.Collect()
System.GC.WaitForPendingFinalizers()
End If
End Sub
That's also... far from ideal in terms of threading and locking. When did this happen? Which versions of our server have this code? Does the 2007 version still have this?
Before we figure that out, let's see the details on that call to _queue.DeQueue().
Using Reflector I can pull up the source easily enough:
public virtual object Dequeue()
{
if (this._size == 0)
{
throw new InvalidOperationException(
Environment.GetResourceString(
"InvalidOperation_EmptyQueue"));
}
object obj1 = this._array[this._head];
this._array[this._head] = null;
this._head = (this._head + 1) % this._array.Length;
this._size--;
this._version++;
return obj1;
}
Found deep inside the CLR, this code is absolutely, positively, not thread safe. Queue.Dequeue needs to be called from inside a Lock. This is a bug on our part.
Which versions are vulnerable?
Using Source Code control, it's easy to pull up the history for this file. Doing so, I find that it didn't always look like this. In fact, in all versions PRIOR to the
SoapBox Server 2005 SR2, it looked like:
Protected Function CheckOutObject() As Object
Monitor.Enter(_queue.SyncRoot)
Try
If _queue.Count = 0 Then
AddObjects(GrowthCount)
End If
Return _queue.Dequeue()
Finally
Monitor.Exit(_queue.SyncRoot)
End Try
End Function
This code is fine, and does not have the race conditions of the code in question. How about the SoapBox Server 2007?
protected object CheckOutObject()
{
if (_disposed)
throw new ObjectDisposedException("ObjectPoolBase");
object obj;
bool added = false;
lock (_queue.SyncRoot)
{
if (0 == _queue.Count)
{
AddObjects(GrowthCount);
added = true;
}
obj = _queue.Dequeue();
}
if (added && _gc)
{
//Have to do this outside of any
//synchronization or things may hang.
GC.WaitForPendingFinalizers();
}
return obj;
}
Whew! Our current release is clean, and all releases other than the SR2 release of SoapBox 2005 are also fine.
Conculusions
So in summary, there is (was!) a threading bug in the server. Why this bug didn't show up under testing,
is somewhat of a mystery.
My best guess is that the action of actually having hundreds of client machines all hitting
the server causes this to happen far, far more than having hundreds of clients on a single machine hitting
the server does.
In general we're very diligent about testing for multi-threaded bugs. All of our workstation are
multi-processor machines, we have dozens of explicitly multi-threaded tests, and our current high-stress
test infrastructure includes a heavily multi-threaded XMPP BotNet that can run on any number of machines.
In addition to that, the people working on our server are all very knowledgeable threading experts. In fact
the developer who introduced that particular bug shown took one look at it and instantly
said, "Who wrote that? There's a race condition there.".
I guess the real answer is a more formal code review process. Unfortunately this is hard to enforce at
the process level, as none of the source-code control tools that we use can do this.
Technorati Tags:
XMPP,
SoapBox,
Son of Strike,
WinDbg,
.Net,
Debugging