Using the System ThreadPool
In an upcoming blog, I’m going to be talking about some of the threading issues we ran into when running on systems with a number of processors (4, x, 16). As part of that, I wrote this little section on the pain we’ve suffered through due to the ThreadPool. Rather than keep it in the already-too-long entry on threading, I figured it could stand on it’s own.
I read a lot of literature on threading, threading patterns, and all sorts of stuff in between. In .Net land, most of the literature that I have read strongly recommends using the system threadpool instead of managing your own threads. This literature then goes on to very strongly recommend against rolling your own thread management code, and emphasizes again and again to just use the system threadpool.
As a direct result of reading this literature, the early versions of the SoapBox Server made extensive use of the system threadpool.
I should take a moment to differentiate between the system threadpool and the IOCP threadpool (which we also use):
· The system threadpool is made up of (by default) 25 threads per processor and is used internally by the .Net framework for a number of tasks. This pool is leveraged by ADO.Net, by all of the built-in delegate BeginInvoke operations, all of the various timer classes, and so forth. Work items are easily posted to a work item queue, and the threadpool then manages as best it can to get the operations done in the order they were posted. The system threadpool is absolutely vital to the proper operation of the .Net framework.
· The IOCP threadpool, by contrast, has (sort of) 1000 threads in it and is only used (as far as I know) by the Sockets infrastructure. This threadpool is managed by the IO Completion Port infrastructure within the kernel. A detailed explication of how IOCP works can be found in a number of places on the web.
In a WinSock server application (such as the SoapBox Server) when we get a callback from a socket (from BeginReceive, BeginSend, BeginAccept, BeginConenct, etc), processing is done in the context of an IOCP thread, not a threadpool thread.
In an earlier architecture, we used this IOCP thread to parse Xml coming off the socket into strongly typed classes, and then would put the resulting objects into a work item queue. For quite some time, we naively used the System Threadpool to actually do work against the items in the queue. For a little while, this even worked.
The problem is this: Under heavy load, we would quickly get all 25 threadpool threads busy doing something. Every once in a while though, our server would hang. To those of us initiated into the Skull and FreeThreading society, we simply go, “Oh, it’s a race condition. No trouble. I’ll fix that this afternoon”. Problem was, it wasn't a race condition. At least not exactly.
After poking around for a bit, we would find all 25 threadpool threads blocked in ADO.NET. The database had returned the values already, but none of our threads had yet awoken. It turns out that the callbacks inside the .Net framework ended up being queued to the threadpool. With no threads available (‘cause we’ve got them all tied up doing work!) the callbacks would never happen, and thus all of our threads are essentially deadlocked.
To make matters worse, our watchdog was setup as a Timer class to get callbacks every 30 seconds. During the callback it would check all the threads that were doing something, and take appropriate action if things were stalled. Any guesses how well the timer callbacks work when the threadpool is out of threads?
We saw this process repeat itself several times across several different sets of circumstances (web service calls, ADO.Net, File system writes for trace logs, etc). Between ADPlus generated minidumps and way too much time staring at Son Of Strike (Why doesn’t Microsoft include managed minidump debugging in VS.NET 2005?!? Why? WHY!?), we decided that using the system threadpool is simply not an option.
The bottom line is this: For production grade server applications do not use the system threadpool. The issue of thread starvation and the assumption by the CLR that it can always use the threadpool make it something that cannot be used by a robust server class application. Applications that do depend on it are at major risk of hanging.
Now, for small apps, and client side code I love the threadpool. It’s great. For server apps it’s deadly.
As a result of this, we have rolled out own threadpool. This is something that sounds easy at first, but is quite hard. I really wish the CLR exposed the ThreadPool class, as it would make things much easier.
I would love to be able to say, “Threadpool _processingPool= new Threadpool();”
Getting things like thread affinity, thread creation, thread destruction, and all the other little details juuuuuust right is very, very hard.
Update:
I had a few conversations with Jeff Richter about this topic, and it was his assertion that's it's not the threadpool that's broken here, but rather our architecture. He is of the opinion that the critical flaw is that we're doing synchronous operations while running on threadpool threads. This can cause timeout and deadlock issues just like we're seeing.
A problem arises in that we need to perform database operations in response to packets sent to us by clients. Richter's answer: Perform aync database operations, make sure everything else is async, and use nothing but threadpool threads. This will increase scalability, reliability, throughput, and everything else. At a very fundimental level, I believe him. His arguments are very persuasive.
At first glance, making the database async will work. SQLClient has support for this with BeginExecuteCommand, and this should do the trick. Things even appear to be IOCP enabled, which is a great bonus. Problems quickly arise though:
- The only ADO.Net provider that supports the Begin/End pattern is SQLClient. This means we're out of luck for Oracle, MySQL, and Postgre. If we're going to do this, it means special casing our database support. Not a showstopper, but certainly a pain.
- The showstopper is that SQLClient only partially supports async operations. There is no "SqlConnection.BeginOpen"!! This means that we're always stuck synchronously opening database connections. As soon as we exhaust the connection pool (25 connections or so, by default), whatever thread is trying to perform a datbase I/O will block waiting for a connection. I would be stuck writing our own connection pooling mechanism, tying it to IOCP, and then managing everything. Ick.
As a result of the inability to perform async database operations at any reasonable scale due to the lack of a BeginOpen method on SqlConnection, I am forced to restate my original premis with a slight modification:
Don't use the Thread Pool if you have to perform any synchronous I/O. Because all signifgiant applications need to perform database access, which must be synchronous, you can't use the ThreadPool in any signifigant server applications.
I suppose there are special case applications where you could post database operations to a Message Queue using async operations. These messages would then be dequeued, and executed. The results posted back to the queue for the application to pick up. For Write-Only operations this seems like it would be a good way to go - for anything else it's not much of an option.