I’ve been helping a good client of mine trouble shoot some performance issues with their SharePoint environment. They have a single MOSS 2007 server under 32 bit, so their 1,000+ active users (not concurrent though) is stretched about as thin as it will go. Recently, the server started having issues where the app pool would get locked up and take all the users down. Now IIS app pools are designed recycle when certain limits are reached so that it would be seamless to the end user. The app pool was set to recycle when memory consumption under the worker process (w3wp.exe) reaches 1 GB or the virtual memory consumption for the app pool reached 1.9 GB. We were not seeing the overlapped recycle taking place automatically because the app pool would get locked up when memory reached around 940 MB. It was not consistent though so it couldn’t be identified readily. We ended up trimming the values back to the eventual 800 MB physical and 1.5 GB virtual memory before triggering a recycle.
Once the app pool reached either of those limits in it’s memory consumption, IIS would spin up a new w3wp.exe worker process with a fresh app pool and all new SharePoint requests would be directed to that process instead. All existing pending requests on the current worker process/app pool would complete or be terminated once the timeout configured in IIS was reached. After all requests completed and released their execution threads, the worker process would terminate and release it’s memory back to the pool for IIS to use.If you are seeing similar behavior in your SharePoint environment, there are a couple of things you need to pay attention to:
- IIS Timeout setting.
- Runaway/locked up threads.
- Time between recycles.
- Physical server memory.
- Bit architecture of the server and SharePoint.
Once your server isn’t crashing for end users any more, it’s time to tune it’s health more closely. Identify what the IIS timeout setting is set for, for your server/app. If your server is still on IIS6, you will want to ensure that theLogEventOnRecycle property in the IIS metabase is set to true. Next you want to look in the Event Log under System for message 1077 which indicates that an overlapped recycle took place for the app pool. Make sure to note the time between these messages. It’s best to use the smallest time which should relate to your peak volume time of day for the given server. Lastly you want to make sure how much physical memory the server has and what the bit architecture of the server, the OS and SharePoint is, i.e. are you running 32 bit or 64 bit.
Now it’s time for some math. If you have a 32 bit server running 32 bit Windows Server and 32 bit SharePoint, this is a much more crucial issue than if you were running all 64 bit. The issue deals with memory. You have to figure that the server will not realistically have available to your worker processes more than half of it’s actual physical ADDRESSABLEmemory. I highlight addressable here because remember than under 32 bit architecture, your server cannot address more than 3.2 GB of memory, even if it has 8 GB of physical memory!
Thus in our all 32 bit example, even though the server has 4 GB of memory physically, the OS can only address 3.2 GB which means by my math, about 1.6 GB would be available to our worker processes in IIS. You may be tempted to use something just below that as your recycle point, but remember that we have OVERLAPPED RECYCLE going on which means that IIS is managing two worker processes at the same time, so each would require it’s own memory in order to function properly.
That was the problem we ran into when the recycle threshold was set at 1 GB. The worker process would trip the limit and then IIS would attempt to spin up an overlapping worker process, but since there wasn’t enough memory available to do so, it took no time at all to completely lock up IIS and bring down end users. Only a forced recycle of the app pool, which forcibly releases all threads and memory pages, thus also dropping users, before spinning up a new worker process, could restore the server to a working state.
Dropping the memory recycle trigger down to 800 MB instead, we consumed half of our available memory, or 25% of the addressable memory. When the worker process triggered the overlapped recycle, it would spin up a second worker process and direct traffic to it while finishing up requests in the first worker process. Provided none of these requests had runaway threads, the worker process would typically shut down and release memory within a minute or two.
This gets the server into a usable state as far as the end user is concerned because they no longer see crashes or get locked up. On the server side, you will see the app pools recycle much more frequently and you are running the risk that a runaway thread would lockup the first worker process until the IIS timeout is reached. That setting is 15 minutes under IIS by default, but most SharePoint shops have upped that to 30 minutes, especially where low bandwidth or VPN users are in play. As a result, a runaway thread would keep the first worker process alive for 30 minutes. You can see how the time between recycles now becomes superCRITICAL! If you overlapped recycles happen more frequently than your IIS timeout value, change something.
RECOMMENDATION: Ensure that your IIS timeout value is always LESS than your overlapped recycle time at it’s shortest interval.
Of course the answer is to solve the memory leak problems so that the app pools don’t have to recycle, but if you’ve ever tried to track down memory leaks, you know it’s HELL! If you’ve never had the misfortune of having to do so, consider yourself truly blessed.
It’s also not always realistic to bring the IIS timeout value down. If your server is recycling worker processes every 15 minutes, it’s certainly not likely to be doable. That’s when it becomes mission critical to hunt down any runaway threads and determine their cause. Anything that may cause the worker process to remain alive need to be addressed in order to keep your server up and running. At my client’s site, we were still getting runaway processes that could potentially put us in a state where a third worker process needs to be spun up which would bring the whole thing to a screeching halt.
As an Enterprise Architect I get to see all sides of the fence. I work with and talk to everyone involved. When talking to developers, the feeling is usually that Ops people must have done “something” to the servers which is causing the instability. When talking to operations personnel, the feeling is usually that Devs are writing bad code that’s causing the instability. I’ve been in many SharePoint shops and have seen both sides of this argument be true, but not this time.
We had an awesome traffic profiling tool available for the job and that’s where we discovered two items that would cause runaway threads.
SQL Server Reporting Services Integrated Mode. If you’re a SharePoint Architect, you probably just had a cold shiver go over your entire body as you read that line. Yes, every SharePoint shop dabbles with SSRSIM at some point. Most come to the conclusion that performance is a problem and usually deploy a dedicated server to run SSRS. That was also the case here. Unfortunately, there was a couple of instances of IM reports that could not be moved over to the dedicated server so IM was left active. What we discovered was a series of reports developed and built (as SSRS empowers end users to do) by end users. Of course end users are not going to know how to write optimized queries for data so as a result, these reports performed poorly. There were reports that would take upward of 30 seconds to load, and that was being local to the servers and on a 1 GB ethernet connection. The reports have very large amounts of tabular data and we know how well IE renders tables. Imagine being a user, on a remote VPN connection. Your wait time on the report could easily go over 2 minutes. The problem with that is the thread requesting data is locked up while all this data is transferred and interpreted for render on the browser. Additionally, a user could easily lose patience and simply close their browser fearing that it may be “locked up”. When a user does that, the thread still remains alive in the background until the download is complete and the loss of the end point on the client side could very well cause the thread to become a runaway thread that never releases its resources. No matter how you slice or dice it, it’s bad.
As we’ve seen in this case, as developers and architects, we always have to be conscious of our end users. Tools we provide them in order to empower them can often come back to haunt us at the most inopportune times.