Posts Tagged ‘Best Practices’

Series – Building your business on Microsoft technologies (Part 0 – Roadmap)

Written by Cornelius J. van Dyk on . Posted in Blog

Many people launch new businesses or expand small businesses to the point where IT starts to play a role.  It is at that precipice where the question about which software to use and build on becomes evident.  As happens in most companies, a software package that most closely does what is more urgently needed, is installed by someone and it starts to gain user traction.  This repeats over and over again until at some point, someone has to figure out how to untangle the spaghetti mess that resulted.

If only someone had planned the expansion and use of software beforehand, it would have saved tons of time for whoever ends up with that project.  And that… is where this series comes into play…

I’ve built my career over the last 10 years or so, on Microsoft technologies.  There’s always someone out there who’s done what you need, IF you understand what you need.  That’s what Enterprise Architects to best.  Understanding the business need and marrying that up with technology decisions that will help drive the business forward.  I intend this series to provide a road map for anyone who needs to build a business on technology that’ll allow less rework down the road.  I will cover all the topics as one may encounter them from the perspective of small (or even one man/woman) IT departments where budgets are tight (especially in the current economy) and getting high priced consultants isn’t always an option.  The most expensive thing that’s done in IT, is rework.  Doing the same thing over and over again because it wasn’t done properly the first time.

My vision for this series is to be a guide that most IT personnel could follow to deploy technologies within their company that’ll be properly positioned to support company growth in the future, requiring little to no rework at any point in time.  So without any further delay… here is my Roadmap for this series… Please note that I’ll be updating the Roadmap as time goes on and I write the corresponding articles and link to them.  It may be a good idea to Bookmark/Favorite this post for future reference.

  1. Installation – Windows Server 2008 R2.  Since Windows Server 2008 R2 is the latest and greatest server operating system from Microsoft, we’ll use it as the basis for all our servers.
  2. Configuration – Creating the Primary Domain Controller – Enabling the Active Directory Domain Services Role on Windows Server 2008 R2.  Once we have our first server with an operating system installed, it’s time to create our company domain.  We’ll be using Active Directory authentication for our environment.
  3. Business Continuity – Enabling and Testing the Windows Server Backup Feature on Windows Server 2008 R2.  No progress can or should be made until we’re sure we can recover from absolute disaster.  That means our server is completely dead and we have to restore onto new metal.  Backup and Restore functionality must be tested before we do anything else.
  4. Configuration – Enabling the Hyper-V Role on Windows Server 2008 R2.  Getting ready for virtualization is a key action here.  In a small business, there is seldom money for multiple servers so we have to stretch our resources to the max by employing virtualization.  Since running absolutely everything on one single server is not only NOT recommended as a Best Practice but also detrimental to scaling with business growth, virtualization is a perfect solution.  We will be using Microsoft’s Hyper-V technology to host all our servers on the same physical box.
  5. Installation – SQL Server 2008 R2 on Hyper-V.  Since absolutely everything we’ll do requires a SQL Server database, and since SQL Server 2008 R2 is Microsoft’s latest and greatest database server product, we’ll build on it.  Initially we’re not going to cluster or scale the SQL Server, but that will be the first point of scaling once volume and traffic increase.
  6. Business Continuity – Configuring and Testing Disaster Recovery for Hyper-V Servers.  Since our SQL Server was the first Hyper-V server we built, we have to test the Backup and Restore of our Hyper-V server before proceeding.
  7. Installation – Exchange Server 2010 on Hyper-V.  Now that we have a domain and a database server, we need email.  We’ll be building on Microsoft’s latest email server for that.
  8. Business Continuity – Testing Disaster Recovery for the Exchange Server 2010 server on Hyper-V.
  9. Installation – SharePoint Server 2010 on Hyper-V.  After establishing email for the company, we need to work on the web site and collaboration between employees.  We’ll use the latest version of SharePoint for that.
  10. Business Continuity – Testing Disaster Recovery for the SharePoint Server 2010 server on Hyper-V.
  11. Etc.

And so the list will grow and continue over time.  I am going to endeavor to post a new chapter in the series every week to two weeks so stay tuned.



SharePoint, IIS, w3wp.exe, threads and app pools… Sometimes, it IS the end user after all!

Written by Cornelius J. van Dyk on . Posted in Blog

I’ve been helping a good client of mine trouble shoot some performance issues with their SharePoint environment.  They have a single MOSS 2007 server under 32 bit, so their 1,000+ active users (not concurrent though) is stretched about as thin as it will go.  Recently, the server started having issues where the app pool would get locked up and take all the users down.  Now IIS app pools are designed recycle when certain limits are reached so that it would be seamless to the end user.  The app pool was set to recycle when memory consumption under the worker process (w3wp.exe) reaches 1 GB or the virtual memory consumption for the app pool reached 1.9 GB.  We were not seeing the overlapped recycle taking place automatically because the app pool would get locked up when memory reached around 940 MB.  It was not consistent though so it couldn’t be identified readily.  We ended up trimming the values back to the eventual 800 MB physical and 1.5 GB virtual memory before triggering a recycle.

imageOnce the app pool reached either of those limits in it’s memory consumption, IIS would spin up a new w3wp.exe worker process with a fresh app pool and all new SharePoint requests would be directed to that process instead.  All existing pending requests on the current worker process/app pool would complete or be terminated once the timeout configured in IIS was reached.  After all requests completed and released their execution threads, the worker process would terminate and release it’s memory back to the pool for IIS to use.

If you are seeing similar behavior in your SharePoint environment, there are a couple of things you need to pay attention to:
  1. IIS Timeout setting.
  2. Runaway/locked up threads.
  3. Time between recycles.
  4. Physical server memory.
  5. Bit architecture of the server and SharePoint.

Once your server isn’t crashing for end users any more, it’s time to tune it’s health more closely.  Identify what the IIS timeout setting is set for, for your server/app.  If your server is still on IIS6, you will want to ensure that theLogEventOnRecycle property in the IIS metabase is set to true.  Next you want to look in the Event Log under System for  message 1077 which indicates that an overlapped recycle took place for the app pool.  Make sure to note the time between these messages.  It’s best to use the smallest time which should relate to your peak volume time of day for the given server.  Lastly you want to make sure how much physical memory the server has and what the bit architecture of the server, the OS and SharePoint is, i.e. are you running 32 bit or 64 bit.

imageNow it’s time for some math.  If you have a 32 bit server running 32 bit Windows Server and 32 bit SharePoint, this is a much more crucial issue than if you were running all 64 bit.  The issue deals with memory.  You have to figure that the server will not realistically have available to your worker processes more than half of it’s actual physical ADDRESSABLEmemory.  I highlight addressable here because remember than under 32 bit architecture, your server cannot address more than 3.2 GB of memory, even if it has 8 GB of physical memory!

Thus in our all 32 bit example, even though the server has 4 GB of memory physically, the OS can only address 3.2 GB which means by my math, about 1.6 GB would be available to our worker processes in IIS.  You may be tempted to use something just below that as your recycle point, but remember that we have OVERLAPPED RECYCLE going on which means that IIS is managing two worker processes at the same time, so each would require it’s own memory in order to function properly.

That was the problem we ran into when the recycle threshold was set at 1 GB.  The worker process would trip the limit and then IIS would attempt to spin up an overlapping worker process, but since there wasn’t enough memory available to do so, it took no time at all to completely lock up IIS and bring down end users.  Only a forced recycle of the app pool, which forcibly releases all threads and memory pages, thus also dropping users, before spinning up a new worker process, could restore the server to a working state.

Dropping the memory recycle trigger down to 800 MB instead, we consumed half of our available memory, or 25% of the addressable memory.  When the worker process triggered the overlapped recycle, it would spin up a second worker process and direct traffic to it while finishing up requests in the first worker process.  Provided none of these requests had runaway threads, the worker process would typically shut down and release memory within a minute or two.

imageThis gets the server into a usable state as far as the end user is concerned because they no longer see crashes or get locked up.  On the server side, you will see the app pools recycle much more frequently and you are running the risk that a runaway thread would lockup the first worker process until the IIS timeout is reached.  That setting is 15 minutes under IIS by default, but most SharePoint shops have upped that to 30 minutes, especially where low bandwidth or VPN users are in play.  As a result, a runaway thread would keep the first worker process alive for 30 minutes.  You can see how the time between recycles now becomes superCRITICAL!  If you overlapped recycles happen more frequently than your IIS timeout value, change something.

RECOMMENDATION:  Ensure that your IIS timeout value is always LESS than your overlapped recycle time at it’s shortest interval.

Of course the answer is to solve the memory leak problems so that the app pools don’t have to recycle, but if you’ve ever tried to track down memory leaks, you know it’s HELL!    If you’ve never had the misfortune of having to do so, consider yourself truly blessed.

It’s also not always realistic to bring the IIS timeout value down.  If your server is recycling worker processes every 15 minutes, it’s certainly not likely to be doable.  That’s when it becomes mission critical to hunt down any runaway threads and determine their cause.  Anything that may cause the worker process to remain alive need to be addressed in order to keep your server up and running.  At my client’s site, we were still getting runaway processes that could potentially put us in a state where a third worker process needs to be spun up which would bring the whole thing to a screeching halt.

As an Enterprise Architect I get to see all sides of the fence.  I work with and talk to everyone involved.  When talking to developers, the feeling is usually that Ops people must have done “something” to the servers which is causing the instability.  When talking to operations personnel, the feeling is usually that Devs are writing bad code that’s causing the instability.  I’ve been in many SharePoint shops and have seen both sides of this argument be true, but not this time.

We had an awesome traffic profiling tool available for the job and that’s where we discovered two items that would cause runaway threads.

  1. imageSQL Server Reporting Services Integrated Mode.  If you’re a SharePoint Architect, you probably just had a cold shiver go over your entire body as you read that line.  Yes, every SharePoint shop dabbles with SSRSIM at some point.  Most come to the conclusion that performance is a problem and usually deploy a dedicated server to run SSRS.  That was also the case here.  Unfortunately, there was a couple of instances of IM reports that could not be moved over to the dedicated server so IM was left active.  What we discovered was a series of reports developed and built (as SSRS empowers end users to do) by end users.  Of course end users are not going to know how to write optimized queries for data so as a result, these reports performed poorly.  There were reports that would take upward of 30 seconds to load, and that was being local to the servers and on a 1 GB ethernet connection.  The reports have very large amounts of tabular data and we know how well IE renders tables.  Imagine being a user, on a remote VPN connection.  Your wait time on the report could easily go over 2 minutes.  The problem with that is the thread requesting data is locked up while all this data is transferred and interpreted for render on the browser.  Additionally, a user could easily lose patience and simply close their browser fearing that it may be “locked up”.  When a user does that, the thread still remains alive in the background until the download is complete and the loss of the end point on the client side could very well cause the thread to become a runaway thread that never releases its resources.  No matter how you slice or dice it, it’s bad.
  2. Image Rotating Banner.  We have a nifty little web part that adds pizzazz to user created pages by rotating through images determined by the designer/user.  Now as I said, any time we empower end users to do design of content delivery, a LOT of thought has to go into it.  In this case, the web part was designed for ease of use in that all the designer/user had to do was drop it on the page, set the Title for it and point it to an Image Library on the site.  Then when the page is loaded, the web part would start rotating images using JavaScript.  Nice.  But way, there’s more.  Using our awesome traffic profiling tool, we discovered pages, like main departmental home pages, were loading literally dozens of images.  Taking a closer look at the pages, they appeared to load rather slowly.  If you’ve ever dissected the loading sequence of a SharePoint page, you’d know that, even if you set them to display partially while downloading content, the JavaScript is usually the last part to be downloaded.  As a SharePoint developer would understand, a SharePoint page isn’t really functional until that JavaScript has loaded.  None of the dropdowns work etc.  But I digress.  Needless to say, until the page is completely loaded, you can’t really do too much.  What we saw was all these images loading with the page before the script would load, making the page load times very slow.  To make matters worse, we looked at some of the pictures being loaded and most were not resized to the 100 pixel banner size they displayed in.  On the contrary, the images were in their original 9 mega pixel JPG format!    Cracking open the code for the web part, we discovered that it did exactly what I just described.  It showed ALL of the pictures in the picture library regardless of SIZE.  Though that design is OK for uses where experienced developers or web designers would be using the web part, it unfortunately does not work well for end users or inexperienced designers because it’s not realistic to expect an end user to think about the number of pictures being displayed, all being preloaded on the page as well as the size of those images.  Considering some images up to 5 MB in size and libraries easily containing in excess of 20 images, you can see how the 1 MB size page, now having to preload all these images, suddenly became a 100 MB+ page.  That’s never good for performance.  Now granted, the web part should probably use AJAX to load it’s images and not preload them on the page, but this was the design that was available.  We implemented a hot fix to the code whereby we simply leveragedSharePoint’s built in thumbnails for image libraries since it’s just a banner anyway.  In addition, we display only a random 10 images from the library each time.  That meant no more than 100 KB in extra page size.  Again, you can see how a user could easily give up and close the browser, leaving a thread locked as it processes.

As we’ve seen in this case, as developers and architects, we always have to be conscious of our end users.  Tools we provide them in order to empower them can often come back to haunt us at the most inopportune times.



Why passing SPBasePermissions.FullMask to SPContext.Current.Web.DoesUserHavePermissions() instead of SPBasePermissions.ManageWeb when you’re trying to determine if a user is a SPWeb administrator, is a bad idea…

Written by Cornelius J. van Dyk on . Posted in Blog

OK, so the title of this post could also have been “Best Practices for Determining if a User is a SPWeb Administrator”, but then the search engines wouldn’t catch the post for all those unfortunate enough to be searching for FullMask or DoesUserHavePermissions() in the future.  The Problem OK, OK, in all seriousness though.  There are a lot of content out there that recommend that people simply use SPContext.Current.Web.DoesUserHavePermissions(SPBasePermissions.FullMask) when trying to determine if the current user has Administrator rights to the current web site (SPWeb).  This is all good and well, but it assumed that you have NEVER customized your web application available permissions list i.e. your effective base permissions.  So what’s the problem, you may be wondering… The problem is in the way SharePoint behaves when you do indeed customize your permissions for the web app.  If you dive into SharePoint Central Administration under Central Administration > Application Management > Permissions for Web Application you will find all the SharePoint base permissions and the ability to turn any one of these permissions off by simply unchecking it’s checkbox and then clicking the OK button… all except for one… FullMask.  If you were to uncheck say UseClientIntegration (in order to disable desktop apps such as Office or SPD from editing content directly on the server) and then save that state, SharePoint will do two things.

  1. It will remove the UseClientIntegration bit flag from the permissions bit mask and
  2. because total full control is no longer possible for the web app, it will also remove the FullMask bit flag from the mask.
That’s all fine and dandy until you go and add the permissions back again.  If you now recheck the UseClientIntegration checkbox and clicked OK, you’d expect SharePoint to add the bit flag back to the permissions mask again and it does do that, but for the UseClientIntegration flag only.  If you’re expecting it to also reset the FullMask flag, you’ll be disappointed.   This appears, at least in my mind, to be a bug in the SharePoint core code.  Yes, yes, I know it’s most probably “working as designed” or “behaving as intended”   , but in my mind the absence of any UI way to reset the FullMask flag, as well as the sparse documentation surrounding it, this just feels like a bug and not an intended feature.  So to be clear… SharePoint does not reset the FullMask security bit once permissions on the web app was customized! Now as far as SharePoint UI and everything else an end user sees is concerned, everything is working perfectly as per usual with no ill effects.  It’s only when you drop into the world of the SharePoint developer that things can become hairy.  In case you were wondering, here’s the base permissions bit mask as returned when the FullMask bit is set: 7FFFFFFFFFFFFFFF And once customized, even with all permissions enabled again, the bit mask returns rather as a union of all aggregated permissions and looks like this: 400001F07FFF1BFF If a SharePoint developer used the DoesUserHavePermissions() method and passed the FullMask flag to it in hopes of identifying an admin user, the method will always return False because the FullMask bit is never reset again.  So using SPContext.Current.Web.DoesUserHavePermissions(SPBasePermissions.FullMask)simply isn’t reliable for admin checking in code.  Fortunately, the failure occurs on the safe side i.e. it is reporting an Admin to simply not be an Admin rather than reporting a User to be an Admin so most probably your app doesn’t break, but is simply not quite working as expected. The Solution So then, you ask, what exactly would be the best practice for determining if the current user is an admin? You may be tempted to use code like this ((ISecurableObject)SPContext.Current.Web).DoesUserHavePermissions((SPBasePermissions)Microsoft.SharePoint.SPPermissionGroup.WebDesigner)) and then use SPPermissionGroup.Administrator as the target, but since the value of Administrator in this case is –1 and the DoesUserHavePermissions() method is looking for aulong value, it will epically fail on you, even if you were to “duck punch” it with up/down casts like (SPBasePermissions)(Object)… Rather, the proper way to check if a user has admin rights to the current SPWeb, regardless of source (web, site collection or CA policy), is as follows: ((ISecurableObject)SPContext.Current.Web).DoesUserHavePermissions(SPBasePermissions.ManageWeb) By checking if the user has the ManageWeb security bit set, you will always get the proper result back.  Using FullMask is akin to trying to remove a wart with a canon.  ManageWeb is more like a scalpel. OK, I broke it.  Now what? If you’re in the boat where you’ve already “broken” the FullMask bit, don’t despair.  There is indeed hope.  Luckily some smart people like MOSS MVP Gary LaPointe, have struggled with this problem before and have created clever workarounds for this. To reset the FullMask security bit, you can use the following code courtesy of Gary:
SPWebApplication wa = SPWebApplication.Lookup(new Uri(url)); 
wa.RightsMask = wa.RightsMask | SPBasePermissions.FullMask;
If you’re adventurous, you can go ahead and write an app or even your own STSADM extension for that, but if you’re like me, you’ll be happy to know that Gary already did that!  Simply go and get his MOSS or WSS extension methods for STSADM package from his Download Page.  It comes all nicely packaged in a .wsp ready for deployment into your SharePoint environment.  The STSADM operation you need is: gl-enableuserpermissionforwebapp and the proper syntax for it is as follows: stsadm -o gl-enableuserpermissionforwebapp -fullmask -url http://YourWebAppURL So there you go.  The best practice for determining if a user is an administrator for a site and a way to fix it if your code is using FullMask and broke because someone played with permissions. Enjoy…