пятница, 20 декабря 2013 г.

Unexpected unloading of mono web application


After several bugs in mono gc were fixed, I was able to run benchmarks for aspx page in apache2+mod-mono server. I used mono from master branch, mono --version says: "Mono Runtime Engine version 3.2.7 (master/01b7a50 Sat Dec 14 01:48:49 NOVT 2013)". Crashes with SIGSEGV went away but unfortunately I can't say that serving aspx with apache2 are stable now. Two times during benchmarks I've got something similar to deadlock: mono stopped to process requests and stuck at consuming 100% of CPU. Don't know what was that, my try to debug mono process with GDB did not bring an answer (unlike the other cases when GDB help me to find cause of deadlocks/SIGSEGV or at least the place of suspicious code and send this info to mono team). Also there are memory leaks. And a bad thing exists, that the server stops responding after processing ~160 000 requests, but there is workaround for it.

Mono .aspx 160K requests limit

If you run ab -n 200000 http://yoursite/hello.aspx where hello.aspx is a simple aspx page which do nothing, and site is served under apache mod-mono, after ~160K request you'll get deny of service. This error caused by several reasons I'll try to explain, what is going on and how to avoid this

When request comes to aspx page web server creates new session. Than the session saves to internal web cache. When the second request comes, the server tries to read session cookies and, if not found, creates and saves new session to the cache again. So every request without cookies creates new session object in the cache. This could provide huge memory leaks, when the number of sessions grow unstoppable, to prevent this web server has the maximal limit of objects, which internal web cache can store. This limit is defined as constant in Cache.cs and hardcoded to 15000

When the number of objects in internal cache hits 15000, web server starts to aggressively delete all objects from the cache using LRU strategy. So if user got the session 5 minutes ago and works with site by clicking the page every minute his session will be removed from cache (and lost all the data inside the session) in opposite to some hazardous script (without session cookies was set) which gets 15K requests to the page during last minute and creates 15K empty sessions. But this is not all.

Internal cache is also used for storing some important server objects, for example all dynamically compiled assemblies are stored there. And there is no preference for server objects when deleting from cache all objects are equal. So if some server object was not accessed too long it will be removed. And this is the cause of second error

Here the code of GetCompiledAssembly() method. It's called every time, when the page is accessed

   string vpabsolute = virtualPath.Absolute;
   if (is_precompiled) {
    Type type = GetPrecompiledType (vpabsolute);
    if (type != null)
     return type.Assembly;
   }
   BuildManagerCacheItem bmci = GetCachedItem (vpabsolute);
   if (bmci != null)
    return bmci.BuiltAssembly;

   Build (virtualPath);
   bmci = GetCachedItem (vpabsolute);
   if (bmci != null)
    return bmci.BuiltAssembly;
   
   return null;

Let's look. When .aspx page is accessed for the first time it tries to check if it was precompiled. If did it run process method. If not, it tries to find the compiled page in the internal cache and if not found there it compiles the page and stores compiled type into the cache (inside the Build() function). The schema looking good, but not in our case. When the internal cache overgrows 15K limit compiled type is removed from the cache even it was accessed right now! I think there is some bug in LRU implementation or maybe object are got from LRU only once and saved into some temp variable, so LRU object does not update last access time.

You may ask: "So what? Compiled type was deleted from the cache, but won't it be there on the next page get? Algorithm checks existence of the type in the cache, and if not found it compiles it again and places to cache. It could reduce performance, but could not be a reason of denial of service". And you'll be right. This is not exactly the reason of DoS. But if you look inside of page compilation, you'll find that it has a limit of recompilation times. And if this limit is reached it starts to unload AppDomain with the whole application! And at the last mod-mono somehow does not control AppDomain unloading, don't know why it should, but after 160K request the page is stopped responding.

try {
 BuildInner (vp, cs != null ? cs.Debug : false);
 if (entryExists && recursionDepth <= 1)
  // We count only update builds - first time a file
  // (or a batch) is built doesn't count.
  buildCount++;
} finally {
 // See http://support.microsoft.com/kb/319947
 if (buildCount > cs.NumRecompilesBeforeAppRestart)
  HttpRuntime.UnloadAppDomain ();
 recursionDepth--;
}

How can this be workarounded?
I know only one way - always use precompiled web site. At first look I had a hope, that constants LOW_WATERMARK and HIGH_WATERMARK for cache can be changed by setting appropriate environment variable, but, unfortunately it's not. In my opinion cache usage should be rewritten - user sessions and web server internal objects should have different storage places and must not affect each other. Also session should not be created at first page access, if the page doesn't asks for session object, it can be created later, when it really needed for processing the page

Комментариев нет:

Отправить комментарий