Espace Deployment crashed server

Nuno Reis

MVP

Question

Hi all.

Yesterday we got a very strange error. The Agile platform became unavailable. Restarting IIS and Windows had no effect, Stop/Start Services made the sites available for one or two clicks until IIS failed again.

We suspect it was caused by a too fast deployment (we published 3 related espaces consecutively instead of sending the usual solution).

After half hour trying to undo it - funny thing was that I could get to Service Center through Service Studio without delays, while browsers were stuck - we disabled those espaces and everything else worked fine.
Starting the espaces one by one, they also worked fine.

The publishing was between 16:45-50. The related event viewer entries were as following (we had several of each, specially the first two).

Error 17:02:28 Application Error
Faulting application name: w3wp.exe, version: 7.5.7600.16385, time stamp: 0x4a5bd0eb
Faulting module name: KERNELBASE.dll, version: 6.1.7600.16385, time stamp: 0x4a5bdfe0
Exception code: 0xe053534f
Fault offset: 0x000000000000aa7d
Faulting process id: 0x%9
Faulting application start time: 0x%10
Faulting application path: %11
Faulting module path: %12
Report Id: %13

warning 17:04:25 WAS
A process serving application pool 'OutSystemsApplications' failed to respond to a ping. The process id was '3176'.

Information 17:21:16 Windows Error Reporting
Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0

Problem signature:
P1: w3wp.exe
P2: 7.5.7600.16385
P3: 4a5bd0eb
P4: KERNELBASE.dll
P5: 6.1.7600.16385
P6: 4a5bdfe0
P7: e053534f
P8: 000000000000aa7d
P9:
P10:
[...]
Report Status: 4

This happened in version 6.0.1.25. Since it is not supported I'm ashamed of opening a support case, so I just ask here: as anyone come across anything similar? I'll appreciate any clue on the subject.

(I found this discussion about an appcrash error caused by inconsistent application logic, but it is not the same. The espaces were and are ok. I'm almost sure it was the publishing process that failed.)

Windows Server 2008 R2 Standard x64 (in VMWare)
Last AppCrash was Jan 2013.

11 Feb 2014

Miguel João

Staff

Hello Nuno

Tracing IIS worker process crashes is not easy. The forum topic you've referred (https://www.outsystems.com/forums/discussion/5706/faq-w3wp-crash/) tries to cover a generic error to provide the troubleshooting technique to actually identify the real error.

The system and application events viewers will not cpover the real reason of the crash, but tracing the worker process crash with the IIS Debug Diagnostics tools to collect and analyze a worker process crash dump, will narrow down the cause of the error, at least identify the classes/objects that cause the crash.

The common use cases that can crash the worker process running OutSystems applications are:

- Changes to the applications that introduce a code pattern that crashes teh worker process. In OutSystems language (espace OML only), the only that I've seen so far are infinite loops, or loops that exhaust the heap memory of the process. If the code runs from an extension, could be missusage of a system object or loading unmanaged code into the application (like C++ DLL, mismatching the process architecture (x86 or x64)
- Unstability of the IIS due to system missconfiguration, or third party tools (like Anti Virus) that cause the process crash. I don't recall an episode were I found this.
- Filesystem corruption ... Windows doesn't like hardboots very often.
- And obvisouly, hardware failure (corrupted memory DIMMs). But it this is the case, it should affect more memory consuming processes as well.

My suggestion is to try to cover the basis.

- If you can replicate the problem, try to check CPU usage and Memory usage before the crashs. Does it spike? Then you may have an application loop.
- If you think that an application change may have introduced the problem, try to roll back the application.
- If you think that it could be a Timer or BPT Process, stop the Scheduler before starting IIS, and check if the problem persists.
- If you suspect that the concurrent deployments corrupted something, try to redeploy the applications again, one by one. Usually corrupted data in the database causes application errors, not worker process crashes.
- If you suspect of file corruption, run a system disk check

Ultimately, collecting a crash dump and analyze it using Microsoft DebugDiag tools, will narrow down the problem and possible determine the cause of the crash. The referred post includes a few likes for articles on how to achieve this.

Hope this information is helpful

Cheers

11 Feb 2014

Ricardo Silva

Hello Nuno,

You really shouldn't be afraid of calling support. I hear they're pretty decent guys which will always try to help you as much as they can.

Regarding this particular error, from the error code I can tell it's a stack overflow (maybe I've seen too many of those). All you need to do is plug in the exception code in Debug Diag, and set up a rule to grab the stack when that code happens and you should have a pretty good idea on where the problem is occurring (whether on your eSpace's code, or on platform code).

When all else fails, you can generate a full dump and analyse it with WinDbg.

Let us know if you require further assistance moving on with this issue.

Best regards,
Ricardo Silva

11 Feb 2014

Nuno Reis

MVP

Thanks for the tips.

About server problems, service desk hasn't found anthing so far. Last reboot was in November or so. 12GB of free disk space. Memory seems normal, but deployment has a temporary impact that can't be ignored.

There were no timers involved (there's one, but it wasn´t running, wasn't changed...), no BPT, no new extensions. This project is one year old (its birthday was Thursday:p) and over the last few months the only changes were screens, consumed webservices and user permissions. We have no SU's to change anything more.
Like I said above, enabling espaces one by one was enough to fix. That's why I blame it on them.

The platform unavailability caused a small crisis (almost 24h to clear data) and was useful to see weak points of some key processes. This last day was also about being sure it wouldn't happen again, but I'll get back to the analysis soon.

Thanks again for your contributions Miguel and Ricardo.

12 Feb 2014

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.

See the full guidelines