» Fig Leaf Software Home

We've Got You Covered.

Friday, July 25, 2014

What's from China and has 8 legs?

It's one of those errors that could only happen with computers.

Apparently, if you run IIS,
and you run ColdFusion,
and maybe you run Commonspot, 
and a specific Chinese web crawler hits your website, then your website will crash.

For over a week, the client's website had been going down.

Just a few months ago they had been migrated to a new server and upgraded to the latest software.  Windows Server 2008 R2, ColdFusion 10, Commonspot 8 (The CMS), IIS 7.5.  

Things were fine when the server was first set up, but now, several times a day, ColdFusion would hit 100% processor usage and die, requiring a restart to get the site back online.

I traced that issue to a bad webservice call - a webservice the client website utilized heavily had become unreliable, and on those occasions when it didn't respond to requests, ColdFusion pages would hang indefinitely in the system, eventually sucking up all the resources and killing the website.

I fixed this by swapping some cfinvoke calls for straight cfhttp calls to the webservice - there's an issue with cfinvoke where at certain stages of the webservice invocation (like waiting for a response), ColdFusion will go into a holding pattern of just waiting, regardless of timeout settings.  Cfhttp calls worked every bit as well and timed out successfully, thus eliminating the problem.
But it didn't save the website.  That still died several times a day, but it died in a different way.  IIS would show pages with a 503 'Service Unavailable' error.  ColdFusion server monitoring and FusionReactor, useful in diagnosing the last issue, were showing no long-running pages this time.  The server resources never hit their capacity.  Everything appeared to be fine and dandy, even as the website died every few hours.
The webservice issue had masked something else.  Once the original killer was dead, another one reared its ugly head - the server either hadn't been running long enough for this one to be noticed, or whatever crashes it had caused in the interim were assumed to be the webservice issue.
Server event logs and Performance Monitoring of the application pools started to hint at the issue - I found errors like these in the event logs:

A process serving application pool 'DefaultAppPool' exceeded time limits during shut down. The process id was '6332'.  Event ID 5013

A worker process '6332' serving application pool 'DefaultAppPool' failed to stop a listener channel for protocol 'http' in the allotted time.  The data field contains the error number. Event ID 5138

A process serving application pool 'DefaultAppPool' suffered a fatal communication error with the Windows Process Activation Service. The process id was '3748'. The data field contains the error number. Event ID 5011
I found my go-to source of information - the internet - rather unhelpful in diagnosing these errors.  Suggestions included 'Find the error in your code' and 'Try replacing this .dll file with an older version of itself'.  After finding more promising info, I tried changing various application pool settings with no success.  

At a dead end, I looked for a way to try to get a better picture of what was going on inside the worker process threads.  You may need to install a server role (That role is IIS -> Heath and Diagnostics-> Request Monitor), as I did, to enable this.  Once it's there, from IIS you can select your server (not any of the websites), click on 'Worker Processes', and that will pop up a list of currently executing requests inside that worker process.
Inside a normal, healthy worker process you should see very little - a few requests executing with nice short times (those are in milliseconds), or maybe even nothing going on at the moment.

This is not what you want to see.

Ok, so that's a problem. 
My first thought was that there was something wrong with the pages, but no.  My own request for any of these pages would go through the application pool and CF, process, and a response was sent back to me - no problem.
My second thought was "Why are these all from similar IP addresses?"
A quick IP lookup revealed that the requests are coming from Guangzhou, China.  Odd traffic for an American site.  Armed with the IP address, I checked the IIS logs, where more of them showed up, helpfully self-identifying as 'Easouspider' requests.  There were several in the logs, and many of them completed successfully - anything that wasn't a .cfm page was fine, and there were successful .cfm page requests. 
Easou.com purports itself to be 'China's #1 mobile search engine', and they may well be - they're pretty big.  They're just crawling the web much like google and yahoo.  But for some reason, some requests from this particular crawler for .cfm pages for my client's server wouldn't go back out of the system.  They just sat in the application pools and nowhere else, gradually filling them up until something (usually errors) triggers application pool recycling.

Normal application pool recycling is pretty slick - IIS keeps the old pool running while it starts up the new one, and then does a handoff so the site doesn't drop, allowing all the requests in the old application pool to finish before fully transitioning to the new one.  That way your website doesn't appear to ever drop during a normal application pool transition.

That's when the truly diabolical thing was happening here - these hung requests are tough.  Persistent.  They do not want to let go.  IIS spends several minutes trying to wait for them to shut down so it can properly restart the application pool during a recycle - and that's the period when website would be dead, stuck in limbo with an old application pool that refused to die and unable to fully transition to the new one. 
Now, finding that the requests are stuck doesn't tell us exactly why it's happening.  The prevailing theory in the office is that requests from China are perhaps sending some odd character that IIS or CF doesn't like as part of the request, but even if we could narrow it down to that, what could we do about it?  I can't change how IIS or CF handles these requests.
So I blocked all IPs beginning with 183.60 from making requests to the site.  Better to cut off a part of China being able to access the site rather than allow the website to continue crashing.
It worked.  The site stopped crashing and ran smooth as butter.  The client was happy, I was happy, and all was well with the world.
A week later, another client called us in.  Their website was crashing mysteriously.

I checked the worker processes first this time, and there it was again.  Hung requests from IPs starting with 183.60.  The Chinese Spider was here too.  I had previously assumed this was an issue limited to the first client - despite having been upgraded recently, theirs was an older code base that had seen a lot of modification over the years, and nobody else had been crashing like they were until now.  This other client was a recent build, but apparently susceptible to the same issue.
After that, I checked every client we had - if two separate sites could go down like this, then others might be vulnerable as well.

I was right.  I found the Chinese Spider's requests stuck in two more ColdFusion / Commonspot / IIS websites that we manage (though several others got a clean bill of health, we nonetheless have blocked the IP on all our clients for now as a safety precaution).  In those cases it appears that the crawler hadn't been hitting the sites too hard, so the application pool would recycle naturally before dying from too many hung requests.
If you manage a ColdFusion / IIS site, I'd recommend taking a quick peek inside those worker threads - if it was present in four of ours, I'd wager we're not the only folks experiencing this strange issue, even if it isn't bringing down your website yet.

Written by: Benjamin Bently

No comments:

Post a Comment

About Us

Fig Leaf Software is an award-winning team of imaginative designers, innovative developers, experienced instructors, and insightful strategists.

For over 20 years, we’ve helped a diverse range of clients...

Read More

Contact Us

202-797-7711

Fig Leaf Software

1400 16th Street NW
Suite 450
Washington, DC 20036

info@figleaf.com