Handling timeouts when trying to start a workflow?

Feb 2, 2015 at 6:19 PM
We have a situation in production when we try to start a workflow through SPServices. The first person to try and start the workflow for the day after the nightly app pool recycle gets some kind of timeout (either from the browser or SP, not sure) and we can't replicate the issue locally so we can analyze the response to try and handle it cleanly. I am all but positive the issue is because the production farm is highly taxed and when the workflow is built/compiled into memory for the first time in the day it takes just a bit longer than some timeout value is set to. Since our code in the completefunc is executed I am pretty sure the timeout is server side, though.

Ideally what we would like to do is catch the timeout and then try one more time before giving up and alerting the user that the code has gone south.

Can anyone provide what we can look for the completeFunc to let us know that a server-side timeout occurred? I am sure its going to be something simple but I am struggling with intentionally causing a workflow to timeout in a test environment.
Coordinator
Feb 4, 2015 at 2:12 PM
You should switch to using promises rather than a synchronous call using completefunc.

M.
Feb 4, 2015 at 3:47 PM
A promise wouldn't help in this scenario since the server is returning a timeout message. I would still have to handle when the server kills the thread and returns back the message that the workflow wasn't started. Also, this is the last thing the form does before closing itself and redirecting the browser, so being synchronous is actually nice.

I have cobbled some code together that sort of replicated the issue (I think) by setting the HTTP timeout on my web app to 30 seconds and doing an IIS reset before I kick off the workflow. The really frustrating part is that SharePoint returns a error page with an HTTP code of 200 that contains HTML markup instead of XML. This trips up SPServices into thinking the workflow started but causes a parsererror result.

I have some additional tracing code going into production soon to track this issue to see if I can provide the community some additional details once I can grok what is actually happening under the covers. I am not confident my 'replication' of the issue is an apples to apples replication, but if it is then I can provide some further insight for future consumption.

BTW - this is SP2010 Enterprise.
Coordinator
Feb 4, 2015 at 4:06 PM
If you're doing a nightly app pool recycle, you might want to run a warm up script so that the first user hit is more "normal". That first hit is just waiting for everything to spin up.

M.
Feb 4, 2015 at 4:29 PM
There is another warmup that happens when the workflow is first started after an app pool recycle. This warmup is outside of the 'normal' SP warmup and only happens when the workflow is started for the first time after the recycle. It may be because we primarily use reusable workflows deployed from a WSP but that behavior has always been pretty consistent in my experience.
Feb 16, 2015 at 6:04 PM
I have confirmed that the server is throwing a timeout error message. What actually happens is SharePoint hands the SPServices call to workflow.asmx the generic error page with a "Request timed out" message. This causes SPServices to encounter a parsererror when it tries to parse the HTML response stream into XML. Rather annoyingly, the correlation ID that is part of the response stream is for the rendering of the error page, not the correlation ID of the start workflow request call.

If we start the workflow from the native SP UI it takes approximately 3:30 to start the first time. Every subsequent StartWorkflow() call that is made returns in ~1 second. There is definitely some kind of warmup the workflow has to go through, presumably this is when the XOML is compiled into its temporary assembly (I seem to remember hearing once that is what SharePoint does in the background...). When the app pools are recycled the temporary XOML assembly is lost and has to be rebuilt the first time the workflow is executed.

I am currently looking for ways to speed up the XOML compilation process (increasing the HTTP timeout via web.config isn't an option). If anyone has any other ideas I would gladly welcome them.
Feb 18, 2015 at 11:55 PM
Re: "when it tries to parse the HTML response stream into XML"

I doubt it's SPServices throwing the paree error. Can you post your code? Specifically how you are handling the response (normally the completefun).
If the response never gets to your callback, then I bet it's jQuery - SPServices sets a value that tells jQuery to handle the response as an XML message.

If it does reach your callback, you might have to do a few checks on the response to determine if it is XML or not.

All of this will not solve the root of your problem (sounds like you know what it is), but hopefully you can show the user some sort of failure notice.

Paul
-- sent from mobile

_____________________________
Feb 19, 2015 at 1:18 AM
You're correct, the parser error is mostly likely coming JQuery. The Status code does give us something useful to key off, though.
                                    completefunc: function (data, Status) {
                                        if (Status == "success") {
                                            //close the form
                                        }
                                        else {
                                            //handle error
                                        }
As for the core issue (the web service taking so long to start the workflow) I think the issue is actually Workflow.asmx. I moved the start workflow into an event receiver using the SPOM as a test and the workflow starts up very quickly each time. Watching ULS logs I can see there is a roughtly 120 second gap between the start of the XOML compile and the thread being aborted by IIS due to timeout.