Fixing the Stack Overflow Bug in Camunda BPM

Today I spent the better part of a great sunny Saturday on something pretty much useless. But it was a lot of fun.

Usually, I try to do useful stuff. How do I know if something is useful? I listen to our users and customers. We get a lot of feedback through different channels. There is enterprise support, the community forums, twitter, consultants who return from workshop, … many people use our product and have something to say about their experience.

Listening to users and working on the things they really need is an important lesson every product team has to learn. At Camunda we have institutionalized a sort of “culture of listening” as opposed to a “culture of guessing”. We do not work on the features we guess are useful. Instead, we try to derive “usefulness” empirically.

While this sounds really nice and all, this kind of customer centric culture can also be frustrating.

So this Saturday I indulged in something that has been bugging me for some time now but unfortunately very few of our users seem to care much about: I fixed the Stack Overflow Bug in Camunda BPM Platform.

What is the Stack Overflow Bug?

Consider the following process:

The process contains a loop which performs about 500 iterations in a single unit of work (without asynchronous continuation). Executing this process failed in Camunda. The process engine executed all the steps recursively which grows the stack until the maximum stack size is reached. At that point it used to fail with a Stack Overflow Exception:

java.lang.StackOverflowError
    at org.camunda.bpm.engine.impl.pvm.runtime.AtomicOperationActivityExecute.execute(AtomicOperationActivityExecute.java:44)
    at org.camunda.bpm.engine.impl.interceptor.CommandContext.performOperation(CommandContext.java:93)
    at org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.performOperationSync(ExecutionEntity.java:728)
    at org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.performOperation(ExecutionEntity.java:719)
    at org.camunda.bpm.engine.impl.pvm.runtime.AtomicOperationTransitionNotifyListenerStart.eventNotificationsCompleted(AtomicOperationTransitionNotifyListenerStart.java:63)
    at org.camunda.bpm.engine.impl.pvm.runtime.AbstractEventAtomicOperation.execute(AbstractEventAtomicOperation.java:63)
    at org.camunda.bpm.engine.impl.interceptor.CommandContext.performOperation(CommandContext.java:93)
    at org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.performOperationSync(ExecutionEntity.java:728)
    at org.camunda.bpm.engine.impl.pvm.runtime.AbstractEventAtomicOperation.execute(AbstractEventAtomicOperation.java:56)
    at org.camunda.bpm.engine.impl.interceptor.CommandContext.performOperation(CommandContext.java:93)
    at org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.performOperationSync(ExecutionEntity.java:728)
    at org.camunda.bpm.engine.impl.pvm.runtime.AbstractEventAtomicOperation.execute(AbstractEventAtomicOperation.java:56)
    at org.camunda.bpm.engine.impl.interceptor.CommandContext.performOperation(CommandContext.java:93)
    at org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.performOperationSync(ExecutionEntity.java:728)
    at org.camunda.bpm.engine.impl.persistence.entity.ExecutionEntity.performOperation(ExecutionEntity.java:719)
    at org.camunda.bpm.engine.impl.pvm.runtime.AtomicOperationTransitionCreateScope.execute(AtomicOperationTransition
    ...

(This does not mean that a loop cannot have 500 iterations or that you cannot have 500 tasks in a process. It just means that if the 500 iterations are performed in a single unit of work without an intermediary save point, then the stack overflow occurs. The whole point of a workflow engine is save points and wait states so it comes with little surprise that this limitation is not practically relevant.)

But as a process engine developer, this bugged me, of course. If you know that there is this theoretic limitation you just keep thinking about it, even if you know that it has no practical relevance. Part of me always wanted to fix this bug while the other part was like: “lets wait until somebody actually reports this”.

A Bug not worth fixing

I waited, and waited, but nobody asked for a fix…

None of our enterprise customers ever reported this problem in support. In the community forums, if you search for “stack overflow”, out of over 1700 topics, one guy has this problem.

Then I talked to Bernd Rücker. Bernd is CEO at Camunda, he has over 10 years of hardcore, in the trenches, BPM and Workflow consulting under his belt. He has talked to every person and project doing workflow in the german-speaking world and many of the rest. My hope was that Bernd would know. He would understand me. Surely someday, somebody would be affected by this Bug and it would be terrible and Bernd would be farsighted enough to know that and he would say: “Daniel, you are right! We have to fix this. Stop everything else, put the whole team on it, nobody goes home until this is fixed.” But, unfortunately this is not what he said. When I talked to him, his answer was: “Ah that thing again. Yeah… nobody cares about that.”

So it really looks like this is not worth fixing. Let’s put it to rest, once and for all.

So I fixed it anyway

My wife works Saturdays. I am home and usually I work too. Today though, the weather was nice and I just didn’t feel like doing something useful. Hey, what about this Stack Overflow Bug? I had tried a couple of times in the past but never really finished it. After two cups of coffee and having opened and closed the fridge 20 times without taking anything out, I thought, lets give it a shot.

When fixing something like this, it is important not to change anything big. After all, this bugfix has no value to 99% of our users and even small internal changes can cause headaches to them. I definitely don’t want something like this to be the cause of frustration to anybody, when all this is, is really just a vanity project. This is why the solution has to change as little as possible to the ordering in which things happen in the process engine.

Turns out, it was not as complicated as initially thought. The idea is to just break the recursion before and after activities. This way, most of the behavior is maintained as is but the stack does not grow boundlessly anymore. Based on this it is also easy to ensure that activities are still executed in the same order (depth-first).

This is the Result.

And I threw in something useful as well

After having fixed the bug, I thought: lets do something useful after all. There was some feedback that with the occurrence of an error in a java stacktrace from the process engine, it is hard to recognize the activity in which the error occurred. So given that I am dealing with these stacks already, why not try to format a decent “BPMN-oriented Error Stack Trace”:

This pull request proposes a stack trace which hopefully allows users to better locate the source of exceptions in the process:

BPMN Stack Trace:
  callActivity (activity-execute, [415](ScopeExecution), pa=invoice)
  callActivity
    ^
    |
  ExclusiveGateway_1
    ^
    |
  ServiceTask_2
    ^
    |
  ServiceTask_1
    ^
    |
  StartEvent_1

Probably needs a little more love but already one can see in which activity the error occurred and which activities were executed beforehand.

I also added the possibility to format a verbose stack trace which lists every atomic operations which was executed. Again, this is probably useless to our users but I am sure my colleague Thorben will love it when he is hunting for the next bugs :)

So there you go. Saturdays are meant for fun and useless things. And those mean different things for different people.