OVO Tech Blog

Monitoring Step Function based workflows

Introduction

Jonny Cundall

Jonny Cundall

I work in the Orion Migrations team


aws lambda stepfunctions monitoring

Monitoring Step Function based workflows

Posted by Jonny Cundall on .
Featured

aws lambda stepfunctions monitoring

Monitoring Step Function based workflows

Posted by Jonny Cundall on .

I'm an engineer for the Orion Migrations team. The job we do as a team is to allow seamless as possible migration of users and all their data from a legacy system to the Orion platform. This is something which sounds simple at first but becomes more complicated the more you look at it. Examples of the challenges we face:

  • Users need to be informed when they are migrated, which leads them to look at their accounts, often leading to more contact with the company - so migrating too many in a small period of time may overload our call centre.
  • The ‘shape’ of the data in the two systems differs - something which is mandatory in the new system may be missing in the old system.
  • If there is a problem during migration, we need to be able to resolve it so that the user is either restored in full to the old system or fixed in the new system.

AWS Step Functions are a powerful tool for certain use cases. If you have a batch, or reactive process which has many distinct steps, uses multiple different AWS services with a mixture of default behaviour and custom code, it could be something to consider.

I’m going to share some of the things we’ve learnt in order to monitor the work our system is doing, which could be of some interest for anyone using Step Functions.

In our team, a few Step Functions, running various Lambdas represent the core of our tooling.

Here’s an automatically generated view our step function based workflow:


I won’t explain everything about it, but here are a few key points

  • Step Functions are finite state machines, and each step is a state with alternative transitions to the next state.
  • Most (but not all) of the steps/states invoke lambda functions.
  • The input into each step is the whole execution json object at that point, which includes the output of all the previous lambdas.
  • At any point we need to safely roll back in case of failure. So typically each step has two outputs, for success and failure respectively.
  • During the early steps, this is trivial, but later on, we have done things such as loading the users into the new system, and creating direct debits, so the steps we need to take become more complicated.
  • A single run of this state machine migrates a single account to the Orion system. To migrate a batch of accounts we have another step function which triggers this one. Typically that runs many copies of this function concurrently up to a configured limit.

How do we monitor this stuff?

One feature of the AWS web console for Step Functions is that we can look at an execution of the function, and drill down into the various steps to see what has happened, at least to the level of the outputs of the steps and any errors which are detected at that level. But that last qualification at that level, is a key caveat. When we want to see things at a finer level of detail we’ll need to see the logs for the Lambdas.

The trouble with looking at the Lambda logs is that there are a lot of them. We could be running more than 100 migrations concurrently and these are all pouring messages into the same log groups. Finding the needles of information about a specific migration we are investigating in the haystack of logs is going to be difficult.

We have ways of aggregating and filtering logs (we use Datadog, which I won’t go into detail about, other tools will do this just as well). But to make this work we need to identify every log with an execution. There needs to be a trace token, some unique identifier for each execution which we can feed into each Lambda’s logging code so each log adds the token as metadata.

Execution Context

One thing we need to start with is a unique identifier for each run of the step function. Luckily there is something built in. Each run of the step function has a unique execution name. If you try to supply one which has been run previously it will not allow you to. So if we can get hold of this in each Lambda, we can feed this into the logging logic of each Lambda and each log can include it.

While it is running the step function has an execution context. This contains various bits of metadata, such as when the step function started, which step it’s currently executing, and also the unique execution name that we want to use.

Getting the execution context into the Lambdas is a little tricky. A Lambda has a single input - usually referred to as event in the Lambda code. For the Lambdas in step function the event is a big json object including the outputs for all the previous steps. Our process is designed around being able to use information from those previous steps in order to make later decisions. There is a way to get the execution context information into the steps, but it is problematic to fit it into the way the Lambdas are set up.

Take an existing step calling a Lambda:

"Extract": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-2:1234123412:function:mylambda:master",
      "Next": "Validate",
      "ResultPath": "$.extract"
}

I can feed in information from the Lambda using the Parameters argument:

"Extract": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-2:1234123412:function:mylambda:master",
      "Next": "Validate",
	"Parameters":{
		“Execution.$”: “$$.Execution.Name”
} 
      "ResultPath": "$.extract"
}

Let’s have a little look at how that’s working:

On the left side of an assignment, a single dollar means “there’s going to be some substitution going on here”.

On the right side of the assignment, a single dollar means the step function execution object. A double dollar means the step function execution context.

So this works, but now the input to our Lambda is just the execution context name, and we’ve lost the step function object we had before. So let’s add that back in:

"Extract": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-west-2:1234123412:function:mylambda:master",
      "Next": "Validate",
	"Parameters":{
		“Execution.$”: “$$.Execution.Name”,
		“Data.$”: “$”
} 
      "ResultPath": "$.extract"
}


So now we have all the data we had before, plus this extra execution context name. But this isn’t all the change we have to make, because our Lambda was expecting the data in a certain shape, but it is now nested inside another set of json brackets. It’s a small change, but annoying. Furthermore we are going to have to make the same change in every step of the step function, and every Lambda, which is not only a lot of change but also a lot more noise in the step function code, making it harder to read.

One cool trick

Looking at the solution above, it’s clear that it leaves something to be desired. I was left with the feeling that there must be a better way. I was looking for a way to inject this execution context information into the execution object. That way any Lambda could access it without changing the way it accesses the execution object data. Surely there must be a way to pass parameters in, like we’ve already done, and just output it like we do at the end of each Lambda.

Well it turns out … there is! There exists something called a Pass step, which simply passes its input in and spits them out again. The documentation shows examples of this as a convenient way of passing in some fixed data, but crucially the pass step also allows the same use of the Parameters parameter. So we can construct a step like this:

(I’ve shown the whole state machine here, in case you want to copy this and test it yourself)

{
    "Comment": "A Pass step",
    "StartAt": "GetContext",
    "States": {
      "GetContext": {
        "Type": "Pass",
        "Parameters": 
          {
          "context.$":"$$"
        }
        ,
        "ResultPath": "$.executionContext",
        "End": true
      }
    }
}


Which gives us an output like this:

{
  "Comment": "my input",
  "executionContext": {
    "context": {
      "Execution": {
        "Id": "arn:aws:states:eu-west-2:1234123411:execution:MyStateMachine:aebe52d2-877d-8844-c4f9-336fde6a4355",
        "Input": {
          "Comment": "my input"
        },
        "Name": "aebe52d2-877d-8844-c4f9-336fde6a4355",
        "RoleArn": "arn:aws:iam::1234123411:role/step-role",
        "StartTime": "2020-05-23T13:43:18.042Z"
      },
      "StateMachine": {
        "Id": "arn:aws:states:eu-west-2:1234123411:stateMachine:MyStateMachine",
        "Name": "MyStateMachine"
      },
      "State": {
        "Name": "GetContext",
        "EnteredTime": "2020-05-23T13:43:18.072Z"
      }
    }
  }
}

So now the whole execution context is available to any Lambda in the step function.

There’s a bit more information than we are going to need here, so let’s modify it to trim it down to what we are interested in, which for us includes the start time of the execution, for some custom metrics:

{
    "Comment": "A Pass step ",
    "StartAt": "GetContext",
    "States": {
      "GetContext": {
        "Type": "Pass",
        "Parameters": 
          {
          "startTime.$":"$$.Execution.StartTime",
          "name.$": "$$.Execution.Name"
        }
        ,
        "ResultPath": "$.executionContext",
        "End": true
      }
    }
}

So this step is something I can put in at the start of any step function I want to monitor, and then in our logging library we just need to tell it to get event.executionContext.name, and we can trace that execution run by its unique identifier.

I think this is a neat little way of accessing the useful data in the execution context so that each Lambda is not quite so ignorant of the process it’s being called in.












Jonny Cundall

Jonny Cundall

I work in the Orion Migrations team

View Comments...