Skip to main content

2 posts tagged with "lambda"

View All Tags

· 13 min read
Michael Hart

This post is about how to build an AWS Step Functions state machine and how you can use it to interact with IoT edge devices. In this case, we are sending a smoothie order to a "robot" and waiting for it to make that smoothie.

The state machine works by chaining together a series of Lambda functions and defining how data should be passed between them (if you're not sure about Lambda function, take a look at this blog post!). There's also a step where the state machine needs to wait for the smoothie to be made, which is slightly more complicated - we'll cover that later in this post.

This post is also available in video form - check the video link below if you want to follow along!

AWS Step Functions Service

AWS Step Functions is an AWS service that allows users to build serverless workflows. Serverless came up in my post on Lambda functions - it means that you can run applications in the cloud without provisioning any servers or contantly-running resources. That in turns means you only pay for the time that something is executing in the cloud, which is often much cheaper than provisioning a server, but with the same performance.

To demonstrate Step Functions, we're building a state machine that accepts smoothie orders from customers and sends them to an available robot to make that smoothie. Our state machine will look for an available robot, send it the order, and wait for the order to complete. The state machine will be built in AWS Step Functions, which we can access using the console.

State Machine Visual Representation

First, we'll look at the finished state machine to get an idea of how it works. Clicking the edit button within the state machine will open the workflow Design tab for a visual representation of the state machine:

Visual representation of Step Functions State Machine

Each box in the diagram is a stage of the Step Functions state machine. Most of the stages are Lambda functions, which are configured to interface with AWS resources. For example, the first stage (GetRobot) scans a DynamoDB table for the first robot with the ONLINE status, meaning that it is ready for work.

If at least one robot is available, GetRobot will pass its name to the next stage - SetRobotWorking. This function updates that robot's entry in the DynamoDB table to WORKING, so future invocations don't try to give that robot another smoothie order.

From there, the robot name is again passed on to TellRobotOrder, which is responsible for sending an MQTT message via AWS IoT Core to tell the robot its new smoothie order. This is where the state machine gets slightly more complicated - we need the state machine to pause and wait for the smoothie to be made.

Activities

While we're waiting for the smoothie to be made, we could have the Lambda function wait for a response, but we would be paying for the entire time that function is sitting and waiting. If the smoothie takes 5 minutes to complete, that would be over 6000x the price!

Instead, we can use the Activities feature of Step Functions to allow the state machine to wait at no extra cost. The system follows this setup:

IoT Rule to Robot Diagram

When the state machine sends the smoothie order to the robot, it includes a generated task token. The robot thens make the smoothie, and when it is finished, publishes a message saying it was successful with that same task token. An IoT Rule that forwards that message to another Lambda function, which tells the state machine that the task was a success. Finally, the state machine updates the robot's status back to ONLINE, so it can receive more orders, and the state machine completes successfully.

Why go through Lambda and IoT Core?

The robot could directly call the Task Success API, but we would need to give it permission to do so - as well as a direct internet connection. This version of the system means that the robot only ever communicates using MQTT messages via AWS IoT Core. See my video on AWS IoT Core to see how to set this up.

Testing the Smoothie State Machine

To test the state machine, we start with a table with two robots, both with ONLINE status. If you follow the setup instructions in the README, your table will have these entries:

Robots with ONLINE state

Successful Execution

If we now request any kind of smoothie using the test_stepfunction.sh script, we start an execution of the state machine. It will find that Robot1 is free to perform the function and update its status to WORKING:

Robot1 with WORKING state

Then it will send an MQTT message requesting the smoothie. After a few seconds, the mock robot script will respond with a success message. We can see this in the MQTT test client:

MQTT Test Client showing order and success messages

This allows the state machine to finish its execution successfully:

Successful step function execution

If we click on the execution, we can see the successful path lit up in green:

State machine diagram with successful states

Smoothie Complete!

We've made our first fake smoothie! Now we should make sure we can handle errors that happen during smoothie making.

Robot Issue during Execution

What happens if there is an issue with the robot? Here we can use error handling in Step Functions. We define a timeout on the smoothie making task, and if that timeout is reached before the task is successful, we catch the error - in this case, we update the robot's state to BROKEN and fail that state machine's execution.

To test this, we can kill the mock robot script, which simulates all robots being offline. In this case, running the test_stepfunction.sh will request the smoothie from Robot1, but will then time out after 10 seconds. This then updates the robot's state to BROKEN, ensuring that future executions do not request smoothies from Robot1.

Robot Status shown as BROKEN

The overall state execution also fails, allowing us to alert the customer of the failure:

Execution fails from time out

We can also see what happened to cause the failure by clicking on the execution and scrolling to the diagram:

State Machine diagram of timeout failure

Another execution will have the same effect for Robot2, leaving us with no available robots.

No Available Robots

If we never add robots into the table, or all of our robots are BROKEN or WORKING, we won't have a robot to make a smoothie order. That means our state machine will fail at the first step - getting an available robot:

State Machine diagram with no robots available

That's our state machine defined and tested. In the next section, we'll take a look at how it's built.

Building a State Machine

To build the Step Functions state machine, we have a few options, but I would recommend using CDK for the definition and the visual designer in the console for prototyping. If you're not sure what the benefits of using CDK are, I invite you to watch my video on the benefits, where I discuss how to use CDK with SiteWise:

The workflow goes something like this:

  1. Make a base state machine with functions and AWS resources using CDK
  2. Use the visual designer to prototype and build the stages of the state machine up further
  3. Define the stages back in the CDK code to make the state machine reproducible and recover from any breaking changes made in the previous step

Once complete, you should be able to deploy the CDK stack to any AWS account and have a fully working serverless application! To make this step simpler, I've uploaded my CDK code to a Github repository. Setup instructions are in the README, so I'll leave them out of this post. Instead, we'll break down some of the code in the repository to see how it forms the full application.

CDK Stack

This time, I've split the CDK stack into multiple files to make the dependencies and interactions clearer. In this case, the main stack is at lib/cdk-stack.ts, and refers to the four components:

  1. RobotTable - the DynamoDB table containing robot names and statuses
  2. Functions - the Lambda functions with the application logic, used to interact with other AWS services
  3. IoTRules - the IoT Rule used to forward the MQTT message from a successful smoothie order back to the Step Function
  4. SmoothieOrderHandler - the definition of the state machine itself, referring to the Lambda functions in the Functions construct

We can take a look at each of these in turn to understand how they work.

RobotTable

This construct is simple; it defines a DynamoDB table where the name of the robot is the primary key. The table will be filled by a script after stack deployment, so this is as much as it needed. Once filled, the table will have the same contents as shown in the testing section.

Functions

This construct defines four Lambda functions. All four are written using Rust to minimize the execution time - the benefits are discussed more in my blog post on Lambda functions. Each handler function is responsible for one small task to show how the state machine can pass data around.

Combining Functions

We could simplify the state machine by combining functions together, or using Step Functions to call AWS services directly. I'll leave it to you to figure out how to simplify the state machine!

The functions are as follows:

  1. Get Available Robot - scans the DynamoDB table to find the first robot with ONLINE status. Requires the table name as an environment variable, and permission to read the table.
  2. Update Status - updates the robot name to the given status in the DynamoDB table. Also requires the table name as an environment variable, and permission to write to the table.
  3. Send MQTT - sends a smoothie order to the given robot name. Requires IoT data permissions to connect to IoT Core and publish a message.
  4. Send Task Success - called by an IoT Rule when a robot publishes that it has successfully finished a smoothie. Requires permission to send the task success message to the state machine, which has to be done after the state machine is defined, hence updating the permission in a separate function.

IoT Rules

This construct defines an IoT Rule that listens on topic filter robots/+/success for any messages, then pulls out the contents of the MQTT message and calls the Send Task Success Lambda function. The only additional permission it needs is to call a Lambda function, so it can call the Send Task Success function.

Smoothie Order Handler

This construct pulls all the Lambda functions together into our state machine. Each stage corresponds to one of the stages in the State Machine Visual Representation section.

The actual state machine is defined as a chain of functions:

const orderDef =
getAvailableRobot
.next(setRobotWorking)
.next(tellRobotOrder
.addCatch(setRobotBroken.next(finishFailure),
{
errors: [step.Errors.TIMEOUT],
resultPath: step.JsonPath.DISCARD,
})
)
.next(setRobotFinished)
.next(finishSuccess);

Defining each stage as a constant, then chaining them together, allows us to see the logic of the state machine more easily. However, it does hide the information that is being passed between stages - Step Functions will store metadata while executing and pass the output of one function to the next. We don't always want to pass the output of one function directly to another, so we define how to modify the data for each stage.

For example, the Get Robot function looks up a robot name, so the entire output payload should be saved for the next function:

const getAvailableRobot = new steptasks.LambdaInvoke(this, 'GetRobot', {
lambdaFunction: functions.getAvailableRobotFunction,
outputPath: "$.Payload",
});

However, the Set Robot Working stage does not produce any relevant output for future stages, so its output can be discarded. Also, it needs a new Status field defined for the function to work, so the payload is defined in the stage. To set one of the fields based on the output of the previous function, we use .$ to tell Step Functions to fill it in automatically. Hence, the result is:

const setRobotWorking = new steptasks.LambdaInvoke(this, 'SetRobotWorking', {
lambdaFunction: functions.updateStatusFunction,
payload: step.TaskInput.fromObject({
"RobotName.$": "$.RobotName",
"Status": "WORKING",
}),
resultPath: step.JsonPath.DISCARD,
});

Another interesting thing to see in this construct is how to define a stage that waits for a task to complete before continuing. This is done by changing the integration pattern, plus passing the task token to the task handler - in this case, our mock robot. The definition is as follows:

const tellRobotOrder = new steptasks.LambdaInvoke(this, 'TellRobotOrder', {
lambdaFunction: functions.sendMqttFunction,
// Define the task token integration pattern
integrationPattern: step.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
// Define the task timeout
taskTimeout: step.Timeout.duration(cdk.Duration.seconds(10)),
payload: step.TaskInput.fromObject({
// Pass the task token to the task handler
"TaskToken": step.JsonPath.taskToken,
"RobotName.$": "$.RobotName",
"SmoothieName.$": "$.SmoothieName",
}),
resultPath: step.JsonPath.DISCARD,
});

This tells the state machine to generate a task token and give it to the Lambda function as defined, then wait for a task success signal before continuing. We can also define a catch route in case the task times out, which is done using the addCatch function:

.addCatch(setRobotBroken.next(finishFailure),
{
errors: [step.Errors.TIMEOUT],
resultPath: step.JsonPath.DISCARD,
})

With that, we've seen how the state machine is built, seen how it runs, and seen how to completely define it in CDK code.

Challenge!

Do you want to test your understanding? Here are a couple of challenges for you to extend this example:

  1. Retry making the smoothie! If a robot times out making the smoothie, just cancelling the order is not a good customer experience - ideally, the system should give the order to another robot instead. See if you can set up a retry path from the BROKEN robot status update back to the start of the state machine.
  2. Add a queue to the input! At present, if we have more orders than robots, the later orders will simply fail immediately. Try adding a queue that starts executing the state machine using Amazon Simple Queue Service (SQS).

Summary

Step Functions can be used to build serverless applications as state machines that call other AWS resources. In particular, a powerful combination is Step Functions with AWS Lambda functions for the application logic.

We can use other serverless AWS resources to access more cloud functionality or interface with edge devices. In this case, we use MQTT messages via IoT Core to message robots with smoothie orders, then listen for the responses to those messages to continue execution. We can also use a DynamoDB table to store robot statuses, which is a serverless database table. The table contains each robot's current status as the step function executes.

Best of all, this serverless application runs in the cloud, giving us all of the advantages of running using AWS - excellent logging and monitoring, fine-grained permissions, and modifying the application on demand, to name a few!

· 14 min read
Michael Hart

This post shows how to build two simple functions, running in the cloud, using AWS Lambda. The purpose of these functions is the same - to update the status of a given robot name in a database, allowing us to view the current statuses in the database or build tools on top of it. This is one way we could coordinate robots in one or more fleets - using the cloud to store the state and run the logic to co-ordinate those robots.

This post is also available in video form - check the video link below if you want to follow along!

What is AWS Lambda?

AWS Lambda is a service for executing serverless functions. That means you don't need to provision any virtual machines or clusters in the cloud - just trigger the Lambda with some kind of event, and your pre-built function will run. It runs on inputs from the event and could give you some outputs, make changes in the cloud (like database modifications), or both.

AWS Lambda charges based on the time taken to execute the function and the memory assigned to the function. The compute power available for a function scales with the memory assigned to it. We will explore this later in the post by comparing the memory and execution time of two Lambda functions.

In short, AWS Lambda allows you to build and upload functions that will execute in the cloud when triggered by configured events. Take a look at the documentation if you'd like to learn more about the service!

How does that help with robot co-ordination?

Moving from one robot to multiple robots helping with the same task means that you will need a central system to co-ordinate between them. The system may distribute orders to different robots, tell them to go and recharge their batteries, or alert a user when something goes wrong.

This central service can run anywhere that the robots are able to communicate with it - on one of the robots, on a server near the robots, or in the cloud. If you want to avoid standing up and maintaining a server that is constantly online and reachable, the cloud is an excellent choice, and AWS Lambda is a great way to run function code as part of this central system.

Let's take an example: you have built a prototype robot booth for serving drinks. Customers can place an order at a terminal next to the robot and have their drink made. Now that your booth is working, you want to add more booths with robots and distribute orders among them. That means your next step is to add two new features:

  1. Customers should be able to place orders online through a digital portal or webapp.
  2. Any order should be dispatched to any available robot at a given location, and alert the user when complete.

Suddenly, you have gone from one robot capable of accepting orders through a terminal to needing a central database with ordering system. Not only that, but if you want to be able to deploy to a new location, having a single server per site makes it more difficult to route online orders to the right location. One central system in the cloud to manage the orders and robots is perfect for this use case.

Building Lambda Functions

Convinced? Great! Let's start by building a simple Lambda function - or rather, two simple Lambda functions. We're going to build one Python function and one Rust function. That's to allow us to explore the differences in memory usage and runtime, both of which increase the cost of running Lambda functions.

All of the code used in this post is available on Github, with setup instructions in the README. In this post, I'll focus on relevant parts of the code.

Python Function

Firstly, what are the Lambda functions doing? In both cases, they accept a name and a status as arguments, attached to the event object passed to the handler; check the status is valid; and update a DynamoDB table for the given robot name with the given robot status. For example, in the Python code:

def lambda_handler(event, context):
# ...
name = str(event["name"])
status = str(event["status"])

We can see that the event is passed to the lambda handler and contains the required fields, name and status. If valid, the DynamoDB table is updated:

ddb = boto3.resource("dynamodb")
table = ddb.Table(table_name)
table.update_item(
Key={"name": name},
AttributeUpdates={
"status": {
"Value": status
}
},
ReturnValues="UPDATED_NEW",
)

Rust Function

Here is the equivalent for checking the input arguments for Rust:

#[derive(Deserialize, Debug, Serialize)]
#[serde(rename_all = "UPPERCASE")]
enum Status {
Online,
}
// ...
#[derive(Deserialize, Debug)]
struct Request {
name: String,
status: Status,
}

The difference here is that Rust states its allowed arguments using an enum, so no extra code is required for checking that arguments are valid. The arguments are obtained by accessing event.payload fields:

let status_str = format!("{}", &event.payload.status);
let status = AttributeValueUpdate::builder().value(AttributeValue::S(status_str)).build();
let name = AttributeValue::S(event.payload.name.clone());

With the fields obtained and checked, the DynamoDB table can be updated:

let request = ddb_client
.update_item()
.table_name(table_name)
.key("name", name)
.attribute_updates("status", status);
tracing::info!("Executing request [{request:?}]...");

let response = request
.send()
.await;
tracing::info!("Got response: {:#?}", response);

CDK Build

To make it easier to build and deploy the functions, the sample repository contains a CDK stack. I've talked more about Cloud Development Kit (CDK) and the advantages of Infrastructure-as-Code (IaC) in my video "From AWS IoT Core to SiteWise with CDK Magic!":

In this case, our CDK stack is building and deploying a few things:

  1. The two Lambda functions
  2. The DynamoDB table used to store the robot statuses
  3. An IoT Rule per Lambda function that will listen for MQTT messages and call the corresponding Lambda function

The DynamoDB table comes from Amazon DynamoDB, another service from AWS that keeps a NoSQL database in the cloud. This service is also serverless, again meaning that no servers or clusters are needed.

There are also two IoT Rules, which are from AWS IoT Core, and define an action to take when an MQTT message is published on a particular topic filter. In our case, it allows robots to publish an MQTT message saying they are online, and will call the corresponding Lambda function. I have used IoT Rules before for inserting data into AWS IoT SiteWise; for more information on setting up rules and seeing how they work, take a look at the video I linked just above.

Testing the Functions

Once the CDK stack has been built and deployed, take a look at the Lambda console. You should have two new functions built, just like in the image below:

Two new Lambda functions in the AWS console

Great! Let's open one up and try it out. Open the function name that has "Py" in it and scroll down to the Test section (top red box). Enter a test name (center red box) and a valid input JSON document (bottom red box), then save the test.

Test configuration for Python Lambda function

Now run the test event. You should see a box pop up saying that the test was successful. Note the memory assigned and the billed duration - these are the main factors in determining the cost of running the function. The actual memory used is not important for cost, but can help optimize the right settings for cost and speed of execution.

Test result for Python Lambda function

You can repeat this for the Rust function, only with the test event name changed to TestRobotRs so we can tell them apart. Note that the memory used and duration taken are significantly lower.

Test result for Rust Lambda function

Checking the Database Table

We can now access the DynamoDB table to check the results of the functions. Access the DynamoDB console and click on the table created by the stack.

DynamoDB Table List

Select the button in the top right to explore items.

Explore Table Items button in DynamoDB

This should reveal a screen with the current items in the table - the two test names you used for the Lambda functions:

DynamoDB table with Lambda test items

Success! We have used functions run in the cloud to modify a database to contain the current status of two robots. We could extend our functions to allow different statuses to be posted, such as OFFLINE or CHARGING, then write other applications to work using the current statuses of the robots, all within the cloud. One issue is that this is a console-heavy way of executing the functions - surely there's something more accessible to our robots?

Executing the Functions

Lambda functions have a huge variety of ways that they can be executed. For example, we could set up an API Gateway that is able to accept API requests and forward them to the Lambda, then return the results. One way to check the possible input types is to access the Lambda, then click the "Add trigger" button. There are far too many options to list them all here, so I encourage you to take a look for yourself!

Lambda add trigger button

There's already one input for each Lambda - the AWS IoT trigger. This is an IoT Rule set up by the CDK stack, which is watching the topic filter robots/+/status. We can test this using either the MQTT test client or by running the test script in the sample repository:

./scripts/send_mqtt.sh

One message published on the topic will trigger both functions to run, and we can see the update in the table.

DynamoDB Table Contents after MQTT

There is only one extra entry, and that's because both functions executed on the same input. That means "FakeRobot" had its status updated to ONLINE once by each function.

If we wanted, we could set up the robot to call the Lambda function when it comes online - it could make an API call, or it could connect to AWS IoT Core and publish a message with its ONLINE status. We could also set up more Lambda functions to take customer orders, dispatch them to robots, and so on - the Lambda functions and accompanying AWS services allow us to build a completely serverless robot co-ordination system in the cloud. If you want to see more about connecting ROS2 robots to AWS IoT Core, take a look at my video here:

Lambda Function Cost

How much does Lambda cost to run? For this section, I'll give rough numbers using the AWS Price Calculator. We will assume a rough estimate of 100 messages per minute - that accounts for customer orders arriving, robots reporting their status when it changes, and orders are being distributed; in all, I'll assume a rough estimate of 100 messages per minute, triggering 1 Lambda function invocation each.

For our functions, we can run the test case a few times for each function to get a small spread of numbers. We can also edit the configuration in the console to set higher memory limits, to see if the increase in speed will offset the increased memory cost.

Edit Lambda general configuration

Edit Lambda memory setting

Finally, we will use an ARM architecture, as this currently costs less than x86 in AWS.

I will run a valid test input for each test function 4 times each for 3 different memory values - 128MB, 256MB, and 512MB - and take the latter 3 invocations, as the first invocation takes much longer. I will then take the median billed runtime and calculate the cost per month for 100 invocations per minute at that runtime and memory usage.

My results are as follows:

TestPython (128MB)Python (256MB)Python (512MB)Rust (128MB)Rust (256MB)Rust (512MB)
1594 ms280 ms147 ms17 ms5 ms6 ms
2574 ms279 ms147 ms15 ms6 ms6 ms
3561 ms274 ms133 ms5 ms5 ms6 ms
Median574 ms279 ms147 ms15 ms5 ms6 ms
Monthly Cost$5.07$4.95$5.17$0.99$0.95$1.06

There is a lot of information to pull out from this table! The first thing to notice is the monthly cost. This is the estimated cost per month for Lambda - 100 invocations per minute for the entire month costs a maximum total of $5.17. These are rough numbers, and other services will add to that cost, but that's still very low!

Next, in the Python function, we can see that multiplying the memory will divide the runtime by roughly the same factor. The cost stays roughly the same as well. That means we can configure the function to use more memory to get the fastest runtime, while still paying the same price. In some further testing, I found that 1024MB is a good middle ground. It's worth experimenting to find the best price point and speed of execution.

If we instead look at the Rust function, we find that the execution time is pretty stable from 256MB onwards. Adding more memory doesn't speed up our function - it is most likely limited by the response time of DynamoDB. The optimal point seems to be 256MB, which gives very stable (and snappy) response times.

Finally, when we compare the two functions, we can see that Rust is much faster to respond (5ms instead of 279 ms at 256MB), and costs ~20% as much per month. That's a large difference in execution time and in cost, and tells us that it's worth considering a compiled language (Rust, C++, Go etc) when building a Lambda function that will be executed many times.

The main point to take away from this comparison is that memory and execution time are the major factors when estimating Lambda cost. If we can minimize these parameters, we will minimize cost of Lambda invocation. The follow-up to that is to consider using a compiled language for frequently-run functions to minimize these parameters.

Summary

Once you move from one robot working alone to multiple robots working together, you're very likely to need some central management system, and the cloud is a great option for this. What's more, you can use serverless technologies like AWS Lambda and Amazon DynamoDB to only pay for the transactions - no upkeep, and no server provisioning. This makes the management process easy: just define your database and the functions to interact with it, and your system is good to go!

AWS Lambda is a great way to define one or more of these functions. It can react to events like API calls or MQTT messages by integrating with other services. By combining IoT, DynamoDB, and Lambda, we can allow robots to send an MQTT message that triggers a Lambda, allowing us to track the current status of robots in our fleet - all deployed using CDK.

Lambda functions are charged by invocation, where the cost for each invocation depends on the memory assigned to the function and the time taken for that function to complete. We can minimize the cost of Lambda by reducing the memory required and the execution time for a function. Because of this, using a compiled language could translate to large savings for functions that run frequently. With that said, the optimal price point might not be the minimum possible memory - the Python function seems to be cheapest when configured with 1024MB.

We could continue to expand this system by adding more possible statuses, defining the fleet for each robot, and adding more functions to manage distributing orders. This is the starting point of our management system. See if you can expand one or both of the Lambda functions to define more possible statuses for the robots!