AWS Step Functions Error Handling

For AWS Step Functions

  • Error can happen in variety of ways
    • State machine definition error (example: no matching name for a state)
    • Task failures (example: exception in lambda)
    • Transient issues (example: network partition event)
  • Use Retry to retry failed state and Catch to transition the state machine to failure path
    • Note: try not to handle the error in the Application layer because it increases the complexity of our application
  • Some of the predefined error codes
    • States.ALL: matches any error name
    • States.Timeout: task ran longer than TimeoutSeconds eor no heartbeat recevied
    • States.TaskFailed: execution failure
    • States.Permissions: insufficient privileges to execute code
  • The state may report is own errors and you can catch them in step functions

Retry (for Task State or Parallel State)

Pasted image 20221017150840.png

  • ErrorEquals: specify the error type
  • IntervalSeconds: how long should we delay after each retry
  • BackoffRate: multiple with the delay after each retry for Exponential Backoff (any AWS service)
  • MaxAttempts: default to 3. Set to 0 to never retry
    • When max attempts are reached. The Catch block kicks in

Catch (for Task State or Parallel State)

Pasted image 20221017151226.png

  • ErrorEquals: match a specific kind of error
  • Next: state to send to
  • ResultPath: A path that determines what input is sent to the state specified in Next field
    • the $.error puts the error inside the output. For example Pasted image 20221017151614.png