This is a cautionary tale… A month or so ago, someone from support asked me why the hell a test environment had spent over a thousand NZD in Logic Apps actions. My first reaction was “Are you kidding?”… my second reaction was that pit in your stomach feeling when you know something is really wrong, but you don’t know why.
I had a lot of “logical assumptions” but most of them based on some big load being dropped in the environment or some test gone really, really wrong.
The answer was a bit more complicated, but also highlighted something worse. We haven’t been treating the test environment (specially Logic Apps) as a production environment. And because of that, a couple of really small things had consequences we didn’t expected.
To give some context, in this project, we had a series of logic apps that would monitor Service Bus topics deadletter queues. Based on a set of rules, that would allow for a couple of retries of those messages by republishing them. The logic app would run every 6 hours, to guarantee that any potential transient issue was resolved. After a number of attempts one branch of the logic app would notify the team that a message had failed too many time (and that message should then be removed from the SB deadletter queue).
Dissecting the problem
Here are a list of things that happened at the same time, that compounded the problem. Each one by itself might not have such a catastrophic result, but the combination was quite painful.
Notification Email credentials expired
Issue 01: The email notification was setup using one a developer’s account. This was the first link in a chain of problems. Because of the password rotation policy, the authentication token for the email became invalid when the he changed his password. We could have done two things to mitigate this from beginning: either use a service account (which would be the preferred approach, and is what we would have in production), or at least could have used an application password, which would not expire when the user password changed. So here is lesson #1:
Make sure that the credentials you are using on your system do not expire! Prefer using service accounts or Managed Service Identities and start that practice from the development environment.Lesson #01
Logic App didn’t fail gracefully
Issue 02: Removal of the message from the Service Bus deadletter queue was dependent on the email notification – this was probably one of the biggest issues on how the logic apps was implemented. Ideally the completion of the peek-lock in this case should not depend on the decision to complete successfully. So here is lesson #2:
Make sure that the critical path of your process have as little dependency possible. Analyze your workflow and decide if your next step is really a dependency or if it should run even if the previous step failed.Lesson #02
No alerts setup
Issue 03: Since we were running this in test, we didn’t create any alerting for the logic apps. Although the alerts were created on production, we didn’t bother to setup them test. So here is the lesson #3:
Make sure that you create alerts for your logic apps. At least alerts for when the logic app fails. Depending on your scenario, also create alerts to trigger when your billable actions spike.Lesson #03
If you have a logic app that executes on average every minute and execute 10 billable actions, create an alert if it goes over 100 action over a period of 5 minutes, for example. Understand the volumes you are expecting from you logic apps, and let you know when it deviates from normal as fast as possible. That will help you to avoid unwanted billing and stop rogue processes earlier. You could use this powershell script to create alerts for all your logic apps in a resource group.
No automated Testing or Environment Management
Issue 04: The logic app shouldn’t be running at the time that happens, but the tester didn’t clean up the environment. Probably automating the testing process would have avoided this. So last lesson #4 (or #0 depending on how you want to look at it):
Invest in automation testing as much as possible, with pre and post test steps that configure your environment and leave it in a stable stage.Lesson #04
Mike Stephenson presented an interesting session on Integration Monday, where he discusses the use of SpecFlow for test automation with logic apps. I strongly recommend people to look at that presentation.
At the same time, if you want to make sure that your resources (logic apps or others) are not enabled when they should not, even if you don’t implement test automation (which we all should) try to implement some automation scripts to turn them off automatically every night. You could use this powershell script to turn logic apps on and off, which can be run automatically or on demand.
So if I could summarize that painful experience and distil it in some “wisdom”, those are the things I will try to incorporate on my design, deployment and monitoring processes from now on:
- Minimize dependencies to your logic app critical path. Try to make sure it always fails gracefully.
- Make sure that you use service accounts or MSI for every connection you have.
- Create alerts for failures and deviation on your normal processes. Especially around unexpected spikes in action executions.
- Invest in test automation, and where possible in environment management automation too.