I tend to bang on about the balance of risk between updating everything automatically, and running the more traditional ‘Dev > Test > Stage > Production’ method of introducing change. With the fashion for CI/CD and the automation available you’d think that we’ve got this covered.
Not when those updates happen in your infrastructure layer, operated by a third party.
This is what happened to me this week..
One of my clients runs an eCommerce platform, based on Magento. The servers it runs on (classic three-tiers, hosted by Memset) are patched and updated as a daily automatic process. (I’m taking my own advice, plus this client doesn’t pay enough for a real human to carry out the work.)
The client complained that they we unable to update products from the admin backend. I was able to replicate the problem on a different client PC so I couldn’t fall back to the trusty “Have you tried turning it off and on again.”
Time to roll out the metaphorical sleeves, turn on detailed logging and hunt for bugs.
Spent half a day fixing issues that weren’t actually the root cause of the problem (this is typical when a platform isn’t getting regular love/db-admin attention) but was unable to see anything in the logs that showed why the product editing process was failing. The next step was to emulate the client’s browser setup and see if I could see any client-side errors.
Sure enough there was an error in the browser console, a cryptic message:
[Error] Failed to load resource: the server responded with a status of 403 () https://foo.com/index.php/admin/catalog_product/validate/id/608/key/69f6be00f17405c6a7c2008ce1112561/?isAjax=true
It took a bit of googling, some dead ends, and a couple of forum searches for me to realise that something was stopping the validated post request and it wasn’t the server end.
Next step was to check all the security control points. We have network firewalls, application firewalls…with just about everything turned on. Judging by the number of ‘attacks’ per hour, they do a good job of keeping crud out and letting customers in.
I checked the network and web application firewalls (WAF) in front of the server, but no rules had been triggered for my session.
Then I remembered that there is a second WAF in the Cloudflare CDN that I put in front of all of my hosted services to filter out dodgy traffic before it gets to gobble up server bandwidth. Sure enough, a rule had been triggered that viewed the long uri (above) as suspicious. I put an exception for my client’s IP address and…voilà. It’s working again!
The root cause of the problem is that I have the Cloudflare WAF set to apply standard OWASP rules. Each time those rules are updated they are automatically applied. Some time between the last product update session and this one, at least one new rule has been added which has blocked the product update process. You can see quite a few rules are triggered.
Am I going to turn off the WAF or stop automatically updating? No. For 4 hours troubleshooting per 6 months it’s worth having the security rules being kept up to date by someone else.
As always YMMV.