Different types of complexity interact with test suites in different ways. Consider BitBacker, an online backup product that I worked on for three years. It had very few functional requirements. At its core, it only had to let the user choose files, back them up, and restore them. Almost all of the complexity was non-functional: it had to look good and be easy to use, of course, but it also had to be fast and secure. I spent most of my development effort on the "fast" and "secure" parts.
(Note that when I say "functional" here, I'm talking about requirements for a system's behavior. This has nothing to do with functional programming.)
In this type of app, where there are so few functional requirements, functional test fragility is less of a problem. In a recent discussion with Jonathan Penn, I mentioned that the backup and restore functionality were tested at the unit, subsystem, and full-stack levels: three different levels of tests, all testing the same thing. He asked me whether this made refactoring difficult. It didn't.
BitBacker's functional requirements were never going to change. When the user backed up and then restored files, they had to be identical to the originals. That's all. It took 17,000 lines of code to make that happen efficiently and securely, but the surface area of the user-facing problem is tiny.
I didn't know this at the time. If I was building a business app instead of a backup system, I probably would've ended up with a similar test suite, and in that situation it would've been a burden. Fortunately, I got lucky, and I've learned this lesson by retrospecting about my luck rather than retrospecting about some pain that I felt.
What about automated non-functional tests? The topic is murky in general, and I only know how to test small subsets of the non-functional requirement space. I don't know how to automate testing of user experience, for example.
I have done automated performance testing, however. At one point, I wrote tests for BitBacker that ran backups across a wide range of file counts and asserted that the backup time grew linearly with the number of files. That's clearly a non-functional test, but how fragile is it?
It's very fragile, of course, unless you run it on massive file counts that would've taken far too long for my patience at the time. I left the file counts small, so it ran fast but broke constantly, which eventually led me to remove it.
I replaced the test with a system that could kick off various predefined processes ("do an empty backup", "back up 1,000 files", etc.), graphing the runtimes and memory footprints across revisions in version control. One look at those performance graphs would show whether, and where, there was a problem. This gave me a different kind of feedback: instead of defining "success" and "failure", it would alert me to a change, which I could then investigate on my own.
I suspect that this is a fundamental property of non-functional testing. Trying to fully automate it and boil it down to a set of pass/fail assertions, while sometimes possible, seems prone to fragility. It may be that non-functional testing is best achieved by dashboard apps, like my performance-over-revisions graph, or an app that renders every page in a user flow automatically and highlights recent changes in appearance.