Disclaimer: This post is very much me unpacking my thoughts around fault analysis, attribution and thinking in systems. It does not aim to be super objective, and is mostly abstract. I've tried to tie it all together at the end with a concrete example, but I think it still remains fairly abstract. In practice, we often have less time to think about the philosophy behind fault analysis, and need to "just fix things and make sure they work".
Edit (June 2022): I've modified the content of this post to use the word "fault" instead of "root cause" - this is deliberate and in part due to reading How Complex Systems Fail.
An anecdote
Having worked in software projects that have comprised decades old code (as a relatively young professional), I often found myself blaming or mischaracterizing the programmers that came before me.
It seems obvious that no “self-respecting” engineer would’ve written such sub-par code. Surely the person responsible for the thousand-line long bash script to do a database backup must’ve have had something fundamentally amiss about them.
Unfortunately (or fortunately) - I have now written my own version of the thousand-line long bash script, several times over, and I wonder if any young engineer fresh out of University has scoffed at my engineering abilities, or perhaps taken issue with my character and commitment to my craft.
Cognitive bias and attribution
The fundamental attribution error is a concept that refers to a tendency or bias toward assigning reasoning for a persons behaviour to their personal or dispositional characteristics, instead of situational or contextual characteristics around the behaviour.
It seems (from my anecdote) that I am subject to this bias. I would make the assumption that a large majority of software engineers are. I would also argue that the dispositional attribution of behaviour isn’t always wrong, but oftentimes misplaced (or as the definition for the FAE refers to, overly emphasised).
Why attribution is important
Often during fault analysis, we aim to understand a few things:
- What happened? e.g. our production instance of Redis ran out of memory.
- What was it's impact? e.g. our users were not able to log in for 8 minutes.
- Why did it happen? e.g. commit
xxxx
introduced a new API call that continually wrote to Redis. - How do we prevent it from happening again? e.g. sanity check changes to code paths that might write to Redis in code review.
What is missing from the above is the contextual understanding of why commit xxxx
had to happen in the first place, and by what series of events it ended up in production. We might jump to assumptions like "The author of commit xxxx
does not know enough about Redis, and they should go RTFM", or perhaps - "We need stricter rate limiting on write APIs so that we can prevent this from happening if a similar change occurs".While fair assumptions, these are absolutist and attribute faults to either the system configuration (no rate limiting) or actor disposition (author does not understand Redis). It does not really answer the question of what circumstances resulted in:
- The commit being written.
- The commit executing in a production environment.
- The system (the notion of the "organizational system" in which the actor and the code exist) accepting the commit.
To bridge that gap, it seems important to have a strategy for de-emphasising dispositional and contextual characteristics of behaviour (side effects) to better understand why things happened, and perhaps - how we can do them better.
Note: Maybe commit xxxx
did not need to happen in the first place? (An idealized commit xxxy
that didn't result in a production failure, would've been one that had the foresight of the configuration and dispositional elements of the system, perhaps through a better managed design step).
git blame
and the importance of the actor
In a world without version control, and tagged ownership of code changes - I doubt I would’ve misplaced judgement on the unlucky engineers who came before me. In hindsight, had I not been able to git blame
a particular commit - I might’ve thought about the history and context that gave rise to a particularly nasty branch of code.
Note: Seeing as this article is less technical, it seems appropriate to describe git blame
. The best description I can think of is that it acts as a guestbook for a particular file in a codebase managed by git. Allowing anyone who interacts with that codebase to view a line by line history of authors and revisions for a particular file.
A world without owners attached to commits would be a world without actor salience. The person who wrote the code is less important than the reason the code was written, or the side effect the code has. This is a far simpler world view, but not one that I believe is strictly correct.
We may need to go up a few layers of abstraction to ask the right questions about attribution (and where it should fall).
Let’s instead think about actors, the systems they interact with, and the side effects of their interactions.
A systems model of attribution
Given an actor who interacts with a system under a given configuration, and gives rise to a side effect through that interaction, can we answer the following questions, about the attached assumptions:
- Assumption: The characteristics of the agent are just as important as the configuration of the system.
- Question: What inherent characteristic did the actor have that allowed them to interact with the system such that the side effect observed resulted?
- Assumption: The characteristics of the agent are unimportant, and rather - it’s interaction with the system was determined by the configuration of the system.
- Question: What about system configuration supported the interaction that would have resulted in the side effect we observe?
Lastly, to decouple and simplify the above assumptions into attribution types
- Assumption: There is an attribute in the system configuration that gives rise to the side effect.
- Question: Would the side effect be observed if we had placed different actors in the same system configuration? (contextual attribution)
- Question: What system configurations would’ve resulted in the same side effect with different actors? (contextual attribution)
- Assumption: There is an attribute common amongst actors that give rise to the side effect.
- Question: Would the same actor have given rise to the same side effect under a different set of system configurations? (dispositional attributional)
- Question: What types of actors would’ve resulted in the same or different side effects under the same environment and situation? (dispositional attribution)
Being more concrete
We can use our systems model of attribution in a real-world scenario quite easily by directly applying our abstractions to real stakeholders and systems.
- Actor: a software engineer.
- System: a business or corporation that has a codebase(s)
- Interacting with the system: Writing some code, deploying a change, etc.
- Configuration: Business strategy, stakeholder pressure, market factors, corporate process, legislation, etc.
- Side effects: A production failure that impacts users, etc.
Given that it is known that the interaction with the system resulted in the side effect, how do understand why the interaction with the system happened? Do we need to understand actor at a fundamental level, or perhaps we need to interrogate the system and its configuration?
I think the answer lies somewhere in between.