Coffee KPIs, Eric Zinn
If you’ve ever worked with me on metrics you’ve probably heard me say something like, “well, let’s set up some coffee KPIs and see how we do.†You probably understood what I meant from the context but this is how that term came to be.
Once upon a time I had the responsibility to define, create and deliver monthly reports for several areas within my organization. Those areas included: Network, Mainframe, Deployment, Messaging and Collaboration, Windows, UNIX and Middleware.
The reports needed to include an executive summary that was a narrative of the included information, a description of key issues and their resolution, changes past and planned and all the associated metrics.
In addition to the report I also had responsibility for creating the middleware report. In that report I included, at first, metrics that looked like these:
Up-time percentage
Number of transactions total and by server
Number of failed transactions
Three charts with just those numbers. Now, if you are comfortable with metrics you already have a lot of questions and see a lot of problems.
I, however, was not comfortable or experienced with metrics so to me those were perfectly valid answers. Therefore when my boss asked, “Is that a good number of transactions?†My youthful and passionate answer was a respectful if not gruff, “how the hell should I know?â€
My boss, a fantastic mentor, said, “Then why are you telling me how many they are?â€
“Because it tells you how much work the system is doing,†I said. Which it does, dang it.
“How much is it capable of doing?†He said.
I took a sip of my coffee and said, “I’ll be right back.â€
I went to my cube and gathered the key performance indicators (KPIs) that I had. I looked at memory utilization when compared to transaction volume as well as disk IO, CPU utilization, network utilization and a number of other KPIs. I then identified what seemed to be our bottleneck, network, and did some basic math. If X number of transactions caused Y percentage of load on the network then I could determine how many transactions each percent of network load represented. All I had to do was determine what I wanted my network utilization to be and multiply that by the number of transactions.
I proudly returned to my boss and gave him my number, “that’s my theoretical maximum and that’s what I can run at without having to wake you up in the middle of the night.â€
He said, “Are you sure?â€
“I’m sure,†I said.
“So if our system breaks due to load before that number I can fire you?â€
“WHAT?†I was, of course, not prepared for such a possibility, “you can’t hold me to that.â€
“Then what use is that number to me?†He said, “If I can’t trust it.â€
“The numbers are the numbers,†I said as I eyed my coffee.
“So the growth is linear?†He said, “It’s a perfect 1:1 relationship?â€
“Well,†I said, “it seems to. I don’t know.†I had to think about it. I’ve seen very few things grow in a linear fashion in IT. Normally it’s more logarithmic and there’s an inflection point where it all falls apart. “No, it probably isn’t.
“Tell you what,†he said, “I agree with you. I think that we’ll see a sharp decline in performance rather than a slow degradation. Set the number to whatever you feel is right. Then we’ll watch it and see what happens.â€
It was my first coffee KPI.
“We can do that?†I said, still stinging from the shock that I could be fired for picking a bad metric threshold, “we can just pick one and change it?â€
“How many transactions do we process in a year?†He said.
“About a billion,†I said.
“Is it fair to say that if the system is down we lose money?â€
I nodded, “sure.â€
“How much?â€
“I really can’t say,†I said.
“Is it more than you make?â€
“Um…probably a hundred times more,†I said and stifled a laugh.
“So if it goes down for an hour it costs the company more than you make?â€
I felt sick. “Yes.â€
“So, it’s an important number,†he said, “we need to be thoughtful about setting it and we need to make sure everyone else agrees with where we have it set.â€
I went back to my cube, set the threshold, clearly annotated that it was just an estimation, and then watched it for a few months. When we got to the holidays we had exactly the data we needed to draw a better conclusion. As traffic went up so did the network but not at the rate I’d predicted it was more of a curve. I adjusted my number and watched my KPIs then adjusted it again. By the time the holidays were over traffic returned to normal levels and I had my answer.
While I would never stake my job on a single metric I was more than able to confidently state what the capacity of our middleware servers were. Because of this I was able to cut our server utilization by 25% and move onto new hardware with a well-informed load balancing solution.
I was resistant to putting my life, my job, on the line for a metric. It’s a feeling I’m sure many people share when they are first learning to measure themselves. We see the setting of metrics and thresholds as a one-time endeavor and that we must be perfect the first time we set it. We never really consider that taking an educated guess is a viable answer nor do we realize we can guess then guess again if we get it wrong.
Coffee KPIs give us more than one chance to get it right.
That’s a neat trick but can they can do more for us?
It’s not just important to pick a good threshold but to pick the right KPIs. Given the number of KPIs we watch we may only really care about a few of them while the others are merely background noise or are simply interesting. Figuring out which is which requires some development time.
I was preparing the end of year report and wanted to show some kind of combined uptime chart across all areas. Of all the areas, our deployment team had, by far, the best record. 100% availability across the board for twelve months compared to our worst team. Network showed an average availability of 89%.
It was hardly a fair comparison. The network was enormous and made up of anything from ancient switches to modern routing clusters. Also, because of the web like nature of networks if a router went down it took down entire swaths of the environment which destroyed their up-time. It was like saying the guy juggling one ball was a far better juggler than the guy juggling a hundred chainsaws.
In talking with our deployment manager I shared with him my thoughts and asked if he had a better number we could use.
“Percentage of successful automated deployments on the first try,†he said without even thinking.
I asked him what his number was and he said, sheepishly, “It sucks.â€
It really did. It was in the mid 70’s. It meant if they deployed to a thousand servers 250 or more would fail. If the automated jobs failed, they had to be manually deployed which held up all the other packages that needed to go out.
While he knew about it and was working on it he wasn’t reporting the number. Why? Because it’s not what we asked for, and yes we were pretty adamant about server “up-time†as a metric, and because it sucked. He didn’t want to show his ugly number any more than I wanted to commit to a threshold.
The issue was that he wasn’t getting his needs prioritized by the teams that could support him. They weren’t evil, just busy. What we needed to do was find a way to highlight how big the impact was to him then share that with the teams.
We decided to put a new chart in the appendix of his report that included this one and several other new KPIs. They were all clearly labeled as experimental. When it came time to setting the thresholds we eye-balled it and went from there.
He and I agreed to just track them see if he could have any impact. He didn’t want to be held accountable to a KPI he couldn’t do anything about.
That’s the second part of coffee KPIs.
Whether it’s IT or not you need to “show your work.†You need to communicate what you are doing to other teams but maybe more importantly you need to communicate what you are thinking about doing. Coffee KPIs are a great way to tell other teams how you are thinking and illustrating what is important to you.
It’s like saying, “hey, guys. We’re looking at this and it could cause you a bunch of work. You may want to start thinking about what you need to prioritize us.â€
After much effort on several teams it turned out he could not only impact his new KPI but he could drastically impact it. We ended up having to create several charts to track all the things he wanted to capture. It was great because we showed not only where we started the year and our intended threshold but where he improved the solution and how far past that threshold he went.
Coffee KPIs allow us to try new things and help us collaborate with other teams.
There is, however, a special power coffee KPIs have. It’s kind of like a super power.
In that year I was talking with a mentor of mine about being overloaded and afraid that I was going to miss something and fail.
He said, “Okay, so you fail.â€
“But we’re not allowed to fail,†I said.
“Who says?†I thought maybe his years in IT operations may have ruined his cognitive ability.
“Anyone,†I said, “we get fired if we fail.â€
“Do you really think that if you make a mistake or get something wrong you’ll be fired?â€
“Well,†I said, “yeah.â€
He laughed, “Our job as leaders,†he said, “is to get the most out of our teams. Sometimes we do that by pushing you harder and harder. We know you’re going to eventually fail. It’s okay. It tells us what you are capable of.â€
This was new to me. Weren’t they depending on me? Didn’t I need to deliver?
“Look at it this way. If you want to be alerted any time your server goes over a set threshold would you rather get a few alerts when it goes outside of how it normally operates or only get alerted when it’s in danger?â€
I thought for a moment and said, “a few.â€
“Why?â€
“Because I know that the monitoring system is working, for one,†I said, “and for another I can go in and check and see why it’s working harder.â€
“So technically it’s failing,†he said.
“Well yeah, but it’s not a big deal.â€
He smiled.
That was the third and special power, the super power, of coffee KPIs.
Coffee KPIs give you permission to fail.Â