Fix bugs by reading the room

In a previous project, my team and I were working hard to get ready for a release. This release was scheduled for the end of the week since our application was used in an enterprise setting and the weekend is when we would affect the least amount of users.

We scheduled the deployment for Friday ( yes, I know this is a bad idea ), and were working towards it.

On Thursday, I woke up sick. For my health, I didn't work and spent the day resting in bed. However for my team, a new critical bug had surfaced.

In this application, we had what Kent calls a Soul Crushing Component. This component was complex, and had lots of different states it could be in.

To make matters worse, features for this component were added over the course of a year, by many different developers. These developers didn't always follow the same state management pattern laid out at the beginning, so state was being managed in many different ways, including Redux, local storage, React Hooks, and more.

It was the component you ran from.

muppet running away

And thus, the bug was in this component. In essence, if you followed a specific flow of steps, this component would get put in an impossible state, and get stuck appearing that it is loading.

This was marked as a show stopper for the production deployment, and my team quickly tried to find a solution.

I was sick the day this surfaced, and do not know the details of all the attempts to solve it that day.

However, when I came into work the next day it seems we were still working on this bug. I joined a war room talking about it, and was quickly caught up on the details that have been found:

You get stuck in an impossible state: Data has been returned from our API but component still shows loading.
It only happens when you follow a specific path (~5% of users would run into this)
As a workaround, you can refresh the page and it works again (which was unacceptable to the client)

There already was a small team of developers who were attempting to fix this. They were trying to find how the impossible state existed and how they could actually deem it impossible. After spending many hours the day before, they had no success but had come up with a design document on how state is managed in this component.

I recognized in this war room that this was putting the deploy in jeopardy, which would be bad for everyone involved. Seeing the work that had already been done, I started thinking of alternate solutions. We needed a fast solution in order to make deployment in time.

I didn't want to work a lot on refactoring the state management issues because we did not have enough time to fix them. Taking that approach would most likely equal the whole team working the weekend when a new bug was found in production due to our quick changes.

I realized one thing that was found:

As a workaround, you can refresh the page and it works again

And thus I forced a refresh anytime you followed this path. Now this wasn't super pretty, and was a little jarring for the user for about a second. But this solution was a sure-fire fix that would not cause more issues once deployed.

I was able to write this fix within minutes, had a PR open, and we were on schedule to deploy by lunch time.

Reflecting on this, I learned a lesson that I can apply to many more situations in the future: Read the room.

I understood the importance of this deployment, how many users were affected by it, and the possibility of a correct fix causing more issues due to time pressure.

By doing my simple quick fix, we were able to dedicate more time into refactoring the entire state management for this component for the next release. We took our time, and the result ended up being a much cleaner solution that is actually maintainable.

What do you think? Let me know on Twitter