11---
22title : ' A Most Intense Debugging Experience'
33date : ' 2025-06-26'
4- updated : ' 2025-06-26 '
5- draft : true
4+ updated : ' 2025-08-03 '
5+ draft : false
66tags : ['Career']
77featuredImage : blog/career-debugging-nightmare.jpeg
8+ featured : true
89summary : >
910 I'd been contracting with the company for 3 weeks. Production had an issue, and I was asked to
1011 help.
@@ -35,8 +36,8 @@ the lead engineer.
3536## The Call
3637
3738One day, I was plugging along, exploring the codebase and learning about the product. I received a
38- Zoom call from the Product Manager . He mentioned that they were having some issues with the app, and
39- was wondering if I could help to debug it. He mentioned that the Lead Engineer was out of the
39+ Zoom call from the product manager . He mentioned that they were having some issues with the app, and
40+ was wondering if I could help to debug it. He mentioned that the lead engineer was out of the
4041office, so they're checking in with me to see if I could help.
4142
4243"Sure!" I said. I like to help out, and this seemed like a good opportunity to get to know the
@@ -45,66 +46,83 @@ product better.
4546At first, I thought it might be a simple issue, nothing too serious. I thought perhaps they were
4647giving me a chance to learn more as well as to understand how I approach problems.
4748
48- I quickly realized that this was not the case.
49+ I quickly realized that ** this was not the case.**
4950
5051Soon, the Chief Technology Officer joined the call. Then the Chief Operating Officer. Then others...
5152people I hadn't really met yet.
5253
53- I learned that the issue was that users were unable to login.
54+ I learned more about the issue: ** users were unable to sign in. **
5455
5556At that point in time, I wasn't too familiar with the company's user base. I didn't know how many
5657users were struggling to sign in, and I didn't know how many companies or businesses that the issue
57- impacted. I honestly didn 't want to know.
58+ impacted... and honestly, I _ didn 't want _ to know.
5859
59- A big part of staying calm and collected in these situations is to not let the situation get to you.
60- Whether the bug or issue is affecting one customer, one thousand, or one million, the root cause is
61- the same.
60+ An useful strategy to remain calm and collected during these situations is this: ** do not let the
61+ situation get to you.** Whether the bug or issue is affecting _ one_ customer, _ one thousand_ , or
62+ _ one million_ , ** the root cause is the same.** The potential complexity of the solution ** does not
63+ increase** with the number of users affected! This should be good news.
6264
63- I was able to help them debug the issue, and we were able to get the users back up and running...
64- but wow, what an experience!
65+ Spoiler: I was able to help them debug the issue, and we were able to get the users back up and
66+ running... but wow, what an experience!
6567
66- ## The Debugging
68+ ## The Debugging Process
6769
68- To debug, I was told where the login page was located in the repository, so I could start there. I
69- was guided to a line in the mid-500s where the Product Manager thought the issue might be ocurring.
70+ To debug, I was told where the login page was located in the repository, so I could start there.
71+
72+ I was guided to a line in the ** mid-500s** where the product manager thought the issue might be
73+ ocurring.
7074
7175I began by adding some server-side logs to see if I could get any more information. I checked out a
72- hotfix branch, committed the log statements and pushed, and the Product Manager was able to deploy
73- this branch to a specific environment for testing. Thank goodness he was able to do that, because I
74- was so fresh in the system that there was no way I'd be deploying anything!
76+ hotfix branch, committed the log statements and pushed, and the PM was able to deploy this branch to
77+ a specific environment for testing. Thank goodness he was able to do that, because I was so fresh in
78+ the system that there was no way I'd be deploying anything!
79+
80+ > ** Learning Opportunity #1 : Awareness.** When onboarding engineers, review the app from top to
81+ > bottom. Keep it to a high level if necessary, but always cover all the points _ from local setup
82+ > through deployment and hosting._ There's no need to hide facets of an application from your team!
83+ > Sharing abundantly could save you in the future. If I understood how this application was normally
84+ > deployed, I could have more quickly understood the fastest path to debugging it.
7585
7686After deploying, the CTO was able to see the logs on production (which I didn't have access to yet),
7787and would paste the log results into our Zoom chat.
7888
89+ > ** Learning Opportunity #2 : Transparency.** Every engineer on your team should have immediate
90+ > access to the logging system. This should not be difficult to find.
91+
7992I realized we weren't seeing anything meaningful, and most of the logs weren't even showing up, so
8093the problem must be higher up in the file.
8194
8295I added more logs starting at the very beginning of the file, and we deployed again.
8396
8497There! Data. We started to see some details.
8598
86- The login page itself was lacking in error handling, and there were some linear queries that were
87- being made to the database when the page loads and users attempt to log in.
99+ The login page itself was lacking in error handling, and there were some linear SQL queries that
100+ were being made to the database when the page initially loads and then when users attempt to log in.
88101
89102One of the log results after a query showed us that the query was returning an empty result set for
90103a specific field... this empty set wasn't accounted for via error handling, and an empty value broke
91- everything down the line!
104+ everything _ else _ down the line!
92105
93106The COO was able to pinpoint the specific data issue in the database, and we were able to fix it.
94107
95- Voila!
108+ ** Voila!**
109+
110+ > ** Learning Opportunity #3 : Test, test, test... automatically.** Proper error handling and
111+ > automated tests would have caught this error long before it became a production issue. Defensive
112+ > programming is a must... always expect the worst-case scenario, and account for it in code. Then,
113+ > automate some tests to replicate those scenarios and ensure they pass regularly.
96114
97- Overall, this process took about an hour. It was intense, there was about a dozen people on the call
98- at one point... but my blinders were on. I was focused on the code , getting answers, and finding a
99- fix.
115+ Overall, this debugging process took about an hour. ** It was intense,** there was about a dozen
116+ people on the call at one point... but my blinders were on. I was focused on the _ code , getting
117+ answers, and finding a fix._
100118
101119## The Aftermath
102120
103121After the fix, I was able to get a better understanding of the codebase, and I learned more about
104- the actual impact of the outage that I helped troubleshoot. Apparently, this issue was a major one,
105- and the company was grateful for my help.
122+ the actual impact of the outage that I helped troubleshoot. Apparently, this issue was a _ major
123+ problem _ , and the fix was met with relief. The company was grateful for my help.
106124
107- I found out the next day that the reason the Lead Engineer was out of the office was because he quit
108- without notice after over a decade with the company .
125+ I found out the next day that the reason the lead engineer was out of the office was because he
126+ quit, without notice.
109127
110128The next chapter is to be written...
0 commit comments