On-call incident story

Tell me about a situation that required you to dig deep to get to the root cause. How did you know you were focusing on the right things? What was the outcome? Would you have done anything differently?

這個問題就是觀察你是否有trouble shooting或是debugging的技能所以算是滿好回答的問題唯一要注意的是不要講的太細節不然回答會往技術方面走那就會浪費時間而且走題了點到就好主要描述一下如何做的跟結果就可以

One time I was on call, and we had an incident. This is with the current company that I’m with. The customers are not able to view their transaction, which holds property information such as selling/buying price, who is buyer/seller and agents, e-sign paperwork, etc. That’s a huge issue. This happened after the regular work hours, so a lot of developers were not online anymore, and I could not reach the developer that was on-call,

so I contacted the engineer manager. Since I did not know the root cause, I did not know which service I should roll back, and it’s not the release hours so there should not be any release. While waiting on the engineer manager or on-call dev response, I looked into CI log and noticed an AWS lambda finished. I dig through the log to see if there are any errors that I can find, and I found an error message about marketId attribute is complained about being a string. I checked the code history and then saw that we have a new code to turn marketId from an integer to a string. Because it is new, even if I roll back, the marketId will still be a string, which will not fix the issue. I made a code change to convert the type back to integer, and indicate that this is just an emergency PR to bring back production. We still need to know the reason why we started a lambda to run a script that converted integer to string. The dev on-call came back and said he was testing and deployed to the wrong environment. He thought he deploy to the staging environment and then stepped away. He approved the code change and then we were able to bring production back. We now have production deployment restrictions only to release managers to prevent this from happening again.

I believe that I made the right choice here because I was able to find the root cause, and then recover the production environment. I would repeat what I did because even if I did not find the root cause, then I could provide the information that I found to the developer which can help speed up debugging the process.

套用：不太好套用但是直接把故事換成自身除錯的經驗就可以了

ST: Production出了問題我也找不到其他團隊的on-call人員

A: 我先連絡了對方的經理指出我找不到on call的工程師在等待時間查看日誌與查找每個有可能出錯的地方

R: 找到了錯誤上傳了更新的代碼也執行了一些限制讓以後同樣的事情不會發生