Skip to content

Latest commit

 

History

History
180 lines (127 loc) · 7.71 KB

2015.09.22.RandomThoutghtsAbout5DaysDownTime.md

File metadata and controls

180 lines (127 loc) · 7.71 KB

Random Thoughts About 5 Days Downtime

Who said Network never fails ?

Timeline

22/09/2015 That dreaded late call, it's 22:00, one of the product owner calls. "We have a problem with the newly deployed features, something does not work, can you help ?" I get a summary of the problem and try to give some pointer as to how to solve it

But one our later, ok it's time to head over there. I don't understand what they are saying over the phone and have to see a bit more information than I could remotely.

Quickly seeing the exact error, it turns out to be simple. Some sources were not deployed on Production. Talks with the developer who stayed for the deployment He knows what has to be done, tomorrow I'm on leave. We quickly decide, with the product owners, to call it a day and that we can wait until tomorrow.

Home It's really nagging me, we should have caught that on our QA environment, I take this as a personnal failure I fix the error. At this point I hate Dependency Injection It's 02:00 , fix deployed to QA & tested, Sending a mail to the System team, Dev Team & product owners saying it's fixed. They just need to deploy to Production.

Meanwhile, we're out of business externally and internally, for a few hours .

23/09/2015 I'm on leave today , caring for the childrens. Then again a call, in the late morning Product Owner : "It does not work, Production is still down, but it seems it's something else" Me : "I need to talk to the developpers and systems admin , for the exact error" Developpers & Systems Admin : "Database lock, caused by a transaction triggered by this operation, on Production & QA" Me: "Database lock there ? On Production & QA ? That should not happen, we fixed that a long time ago" I visualize the mental model of the code, ok... so this operations might indeed cause a transaction, but a long lived lock ? I give instruction on how to change the code.

They fix it, and around tea time, get another call... Developpers & SysAdmins : "Ok, so the operation that caused a lock ... does not anymore" Me: "Cool" Developpers & SysAdmins : "but, we have other locks on the database. Caused by other operations" Me: "But those haven't changed for ages" Developpers & SysAdmins : "By the way...Production, QA and development are DOWN" At this point, having not slept enough, do not understand that. Fatal error not sleeping enough.

Meanwhile, we're still out of business externally and internally for 1 day.

24/09/2015 Yesterday the SysAdmins, Products Owners & Development team decided, even though not fully functional , to allow access to the systems internally. Some things do works sometimes. Externally minimal access to the system are granted. They also also implemented a manual process to allow other systems to be fed with the needed data from our systems, so that the business can still run , albeit suboptimal.

Everybody is searching for the problem causing those database locks. First thing we do is list what has changed in the last few days. Our deployment from 2 days ago of course... but nothing in this code, or the deployment process itself could affect all environments in this way.

Checking the logs...we find something like "the process could not access the Transaction mananager"

Oeps... transaction manager in this case means DTC (Distributed Transaction Manager). At this point, well I already hated Distributed Transactions

Our Web & App servers cannot access the Distributed Transaction manager. This point to a change in the network or the firewall... This is checked : no changes on any level of networking infrastructure can found. We check in realtime if all traffic is going were it should be and is not blocked by any firewall rules. No, all seems to be OK. We search, have friendly arguments, don't find any reason for all environments to be down By the end of the day, we find that 3d party systems are also having the same exact issue with DTC. We check , those systems haven't changed either.

Time for Plan B. With the product owners, we review the list of the top features that have to be restored as soon a possible. Development will search for a way to avoid any distributed transactions in those code paths.

System admins are still looking for the reason the DTC communication is blocked. I find DTCPing cumbersome to use , you have to open the application on both sides. every time you ping you have to close & reopen it on both sides.

I quickly write a console app that check database connectivity & trigger a DTC . Way quicker to check for the sysadmins.

Meanwhile, we're still out of business externally for 2 days, internally well it kind of works.

25/09/2015 One of the system admins has a great idea: he moved one of the internal web servers from the firewall's WebServer Zone to the SQL server Zone. Guess what , it can fully communicate to the DTC servers located in the SQL server Zone.

It definitively is a networking related problem, but why ?

Our plan B proves to be difficult.

Meanwhile, we're still out of business externally for 3 days, internally well it kind of works. 26/09/2015

Everybody slept, we abandon plan B . We focus on the network. We don't find anything, the firewall logs are telling that no packets are dropped. It's getting late. We are losing hope.

Since moving a Web server to the SQL zone seems to correct the problem we are thinking of moving all of them. But it proves difficult as well, some of the web servers & app servers are behind a load balancer. We have to reconfigure it as well. It's too many changes, error prone...

Time for Plan B(s)

  • SysAdmins , instead of moving the servers from zone , will simply add a network card and have this NIC in the SQL server zone We tested that on one server & it actually works. The bonus being that we can appoly this solutions to other systems having problems as well.µ

  • Development will remove any trace of possible Distibuted transaction from the code. Downside, only our system will be available, and we don't know yet if we could suffer data loss & inconsistencies in the data.

We are tired, we xwill do this in parallel tomorrow

Meanwhile, we're still out of business externally for 4 days

27/09/2015 Everybody is rested. SysAdmins are implementing the bypass , development is changing the code.

The sysadmins wins...they finish first..

Meanwhile, we're in business externally and internally

But we still don't know what the root cause is.

28/09/2015 All systems are up and running, we do take the time to check anything that might have hung when we were down. Easy enough, we are using message queuing at key points of the systems. The messages have been patiently waiting for us to enable consumers.

Meanwhile, we're in business externally and internally

29/09/2015

Meanwhile, we're in business externally and internally

Conclusion

Many, many thank you to Christophe, Olivier, Pieter, Pierre, Tom, Inne Everybody was open to discussion, trying to find solutions. Helping to narrow down the problem & finding different ways to bypass it. No one pointing fingers at anyone.

Many, many thanks to management, they have not been breeding on our necks. They remained hidden, because the product owners regularly provided information on what was going on & their understanding of the problem.

Many, many thanks to our users, they stayed out of the way. I would have expected them to be nervous after the first day. But no, they were supportive and actually trying to understand what the problem was.

On the human side, although an exhausting 7 days, this was a real eye opener. Quite a nice surprise to see that we have acted as one team.

Post-Scriptum

The problem was [....]

TODO

  • Add business explanation drawing of bypass (the one with the roadblock, that you can use to explain to your non IT savy friends)
  • Add It explanation drawing of bypass