Oh no. Pings stopped.
A cold breeze raised the hair on the back of my neck while I stood motionless in front of rack 5 in aisle 7. I much preferred standing in the hot aisle, but this is where the KVM console was. The screen flickered as I adjusted the angle of the monitor. This thing was ancient, but at least I had something to work with.
Thankfully, pings worked outbound from the LAN right away. What a relief. It’s truly a satisfying feeling when it works the first time.
This network had a massive DMZ, however, and that’s what worried me about this cutover. My customer had several United States agencies pulling massive amounts of data pretty much every hour, and coordinating this two hour change window took months.
This had to work now.
DMZ_ASA_01#ping 8.8.8.8 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 8.8.8.8, timeout is 2 seconds: ..... Success rate is 0 percent (0/5) DMZ_ASA_01#
I tried again and held my breath. Ok maybe it’ll come back in a few seconds…any time now….any moment….any second….
I’ve done enough cutovers to know that it isn’t unheard of to have some issue after making even the smallest change, but tonight was different. The change window my customer gave me was 2am to 4am, and I was already exhausted.
I wasn’t able to sleep late that day in order to to be fresh and awake for tonight’s work. That’s not realistic for me. That morning I had a 9am infrastructure team meeting, then I had to finish up some documentation for a different customer. My son was sick, so my wife needed me to take my daughters to piano lessons followed by a sprint to ballet class.
In my world, dads don’t get to sleep in.
I shook it off. I had to stay focused.
When it comes to technology, I tend to second guess myself, so I carefully crafted a script for tonight’s cutover. When I say “script”, what I really mean is that I wrote out in notepad the step-by-step configuration changes I was making along with a few rollback procedures. A far cry from what actual technology professionals would call a script; nevertheless, I felt a sense of comfort knowing I had some semblance of a plan to follow.
The data center lights shut off again. I must have been standing very still for a long while. I waved my arms around toward the sensor in the ceiling like some drowning swimmer waving to shore, and the cold, sterile, fluorescent lights flickered back on row by row.
Still no pings, but I was only half an hour into the change window. I wasn’t rolling back yet.
My troubleshooting method at that time can best be described as a loose assortment of show commands and frequent pinging.
DMZ_ASA_01#show ip int brief
Interfaces are correct and up/up.
DMZ_ASA_01#show ip route
Looks good.
DMZ_ASA_01#show ip cef 8.8.8.8
Yep. That’s going out the correct interface.
Maybe the problem is upstream from me.
DMZ_ASA_01#traceroute 8.8.8.8 Type escape sequence to abort. Tracing the route to 8.8.8.8 VRF info: (vrf in name/id, vrf out name/id) 1 * * * 2 * * * 3 * * * 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * *
Ok kill the trace. It looks like it’s probably me.
By now I was just throwing random show commands into the firewall and tracing the cable to the ISP’s router a couple racks away hoping to see that their link lights weren’t blinking. After seeing they were lit and green, I started to sweat a little bit.
This was stupid. I was supposed to be some sort of senior network engineer or something, and I couldn’t figure this out right away?
I yearned for a cup of coffee but wasn’t allowed to bring any liquids into the data center.
I sat down crouched against the wall with my computer in my lap and scrolled through the configuration in notepad.
My stomach growled again.
With one hour left to the change window and no connectivity, my customer, who was monitoring connectivity, sent an email looking for an update. I ignored it and felt a strange sense of gratification since my justification for ignoring it was really good.
At about 3:20am I saw a couple more emails from my customer. They wanted an update and wanted to know if I planned to roll back soon. They weren’t interested in troubleshooting. Their network needed to be back online by 4am. Period.
I knew the rollback would take only minutes; however, and with a deep sense of undue shame, I sent my co-worker a message at 3:30am asking for help.
He must have been awake or had his phone right by his bed, because in only two or three minutes I saw a message from him.
“What’s up?”
I explained the issue, tethered my laptop to my cell, and had him remote in. Within five minutes or so he asked if the DMZ network, a public address space, needed to be exempted from NAT.
Holy crap. He was right, and how in the world did I miss that?
I made the changes and tested.
DMZ_ASA_01#ping 8.8.8.8 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 8.8.8.8, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 8/12/28 ms DMZ_ASA_01#
I sent a quick reply to my customer and asked if they could check everything on their end. Success across the board. I thanked my co-worker and promised him lunch the next day.
To everyone involved, this was a successful cutover with no major issues. I maintained my status as the network master. But the reality is that I didn’t see something that I should have, and I didn’t know how to interpret that.
Did this mean I didn’t know what I was doing? No, I don’t believe that.
Did it mean I was just careless? Possibly.
Could it mean that I needed to kill my pride, get a second set of eyes on my configs before starting a cutover, and be quicker to ask for help when I needed it? Definitely. And this is supremely ironic because inwardly I second-guess every technical decision I make.
In those days, and to an extent still today, I would hunt for an answer to a problem for a long while before asking a co-worker for help. Probably due simply to foolish pride, I often wasted too much time on an issue that wasn’t all that important just to hide the fact that I didn’t know something.
This cutover project was maybe five or six years ago – I was about 33 years old at the time. Surely that’s a mature adult by any measure, but for me, and specifically in the context of being a network engineer, I had to grow up just a little bit more.
This was great! I have had many moments like this and I can relate to pride issue. Thanks for the story and the perspective.
LikeLiked by 2 people