Amazon S3 Outage: We’ve All Been There

I’ve been thinking a little bit about the Amazon S3 incident. Not really the incident, actually, but the responses to it. More than once I read something along the lines of “I’m sure that guy got fired” with regard to the engineer who entered the fatal command.

Sure, that’s kind of funny for a quick tweet or in the greater context of a blog post on change control, but for me, I’m not sitting at my desk shaking my head right now. Instead, I’m reminded about the times I did the exact same thing (on a much smaller scale) and will probably do it again.

Here are a few of my most memorable:

  • Remotely shutting down the only WAN interface on an edge router in another country.
  • Accidentally flipping the second and third octects in a script for modifying production site-to-site VPNs.
  • Applying a service policy to the wrong interface and killing all phone calls.
  • Forgetting to disable the reload in command after a successful change.
  • I even destroyed several physical ports in a core switch by accidentally ripping out cables while trying to make the rack, which wasn’t bolted to the floor, line up better with the floor tiles.

I’m not really thinking about the S3 incident itself, and I’m not too concerned with how a manager should handle this type of situation with someone on their team. I don’t know that much about either of those things. What does come to mind is that I’ve been there to one extent or another more times than I care to admit. In the context in which I work, that could just as easily have happened to me.

I’ve never taken down huge swaths of the internet or services for thousands of customers, but I have had those moments in which I went from confidently hacking away at the CLI to staring wide-eyed at my screen in a cold sweat and with a sinking feeling in my stomach.

I guess this is just part and parcel of the human interaction we still have with our infrastructures and why we have change control. In any case, I’m not judging the engineer who took down S3 accidentally. I’ve been there in my own small way, and I’m sure we all have to one extent or another.

Feel free to leave some of your most memorable blunders in the comments or tweet them out.

 

 

 

14 thoughts on “Amazon S3 Outage: We’ve All Been There

Add yours

  1. I’ve been there. I got an embarrassing call from a vendor asking me why they see links to my core switch bouncing. This is when I realized that I forgot to do reload cancel after making some changes….
    Good thing it was waay after hours.

    Like

  2. Telnet’d to the VTP server (also core switch), created a new VLAN, then *thought* I’d telneted to the access switch were I wanted to change the VLAN on some ports, and made the VLAN membership change.

    Was *actually* still logged onto the core switch so took down the internet and MPLS links for the data centre.

    The clue was the helpdesk wallboard lighting up straight away, cue cycling back through the command history to work out what I’d done!

    Like

  3. It’s refreshing reading your article Phil, not many people care to talk about previous mistakes. The biggest outages I’ve caused in the past have not always been because I’ve broken something but because I’ve fixed something.

    One example of that was an IPSEC tunnel that failed to establish providing a backup layer 2 DCI link. Unbeknown to me spanning tree had not been configured properly on either of the switches at each DC and almost immediately post fixing the tunnel issue the link came up, caused a layer 2 loop and levelled both DC’s. I Had to ‘un fix’ the tunnel to break the loop and restore service.

    I’m not sure if those types of incidents make us stronger engineers or just shave a bit more time off our life spans?

    And no, the stretched L2 between the DC’s was not my idea!

    Like

    1. I would never think the stretched L2 design was your idea hahaha! I think you’re right in both ways – those incidents definitely shave a little bit off our life spans, but I think they also make us better engineers šŸ™‚

      Like

  4. I had a change to create a new vlan at a large automated warehouse, console reasons I couldn’t remember if you had to add the vlan to the port channel or the switchports so I had a bright idea to do it to them both at the same time ‘int range t1/1-4,po10’…. It didn’t work, the port channel unbundled spaning-tree reconverged while the port channel re-bundled. I thought I’d got away with it untill a couple of the server guys who were monitoring the site during the change came over and asked me if the network was down. Turns out the site had some old LeftHand iSCSI storage for all the server’s, connected via the network, when the network was partitioned both storage clusters became active data got corrupted and then when the network reconverged the storage room it’s self offline! Took the server guys the rest of the day to restore stuff and bring the site back online.

    Like

  5. Forgot to add the word “add” in switchport trunk allowed vlan command in one of the MDF switches in a remote site. Luckily there is no production that time and there is an engineer that time that can setup a workstation for me that I can RDP to.

    Like

  6. A couple of quickies…

    – “debug ip packet” on a busy, remote WAN edge 7206VXR with a T3. Killed it immediately. I only had to drive across town to restart it, thankfully, but a campus with about 3,000 people on it weren’t getting outside of their local networks until I got there.

    – interface range gi0/1 – 14 when I meant 4, not 14. Went on to apply a shiny new VLAN to 10 ports that were already doing something else. Made for a happy 10+ minutes when some servers and storage fell off the wire until I realized what I’d fat-fingered.

    Like

  7. With you on that one. I’ve been fortunate in this area as an earlier although brief career as a professional pilot taught me some things that nicely apply to NE work. One flight instructor taught me that before making any change to an aircraft system use the mental flow, “identify, verify, execute” because as he put it “flying isn’t inherently dangerous; just darn unforgiving of errors.”

    But caution not-with-standing I’ve had a few “fun” incidents. Like when I learned that you shouldn’t have a critical device’s management window up at the same time you are going to do something more invasive on a another device. šŸ™‚

    I was troubleshooting an issue with the VPN connection to a remote site and had the active HQ edge router in one window and the branch router in the other. A reload of the branch router was the next step. One has to bear in mind VTY exec-timeout settings.

    I looked down and typed “reload”, looked up and as the little-finger depressed the “return” key I saw something I didn’t want to for a host name. The VPN router had timed out just as I looked down and I typed reload into the edge router. Fortunately the total outage was short lived as HSRP kicked the connection over to the other edge router. But that was a rather long minute as “is the Internet connection down” queries went up around the office.

    Like

    1. Thanks for that, Mike. I’m sure we could go back forth with example after sorry example haha! It’s cool to see how others’ pre-networking experience impacts their career as a network engineer, too.

      Like

Leave a comment

Blog at WordPress.com.

Up ↑