Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill • GOTO 2017

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 พ.ย. 2024

ความคิดเห็น • 77

  • @jfltech
    @jfltech 7 ปีที่แล้ว +170

    Anything Bryan Cantrill I watch immediately, the guy is basically a young Unix Grey-beard and understands the deep intricacies of Unix based operating systems, wouldn't expect any less from the author of Dtrace.

    • @jonathanwagner4600
      @jonathanwagner4600 7 ปีที่แล้ว +12

      The real fun happens when you try to find videos that he is in but isn't listed in the title anywhere. You can find these by searching for what is listed on his CV.

    • @tariqilyaskhan2083
      @tariqilyaskhan2083 7 ปีที่แล้ว

      Raynold Cherry

    • @afterthesmash
      @afterthesmash 7 ปีที่แล้ว +12

      As a Canadian, yes, I know the Gimli accident. There's a reason the "out of gas" airplane is on its nose cone. The front wheel assembly swings down and forward. For the down part, gravity is enough, but the forward part-against the wind-requires a hydraulic assist to reach the locked position that wasn't available. Which is good, because otherwise the plane would have required brakes that weren't available in order to stop in time. As it slid down the runway on its nose cone, its nose cone was engulfed in more sparks-or so I imagine-than any demonic trade-of-paint since Halifax harbour (an event no one who witnessed-from close range-ever related).
      "The go-cart races were over on the night we landed, and people were cooking on their barbecues beside tents and trailers," Mr Pearson said. "Their mouths were wide open as our plane went sliding by. But the go-cart races went ahead the next day." A few hours earlier, the plane would have come in mowing down the multitudes. And it was only because one of the pilots recalled Gimli that they diverted there (it was an old military runway-very long-which I don't think was even listed for active runway duty in panic control).
      The reason no other pilot could duplicate the landing is that one of the pilots was also a recreational glider pilot and used some kind of small-craft air spill to shed speed at the end, which only small-craft glider pilots train to perform. "Gosh, I've never done this in a big bird before-but, hey, no time like the present!"
      Note that the L1011 crash-Flight 232-is an even more gripping story-one of the best disaster yarns ever-and none of the simulator pilots came _anywhere_ close to a survivable landing in many, many attempts afterward. It was like the Gimli crash, where the "hey, what about Gimli?" moment happened not once, but ten freaking times in a row (from having a third pilot deadheading on the flight, right up to the airport in Sioux City having 285 trained personnel from the Iowa National Guard on duty for a training exercise, to assist in rescuing crash victims-on top of a shift change which doubled airport personnel on site, as well).
      "Following Air Canada's internal investigation, Captain Pearson was demoted for six months, and First Officer Quintal was suspended for two weeks for allowing the incident to happen. Three maintenance workers were also suspended. In 1985 the pilots were awarded the first ever Fédération Aéronautique Internationale Diploma for Outstanding Airmanship." Suspension _and_ medal, in true "it was me / let's fix this" Operator1 stoic-heroic fashion.

    • @JeffRAllenCH
      @JeffRAllenCH 6 ปีที่แล้ว +1

      Well, to be fair, he's not so young. (Nor am I anymore.)

    • @chrissherlock1748
      @chrissherlock1748 5 ปีที่แล้ว +2

      Yeah, once asked "ever kissed a girl" to a superb Linux developer, then years later got all upset when someone used a non-gender neutral pronoun in documentation.

  • @bocckoka
    @bocckoka 6 ปีที่แล้ว +145

    This guy is a stand up computer scientist.

  • @cabc74
    @cabc74 7 ปีที่แล้ว +149

    That poor camera operator ;)

    • @thapakazi_
      @thapakazi_ 7 ปีที่แล้ว +4

      hehe, guy is more agile.. camera ops playing hide and seek :D

    • @kalleguld
      @kalleguld 4 ปีที่แล้ว +5

      @@technicalthug I guess he did a fine job - until the picture was cropped

  • @beofonemind
    @beofonemind ปีที่แล้ว +2

    Another classic talk by Bryan Cantrill. Legend. His talks probably pre-emptively saved me from disaster :)

  • @afterthesmash
    @afterthesmash 7 ปีที่แล้ว +29

    My formative exposure to "how could this ever have worked?" was in 1982, on the original Wang PC, where I was passing NULL as the database pointer into a lookup routine on a simple in-memory data structure _for an entire week of active development_ with my test application returning correct records the whole while. Memory protection? Surely you jest. Turns out the search loop was wandering through low memory, not finding any matched keys, then _accidentally_ aligning to the correct modulo-14 boundary (pranked by the linker yet again) and proceeding to then scan the intended data structure correctly, with me none the wiser, for an entire week. And when I finally did have to chase this bug down, I pretty much gave my entire code base the most thorough code review of all time, before finally noticing my "obvious" coding error.

    • @GeorgeTsiros
      @GeorgeTsiros 3 ปีที่แล้ว +2

      ah, the old "start reading, you'll know when you reach it"

  • @piyushpkurur78
    @piyushpkurur78 2 ปีที่แล้ว +4

    How in the world can debugging be an interesting thing to talk about; this is what I thought but then this is one of the (if not the) best talks I have heard about anything. And I reached here by chance

  • @afterthesmash
    @afterthesmash 7 ปีที่แล้ว +22

    As a Canadian, yes, I know the Gimli accident. There's a reason the "out of gas" airplane is on its nose cone. The front wheel assembly swings down and forward. For the down part, gravity is enough, but the forward part-against the wind-requires a hydraulic assist to reach the locked position that wasn't available. Which is good, because otherwise the plane would have required brakes that weren't available in order to stop in time. As it slid down the runway on its nose cone, its nose cone was engulfed in more sparks-or so I imagine-than any demonic trade-of-paint since Halifax harbour (an event no one who witnessed-from close range-ever related).
    "The go-cart races were over on the night we landed, and people were cooking on their barbecues beside tents and trailers," Mr Pearson said. "Their mouths were wide open as our plane went sliding by. But the go-cart races went ahead the next day." A few hours earlier, the plane would have come in mowing down the multitudes. And it was only because one of the pilots recalled Gimli that they diverted there (it was an old military runway-very long-which I don't think was even listed for active runway duty in panic control).
    The reason no other pilot could duplicate the landing is that one of the pilots was also a recreational glider pilot and used some kind of small-craft air spill to shed speed at the end, which only small-craft glider pilots train to perform. "Gosh, I've never done this in a big bird before-but, hey, no time like the present!"
    Note that the L1011 crash-Flight 232-is an even more gripping story-one of the best disaster yarns ever-and none of the simulator pilots came _anywhere_ close to a survivable landing in many, many attempts afterward. It was like the Gimli crash, where the "hey, what about Gimli?" moment happened not once, but ten freaking times in a row (from having a third pilot deadheading on the flight, right up to the airport in Sioux City having 285 trained personnel from the Iowa National Guard on duty for a training exercise, to assist in rescuing crash victims-on top of a shift change which doubled airport personnel on site, as well).
    "Following Air Canada's internal investigation, Captain Pearson was demoted for six months, and First Officer Quintal was suspended for two weeks for allowing the incident to happen. Three maintenance workers were also suspended. In 1985 the pilots were awarded the first ever Fédération Aéronautique Internationale Diploma for Outstanding Airmanship." Suspension _and_ medal, in true "it was me / let's fix this" Operator1 stoic-heroic fashion.

    • @bcantrill
      @bcantrill 7 ปีที่แล้ว +6

      A truly terrific comment! My only addendum: UA232 was a DC-10, not an L-1011. As Gimli might be to a Canadian, 232 is to me as (1) a Denverite (it was a Denver-originated flight) and (2) a Tristar fan and a DC-10 malcontent. On that latter point: I highly recommend "The Rise and Fall of the DC-10" (if you can find it!) as a tale of engineering arrogance (in particular with respect to the cargo door misdesign that caused Turkish 981).

  • @jubalskaggs3025
    @jubalskaggs3025 7 ปีที่แล้ว +4

    Very much enjoyed, I feel that those of us who have to deal with these pathological situations in production know the values Brian is conveying. Profuse kudos to him for raising awareness that we have a duty to create software that can be dissected, troubleshot, and repaired not only in the heat of the moment, but also by future generations who may depend on it.

  • @hopperstreams4487
    @hopperstreams4487 3 ปีที่แล้ว +4

    I would love nothing more than to have Bryan as a mentor, he sounds like he would be a top-notch teacher.

  • @TatianaRacheva
    @TatianaRacheva 3 ปีที่แล้ว +2

    Another @bcantrill classic I've watched many times, and I need a timestamp TOC to find my favorite spots, so here it is:
    3:04 - if you want, you can stay logged in and edit away, because we're going down anyway
    3:48 - any time somebody says WTF, the bot takes a random sentence from chat over its entire history in which someone used the word f*** and offers it up. . .the other thing that the bot likes to do is to correct anyone saying Linux to, "you mean, GNU Linux?"
    5:05 - please don't be me, please don't be me, please don't be me
    6:18 - it is me. I am become Death, the destroyer of datacenters
    12:10 - 67-hour outage
    13:30 - sleep management and judgment impairment
    16:15 - thought, might never recover, might go full Magnolia on the cloud
    17:40 - we are not post-singularity, despite what the bot will tell you in chat: humans are still in the loop
    18:15 - the curse of the intermediate skier
    19:07 - welcome, Canadian! . . .Gimli glider, glass cockpit, the RAT
    21:25 - it's a 767, it does not glide. . .this is a brick
    24:35 - 40% or more of the microservices boom is inter-organizational strife
    26:30 - an outage in production does not feel like a murder mystery. . .it feels like an active shooter
    28:25 - if you go dark because of load, you have gone dark at the worst possible time
    29:20 - 3 Mile Island: this all started because of some routine maintenance where they were running autovacuum on a Postgres shard
    30:45 - they had not checked the database backups
    32:25 - pilot-operated relief valve UI disaster
    33:40 - "we monitor everything! We alert everything!"
    38:55 - legerdemain: "a debugger never tells"
    39:58 - debugging: you are playing 20 questions
    43:35 - root-cause things: is that a fire in the kitchen, is that the coffeemaker, or is that a fire raging in the coal seam?
    44:48 - the cost of the rewrite is never borne by the technical debt that induced it
    45:03 - 18 months, 18 months, 18 months - crooked founder, crooked founder, crooked founder
    46:20 - Look, gotta restart it! - Look, gotta debug it!
    46:45 - we no longer understand the system; restarting everything all the time: that's called Windows, we did this experiment, and it doesn't work
    48:00 - fatal failure/uncaught exception handling: present your embalmed carcass to Quincy M.E.
    49:15 - you write up the postmortem because it forces you to completely understand it
    50:00 - for programmatic failure, you need to die, for operational failure, you need to handle it

  • @hopmingu
    @hopmingu 7 ปีที่แล้ว +13

    The talk is great and lets make it even greater.....
    Whenever Bryan says 'Um', We Drink..! Go...!

    • @kalleguld
      @kalleguld 6 ปีที่แล้ว

      Every time he starts an interposed sentence, you drink. If he starts an interposed sentence within that sentence, you drink twice.

    • @PatrickAustin
      @PatrickAustin 3 ปีที่แล้ว

      and...we're dead...

  • @WorldTravelerCooking
    @WorldTravelerCooking ปีที่แล้ว

    Regarding early missteps, I think that there is an important approach of structured decision making. Implementing TDODAR or FORDEC in an outage is extremely helpful in terms of breaking those early missteps.

  • @chordfunc3072
    @chordfunc3072 4 ปีที่แล้ว +7

    this is better than any stand-up special I've seen all year 😂

  • @cbrunnkvist
    @cbrunnkvist 2 ปีที่แล้ว +1

    40:40 explains what separates how hobby programmers approach ops to how professional software engineers do

  • @joeyalfaro2323
    @joeyalfaro2323 2 ปีที่แล้ว +1

    I'm watching ex bomb maker tell his story. Listening to his reasoning. Part his training was your first mistake is your last mistake. Another student learned crazy guy he said never do this in your home, always sober. You usually train for worst case scenario.

  • @TomAtkinson
    @TomAtkinson 2 ปีที่แล้ว +2

    I laughed so hard the tears came down into my mouth!!!

  • @Dygear
    @Dygear 4 ปีที่แล้ว +5

    Dear Camera Operator ... Just Zoom Out.

  • @IevgenPyrogov
    @IevgenPyrogov 7 ปีที่แล้ว +7

    Around 12 minutes here, Bryan mentions presentation on Heroku production outage. I was not able to quickly find it. Any ideas what he is talking about?

    • @TheSrishanbhattarai
      @TheSrishanbhattarai 4 ปีที่แล้ว +1

      Found this: status.heroku.com/incidents/151
      not sure if it matches the timeline though.

  • @joshuagies4900
    @joshuagies4900 2 ปีที่แล้ว +1

    Love this talk 👍

  • @nealelliott
    @nealelliott 7 ปีที่แล้ว +2

    word on the street is that " friends don't let friends use US-east-1"...

  • @Rene-tu3fc
    @Rene-tu3fc 3 ปีที่แล้ว +2

    "do it right the first time"

  • @RandomInsano2
    @RandomInsano2 5 ปีที่แล้ว

    He’s right about the pride about the Gimli Glider. I live in Manitoba (same province) and everyone knows about this darn plane.

  • @soberhippie
    @soberhippie 2 ปีที่แล้ว

    Man this guy looks like the second bad dude from Despicable Me. And man, do his stories ring a bell

  • @ornous
    @ornous 6 ปีที่แล้ว +1

    Brilliant. Thanks!

  • @SiD3WiNDR
    @SiD3WiNDR 3 ปีที่แล้ว

    48:38 Who else was yelling "Uhh I think you mean GNU/Linux" at the monitor? ;-)

  • @DanielDugovic
    @DanielDugovic 3 ปีที่แล้ว +1

    29:52 "... auto vacuum of a postgres shard..."

  • @joshgibson3618
    @joshgibson3618 3 ปีที่แล้ว

    Can't be like General Motors and Microsoft who just want to get the product out the door!! Different departments must communicate with each other!

  • @vmachacek
    @vmachacek 15 วันที่ผ่านมา +1

    I guess this is before chaos monkey was invented 😝

  • @nirmeshkhandelwal7161
    @nirmeshkhandelwal7161 7 ปีที่แล้ว +1

    Awesome talk. Keep it up :)

  • @ellenorbjornsdottir1166
    @ellenorbjornsdottir1166 2 ปีที่แล้ว

    What was this about the Magnolia?

    • @rolandcrosby
      @rolandcrosby ปีที่แล้ว

      "ma.gnolia", a blogosphere-era social bookmarking service best known for irretrievably losing all of their users' data

  • @GeorgeTsiros
    @GeorgeTsiros 4 ปีที่แล้ว

    46:24 if you only restart it... the problem will happen again and you will be facing the same problem *again*. I would go as far as to say the *first* time it happened, it was only unfortunate... the second time it happens, it is the fault of the person that decided to simply restart it, not mere chance or bad coding.

  • @GeorgeTsiros
    @GeorgeTsiros 4 ปีที่แล้ว

    i am fairly certain that if they could, they would... so why didn't they abort the shutdown?

  • @TomerBenDavid
    @TomerBenDavid 7 ปีที่แล้ว +1

    Great show and lecture!

  • @falcon20243
    @falcon20243 2 ปีที่แล้ว

    6:20 I have become death, the destroyer of datacenters.

  • @SaravanaThiyagarajs
    @SaravanaThiyagarajs 5 ปีที่แล้ว +3

    47:55 "It's your duty, to humanity, to die"

  • @MrTweetyhack
    @MrTweetyhack ปีที่แล้ว

    This is a rm * -rf moment, can happen to anybody

  • @nickmeier1571
    @nickmeier1571 3 ปีที่แล้ว

    47:58 if you have an uncaught exception, it is your duty to humanity to die 😂

  • @blenderpanzi
    @blenderpanzi 4 ปีที่แล้ว

    How long was the outage?

  • @mw3653
    @mw3653 2 ปีที่แล้ว +1

    The camera man for this conference needs to be fired.

  • @TomerBenDavid
    @TomerBenDavid 7 ปีที่แล้ว

    Under fire indeed :)

  • @beavis3141
    @beavis3141 7 ปีที่แล้ว

    Thanks d00d

  • @sahanasrivats7950
    @sahanasrivats7950 7 ปีที่แล้ว +10

    Alright, I am 25 minutes into the talk. The speaker is STILL setting up the topic!

    • @apteryx01
      @apteryx01 7 ปีที่แล้ว +1

      I'm 3 minutes into the talk. I guess I'll give up now.

    • @rostislavsvoboda7013
      @rostislavsvoboda7013 5 ปีที่แล้ว +1

      I'm at 16:16 now. You just save me a 9 minutes of my life. THX.

  • @melonangie
    @melonangie 5 ปีที่แล้ว +1

    22:48

  • @ace100hyper3
    @ace100hyper3 7 ปีที่แล้ว +4

    This whole incident and the way it happened reeks of bad engineering. The fact that someone nuked a data center without prompt by forgetting -n; somebody didn't think what will happen if the main node gets rebooted but did program a stupid bot and all that... A typical example of UNIXOID stupidity that will cause sleepless nights. And they plan to solve it by better debugging...

    • @THB192
      @THB192 7 ปีที่แล้ว +14

      Ace100 Hyper Firstly, they DID think of what would happen. Secondly, yes the command was poorly designed to allow this to happen (it's rm -rf / all over again), but that's not UNIX's fault, that's the fault of the engineer responsible for the design. And no, UNIX doesn't encourage this: Yes, confirmation is uncommon, but it's generally expected to have the command DEFAULT to doing something you'd want. And that you wouldn't write a command that defaults to taking down your entire DC. That's the bad engineering.
      Thirdly, the bot was written by someone who probably wasn't even involved in this event, or this design. It's got nothing to do with this.
      tl;dr, go back to your emulated ITS machine and your lispms until you can realize that while bad engineering happens, it's not always UNIX's fault.

    • @afterthesmash
      @afterthesmash 7 ปีที่แล้ว +5

      Many times this kind of rough edge is found in a script which exists as a placeholder for a proper solution that you haven't quite got around to finishing yet (e.g. a control daemon that simply wouldn't allow the fatal command to run without the explicit option --smellslikevictory). You strike me as naive about what it takes to keep the Hydra's every neck-stump properly bevelled in all scripts, at all times.

    • @ellenorbjornsdottir1166
      @ellenorbjornsdottir1166 2 ปีที่แล้ว

      This was a UI engineering fail. The debugging was not the moral of THAT story.

  • @rostislavsvoboda7013
    @rostislavsvoboda7013 5 ปีที่แล้ว +4

    26:50 "production is war, war is hell" uhm this guy has an ego size of a small Trump. /quit self-staging

  • @tenminutetokyo2643
    @tenminutetokyo2643 4 ปีที่แล้ว

    Debugging is for failed programmers. If you can’t write it correctly the first time, find a new line of work.

    • @ThePandaGuitar
      @ThePandaGuitar 4 ปีที่แล้ว +9

      LOL. i bet u just learned to program in 27 days.

    • @tenminutetokyo2643
      @tenminutetokyo2643 4 ปีที่แล้ว

      @@ThePandaGuitar I have been programming since Apple II and have over 30 years' experience.

    • @affegpus4195
      @affegpus4195 2 ปีที่แล้ว +1

      If you aint debugging then you haven't written any large or complex enough software.

    • @tenminutetokyo2643
      @tenminutetokyo2643 2 ปีที่แล้ว

      @@affegpus4195 I can assure you I have. Is 2-3 millilon lines large and complex enough for you? How about the entire audio editing suits for Playstation 2 - is that large and complex enough for you? "What does it say about the quality of your engineering that you have to have all this testing" - Steve Jobs. If you write buggy code you suck at engineering.

    • @ellenorbjornsdottir1166
      @ellenorbjornsdottir1166 2 ปีที่แล้ว +1

      People do fail. The difference between a failed and successful programmer is whether they can recover.