I'd like to see an episode of "Things KSP Doesn't Teach" about instrumentation. How air/spacecraft instruments work, their limitations and quirks, and how they can fail.
I'll second that. The Soviets had a number of uncrewed vehicle losses because they used ionic sensors to determine the orientation of the vehicle, which would fail on occasion. On the other hand, the gyroscopes aboard Apollo 13 held true despite being pushed well outside their comfort zone. Some sort of video about orientation sensors would be very enlightening.
@@ferdievanschalkwyk1669 As a former Formula 500 driver, the fastest lap times were NOT the shortest distance lines around the track. A computer simulation takes the shortest distance, while the faster drivers took a slightly longer, but faster path that defied logic.
That's what happens when you keep changing the software spec's of a project. It's a bit hard to believe that they changed the landing site without rerunning the simulations.
I feel like something in the sensor processing design isn't fundamentally robust enough if it can be this easily confused by real terrain features. Maybe they can add a second radar or lidar sensor for dissimilar redundancy or to differentiate unexpected yet real inputs from sensor faults. We all know what happens when you run a safety-critical algorithm on a single AoA sensor...
Exactly! Mission creep eats in to the project time line and system tests degrade into delta testing for success instead of system testing for non-failure.
In my personal experience, people insufficiently care about aerospace software. I worked in a software company that worked for ESA and we were always pretty much ignored (e.g. in all presentations of our local space agency). But when some other company made a screw for a satellite, it was plastered all over their presentations. There were literal delegations going to take a look at the space-screw-producing machine. Such an interesting visit, you see, to a hall with machining equipment, clean rooms - that's the "space stuff" in minds of people. Something you can touch and see. How do you brag about a company with people sitting at their PCs? Nobody cares. Even if these are the guys whose work ultimately decides whether these magical screws end-up doing something or are splattered over the moon. I don't care about the publicity but it's the mindset. Everyone focuses on the aluminum this and titanium that - and software is always the afterthought. We can change that anytime. We can even send an update to space ... so, why should we think about too hard? Bam!
Good point. And I think it is because it takes a higher level of intelligence and technical knowledge to understand software systems and the media and others can't understand it. You only have to look at any news article published by the main stream media, on television, in newspapers and you will see the errors the journalists make, the incorrect use of terminology, the lack of detail and you walk away realising the news article has told you almost nothing.
Boom! As a retired architect for space sensor payloads, I can say you are spot on. I watched management spend all sorts of money on convenience tooling but if SW wanted licenses for software production and testing tools, oh God, you got run through the gauntlet. So how many times must a company learn these lessons? Simple, once per program.
It's the same attitude all over the place. Games no longer ship out as completed projects... 'we can just patch it later'. Mamy other fields also do shit like this.
In their official debriefing, ispace actually admitted that it's primarily a (project / program) management issue, not an engineering issue. That gives me hope that they might actually learn something from this.
It's almost *ALWAYS* a project/program management issue, not an engineering issue. This was also true for Mars Polar Lander and for Mars Climate Orbiter (the one that famously mixed up imperial and metric units).
This is why you also have timers for expected milestones (earliest and latest time a milestone can be validly sensed). My background is that I worked on the Attitude and Articulation Flight Software for the Galileo and Cassini spacecraft when I worked at JPL. For a very simple and solid method they could have used what the Surveyor landers did.
I'm sure mission control had a plot of the expected altitude changes, the lander may have had one as well. Problem is that the expected rate of change of the altitude, was outside what had been set as acceptable for the altitude radar. It was probably written in the specs somewhere. Proper simulation of the landing would have caught this, it could possibly even have been dealt with after launch. It's changing the landing site without simulating it that screwed them.
What did Galileo and Cassini use for altitude readings? and would they have been equally screwed if forced to switch over to gyro / accelerometer readings with an apparent failed altitude radar?
@@nocturnal6863 John's point was that "forced" switch is averted, if the switch algorithm is completely disabled at such an early phase of flight. Reread about "earliest time.. a milestone can be validly sensed".
@@u1zha except you wouldn’t disable the software monitoring a sensor for failure. Not unless you knew in advance it might give faulty readings at that point. Further thinking, I think I see what you are suggesting. That it should have been expecting by the dip in altitude and it’s failure the see it, means it should have known it’s altitude was off.
It fascinates me that as you look at the history of disasters how many of them are ultimately caused by cutting corners to meet time pressures or budget targets. In this case you have to wonder (A) why the target zone was changed late in the game, and (B) why simulations with the new target zone weren't run. I would bet a dollar that engineers thought of it, but they were over-ruled because of time pressures or a budget target.
Your question (A) is a great one. It could be that the new landing site could be reached with less expenditure of propellant or something like that. They thought it was a lower margin of error. Or was it the opposite? Was there a "better" more ambitious site with more interesting geography?
@@aarondavis8943 It is interesting to speculate. IMHO "less expenditure of propellant" would fit into theory about disasters and cutting corners to meet budget targets. On the "better geography" thought... unless an asteroid suddenly impacted an area close to their original site, one would think that the geography question would have been settled long ago... the lunar surface is pretty well documented (at least the front side).
Proverbs have a sort of statistical truth. "Haste makes waste" exist exactly due to that. The sad part is that seemingly we keep doing the same mistakes
@@pierQRzt180 Yes it is sad to think about all the people who have lost their lives due to decisions on someone's part to save a few bucks by cutting corners. One of the latest examples appears to be that partial collapse of the apartment building in Davenport. Looking like the owner went with a cheaper contractor who would forego shoring up the building before proceeding.
It continues to amaze me that we managed to safely land astronauts on the moon AND have them take off from the lunar surface and return home, several times. Obviously, having actual humans present makes a ton of difference, but the number of things that could have gone wrong but didn't is mind-boggling.
The brain is a wonderful flight computer. Lander: I'm going to land here. Human: Dummy, there's a rock the size of a McMansion there! Gimme manual control.
think about it like this: humans have managed to control powered airplanes since the start of the 20th century, while autonomous aircraft have only just appeared in the last decade or two. humans are just that versatile
One of the NASA research reports justified the cost and risk to send human astronauts to the Moon with the allegory says "Human brain is the most lightweight and easy-to-aqcuire real-time non-linear computer"
About 30 years ago I had a single page photocopied from Computer World or some such industrial publication taped to the outside of my cubicle. On that page was listed the ten most expensive software defects (bugs). I was astounded when the most expensive defects caused hundreds of millions of dollars of loss. When you read the list the top five defects (again, multi million losses) you found out that they were all losses of spacecraft and/or their payload. Flight software is tremendously complex and a single error will cost you your whole vehicle and years of effort. Now that page would have to be scaled to near billions of loss I expect
Maybe not most expensive, but Therac-25 should be on that list somewhere. Ya know, because it ended up maiming and killing a bunch of people with intense doses of radiation.
@@a.p.2356 In a way, that is possibly the most expensive software bug ever; in another, it's quite cheap. Consider: we know for a fact that cars kill people, all the time, in every way, yet we do not ban cars. The value of a human being's life has been calculated, and apparently it's cheaper than you would expect. Electricity production has a cost measured in lives per TW/h. You can look it up. Biofuel has a cost of 12 people per TW/h. Solar is 0.44. Wind is 0.15, and new/clear is 0.07. The average American consumes 0.1-0.2 GW/h per year. In other words, over the course of your entire life you will likely kill less than one fiftieth of a person in order to keep the lights on. This does stack with the people you kill while driving, however - I'm talking about tyres particulate and excess death from pollution, not running somebody over. Ain't that grand?
@@o0alessandro0o It's not easy to respond to that kind of information. I do know that training and regular checking of pilots contributes to a high level safety for commercial aviation (ignoring mechanical failures). For drivers, I reckon that similar processes should be followed. It would not be popular among the general public, but I have said for years that licenses should be graded, based on years of experience and how many training courses a driver takes. Governments balk at the idea, however, and go on putting up cameras and roadside radars, more draconian speed limits, but never addressing the fact that poor situational awareness, slow and inappropriate reactions, and limited skills are the biggest factors in car accident rates. But I'm going down a rabbit hole!
Not strictly a bug, rather bad design; but implicit nullability - first introduced in ALGOL in 1965 and later copied into most programming languages - was famously coined by its creator as a billion dollar mistake. I think I read somewhere that at the time the estimate was quite accurate, but that was 2009 so by now it wouldn't be surprising if it is an order of magnitude too low.
It's very interesting that this is almost the exact reverse of the famous 1201 alarm on Apollo 11. In that case the computer restarted and generated errors on the astronauts control panels. But because they knew that they were at the right altitude per the flight plan they had confidence that they were still flying correctly and Neil Armstrong brought the lander down safely.
Source of 1202 and 1201 alarms was traced to the rendezvous docking radar, used for rejoining the command/service module was inadvertently left on, at the same time the radar for landing, the only one needed for the decent phase was running. This overwhelmed the lunar module computer, but mission control knew it was still safe to land because of one man at Huston.
@@warrenpierce5542 yes, when you go into the details then these are different cases. But in the abstract, in both cases the computer was confused because it got signals which were unexpected and didn't handle them well. In Apollo's case, the human was able to use additional information to recognize that the problem wasn't severe and in this case - there was no human.
@@warrenpierce5542 In Mike Collins' excellent book, he mentioned the 1201 and 1202 weren't exactly "well known" issues. Took a bit of "looking up" (quickly, albeit)
Just realized that manned missions are technologically easier (skill of pilot) than unmanned soft landings that are possible now due to the progress of software systems.
My experience of being a software engineer is that the code has to be tested every time. It's amazing how often things that can't go wrong do go wrong.
It is not a sw that changed but the parameters of the flight. You may of course argue that the sw was made for the particular landing zone which I do not buy. I may be mistaken as the video is the only source of my knowledge of the situation - sort of like this radar was. So you take a peek at the surface with radar and see this crater with it or rather a human having visual would have seen the crater - the landing module saw just a point on the surface which was 3km higher then the previous point it peeked at. I suspect what they would have needed to do is to have more points that radar is measuring especially from distance and make an average out of it or use some other technique to see where one is. When much lower this would also needed to be done to see if there is no big stone occupying part of the landing zone. I suppose this last thing was eliminated by assumption that the landing is going to be done on the flat empty surface by choice of the mission control. I suspect if they were landing on the water/liquid surface this radar error could only occur due to a massive tsunami - well no water surface and no tsunamis but hard landing. Interesting to know all this tho, aint it?
Been there! Written lots of software... made some unbelievable bone-headed mistakes, which are all *BLINDINGLY OBVIOUS* in retrospect. "This change is SOOOOOOOOOO OBVIOUS that we don't need to test it" ... ha ha ha... this is when reality bites you on the backside, informing you that you definitely *DO* need to test it again.
@@hanskloss7726 HA! Then you discover it's high tide instead of low tide... maybe you simulated it with mean sea level but a mile away someone opened the sluice gates and there was a large wave from the reservoir... etc.
@@simonmultiverse6349 low tide v. high tide does not cut it here - the surface is mostly flat still at least from a 5km perspective. The crater is a different story so you need to have many points possibly also a map? Not sure what is easier here but their method obviously failed. We know this is not a shame - we all have been there....
There was a similar bug in the LEM : if the module had flown above a circular-shaped crated of a certain size, the radar altimeter would have shut off all propulsion, probably leading to a crash. Fortunately the bug was never triggered (mainly because the onboard crew had taken over manual controls at this point) and was found decades after the landings.
@@dr.cheeze5382 so ive heard. but compared to the alternative of having to lose these spacecrafts to software error, i think "no soviet to beat" is a ridiculoua excuse. so i still really dont understand why lunar exploration has been strictly rover based these past few years...
I flew fighters for over 20 years. The Kalman filter was the bain of the navigation and bombing solution. It would actually discount most of the updates I would insert. It thought it knew more than I did … it didn’t.
Another outstanding episode Scott! Being a software safety engineer for the last 39 years, I have to agree with previous comments that point out this is not a software bug, but more of a people problem during design, testing, management, etc. I believe the first Ariane 5 launch was a similar issue where the software worked perfectly per its specifications (from Ariane 4) and doomed the flight to failure. Like in this case, proper testing would have prevented the, expensive, tragedy. Also wanted to give a shout to "How To Destroy Wayward Rockets - Flight Termination Systems Explained". My 39 years were all spent on Range Safety Software with the last 13 years working on autonomous flight termination systems. That was another outstanding episode! Keep up the awesome work!
@@vast634 Depends on the type of Flight Termination System (FTS). For solid rocket motors, they use a shaped linear charge that opens the casing and exposes the fuel which burns up quickly in an impressive display. (I think Scott mentioned that in his previous video.) For chemical fuels, things are different. You have more choices. The basic idea is to stop thrusting the vehicle so it falls into an unpopulated area, such as a broad ocean area in the case of SpaceX. Based on the video of the flight, the FTS worked properly and detonated explosive devices that created holes in the fuel tanks. That reduced or stopped the fuel flow to the engines. The FTS did its job. After that, it's all physics. If the fuels are hypergolic, they will combust on contact and you get a near-instant explosion. Otherwise, you need combustible fuel, oxygen, and an ignition source. Guessing, it took about 40 seconds before the three elements came together in the right quantities in the case of SpaceX. An FTS doesn't need to create an explosion. Rather than connect to explosives, the FTS can connect to fuel valves that terminate fuel flow.
@@xGOKOPx I understand your perspective. My point is that the bug should have been avoided during design or implementation, and if not, then detected during development testing. Find and correct all the bugs before deployment. Since their development testing failed to react properly to "unexpected terrain" (kind of a silly term considering the moon's terrain is pretty stable), the people failed in the software development cycle and left in a failure mode (i.e., the bug) so it could be exposed during execution. The software did what it was designed to do so it worked properly. The people failed to account for something. The same thing happens with hardware but folks don't usually blame the hardware. The failure of Galloping Gertie wasn't blamed on the bridge. The people who designed and built it were blamed for not accounting for potential wind loading.
Scott - great video as usual - thank you! But really this is not a software 'bug' is it. It's a systemic design and control failure. The software was designed to work as it did, but the specifications do not seem to have included passing over a crater like this. In other words the initial flight plan was intended to avoid this situation, and the software was designed to work within that flight plan. The first error was changing the flight plan without checking if the software could still function with the new one. The second error was not testing the software under the revised conditions it would have to work in. Both errors are symptomatic of inadequate control over change management. In other words, the flaw did not lie in the programming, but the organization's approach to change management.
...and perhaps a Third error of the program disregarding the radar altimeter instead of querying it again. 'Say what? That result is outside of parameters. Please take your reading again'
came here to say that! also, even if the result lateron would be inside parameters again: the device has already been proven to be unreliable. it might be an intermittent error, or it might be a bias that only on this occasion was noticed but existed all the time. revalidating system reliability would be a tough cookie to crack on its own if it didn't come with a redundant 2nd and 3rd system, though it should have notified ground control and gotten an update/patch. to my understanding that is how generally system failures are resolved. i don't know if their mission profile put an artificial time delay on that to prepare for longer ranged versions, or what happened.
Haven't read the report so might be wrong, but if they use something like a Kalman filter, it is likely that they are not simply not querying the sensor, but that the calculated variance to associate to the sensor readings spiked. In that case, the sensor would still be queried, but is _effectively_ disregarded since the resulting effect on output would be so low (due to the change in the assigned variance). Someone can correct me if I am wrong here.
There's also the design of the spacecraft that has to be called into question, specifically the AD&C architecture. Relying on a single altimeter means that you can't verify the data with a redundant sensor. Since accelerometers and gyroscopes can't really capture things like topography from orbit, it's like flying with one eye. I don't know how much mass, power, and space another altimeter would have taken up, but perhaps a redundant altitude sensor, possibly one with a lower resolution and/or sample rate, could have been used to verify the data coming from the primary one.
They may not accomplish it. Very often such incidents reveal a whole load of issues that have been swept under the carpet, and the necessary organisational change required to address them all can easily break a small team / organisation. Even big companies can be killed by this. This is what is going on in Boeing right now. They caused the crashes of two 737MAXes and killed people. Since then they've tried to institute root and branch reform of how they run their business. Yet, they're still having problems. The most recent one was a fuselage manufacturing defect (they were building them wrong) that had gone unnoticed for approx 700 airframes (yep they're flying, possibly with Southwest today!). Fine, they've found it, repairs needed, not immediately dangerous, but cannot be ignored. Trouble is the manner of them finding it was accidental; someone was in the right place, at the right time and realised what was going wrong. The issue is that, if despite the introduction of a root and branch reform about how they approach quality (= safety, reliability) they're still finding major issues by chance, then the root and branch reforms are junk and are not working. They should be finding such problems as part of a systematic continuous improvement process, and they're not. So the bet-your-life question is, what else have they missed, given that they've essentially admitted that they've not been looking hard enough? It's similar with 787 (fuselage barrel joints), brand new 737MAXs with FOD and rodent damage, etc. This suggests to me that Boeing are in no way adequately reformed following the MAX crashes, the problem most likely being in the senior management who never understood it before and are still there today. It's worryingly possible that they're going to make another fatal mistake. Ok, the FAA is now (belatedly) keeping a much beadier eye on Boeing, but they can't see and check everything; certification engineers / inspectors are not there to do basic QC and basic QC improvement. The Hakuto team's best bet, if they're to try it again, is to just fix that one core issue and try again, and do as much simming as they can muster. Unlike Boeing, crashes are just disappointments and money.
Unfortunate to see what led to the failure of this mission. But glad to see that they have found the issue. Really hoping they succeed on their next attempt. Thanks Scott for the comprehensive explanation on this!
People need to wake up, controversy surrounded moon landings because there is stuff there. The issue / bug was in there on purpose. They will probably never let us see the real moon.
@@davidbeppler3032 The whole things was planned it is every time with every country why do people not see this, every single machine that lands on the moon has issues - #1 because the surface is covered in glass domes and other hanging debris #2 to cover up such things from the public in a convincing way.
7:50 Yeah , the VIKRAM lander from CHANDRAYAAN-2 had lost communication and went out of control , but with improvements in software, damper etc , we are again ready for CHANDRAYAAN-3 to it in July according to official message......
I'd argue it's due to inadequate testing and making assumptions they shouldn't make rather than just blaming the software. To move the landing site and NOT run a series of full simulations for the new site is just an astonishing degree of incompetence!
@@mcgilliman I like to think of an F1 analogy... Imagine if you set your car up to race in good sunny weather in Monaco at sea level, and they changed the race to be in Mexico in soaking wet weather at 2,260m above sea level... You would NEVER just race the car with the exact same set up and no testing before the race!!
True, but if history has taught us anything it's that the incompetence almost certainly wasn't the software engineers themselves and was instead a cumulative effect of multiple levels of beurecracy repeatedly ignoring the recommendations and pleas of the people who actually knew what they were doing and what additional work had to be done. I suspect this is a scaled-down version of Challenger all over again, albeit thankfully with no loss of life this time.
Seems a pretty basic error, in what universe did they think they could figure the exact altitude without the radar? Even if it was broke too bad, damned if you do/don't.
@thePronto They moved their landing site to align with NASA South Pole targets super late in development (post validation). The threshold for culling/re-baselining seems to be the issue. The sudden change in relative altitude wasn’t expected from their simulations.
Working as a Software Tester I often see the managers tend to take the risk to save some money vs malfunctioning SW especially when it has to deal with error handling.
Which is a shame, really. The more expensive the project, the less management should feel like cutting corners on error handling and verification. Ah, well, what "should" happen in the real world doesn't agree closely with what actually happens in the real world.
Being a software developer, I have seen this happen countless times in multiple companies. Software is often overlooked. Testing is usually considered redundant and a waste of time/money. Developer's warnings and requests are normally disregarded or displaced by other department's concerns which are non-technical and even non-functional.
In my experience the software developers/engineers are kept out of the decision-making inner circle. Actually, this goes for engineering/tech in general. It’s fine, just change this, this and that: what’s the worst that can happen? (Changing the Moon’s landscape to more closely resemble a seedy neighborhood in Brooklyn, one spacecraft at a time.)
Software may have been the proximate cause but you can argue the real problem was somewhere in the development and quality control procedures. How can you not re-run a full landing simulation after changing the landing location? It reminds me of Starliner's problems in 2019. The software glitch where the flight computer grabbed the wrong "time" was the proximate cause, but the real problem was Boeing never ran a full end-to-end launch simulation.
Actually these days so much of manufacturing and coding is outsourced that the management, hardware and software teams are no longer next to each other - quality control begins to suffer massively. The more people outsource stuff, the more the work gets into the hands of rookies paid on cheap wages, who then end up making rookie mistakes that then require even more time and energy to fix. Boeing turned from an engineering firm to a management firm and the rest is history - 787, 737 max, 777x, Starliner. And as more and more automation comes in there's less and less human intervention to take care of the times where the computers reach their limits.
Feels like they might not have a great CI indeed, probably more like bunch of artifacts in git LFS type of management. But Starliner glitch might be slightly different topic IMO
I was also wondering how expensive such a simulation would be. If they aren't too expensive, I was wondering if you couldn't run landing simulations from randomized positions and flag anomalies from there. Not so much that you can just fling the lander at the moon arbitrarily, but more so you can find starting conditions which result in something weird. IDK, maybe we're getting into a space where "moon lander software testing" and later "asteroid lander software testing" might be a market, that would be amazing. With the costs of these missions, there might be some money on the line for a testing company - especially if they end up with a body of "known problematic situations" like the one from the video.
How can you put a tank that has experienced both problems and damage in test into Apollo 13? Exactly like that, only different. Or get km and miles crossed up and smash a probe into Mars (IIRC), or ... ad infinitum. You can run all the simulations in the universe and still have problems, but not running ANY sims to cover a deviation in the program...yeah, that's just begging for it. I would think, in this day and age, that you could pretty much run that sim real time in parallel with the mission, for the problem they had there, knowing the path and surface profile, I'm guessing, and have it fire up a quick "do not ignore the damned properly functioning radar" command, or some such. It might even be good to have REAL TIME simulations running against the truth of the mission. Having done some aerospace hardware design, I'm guessing that there were schedulers and/or bean counters directly in the problematical loop. Or maybe idiotic MBA wielding managers that think they are engineers, or worse know BETTER than the engineers, because they know a few buzz words, and then maybe hold people's feet to the fire to get them to sign off on VERY cold Shuttle launches, or what have you. That's the sort of feedback you do NOT want in, say, a servo. :-/ Sometimes I look back and am glad I am retired, frankly. Some of it was fun, some of it SUCKED. Doc requirements come to mind as some of the latter. I had one junior documentation fiefdom wannabe tell me that the real output of a program was the documentation. When I finally quit laughing I told her that if she actually believed that she should go talk to some F16 pilot and ask them which they'd rather have with them on a mission, a working LANTIRN pod, or the documentation that describes it. She wasn't happy, because then a couple of people standing around laughed too. She wasn't a nice person (that's putting it mildly), or I wouldn't have said it that way. My bad, I guess.
Armchair quarterbacks are a dime a dozen. You can certainly crow if you ever land a vehicle on the moon or even achieve orbit. Perhaps we can talk about "how obvious" the solution was when we stop whining about how LONG it takes to build and fly a vehicle and how the contractors are "milking the American public" for so much money. I thank Providence every day for Kathy Lueders and NASA for riding herd on SpaceX to make the Dragon 2 safe. Everyone had lots of criticism for NASA for being conservative and "delaying" the first launch of the manned spacecraft. But all that effort kept the astronauts safe. (Also SpaceX had a real leg up on Boeing, because they had a working cargo spacecraft in Dragon 1 to build on. The last time Boeing designed a manned spacecraft was the 1970s and the Space Shuttle. All those engineers are long since retired.)
Similar thing happened in 2017 with second launch from new cosmodrome Vostochny, old software logic applied to new geography without double check. Didn't happen at first because they used very rare Volga upper stage but second launch was in default configuration that flew for decades from launch pads everywhere including South America. So after final separation, Fregat upper stage was scheduled to make 10 degree turn counter clockwise but due to geography of new cosmodrome and flight trajectory, software decided that it needs 350° clockwise turn instead. Didn't end well. Turned out that there was narrow set of input parameters that could make upper stage behave like this and new lauch pad won jackpot.
There's an argument to be made that if a sensor is critical enough that if it fails you're going to land on non-existent terrain 5km up, then you just assume it won't fail. If you handle failure gracefully but then don't have enough data to avoid crashing, what's the point of handling it gracefully? Of course, ideally you'd have a backup. Like another radar, or GPS, or a video camera capable of estimating height using machine vision and a map, so you can sanity check it. The next best thing is just have a map: the vehicle knows where it is, so if it knows the terrain it can estimate what the radar values _should_ be, so instead of going 'eek, a delta of 3km in ten seconds is clearly wrong' you go 'the radar has shown a delta of 3km in ten seconds, what does the map say the delta should be? Right, 3km, moving on'.
I work in flight software and you're right. At a certain point if a system is so critical and irreplaceable you just have to trust it won't fail because as you said, detecting the failure isn't helpful if your SOL.
There is an argument to be made that if a $90 million project can go up into smoke due to a single sensor failure you have an expectation that it could potentially fail, you should really have some sort of redundancy even if it is unlikely. Or some other form for backup plan. The question is if it was actually considered if this sensor could fail, or if it just used the same behavior failure detection and handling as any other sensor without further consideration.
This reminds me of an Alastair Reynolds novel where an automated system recorded the sudden vanishing of a planet but disregarded the data because the event was so far out of expected results that it assumed there was some kind of fault.
@@yogiwp_ Absolution Gap, it's the third book in the Revelation Space series which is kind of weird. If you're looking to check out the author, I'd recommend Pushing Ice!
@@ShoeTheGreyCat I forgot about that bit. Given that series largely relates to characters that are functionally aging immortal, it's wild how easily they torture and kill each other.
My reaction to just the title is "There are no unbelievable computer bugs". Now that I've watched the video: *very* believable. Accumulation of error is nasty and dead reckoning is very hard. Changing something that "can't possibly affect the outcome" late in the process and not doing a full test happens often enough that it's a subject of comic strips and many high profile failures.
When you're using accelerometer and gyroscopic data alone for position on a 2d plane, it can become hilariously inaccurate quickly, no matter how good your algo is. Doing this in a 3D plane would be basically impossible if I'm being honest. As an example, there is a reason VR relies so heavily on video processing for limb positioning. Obviously these aren't in the same ball park of cost/importance, but the same rules apply.
@@winebartender6653 US missile submarines can pull it off, but their inertial navigation hardware is larger than the entire lunar probe and submarines experience much smaller accelerations. It does seem like it was selected as a fallback with rather optimistic expectations of how well it would stay accurate. In hindsight, it would have been better to try turning the radar off and back on, relying on inertial navigation only as long as it took the radar to come back on. Also, redundant radar.
Hi Scott: Very nicely done! (but then, I say that a lot about your stuff...). This scenario is directly reminiscent of the situation on the Apollo landings where passing over a crater (or any other feature like that) would cause a 'jump' in the Radar Altimeter-portrayed altitude, and it would 'jump' from the PGNCS altitude. Remember 'Delta H'? The difference between RA and PGNCS altitudes. In order to keep things from diverging in the PGNCS, they had to incorporate a 'terrain map' into the software that accounted for local differences in surface elevation. Remember the landing of Apollo 17? At some time in the PDI maneuver, one of the crewmen (I can't tell which one--they sound a lot alike) said 'We went over the hump, and Delta H just jumped'. It sounds (at least at first blush) like a feature similar to the Apollo 'terrain map' might have been appropriate here (?) Thank you.
From this description it sounds like the software worked as intended based on the circumstances. It sounds more like they need to rethink the system level design to have more inputs that can be used to sanity check one another, and perhaps have a means for a one-time instrument glitch (at least in the design interpretation) to be "forgiven" if later sanity checks pass.
Yes, that makes sense, and the "forgiving" part is commonly solved by Kalman filtering, which Scott also mentioned. Here it sounds like Ispace overengineered a little bit, overeagerly dropping sensor data on the floor before giving the filter a chance.
I've never done it, but from what I have read, sensor fusion is an enormously complicated and fuzzy technique. You have to take a bunch of sensors, account for non linearities and malfunctions, and you need to figure out which ones are correct, which ones are sorta correct and by how much, and which to ignore. On top of this you have enormous weight and power restraints. And there must be a million fudge factors that have to be played with. Move it one way and you get a false positive, move it the other way and you get a false negative.
I wonder if this would be a good application for AI? A computer would definitely be able to interpret way more inputs than a human pilot ever could and in real time
@@andrewahern3730 AI isn't a magic fix. Those sensor fusion algorithms are supported by a a deep understanding of the system and statistics. Like with the Falcon 9, they are extremely reliable once properly tuned. Obviously an advanced enough AI system can always do the job. But if, like in this case, you simply didn't test the system with enough variations of inputs, you're not going to get good results either. The amount of simulations needed to properly train the AI would also have been plenty to find this bug in the old control code. The lesson here is that more robust testing is needed. I have a feeling that spaceflight is often seen as hardware-first. That's understandable, but without proper software the hardware is useless. I think more modern software engineering practices could be useful here.
IRL, nothing says you can’t have false positives and false negatives at the same time, while you struggle to understand the data. That’s no fun at all.
kalman filtering is pretty damn straightforward. It's a basic method, not something extraordinary. Known for more than 50 years and optimal for typical sensors (ie those with common noise distribution).
Mr. Manley, for the whole planet you are our 21st century Eugene Kranz. At 01:13 your video proves that we now have available: ‘A da Vinci World of Creativity at Home’ The video shows that they used a $170 Airspy R2 receiver (with a $620 LNC + antenna) with the mind blowing power of the software available for the Airspy, so for less than $900 USD you can have the same setup at home ! Your use of the Kerbal simulator, to help us better understand the sequence of events, is of jaw dropping beauty.
hey Scott - thanks for the analysis. I remember this one (as well as the Israeli and Indian ones) and seeing the disbelief in the control room was sad. It is easy to tell who has a clue and who is a bureaucrat by their expressions, etc. :P
@@adarsh4764 agreed - one would have thought that even a small lander would have a pretty robust navigation system these days, but obviously they met an edge condition they hadn't properly tested for... and a sad oversight too as nearly all landing trajectories will have the radar return affected by craters you're passing over. There are many of them after all, and although most are small, many are large/deep and you need to keep their profile in mind as you use the radar/laser/etc surface measurement. The state vector routine needs a sanity check to make sure the drift never disagrees from projected too much without it doing some form of reliable recheck.
I'm very impressed with the abilities to diagnose what went wrong. Even amateurs helped! Another case study for future designers of "fail-safe" systems.
How do you know? Did you read the design spec? If the design spec stated it should be able to handle multiple lunar landing locations, then it's not a design spec issue.
Mismatch at Requirements and/or Expectation levels. Activated by beyond test envelope operation. Needed a calm (seasoned?) "captain" to hold a steady, pre-planned course.
@@simongeard4824 I think the video was quite clear on that the software started ignoring that sensor because it was programmed to do so. An intentional feature that behaved differently than expected *because* it was put into a situation that was not considered while designing it. And that this only happened because they changed the mission plan after the software was developed and did not test it again with the new landing site, because their tests would have detected the issue. That last part really hurts because they reasonably could have avoided the crash.
I can't believe they didn't simulate their final landing site but that's what you are saying. Thanks for the explanation. Such a shame, they picked the wrong thing for a shortcut!
By, "unbelievable", I'm pretty sure you meant, "Completely realistic, very common scenario when the software is put in an untested environment." Note that I am saying this as a software developer myself. I actually just identified a scenario where our existing tests were thought to be sufficient, but then some surrounding parameters changed and a bug was found.
@@jarisundell8859 Good question. Seems like actually running the sim again once the final site was chosen should have caught this, maybe allowing them to upload the fix. For the record, CI is how the one I looked at was caught... prior to release 👍
Yea, that is the challenge of small projects with limited resources. It is great that this is not a problem for larger projects (cough-cough-Starliner) that have the money and resources to allocate to proper SW verification.😆
But what if the radar altimeter actually did fail around the time it passed over that crater? It sounds like it would have produced the same result. I think the only way to deal with this in the design is to have at least one redundant sensor for something this mission critical. Of course the problem with just one sensor is that you need to try figure out which one is actually the broken sensor. That's why there is often 3 sensors or 3 computer systems that are used in this kind of redundancy...
Sounds like a redundant sensor wouldn't have helped this particular issue though, because it would have just gotten the same confusing measurements of the cliff wall. I think they just need to thoroughly run simulations of the actual mission to catch edge cases like this early.
@@sonaxaton 3 redundant sensor and a voting system is the way to go. Worked flawlessly in many aeronautical things, from Concorde autopilot to missile guidance system.
The dead reckoning system combined with prior knowledge (a map of roughly what is expected) should have been enough of a redundant system. Seems like they should have included a reassessment/recovery routine to check if that apparent altimeter glitch (which wasn't a glitch of course) cleared and the instrument was giving reasonable data. This stuff is really tricky without a human in the loop.
To be honest (and a bit philosophical), I would not call this a "bug," in the sense that sometimes with bug you mean an error in the software that makes it behave differently from the behavior specified at design phase. In this case the software had to face a situation that was not expected, that is, a suddenly increase of altitude due to a deep crater. It was not an error introduced at implementation time (that is, when they wrote the software), but at design time. Like a bridge that breaks down, not because some error during the building, but because of a strong wind that was not considered at design time.
I agree. This was planning error, or a failure to test error, or changing the landing into a regime that had not been tested, or all of the above. It's been know for a long time that radar altimeters can be spoofed by terrain, it is nothing new.
As a software engineer I agree lol but that's not to say it is not also partly the responsibility of software engineers to raise potential bugs in the design.
Love this! Thanks you for reminding me of Kalman Filters, i studied those in my M. Tech., loved them but never thought i would ever hear of them again. I still remember how the "location estimation" part, based on current velocity and direction integrated over time (aka: dead reckoning) can provide smooth and accurate predictions over short durations, but errors tend to accumulate in a physics based predictor like this and needs to be augmented with an independent measurement (ie: the radar), even if the radar data is not accurate. Amazing to see how stuff like that led to this outcome. It is a tough one though... I wish you had shared your thoughts on how should a "faulty sensor" be detected then? I mean, you could say that a 3 Km sudden jump in the sensor output means the the sensor is probably broken, right? If not, how else would you do that and handle the case when the sensor actually is broken?
Redundant systems and majority check: if all three of your radar sensor reports a sudden altitude change, than that's what actually happened. What surprise me is that the sudden altitude change eventuality is never accounted for...
I mean to be fair, even the Apollo lunar module only had a single non-redundant landing radar altimeter for determining exact altitude. The astronauts were fairly confident they could manage to land without it but if it failed, mission rules called for an immediate abort. The weight constraints for landers are so tight, engineers have no choice but to make those trade offs.
If it stopped to a speed of 0, then fell to a speed of 500 km/h then it would have had to fallen for ~86 seconds. Moon gravity acceleration = 1.62 m/s^2. That means it was in free fall for a distance of about 6.0 km. That's all based off of the "500 km/h" crash speed given.
@@scottmanley Love how reliable basic physics equations are! With either bits of data it still comes up with the same results! If only the rest of landing on the moon were that simple.
What I want to know is how that equates to a violent impact on earth. Do I divide by six, which comes to 83.3km/h or just under 52mph? That's bad enough for it to need airbags...
@@travelbugse2829 You multiply by the square root of six - assuming there is no air resistance, so with air resistance you might end up with something not too different from 500 km/h for this type of vehicle.
@@travelbugse2829 When it comes to the moment of impact, 500kph is 500kph. It's about Mach 0.5. You know those old war movies where they show fighters shot down and augering in? *That.*
I wonder what all these people in mission control were doing during the landing. Were they analyzing the telemetry in real-time? I assume they were supposed to notice that the radar altimeter was considered faulty and disabled. If so, perhaps they could have reviewed its readings and realized that after passing the edge of the crater, the readings returned back to normal. In that case, they could have just manually reenabled the radar altimeter. Since it is not Mars, the signal delay is small enough to allow for manual corrections during the landing.
Putting aside changes in mission plans, redundant systems missing or even software bugs, I think the main issue here is overly strict programming. Assuming something is defective just because of a sudden change that is out of scope is bit extreme. Baffles me how it could hover waiting for the moon while letting propellant go to zero without at the some point trying to salvage itself with something like "this is not working, maybe I should take another look at that system i think it's dead".
what you're describing is human decision making, and you're ready to scratch this plan and try something better when the moment comes. you can't just imagine every scenario branching out at every step and hard-coding solutions to each. At some point you'll realize you need a generic decision making algorithm. In fact the mission failed due to them having a specific solution of switching off a reading since that allowed them success in previous simulations.
It tore me apart watching the team coming so close, it really has to weigh on the people who didn't catch the glitch, I am sure some are still laying in bed awake at night, can't wait to see the team bounce back with a flawless mission
This sort of situation is actually a good point to use against those who claim the Apollo moon landings were impossible back then because of the limited computer power available. Having a human (or two!) at the controls made the landings difficult but not impossible. Comparing it to the success /failure rate of the even earlier Surveyor unmanned landers shows it can be hard to do, and losing a lander is expensive but you can try again.
I seem to recall the University of Wyoming having a "Missile Guidance for Dummies" audio description of a guidance system for knowing where the missle is by knowing where it isn't - it seemed pretty rock solid. I have to wonder why this method hasn't been adapted for spacecraft yet.
Exactly. It would be especially helpful in this case; if the lander knew where it isn't, it would not waste fuel by trying to land as if it was just above the surface. :)
I believe that's just a sentence for lulz, engineers expressing themselves purposefully obtuse. Kalman filters are exactly the "knowing" part, and a closed loop control system is exactly the "subtracting" part.
@@u1zha th-cam.com/video/bZe5J8SVCYQ/w-d-xo.html This video is full of these sentences, that are close to how control loops work, but not quite, which I find quite funny, especially if you know how it works
@@u1zha Unfortunately, it also inspires a lot of morons to quote that line constantly on TH-cam, perhaps under the mistaken impression that it makes them look smart.
6:20 Planned landing site change? That would normally require a revalidation of the software in the industry. I would blame this on the people who decided not to do that. I would investigate those guys and why the change. I would not blame the software as the software was not designed to be used that way.
Hardly a "bug" when it worked correctly for the data input it was programmed to handle. At best, it encountered data it _wasn't_ programmed to handle, which makes this more a missing feature.
Can't imagine why they didnt run simulations of this. It's not like the moons topology isn't known down to the meter. Stick it in Kerbal and run simulations.
@@ddnguyen278 Uh, the moon's topography *isn't* known down to the meter. Some areas of the moon are, but generating meaningful maps of the moon is actually quite hard and time consuming. There are folks whose entire job is to take lower res digital elevation maps and apply reasonable interpolations to generate higher fidelity maps than we actually have. Not saying they shouldn't have done more sims, but it's harder than it sounds.
Sounds like they tightened the Mahalanobis check magrins in Kalman filter. It's the check that real measurement at each time step, expected measurement and estimated measurement errors are all in accordance with each other. And you usually hardcode the acceptable marigins for that, i.e real-expected measurements must < 4 times expected error. If it isn't the measurement is bad (e.g. accelerometer physically fell off the mounting). Unfortunately the margins are often set too tight. It could've been another problem tho, related to algorithms similar to simulataneous localization and mapping, but I don't have enough experience with them to judge.
My Dad helped design the Apollo lunar landing software ... and curiously enough, it was never used due to a sensor overload ... the famous DSKY error 1202. When Neil Armstrong disabled my Dad's software for Apollo 11, that was the end of it. The LM landing program was always over-ridden by future LM pilots and the LM was landed manually. The fault was in a completely unrelated system ... I guess a lot of people wonder if it would have done its job. My dad says it was pretty robust and he never saw a simulation that it would have failed if given the chance to run to completion. It's a good thing Armstrong was a good pilot! My dad would go on to be famous for mockups, and then later, he worked on the avionics of the world's most capable fighter jet. He's getting old, but still with us. I wish he was more of a storyteller ... but the one he thought was the funniest (and most irrelevant) was meeting the president in the restroom at NASA ... as in, _um, nice day, isn't it Lyndon?_ as they conducted their business. I am guessing it was during LBJ's visit to Houston in 1968, the same time frame that my dad was working there.
Scott, I am surprised that you did not touch on redundancy. I was a fighter jet aviator and one of the things we always did was use multiple sensors to allow the software to compare and then estimate probability. If they has three Radar altimeters they could see the rate of change of the surface as the spacecraft travels. Even if each would have shown the cliff, probability calc would have told it that its is virtually impossible that all three are suddenly all bad. Redundancy would be one answer in my book.
@@drill_fiend1097 This is a commercial effort so they could have just gotten one normally used in aircraft. It's not NASA where they cost $750K each just because...
Didn't Neal Armstrong have to do some on the fly recoding to overcome the 1202 error when the Eagle lunar module was getting overwhelmed with input? (Which was fixed on later Apollo missions through code fixes and turning off an un-needed radar as part of the checklist?). Seems like they could have used an altimeter, but on the moon the altimeter setting is always "00.00" 😅
Stuff like that is really hard to catch before hand. Perfectly good sensors spit out garbage sometimes. I honestly think it might be worth having dual altimeters just to make the signal more robust. (Yea i know it would weigh more) but maybe lidar + radar. Then you make the software not able to disqualify both the radar and lidar at once or something.
Calling it a bug seems wrong tho to me, dual altimeters at least would continue measuring. Sounds to me if an unusual number pops up you don't straight up then ignore further readings maybe? Like did they set up the software to turn of a key component off once an odd reading occurred? Having a backup for dual if not triple check incoming data on the fly instead of relying on past information should be normal. Just because a road was empty 30 seconds ago doesn't mean I'll blindly trust there won't be something speeding around the corner. Expect the unexpected especially when it's not on the same floating rock right.
@@dounyamonty The system was designed to ignore all new data from a sensor if it felt like some of the readings were out of bounds. They could have had 20 altimeters and the system would have ignored them all because they all would have shown the same "error".
@@patreekotime4578 When you have only one sensor for a given parameter, treating it as failed after seeing out-of-bounds data from it is just about the only thing you *can* do. But when you have more than one sensor, you can cross-check them, and *that* usually becomes your primary method of determining whether a failure has occurred. If both sensors do something unusual *at the same time,* you might reasonably infer that the problem is *not* in the sensors. In this particular case, the spacecraft should have been able to cross-check the radar-altimeter reading with the topography it was flying over. Seeing the distance-to-ground reading increase rapidly as the ground abruptly drops away beneath the spacecraft is an entirely expected condition.
Very good explanation and top video! I guess also the loss of the Mars Polar Lander was caused by a software issue, telling the landing thrusters to ignite too early, causing the probe to run out of fuel...
Redundant systems to help the mission don't matter if the mission never starts. I worked on a Single/Dual/Triple redundancy system a long time ago. I think the probability of a single incorrect signal per million samples for each device was 75/93/98 percent (roughly, I don't recall the exact number). A huge bonus from single to dual redundancy but rarely worth the extra 33% in cost between Dual and Triple. However, each module had to boot up on its own and if they did not, then the system wouldn't run anyway.
Is there a requirement that all titles must be clickbait and include one of these words: Unbelievable, Shocking, Terrifying? No the reason wasn't unbelievable, it's actually quite believable and simply just an oversight.
@@scottmanley Lol, good save. Wasn't really directed at your video per say, just that's the TH-cam titling by youtubers trend these days. Although yours is actually technically accurate hah 🙂
Hi Scott, during the iSpace debriefing, they reported that their velocimeter did not start reporting data when it expected to be 2km above the surface (event 9 in the schedule). Do you know if this is a separate issue or a consequence of being too high from the ground?
Having seen in person how Japanese programmers work, how specialized and narrow their programming skills are and how ridiculously rigid their management approaches are, how many non-unified standards they use, this sort of thing doesn't surprise me at all ps. the Ron Burgundu clip was priceless and so on money xD
the point being this was a bug tha should've been relatively easy to find, if thoy had simulated a couple of "landing site changed at last minute" scenarios that included heavily cratered areas or craters with steep walls. Just doing some tests on random landing sites would've triggered this. But nobodu thought of this, and because of corporate culture, everybody was dis-incentivized to even raise the question
It would be interesting to try an optical parallax system to verify the radar readings. If both systems agree, then the data is correct. Cameras could be a few meters apart, so, the parallax would be measurable from pretty far away.
@@xonx209 In sci-fi usually it would be three independent systems with two having to agree as to what they were seeing. A two system setup would require that both system have to agree and if one system cut out a sensor as malfunctioning and the other didn't, something would have to be present to break the disagreement deadlock.
Given that it was not possible to land the craft without the altimeter data, it's an odd programming decision to permanently ignore that data the moment it starts to look janky. Odds of the altimeter recovering its wits may be slight, but I'd rather give that a go than rely on an approximation that's +/- 5 km in altitude.
Exactly what I was going to say. Since radar altimeters are highly robust there is almost no situation where you should ignore one. If it does fail your mission is doomed, whether you ignore it or not.
If they tested by simulating landing on other spots but not on the selected one, then they didn't tested! This isn't a software bug but a project mangement issue (specifically testing). It's like if you "test" your computer program on your desktop but then deploy it in a server and the faster hardware makes apparent a race condition that borks the system. Testing is expensive, testing is hard, but not testing the actual flight plan is dumb.
Software engineering does not start at the keyboard, and end when it gets sent to a testing team that is somehow not software engineering. Engineers have a responsibility to work with testing crew, to validate the test scenarios. The teams failed to run enough variations of realistic input, so inputs outside the limited sets caused a fault. Specifically several bugs in the system as a whole 1 Spacecraft is unable to land without altimeter inputs. Relying on only inertial guidance cannot be accurate enough to land, due to inherent input noise. If the altimeter signal is discarded more than X seconds before touchdown, error margins cause failure rates approaching 100% 2. Guidance system (apparently) had no way to recover confidence in sensors 3. Guidance system would erroneously flag valid inputs altimeter as a broken sensor. 4. Testing was not done to cover new landing site (and yes, a senior engineer should have balked at the change)
Another fascinating and instructive example of Robert Burns' "The best laid schemes o' mice an' men / Gang aft a-gley.”. Cheers from sunny Vienna, Scott.
When flying on instruments, pilots are trained to trust what the instruments are showing them, not what the software in their heads is telling them "I can't see the horizon, so I think I'm upside down..." No... You are the right way up etc.
Been through the training, you absolutely cannot trust your feelings. It varies for each person, for me, I felt I was leaning one left and right. I had to actively fight to focus on the instruments, draining your energy very quickly. Even if you are completely focused on the instruments, your arms will try to 'Level' the plane. You acclimate with each subsequent flight easier, eventually it's nothing. Instrument flying is still draining as you have to fuse all these sensors. In comparison, looking outside is about as stressful as driving.
I would add optical recognition and stored high resolution images to the package. These would optically compare the expected position and orientation to the visible information and so call out anomalies. This entire apparatus would be as small as a Raspberry Pi, using off the shelf components. DJI drones use something similar in their RTB software. When the drone sets off, it takes photographs of its starting location and compares them to the downward facing camera live images when returning to land.
One of the reasons Chandrayan-2 failed because of the mapping. When the lander moved away from the photographed landing site it tried to over correct and failed.
Neil Armstrong took over the controls and manually landed on the moon when he saw rougher terrain than expected at the final approach of the first manned landing in 1969. He was a true test pilot who was able to think fast and take action without losing his nerve. He barely had enough fuel for the extra maneuver, so he was also lucky. The problem with depending on robotics is that software doesn't have "common sense" and enough experience to handle the unexpected. However, these crashed robot landers are much cheaper than manned missions, so with trial and error they will eventually work.
And then a unicorn ran up & they rode the unicorn all around the Moon going 240,000 miles back to Earth. The Unicorn didn't run 28,000mph like they would've had to go in the pop rivet aluminum can they brought them there.
It has. all those "radio Sensors" (I think it had 3, for collecting data from different directions) that were measuring the height were identified as "faulty" by the bug and It turned off the input from those sensors. Marking the sensor as faulty based on the output of the sensor and turning off all those sensors was the bug
A moment of silence for the Mars programmer who couldn't handle math, metric and standard. Although, having some direct experience with atrocious egomaniacal team leaders and managers, including one in particular at JPL, I would suspect that the programmers and reviewers did exactly what they were told, while the team leader should get a "black mark in his permanent record" for being the one who was the true culprit.
@@samuraidriver4x4 Really? Can you describe your design? I was thinking about a lightweight baseball sized package on the end of a long collapsible pole, throw a dozen of them out in hopes a few survived to do the job. Or how about a huge air bag with the probe suspended by rubber bands in the center, no need for an accurate landing of the beacon. Did y'all really think I was suggesting to land a huge piece of equipment to be a beacon?
I remember one of my computational methods professors said: "There are no such thing as software bugs, only human error" in one of our first lectures so that we would code more carefully and make sure everything has correct syntax before moving on to a new line. It'll save you a lot of time rather than overconfidently writing 100 more lines of code and then having to scroll through it all only to find that you coded "fr" instead of "for."
Most bugs are not caused by typos but by overlooking consequences somewhere else in the code, like creating a potential timing/synchronization issue. Or by wrongly interpreting the functional requirements, or by misreading the documentation from an external library and things like that. Typo bugs are rare, except in the ui but that is because certain programmers cannot write decent English sentences 😅
There are very few human activities that are as complex and as routine and yet have such potential to fail catastrophically from even the smallest of errors. So yes, it's objectively a hard thing to consistently do well.
I suppose that with the rapid advances in technology and AI , this kind of problems will soon disappear. A simple camera pair for example could recreate human-like vision, and give enough information to an AI to perform a landing, specially if paralleled with all the already existing sensors.
With the AI the need for higher processing power to multiply large matrices come. This increases the electricity power requirements and requires more RnD for creating radiation-hardened variants of processors. The Snapdragon 801 in ingenuity Mars drone is probably the state-of-the-art SoC being used. But that's the same one used for Galaxy S5 a decade ago.
I wouldn't call it a software bug. The software performed as instructed. It was the changes in the planned landing sight at the last minute without any testing afterward. Someone f____d up.
such a stupid error... I'm sure the engineers were banging their heads against the wall when they found out. Hopefully they will be able to build it again and fix the SW for a proper landing.
@@laimejannister5627 you do realize you just compared a private company to a government right? Also China has a ton of experience in space - they have their own space station, mars mission, satellites, multiple rockets... just because cheap crap is produced there for people not willing to spend $$$ doesn''t mean that they can't produce quality items
@@witchdoctor6502 well to be fair Space X is also a private company yet they are better than all except maybe a handful of government agencies on Earth. So it doesn’t mean much what entity it is. Plus they already got a pass since it was launched by a spacex rocket.
Dunno if you'll see this, but I'd be suuuuper interested to hear your thoughts on whether software engineers should have the same certification processes that physical engineers have.
Fyi: they do have in some countries. One isn't allowed to develop software for medical gear for example here in Germany without certain qualifications. There are also legal requirements and standards that the software development teams working on critical or potentially dangerous software have to adhere to here both in regards to coding and testing, but also in regards to their overall software design, risk analysis and much much more.
The kind of software you're working on makes an enormous difference to the consequences of getting it wrong, and I don't think it necessarily makes sense to try and develop a certification scheme that's simultaneously rigorous enough to assess people working on nuclear reactor control software while not failing 90% of candidates going into a career in casual game development...
They do. Software engineer, computer engineer, computer scientist, etc they are all degrees that you can get studying in college. The issue is that you have all the 3 month long React crash course Jimmys that call themselves "software engineers" when they are software developers instead. So yeah, there are many different certificates for software engineers, its just that the name is often not respected. Btw Im not trying to shit on self taught or non engineer software devs. The same way people like Michael Faraday did amazing things in the name of scientific research without a degree, some software developers are incredible at what they do. I dont want to sound like Im implicitly blaming the lack of certificate enforcement as the reason why stuff like these software bugs exist. This could have very well been coded by a software engineer, who would be also not to blame because stuff like this has to be peer reviewed and tested by many people and on many levels. It was simply and incredibly unfortunate event
A degree is not certification. And for physical engineers that assume accountability for the designs of their companies, Principal Engineers, their job titles do not come easily. What would be interesting to see is if there is any push to require engineer ACCOUNTABILITY, not just responsibility, as things are currently in the physical space.
This brings me back to my software engineering course. There are supposed to be several stages like requirements, specifications, design, implementation, testing, maintenance. And each stage is supposed to be evaluated and fed back to improve processes. The instructor stated that there had only ever been ONE software project to have ever followed the theory: Space Shuttle avionics.
My first feeling was that one of the issues is that they all seem to be young people, I know they are very very smart young people. But I don't see the 'old guy' in any of the pictures. An old guy who has experience with developers and engineers.
I'd like to see an episode of "Things KSP Doesn't Teach" about instrumentation. How air/spacecraft instruments work, their limitations and quirks, and how they can fail.
I'll second that.
The Soviets had a number of uncrewed vehicle losses because they used ionic sensors to determine the orientation of the vehicle, which would fail on occasion. On the other hand, the gyroscopes aboard Apollo 13 held true despite being pushed well outside their comfort zone. Some sort of video about orientation sensors would be very enlightening.
Yes please!!! We need more of those videos please Scott.
Another vote. I see it in formula racing where drivers are having to "fail" various sensors to address issues with the power train.
Yep, me too. I think I'm aware of the issues already, but I'd like to know how different real sensors would be.
@@ferdievanschalkwyk1669 As a former Formula 500 driver, the fastest lap times were NOT the shortest distance lines around the track. A computer simulation takes the shortest distance, while the faster drivers took a slightly longer, but faster path that defied logic.
That's what happens when you keep changing the software spec's of a project. It's a bit hard to believe that they changed the landing site without rerunning the simulations.
This! This all day.
Your first remark is very true, and not just in an aerospace environment!
That's like releasing software without doing unit tests just right after the remote guy pushed ten thousand lines of code
I feel like something in the sensor processing design isn't fundamentally robust enough if it can be this easily confused by real terrain features. Maybe they can add a second radar or lidar sensor for dissimilar redundancy or to differentiate unexpected yet real inputs from sensor faults.
We all know what happens when you run a safety-critical algorithm on a single AoA sensor...
Exactly! Mission creep eats in to the project time line and system tests degrade into delta testing for success instead of system testing for non-failure.
In my personal experience, people insufficiently care about aerospace software. I worked in a software company that worked for ESA and we were always pretty much ignored (e.g. in all presentations of our local space agency). But when some other company made a screw for a satellite, it was plastered all over their presentations. There were literal delegations going to take a look at the space-screw-producing machine. Such an interesting visit, you see, to a hall with machining equipment, clean rooms - that's the "space stuff" in minds of people. Something you can touch and see. How do you brag about a company with people sitting at their PCs? Nobody cares. Even if these are the guys whose work ultimately decides whether these magical screws end-up doing something or are splattered over the moon.
I don't care about the publicity but it's the mindset. Everyone focuses on the aluminum this and titanium that - and software is always the afterthought. We can change that anytime. We can even send an update to space ... so, why should we think about too hard? Bam!
Good point.
And I think it is because it takes a higher level of intelligence and technical knowledge to understand software systems and the media and others can't understand it.
You only have to look at any news article published by the main stream media, on television, in newspapers and you will see the errors the journalists make, the incorrect use of terminology, the lack of detail and you walk away realising the news article has told you almost nothing.
oof
Boom! As a retired architect for space sensor payloads, I can say you are spot on. I watched management spend all sorts of money on convenience tooling but if SW wanted licenses for software production and testing tools, oh God, you got run through the gauntlet.
So how many times must a company learn these lessons? Simple, once per program.
It's the same attitude all over the place. Games no longer ship out as completed projects... 'we can just patch it later'.
Mamy other fields also do shit like this.
Software in general too
In their official debriefing, ispace actually admitted that it's primarily a (project / program) management issue, not an engineering issue. That gives me hope that they might actually learn something from this.
most underrated comment
Life imitates art: a common problem in the Dilbert comic results in utter failure.
It's almost *ALWAYS* a project/program management issue, not an engineering issue. This was also true for Mars Polar Lander and for Mars Climate Orbiter (the one that famously mixed up imperial and metric units).
Don’t bet on it
@@Josh_728 Get with the program: in 2023, we measure things in bananas
This is why you also have timers for expected milestones (earliest and latest time a milestone can be validly sensed). My background is that I worked on the Attitude and Articulation Flight Software for the Galileo and Cassini spacecraft when I worked at JPL. For a very simple and solid method they could have used what the Surveyor landers did.
Thanks for your awesome contribution to space science!
I'm sure mission control had a plot of the expected altitude changes, the lander may have had one as well. Problem is that the expected rate of change of the altitude, was outside what had been set as acceptable for the altitude radar. It was probably written in the specs somewhere. Proper simulation of the landing would have caught this, it could possibly even have been dealt with after launch. It's changing the landing site without simulating it that screwed them.
What did Galileo and Cassini use for altitude readings? and would they have been equally screwed if forced to switch over to gyro / accelerometer readings with an apparent failed altitude radar?
@@nocturnal6863 John's point was that "forced" switch is averted, if the switch algorithm is completely disabled at such an early phase of flight. Reread about "earliest time.. a milestone can be validly sensed".
@@u1zha except you wouldn’t disable the software monitoring a sensor for failure. Not unless you knew in advance it might give faulty readings at that point.
Further thinking, I think I see what you are suggesting. That it should have been expecting by the dip in altitude and it’s failure the see it, means it should have known it’s altitude was off.
It fascinates me that as you look at the history of disasters how many of them are ultimately caused by cutting corners to meet time pressures or budget targets. In this case you have to wonder (A) why the target zone was changed late in the game, and (B) why simulations with the new target zone weren't run. I would bet a dollar that engineers thought of it, but they were over-ruled because of time pressures or a budget target.
Your question (A) is a great one.
It could be that the new landing site could be reached with less expenditure of propellant or something like that. They thought it was a lower margin of error. Or was it the opposite? Was there a "better" more ambitious site with more interesting geography?
@@aarondavis8943 It is interesting to speculate. IMHO "less expenditure of propellant" would fit into theory about disasters and cutting corners to meet budget targets. On the "better geography" thought... unless an asteroid suddenly impacted an area close to their original site, one would think that the geography question would have been settled long ago... the lunar surface is pretty well documented (at least the front side).
Proverbs have a sort of statistical truth. "Haste makes waste" exist exactly due to that.
The sad part is that seemingly we keep doing the same mistakes
@@pierQRzt180 Yes it is sad to think about all the people who have lost their lives due to decisions on someone's part to save a few bucks by cutting corners. One of the latest examples appears to be that partial collapse of the apartment building in Davenport. Looking like the owner went with a cheaper contractor who would forego shoring up the building before proceeding.
And C) why there weren't redundant sistems with majority check before deciding to discard the most vital part of your data...
It continues to amaze me that we managed to safely land astronauts on the moon AND have them take off from the lunar surface and return home, several times. Obviously, having actual humans present makes a ton of difference, but the number of things that could have gone wrong but didn't is mind-boggling.
The brain is a wonderful flight computer.
Lander: I'm going to land here.
Human: Dummy, there's a rock the size of a McMansion there! Gimme manual control.
The first moon landing was saved by the astronauts: the automation on the lander was going to put it down in a field of big boulders.
think about it like this: humans have managed to control powered airplanes since the start of the 20th century, while autonomous aircraft have only just appeared in the last decade or two. humans are just that versatile
@@MarlinMay I am howling, when I pictured this in my head with astronaut Slapping their computer and called it dumb.
One of the NASA research reports justified the cost and risk to send human astronauts to the Moon with the allegory says "Human brain is the most lightweight and easy-to-aqcuire real-time non-linear computer"
About 30 years ago I had a single page photocopied from Computer World or some such industrial publication taped to the outside of my cubicle. On that page was listed the ten most expensive software defects (bugs). I was astounded when the most expensive defects caused hundreds of millions of dollars of loss. When you read the list the top five defects (again, multi million losses) you found out that they were all losses of spacecraft and/or their payload. Flight software is tremendously complex and a single error will cost you your whole vehicle and years of effort. Now that page would have to be scaled to near billions of loss I expect
Maybe not most expensive, but Therac-25 should be on that list somewhere. Ya know, because it ended up maiming and killing a bunch of people with intense doses of radiation.
There you go Scott - that’s an epic video right there. Top 10 most expensive Astro/Software defects.
@@a.p.2356 In a way, that is possibly the most expensive software bug ever; in another, it's quite cheap. Consider: we know for a fact that cars kill people, all the time, in every way, yet we do not ban cars.
The value of a human being's life has been calculated, and apparently it's cheaper than you would expect. Electricity production has a cost measured in lives per TW/h. You can look it up. Biofuel has a cost of 12 people per TW/h. Solar is 0.44. Wind is 0.15, and new/clear is 0.07.
The average American consumes 0.1-0.2 GW/h per year. In other words, over the course of your entire life you will likely kill less than one fiftieth of a person in order to keep the lights on. This does stack with the people you kill while driving, however - I'm talking about tyres particulate and excess death from pollution, not running somebody over.
Ain't that grand?
@@o0alessandro0o It's not easy to respond to that kind of information. I do know that training and regular checking of pilots contributes to a high level safety for commercial aviation (ignoring mechanical failures). For drivers, I reckon that similar processes should be followed. It would not be popular among the general public, but I have said for years that licenses should be graded, based on years of experience and how many training courses a driver takes. Governments balk at the idea, however, and go on putting up cameras and roadside radars, more draconian speed limits, but never addressing the fact that poor situational awareness, slow and inappropriate reactions, and limited skills are the biggest factors in car accident rates. But I'm going down a rabbit hole!
Not strictly a bug, rather bad design; but implicit nullability - first introduced in ALGOL in 1965 and later copied into most programming languages - was famously coined by its creator as a billion dollar mistake.
I think I read somewhere that at the time the estimate was quite accurate, but that was 2009 so by now it wouldn't be surprising if it is an order of magnitude too low.
It's very interesting that this is almost the exact reverse of the famous 1201 alarm on Apollo 11.
In that case the computer restarted and generated errors on the astronauts control panels. But because they knew that they were at the right altitude per the flight plan they had confidence that they were still flying correctly and Neil Armstrong brought the lander down safely.
Source of 1202 and 1201 alarms was traced to the rendezvous docking radar, used for rejoining the command/service module was inadvertently left on, at the same time the radar for landing, the only one needed for the decent phase was running. This overwhelmed the lunar module computer, but mission control knew it was still safe to land because of one man at Huston.
@@warrenpierce5542 yes, when you go into the details then these are different cases. But in the abstract, in both cases the computer was confused because it got signals which were unexpected and didn't handle them well. In Apollo's case, the human was able to use additional information to recognize that the problem wasn't severe and in this case - there was no human.
@@warrenpierce5542 In Mike Collins' excellent book, he mentioned the 1201 and 1202 weren't exactly "well known" issues. Took a bit of "looking up" (quickly, albeit)
That second antennae wasn:t inadvertently left on. I saw Buzz sheepishly confess, the engineers didn’t think the same way he did in an interview.
Just realized that manned missions are technologically easier (skill of pilot) than unmanned soft landings that are possible now due to the progress of software systems.
My experience of being a software engineer is that the code has to be tested every time. It's amazing how often things that can't go wrong do go wrong.
It is not a sw that changed but the parameters of the flight. You may of course argue that the sw was made for the particular landing zone which I do not buy.
I may be mistaken as the video is the only source of my knowledge of the situation - sort of like this radar was. So you take a peek at the surface with radar and see this crater with it or rather a human having visual would have seen the crater - the landing module saw just a point on the surface which was 3km higher then the previous point it peeked at. I suspect what they would have needed to do is to have more points that radar is measuring especially from distance and make an average out of it or use some other technique to see where one is. When much lower this would also needed to be done to see if there is no big stone occupying part of the landing zone. I suppose this last thing was eliminated by assumption that the landing is going to be done on the flat empty surface by choice of the mission control. I suspect if they were landing on the water/liquid surface this radar error could only occur due to a massive tsunami - well no water surface and no tsunamis but hard landing.
Interesting to know all this tho, aint it?
Been there! Written lots of software... made some unbelievable bone-headed mistakes, which are all *BLINDINGLY OBVIOUS* in retrospect. "This change is SOOOOOOOOOO OBVIOUS that we don't need to test it" ... ha ha ha... this is when reality bites you on the backside, informing you that you definitely *DO* need to test it again.
@@hanskloss7726 HA! Then you discover it's high tide instead of low tide... maybe you simulated it with mean sea level but a mile away someone opened the sluice gates and there was a large wave from the reservoir... etc.
This moon lander crash is an example of space sabotage. Deliberate.
@@simonmultiverse6349 low tide v. high tide does not cut it here - the surface is mostly flat still at least from a 5km perspective. The crater is a different story so you need to have many points possibly also a map? Not sure what is easier here but their method obviously failed.
We know this is not a shame - we all have been there....
There was a similar bug in the LEM : if the module had flown above a circular-shaped crated of a certain size, the radar altimeter would have shut off all propulsion, probably leading to a crash. Fortunately the bug was never triggered (mainly because the onboard crew had taken over manual controls at this point) and was found decades after the landings.
why are lunar manned missions not done anymore these days?
@@alamrasyidi4097 Congress dropped funding, so NASA had no money for going to the moon (canceled the last planned 4 trips).
@@alamrasyidi4097 No Soviets to beat
@@alamrasyidi4097 isn't nasa planning to go back? Starting with an (unmanned?) Mission sometime after 2024?
@@dr.cheeze5382 so ive heard. but compared to the alternative of having to lose these spacecrafts to software error, i think "no soviet to beat" is a ridiculoua excuse. so i still really dont understand why lunar exploration has been strictly rover based these past few years...
I flew fighters for over 20 years. The Kalman filter was the bain of the navigation and bombing solution. It would actually discount most of the updates I would insert. It thought it knew more than I did … it didn’t.
Another outstanding episode Scott! Being a software safety engineer for the last 39 years, I have to agree with previous comments that point out this is not a software bug, but more of a people problem during design, testing, management, etc. I believe the first Ariane 5 launch was a similar issue where the software worked perfectly per its specifications (from Ariane 4) and doomed the flight to failure. Like in this case, proper testing would have prevented the, expensive, tragedy. Also wanted to give a shout to "How To Destroy Wayward Rockets - Flight Termination Systems Explained". My 39 years were all spent on Range Safety Software with the last 13 years working on autonomous flight termination systems. That was another outstanding episode! Keep up the awesome work!
Pop op o99⁹9th kiwi's😊
It is a software bug though. People problem is that the bug wasn't caught
Have you ever experience a flight termination system not working instantly, but 50 seconds late, as with the starship launch?
@@vast634 Depends on the type of Flight Termination System (FTS). For solid rocket motors, they use a shaped linear charge that opens the casing and exposes the fuel which burns up quickly in an impressive display. (I think Scott mentioned that in his previous video.) For chemical fuels, things are different. You have more choices. The basic idea is to stop thrusting the vehicle so it falls into an unpopulated area, such as a broad ocean area in the case of SpaceX. Based on the video of the flight, the FTS worked properly and detonated explosive devices that created holes in the fuel tanks. That reduced or stopped the fuel flow to the engines. The FTS did its job. After that, it's all physics. If the fuels are hypergolic, they will combust on contact and you get a near-instant explosion. Otherwise, you need combustible fuel, oxygen, and an ignition source. Guessing, it took about 40 seconds before the three elements came together in the right quantities in the case of SpaceX. An FTS doesn't need to create an explosion. Rather than connect to explosives, the FTS can connect to fuel valves that terminate fuel flow.
@@xGOKOPx I understand your perspective. My point is that the bug should have been avoided during design or implementation, and if not, then detected during development testing. Find and correct all the bugs before deployment. Since their development testing failed to react properly to "unexpected terrain" (kind of a silly term considering the moon's terrain is pretty stable), the people failed in the software development cycle and left in a failure mode (i.e., the bug) so it could be exposed during execution. The software did what it was designed to do so it worked properly. The people failed to account for something. The same thing happens with hardware but folks don't usually blame the hardware. The failure of Galloping Gertie wasn't blamed on the bridge. The people who designed and built it were blamed for not accounting for potential wind loading.
Scott - great video as usual - thank you!
But really this is not a software 'bug' is it. It's a systemic design and control failure. The software was designed to work as it did, but the specifications do not seem to have included passing over a crater like this. In other words the initial flight plan was intended to avoid this situation, and the software was designed to work within that flight plan.
The first error was changing the flight plan without checking if the software could still function with the new one. The second error was not testing the software under the revised conditions it would have to work in.
Both errors are symptomatic of inadequate control over change management.
In other words, the flaw did not lie in the programming, but the organization's approach to change management.
...and perhaps a Third error of the program disregarding the radar altimeter instead of querying it again. 'Say what? That result is outside of parameters. Please take your reading again'
I agree this probably shouldn't be called a bug. Requirements were not properly validated, so it's a failure in their systems engineering process.
came here to say that!
also, even if the result lateron would be inside parameters again: the device has already been proven to be unreliable. it might be an intermittent error, or it might be a bias that only on this occasion was noticed but existed all the time. revalidating system reliability would be a tough cookie to crack on its own if it didn't come with a redundant 2nd and 3rd system, though it should have notified ground control and gotten an update/patch. to my understanding that is how generally system failures are resolved. i don't know if their mission profile put an artificial time delay on that to prepare for longer ranged versions, or what happened.
Haven't read the report so might be wrong, but if they use something like a Kalman filter, it is likely that they are not simply not querying the sensor, but that the calculated variance to associate to the sensor readings spiked. In that case, the sensor would still be queried, but is _effectively_ disregarded since the resulting effect on output would be so low (due to the change in the assigned variance). Someone can correct me if I am wrong here.
There's also the design of the spacecraft that has to be called into question, specifically the AD&C architecture. Relying on a single altimeter means that you can't verify the data with a redundant sensor. Since accelerometers and gyroscopes can't really capture things like topography from orbit, it's like flying with one eye. I don't know how much mass, power, and space another altimeter would have taken up, but perhaps a redundant altitude sensor, possibly one with a lower resolution and/or sample rate, could have been used to verify the data coming from the primary one.
Makes the Apollo landings even more amazing. Considering the precision of sensors and computational resources, both to simulates and support landing.
I feel such compassion for the Hakuto-R team. They are going to accomplish it!
I hope that they can make a new one and landed on the moon the next few years
I watched the landing live stream and I felt so bad for them. Their nervous faces really hurt me too. I bet they do it next tie!
They may not accomplish it. Very often such incidents reveal a whole load of issues that have been swept under the carpet, and the necessary organisational change required to address them all can easily break a small team / organisation.
Even big companies can be killed by this. This is what is going on in Boeing right now. They caused the crashes of two 737MAXes and killed people. Since then they've tried to institute root and branch reform of how they run their business. Yet, they're still having problems. The most recent one was a fuselage manufacturing defect (they were building them wrong) that had gone unnoticed for approx 700 airframes (yep they're flying, possibly with Southwest today!). Fine, they've found it, repairs needed, not immediately dangerous, but cannot be ignored.
Trouble is the manner of them finding it was accidental; someone was in the right place, at the right time and realised what was going wrong. The issue is that, if despite the introduction of a root and branch reform about how they approach quality (= safety, reliability) they're still finding major issues by chance, then the root and branch reforms are junk and are not working. They should be finding such problems as part of a systematic continuous improvement process, and they're not. So the bet-your-life question is, what else have they missed, given that they've essentially admitted that they've not been looking hard enough?
It's similar with 787 (fuselage barrel joints), brand new 737MAXs with FOD and rodent damage, etc.
This suggests to me that Boeing are in no way adequately reformed following the MAX crashes, the problem most likely being in the senior management who never understood it before and are still there today. It's worryingly possible that they're going to make another fatal mistake. Ok, the FAA is now (belatedly) keeping a much beadier eye on Boeing, but they can't see and check everything; certification engineers / inspectors are not there to do basic QC and basic QC improvement.
The Hakuto team's best bet, if they're to try it again, is to just fix that one core issue and try again, and do as much simming as they can muster. Unlike Boeing, crashes are just disappointments and money.
❤❤❤❤❤❤❤❤❤❤ Yes they will succeed…… After they spend a lot of someone else’s money……… Let’s all go to Sugar rock Candy Mountain
@@emileriksson76 What color is the next tie?
Unfortunate to see what led to the failure of this mission. But glad to see that they have found the issue. Really hoping they succeed on their next attempt. Thanks Scott for the comprehensive explanation on this!
they didnt find the issue, they made the issue
Sounds like the software took the path of flat-Earth "science".
What I see doesn't fit my preconceptions, ignore it!
People need to wake up, controversy surrounded moon landings because there is stuff there. The issue / bug was in there on purpose. They will probably never let us see the real moon.
They did not find the issue. The issue was management. The software was fine. Software did not change the landing location, management did.
@@davidbeppler3032 The whole things was planned it is every time with every country why do people not see this, every single machine that lands on the moon has issues - #1 because the surface is covered in glass domes and other hanging debris #2 to cover up such things from the public in a convincing way.
7:50 Yeah , the VIKRAM lander from CHANDRAYAAN-2 had lost communication and went out of control , but with improvements in software, damper etc , we are again ready for CHANDRAYAAN-3 to it in July according to official message......
Excellent explanation, Scott. Thanks for putting it all together for us to easily digest. Nice Kerbal recreation, too!
I'd argue it's due to inadequate testing and making assumptions they shouldn't make rather than just blaming the software.
To move the landing site and NOT run a series of full simulations for the new site is just an astonishing degree of incompetence!
This.
@@mcgilliman I like to think of an F1 analogy... Imagine if you set your car up to race in good sunny weather in Monaco at sea level, and they changed the race to be in Mexico in soaking wet weather at 2,260m above sea level... You would NEVER just race the car with the exact same set up and no testing before the race!!
True, but if history has taught us anything it's that the incompetence almost certainly wasn't the software engineers themselves and was instead a cumulative effect of multiple levels of beurecracy repeatedly ignoring the recommendations and pleas of the people who actually knew what they were doing and what additional work had to be done. I suspect this is a scaled-down version of Challenger all over again, albeit thankfully with no loss of life this time.
Been watching you since 2015! You have helped keep me interested in space flight! Thank you for doing what you do Mr.Manley.
Such a bummer that a error correction filter with and edge case nailed them. Lots of amazing data and at least it’s a software fix!
Cliff case*
🎢
Seems a pretty basic error, in what universe did they think they could figure the exact altitude without the radar? Even if it was broke too bad, damned if you do/don't.
@thePronto They moved their landing site to align with NASA South Pole targets super late in development (post validation). The threshold for culling/re-baselining seems to be the issue. The sudden change in relative altitude wasn’t expected from their simulations.
It's not the error correction fiter that nailed them, it's the people who designed/specified/signed off on it.
Working as a Software Tester I often see the managers tend to take the risk to save some money vs malfunctioning SW especially when it has to deal with error handling.
Which is a shame, really. The more expensive the project, the less management should feel like cutting corners on error handling and verification. Ah, well, what "should" happen in the real world doesn't agree closely with what actually happens in the real world.
Being a software developer, I have seen this happen countless times in multiple companies. Software is often overlooked. Testing is usually considered redundant and a waste of time/money. Developer's warnings and requests are normally disregarded or displaced by other department's concerns which are non-technical and even non-functional.
In my experience the software developers/engineers are kept out of the decision-making inner circle. Actually, this goes for engineering/tech in general. It’s fine, just change this, this and that: what’s the worst that can happen?
(Changing the Moon’s landscape to more closely resemble a seedy neighborhood in Brooklyn, one spacecraft at a time.)
After fully tested, I still fear my code would break in some case that we have not looked 😂, its scary for space mission
Software may have been the proximate cause but you can argue the real problem was somewhere in the development and quality control procedures. How can you not re-run a full landing simulation after changing the landing location? It reminds me of Starliner's problems in 2019. The software glitch where the flight computer grabbed the wrong "time" was the proximate cause, but the real problem was Boeing never ran a full end-to-end launch simulation.
Actually these days so much of manufacturing and coding is outsourced that the management, hardware and software teams are no longer next to each other - quality control begins to suffer massively. The more people outsource stuff, the more the work gets into the hands of rookies paid on cheap wages, who then end up making rookie mistakes that then require even more time and energy to fix. Boeing turned from an engineering firm to a management firm and the rest is history - 787, 737 max, 777x, Starliner.
And as more and more automation comes in there's less and less human intervention to take care of the times where the computers reach their limits.
Feels like they might not have a great CI indeed, probably more like bunch of artifacts in git LFS type of management. But Starliner glitch might be slightly different topic IMO
I was also wondering how expensive such a simulation would be. If they aren't too expensive, I was wondering if you couldn't run landing simulations from randomized positions and flag anomalies from there. Not so much that you can just fling the lander at the moon arbitrarily, but more so you can find starting conditions which result in something weird.
IDK, maybe we're getting into a space where "moon lander software testing" and later "asteroid lander software testing" might be a market, that would be amazing. With the costs of these missions, there might be some money on the line for a testing company - especially if they end up with a body of "known problematic situations" like the one from the video.
How can you put a tank that has experienced both problems and damage in test into Apollo 13? Exactly like that, only different. Or get km and miles crossed up and smash a probe into Mars (IIRC), or ... ad infinitum.
You can run all the simulations in the universe and still have problems, but not running ANY sims to cover a deviation in the program...yeah, that's just begging for it. I would think, in this day and age, that you could pretty much run that sim real time in parallel with the mission, for the problem they had there, knowing the path and surface profile, I'm guessing, and have it fire up a quick "do not ignore the damned properly functioning radar" command, or some such. It might even be good to have REAL TIME simulations running against the truth of the mission.
Having done some aerospace hardware design, I'm guessing that there were schedulers and/or bean counters directly in the problematical loop. Or maybe idiotic MBA wielding managers that think they are engineers, or worse know BETTER than the engineers, because they know a few buzz words, and then maybe hold people's feet to the fire to get them to sign off on VERY cold Shuttle launches, or what have you. That's the sort of feedback you do NOT want in, say, a servo. :-/ Sometimes I look back and am glad I am retired, frankly. Some of it was fun, some of it SUCKED.
Doc requirements come to mind as some of the latter. I had one junior documentation fiefdom wannabe tell me that the real output of a program was the documentation. When I finally quit laughing I told her that if she actually believed that she should go talk to some F16 pilot and ask them which they'd rather have with them on a mission, a working LANTIRN pod, or the documentation that describes it. She wasn't happy, because then a couple of people standing around laughed too. She wasn't a nice person (that's putting it mildly), or I wouldn't have said it that way. My bad, I guess.
Armchair quarterbacks are a dime a dozen. You can certainly crow if you ever land a vehicle on the moon or even achieve orbit. Perhaps we can talk about "how obvious" the solution was when we stop whining about how LONG it takes to build and fly a vehicle and how the contractors are "milking the American public" for so much money.
I thank Providence every day for Kathy Lueders and NASA for riding herd on SpaceX to make the Dragon 2 safe. Everyone had lots of criticism for NASA for being conservative and "delaying" the first launch of the manned spacecraft. But all that effort kept the astronauts safe. (Also SpaceX had a real leg up on Boeing, because they had a working cargo spacecraft in Dragon 1 to build on. The last time Boeing designed a manned spacecraft was the 1970s and the Space Shuttle. All those engineers are long since retired.)
Similar thing happened in 2017 with second launch from new cosmodrome Vostochny, old software logic applied to new geography without double check. Didn't happen at first because they used very rare Volga upper stage but second launch was in default configuration that flew for decades from launch pads everywhere including South America. So after final separation, Fregat upper stage was scheduled to make 10 degree turn counter clockwise but due to geography of new cosmodrome and flight trajectory, software decided that it needs 350° clockwise turn instead. Didn't end well. Turned out that there was narrow set of input parameters that could make upper stage behave like this and new lauch pad won jackpot.
Wasn't there something about thermal modelling and pipes freezing in the fregat upper? Or am I misremembering
@@JohnMullee No, that was definitely some other story
I’m surprised they didn’t have a redundant altimeter to verify the suspect altimeter reading against.
Japan, sorry for your loss, but thanks for the software design lesson. Rockets are hard and this is how we learn. Thanks for sharing this one, Scott!
There's an argument to be made that if a sensor is critical enough that if it fails you're going to land on non-existent terrain 5km up, then you just assume it won't fail. If you handle failure gracefully but then don't have enough data to avoid crashing, what's the point of handling it gracefully?
Of course, ideally you'd have a backup. Like another radar, or GPS, or a video camera capable of estimating height using machine vision and a map, so you can sanity check it. The next best thing is just have a map: the vehicle knows where it is, so if it knows the terrain it can estimate what the radar values _should_ be, so instead of going 'eek, a delta of 3km in ten seconds is clearly wrong' you go 'the radar has shown a delta of 3km in ten seconds, what does the map say the delta should be? Right, 3km, moving on'.
You can have a video camera that is very good at finding the distance by using phase detect autofocus, same principle as a rangefinder.
I work in flight software and you're right. At a certain point if a system is so critical and irreplaceable you just have to trust it won't fail because as you said, detecting the failure isn't helpful if your SOL.
There is an argument to be made that if a $90 million project can go up into smoke due to a single sensor failure you have an expectation that it could potentially fail, you should really have some sort of redundancy even if it is unlikely. Or some other form for backup plan. The question is if it was actually considered if this sensor could fail, or if it just used the same behavior failure detection and handling as any other sensor without further consideration.
I love the ksp2 animations you added. That was really nice to watch.
This reminds me of an Alastair Reynolds novel where an automated system recorded the sudden vanishing of a planet but disregarded the data because the event was so far out of expected results that it assumed there was some kind of fault.
It then accidentally creates a cult.
Which novel is this?
@@yogiwp_ Absolution Gap, it's the third book in the Revelation Space series which is kind of weird. If you're looking to check out the author, I'd recommend Pushing Ice!
@@letsburn00 And also liquifying the poor guys wife stuck in the scrimshaw suit
@@ShoeTheGreyCat I forgot about that bit. Given that series largely relates to characters that are functionally aging immortal, it's wild how easily they torture and kill each other.
My reaction to just the title is "There are no unbelievable computer bugs".
Now that I've watched the video: *very* believable. Accumulation of error is nasty and dead reckoning is very hard. Changing something that "can't possibly affect the outcome" late in the process and not doing a full test happens often enough that it's a subject of comic strips and many high profile failures.
Except the one that flew into one of the first computers and caused a short circuit.
Scott's been moving to more and more clickbait titles of late. It's unfortunate to see him doing it.
When you're using accelerometer and gyroscopic data alone for position on a 2d plane, it can become hilariously inaccurate quickly, no matter how good your algo is.
Doing this in a 3D plane would be basically impossible if I'm being honest.
As an example, there is a reason VR relies so heavily on video processing for limb positioning. Obviously these aren't in the same ball park of cost/importance, but the same rules apply.
The unbelievable part here, honestly, is how someone expected this to work without simulating the actual final flight plan at least once.
@@winebartender6653 US missile submarines can pull it off, but their inertial navigation hardware is larger than the entire lunar probe and submarines experience much smaller accelerations.
It does seem like it was selected as a fallback with rather optimistic expectations of how well it would stay accurate. In hindsight, it would have been better to try turning the radar off and back on, relying on inertial navigation only as long as it took the radar to come back on. Also, redundant radar.
Hi Scott: Very nicely done! (but then, I say that a lot about your stuff...). This scenario is directly reminiscent of the situation on the Apollo landings where passing over a crater (or any other feature like that) would cause a 'jump' in the Radar Altimeter-portrayed altitude, and it would 'jump' from the PGNCS altitude. Remember 'Delta H'? The difference between RA and PGNCS altitudes. In order to keep things from diverging in the PGNCS, they had to incorporate a 'terrain map' into the software that accounted for local differences in surface elevation. Remember the landing of Apollo 17? At some time in the PDI maneuver, one of the crewmen (I can't tell which one--they sound a lot alike) said 'We went over the hump, and Delta H just jumped'. It sounds (at least at first blush) like a feature similar to the Apollo 'terrain map' might have been appropriate here (?) Thank you.
From this description it sounds like the software worked as intended based on the circumstances. It sounds more like they need to rethink the system level design to have more inputs that can be used to sanity check one another, and perhaps have a means for a one-time instrument glitch (at least in the design interpretation) to be "forgiven" if later sanity checks pass.
Yes, that makes sense, and the "forgiving" part is commonly solved by Kalman filtering, which Scott also mentioned. Here it sounds like Ispace overengineered a little bit, overeagerly dropping sensor data on the floor before giving the filter a chance.
I've never done it, but from what I have read, sensor fusion is an enormously complicated and fuzzy technique. You have to take a bunch of sensors, account for non linearities and malfunctions, and you need to figure out which ones are correct, which ones are sorta correct and by how much, and which to ignore. On top of this you have enormous weight and power restraints. And there must be a million fudge factors that have to be played with. Move it one way and you get a false positive, move it the other way and you get a false negative.
I wonder if this would be a good application for AI? A computer would definitely be able to interpret way more inputs than a human pilot ever could and in real time
It's a satisfying problem to work on.
@@andrewahern3730 AI isn't a magic fix. Those sensor fusion algorithms are supported by a a deep understanding of the system and statistics. Like with the Falcon 9, they are extremely reliable once properly tuned.
Obviously an advanced enough AI system can always do the job. But if, like in this case, you simply didn't test the system with enough variations of inputs, you're not going to get good results either. The amount of simulations needed to properly train the AI would also have been plenty to find this bug in the old control code.
The lesson here is that more robust testing is needed. I have a feeling that spaceflight is often seen as hardware-first. That's understandable, but without proper software the hardware is useless. I think more modern software engineering practices could be useful here.
IRL, nothing says you can’t have false positives and false negatives at the same time, while you struggle to understand the data. That’s no fun at all.
kalman filtering is pretty damn straightforward. It's a basic method, not something extraordinary. Known for more than 50 years and optimal for typical sensors (ie those with common noise distribution).
Mr. Manley, for the whole planet you are our 21st century Eugene Kranz.
At 01:13 your video proves that we now have available:
‘A da Vinci World of Creativity at Home’
The video shows that they used a $170 Airspy R2 receiver (with a $620 LNC + antenna) with the mind blowing power of the software available for the Airspy, so for less than $900 USD you can have the same setup at home !
Your use of the Kerbal simulator, to help us better understand the sequence of events, is of jaw dropping beauty.
hey Scott - thanks for the analysis. I remember this one (as well as the Israeli and Indian ones) and seeing the disbelief in the control room was sad. It is easy to tell who has a clue and who is a bureaucrat by their expressions, etc. :P
Hope there's no software issue when Nasa lands back on the moon!😂
@@adarsh4764 agreed - one would have thought that even a small lander would have a pretty robust navigation system these days, but obviously they met an edge condition they hadn't properly tested for... and a sad oversight too as nearly all landing trajectories will have the radar return affected by craters you're passing over. There are many of them after all, and although most are small, many are large/deep and you need to keep their profile in mind as you use the radar/laser/etc surface measurement.
The state vector routine needs a sanity check to make sure the drift never disagrees from projected too much without it doing some form of reliable recheck.
I'm very impressed with the abilities to diagnose what went wrong. Even amateurs helped! Another case study for future designers of "fail-safe" systems.
I'm so sorry to hear that this happened. I hope they try again and maybe send back some remarkable pictures. Don't give up. Greetings from Arizona.
Thank you, Scott! You answered questions I've had for the last few years about the landers crashing on the moon.
Not a software bug, it was a design bug. The software functioned as specified.
How do you know? Did you read the design spec? If the design spec stated it should be able to handle multiple lunar landing locations, then it's not a design spec issue.
Definitely a process bug that this wasn't picked up in testing - but premature to say that it wasn't also a software bug.
@@usingthecharlim 6:17
Mismatch at Requirements and/or Expectation levels. Activated by beyond test envelope operation. Needed a calm (seasoned?) "captain" to hold a steady, pre-planned course.
@@simongeard4824 I think the video was quite clear on that the software started ignoring that sensor because it was programmed to do so. An intentional feature that behaved differently than expected *because* it was put into a situation that was not considered while designing it. And that this only happened because they changed the mission plan after the software was developed and did not test it again with the new landing site, because their tests would have detected the issue. That last part really hurts because they reasonably could have avoided the crash.
I can't believe they didn't simulate their final landing site but that's what you are saying. Thanks for the explanation. Such a shame, they picked the wrong thing for a shortcut!
By, "unbelievable", I'm pretty sure you meant, "Completely realistic, very common scenario when the software is put in an untested environment."
Note that I am saying this as a software developer myself. I actually just identified a scenario where our existing tests were thought to be sufficient, but then some surrounding parameters changed and a bug was found.
As a software developer myself, I'm actually asking myself why those simulations were not set up to run like a CI.
@@jarisundell8859
Good question. Seems like actually running the sim again once the final site was chosen should have caught this, maybe allowing them to upload the fix.
For the record, CI is how the one I looked at was caught... prior to release 👍
Scott is also a dev by trade, too, lol, he works at Apple.
@@firefly4f4 I think it's a joke, since the bug happened because the computer didn't "believe" the radar
By unbelievable, I mean the software stopped believing the radar
Yea, that is the challenge of small projects with limited resources. It is great that this is not a problem for larger projects (cough-cough-Starliner) that have the money and resources to allocate to proper SW verification.😆
It's crazy to me that they didn't redo the simulations when a new landing site was chosen.
But what if the radar altimeter actually did fail around the time it passed over that crater? It sounds like it would have produced the same result. I think the only way to deal with this in the design is to have at least one redundant sensor for something this mission critical. Of course the problem with just one sensor is that you need to try figure out which one is actually the broken sensor. That's why there is often 3 sensors or 3 computer systems that are used in this kind of redundancy...
Sounds like a redundant sensor wouldn't have helped this particular issue though, because it would have just gotten the same confusing measurements of the cliff wall. I think they just need to thoroughly run simulations of the actual mission to catch edge cases like this early.
On a vehicle like this, without humans onboard, the space and weight requirements might be too costly compared to the risk of a failed sensor.
@@sonaxaton exactly. Proper simulation campaign would've catched that.
@@sonaxaton 3 redundant sensor and a voting system is the way to go. Worked flawlessly in many aeronautical things, from Concorde autopilot to missile guidance system.
The dead reckoning system combined with prior knowledge (a map of roughly what is expected) should have been enough of a redundant system. Seems like they should have included a reassessment/recovery routine to check if that apparent altimeter glitch (which wasn't a glitch of course) cleared and the instrument was giving reasonable data.
This stuff is really tricky without a human in the loop.
To be honest (and a bit philosophical), I would not call this a "bug," in the sense that sometimes with bug you mean an error in the software that makes it behave differently from the behavior specified at design phase. In this case the software had to face a situation that was not expected, that is, a suddenly increase of altitude due to a deep crater. It was not an error introduced at implementation time (that is, when they wrote the software), but at design time. Like a bridge that breaks down, not because some error during the building, but because of a strong wind that was not considered at design time.
I agree. This was planning error, or a failure to test error, or changing the landing into a regime that had not been tested, or all of the above. It's been know for a long time that radar altimeters can be spoofed by terrain, it is nothing new.
An engineering oversight
As a software engineer I agree lol but that's not to say it is not also partly the responsibility of software engineers to raise potential bugs in the design.
Agreed, and came to write this. As a space systems engineer, this was a systems engineering failure, not a software bug.
They should have tested their software with real data.
Love this! Thanks you for reminding me of Kalman Filters, i studied those in my M. Tech., loved them but never thought i would ever hear of them again. I still remember how the "location estimation" part, based on current velocity and direction integrated over time (aka: dead reckoning) can provide smooth and accurate predictions over short durations, but errors tend to accumulate in a physics based predictor like this and needs to be augmented with an independent measurement (ie: the radar), even if the radar data is not accurate. Amazing to see how stuff like that led to this outcome. It is a tough one though... I wish you had shared your thoughts on how should a "faulty sensor" be detected then? I mean, you could say that a 3 Km sudden jump in the sensor output means the the sensor is probably broken, right? If not, how else would you do that and handle the case when the sensor actually is broken?
Redundant systems and majority check: if all three of your radar sensor reports a sudden altitude change, than that's what actually happened. What surprise me is that the sudden altitude change eventuality is never accounted for...
Unbelievable that they had only one method of determining altitude!
My thought exactly.
And that that one method could be turned off for the rest of the landing!
I guess that's part and parcel for those very small landers.
I mean to be fair, even the Apollo lunar module only had a single non-redundant landing radar altimeter for determining exact altitude. The astronauts were fairly confident they could manage to land without it but if it failed, mission rules called for an immediate abort. The weight constraints for landers are so tight, engineers have no choice but to make those trade offs.
@@GlutenEruption Ah, but they had the backup that was the Mk. 1 Eyeball and its associated biological computer!
Enjoyed hearing you on NSF Mr. manly.
Thank you for all the detailed explanation!
Greetings,
Anthony
If it stopped to a speed of 0, then fell to a speed of 500 km/h then it would have had to fallen for ~86 seconds. Moon gravity acceleration = 1.62 m/s^2. That means it was in free fall for a distance of about 6.0 km. That's all based off of the "500 km/h" crash speed given.
Actually, I figured out 500km/h based upon the amateur radar measurments of 88seconds of freefall.
@@scottmanley Love how reliable basic physics equations are! With either bits of data it still comes up with the same results! If only the rest of landing on the moon were that simple.
What I want to know is how that equates to a violent impact on earth. Do I divide by six, which comes to 83.3km/h or just under 52mph? That's bad enough for it to need airbags...
@@travelbugse2829 You multiply by the square root of six - assuming there is no air resistance, so with air resistance you might end up with something not too different from 500 km/h for this type of vehicle.
@@travelbugse2829 When it comes to the moment of impact, 500kph is 500kph. It's about Mach 0.5. You know those old war movies where they show fighters shot down and augering in? *That.*
I wonder what all these people in mission control were doing during the landing. Were they analyzing the telemetry in real-time? I assume they were supposed to notice that the radar altimeter was considered faulty and disabled. If so, perhaps they could have reviewed its readings and realized that after passing the edge of the crater, the readings returned back to normal. In that case, they could have just manually reenabled the radar altimeter. Since it is not Mars, the signal delay is small enough to allow for manual corrections during the landing.
Word!
They lost telemetry. If they had a connection they could saved it.
There might also be delay in communication.
Putting aside changes in mission plans, redundant systems missing or even software bugs, I think the main issue here is overly strict programming. Assuming something is defective just because of a sudden change that is out of scope is bit extreme. Baffles me how it could hover waiting for the moon while letting propellant go to zero without at the some point trying to salvage itself with something like "this is not working, maybe I should take another look at that system i think it's dead".
what you're describing is human decision making, and you're ready to scratch this plan and try something better when the moment comes. you can't just imagine every scenario branching out at every step and hard-coding solutions to each. At some point you'll realize you need a generic decision making algorithm. In fact the mission failed due to them having a specific solution of switching off a reading since that allowed them success in previous simulations.
It tore me apart watching the team coming so close, it really has to weigh on the people who didn't catch the glitch, I am sure some are still laying in bed awake at night, can't wait to see the team bounce back with a flawless mission
Those software engineers were layed off, hence the laying in bed awake at night
This sort of situation is actually a good point to use against those who claim the Apollo moon landings were impossible back then because of the limited computer power available. Having a human (or two!) at the controls made the landings difficult but not impossible. Comparing it to the success
/failure rate of the even earlier Surveyor unmanned landers shows it can be hard to do, and losing a lander is expensive but you can try again.
Thats why i love Scott Manley video, so detail
I seem to recall the University of Wyoming having a "Missile Guidance for Dummies" audio description of a guidance system for knowing where the missle is by knowing where it isn't - it seemed pretty rock solid. I have to wonder why this method hasn't been adapted for spacecraft yet.
It substracts where it should be, from where it wasn't.
Exactly. It would be especially helpful in this case; if the lander knew where it isn't, it would not waste fuel by trying to land as if it was just above the surface. :)
I believe that's just a sentence for lulz, engineers expressing themselves purposefully obtuse. Kalman filters are exactly the "knowing" part, and a closed loop control system is exactly the "subtracting" part.
@@u1zha th-cam.com/video/bZe5J8SVCYQ/w-d-xo.html This video is full of these sentences, that are close to how control loops work, but not quite, which I find quite funny, especially if you know how it works
@@u1zha Unfortunately, it also inspires a lot of morons to quote that line constantly on TH-cam, perhaps under the mistaken impression that it makes them look smart.
6:20 Planned landing site change? That would normally require a revalidation of the software in the industry. I would blame this on the people who decided not to do that. I would investigate those guys and why the change. I would not blame the software as the software was not designed to be used that way.
Maybe multiple countries could drop beacons around common landing areas that everyone could use during landing. Not a foolproof but can help.
Hardly a "bug" when it worked correctly for the data input it was programmed to handle. At best, it encountered data it _wasn't_ programmed to handle, which makes this more a missing feature.
I was thinking same thing. Sounds like the software did exactly what it was supposed to do.
well, a bug is just unintended behaviour. The computer did exactly what you told it to do, just not what you wanted it to do
Can't imagine why they didnt run simulations of this. It's not like the moons topology isn't known down to the meter. Stick it in Kerbal and run simulations.
@@ddnguyen278 Uh, the moon's topography *isn't* known down to the meter. Some areas of the moon are, but generating meaningful maps of the moon is actually quite hard and time consuming. There are folks whose entire job is to take lower res digital elevation maps and apply reasonable interpolations to generate higher fidelity maps than we actually have.
Not saying they shouldn't have done more sims, but it's harder than it sounds.
Repeat after me “That’s not a bug it’s a feature”
Sounds like they tightened the Mahalanobis check magrins in Kalman filter. It's the check that real measurement at each time step, expected measurement and estimated measurement errors are all in accordance with each other. And you usually hardcode the acceptable marigins for that, i.e real-expected measurements must < 4 times expected error. If it isn't the measurement is bad (e.g. accelerometer physically fell off the mounting). Unfortunately the margins are often set too tight.
It could've been another problem tho, related to algorithms similar to simulataneous localization and mapping, but I don't have enough experience with them to judge.
My Dad helped design the Apollo lunar landing software ... and curiously enough, it was never used due to a sensor overload ... the famous DSKY error 1202. When Neil Armstrong disabled my Dad's software for Apollo 11, that was the end of it. The LM landing program was always over-ridden by future LM pilots and the LM was landed manually. The fault was in a completely unrelated system ... I guess a lot of people wonder if it would have done its job. My dad says it was pretty robust and he never saw a simulation that it would have failed if given the chance to run to completion.
It's a good thing Armstrong was a good pilot!
My dad would go on to be famous for mockups, and then later, he worked on the avionics of the world's most capable fighter jet. He's getting old, but still with us. I wish he was more of a storyteller ... but the one he thought was the funniest (and most irrelevant) was meeting the president in the restroom at NASA ... as in, _um, nice day, isn't it Lyndon?_ as they conducted their business. I am guessing it was during LBJ's visit to Houston in 1968, the same time frame that my dad was working there.
You @PT-xi5rt didn't know that LBJ was president? Or that he used the bathroom like the rest of us?
Scott, I am surprised that you did not touch on redundancy. I was a fighter jet aviator and one of the things we always did was use multiple sensors to allow the software to compare and then estimate probability. If they has three Radar altimeters they could see the rate of change of the surface as the spacecraft travels. Even if each would have shown the cliff, probability calc would have told it that its is virtually impossible that all three are suddenly all bad. Redundancy would be one answer in my book.
Agree and also on my part, I always thought that radar altimeters were used « closer » to the surface.
Probably budget constrained.
@@drill_fiend1097 This is a commercial effort so they could have just gotten one normally used in aircraft. It's not NASA where they cost $750K each just because...
Didn't Neal Armstrong have to do some on the fly recoding to overcome the 1202 error when the Eagle lunar module was getting overwhelmed with input? (Which was fixed on later Apollo missions through code fixes and turning off an un-needed radar as part of the checklist?). Seems like they could have used an altimeter, but on the moon the altimeter setting is always "00.00" 😅
"Actually I'm underground, so I should cut my parachute" is the funniest conclusion an AI spacecraft could make before murking itself
Like when you're walking down the stairs and miss that last step.
Stuff like that is really hard to catch before hand. Perfectly good sensors spit out garbage sometimes. I honestly think it might be worth having dual altimeters just to make the signal more robust. (Yea i know it would weigh more) but maybe lidar + radar. Then you make the software not able to disqualify both the radar and lidar at once or something.
Calling it a bug seems wrong tho to me, dual altimeters at least would continue measuring.
Sounds to me if an unusual number pops up you don't straight up then ignore further readings maybe?
Like did they set up the software to turn of a key component off once an odd reading occurred?
Having a backup for dual if not triple check incoming data on the fly instead of relying on past information should be normal.
Just because a road was empty 30 seconds ago doesn't mean I'll blindly trust there won't be something speeding around the corner.
Expect the unexpected especially when it's not on the same floating rock right.
@@dounyamonty The system was designed to ignore all new data from a sensor if it felt like some of the readings were out of bounds. They could have had 20 altimeters and the system would have ignored them all because they all would have shown the same "error".
@@patreekotime4578 When you have only one sensor for a given parameter, treating it as failed after seeing out-of-bounds data from it is just about the only thing you *can* do. But when you have more than one sensor, you can cross-check them, and *that* usually becomes your primary method of determining whether a failure has occurred. If both sensors do something unusual *at the same time,* you might reasonably infer that the problem is *not* in the sensors.
In this particular case, the spacecraft should have been able to cross-check the radar-altimeter reading with the topography it was flying over. Seeing the distance-to-ground reading increase rapidly as the ground abruptly drops away beneath the spacecraft is an entirely expected condition.
Right! Really hard. Who would consider that the Moon might have craters?
Very good explanation and top video! I guess also the loss of the Mars Polar Lander was caused by a software issue, telling the landing thrusters to ignite too early, causing the probe to run out of fuel...
Redundant systems to help the mission don't matter if the mission never starts. I worked on a Single/Dual/Triple redundancy system a long time ago. I think the probability of a single incorrect signal per million samples for each device was 75/93/98 percent (roughly, I don't recall the exact number). A huge bonus from single to dual redundancy but rarely worth the extra 33% in cost between Dual and Triple. However, each module had to boot up on its own and if they did not, then the system wouldn't run anyway.
Its amazing how Apollo didnt have such bugs, despite it being written in pure Assembly!
They chose tamer landing sites.
@@PMA65537 apollo 15 likes to have word wth you
Apollo missions had some issues but they handled then well..The engineers couldn't even visit their families becoz of the work pressure
Is there a requirement that all titles must be clickbait and include one of these words: Unbelievable, Shocking, Terrifying? No the reason wasn't unbelievable, it's actually quite believable and simply just an oversight.
It’s unbelievable because the navigation software stopped believing the altimeter.
@@scottmanley Lol, good save. Wasn't really directed at your video per say, just that's the TH-cam titling by youtubers trend these days. Although yours is actually technically accurate hah 🙂
What was the 'REAL TRUTH?!!'
Hi Scott, during the iSpace debriefing, they reported that their velocimeter did not start reporting data when it expected to be 2km above the surface (event 9 in the schedule).
Do you know if this is a separate issue or a consequence of being too high from the ground?
Seems pretty sloppy not realising a change in landing site might cause the craft problems ... Sounds like bad project management to me.
😅 A buddy's could hear from you again. Thanks for your good update videos.
Having seen in person how Japanese programmers work, how specialized and narrow their programming skills are and how ridiculously rigid their management approaches are, how many non-unified standards they use, this sort of thing doesn't surprise me at all
ps. the Ron Burgundu clip was priceless and so on money xD
I actually did not get your point . Could you please explain little bit more ?
the point being this was a bug tha should've been relatively easy to find, if thoy had simulated a couple of "landing site changed at last minute" scenarios that included heavily cratered areas or craters with steep walls. Just doing some tests on random landing sites would've triggered this. But nobodu thought of this, and because of corporate culture, everybody was dis-incentivized to even raise the question
The software was built by Astrobotic, an American company. Not sure how stereotypes of Japanese corporate culture come into this.
@@JosePineda-cy6om Oh ok . Thanks a lot for the explaination
@@JosePineda-cy6om Yes, this is exactly what I meant
It would be interesting to try an optical parallax system to verify the radar readings. If both systems agree, then the data is correct. Cameras could be a few meters apart, so, the parallax would be measurable from pretty far away.
That’s a very creative solution! ✌🏼✌🏼Pretty sure would work!
If they don't agree, then what do you do?
@@xonx209 In sci-fi usually it would be three independent systems with two having to agree as to what they were seeing. A two system setup would require that both system have to agree and if one system cut out a sensor as malfunctioning and the other didn't, something would have to be present to break the disagreement deadlock.
Your explanation reminded me quite a bit of dynamic positioning systems on ships / oil rigs
Given that it was not possible to land the craft without the altimeter data, it's an odd programming decision to permanently ignore that data the moment it starts to look janky. Odds of the altimeter recovering its wits may be slight, but I'd rather give that a go than rely on an approximation that's +/- 5 km in altitude.
Exactly what I was going to say. Since radar altimeters are highly robust there is almost no situation where you should ignore one. If it does fail your mission is doomed, whether you ignore it or not.
If they tested by simulating landing on other spots but not on the selected one, then they didn't tested! This isn't a software bug but a project mangement issue (specifically testing). It's like if you "test" your computer program on your desktop but then deploy it in a server and the faster hardware makes apparent a race condition that borks the system. Testing is expensive, testing is hard, but not testing the actual flight plan is dumb.
Software engineering does not start at the keyboard, and end when it gets sent to a testing team that is somehow not software engineering.
Engineers have a responsibility to work with testing crew, to validate the test scenarios. The teams failed to run enough variations of realistic input, so inputs outside the limited sets caused a fault.
Specifically several bugs in the system as a whole
1 Spacecraft is unable to land without altimeter inputs. Relying on only inertial guidance cannot be accurate enough to land, due to inherent input noise. If the altimeter signal is discarded more than X seconds before touchdown, error margins cause failure rates approaching 100%
2. Guidance system (apparently) had no way to recover confidence in sensors
3. Guidance system would erroneously flag valid inputs altimeter as a broken sensor.
4. Testing was not done to cover new landing site (and yes, a senior engineer should have balked at the change)
Another fascinating and instructive example of Robert Burns' "The best laid schemes o' mice an' men / Gang aft a-gley.”.
Cheers from sunny Vienna, Scott.
When flying on instruments, pilots are trained to trust what the instruments are showing them, not what the software in their heads is telling them "I can't see the horizon, so I think I'm upside down..." No... You are the right way up etc.
@thePronto We too high when we ran out of fuel... OOPS. Instruments didn't lie then?
Been through the training, you absolutely cannot trust your feelings. It varies for each person, for me, I felt I was leaning one left and right. I had to actively fight to focus on the instruments, draining your energy very quickly. Even if you are completely focused on the instruments, your arms will try to 'Level' the plane. You acclimate with each subsequent flight easier, eventually it's nothing. Instrument flying is still draining as you have to fuse all these sensors. In comparison, looking outside is about as stressful as driving.
@thePronto
@@oohhboy-funhouse Yes Sir! It's almost like the programming team had no idea that the moon has craters...
I would add optical recognition and stored high resolution images to the package. These would optically compare the expected position and orientation to the visible information and so call out anomalies. This entire apparatus would be as small as a Raspberry Pi, using off the shelf components. DJI drones use something similar in their RTB software. When the drone sets off, it takes photographs of its starting location and compares them to the downward facing camera live images when returning to land.
One of the reasons Chandrayan-2 failed because of the mapping. When the lander moved away from the photographed landing site it tried to over correct and failed.
Neil Armstrong took over the controls and manually landed on the moon when he saw rougher terrain than expected at the final approach of the first manned landing in 1969. He was a true test pilot who was able to think fast and take action without losing his nerve. He barely had enough fuel for the extra maneuver, so he was also lucky. The problem with depending on robotics is that software doesn't have "common sense" and enough experience to handle the unexpected. However, these crashed robot landers are much cheaper than manned missions, so with trial and error they will eventually work.
And then a unicorn ran up & they rode the unicorn all around the Moon going 240,000 miles back to Earth.
The Unicorn didn't run 28,000mph like they would've had to go in the pop rivet aluminum can they brought them there.
Perhaps a stupid question but why didn't it have redundancy on sensors?
Probably weight limit
Weight would be the first thing that comes to mind.
More mass = more capable spacecraft needed = more costs.
It has. all those "radio Sensors" (I think it had 3, for collecting data from different directions) that were measuring the height were identified as "faulty" by the bug and It turned off the input from those sensors. Marking the sensor as faulty based on the output of the sensor and turning off all those sensors was the bug
$$$ and weight.
find me a spacecraft that has/had redundant radar altimeter, even Apollo didn't have one
A moment of silence for the Mars programmer who couldn't handle math, metric and standard.
Although, having some direct experience with atrocious egomaniacal team leaders and managers, including one in particular at JPL, I would suspect that the programmers and reviewers did exactly what they were told, while the team leader should get a "black mark in his permanent record" for being the one who was the true culprit.
Way to go, Scott. Fly safe, indeed.❤
Speaking as a veteran in the software industry, no bug, no matter how absurd, is unbelievable to me.
Indias Chandrayaan 3 has finally soft landed on the moons south pole successfully. 🎉
great presentation
thanks for the follow up
Why haven't they orbited GPS satellites around the moon yet?
Why don't they drop transmitters to the surface first as becons?
The Moon has very few stable orbits, and they still require propellant. The moon is very lumpy and the earth tends to fling you off.
Dropping beacons on the surface is the exact same thing as putting a lander like this down.
@@samuraidriver4x4 To make it easier to land our probe, we will first land three probes.
Apollo did that once using a Surveyor lander as the beacon. But that was just for position, not altitude.
@@samuraidriver4x4
Really? Can you describe your design?
I was thinking about a lightweight baseball sized package on the end of a long collapsible pole, throw a dozen of them out in hopes a few survived to do the job.
Or how about a huge air bag with the probe suspended by rubber bands in the center, no need for an accurate landing of the beacon.
Did y'all really think I was suggesting to land a huge piece of equipment to be a beacon?
I remember one of my computational methods professors said: "There are no such thing as software bugs, only human error" in one of our first lectures so that we would code more carefully and make sure everything has correct syntax before moving on to a new line. It'll save you a lot of time rather than overconfidently writing 100 more lines of code and then having to scroll through it all only to find that you coded "fr" instead of "for."
Most bugs are not caused by typos but by overlooking consequences somewhere else in the code, like creating a potential timing/synchronization issue. Or by wrongly interpreting the functional requirements, or by misreading the documentation from an external library and things like that. Typo bugs are rare, except in the ui but that is because certain programmers cannot write decent English sentences 😅
@@effedrien I think he was just giving a basic example.
Thanks Scott!
Software is hard…
So, as the lander found, is the moon.
There are very few human activities that are as complex and as routine and yet have such potential to fail catastrophically from even the smallest of errors. So yes, it's objectively a hard thing to consistently do well.
I suppose that with the rapid advances in technology and AI , this kind of problems will soon disappear.
A simple camera pair for example could recreate human-like vision, and give enough information to an AI to perform a landing, specially if paralleled with all the already existing sensors.
Excellent point.
And do you really think the Japanese didn't deploy that?
@@alwayshiking_ Well, apparently not since it crashed after thinking for too long that it had landed, 5km above the surface ...
With the AI the need for higher processing power to multiply large matrices come. This increases the electricity power requirements and requires more RnD for creating radiation-hardened variants of processors.
The Snapdragon 801 in ingenuity Mars drone is probably the state-of-the-art SoC being used. But that's the same one used for Galaxy S5 a decade ago.
I doubt AI in space crafts any time soon unless we develop more effecient processors.
I wouldn't call it a software bug. The software performed as instructed. It was the changes in the planned landing sight at the last minute without any testing afterward. Someone f____d up.
such a stupid error... I'm sure the engineers were banging their heads against the wall when they found out. Hopefully they will be able to build it again and fix the SW for a proper landing.
imagine failing a moon landing when even china did it successfully
@@laimejannister5627 you do realize you just compared a private company to a government right? Also China has a ton of experience in space - they have their own space station, mars mission, satellites, multiple rockets... just because cheap crap is produced there for people not willing to spend $$$ doesn''t mean that they can't produce quality items
@@witchdoctor6502 well to be fair Space X is also a private company yet they are better than all except maybe a handful of government agencies on Earth. So it doesn’t mean much what entity it is. Plus they already got a pass since it was launched by a spacex rocket.
Dunno if you'll see this, but I'd be suuuuper interested to hear your thoughts on whether software engineers should have the same certification processes that physical engineers have.
Fyi: they do have in some countries.
One isn't allowed to develop software for medical gear for example here in Germany without certain qualifications. There are also legal requirements and standards that the software development teams working on critical or potentially dangerous software have to adhere to here both in regards to coding and testing, but also in regards to their overall software design, risk analysis and much much more.
The kind of software you're working on makes an enormous difference to the consequences of getting it wrong, and I don't think it necessarily makes sense to try and develop a certification scheme that's simultaneously rigorous enough to assess people working on nuclear reactor control software while not failing 90% of candidates going into a career in casual game development...
They do. Software engineer, computer engineer, computer scientist, etc they are all degrees that you can get studying in college. The issue is that you have all the 3 month long React crash course Jimmys that call themselves "software engineers" when they are software developers instead. So yeah, there are many different certificates for software engineers, its just that the name is often not respected.
Btw Im not trying to shit on self taught or non engineer software devs. The same way people like Michael Faraday did amazing things in the name of scientific research without a degree, some software developers are incredible at what they do. I dont want to sound like Im implicitly blaming the lack of certificate enforcement as the reason why stuff like these software bugs exist. This could have very well been coded by a software engineer, who would be also not to blame because stuff like this has to be peer reviewed and tested by many people and on many levels. It was simply and incredibly unfortunate event
A degree is not certification. And for physical engineers that assume accountability for the designs of their companies, Principal Engineers, their job titles do not come easily. What would be interesting to see is if there is any push to require engineer ACCOUNTABILITY, not just responsibility, as things are currently in the physical space.
This brings me back to my software engineering course. There are supposed to be several stages like requirements, specifications, design, implementation, testing, maintenance. And each stage is supposed to be evaluated and fed back to improve processes. The instructor stated that there had only ever been ONE software project to have ever followed the theory: Space Shuttle avionics.
My first feeling was that one of the issues is that they all seem to be young people, I know they are very very smart young people. But I don't see the 'old guy' in any of the pictures. An old guy who has experience with developers and engineers.
They built a kamikaze moon lander.