Friday, 22 May 2015

Cyber Security. It's really hard but it's also not that hard!

On December 20th 1995, American Airlines flight 965 crashed into a mountain in Buga, Columbia while en-route from Miami to Cali airport. The aircraft was fully functional and the crew both experienced and concious but a series of errors led to the deaths of 159 people - only 4 passengers survived.

The flight departed 2 hours late from Miami due to both waiting for some connecting passengers and then, missing their slot, a host of other flights had to depart before another space was open. In some ways, this was the foundation for everything that followed. Being in a rush is rarely a good thing in aviation since it makes you skip things that you would normally do and more importantly, it reduces the thinking time you have when unexpected conditions arise in the air. It is also serious because of FAA rules on working hours and breaks, which although well-meaning can have the effect of causing people to rush even more to avoid an embarrassing situation, such as Flight 965, where such a delay would have caused a knock-on delay in the subsequent departure from Cali in the understandably but not ideal short turnaround times that airlines plan their schedules around.

The flight was largely uneventful until it arrived close to Cali and was planning its final approach. Cali being under civil conflict at the time had lost its radar system so it had no visibility of approaching aircraft, relying instead on the pilots to inform the tower of their position. The approach should have been largely textbook. It involved approaching a beacon called TULUA, after which another called ROZA close to the airport would be tracked, after which the plane would have passed the airport, turned and landed from the south. The approach here, as in other airports, is critical because the airport is in a valley surrounded by 4000m high mountains and it was night time. (I think sometimes, us developers feel like we are always flying at night!)

A second factor now comes into play. The air traffic controller was Columbian and spoke Spanish as his first language. There was no suggestion that his English was not understood but some confusion as to the language used causes the Captain to make the first of a series of errors. The flight is told that they are "cleared" to ROZA and to report TULUA. The intention here was that although they were still supposed to fly the approach as planned, they were clear all the way to ROZA (they didn't need any more permissions) but when they flew over the TULUA beacon en-route, they would inform the tower so he knew where they were. The Captain heard, "ignore TULUA and go straight to ROZA", causing him fatefully and incorrectly to delete the TULUA beacon from his flight plan.

The tower also informs them that because the wind has died down, they can land directly from the north if they want to. Of course they want to, they are 2 hours late and that kind of thing is helpful to save some minutes and potentially avert any further delays. They are, however, too high to make this a simple procedure and against the background of being rushed, they deploy the speed brakes to enable the flight to descend more quickly and continue their approach.

A series of more confusion leads the two pilots to decide to put ROZA back into their Flight Management System (FMS) and go there as part of the ROZA 1 standard approach.

Entering a new waypoint is usually done before the engines have even started from the comfort of the ground and without anything else distracting you. You can think more clearly about each decision and the FMS can warn you about any discontinuities - which is waypoints that don't appear to logically connect and which therefore are not permitted. Do the same thing in a rush in the air and you are presented with a list of all the waypoints listed under the letter R (in this case for ROZA) and you would normally assume that the top one is both the closest to the current position and therefore the correct one in this situation. It is not. It turns out that for unknown reasons, ROZA cannot be found under R in the FMS on the 757 aircraft - it would be found only by its full name, the Captain doesn't know this and blindly selects this new waypoint, executes it (without checking with his First Officer as process requires) and the plane starts a strong left bank to point to a waypoint that happens to be 100 miles away in the wrong direction near Bogota.

Remember, this is nighttime and it is not always obvious when you are turning. They are also still descending and they are also still confused about what is going on.

They reach a point a couple of minutes later where they realise that something is not right with their heading and they can't seem to agree how to reset their bearings and bring them back to the normal approach. The Captain tries TULUA but can't find it on his radio so instead plots to ROZA on his NAV radio instead of the FMS - which gives him a much more unambiguous heading. What they haven't done is kept an eye on their flight, which has now descended so far that they are on the other side of a high mountain from Cali but without visual references, they don't know this until an alarm suddenly blasts the cockpit with a continuous "Terrain, terrain, pull up". They quickly throttle up and pull back as they have been trained to do but it is too little and too late. The flight crashes into the side of the mountain. In a scary twist, investigators find that the speed brakes are still deployed from earlier and that if they hadn't been, the pilots would probably have cleared the mountain!

Why have I told this story? Firstly, I find flight crash investigations incredibly interesting but also there are clear parallels with the software industry and particularly Cyber Security.

What if I said to you that the incident above was unavoidable? What if I said that there was no practical way that the plane could have avoided crashing? I hope you would disagree. Of course it could have been avoided. In fact, 99.99999% of flights avoid this every day by following procedures, learning from other people's mistakes, working out where weaknesses lie and doing something to mitigate the risks that they carry.

How is this is very different from Cyber Security? A breach is rarely caused by a single thing but by a series of events which, when added together, create the opportunity for an attacker to take advantage and for your site to crash.

Sure, in some ways Cyber is very difficult because there are many different attack vectors, also many different types of attack vectors. One person rarely has enough expertise to understand all of them (except in Hollywood movies) and it seems every day there is some new exploit or malware or weakness. Also, there are sometimes advanced, persistent threats. These are not easy because they occur over periods of time and exploit human factors as well as systems but they are still systems that can be quantified, risk assessed and mitigated. You can still create processes that help and don't hinder your security.

So what can we learn from flight 965 and how can Cyber be easier instead of harder?

You need knowledge. With the best systems in the world, if you do not have one person who understands the basic attack surface of each type of system that you expose to the web (hardware, software, remote desktop, web application etc.) you will not be able to have any security assurance. Even though external contractors can help you, you still must have someone in-house that understands broadly what's going on - a domain expert. In flight 965, this was the flight crew. You cannot get a non-pilot and expect him to fly a plane safely, even with everything that is known about the subject being available to that person.

Secondly, you need a good map. The pilots would not have been able to even attempt a night landing without their maps, including the beacons that specify waypoints to safely navigate the valley. But yet in our companies, most of us have, at best, a general idea of our systems and how they are connected and exposed (or not) to the outside and inside world. This is partly a software issue since the only programs I am aware of that present this sort of data tend to be expensive networking tools from people like Cisco and HP, not the kinds of software that most people can afford. There is also, sometimes, an assumption that things like Network monitoring tools are non-productive and are therefore a luxury. Why pay someone to maintain something that if done properly is never an issue? You might as well employ someone to keep an eye on your office carpets. Of course, this logic is the same for many things that only betray their value if something goes wrong and most of us are fortunate enough to either never get attacked or not to find out that we are.

Thirdly, you need processes and procedures. Is someone allowed to spin up an FTP server on one of your web servers without any oversight or approval or risk management? Could you imagine if the pilots of airlines were allowed to fly in their own way depending on what worked for them? "I always fly faster than planned to give myself some slack if I hit a delay", "I never follow that route exactly because I think it goes too close to those mountains". How are your networks wired? Do you have anything that ensures that things only get connected because they have to, not just because it is easier than buying another firewall or another web server? Do you have any code development checklists and approvals to ensure that someone - who might have all the right skills - hasn't forgotten something and opened up a hole?

The fact is, sometimes just one of these measures will be enough to stop a hack in teh same way as any one of the various links could have prevented flight 965 from crashing. Of course, it is better to have several measures - defence in depth - so that we can afford for one to break under some certain conditions and rely on the others to help us. We do, however, need to know when the individual checks break so that we can see whether those measures are fit for purpose. If flight 965 had retracted the spoilers and got over the mountain, an investigation might have decided that the processes for verifying flight plan changes or even the language used between Air Traffic Control and aircraft needed tightening up for this specific scenario.

Whatever happens. Do something. Your worst processes are probably 100 times better than nothing at all but if you have an improvement mentality, you start where you are, you learn from yourselves or others and you improve things over time.

So Cyber's not that hard!
Post a Comment