At the Johnson Space Center (JSC), and at every other NASA center with a hand in managing the International Space Station (ISS), there are case studies and lessons-learned databases on every significant anomaly that has arisen during the two decades the station has been in operation. They are not simply short-form documents or synopses of a problem faced and resolved.
“There’s a narrative around the details,” Brian Derkowski said. Derkowski is the manager of the ISS On-Orbit Engineering Office responsible for the ISS Mission Evaluation Room (MER) and the ISS anomaly resolution and engineering support to the flight control team.
The instructive power of the story that goes with the particulars of any anomalous event – large or small – is something NASA has striven to internalize over the 20-year history of the ISS, and indeed across the six decades of the agency’s existence.
The ISS has a few stories.
At around 10:00 in the morning East Coast time on March 30, 2017, veteran U.S. astronaut Peggy Whitson was on a spacewalk outside the ISS doing some routine maintenance. In this case, Whitson and her spacewalk (or extravehicular activity, EVA) companion, astronaut Shane Kimbrough, were installing a thermal shield on an unused docking port on the ISS.
The port needed protection against space debris and radiation.
“Peggy, I don’t have a shield,” Kimbrough said over the UHF radio to Whitson. “What?” she replied.
“Despite the rigor we put into preparation and planning, those types of anomalies occasionally occur,” Derkowski acknowledged.
“Yeah, I don’t have a shield,” Kimbrough reiterated.
“Where is it?” she asked. “It’s right by the radiator,” she said, answering her own question. “It’s moving at about half-a-foot-a-second and it looks like straight away from the radiator angle.”
“We copy, and we see it,” said a controller from NASA’s Mission Control Center at JSC.
With that, the 5-foot shield floated blithely off into space, joining 20,000-plus other pieces of sizable debris orbiting Earth. Whitson and Kimbrough had planned to install a total of four such shields during a 6.5-hour EVA. They now had just a trio. What to do?
“Despite the rigor we put into preparation and planning, those types of anomalies occasionally occur,” Derkowski acknowledged.
As with all EVAs, Whitson and Kimbrough had only a few hours of time available to work outside the space station. If a timely solution to the loss of the shield was going to be arrived at and executed, it would have to happen very quickly.
“Our team in the control center met and rapidly came up with a work-around,” Derkowski said. “Within an hour or so, they had the procedures defined well enough to call up to the EVA crew in real-time and execute the fix.”
The fix was effected by retrieving a thermal blanket that had been removed from a port earlier in the EVA. Whitson and Kimbrough fitted the improvised cover to the final port. It wasn’t a perfect fit, but it would effectively prevent exposure to high temperatures and micrometeoroid debris until a purpose-built replacement could be put on-orbit.
“That’s the beauty of having all the ISS teams – engineering, operations, safety – all coming together and working issues in real-time,” Derkowski said.
“Anomaly resolution has been refined and evolved through the [20-year] course of the ISS program, but it has its roots in earlier programs as well,” Derkowski said.
It also illustrates the ability of ISS’ managers to improvise. Improvisation is a talent in high demand among those tasked with responding to the unexpected challenges that emerge in the operation of the space station. The various engineering and flight control teams have established troubleshooting procedures that apply to any anomaly event.
“But sometimes, we’ll exhaust those without resolving the issue,” Derkowski acknowledged. “Improvisation comes into play quite frequently. That’s where the NASA team will get together to talk about what we’ve done, what steps have been taken, what we might do, and the risk associated with taking proposed steps. When you’re improvising a troubleshooting procedure on-orbit, major or minor, there’s a degree of risk associated, which we always discuss before launching into improvised troubleshooting.”
Common Procedures, Uncommon People
The need to identify and resolve ISS anomalies quickly is obvious. Space, even nearby space, is an inhospitable environment. But the need for speed in problem-solving doesn’t displace methodology and procedure. It makes them more important.
NASA and the ISS flight control and engineering teams at JSC have a long-established methodology in terms of how they respond to anomalies. The MER, along with the flight control team and the rest of the NASA community, is part of the anomaly resolution legacy. All significant ISS anomalies are discussed and worked in the MER. It is not a new concept.
“Anomaly resolution has been refined and evolved through the [20-year] course of the ISS program, but it has its roots in earlier programs as well,” Derkowski said.
There were MERs for the Apollo, Skylab (America’s first space station), and Space Shuttle programs. Each accumulated and built upon the databases, case studies, and narratives arising from operational challenges and anomalies. The information is part of the record, available to the ISS MER and other NASA programs. In addition, most communication loops between the ISS and the flight control team are recorded and available for review should the MER, engineering, or flight control teams wish to consult them.
Sharing the accumulated knowledge is, in its own right, a key common procedure within the ISS program, Derkowski said.
“It’s really important, because when you have an anomaly, you want not only the procedures [that] tell you how to respond, but common technical data that’s as visible to as many people as possible. You all want to be looking at the same data for the on-orbit anomaly.”
The eyes surveying the data matter too. The ISS MER has institutional knowledge stored within the people who staff it. They’re not just in the MER or even at JSC itself, but farther afield.
“That’s resident not only here at JSC but at centers throughout NASA and our partners, which had a hand in developing the current ISS system,” Derkowski added. “The body of technical work is also shared with our flight control team counterparts. In some cases, the flight control team knows the systems just as well as the engineering support teams.”
While the reservoir of experience is distributed, it’s still a rare and difficult-to-come-by commodity. The relative number of people who have worked on, with, or around the International Space Station, is not surprisingly, tiny.
“I think we have a really good mix of personnel right now,” Paul Rathbun said. Rathbun manages the Vehicle Integration Office at JSC and is also a former MER manager.
“There are still a number of folks who worked the early days of the space station, who worked through its assembly and learned good lessons along the way. As we’ve had folks move on, we have had a chance to transition some of their knowledge.”
Doing so isn’t easy, Rathbun and Derkowski admit. But it is vital to the future of the ISS and other follow-on programs.
“As the program grows older,” Derkowski said, “one of the challenges is to make sure that tacit knowledge – knowledge inherent to the experts whether they’re on the engineering or operations side – is documented so that when they move on to a different job or retirement, that knowledge remains for the benefit of the rest of the team.”
Reaching Out, Reaching In
Retained knowledge and experience with past failures come in handy both in resolving anomalies and in provisioning the space station for expected and unexpected failures. Having the right components on hand can make a huge difference in response time when something goes awry, say, with an electrical or computer component.
“Responding to that could take a couple weeks to plan, but I’ve seen plans devised and executed in three days,” Derkowski said. “If the anomaly affects critical ISS functionality, I’ve seen an [anomaly] event happen on a Saturday and by the following Tuesday, we were ready to go do an EVA.”
The failure of a computer located on the exterior of the ISS on Saturday, May 20, 2017, is the anomaly to which Derkowski refers. Known as a multiplexer-demultiplexer (MDM), the computer is one of two units on the station used to route commands to its solar power system, radiators, cooling loops, and other equipment. Though another backup external MDM assumed those control functions when the first failed, they are critical enough that the ISS team wanted to get a replacement installed as soon as possible to restore full redundancy.
The computer failure was ultimately traced to a faulty internal circuit card. At JSC, the MER, engineering, and flight control teams devised a plan for an EVA to replace the failed unit as soon as possible.
The plan would require an astronaut, in this case ISS veteran Whitson, to assemble and test a spare MDM to replace the failed device that had been installed only two months earlier. Whitson had the spare, which NASA calls an Orbital Replacement Unit (ORU), on hand thanks to routine re-provisioning of the ISS, which is done with an eye firmly focused on anomaly resolution.
Knowing what spares to send via unmanned resupply flights is a bit science and a bit art, according to Bill Robbins, manager of International Space Station logistics and the Maintenance Office at JSC. For a start, there is only a limited amount of “up-mass” – the ability to launch and set aside storage for spares – for the ISS. As such, the logistics team must not only anticipate critical component failures, but also determine whether getting the requisite ORU to the station and storing it there is feasible.
“We do have a good deal of [spares] history, mean time between failures data, actual performance data,” Robbins explained. “And working with the systems teams, we have a thorough understanding of the impact of the failure of any given piece of hardware. We’ve gone to great lengths in the past few years to posture ourselves with the right set of spares already positioned on board so that in the event of a failure, the spare is immediately available to the crew.”
The practice is similar to the long-established pre-positioning of assets by the U.S. military in far-flung operational theaters from Europe and Africa to Asia and the Middle East. Having ORUs aboard the ISS also alleviates uncertainties in resupply launch schedules, whether those uncertainties are technical or political.
Of course, capitalizing on the opportunity to utilize a spare on hand requires that the ISS crew (typically, five members) have some knowledge of the systems they may need to work on. The balance of how much troubleshooting knowledge a crewmember should have is an issue, since the primary focus of each is in conducting the projects and scientific experiments that each nation places on the ISS.
“The crew does get familiarization training with the systems and how they work,” Robbins said. “They get what we call ‘skills-based training,’ which means they’re taught the skills to operate and maintain onboard systems in a generic sense rather than trying to memorize detailed procedures for multiple systems.”
The combination of training and an onboard ORU allowed JSC’s MER and other control teams to map out the MDM assembly, test, and an EVA to swap out the failed component in just three days. After losing functionality on Saturday, Whitson and fellow astronaut Jack Fischer ventured out of the station’s Quest airlock to put the new MDM in place and install a pair of antennas to improve wireless communications for future EVAs. (It’s worth pointing out that not every ISS anomaly requires an EVA. In fact, many don’t require the crew to act at all.)
There really isn’t a situation in which the well-worn “don’t assume” mantra does not apply. It certainly extends to anomaly resolution for the ISS.
The ISS has near continuous communications with the JSC’s Mission Control Center via S band (voice, telemetry) and Ku-band (video, high-speed internet) frequencies. The channels allow the crew ample contact with the ground for private conferences, calls with the media, schools, and relatives, or specialized troubleshooting conferences.
The channels are also conduits for remote control of onboard systems. Ground controllers now have the ability to command the vast majority of ISS systems from Earth, Derkowski said.
“A good example is the ground control command of robotics. In the past, a crew member would have to manipulate an actuator and move the big robotic arm on the space station, but in the past couple of years, we’ve had the ability to do that on the ground, which has alleviated a lot of crew workload on board.”
Solar arrays, the electrical system, and coolant pumps can be manipulated as well. However, the last of these demonstrated the risk in believing that systems work as you expect.
Assumptions
There really isn’t a situation in which the well-worn “don’t assume” mantra does not apply. It certainly extends to anomaly resolution for the ISS.
On a Wednesday morning in December 2013, a pump module in the ISS’ External Thermal Control System (ETCS) unexpectedly shut down when a fault detector noted far-below-normal coolant temperatures.
The ETCS has two separate external ammonia coolant loops that jointly transport heat away from the ISS’ electronic equipment and toward respective cooling radiators. With one loop operating much too cold due to the apparent failure of an ammonia pump temperature control valve, the station was in jeopardy of losing half its cooling capacity and possibly having ammonia enter its cabin.
Derkowski led the Close Call Investigation team that later investigated the incident. Investigation teams are common review mechanisms in aviation and spaceflight. Members of the MER, engineering, flight, and control teams and outside experts typically make up an ISS investigative board.
After the shutdown, the ISS Mission Control team successfully re-powered the affected loop. However, they soon discovered that a control valve in the pump was not closing correctly, leading to the ammonia in the loop becoming too cold for nominal operation. Since a single cooling loop cannot support all of the ISS’ cooling needs, ground teams were forced to begin shutting down equipment in order to reduce the heat generation of the ISS systems. The loss of cooling would potentially limit experiments aboard the station. Addressing it was important, but as it turned out, unintentionally risky.
“We took some actions that were pre-planned, but some of those actions resulted in the loss of internal water coolant flow to one of our heat exchangers,” Derkowski explained. “The exchanger is the junction between the external ammonia system and the internal coolant. We got very close to freezing our interface heat exchangers.”
“We try to resist the temptation to dismiss an anomaly as explained or a one-off event. We really take things seriously.”
Freezing the heat exchangers could trigger a potential rupture and the entry of ammonia into the cabin. The actions taken by the ground control team were based on suppositions made when the ETCS was initially conceived. There was an underlying assumption that cold ammonia resulting from failure in a cooling loop could be avoided if systems were configured correctly, Derkowski recalled.
“That underlying assumption from the design phase of the space station persisted. The short of the investigation was that this assumption was flawed. If the systems were configured a certain way, there was still risk of freezing the heat exchanger. We determined there were definitely operational procedures to update and some changes to how we operate the ISS in terms of automatic recovery steps embedded in software.”
Ultimately, astronauts Rick Mastracchio and Michael Hopkins would undertake another EVA to remove and replace the affected pump module in late December. The swap was successful, but with every additional EVA comes a degree of risk. That’s why a mishap board and/or the anomaly team look at just about every unforeseen event or phenomenon on the ISS. And, as Derkowski said, they try not to let time pressure affect their post-event analysis.
“We try to resist the temptation to dismiss an anomaly as explained or a one-off event. We really take things seriously.”
The bare necessity of doing so was brought home by one of the closest calls NASA has ever had during a spacewalk, ironically following another cooling issue seven months prior.
Fluid Assumptions
In May 2013, the ISS crew spotted a growing ammonia coolant leak in the station’s U.S. cooling system. The seepage they noticed was dramatically increased from a recognized long-running, low-rate leak in one of the cooling legs. Left unchecked, it could shut down the cooling loop. The MER and Mission Control team began working on a plan to fix the leak via a contingency EVA.
Two days later, NASA astronauts Tom Marshburn and Chris Cassidy exited the station’s Quest airlock on a spacewalk to check the area where the coolant leak was spotted and replace the system’s pump flow control subassembly unit with a spare. Their labors were a success and no major leaks were found.
Shortly thereafter, ISS Commander Chris Hadfield, Marshburn, and cosmonaut Roman Romanenko departed the ISS in a Soyuz TMA-07M capsule. They were replaced on May 28 with the arrival of another Soyuz carrying NASA’s Karen Nyberg, Italian astronaut Luca Parmitano, and cosmonaut Fyodor Yurchikhin.
“I started going back to the airlock and the water kept trickling,” Parmitano said. “It completely covered my eyes and my nose. It was really hard to see. I couldn’t hear anything. It was really hard to communicate. I went back using just memory, basically going back to the airlock until I found it.”
On July 9, another EVA was undertaken by Parmitano and Cassidy to perform a variety of scheduled maintenance tasks on the ISS exterior. Upon returning from the EVA, Cassidy and Parmitano found a small quantity of water in the helmet of Parmitano’s spacesuit. They informed Mission Control and related their assumption that the water was the product of a leaky drink bag inside Parmitano’s helmet. That conclusion was accepted and agreed with by the ground team.
A week later, Parmitano and Cassidy found themselves outside the ISS yet again, tackling more scheduled maintenance tasks. Forty-four minutes into the EVA, Parmitano reported that he felt a moderate quantity of water in the back of his helmet.
“Chris and I were ahead on our tasks, so we were starting our third task and I felt some water on the back of my head,” Parmitano said after the incident. “I realized that it was cold water, it was not a normal feeling, so I told ground control.”
The spacewalk was expected to last about 6.5 hours, but after Parmitano reported the presence of water, mission controllers aborted the EVA about 60 minutes in.
“I started going back to the airlock and the water kept trickling,” Parmitano said. “It completely covered my eyes and my nose. It was really hard to see. I couldn’t hear anything. It was really hard to communicate. I went back using just memory, basically going back to the airlock until I found it.”
“At that moment, as I turned ‘upside-down,’ two things happen: the sun sets, and my ability to see – already compromised by the water – completely vanishes, making my eyes useless; but worse than that, the water covers my nose, a really awful sensation that I make worse by my vain attempts to move the water by shaking my head. By now, the upper part of the helmet is full of water, and I can’t even be sure that the next time I breathe I will fill my lungs with air and not liquid.”
“We do have to have a culture where people feel free to bring concerns up and have them heard. I think the anomaly-resolution process is very conducive to that. Having an international partnership is a great benefit.”
Parmitano came perilously close to drowning in his spacesuit outside the ISS. Back inside the hatch, the Italian and his fellow crewmembers found approximately 1.5 liters of water filling the helmet.
In the aftermath, a Mishap Investigation Board met and issued a report on Feb. 27, 2014, with 16 key findings/recommendations. Chief among these was that the water buildup on the July 9 EVA was not discussed in enough detail.
If the leaky drink bag assumption had been challenged, the report concluded, mission controllers probably would have realized that the “issue needed to be investigated further before pressing ahead” to the next spacewalk.
It was another lesson that assumption can potentially lead to disaster. And yet, there are practical limits to the energy or attention that can be devoted to any single anomaly.
“It’s the trade-off we deal with daily, but the key is not to dismiss an off-nominal event, rather decide as a team if and how we address it,” Derkowski said.
Still, it’s a trade-off that isn’t made without taking in a diverse range of input.
Broad-based Problem-solving
“We do have to have a culture where people feel free to bring concerns up and have them heard. I think the anomaly-resolution process is very conducive to that. Having an international partnership is a great benefit.”
Rathbun alludes to the ISS’ Russian, European, and Canadian partners, all of whom participate in anomaly resolution or review.
“When I used to lead anomaly-resolution teams, I wanted diverse input from across the communities that were participating or had a stake in the issue. You never knew where a creative thought was going to come from,” he said.
There is wide agreement among ISS stakeholders that applying more brainpower to any problem is a good thing. The international ISS management team includes every major agency that contributes to the mission.
“In the Mission Evaluation Room, we bring in multilateral anomaly-resolution teams with our engineering partners at other space agencies,” Derkowski said. “So when you have an issue that involves an interface or affects multiple parties, we’re all looking at that issue together.”
Those solving problems at JSC not only benefit from international partners but from the diversity of their own experience. Cross-disciplinary knowledge is gained every day through the working relationships of different teams resident at Johnson Space Center and throughout the ISS infrastructure. Robbins acknowledges that his role as ISS logistics and maintenance manager broadens his perspective daily.
“We buy spares but we also do the long-range maintenance planning, so with those functions, I work with all of the different discipline teams – reliability, safety, quality teams. We manage the ground logistics infrastructure, including hardware repair, so I work with all of those companies. I get exposed to a great number of disciplines and entities.”
Rathbun works with organizations across the ISS program as well, absorbing knowledge from the Systems Engineering and Integration Office, Transportation Integration, and other teams. Other NASA field centers, from Kennedy Space Center, Marshall Space Flight Center and Glenn Research Center, to Goddard Space Flight Center, Ames Research Center, Jet Propulsion Laboratory, and White Sands Test Facility, regularly consult with the JSC team. The same holds true for personnel in other NASA programs.
Such a broad-based approach will have to continue as long as the ISS operates, presently expected to reach out to 2028. As it does, it will see new anomalies, including some associated with the commercial supply delivery systems, which the U.S space industry is developing to serve it.
“We are already starting to think about how to respond to potential issues on the ISS or at the provider vehicle to ISS interface,” Derkowski said.
“How would we plan for those? How would we respond in a timely manner? We do know from our history and experience that there are a lot of challenges when you do first-time operations. They’ve been part of all NASA programs. We’re definitely taking steps on our side to make sure we’re ready.”