Introduction
The purpose of this post is to learn lessons from NASA’s Cassini spacecraft mission and apply them to healthcare IT. These lessons can be applied by any system engineer to any combination of off-the-shelf systems/software and do not require special training.
These lessons will be increasingly valuable as more and more healthcare systems become interconnected and more people are depending on them for every basic transaction in healthcare. It is of paramount importance to make these systems 100% reliable.
When we first started the project electronics and software in the earlier 90’s, we had a design requirement for at least 16 years nonstop operation. My primary responsibility was the firmware design and testing of a set of sixteen small computers called Remote Terminal Input Output Units (RT-IOUs). These sixteen RT-IOUs are responsible for interpreting commands from the main computer to control the attitude (x-y-z axis) and positioning of the spacecraft. There are eight of these RT-IOUs plus eight redundant backups. The RT-IOU’s communicate with the main computer which then talks to the ground control staff. RT-IOU’s control multiple devices including thrusters (of maneuvering), main engine, gyros, which sense the position, and reaction wheels. Reaction wheels conserve momentum, so once a spacecraft is spinning in a certain direction it can be reoriented without needing thrusters. This conserves fuel. There’s also a sun sensor that can sense the distant sun for position and earth tracking.
The applicable lessons learned for Healthcare IT are:
- Increasing system reliability using standards,
- The value of redundancy
- Operating contingencies when the network goes down
The Cassini mission had an interesting travel path because it had to get enough speed to get up to Saturn. After launch from Earth Cassini had to loop around Venus, getting gravity assist, loop back around Earth, then around Venus, and then out to Saturn to get up to 40,000 miles an hour. Even at 40,000 miles an hour it took seven years to reach its destination. The distance from the Earth to Saturn was over 900 million miles but the actual travel distance is about 2 billion miles because of the roundabout path that it had to take. In that all these critical maneuvers and everything that are controlled by these computers some very good lessons were learned for reliability and having witnessed this and participated on that these lessons can be very important for the space program for the healthcare IT systems.
Standards increase Reliability
Using standards allows us to design more reliable systems. Once a group agrees on a set of defined standards, projects can be tested more frequently and reliably. For example, each of the devices on the Cassini spacecraft that control positioning of the spacecraft is connected with standard mil-spec 1553B redundant communications bus. They also interpret a standard set of commands which are interpreted by the Remote Terminal Input-Output Units (RT-IOUs). This is similar to organizations that use Clinical integration engines to get their standard HL7 feeds into their proprietary EHRs. On prior spacecrafts (such as the Galileo mission to Jupiter) each device had a unique custom interface which made testing more difficult. On the Cassini spacecraft the reaction wheels, gyros, accelerometers, thrusters, valve driver electronics for the main engine, and sun sensor all worked through the same interface. This enabled us to write custom checkout software early in the design process that allowed each of the device designers to verify that there device was working to spec. This allowed us over time to get more and more time on the interface and we were able to rule out many potential problems that could have occurred in flight.
Redundancy reduces complexity
It seems an oxymoron but it is not; as long as the designers understand the “End to End” operations. This “End to End” knowledge is critical to understanding which parts of the system require redundancy. For the Cassini spacecraft, redundancy was mandatory on the system to alleviate the possibility of mission failure. For example, the reaction wheels, the gyros and the thrusters and the main engine electronics were all redundant. In the event of a failure we could switch over to a backup and continue the mission. If there is only one system then it requires you to have belts and suspenders and all these other primary artifacts that actually make the design of each individual system complex, but two simpler systems tied together allow you to have redundancy with yields lower failure rates.
During the Cassini mission these redundancies became very useful several years ago when one of the thrusters started functioning below spec (insert status link). The mission controllers we were able to use an alternate thruster series for the X,Y and Z positioning without any loss of mission performance. A similar situation occurred with the reaction wheels, which after running ten years, one of them was spinning not as well as desired. This required switching to an alternative reaction wheels. So rather than trying to attempt to build this overly complex device to not fail, the design takes simpler devices and connect them in parallel so they can be used interchangeably.
For Healthcare IT this means that there should be other backup modules and other backup systems that can allow people to keep running despite of failures. I once witnessed a whole hospital being locked down because of an internet problem caused when one of the routers failed. The failure of one part took out the whole network.
Do the systems in your organization have built in redundancy?
When Always On, isn’t
With increasing digital connectivity around the world these days, it seems difficult to imagine a system that isn’t connected to a network. Although, systems shouldn’t be designed so that they always need constant connection, because face it, the connections are not always on. There are any number of glitches that will prevent one from connecting to the network at a specific time. Systems should be designed so they do not require an instant connection all the time. The user should be able to continue with most of their work without that connection.
In the Cassini example, a remote spacecraft at the distance of the planet Saturn, it actually takes 80 minutes or more to receive commands from Earth. This means that the system has to operate without getting commands from Earth. Cassini follows automated sequences that enable the system to work without being in constant network contact. It’s not direct remote control, like a toy remote control car for an example. Sequences that are uploaded and then the Cassini robotic spacecraft executes the sequences. For an Earth-bound health IT system, this means that designers should be able to configure an off-the-shelf system capable of common queries and functions that operate without depending on a working network connection all of the time.
There are local connections between the database and main application, interface engines, and the EHR system but these connections are pretty reliable but whenever traffic is going outside of the local network, across the gateway, the system shouldn’t just lock the user out because it can’t get a basic connection. It should start caching and being able to operate normally. As part of the design process; determine what happens if a network connection is not available. Don’t assume that there is a working connection all the time.
A possible use case for a clinician application is as follows: A EHR could queue up patient data for the next day’s set of appointments. It could refer to this data if network connectivity is not available, and update it as needed when connectivity is restored. Currently there are limitations when aggregating data from multiple hospitals and some of the new specifications indicate that data can’t be saved at all. But there needs to be some design considerations for caching data for performance and connectivity purposes.
When the Cassini spacecraft doesn’t “hear” a signal from Earth after a certain time interval, it assumes that something may have happened. This sends it into a “safe-ing” mode. It closes its lens covers to prevent damage, starts a call-home signal and automatically starts looking for the sun. Once the sun is located with the help of a few internal star charts, it turns a few degrees to look for the earth and to reorient itself until contact is reestablished with Earth.
In conclusion, putting some sequencing into Healthcare IT systems will help designers use network connections effectively but not overly rely on them