Tuesday, June 19, 2012

Too Big Not To Fail?



Our Agfa IMPAX 6.5 failed for over three hours this past weekend, and that's not the first time that's happened. This sent the three hospitals it services into complete pandemonium, as you might expect. At the moment, there is no hardware disaster recovery option. Hopefully, this will be implemented...eventually.

Fortunately, I was not on call when this little disaster occurred, but I certainly heard about it. To this point, we don't know quite what went wrong, but bouncing the servers seemed to fix it.

I guess we should be quite grateful, as the IMPAX installation in Western Australia continues to misbehave. I've received no word from Agfa as to why they think this might be happening, nor when it will be fixed Personally, I would hesitate to buy anything new until I knew the answers to these questions.

I have to contrast the IMPAX experience with our other hospital (and our group-owned system) that uses AMICAS Merge PACS. Merge (I still have some trouble with the concept) is a simpler product, using Windows Server and SQL databases instead of the supposedly more robust Oracle and Unix boxes galore. But it is much more stable. We've had perhaps two or three hours total unplanned downtime in the past 8 years.

PACS is not rocket science. It is simply a database of images and associated text. But the images are rather large, and have MEDICAL stamped on them, and thus has grown the culture of complex, expensive, and cantankerous big iron systems.

Most of the next generation of PACS are not a whole lot more than web-servers and associated databases. In this manner, they are robust, (relatively) simple, and while not quite bullet-proof, they have fewer points of failure, and are more easily maintained.

Could it be that the legacy designs are so complex that there is no way to avoid failure? I have to wonder...

4 comments :

stacey said...

It's always interesting to me that bouncing a server always seems to fix things/processes/services that have either stopped or become "hung up" but when I ask my support folks as to what CAUSED the sequence of events, they never seem to be able to put it together...Since they never find root cause, they never can prevent the problem.. My "favorite" sentence is when they say things like.. It SHOULD work now... and they have no idea or clue why it failed or why it's now working...

Anonymous said...

23 Skidoo, that is because the system would have had to be in debug at the time of the incident. There are no sites (that I know of) that are willing to leave the system in debug ALL the time (for good reason).

If a problem is of a recurrent nature then support *should* request that the affected services be placed and left in debug to gather additional information at the time of the incident.

Finally, since the IMPAX web services run on IIS, the web services are at the mercy of IIS. Many times the failure is with IIS itself which is why an iisreset many times resolves the problem.

stacey said...

There should be a better way to analyze the system and explore root cause without compromising performance. Poor design. Perhaps not an issue in some areas where computers are used, but this is not well suited to healthcare environment. I also love it when network folks shut things down or change some settings that they think aren't being used and wait for someone to "scream" and call support to find out who was actually using it. Do they not maintain a simple spreadsheet to keep track of who uses what ports, etc? Or is it easier just to flip a switch and have others call in. This also may not be a big deal if you are playing world of warcraft but it endangers patient lives when they pull this crap

Anonymous said...

23 Skidoo, I can't comment on the network side of things. I don't support that side of the infrastructure. I am just involved in helping to troubleshoot those sort of issues.

I'm a little confused by your comment though. You are lamenting about poor performance while the system is in debug yet chastising vendors for not having a better way of diagnosing problems.

DEBUG is the method for identifying certain recurrent problems. System degradation should be nominal if the support staff can identify and only debug the web services(s) with said problem (and not throw entire system in debug).

It's a balancing act. Performance vs. diagnosis. When the system is running optimal all systems should be in ERROR mode (reducing the overhead required by DEBUG)