Friday, August 07, 2009

Amazon EC2 and instances failures

I really like Amazon EC2 and I use that for different kind of experiments. Most of the time I use small, large and high cpu extra large instances. Most of the time my experiments run for about 10~15 hours, so far so good. I like Amazon I can easily scale out my experiments, I can easily increase the number of instances, if the instances is not enough to handle my application then I can move to different instance type etc...

However I am experiencing some issues, first issues is when I start about 10~15 instances most of them start quickly and few take a long time to start, because of that I have to wait (and pay additional $$ to amazon) to start my experiments. Even after I start my experiment, some times some of the instances become un-reachable, thats the killer. When that happen I have no option other than rebooting (or waiting a long time..) the instance(s) and restart the experiment. So my 15 hours experiment end up running for 25~30 hours. Issues is I have to pay $$ ...

I believe this is something Amazon has to address, EC2 idea is cool, but due to some network and infrastructure issues we can not use the EC2 as we expected.

Sometime this instance failure last only for few minutes and come back fine, sometime we really need to reboot or terminate the instances.

1 comment:

SteveL said...

All large clusters have failure rates; the funny about virtualised ones is that you can't learn which physical machines are playing up, you can't blacklist them.

What you can do is design your app with checkpointing, so that failure can be handled without a complete restart. Without that, the run time of any application, on anyone's datacentre, will be limited.