Reflections Of The Void: The network performance within the cloud, an hidden enemy

Monday, January 25, 2010

The network performance within the cloud, an hidden enemy

A lot of people talked about the latency issue when hosting services in the cloud . Recently amazon latency hiccup revealed a deeper problem, but seems to be rarely discussed. While most focus on the network access and consume services from the cloud. I realise that their is a big unknown concerning network performance inside the cloud.

Could provider don't disclose their real infrastructure underlying their cloud offers. By doing so, cloud customers are completly left in the dark regarding the network linking their different instances. Leaving them with the false warm feeling that their are on top their own flat network.

What does it mean:

You have no idea of your network or I/O performance for your instance. Your virtual interface is sharing a physical (sometimes trunked) one(s) with other tenants collocated on the same physical server and theycompete with you for a share of the network pipe.
You have no idea of your network performance between multiple instances within the same cloud:

First your instances can be located in different branch of the infrastructure. Which means more network gears between them.
Then, Virtualizated network gears can also be thrown into the mix. Which add virtual switches and routers with sub optimal performance (remember they are software) but add greater flexibility.
Finally, the network traffic generated by all the tenants makes it very difficult (and expensive) to guaranty QoS throughout the infrastructure. Not to mention that capacity planning , measurement and management becomes extremely difficult because it is impossible to predict the(often asymmetric) bandwidth network consumption of the instance. A reason why cloud providers dream for hugely dense, multi-terabit, wire speed L2 switching fabrics.

As a consequence, there is not generally a published service level associated with throughput and latency within cloud. When oversubscription hit you, you often don't see it coming. Maybe cloud will become similar to the home broadband with advertised "unilimited" offers but with content ratio.

All this, makes it extremely difficult to deploy and guaranty the performance of services that rely on low latency and/or high bandwidth architectures such as high performance computing, web and database clusters, storage access, seismic analysis, large scale data analytics, financial services and algorithmic trading platform.

I can think of some solutions to these problems but this will be for another post.