Outsourcing IT to the cloud has become an extremely popular model in the enterprise sector – offering elasticity and lower costs than on-premises infrastructures can offer. Other sectors – such as HPC – have been less enthusiastic, adopting it only for a limited set of workloads while the bulk of computation is still executed on traditional clusters and supercomputers.
What are the reasons behind this? Just tradition and an unwillingness to adapt to the new world order? Or are there actual hard facts for not outsourcing all HPC computation to Infrastructure as a Service (IaaS) clouds offered by, e.g., Amazon, Google and Microsoft?
Performance – forget parallel computing with multiple nodes
Cloud capacity is offered using virtualization technology. On each node a hypervisor runs multiple virtual machines (VMs, or "instances") on virtual operating platforms. On these VMs one can install Linux or Windows, and a complete custom stack of software.
Cloud platforms cannot execute typical parallel HPC workloads
This offers great flexibility and benefits for some applications with very specific demands on the software stack, but also comes with drawbacks. As we discussed in a recent report, performance is just not there to execute typical parallel HPC workloads.
Scientific applications with tightly coupled communication cannot use more than one node, as shown by network latency and bandwidth tests as well as application benchmarks. This means that the vast majority of the scientific workload being run at CSC cannot be efficiently executed on a cloud platform; 87% of CPU cycles at CSC were spent in jobs using more than 32 cores in 2015.
Nearly 90% of computing done at CSC couldn’t be done in cloud
The main reason for this lack of performance is that the large commercial cloud infrastructures are typically built on Ethernet interconnects, which have considerably higher latency and lower bandwidth than InfiniBand (used in our Taito supercluster) or proprietary interconnects such as Cray Aries (used in Sisu supercomputer).
The other reason is that virtualization itself incurs a performance penalty. VM can efficiently utilize the central processing unit (CPU) and main memory, but accessing external devices such as disks, graphics processing units (GPUs) and network interfaces may incur significant overhead since the hypervisor translates the accesses in software. Recently a number of hardware based approaches have been developed that reduced the overhead, such as Single Root I/O Virtualization (SR-IOV), but these do not completely solve the performance issues.
Price – not all it’s cracked up to be
What about pricing? The main reason why cloud computing is touted as being cost efficient, is that for users whose need for resources varies a lot, it makes sense to pay-as-you-go instead of investing in an infrastructure which would be idle for most of the time. On the other hand, for large installations with a constant workload replacing this with pay-as-you-go model cloud resources would become very expensive. It is no wonder, cloud providers still want to make a profit.
Fluctuating prices or data transfer costs can surprise you
When comparing Amazons EC2 cloud prices to the ones offered by CSC, the report shows that the compute or I/O optimized pay-as-you-go cloud instances are many times more expensive than CSC's resources. Even reserved instances, where one pays for continuous use multiple years in advance are not competitive. The only pricing model which has comparable price-performance ratio are spot priced AWS EC2 instances, and preemptable instances on Google cloud.
This is not comparing apples-to-apples though, the service level is not the same. The availability of spot priced instances is not guaranteed, they may be terminated at any time, and the pricing fluctuates according to demand.
In addition to the costs related to compute resources there will also be charges for data transfers and storage. Depending on the case at hand these can range from being very small, to being very large. These may also lead to a lock-in effect, where it is difficult to change to another computing platform since retrieving data from deep storage such as AWS Glacier and data transfer costs can make it too expensive to move the data.
Not only clouds on the horizon
It is not all gloomy under the clouds though. One significant benefit in IaaS clouds is the ability for users to run a completely custom software stack. Scientific software have typically been developed so that they are easy to run on the normal clusters and supercomputers, but some pieces of software may not have been developed with this in mind. For example in the field of bioinformatics users have embraced the possibility to customize the whole stack to run complex workflows. Parallel performance is not an issue there, since the work comprises a large number of single-node computations.
CSC has been among the first supercomputing centers to offer cloud resources for researchers
In view of this, CSC has been among the first supercomputing centers to offer cloud resources for researchers. The cPouta compute cloud, and the ePouta cloud for sensitive data are in active use by those customer segments for which it is the best choice.
The facts are thus, for the vast majority of scientific computation done in Finland the cloud cannot compete with CSC's resources in terms of performance and price. At the same time technologies such as containers and cloud computing are constantly evaluated, and taken proactively in use to continue the tradition of offering world-class compute resources to the Finnish scientific community.
Blogger is the corresponding author of CSC's report High-performance computing in the cloud? (2016)