While traditional HPC is accustomed to POSIX, adding S3 to the armoury enables workloads processing large, unwieldy files to be more portable and accessible in a modern cloud native environment.
Advanced computing is reshaping industries, and the research that powers their innovation – but legacy systems and issues with moving data in and out of the public cloud can hamper even the most dedicated admins. And as more organisations recognise the advantages of edge computing, the pressure is on researchers to shift to minimal latency solutions that can scale up or down as needed.
But performance isn’t the only challenge facing HPC solutions. High-performance storage can be eye-wateringly expensive; but economical storage choices can impede workflows. And many storage solutions may not easily scale or adapt to accommodate future needs, or be at all cost-effective when it comes to highly fluctuating storage demands.
Many HPC technology stacks look similar – lots of horsepower for computations, scalable and fault-tolerant scheduling with SLURM, and a parallel file system such as Lustre for storage. This is a model that has been proven to work over the course of decades, but how can we adapt the workflow to enable workload portability and keep up with the rapid changes in infrastructure technology?
With Ceph’s native file system, CephFS, you’ll still get your POSIX fix – but also harness the power of object storage for managing bulky unstructured data, and block storage for virtualisation.
Everything in Ceph is built upon its foundational layer, RADOS (Ceph’s Reliable Autonomic Distributed Object Store) which eliminates the limitations to scale normally experienced with HPC file system storage. This unified approach not only removes barriers to scale, but offers opportunities for more efficient storage, in terms of both latency and infrastructure costs.
Ceph’s unique CRUSH algorithm dramatically reduces data storage and retrieval bottlenecks typically encountered in other cloud storage software, enabling clusters to scale without paying a performance penalty. Ceph can be used to provide a massively scalable storage solution for modern workloads like cloud infrastructure, data analytics, and other high-performance computing applications.
Much like SLURM and Lustre, Ceph is open-source software, free from licensing restrictions. There are no organisation-wide infrastructure replacements required in order to begin making use of it, either. Software-defined Ceph is also hardware independent; that is, it can run on anything (though I don’t recommend managing your aeronautics research data with a toaster oven).
I'm such a big fan, I've become the Ceph Ambassador for Australia!
Getting acquainted with Ceph and S3
Amazon’s S3 allows organisations to add additional object storage instantly, when they need it. It also enables the processing of large HPC workloads such as genomic sequencing to be carried out via AWS ParellelCluster. This on-demand approach makes managing fluctuating HPC demands much easier to manage on a day-to-day basis – and you only need to pay for what you use.
Ceph’s ability to flexibly scale out pairs perfectly with Amazon for HPC use – any Ceph cluster will dynamically adjust to balance data replication across storage appliances, allowing you to add and remove storage with minimal bottlenecks or hoops to jump through. Ceph’s use in HPC environments enables scalability, high performance, and timely access to data. It can help support work on large databases such as complex topographical mapping data from Mars rover exploration, or to support collaborative research across disparate locations. These were some of the reasons CERN chose to utilise Ceph as part of their particle collision experiments.
Now that I’ve covered the why behind a combination of Amazon S3 and Ceph, it’s time to look at the how.
In the examples below, we’re running Ceph on SoftIron HyperDrive (full disclosure - SoftIron is my employer - but I genuinely believe we are making an awesome product). HyperDrive is completely compatible with a mixed-hardware approach to Ceph, so while it offers clear performance advantages (amongst other unique features) when compared to a generic hardware approach, the information below will broadly apply to any Ceph cluster.
Accessing S3 object storage with Ceph
Accessing objects natively in Ceph can be done two ways – the first is directly in your application code with Librados and the second is to use RADOS Gateway’s API functionality – which is compatible both with OpenStack’s Swift Object Store and S3. S3 is ubiquitous and is the most popular way to move objects in and out of HyperDrive storage.
Hosted on our HyperDrive Storage Router and powered by RGW, Ceph’s thorough implementation of an S3-compatible API acts as a translation layer between the underlying object store and clients making RESTful HTTP requests.
These requests are converted into RADOS calls to mon daemons and OSDs hosted on HyperDrive storage, which is in turn responsible for the lower-level data management, replication, and housekeeping.
As one of the most popular ways to use Ceph, RGW has seen rapid and heavy development and today not only has a plethora of independent features but is also designed to be massively scalable and performant, and enable users to build out deployments across regions with multiple availability zones. In addition, multi-tenancy allows service providers to host many customers on a single deployment with total namespace separation.
Scalability and load balancing your cluster
RGW is a stateless daemon which can be added or removed from a cluster at will – and each instance functions autonomously. Each gateway can independently handle requests sent its way, without having to talk to any other gateways. This stateless architecture means that the service can be scaled horizontally, and adding more gateways as demand increases is trivial.
In this example, deploying and scaling endpoints is entirely handled by HyperDrive Storage Router. The decision you’ll need to make with this setup, is how you wish to provide S3 clients with a single interface or endpoint; how you’ll perform the magic behind the scenes to ensure that requests can be balanced across a changing and evolving landscape of storage routers hosting the gateway. This single interface needs to effectively handle the scaling up or down of not only the storage routers, but also the load balancing services themselves.
For some environments, a solution as simple as round-robin DNS may suffice. For a larger-scale architecture, developing automation to deploy multiple load balancing layers is the way to go. This typically consists of Layer 4 TCP load balancing in conjunction with the typical Layer 7 HTTP load balancers and ECMP routing.
Multiple layers gives us the flexibility to add and lose both storage routers and load balancers, without sacrificing availability or performance.
Multi-site implementation, for your geographically distributed HPC needs
If you’re building a production S3-enabled service that needs to either service multiple regions, or be resilient to the loss of an entire facility, you’re going to need to build a geographically distributed S3 service. RGW has a mature system for building out a complex and large-scale distributed S3 service, with multiple layers.
This consists of Realms, Zone Groups, and Zones which allow you to structure your S3 service with layered logical namespaces to support geographically dispersed customers with a performant and resilient service. Replication takes place between zones within a zone group, and multiple zone groups can serve and handle S3 requests within a realm. Zone groups can also be split across larger geographical regions for asynchronous replication between zones if that architecture is desired.
The economy and flexibility of Ceph makes it a natural choice for scalable storage. And with HyperDrive’s wire-speed performance (PDF download), you have a solution that can work efficiently with S3 for highly cost-efficient, on-demand, and portable HPC.
Please note: I originally published a version of this article on the SoftIron blog, but as SoftIron has changed its product line-up since then, the article is no longer live. So, I've republished a modified version of it here for posterity.