Fast Storage For OpenStack

Traditional disks typically support only around 10s of IO Operations Per Second (IOPs). It poses IO bottleneck in virtualized environments where you are running hundreds (or even more) of Virtual Machines (VMs) on a single hypervisor. You can easily address memory bottleneck by adding more RAM to the hypervisor. The computer bottleneck can also be solved by throwing in multiple CPUs or CPUs with more cores or hyper-threads. However, there is a limit on how many disks you can add to a physical machine. Also in case of multiple disks, they are required to be controlled by the controller which can cause the bottleneck. Some aspects of this bottleneck are due to physical movement constraints of the rotating disks. One way to address this issue is to leverage storage tiering with faster SSD (Solid-State Drives) and Software Defined Storage (SDS).  This solution is also more efficient than the approach taken earlier by storage arrays. The storage arrays are typically expensive and have lock-in problem. Their management and maintenance are also challenging. Their upgrade is quite an ordeal.

This approach utilizes the SSDs connected to the physical machine for providing write as well as a read cache to the VMs running on the same physical machine. The SSDs are multiple order of magnitude faster than Hard Disk Drive (HDD)s. However, they are currently significantly more expensive than HDDs. For this reason, we augment the (say 12 TB of) HDDs with small (say about ½ TB of) SSD cache in this fast storage.  This storage is exposed as an NFS Store so that it can be easily configured and managed with any virtualization platform like VMWare, KVM, etc. It can also be configured with NFS driver to deploy OpenStack Cinder.

The file system interface between NFS and custom Block Device Driver (BDD) is managed by User level Filesystem (FUSE). All the file system calls from all the VMs on the hypervisor are intercepted using FUSE. FUSE also provides the interface between kernel space to user space. FUSE has two components: a kernel module and a user-space library (libfuse) with which applications can be built. The kernel module exposes a filesystem interface to the kernel and reroutes all the filesystem calls to the user-space library via a character device (/dev/fuse).  The FUSE user-space library is multi-threaded. It routes the calls to BDD. Thus all the VM IOs are processed by the BDD in userland. It gives us the freedom to implement custom BDD in user land which is easy to develop, maintain and manage. For this we trade-off a small performance penalty which does not have much impact on the overall solution as beyond the achieved IO speed-up other bottlenecks like memory and CPU will start creeping in.

The create, delete, open, close, read, write, flush calls are handled by the custom BDD. The write calls, dealt with by the custom BDD, write to high endurance write SSD cache for low latency. The writers are laid out as a transactional log. Every write gets acknowledged only after it is committed to the SSD providing the desired low write latency. The data on SSD is periodically flushed to the hard disk on the staging area in the background (outside of the IO path of the end user application / Operating System). The staged data is further processed periodically in the background to achieve deduplication across all the VMs running on the hypervisor. The data can also be archived in an object store over the cloud.

Multiple storage servers are run in an active-passive failover cluster mode to avoid a single point of failure. Each storage server in this cluster as SSDs. The SSD from active storage server in the cluster is continuously replicated to the standby SSD server in the cluster. This continuous replication guard against failure on the active SSD while the primary storage of running VMs is being served out of it. The staging and deduped data are stored on a GLuster FS cluster to guard against the disks going bad while the VMs are running.

We have implemented and deployed this solution with standard SSD on our in-house OpenStack deployment and here is the performance comparison:

The fast storage is almost hundred times faster than the one with local disk. It means that we can run hundreds of VMs without causing any IO bottleneck or be noticing any unresponsiveness on the VMs. It also means that VMs can be launched much faster and run IO intensive loads efficiently. This improvement can be enhanced further. E.g., current Openstack NFS driver for Cinder is fairly new and is being tuned for better performance. It also scales further with more VMs. The same software stack provides additional secondary storage capabilities like snapshot, backup, archival without causing much overhead on primary production storage. This storage can also be deployed for other platforms like VMWare, KVM, etc. it can easily be extended as an acceleration for any storage like object store, etc.