Evaluating SSDs in Virtualized Datacenters by Irfan Ahmad

Flash-based solid-state disks (SSDs) offer impressive performance capabilities and are all the rage these days. Rightly so? Let’s find out how you can assess the performance benefit of SSDs in your own datacenter before purchasing anything and without expensive, time-consuming and usually inaccurate proofs-of-concept.

** Please note that this article is written by Irfan Ahmad, follow him on twitter and make sure to attend his webinar on the 5th of June on this topic, and vote for CloudPhysics in the big data startup top 10. **

I was fortunate enough to have started the very first project at VMware that optimized ESX to take advantage of Flash and SSDs. Swap to Host Cache (aka Swap-to-SSD) shipped in vSphere 5. For those customers wanting to manage their DRAM spend, this feature can be a huge cost saving. It also continues to serve as a differentiator for vSphere against competitors.

Swap-to-SSD has the distinction of being the first VMware project to fully utilize the capabilities of Flash but it is certainly not the only one. Since then, every established storage vendor has entered this area, not to mention a dozen awesome startups. Some have solutions that apply broadly to all compute infrastructures, yet others have products that are specifically designed to address the hypervisor platform.

The performance capabilities of the Flash are indeed impressive. But they can cost a pretty penny. Marketing machines are in full force trying to convince you that you need a shiny hardware or software solution. An important question remains: can the actual benefit keep up with the hype? The results are mixed and worth reading through.

I thought it’d be fun to look for a measurement of hype. I plotted Google Ngrams for the usage of the terms “hard disk drive” and “flash drive” in publications. The results were remarkable: there is an exponential increase in usage of the latter term alongside a slow decline of the former. Take it for what it’s worth :-)

What is a virtualization team to do? For example, imagine the expense associated with buying $10K Flash PCI-e cards for read caching for all your servers. Is that cost justified? It could be justified if the performance benefit was clear. But the reality is that it is extremely difficult to predict whether there will be a significant performance benefit for any given VM.

Let’s continue with the server-side IO caching use case (in my presentation on June the 5th, I’ll discuss several other use cases). My team was called in to help a customer in the automotive business with revenue in the multi-billions.

Case Study 1 (Automotive)—Disaster:

Customer Quick Facts:

4,000+ employees, $4bln+ in revenue
Large SSD caching project
Feasibility study completed
Proof-of-concept (POC) completed
Production deployment completed
No measurable benefit to production VMs!

The customer first did a time-consuming, detailed study to validate feasibility of deploying SSD in their datacenter. Once completed, they undertook an expensive proof of concept with their SSD cache vendor. After all of this, the customer experienced NO measurable benefit! While this came as a shock to both the customer and the vendor, the scenario is far too common.

Understanding the mismatch between a POC and reality is the key to avoiding such problems. Turns out that the customer had plenty of areas where performance was hurting and could have been helped by IO caching but they used experts to do back-of-the-envelope VM selection for the initial rollout of SSDs. The experts relied on application identity and their knowledge of what their workload characteristics might have been to perform the POC. Bingo! No surprise: the issue is with the assessment. They simply picked the wrong VMs. The operations team selected their top-tier application-tier VMs for SSD rollout. Except that, unbeknownst to them, the developers of the application tier who were tired of bad performance had switched over to a different architecture that changed the on-disk workload pattern. And you guessed it, that pattern didn’t experience a speedup by caching.

In reality, the OPs team should have actually selected the numerous other VMs for the DB backing store which were still hurting badly. Instead their POC blinded them to the reality that no benefit was to come from their choice of VMs.

Aha! So different workloads respond differently to SSD IO caching in non-trivial ways. But argh! How do we figure that out? In the webinar I also provide more details on the types of workload characteristics that can affect performance.

To resolve this guessing game that is rampant in our industry, CloudPhysics engineers began discussing the idea of developing a card that would simulate the exact caching behavior of any VM without actually installing a caching solution. If we could do this, the results of this card for all the VMs in a datacenter would help us predict which VMs benefitted and by how much. OK, that sounded easy on paper but pretty much everyone in the industry thought it was almost impossible to get to that level of accuracy. Let it be known that CloudPhysics engineers aren’t ordinary engineers. Under the technical direction of Carl Waldspurger, the team nailed it (I’ll cover how we accomplished this amazing feat in another post).

Let this industry achievement sink in for a moment: we now have the capability to simulate the latency benefit to a VM of applying an SSD cache. Amazing predictive power.

Last year, we released the Caching Analytics Service card that could be used to predict how well any single VM or all VMs running on a host, cluster or datacenter could benefit by server-side IO caching. So naturally, we offered this to customers and hundreds have already used this as a paid service. Let me share two results here, though more details are covered in the webinar.

Case Study 2 (Finance)—CloudPhysics Caching Assessment card predicted 16% of VMs with a latency reduction by 1.5x – 4x

Case Study 3 (Education)—CloudPhysics Caching Assessment card predicted that only 3% of VMs would experience a latency reduction of greater than 1.5x

For both case studies, the customer saved a tremendous amount of money by targeting the VMs that would actually benefit significantly. We’ve saved customers hundreds of thousands of dollars while delivering intelligence to achieve superb performance improvements. Imagine being able to pinpoint exactly where to drop the SSD cards to extract the optimal performance for your VMs. Best of all, the assessment is completely transparent, you don’t have to install any agents. Point and click easy. If you’d like to try out this service, please contact us.

In conclusion, SSDs hold great promise and you can find immediate benefit to resolve performance issues in your datacenters today. However, we have shown that a blind rollout is wasteful and the vast majority of the benefit can be had by a detailed assessment. Sign up now to try out a free trial of all the various CloudPhysics cards.

Bio:
Irfan Ahmad is the CTO and co-founder at CloudPhysics. He was the lead engineer behind VMware’s flagship products Storage DRS, Storage I/O Control and authored vscsiStats. His new company’s product takes the guesswork out of operations management enabling you to model how your systems will behave by simulating different configurations. Irfan is leading a Webinar delving deeper into the benefits and challenges with SSDs in virtualized datacenters this week.

"Evaluating SSDs in Virtualized Datacenters by Irfan Ahmad" originally appeared on Yellow-Bricks.com. Follow me on twitter - @DuncanYB.

Looking for VMware Training? Sign up for Free Trial today on Train Signal!