The Art in the Architecture – vSAN Performance Design, Part 2

by
calendar-alt June 14, 2018clock 6 minutes music The Octopus Project - Half a Nice Day share-square Share this: twitter linkedin-in facebook hacker-news envelope

Preface

I’ll resume our discussion from the previous article in this series, specifically around vSAN’s handling of “hot spots”. Let’s start with a return visit to The Case for Shared Nothing. Mr. Stonebraker calls out three ways to solve hot spots:

  1. get rid of them
  2. divide a hot spot record into N subrecords
  3. use some implementation of a reservation system

Armed with the information we’ve covered so far, we can line up vSAN’s implementations of these quite well.

Get Rid of Hot Spots

As a storage system for VMs, vSAN doesn’t have the luxury of simply not performing writes desired. However, it can make these writes less of a hot spot by ensuring the involved parts are few in number and high in capability. The same efficiency you experience from running ESXi as a bare metal, “tier 0” operating system to provide VMs with CPU and RAM also means it can provide that efficiency when it comes to storage. I don’t need to proxy read/write activity through an array, VM, or something else. The hypervisor will communicate directly with the disks on one side and the VM on the other. This lessens complexity and latency involved and avoids potential bottlenecks like the availability of a VM, the route to a storage fabric or array, and so on.

The storage media involved in that VM write is only a single device, which means I don’t have to muck about with disk RAID or other things like that. The write goes directly to a single cache disk on at least one node, but usually to 2 or more nodes, depending on the “Failures to Tolerate” policy involved. Further, the writes are done in parallel (see “A Day in the Life of a vSAN I/O” to understand this better). This parallel behavior means no waiting for a replication to somewhere after every write before I can acknowledge back to the VM. Even better, deduplication and compression isn’t performed until I destage down to capacity, so I don’t have to worry about that, either. This pushes a great deal of IOs through very quickly, so many hot spots are actually avoided. To quickly review:

  1. The hypervisor communicates directly to the storage, no middleman array, fabric, or storage VM is needed.

  2. The write is received upon a single cache device per node, which means no disk RAID complexities or chokepoints. (Ever seen what happens when the write back cache isn’t large enough on a RAID controller?)

  3. The redundant writes are in parallel and not replicated, which means I’m not waiting for post-write activity before I can acknowledge back to the VM.

Divide Hot Spots into Smaller Parts

vSAN sub-divides components in two cases:

  1. A given object is > 255 GB in provisioned size, so separate components are created for each 255 GB “chunk”. These can be placed on the same or different disks (including other nodes).
  2. An object has a stripe width policy setting > 1, so the component is sub-divided into the requested number of stripes, which are always placed on different disks (also including other nodes).

We can see a nice visual of these when examining the data placement of a VMDK > 255 GB in size:

vSAN placing an object larger than 255 GB
An example of a larger VMDK placed on vSAN. This VMDK is 510 GB in size and has a policy of FTT = 1 with Mirroring, so two “RAID-0 chunks”, (data stripes) are created for each replica.

The VMDK pictured has been divided into smaller parts. This alone can help certain transactions performed against it function much faster, as there’s less data in each part. In general, computers, especially general-purpose servers like vSAN runs upon, prefer many small problems over fewer large problems. Great, we’re dividing a large problem up into small parts and presenting an opportunity for advantage here. However, these “chunks” aren’t really made for performance’s sake: they’re made so we can fit large VMDKs onto storage media we might use in vSAN. After all, I can make a 62 TB VMDK in vSAN 6.7, but I can’t buy a 62 TB capacity disk!1 Because the focus of these components is more practical than performance-focused, they can be placed on the same disk, so there’s still a chance that a very busy VMDK could not be engaging enough disks to work as quickly as it should.

This is where the vSAN Stripe Width policy rule comes to the rescue. John Nicholson wrote about vSAN stripes a while back; it’s a great quick read. The key point of stripes in vSAN is that they’re always on different disks from their counterparts. To see this in action, here’s a smaller VMDK, but with a stripe width of 3:

vSAN placing an object with a stripe width of 3
In this example, a 100 MB VMDK has a stripe width of 3. While 100 MB isn’t large enough to require the “chunks” a larger VMDK has, this object has many sub-components as a result of the striping policy. When mirroring, these striped components will always be on different disks from those belonging to the same replica.

We can see that there’s three parts to each replica. Each of those three parts within the same replica will always be on different disks, so we can ensure plenty of disk engagement here. As John notes in the article I referenced above, this is a great option as needed, but it’s actually best to start at the default single stripe, examine metrics and data, and scale to more stripes as you need them. We often find that engaging more disks is not necessary for many objects in vSAN, due to the other intrinsic performance features within vSAN that I’ve noted here and in the previous article.

Note

As you add more stripes, you need more disks to support those striped components. Check out Cormac’s great writeup on vSAN stripes and disk requirements for more info.

So we can see here that vSAN has two ways of dividing potential hot spots, for two different reasons:

  1. Divide large objects for practical purposes

  2. Offer optional division of any object to arbitrarily engage more disks

Use Some Implementation of a Reservation System

I’m sometimes asked why vSAN doesn’t have an “IOPS guarantee” policy or something like that. That might be worth a whole article in itself sometime, but the simplest answer is that vSAN focuses on delivering sufficient IOPS to every requester, instead of making you micro-manage that. Moreover, what would be done to “guarantee” these IOPS? Limit all other I/O unconditionally? That gets dangerous, like when an important VM is being resynchronized after a disk or node failure. Remove some threshold? Why not have it run unfettered in the first place? vSAN gives you a great deal of control over the performance of your VMs, but doesn’t have or need a performance reservation. However, there are two different reservation systems in vSAN that are very relevant here.

IOPS Limits

While I can’t (and personally, don’t want to) specify some “IOPS floor” for VMs, I can specify a limit. This is great for testing VMs that might be subject to scripts or processes that have unintended consequences. Do I really want that test SQL server ticking away at 10,000 IOPS just because someone accidentally ran an update statement that changed 2 million rows? Even if vSAN can handle that burst with no issue, there’s other side-effects this might create. IOPS limits ensure that VM (or even VMDK) can’t make more requests than I allow. These limits can be monitored in comparison to the actual ongoing IOPS in the VM by navigating to Your VM caret-right Monitor caret-right Performance caret-right vSAN caret-right Performance, then select the “Virtual Disks” tab and choose the VMDK you’ve limited from the “Virtual Disk:” drop down.

Monitoring the actual IOPS vs. IOPS limit on a VMDK.
This VM has an IOPS limit of 100. Good thing my ongoing IOPS are nowhere near that…

Adaptive Resync

Adaptive resync dynamically allocates network bandwidth to different types of data queues in vSAN based on need, and available bandwidth. For example, if VM traffic is busy enough that it’s putting pressure on the network available to vSAN, then resync traffic will be throttled to no more than 20% of the total bandwidth available. As soon as that pressure is relieved, that choke is removed. This is a great consideration, because so far in this article we’ve been focusing more on storage media while ignoring just how crucial available network bandwidth is to ongoing IOs. This is a type of reservation system, but entirely focused on network bandwidth, as that’s especially out of vSAN’s control: it can’t dictate when or how VMs will ask for reads and writes, it just has to “roll with the punches”. If you’re interested in learning more about this feature, here’s a great quick gist on vSAN Adaptive Resync by Duncan, and the official Tech Note on Adaptive Resync in vSAN 6.7.

So, while vSAN has some reservation systems, these take a back seat to the other methods mentioned above to control I/O flow: eliminating hot spots where possible, and sub-dividing work across agents. Mr. Stonebraker agrees with this priority in his notes about improving database performance:

It is clear that this tactic [of using reservation systems] can be applied equally well to any of the proposed architectures; however, it is not clear that it ever dominates the “divide into subrecords” tactic. Consequently, hot spots should be solvable using conventional techniques.

Conclusion

Hopefully this has given some insight into the native performance characteristics of vSAN. There’s plenty more that could be said, of course. I chose to focus on key aspects that focus on what’s relevant to core I/O flow and shared nothing systems. Besides what I’ve linked in this article and the previous vSAN performance design article in this series, some other great material on vSAN performance is below.

Storage Performance on the Virtually Speaking Podcast
Pete Koehler chats with the hosts of Virtually Speaking about storage performance and vSAN’s advantages
Designing for Performance on the Virtually Speaking Podcast
Pete returns to discuss more performance considerations and great new performance features in vSAN 6.7
Extreme Performance Series: vSAN Performance Troubleshooting
Amithaba and Suraj provide an awesome deep-dive presentation on understanding, troubleshooting, and designing for performance in vSAN at VMworld 2017
vSAN Performance Evaluation Checklist
A quick rundown on how to properly assess vSAN performance hands-on
vSAN Caching Algorithms Whitepaper
The caching behavior of vSAN is reviewed in detail, shedding light on the performance gains found in its remarkable design

Next time in this series, let’s put this great performance to specific work by examining Storage Policy-Based Management in detail. Until then, thanks for reading!


  1. At least, not yet. We’re up to [16 TB], so let’s see how far this goes…