Mike Workman
 

« Previous · Main · Next »

October 16, 2009

Auto-Tiering of Data

Auto-Tiering of data in a storage array, at least in Pillar’s vision, has two axes – 1) application priority; and 2) class of storage. In the Axiom, this is represented by two functions: 1) QoS explained in many other posts; and 2) data migration within the array. Currently we have coupled QoS with storage class, but we are decoupling those attributes such that all classes are available to different quality of service levels with the benefits and restrictions applied appropriately.

Simply put, QoS allows for constructive contention management and resource deployment (cache, CPU, network, storage class). Instead of segregating spindles and silo-ing platforms by application the way we did with servers before VMWare, XEN, HyperV, Axiom allows sharing or multi-tenancy of the storage by management of system resources to meet service –levels required by applications..

Data migration within an array (or coupling of arrays) is: “Data moves to faster class of storage elements when signaled to move, scheduled to move, or allowed to do so automatically using array-based algorithms that determine it should move.”

We refer to our vision on this topic as Auto-Tiering.  It is a superset of what some refer to as data progression. IBM referred to it as HSM 20+ years ago, but never implemented it in their storage arrays. Storage used to be dumb enough that the industry virtualized above the storage array. The argument was, if I virtualize above the array, I can manage heterogeneous storage from different vendors and do tons of fancy things that sound great. IBM’s SVC is just such a contrivance and they have had pretty good success with this approach. It took 10 years, but hey, some of this has materialized and we have customers that manage the Axiom with SVC.

At the array level though, for modern architectures, there are a few must-haves:

1.    Based on a storage pool
2.    Managed by policy
3.    Allow classes of storage within the pool (tiered pool)

These are basic.

Our Auto-Tiering vision includes the basics above with the following key system level features:

1.    QoS
2.    Dividing volumes into finer grains (super-blocks)
3.    Providing for scheduled, signaled, or automatic movement of super-blocks through the classes or a subset of classes of storage within the pool.

And that is it! It is very powerful, depending on the degree of development of each of the basics, and auto-tiering attributes. The current landscape amongst a few providers who are working on at least a subset of this is shown in the table below.

Chart4-560

Algorithms used to move stuff up and down the classes of storage in the pool are not very good today. One could argue that they are better than none at all, but that is not true either. Here’s why: The algorithms use data usage patterns in the past to predict the future. In other words, it takes days to decide to move data. And it takes time of course to move the data once it has been decided to do so. To defeat the algorithm, one just needs to imagine cases where the last X days of use aren’t representative of the next Y day(s). It’s that simple. And common! High priority data sets are not necessarily consistently the same day in and day out – use patterns are not regular, and even if they are, variations ruin the algorithm’s effectiveness. Easy to come up with examples that it works on, easier to come up with realistic “hostile” cases. This is why you don’t see standard benchmark results from companies who have some version of this feature – benchmarks are designed to avoid allowing this kind of tuning because it is in general not representative of how you should run a shop.

If you have to set up a benchmark by placing data on a storage system – how do you do it? Well, you examine the workload and you use policy-based management CLI or GUI to place volumes according to the workload. Obviously no goofball at Pillar or EMC or NetApps is going to purposely do this wrong…so we all set it up to perform the best we can given the storage assets we have in the system (or storage pool if you are Pillar, CML or 3PAR).  Unless you are setting up someone else’s system to make the case that it is a piece of crap – but who would do that? 

Very few people are out of space on high-performance spindles. The fact is they don’t have enough spindles for the IO they need in the first place. We all know this one. So what are you going to do with more free space? Put more stuff on it? Not unless you had lots of IOPS on them to give away to the “super-blocks” you wanted to move there.

On the other hand, this is not true, or won’t be as true for SSD.  SSD IOPS are plentiful, but the capacity side is a challenge. So moving data to lower tiers that frees up space in SSD is currently a valuable schema.

But having multiple applications vying for SSD without the ability to prioritize volumes from a business perspective is not a good idea.

Let me show you how the lack of QoS coupled with data migration will make things worse for a system: put a low priority set of LUNs onto it and frequently access them; without QoS, the system will purposely move the low priority superblocks up the tiers of storage where they can hog system resource (capacity, IOPS) – the squeaky wheel gets the grease. In this case – the wheel that is squeaking is taking precedence over more important applications for a finite resource. You see this in a litter of dogs – the biggest puppy hogs the best spots in the chow line and gets the most chow. If the smallest dog can talk, the one who is starving to death, you are clearly not deploying the food correctly: That talking pup is worth a billion dollars. Nature may be grand, but this is similar to letting those downloading music take all the system resources from those running the business.

Defining algorithms that can see into the future and respond to patterns of usage while following your business priorities is challenging. Today in the Axiom we move entire volumes up and down in class. We can deploy CPU, priority, network, cache, and storage class by moving data, or adjust everything without moving the data.  This can be done manually, or by “signal” from some event, or scripted on a schedule that allows data centers to deterministically make these changes based on known workloads, patterns, or movement of virtual machines or whatever.  Automatic data migration (managed by policy, QoS, and data use patterns) will appear in the Axiom in 2010.

Vision

Here are the key elements of where I believe everyone is headed:

1.    Storage pool
2.    Policy management
3.    Tiered storage management within the pool
4.    Granular, sub-volume level flexibility (super-blocks)
5.    Algorithms that determine the “working set” of super-blocks should reside in what tier. These need massive improvement over where we are today.
6.    QoS. You should not use frequency of use as the sole determination of Tier position – it is highly flawed. Business priority or Application Aware axis is essential in optimization of system resources.

Within the next year, Pillar will expose the granular structure of our volumes to make auto-tiering possible, and we will provide more hooks into our API that allow the automatic movement of data subject to constraints, while still supporting the control we have today.



Mike_signature_5