
|
|
| |
Storage Industry Archives • Home
OK, we’ve talked about SAS drives and the fact that what people really want out of them is “fast”. Let’s expound a bit.
What anyone wants out of an HDD is access density; IOPS/GB, and a High GB/$ (more normally specified as low $/GB, but I inverted this to make a point on the following graph).

Unfortunately we can see from this that as GB/$ has gone up sharply, access density has dropped precipitously. What is access density? Access density translates to the number of actuators chasing data. Thus with the relentless pursuit of more GB per disk drive, access density has gone in the tank.
We all know that the low capacity models of server drives last a lot longer than a mere $/GB consideration can explain, and the reason is that the more capacity you have under an actuator, the lower the performance of a subsystem comprised of these sorts of disks.
So 24 spindle “shelves”, or in Pillar’s parlance - Bricks, have higher access density than the same capacity, low RPM larger platter incarnations. Hence, better performance.
There are a few interesting and very enlightening points you can make about this:
1. Access density is always at odds with cost for HDD-based subsystems of a given RPM.
2. Access density always gets better with smaller platter sizes for an equal number of platters.
3. Smaller form factor drives of the high RPM variety don’t yield as big an improvement as you might think, because 95mm (3.5” form factor) 15K RPM drives already use smaller platters – closer to the 2.5” form factor that the SFF HDD uses anyway
4. What Small Form Factor Drives give you is the ability to package them in a serviceable enclosure that puts more actuators per TB in the familiar storage tray! Serviceability is key for small, medium, and large Enterprise.
5. There has always been an option of stacking drives in some high density fashion to optimize actuators per unit volume, but Serviceability is just about always compromised – hence it is not a common practice.
OK good, now the question is… what else can we do to get around access density? Well, with legacy architectures the answer is cache (don’t use the disk if we can avoid it). We can pay more and use smaller disks – oh, the wonderful days of 9GB disk drives.
Or, we could take a large capacity disk with great $/GB, and short-stroke the disk. As an example, let’s say we use 10% of the drive’s capacity. The access density will go up by about 20X!! (~2X from access time reduction, 10X from the fact that the actuator is chasing 10% of the capacity). Wow!. And to think we made that HUGE improvement by only increasing the $/GB by 10X. Oops.
Well, here is what the Axiom QoS does for you. It allows you to short stroke the drive, and get a HUGE access density improvement. But instead of throwing away the other 90% of the capacity, it allows you to sneak in and access that part of the disk in a way that only causes the access density of the high performance capacity (10% share) to drop by say, 10%, from 10X better to only 9X better. WOW! You get the whole drive space back, but have a portion of it that behaves like it has 9X the access density!!
You see, the truth is that stove-piped storage doesn’t get you anything if you can buy a single array that can mitigate contention while increasing access density.
So what in the hell does this mean anyway? To many, it means using expensive disk for high performance applications, midrange disk storage for mid-performance applications, and low cost SATA desktop drives for archive or disk backup applications. (You may recall this as ILM, HSM, or SRM depending on which vendor’s Flavor-aid you were drinking at the time. (BTW, did you know that despite conventional wisdom it was not in fact Kool-Aid that was ladled out at Jonestown?)
If you have a long memory and been around awhile like me, you can actually recall the days when this meant mainframe and minicomputers (and 8 track tape!). Tiering was by platform, and they were all very expensive relative to today.
Most of our competitors tier by platform, although it is pretty much all open, networked modular arrays. Few people are building closed interface monolithic storage anymore, unless they have a cash cow to milk.(DMoooooX)
So to some, the modern idea of Tiering is allowing different types of disk on the same platform and allowing the storage for less critical applications to be SATA on the same platform that FC disk resides. You can see a lot of confusion in chat rooms and blog posts around this. Tiering on an array – well just about everyone has this don’t they? Sure, if you define it as having SATA and FC disk on the same array. We have finally reached the point where people at companies like NetApps (I know, but I like saying it that way since they are so officially uptight about their damn name) have stopped saying SATA will never make it into the enterprise. Whoopty do!
So to me, Tiering on disk goes much further. For those of you who think I am going to say down to the location on the platter, you’re wrong. Tiering on disk to me means that you don’t have to think about what platter, and define LUNs or Filesystems to reside certain types of disk. Tiering on the array means you have a storage pool, the array picks the type of disk you need out of the pool based on your application requirements. It also means that the system will move or migrate LUNs and File systems based on changing requirements. For those of you who think this is standard, that QoS, Application-Aware, and auto-migration of data from FC to SATA are standard, you need to look again. They are far from standard; in fact they are basically not there unless you buy a Pillar Axiom.
Tiering on disk means you don’t have to buy 2, 3, 4 different platforms to meet disparate needs in your IT shop; you buy one platform that meets those needs out of a disk storage pool according to the application requirements you specify when you set it up. The only thing you need to do is pick some disk resources that encompass the needs of your applications to put into the pool, like SATA 1TB disk, FC 300GB 15K RPM disk to span a wide range of high capacity and high performance for QoS to work with for all your applications.
Why? Efficiency. Utilization is driven up with proper application of a storage pool in your data center. You can use both the capacity and the IOPs of the spindles you own using Axiom instead of one or the other.
Chris Mellor quoted me regarding an EMC spokesperson saying that SSD would be at price parity with HDD by 2010 – about 18 months from now. The implication was that SSD will replace HDD in 18 months time.
While I think this is incorrect, I did point out that this happened once before and there are reasons other than Cost per GB that might drive this crossover.
My IBM team 10 years ago or so developed a 1-inch drive we called the “MicroDrive”. Well, before you could buy 256MB of Flash, we sold 1GB MicroDrives. Although the cost per MB was much cheaper with the MicroDrive, the marginal value of the capacity was not large enough to motivate its purchase by a large enough proportion of the users. Only in extreme situations where the largest capacity was of value would someone choose the “cheaper” solution. Essentially, the value of a robust solid state solution was higher than the one that gave a lower cost per MB.
It’s kinda like making a trip to a Warehouse Store that sells groceries; it may be cheaper by the pound to buy a 10 pound block of cheddar cheese, but unless you’re feeding an Army the stuff will turn green before you can use it. In this situation the one pounder is easier to keep in the fridge, and you can use it up before you get sick of trying to add cheddar cheese to everything you make. Never mind the large quantity value proposition.
Such will be the case for Laptops – 128GB of SSD is a lot, and its speed and physical robustness is far more valuable than an extra 128GB for all but the most geeky users.
How about the Enterprise? Well, sure there are a lot of Customers who will find the speed and size of SSD to be ideal for certain circumstances, but with the capacity demands today it is unlikely that we can afford to substitute all the Petabytes of HDD with SSD for at least another 4-5 years.
So I am not sure Tucci was blowin’ smoke, but in the least his assertion seems a bit aggressive.
Perhaps if I started a “Pinheads and Patriots” section of the Blog? Nah, one of those is enough.
Well, you do if you own or operate a storage array. Storage systems have lots of components, including mechanical ones like disk drives. The whole point of RAID is to deal with failures of parts. Moving parts fail most often. Most systems today (not all) have redundancy built-in throughout the entire system. In an Axiom, all those redundant parts do work for the array all the time because it is an active-active architecture. Some arrays have active-passive architectures that waste those components that just sit around waiting for a failure.
So, what happens when a component fails? Well, in HA systems like Axiom, Clariion, and NetApp products, customers still have access to their data. Paramount in all but the most trivial storage applications is being able to get at your data regardless of any single failure.
What goes mostly unspoken is the effect of a failure on performance. Systems from NetApp and EMC can take a long time under load to rebuild their failed disks onto a spare drive. In fact, it can take more than a day! This matters because while the rebuild is in progress, the array is running without protection against another failure. The odds of a second failure are small, but get proportionately larger as the rebuild times grow longer.
To solve this problem, some storage manufactures put yet another redundant disk drive into their arrays. So you pay for more unusable capacity and power to protect yourself against the vendor’s long rebuild times. This is a great technique against loss of data, but wasteful and expensive. In contrast, Pillar’s Axiom drastically reduces the drive rebuild time by using a distributed hardware RAID architecture. Distributed RAID gives the following clear, demonstrable benefits that have been measured by outside laboratories against our competitors:
- We rebuild faster than any array on the market.
- We perform better under faulted conditions by a HUGE margin, factors of 2-3.
- We perform under all faulted conditions with minor loss of performance, on the order of 0-8%, versus 50% loss of performance from some vendors.
While most everyone guarantees continuous access to data under fault, they really don’t want to talk about the systems’ performance under those fault conditions. Why does this matter? Well, backup window integrity, customer perceived performance, boot times, the list goes on and on. They all depend on predictable, reasonable performance of the system, not 3 to 1 variations in performance under fault.
If you want great performance under fault conditions of any type, buy the Axiom. Your mileage may vary, but it will vary a hell of a lot less with the Axiom than with our competitors.
I was reading a competitors brochure a few days ago and it struck me how dishonest people can be talking about performance in a storage array.
I am going to give a couple of rules of thumb for performance here, and out of context they could be said to be “wrong”, but I think they are true enough to be illuminating:
1. IO’s per second or IOPS to a first order depend on how many spindles you have in a disk array. This can be exceeded with really effective cache on the right workloads, but IO performance is fundamentally based on Spindle count.
2. SATA drives (7200 RPM) are capable of 100-125 IOPS if you have an algorithm like Pillar’s QoS intelligently laying out the data on the disk and accessing it optimally.
3. Fibre Channel or SAS drives at 15K RPM can give you 2.5X that of a 7200 RPM SATA drive.
4. A typical storage shelf (2U) with 12 SATA drives therefore give you about 1,000-1500 IOPS typically.
5. A typical storage shelf (2U) with 12 15K RPM FC drives in it will yield about 2500- 3000 IOPS.
6. Other than effective cache utilization, storage controllers at some point limit performance of spindles. In other words, you aren’t going to get more than about 3000 IOPS from 12 15K RPM FC or SAS drives. You can put the world’s most capable controller on it, but you aren’t going to coax more IOPS out of them with a bigger controller. Of course if you have enough cache the disk can eventually idle, but that is not reality in any system.
7. Putting a 1000 spindles of any type on a pair of RAID controllers, or in a single Storage Controller will not net you 100,000 IOPS: nobody makes a single controller (even an active-active pair) that handles that many spindles today. So saying your system scales because you can support 1000 spindles on your Storage Controller is deceptive at best: you may support the capacity but you will not get the performance those spindles can deliver – this is a waste.
8. If you expand your definition of storage controller to mean a confederation of controllers, in Pillars case 2, 4, 6, or 8 controllers on the same storage pool, expanding to 1000 spindles may give you both the capacity and the performance the drives can deliver. While some companies besides Pillar offer multiple controllers and RAID engines, nobody offers 128 hardware RAID engines as Pillar does. Even with multiple controllers, most companies charge you for software that you load on those “expansion” controllers, and as I always tell our customers, buying two of something isn’t scaling. Pillar does not charge you for the software when you non-disruptively add controllers to a system.
9. Cache is a weapon in the war on performance. Cache can help avoid using slower disk, and it can help in organizing operations to disk in a way that makes the disk more efficient. Cache is probably more important than ever with larger numbers of servers sharing the data of the same storage array. There is nobody in the midrange market that makes more cache available than Pillar (96 GB per storage pool). There are storage companies who offer a whole 4GB of cache on 1PB of disk; this is just downright lame, embarrassing to say the least. Actually, let’s be brutally honest, it is pathetic as it is about twice as much as most people have on their laptop.
10. Throttling IO is different than maximizing performance through intelligent architecture. Throttling is just a limitation, and can be useful in preventing an application from hogging too much storage resource with other applications that share the storage array. Pillar’s QoS applies business priorities to LUNS and Filesystems to give applications the attention deserved from a business perspective, in the event of contention for resource. Even better, Pillar’s QoS optimizes all system resources to fit the application; cache, disk striping, queuing, network resources, and layout of data on the disk.
There is a lot more to say about performance, but the above list of 10 items pertains to almost any storage purchase, not just those from Pillar. There is too much BS out there - some of the claims made by some vendors are just downright ridiculous; what’s worse, they are misleading.
At Oracle OpenWorld a few weeks ago, I was asked more than once if we provide SAS drives. Although I answered politely, another thought came to mind: Why do you care?
The fact is any disk drive can be made to work with any interface. The fundamental performance characteristics of the drive - access time and rotational speed - have nothing to do with the communication interface you use. Interfaces are chosen for their relative cost and physical packaging related attributes way more than they are fundamental performance.
FC is not meant for a laptop environment, it has drivers and receivers that are meant for much bigger, physically distributed systems. Desktop computers are made in the hundreds of millions, so computer manufacturers notice even a 10 cent difference in price. Simple, point-to-point connections of limited distance suffice for desktop applications.
So why do people ask? Well, I think they aren’t really asking about the interface, they are asking about the type of drive. In other words, I think they want to know if this a 7200 RPM drive, a 10K RPM drive, or a 15K RPM drive? Is this a high capacity drive or a lower capacity drive? Is this “fast” or not? From the interface type, they draw conclusions which are not unreasonable, about system performance or most likely application target. The mindset has been that SATA drives are high capacity lower performance on a relative scale, so they are great for applications like back-up and virtual tape libraries (VTL).
When we introduced high performance array’s based on SATA, many raised their eyebrows because it sounded oxymoronic. After all, people were using them in storage arrays called “near store,” so how could they be used for serious “real store” systems? The truth is, we get a hell of a lot of performance out of SATA disk drives; we wanted to, that’s all. Our QoS architecture allows the system to overachieve or exceed most peoples expectations for performance on “SATA” disk (read 7200 RPM slower access time).
However, at the end of the day the drive physics can get ya. Regardless of the interface the access time and latency will become visible in certain applications. So in those applications, if you can afford it, you use low latency, fast access time drives. The trick is realizing which applications need what, rather than using the most expensive power hogs for everything.
Storage system designers try to get the most out of the components they use – at Pillar the QoS architecture allows fantastic performance from slower but much more cost effective disk, at half the power. When applications push the system, like all the other vendors, we resort to higher performance drives like FC, and SAS drives or “Server” disk drives. Why are these choices of last resort? Because they are more expensive and burn more power, that’s all.
So what about SAS then? Well, SAS (Serial Attached SCSI) is associated with “Server” drive attributes, and is a better alternative than FC for certain applications, mostly due to the electromechanical packaging requirements of the enclosures the drives are packaged in. Specifically, small, 2.5 inch form factor higher performance drives are being packaged into an enclosure that provides good density and response time characteristics for storage subsystems. SAS isn't what people are asking about – they are asking if you are offering storage bricks (Pillar’s nomenclature) that have more actuators per TB with faster response time.
Oh… and the short answer is, yes. We have SATA, FC drives and SAS is in our plans along with the rest of the industry.
At Oracle OpenWorld last week, a customer asked me about storage benchmarks. I told him to approach performance benchmarks with a high degree of skepticism… especially if they were published by someone trying to sell him something!
The fact is benchmarks are often not good representations of actual workloads or environments. That said, good benchmarks don’t have to necessarily represent real-world applications in order to be useful. EPA estimates for automobile mileage get criticized for this, but the estimates are useful comparisons for two vehicles even though no one drives the way the tests are defined, except my mother-in-law. Good benchmarks, like the EPA performs, have extreme controls and allow no vendor “tuning”.
And what about SPC benchmarks? Well, they are very tunable, and the most configurations are downright ludicrous for most of the market. See for example this config - 1536 Spindles for 24TB… for Pete’s sake!…only $29/GB of user capacity, and a whopping $58/GB for the storage used in the benchmark!
Yeah, right. Is this a typical real-world config? Nope. Not even close. My advice to anyone using taking these benchmarks too seriously is this: Don’t.
I think it is important to point out that SPC benchmark engineers work hard to build representative workloads; workloads are engineered and are often close enough to reality that they represent meaningful tests. But representative workloads are not the problem. The problem is more centered on extreme configuration of solutions as in the $3M+ case sited above.
Most vendors who participate in these types of benchmarks put together corner-case configurations to win marketing bragging rights. Enough said.
It is interesting to read Dave Hitz’ Blog on this subject. The shape of the IO response curve is indeed critical, and Dave points this out by referring to the IO rate when the response time is 1mS. The simplest characterization of response time versus IO rate curves is two numbers: 1) The minimum response time (Y-Axis intercept) and 2) The maximum IO rate (ridiculously large response time). I find it amusing that Dave cannot even use this characterization of the response time curve – he has to summarize it as one number. His point is valid, but for crying out loud are we all so pressed for time that we cannot look at the curve, two points, or more data than one number?
One respondent to Dave made a critical observation: How many spindles? How many filesystems? These are major influencers on the result! The hilarity of this is that the whole give and take on the subject reflects the problem of “benchmarks”: they aren’t well enough controlled.
In general, people don’t want to read and understand lots of data or analysis, they want a simple number. When you try and boil the performance of a complex system down to a simple number – the results are extremely easy to manipulate, and often misleading.
So what should you do in trying to assess performance? The best way is to do proof-of-concept tests or “bake offs” are useful if done in your shop by your own folks while trying to get the most out of all the systems tested as if you owned them.
I think a key question for any benchmark is this: Who ran it, for what purpose? Weight your interpretation of the results accordingly.
Unfortunately, disk drives are not solid-state devices like RAM or microprocessors. Disk drives are more like blenders, or doorbells. Disks spin, and the read-write heads move because they access data they write in concentric circles called tracks. There are about 25000 tracks in a disk drive of current vintage. The distance the read-write heads move to access the track closest to the outside of the disk from the ones closest to the center of the disk is often referred to as the “stroke length”.
To read or write data stored on a track, the disk drives have to move the heads from the track they are on, to the track to be accessed. Easy enough, but it is slow; it takes milliseconds, during which an Opteron or Xeon could execute a million instructions.
In fact it is so slow, that a lot of Storage system engineering, and database engineering is put into trying to hide the relative slowness of disk. One way to cut the time to access data on a disk is to cut the distance the heads need to travel to get from track to track (access time), and another is to cut the distance the heads need to seek over – stroke length.
The head move time goes as the square root of the distance traveled, for you can cut the move time by a factor of two if you cut the stroke (distance) down by a factor of four. This is called “short stroking”. [I hope you appreciate my discretion in the title of this blog ☺]. To you double the speed, you throw away 75% of your storage space. In the old days, people made storage “drums,” where fixed heads just floated over rotating magnetic cylinders, then later over a disk surface. Why? No move time!
Anyway, the problem with short stroking is that you end up throwing away capacity.
Being greedy, I always wondered why we couldn’t pretend to throw away capacity to make it appear we had achieved faster access time, but sneak in, every now and then, to the “thrown away part” and read/write stuff. Well, that’s the basis for Pillar’s QoS Operating System. Essentially, we partition the disk off so important, high priority, fast-accessed data is stored in the short stroke regions and less important stuff is stored in the rest of the space. We call this “statistical short-stroking”.
Many systems can specify LBA ranges (which translate to track or data band ranges) where data is to reside. With Pillar’s Axiom QoS, you don’t have to tell the system where to put the data. Who does then? Well, not you Ms. or Mr. Database Admin. Do you really want to assign LBA ranges to provision 100 LUNS or Filesystems? Probably not unless you are trying to get out of going to your neighbors kids piano recital again. With Axiom, the system asks you the application type, a few other parameters like read, write, or mixed bias, redundancy level, and relative business priority to lay the data out for you. It stripes it, determines RAID level, write performance level, and business priority.
The trick to statistical short stroking is that you have queues that make sure the disk drive heads stay over the highest priority bands when they need to be, most of the time. Since Axiom has the business priority and queue managers for each file system and LUN, there is no reason why we cannot also assign cache, network bandwidth, CPU utilization all aligned with business priority and application characteristics as well. In fact, this is exactly what Axiom does.
If you have a “Write anywhere” anything in your NAS system – it is a little difficult to plop “Write here” technology into your product. Write anywhere is clever, and had its day, but it is the worst thing you can do from a data layout point of view. Fragmentation, non-contiguous data layout is not a good thing for performance. For a great SAN product, the last thing you would do is “Write anywhere”, it is inefficient as hell.
While underutilized NAS can survive some fragmentation and meet NAS performance norms, SAN cannot. Hence making a SAN product out of an underlying NAS structure is not a great idea. It fact doing so is akin to having a hammer and convincing yourself that everything is a nail.
But, a good SAN product can be interfaced to a volume manager, a file system, and a protocol stack to make a NAS system, so this is exactly what we built at Pillar. Why not? As long as the SAN doesn’t go through the NAS for data or have NAS determine its data structure and layout, all is good. This is the Pillar Axiom and what in my mind is meant by converged storage. Stacking stuff in the same rack is not “convergence”, nor is driving a screw in with a hammer.
Aren’t you proud of me for resisting the temptation to make up something really goofy for the title of this?
People say getting techies to agree on anything is a bit like pulling the molars from a Rottweiler. Well, I’m a geek and I disagree with that. Here are 16 ways…
1. A Switched Fabric is better than Loop topology for all but the most trivial applications (Disk Drive attachment).
2. Off the shelf parts and standards are less costly to build a system around than proprietary ones.
3. Modular is better, more flexible, than monolithic.
4. Hardware-assisted RAID usually out-performs Software RAID, especially in re-build.
5. SAN and NAS are both block based storage at some point.
6. FC, Ethernet attached storage both have their place (File or Block).
7. To share storage, a SAN is optimal for some applications, NAS for others, both often fit in the same shop.
8. A storage pool is much more flexible than dedicated hardware for each LUN or file system – at least virtualize your own if you can’t virtualize everyone else’s stuff!
9. Cache memory is a weapon in the war on performance, and it isn’t too expensive.
10. Quality of Service (QoS ) concepts in Networking are valuable
11. Quality of Service concepts applied to Storage can be very useful in many applications (Pillar QoS).
12. All LUNS and File systems probably do not have equal business priority.
13. “Ease of use” is very hard to do, but necessary as systems get more complex.
14. Single points of failure (SPOF) are not allowed in Storage Subsystems; highly available (HA) systems are.
15. Reliability, Availability, and Serviceability (RAS) are becoming more and more important; Uptime is imperative.
16. Maintenance can be made easy if you want it to be, but it is a lot of work.
17. The iPhone is way over rated.
I said 16, not 17.
Hey. This whole thing was a set up. We started Pillar knowing full well technology pros would agree with this list of Pillar premises. Think there’s a reason to go with a more expensive, or a less capable solution from one of our competitors? I disagree.
Jonathan Schwartz at Sun shocked no one a few days ago by announcing that Sun has decided to do a reorganization. Well, stop the presses!
It turns out that Sun is going to fold the Storage group into the Server group because it realizes that what they sell is a “System”. Time to re-think this whole business, and as Mario Apicella wrote, watch out EMC, NetApp, and all of us littler guys like Pillar. The times they are a changin’.
Yes this is the kind of rock-your-world announcement that can have you staring at the bottom of a bottle, wondering how the rest of us missed such a critical observation and organizational structure. Well, the rest of us except IBM, who has done this at least three times in the last 15 years. And every time IBM did it, they subsequently un-did it.
Why? Because a company’s organizational structure depends on how they view their business rather than technology or the needs of their Customers. Organizations aren’t products; they should be immaterial to the Customer.
Perhaps Steve Duplessie said it best when he said, “Whoopdie Doo” in his normal eloquent fashion.
Server people look at storage as “clothing” for Servers. Storage people look at storage as well, storage. The storage industry is quite large, and storage companies and storage divisions of larger companies look at the total available market (TAM) as their customer base, not just the part of the TAM that has their company’s Servers, or Switches, or whatever in it.
So in the end, if your company’s attach rate is say, 30%, meaning that 30% of your servers are sold with your own storage instead of somebody else’s, you could argue this two ways: 1) Build a better storage subsystem to improve your attach rate, or 2) Sell your storage product on everyone else’s server’s in addition to your own since that is a lot larger opportunity anyway.
Both of these arguments are reasonable, hence the shuffling around of these groups inside companies like IBM and Sun. The truth is, the shuffling has more to do with internal politics, Sales force structure, and business growth targets than it does with some technological shifts or customer requirements.
Perhaps my annoyance is too obvious, but for crying out loud it seems like the internal machinations of our companies are not really relevant to our Customers; it is goods and services that matter. Who reports to whom should not matter: If it does we have gone nowhere because all storage manufacturers were part of their respective “server” (read computer mainframe) groups 50 years ago.
Mike
|
|
|
|