Sunday, October 16, 2011

ZFS: A Multi-Year Case Study in Moving From Desktop Mirroring (Part 2)



Abstract:
ZFS was created by Sun Microsystems to innovate the storage subsystem of computing systems by simultaneously expanding capacity & security exponentially while collapsing the formerly striated layers of storage (i.e. volume managers, file systems, RAID, etc.) into a single layer in order to deliver capabilities that would normally be very complex to achieve. One such innovation introduced in ZFS was the ability to provide inexpensive limited life solid state storage (FLASH media) which may offer fast (or at least greater deterministic) random read or write access to the storage hierarchy in a place where it can enhance performance of less deterministic rotating media. This paper discusses the use of various configurations of inexpensive flash to enhance the write performance of high capacity yet low cost mirrored external media with ZFS.

Case Study:
A particular Media Design House had formerly used multiple external mirrored storage on desktops as well as racks of archived optical media in order to meet their storage requirements. A pair of (formerly high-end) 400 Gigabyte Firewire drives lost a drive. An additional pair of (formerly high-end) 500 Gigabyte Firewire drives experienced a drive loss within one month later. A media wall of CD's and DVD's was getting cumbersome to retain.

First Upgrade:
A newer version of Solaris 10 was released, which included more recent features. The Media House was pleased to accept Update 8, with the possibility of supporting Level 2 ARC for increased read performance and Intent Logging for increase write performance.

The Media House did not see the need to purchase flash for read or write logging at this time. The mirrored 1.5 Terabyte SAN performed adequately.


Second Upgrade:
The Media House started becoming concerned, about 1 year later, when 65% of their 1.5 Terabyte SAN storage was burned through.
Ultra60/root# zpool list

NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.36T   905G   487G    65%  ONLINE  -
The decision to invest in an additional pair of 2 Terabyte drives for the SAN was an easy one. The external Seagate Expansion drives were selected, because of the reliability of the former drives, and the built in power management which would reduce power consumption.

Additional storage was purchased for the network, but if there was going to be an upgrade, a major question included: what kind of common flash media would perform best for the investment?


Multiple Flash Sticks or Solid State Disk?

Understanding that Flash Media normally has a high Write latency, the question in everyone's mind is: what would perform better, an army of flash sticks or a solid state disk?

This simple question started what became a testing rat hole where people ask the question but often the responses comes from anecdotal assumptions. The media house was interested in the real answer.

Testing Methodology

It was decided that the copying of large files to/from large drive pairs was the most accurate way to simulate the day to day operations of the design house. This is what they do with media files, so this is how the storage should be tested.

The first set of tests surrounded testing the write cache in different configurations.
  • The USB sticks would each use a dedicated 400Mbit port
  • USB stick mirroring would occur across 2 different PCI buses
  • 4x consumer grade 8 Gigabyte USB sticks from MicroCenter were procured
  • Approximately 900 Gigabytes of data would be copied during each test run
  • The same source mirror was used: the 1.5TB mirror
  • The same destination mirror would be used: the 2TB mirror
  • The same Ultra60 Creator 3D with dial 450MHz processors would be used
  • The SAN platform was maxed out at 2 GB of ECC RAM
  • The destination drives would be destroyed and re-mirrored between tests
  • Solaris 10 Update 8 would be used
The Base System
# Check patch release
Ultra60/root# uname -a
SunOS Ultra60 5.10 Generic_141444-09 sun4u sparc sun4u


# check OS release
Ultra60/root# cat /etc/release
Solaris 10 10/09 s10s_u8wos_08a SPARC
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 16 September 2009


# check memory size
Ultra60/root# prtconf | grep Memory
Memory size: 2048 Megabytes


# status of zpool, show devices
Ultra60/root# zpool status zpool2
pool: zpool2
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zpool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t0d0s0 ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors
The Base Test: No Write Cache

A standard needed to be created by which each additional run could be tested against. This base test was a straight create and copy.

ZFS is a tremendously fast system for creating a mirrored pool under. A 2TB mirrored poll takes only 4 seconds on an old dual 450MHz UltraSPARC II.
# Create mirrored pool of 2x 2.0TB drives
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0

real 0m4.09s
user 0m0.74s
sys 0m0.75s
The data to be copied with source and destination storage is easily listed.
# show source and destination zpools
Ultra60/root# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool2 1.36T 905G 487G 65% ONLINE -
zpool3 1.81T 85.5K 1.81T 0% ONLINE -

The copy of over 900 GB between mirrored USB pairs takes about 41 hours.
# perform copy of 905GBytes of data from old source to new destination zpool
Ultra60/root# cd /u002 ; time cp -r . /u003
real 41h6m14.98s
user 0m47.54s
sys 5h36m59.29s
The time to destroy the 2 TB mirrored pool holding 900GB of data was about 2 seconds.
# erase and unmount new destination zpool
Ultra60/root# time zpool destroy zpool3
real 0m2.19s
user 0m0.02s
sys 0m0.14s
Another Base Test: Quad Mirrored Write Cache

The ZFS Intent Log can be split from the mirror onto higher throughput media, for the purpose of speeding writes. Because this is a write cache, it is extremely important that this media is redundant - a loss to the write cache can result in a corrupt pool and loss of data.

The first test was to create a quad mirrored write cache. With 2 GB of RAM, there is absolutely no way that the quad 8 GB sticks would ever have more than a fraction flash used, but the hope is that such a small amount of flash used would allow the commodity sticks to perform well.

The 4x 8GB sticks were inserted into the system, they were found, formatted (see this article for additional USB stick handling under Solaris 10), and the system was now ready to accept them for creating a new destination pool.

Creation of 4x mirror ZFS Intent Log with 2TB mirror took longer - 20 seconds.
# Create mirrored pool with 4x 8GB USB sticks for ZIL for highest reliability
Ultra60/root# time zpool create -m /u003 zpool3 \
mirror c8t0d0 c9t0d0 \
log mirror c1t0d0s0 c2t0d0s0 c6t0d0s0 c7t0d0s0
real 0m20.01s
user 0m0.77s
sys 0m1.36s
The new zpool is clearly composed of a 4 way mirror.
# status of zpool, show devices
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
mirror ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0

errors: No known data errors
No copy was done using the quad mirrored USB ZIL, because this level of redundancy was not needed.

A destroy of the 4 way mirrored ZIL with 2TB mirrored zpool still only took 2 seconds.

# destroy zpool3 to create without mirror for highest throughput
Ultra60/root# time zpool destroy zpool3
real 0m2.19s
user 0m0.02s
sys 0m0.14s
The intention of this setup was just to see if it was possible, ensure the USB sticks were functioning, and determine if adding an unreasonable amount of redundant ZIL to the system created any odd performance behaviors. Clearly, if this is acceptable, nearly every other realistic scenario that is tried will be fine.

Scenario One: 4x Striped USB Stick ZIL

The first scenario to test will be the 4-way striped USB Stick ZFS Intent Log. With 4 USB sticks, 2 sticks on each PCI bus, each stick on a dedicated USB 2.0 port - this should offer the greatest amount of throughput from these commodity flash sticks, but the least amount of security from a failed stick.
# Create zpool without mirror to round-robin USB sticks for highest throughput (dangerous)
Ultra60/root# time zpool create -m /u003 zpool3 \
mirror c8t0d0 c9t0d0 \
log c1t0d0s0 c2t0d0s0 c6t0d0s0 c7t0d0s0
real 0m19.17s
user 0m0.76s
sys 0m1.37s

# list zpools
Ultra60/root# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool2 1.36T 905G 487G 65% ONLINE -
zpool3 1.81T 87K 1.81T 0% ONLINE -


# show status of zpool including devices
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0
errors: No known data errors


# start copy of 905GB of data from mirrored 1.5TB to 2.0TB
Ultra60/root# cd /u002 ; time cp -r . /u003
real 37h12m43.54s
user 0m49.27s
sys 5h30m53.29s

# destroy it again for new test
Ultra60/root# time zpool destroy zpool3
real 0m2.77s
user 0m0.02s
sys 0m0.56s
The zpool creation was 19 seconds, destroying almost 3 seconds, but the copy decreased from 41 to 37 hours or about 10% savings... with no redundancy.

Scenario Two: 2x Mirrored USB ZIL on 2TB Mirrored Pool

Adding quad mirrored ZIL offered a 10% boost with no redundancy, what if we added a pair of mirrored USB ZIL sticks, to offer write striping for speed and mirroring for redundancy?
# create zpool3 with pair of mirrored intent USB intent logs
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0 \
log mirror c1t0d0s0 c2t0d0s0 mirror c6t0d0s0 c7t0d0s0
real 0m19.20s
user 0m0.79s
sys 0m1.34s

# view new pool with pair of mirrored intent logs
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
mirror ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0
errors: No known data errors

# run capacity test
Ultra60/root# cd /u002 ; time cp -r . /u003
real 37h9m52.78s
user 0m48.88s
sys 5h31m30.28s


# destroy it again for new test
Ultra60/root# time zpool destroy zpool3
real 0m21.99s
user 0m0.02s
sys 0m0.31s
The results were almost identical. A 10% improvement in speed was measured. Splitting the commodity 8GB USB sticks into a mirror offered redundancy without lacking performance.

If 4 USB sticks are to be purchased for ZIL, don't bother striping all 4, split them into mirrored pairs and get your 10% boost in speed.


Scenario Three: Vertex OCZ Solid State Disk

Purchasing 4 USB sticks for the purpose of a ZIL starts to approach the purchase price of a fast SATA SSD drive. On the UltraSPARC II processors, the drivers for SATA are lacking, so that is not necessarily a clear option.

The decision test a USB to SATA conversion kit with the SSD and run a single SSD SIL was made.
# new flash disk, format
Ultra60/root# format -e
Searching for disks...done
AVAILABLE DISK SELECTIONS:
...
2. c1t0d0
/pci@1f,2000/usb@1,2/storage@4/disk@0,0
...


# create zpool3 with SATA-to-USB flash disk intent USB intent log
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0 log c1t0d0
real 0m5.07s
user 0m0.74s
sys 0m1.15s
# show zpool3 with intent log
Ultra60/root# zpool status zpool3
  pool: zpool3
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zpool3      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0
            c9t0d0  ONLINE       0     0     0
        logs
          c1t0d0    ONLINE       0     0     0

# run capacity test
Ultra60/root# cd /u002 ; time cp -r . /u003
real 32h57m40.99s
user 0m52.04s
sys 5h43m13.31s
The single SSD over a SATA to USB interface provided a 20% boost in throughput.

In Conclusion

Using commodity parts, a ZFS SAN can have the write performance boosted using USB sticks by 10% as well as by 20% using an SSD. The SSD is a more reliable device and better choice for ZIL.

No comments:

Post a Comment