06 July, 2009

LVM Installation (Partition Alignment)

We are now ready to delve into the details and start the procedure again with the goal of performing various
optimizations.

Things to consider when doing lvm on top of raid:
- stripe vs. extent alignment
- stride vs. stripe vs. extent size for ext3 filesystems (or sunit swidth in the case of xfs filesystems)
- filesystem's awareness that there's also raid a layer below
- lvm's readahead
In the discussion that follows I will detail the various topics above and how I addressed them.

Step 0: Boot from the server or the desktop CD in rescue (or live) mode to execute these commands.

Step 1: Create the array
One of the choices that has to be performed is the stripe (chunk) size for the raid5 array.
Based on the discussion here regarding how to choose an appropriate stripe size (I found much, oftentimes conflicting,
information on the web the previous link gives satisfactory explanations)  also there are a number
of benchmarks that help understand on the effect of various parameters. Based on the benchmarks here and the
discussion in the previous link I created the array using stripe size 256kB.

To create the arrays:
mdadm --create /dev/md0 --chunk=256 --level=raid5 --raid-devices=5 /dev/sd[a-e]2
and
mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/sda[a-b]1

To delete the arrays: (Warning: This can and probably will destroy your data)
mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sd[a-e]2

and after 2-3 hours the building of the array is complete as can be verified by
cat /proc/mdstat
and
mdadm --detail /dev/md0
mdadm --detail /dev/md1

For md0 the layout is left-symmetric, i.e.
Left-Symmetric Layout
A quick bench shows
hdparm -tT /dev/md0 shows an uncached read speed of ~434MB/sec something to be expected based on the stripe size
and the per disk performance of the hardware used. Compared to the previous chunk size (64kB, see my previous post) and
increase in performance is obtained as expected given the increase in stripe size.

Step 2: LVM and it's alignment
A search on the web, turns out a long discussion about alignment of the various layers. This is especially important for RAID 5 installations since a misalignment will incur a performance hit especially during the write operations. It seems that there is a long discussion regarding alignment on the web, but many people are unclear about the exact procedure.

In my (long and winded) search, I found a number of interesting discussions. These can be found in the following links:

Link 1:   Is a discussion on alignment for SSDs. Although the topic is only somewhat related the discussion is extremely clear and all the salient points are addressed. This post helped me understand the various problems.

The main discussion of Ted Ts'o covers alignment at the sector level. Basically the idea is to change the hd geometry in such a way that each cylinder will be aligned with a certain basic (stripe) size. This depending on the application alignment can happen on 4KiB (for next gen H/Ds) or 128KiB (for SSDs, erase block boundaries his case) boundaries. His explanations are very clear so I will point to his discussion. One thing that needs special care is the partition table: for MS/DOS compatibility the first partition starts on track 1. To have proper partition alignment one has to move the start of the partition so that it is aligned correctly. This can happen with fdisk in expert mode (see at the end of this post for an example). In our case we do not need alignment at the disk level (this is not the case in hardware raid OR if we create a partition table in the md array, in this case read this for a discussion), but if we did we would have to manually move the partitions.

Link 2: The impact of misalignment can be significant as the link illustrates. Also as the discussion on this link illustrates it can have an impact of 30% or more on performance. It seems that the greater impact can be expected when the stripe size becomes smaller. This is obvious as the read-verify-write operation would cross more times the boundaries in the case of misalignment and therefore we would pay a higher premium in terms of performance.

Note I will be using the same partition scheme described in a previous post. That is on the hardware level for each of the hard drives there are two partitions: a small one (~100MB to be used for the boot partition); and a larger one to be used for the raid5 array. Note that on each of the drives we do not care about alignment  and there is a partition table on the first track (although we could easily take this into account). We need alignment once the array is created. In this case, given the stripe (chunk) size of 256kB, the basic "quantum" size for alignment is the 256K. While overlaying on top of the md array the LVM partition we have similar problems as the ones described before in the sense that the LVM extents should be aligned with the md stripe...

Note for HW Raids:
In the software case we need to have a RAID 1 partition so that we can boot from there since the bootloaders do not fully support booting from a RAID 5. In the case of HW this is not a problem. The system sees the hard drive as a unit (since the controller takes care of that). The best approach in this case is not to create a partition table (anyway) and overlay the LVM system --- then use the LVM to perform the partition.
Link 3: An
extremely comprehensive benchmark and comparison between hard and soft
raid. Essentially md is compared to a 3ware card on RAID 5 and RAID 10
configurations. Along the way many interesting information is
presented. A definite must read as it contains a lot of information...

To create the LVM we  have two options:
1) Create  a partition on /dev/md0 and label it for LVM, (typical of a HW raid) or
2) Create the LVM without a partition table.
In
the first case we would need alignment for the partition (since we are
doing this on top of the md layer see the example below), and then
alignment for the metadata, whereas in the second case alignment is
only required for the metadata (see what follows). The LVM tools
from version 2.0.40 onwards (which unfortunately as of this time is not
yet integrated with ubuntu) can get information from a software array
and arrange automagically aligment issues. For HW raids or in our case
(since we have version 2.0.39) we will do it manually.
Link 4:  Interesting discussion on alignment for Windows OSes..
Link 5:  Linux-raid mailing list: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Link 6:  LVM tools confuse Megabytes with Mebibytes. Overall a very detailed and interesting article...

Relevant/Interesting HOWTOs:
HOWTO: Software Raid
HOWTO: Multi Disk System Tuning
HOWTO: LVM

Disk partition adjustment for Linux systems
In
Linux, align the partition table before data is written to the LUN, as
the partition map will be rewritten and all data on the LUN destroyed.
In the following example, the LUN is mapped to /dev/emcpowerah, and the
LUN stripe element size is 128 blocks. Arguments for the fdisk utility
are as follows:
fdisk    /dev/emcpowerah
x      # expert mode
b      # adjust starting block number
1      # choose partition 1
xxx #    set it to an appropriate size for the alignment, our stripe element size
w      # write the new partition

Steps to setup LVM:

1) First create a test filesystem using the defaults

mkfs.xfs /dev/md0 and record the various parameters. (will be needed later)

Filesystem parameters by default on /dev/md0
meta-data=/dev/md0      isize=256           agcount=32, agsize=15258240 blks
                =                    sectsz=4096,     attr=2
data         =                    bsize=4096        blocks=488263424, imaxpct=5
                =                    sunit=64             swidth=256 blks
naming     = version 2    bsize=4096        ascii-ci=0
log           = internal log  bsize=4096        blocks=32768, version=2
                =                    sectsz=4096      sunit=1 blks, lazy-count=0
realtime    = none          extsz= 1048576 blocks=0, rtextents=0

this will remove the partition table...
dd if=/dev/zero of=/dev/md0 bs=512 count=1

2) Create physical volume
Normally the LVM metadata allocates 196kB (we need to allocate a little more for alignment)

pvcreate --metadatasize 250k /dev/md0     (apparently the calculation is 250KiB *1.024=256, what a mess...)

To verify:
pvs -o +pe_start 
(you can also add     --units B)
or
pvdisplay --units b

The second set of commands are used to verify that the first physical extent is aligned with the 256K boundary. Notice
that because lvm tools confuse KiB,MiB,GiB, with kB,MB,GB One might wonder why 250K is used.It's a mess but see Link 6 for an "explanation"..

3) Create volume group (32MB extend size)
This needs to align on top of the md layer. So it has to be a multiple of 256Kib

It can be argued that it is beneficial to have it a multiple of
256Kib*4=1MiB (where 4:Raid Devices-1).

Here we choose it to be 32MiB

vgcreate --physicalextentsize 32M /dev/md0

to verify alignment

vgdisplay --units b
and we get PE size 33554432= 32*(1024)^2

4) Create Logical Volumes

100GiB for /
600GiB for /var
600GiB for /home

In terms of extents this is equal to:
32extents*32MiB=1GiB
100Gib= 32*100=3200 extents
600Gib= 32*600=19200 extents

lvcreate -l 3200 -n root
lvcreate -l 19200 -n home

lvcreate -l 19200 -n var

lvs (to verify that everything is ok)

To activate an lv:
vgchange -a y


Step 3: Create the filesystem


To create the filesystem we need to make sure that we get alignment also at this level. Thankfully the XFS filesystem
can become RAID aware and adapt performance to the presence of soft/hard RAID. The relevant parameters are
the sunit (stripe unit) and swidth (stripe width) parameters.

Explanation of options: from the manpage:
and also using notes from the following links
 tuning the XFS        XFS FAQ     Tweaking XFS Performance

My choices are outlined below
Block Size
-b size : This option specifies the fundamental block size of the filesystem. This has to be smaller than the kernel pagesize, in 32-bit linux this is 4096 and in 64-bit it can be higher. Normally, a higher block size will result in better performance but here I let the default choice.

-b size=4096

Data Section

-d data_section_options

    agcount:This is used to specify the number of allocation groups. The data section of the filesystem is divided into allocation groups to improve the performance of XFS.
    sunit: This is used to specify the stripe unit for a RAID device or a logical volume. The value has to be specified in 512-byte block units. Use the su sub-option to specify the stripe unit size in bytes.
    swidth: This is used to specify the stripe width for a RAID device or a striped logical volume. The value has to be specified in 512-byte block units. Use the suboption sw to specify the width size in bytes.

 -d agcount=4,su=256k,sw=4

Here for RAID5: width=su*(number of Raid Drives - 1)
for RAID 6, it would be: width=su*(number of Raid Drives -2)

Force Overwrite (Optional)
-f Force overwrite when an existing filesystem is detected on the device.

Log Section
-l log_section_options
     internal: This is used to specify that the log section is a piece of the data section instead of being another device or logical volume.
     size: This is used to specify the size of the log section.
     version: This specifies the version of the log. The current default is 2, which allows for larger log buffer sizes as well as supporting stripe-aligned log writes (see the sunit and su options, below).
     sunit: This specifies the alignment to be used for log writes. The value has to be specified in 512-byte block units. Note: I do not set it as it done automatically once the data sunit is given.
     lazy-count: This changes the method of logging various persistent counters in the superblock. Under metadata intensive workloads, these counters are updated and logged frequently enough that the superblock updates become a serialisation point in the filesystem. The value can be either 0 or 1.

-l internal,size=128m, version=2, lazy-count=1

The remaining options can remain to their default values

mkfs.xfs -b size=4096 -d agcount=4, su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//root
running it gives that, for alignment AG must be a multiple of stripe width, so a recommendation is given
mkfs.xfs -b size=4096 -d agsize=6553536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//root
mkfs.xfs -b size=4096 -d agsize=39321536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//home
mkfs.xfs -b size=4096 -d agsize=39321536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//var

and then to mount (also change these options in the fstab)
nobarrier,logbufs=8,noatime,nodiratime /dev/root
nobarrier,logbufs=8,noatime,nodiratime /dev/var
nobarrier,logbufs=8,noatime,nodiratime /dev/home

Step 4: Final tweaks. Set readahead buffers correctly

There is an issue with the readahead buffers. This is a known problem and is discussed extensively in the following links:

Link 1     Link2

blockdev --getra /dev/md0 /dev//root /dev//var /dev//home

Gives
4096 256 256 256

To fix this:
blockdev --setra 4096 /dev/md0 /dev//root /dev//var /dev//home

(To make this permanent add an entry to /etc/rc.local)

and then do a bonnie++ benchmark to test that everything works as expected.
bonnie++ -u -f

Note: The benefits of partition alignment will be more profound as the chunk size becomes smaller.

Step 5: System Installation

Reboot the system and do the installation as usual (one could do it manually but there are no
significant reasons why one should complicate things more).

Once the basic installation has completed, the system will be restarted and the basic grub prompt will
appear:

find /grub/menu.lst (or find /boot/grub/menu.lst)
root (hd0,0)
setup (hd0)

root(hd1,0)
setup (hd1)

will install grub on two hard drives. And restart....

Once the system is setup:

1. Install the network (edit /etc/init.d/networking)

iface eth0 inet static
address 192.168.1.100
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
gateway 192.168.1.254

Edit /etc/resolv.conf to add the nameservers
search myisp.com
nameserver 192.168.1.254
nameserver 202.54.1.20
nameserver 202.54.1.30

/etc/init.d/networking restart

Test connectivity:
ping www.google.com


2. Update the system

apt-get update;apt-get dist-upgrade

dpkg-reconfigure debconf (to set the level of questions that you want asked, I choose medium)


3. Add swap (if you have not added it before)

Rule of thumb (for system memory of 4GB and higher, swap should be system memory+2GB),
so for me it is 6GB.

lvcreate -l 192 -n swap1
mkswap /dev//swap1
and record the UUID given.
swapon -va (To activate it)
and
cat /proc/swaps
or
free
To verify that it is installed

4. Edit /etc/fstab  and add there
a. The options for the xfs filesystems (see above)
b. for the swap one line along the lines
UUID=  swap     swap    defaults     0 0

5. Set readahead to a larger value automatically on system boot.

Edit rc.local and add the line
blockdev --getra 4096 /dev/md0 /dev//*

Reboot and we are done!!!!

Other minor topics defrag the filesystem...
1. Info on xfs system
xfs_info /dev/data/test
Check Fragmentation Level:
xfs_db -c frag -r /dev/hdXY
To lower fragmentation level:
xfs_fsr /dev/hdXY

2. Expert mode in server installation

Note: in Ubuntu 9.04 using expert mode seems to create problems during the installation of the base system when mkintrd is creating the initrd image. There are some workarounds on the internet but it is easier to not use expert mode.

3. Items that need further investigation

a. bonnie++ and bonnie++ -f give different results....

I have no clue, why this is the case. This difference is probably due to a bug with bonnie++. Other people around the net have noticed this behavior. While not certain, I can say that compared with other benchmarking software it seems that there is a problem with the -f switch.

After the tweakings above indicative numbers are shown below:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
name          8G 81129  93 161448  19 115095  13 86482  96 414836  27 586.2   0
                    ------Sequential Create------ --------Random Create--------  
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--  
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  
                 16  7780  23 +++++ +++  3866   8  8314  19 +++++ +++  3874   9  

and with the option in my rc.local file:

echo 4096 > /sys/block/md0/md/stripe_cache_size

I get the following:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
foxtrot          8G 87635  96 180044  21 131761  16 91021  97 401820  25 499.2   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  8988  27 +++++ +++  4239   8 10419  33 +++++ +++  3921   9


c. Why hdparm -tT gives different numbers on a mounted vs an unmounted filesystem
Based on the discussion here it seems that there is some communication between the filesystem and the block device. This gives slower hdparm results when the filesystem is mounted. 

d. Configure mdadm.conf to send automatic notifications regarding the health of the disk array.


No comments: