21 July, 2009

Notes on various small tweaks

Note 1: Console properties
To change the console configuration (e.g. language switching etc.etc.)
dpkg-reconfigure console-setup

also for locales the old "debian" way of selecting the locales (via a menu) is not used. Rather one
has to do it manually (Check various easy guides on the web for more information).

Note 2: Dont' Zap
To be able to restart the X server with Alt+Ctrl+Backspace (this has changed recently following a
decision by Xorg maintainers). To restore this functionality do:
1. Edit xorg.cong and add

Section "ServerFlags"
        Option  "DontZap"       "False"
EndSection

2. or  (in ubuntu)
dontzap --disable

both have the same effect, i.e. they modify the xorg.file.

Note 3: MDADM
To configure mail notifications I had to enter my e-mail. The easiest way is:
to
dpkg-reconfigure mdadm
and run it as a deamon...

Note that it is a wise choice to have it run a consistency check (which might be time consuming but nevertheless it is useful).

Alternatively, you can manually edit the file
/etc/mdadm/mdadm.conf

Now to test it we need to simulate a drive failure. I will simulate a failure on my raid1 (md1) array (since it is easier to rebuild)
 mdadm --manage --set-faulty /dev/md1 /dev/sdb1
and then to see that it failed try:
mdadm --detail /dev/md1
or
dmesg
or
cat /proc/mdstat

You should now have received an e-mail notifying you of the failure.
Remove the failed drive:
mdadm /dev/md1 -r /dev/sdb1
and re-add it:
mdadm /dev/md1 -a /dev/sdb1
and verify that everything is back to normal.

Note 4: Configure smartmon tools
Verify that the packages mail (or mailx) and smartmontools are installed

1) Edit the file /etc/smartd.conf (see the man page for the available options)

*  Comment out the DEVICESCAN live

* and add the following 
# Run a Long self test on the 13th of each month and short self tests on Wednesday evenings.
# -a: Run default tests
# -m: root (Mail to root)
/dev/sda -d sat  -s (L/../13/./01|S/../../3/01) -a -W 4,47,55 -m root
/dev/sdb -d sat -s (L/../13/./02|S/../../3/02) -a -W 4,47,55 -m root
/dev/sdc -d sat -s (L/../13/./03|S/../../3/03) -a -W 4,47,55 -m root
/dev/sdd -d sat -s (L/../13/./04|S/../../3/04) -a -W 4,47,55 -m root
/dev/sde -d sat -s (L/../13/./05|S/../../3/05) -a -W 4,47,55 -m root
/dev/sdf  -d sat -s (L/../13/./06|S/../../3/06) -a -W 4,47,55 -m root

2) Edit /etc/default/smartmontools
and uncomment the line: (to start the deamon)
start_smartd=yes

3) Restart the deamon...
/etc/init.d/smartmontools restart

and check /var/log/syslog if everything works as expected

Note 5: To receive automatically various notifications.
Edit /etc/aliases (of course we need something like sendmail) installed and configured.
Edit  /etc/aliases

add the line
root: name@mailadd.com

and run newaliases

Note 6: Other rc.local options

Here we can add various optimizations. For example

Note 7: Disable ipv6

TODO

Note 8: Logwatch

TODO

Note 9: prefetch and readahead
TODO


Note 10: APC UPS (Configuration and notifications).
TODO

Note 11: Sensors
TODO

Note 12: Sound Card
TODO

Note 13: Sudoers

Run visudo as root and add the lines:

# User privilege specification
root    ALL=(ALL) ALL
username   ALL=(ALL) ALL



18 July, 2009

Fsarchiver

FSarchiver is a new tool that helps take snapshots (much like Acronis true Image does on Windows). With Ubuntu there is a rather mature solution, namely partimage.

Partimage has several problems:
  • It does not support multithreaded compression.
  • It has stopped being actively developed, and
  • does not seem to work well with lvm.
FSarchiver seems to be a better option as it resolves this option. The problem is that it is not yet packages for ubuntu and a compilation from source is required. Below information is given on how to compile it and also how to backup a snapshot of a partition (using the LVM snapshot function).

The website for fsarchiver is here

To have full functionality (i.e. lzma compression support the xz utils is needed). Download this from
here

Step 1: FSarchiver installation
Make sure you have some of the required packages:
apt-get install zlib1g-dev libssl-dev libbz2-dev liblzo2-dev e2fslibs-dev attr-dev libssl-dev libblkid-dev uuid-dev

Download the source code for fsarchiver and xz utils and untar it
cd
tar xvfz xz-4.999.8beta.tar.gz
tar xvfz fsarchiver-0.5.8.tar.gz

First let's build the xz utils:
cd xz-4.999.8beta/
./configure; make;make check
and verify that all tests are passed. Then do:
make install

cd ../fsarchiver-0.5.8
./configure --enable-static
make;make install
cd ../xz-4.999.8beta/
make uninstall
cd ..
rm -rf fsarchiver-0.5.8 xz-4.999.8beta/

And you will have fsarchiver installed (with all options regarding compression support) on /usr/local/sbin

Step 2: Creating an LVM snapshot and an image of useful directories (can be used to
restore system in case of failure).




To restore
fsarchiver restfs -j 4 backup_name.fsa id=0,dest=/dev/vgname/lvname
id=0: Is used in case the archiver has more than one filesystems...
-j 4: Use all four cores.

To display information regarding the partitions and the current filesystems:
fsarchiver probe simple

To see the details of an archive use:
fsarchiver archinfo backup_file.fsa






06 July, 2009

LVM Installation (Partition Alignment)

We are now ready to delve into the details and start the procedure again with the goal of performing various
optimizations.

Things to consider when doing lvm on top of raid:
- stripe vs. extent alignment
- stride vs. stripe vs. extent size for ext3 filesystems (or sunit swidth in the case of xfs filesystems)
- filesystem's awareness that there's also raid a layer below
- lvm's readahead
In the discussion that follows I will detail the various topics above and how I addressed them.

Step 0: Boot from the server or the desktop CD in rescue (or live) mode to execute these commands.

Step 1: Create the array
One of the choices that has to be performed is the stripe (chunk) size for the raid5 array.
Based on the discussion here regarding how to choose an appropriate stripe size (I found much, oftentimes conflicting,
information on the web the previous link gives satisfactory explanations)  also there are a number
of benchmarks that help understand on the effect of various parameters. Based on the benchmarks here and the
discussion in the previous link I created the array using stripe size 256kB.

To create the arrays:
mdadm --create /dev/md0 --chunk=256 --level=raid5 --raid-devices=5 /dev/sd[a-e]2
and
mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/sda[a-b]1

To delete the arrays: (Warning: This can and probably will destroy your data)
mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sd[a-e]2

and after 2-3 hours the building of the array is complete as can be verified by
cat /proc/mdstat
and
mdadm --detail /dev/md0
mdadm --detail /dev/md1

For md0 the layout is left-symmetric, i.e.
Left-Symmetric Layout
A quick bench shows
hdparm -tT /dev/md0 shows an uncached read speed of ~434MB/sec something to be expected based on the stripe size
and the per disk performance of the hardware used. Compared to the previous chunk size (64kB, see my previous post) and
increase in performance is obtained as expected given the increase in stripe size.

Step 2: LVM and it's alignment
A search on the web, turns out a long discussion about alignment of the various layers. This is especially important for RAID 5 installations since a misalignment will incur a performance hit especially during the write operations. It seems that there is a long discussion regarding alignment on the web, but many people are unclear about the exact procedure.

In my (long and winded) search, I found a number of interesting discussions. These can be found in the following links:

Link 1:   Is a discussion on alignment for SSDs. Although the topic is only somewhat related the discussion is extremely clear and all the salient points are addressed. This post helped me understand the various problems.

The main discussion of Ted Ts'o covers alignment at the sector level. Basically the idea is to change the hd geometry in such a way that each cylinder will be aligned with a certain basic (stripe) size. This depending on the application alignment can happen on 4KiB (for next gen H/Ds) or 128KiB (for SSDs, erase block boundaries his case) boundaries. His explanations are very clear so I will point to his discussion. One thing that needs special care is the partition table: for MS/DOS compatibility the first partition starts on track 1. To have proper partition alignment one has to move the start of the partition so that it is aligned correctly. This can happen with fdisk in expert mode (see at the end of this post for an example). In our case we do not need alignment at the disk level (this is not the case in hardware raid OR if we create a partition table in the md array, in this case read this for a discussion), but if we did we would have to manually move the partitions.

Link 2: The impact of misalignment can be significant as the link illustrates. Also as the discussion on this link illustrates it can have an impact of 30% or more on performance. It seems that the greater impact can be expected when the stripe size becomes smaller. This is obvious as the read-verify-write operation would cross more times the boundaries in the case of misalignment and therefore we would pay a higher premium in terms of performance.

Note I will be using the same partition scheme described in a previous post. That is on the hardware level for each of the hard drives there are two partitions: a small one (~100MB to be used for the boot partition); and a larger one to be used for the raid5 array. Note that on each of the drives we do not care about alignment  and there is a partition table on the first track (although we could easily take this into account). We need alignment once the array is created. In this case, given the stripe (chunk) size of 256kB, the basic "quantum" size for alignment is the 256K. While overlaying on top of the md array the LVM partition we have similar problems as the ones described before in the sense that the LVM extents should be aligned with the md stripe...

Note for HW Raids:
In the software case we need to have a RAID 1 partition so that we can boot from there since the bootloaders do not fully support booting from a RAID 5. In the case of HW this is not a problem. The system sees the hard drive as a unit (since the controller takes care of that). The best approach in this case is not to create a partition table (anyway) and overlay the LVM system --- then use the LVM to perform the partition.
Link 3: An
extremely comprehensive benchmark and comparison between hard and soft
raid. Essentially md is compared to a 3ware card on RAID 5 and RAID 10
configurations. Along the way many interesting information is
presented. A definite must read as it contains a lot of information...

To create the LVM we  have two options:
1) Create  a partition on /dev/md0 and label it for LVM, (typical of a HW raid) or
2) Create the LVM without a partition table.
In
the first case we would need alignment for the partition (since we are
doing this on top of the md layer see the example below), and then
alignment for the metadata, whereas in the second case alignment is
only required for the metadata (see what follows). The LVM tools
from version 2.0.40 onwards (which unfortunately as of this time is not
yet integrated with ubuntu) can get information from a software array
and arrange automagically aligment issues. For HW raids or in our case
(since we have version 2.0.39) we will do it manually.
Link 4:  Interesting discussion on alignment for Windows OSes..
Link 5:  Linux-raid mailing list: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Link 6:  LVM tools confuse Megabytes with Mebibytes. Overall a very detailed and interesting article...

Relevant/Interesting HOWTOs:
HOWTO: Software Raid
HOWTO: Multi Disk System Tuning
HOWTO: LVM

Disk partition adjustment for Linux systems
In
Linux, align the partition table before data is written to the LUN, as
the partition map will be rewritten and all data on the LUN destroyed.
In the following example, the LUN is mapped to /dev/emcpowerah, and the
LUN stripe element size is 128 blocks. Arguments for the fdisk utility
are as follows:
fdisk    /dev/emcpowerah
x      # expert mode
b      # adjust starting block number
1      # choose partition 1
xxx #    set it to an appropriate size for the alignment, our stripe element size
w      # write the new partition

Steps to setup LVM:

1) First create a test filesystem using the defaults

mkfs.xfs /dev/md0 and record the various parameters. (will be needed later)

Filesystem parameters by default on /dev/md0
meta-data=/dev/md0      isize=256           agcount=32, agsize=15258240 blks
                =                    sectsz=4096,     attr=2
data         =                    bsize=4096        blocks=488263424, imaxpct=5
                =                    sunit=64             swidth=256 blks
naming     = version 2    bsize=4096        ascii-ci=0
log           = internal log  bsize=4096        blocks=32768, version=2
                =                    sectsz=4096      sunit=1 blks, lazy-count=0
realtime    = none          extsz= 1048576 blocks=0, rtextents=0

this will remove the partition table...
dd if=/dev/zero of=/dev/md0 bs=512 count=1

2) Create physical volume
Normally the LVM metadata allocates 196kB (we need to allocate a little more for alignment)

pvcreate --metadatasize 250k /dev/md0     (apparently the calculation is 250KiB *1.024=256, what a mess...)

To verify:
pvs -o +pe_start 
(you can also add     --units B)
or
pvdisplay --units b

The second set of commands are used to verify that the first physical extent is aligned with the 256K boundary. Notice
that because lvm tools confuse KiB,MiB,GiB, with kB,MB,GB One might wonder why 250K is used.It's a mess but see Link 6 for an "explanation"..

3) Create volume group (32MB extend size)
This needs to align on top of the md layer. So it has to be a multiple of 256Kib

It can be argued that it is beneficial to have it a multiple of
256Kib*4=1MiB (where 4:Raid Devices-1).

Here we choose it to be 32MiB

vgcreate --physicalextentsize 32M /dev/md0

to verify alignment

vgdisplay --units b
and we get PE size 33554432= 32*(1024)^2

4) Create Logical Volumes

100GiB for /
600GiB for /var
600GiB for /home

In terms of extents this is equal to:
32extents*32MiB=1GiB
100Gib= 32*100=3200 extents
600Gib= 32*600=19200 extents

lvcreate -l 3200 -n root
lvcreate -l 19200 -n home

lvcreate -l 19200 -n var

lvs (to verify that everything is ok)

To activate an lv:
vgchange -a y


Step 3: Create the filesystem


To create the filesystem we need to make sure that we get alignment also at this level. Thankfully the XFS filesystem
can become RAID aware and adapt performance to the presence of soft/hard RAID. The relevant parameters are
the sunit (stripe unit) and swidth (stripe width) parameters.

Explanation of options: from the manpage:
and also using notes from the following links
 tuning the XFS        XFS FAQ     Tweaking XFS Performance

My choices are outlined below
Block Size
-b size : This option specifies the fundamental block size of the filesystem. This has to be smaller than the kernel pagesize, in 32-bit linux this is 4096 and in 64-bit it can be higher. Normally, a higher block size will result in better performance but here I let the default choice.

-b size=4096

Data Section

-d data_section_options

    agcount:This is used to specify the number of allocation groups. The data section of the filesystem is divided into allocation groups to improve the performance of XFS.
    sunit: This is used to specify the stripe unit for a RAID device or a logical volume. The value has to be specified in 512-byte block units. Use the su sub-option to specify the stripe unit size in bytes.
    swidth: This is used to specify the stripe width for a RAID device or a striped logical volume. The value has to be specified in 512-byte block units. Use the suboption sw to specify the width size in bytes.

 -d agcount=4,su=256k,sw=4

Here for RAID5: width=su*(number of Raid Drives - 1)
for RAID 6, it would be: width=su*(number of Raid Drives -2)

Force Overwrite (Optional)
-f Force overwrite when an existing filesystem is detected on the device.

Log Section
-l log_section_options
     internal: This is used to specify that the log section is a piece of the data section instead of being another device or logical volume.
     size: This is used to specify the size of the log section.
     version: This specifies the version of the log. The current default is 2, which allows for larger log buffer sizes as well as supporting stripe-aligned log writes (see the sunit and su options, below).
     sunit: This specifies the alignment to be used for log writes. The value has to be specified in 512-byte block units. Note: I do not set it as it done automatically once the data sunit is given.
     lazy-count: This changes the method of logging various persistent counters in the superblock. Under metadata intensive workloads, these counters are updated and logged frequently enough that the superblock updates become a serialisation point in the filesystem. The value can be either 0 or 1.

-l internal,size=128m, version=2, lazy-count=1

The remaining options can remain to their default values

mkfs.xfs -b size=4096 -d agcount=4, su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//root
running it gives that, for alignment AG must be a multiple of stripe width, so a recommendation is given
mkfs.xfs -b size=4096 -d agsize=6553536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//root
mkfs.xfs -b size=4096 -d agsize=39321536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//home
mkfs.xfs -b size=4096 -d agsize=39321536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//var

and then to mount (also change these options in the fstab)
nobarrier,logbufs=8,noatime,nodiratime /dev/root
nobarrier,logbufs=8,noatime,nodiratime /dev/var
nobarrier,logbufs=8,noatime,nodiratime /dev/home

Step 4: Final tweaks. Set readahead buffers correctly

There is an issue with the readahead buffers. This is a known problem and is discussed extensively in the following links:

Link 1     Link2

blockdev --getra /dev/md0 /dev//root /dev//var /dev//home

Gives
4096 256 256 256

To fix this:
blockdev --setra 4096 /dev/md0 /dev//root /dev//var /dev//home

(To make this permanent add an entry to /etc/rc.local)

and then do a bonnie++ benchmark to test that everything works as expected.
bonnie++ -u -f

Note: The benefits of partition alignment will be more profound as the chunk size becomes smaller.

Step 5: System Installation

Reboot the system and do the installation as usual (one could do it manually but there are no
significant reasons why one should complicate things more).

Once the basic installation has completed, the system will be restarted and the basic grub prompt will
appear:

find /grub/menu.lst (or find /boot/grub/menu.lst)
root (hd0,0)
setup (hd0)

root(hd1,0)
setup (hd1)

will install grub on two hard drives. And restart....

Once the system is setup:

1. Install the network (edit /etc/init.d/networking)

iface eth0 inet static
address 192.168.1.100
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
gateway 192.168.1.254

Edit /etc/resolv.conf to add the nameservers
search myisp.com
nameserver 192.168.1.254
nameserver 202.54.1.20
nameserver 202.54.1.30

/etc/init.d/networking restart

Test connectivity:
ping www.google.com


2. Update the system

apt-get update;apt-get dist-upgrade

dpkg-reconfigure debconf (to set the level of questions that you want asked, I choose medium)


3. Add swap (if you have not added it before)

Rule of thumb (for system memory of 4GB and higher, swap should be system memory+2GB),
so for me it is 6GB.

lvcreate -l 192 -n swap1
mkswap /dev//swap1
and record the UUID given.
swapon -va (To activate it)
and
cat /proc/swaps
or
free
To verify that it is installed

4. Edit /etc/fstab  and add there
a. The options for the xfs filesystems (see above)
b. for the swap one line along the lines
UUID=  swap     swap    defaults     0 0

5. Set readahead to a larger value automatically on system boot.

Edit rc.local and add the line
blockdev --getra 4096 /dev/md0 /dev//*

Reboot and we are done!!!!

Other minor topics defrag the filesystem...
1. Info on xfs system
xfs_info /dev/data/test
Check Fragmentation Level:
xfs_db -c frag -r /dev/hdXY
To lower fragmentation level:
xfs_fsr /dev/hdXY

2. Expert mode in server installation

Note: in Ubuntu 9.04 using expert mode seems to create problems during the installation of the base system when mkintrd is creating the initrd image. There are some workarounds on the internet but it is easier to not use expert mode.

3. Items that need further investigation

a. bonnie++ and bonnie++ -f give different results....

I have no clue, why this is the case. This difference is probably due to a bug with bonnie++. Other people around the net have noticed this behavior. While not certain, I can say that compared with other benchmarking software it seems that there is a problem with the -f switch.

After the tweakings above indicative numbers are shown below:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
name          8G 81129  93 161448  19 115095  13 86482  96 414836  27 586.2   0
                    ------Sequential Create------ --------Random Create--------  
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--  
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  
                 16  7780  23 +++++ +++  3866   8  8314  19 +++++ +++  3874   9  

and with the option in my rc.local file:

echo 4096 > /sys/block/md0/md/stripe_cache_size

I get the following:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
foxtrot          8G 87635  96 180044  21 131761  16 91021  97 401820  25 499.2   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  8988  27 +++++ +++  4239   8 10419  33 +++++ +++  3921   9


c. Why hdparm -tT gives different numbers on a mounted vs an unmounted filesystem
Based on the discussion here it seems that there is some communication between the filesystem and the block device. This gives slower hdparm results when the filesystem is mounted. 

d. Configure mdadm.conf to send automatic notifications regarding the health of the disk array.


04 July, 2009

LVM Advanced Installation Notes:

1) The problem
After the  default installation (see previous post) I noticed that performance was not satisfactory.
I ran bonnie++ and other io benchmarking software.
The problem can be illustrated as follows:
hdparm -tT /dev/md0
Gives reasonable performance (380MB/s reads), while
hdparm -tT /dev/vgvol/dir
gives abysmal performance (120 MB/sec, equivalent to that of one drive)...

This suggests that we might have a problem with alignment between raid/lvm/xfs....

2) Raid Information
The following resources provide a lot of useful information regarding raid installation:
RAID HOWTO

In particular it defines the superblock and gives lots of useful information on mdadm and it's use.

3) File mdadm.conf
/etc/mdadm.conf is mdadms' primary configuration file. Unlike /etc/raidtab, mdadm does not rely on /etc/mdadm.conf to create or manage arrays. Rather, mdadm.conf is simply an extra way of keeping track of software RAIDs. Using a configuration file with mdadm is useful, but not required. Having one means you can quickly manage arrays without spending extra time figuring out what array properties are and where disks belong. For example, if an array wasn't running and there was no mdadm.conf file describing it, then the system administrator would need to spend time examining individual disks to determine array properties and member disks.

# mdadm --detail --scan
ARRAY /dev/md0 level=raid0 num-devices=2   \
    UUID=410a299e:4cdd535e:169d3df4:48b7144a

If there were multiple arrays running on the system, then mdadm would generate an array line for each one. So after you're done building arrays you could redirect the output of mdadm --detail --scan to /etc/mdadm.conf. Just make sure that you manually create a DEVICEentry as well. Using the example I've provided above we might have an /etc/mdadm.conf that looks like:

DEVICE    /dev/sdb1 /dev/sdc1
ARRAY     /dev/md0 level=raid0 num-devices=2    \                      
    UUID=410a299e:4cdd535e:169d3df4:48b7144a


4) Choices
HW vs SoftRaid vs FakeRaid:
I had all three options (I have a raid controller, an ICH10R mobo, and only run linux).
See below for a discussion
Pros and Cons
I chose softraid because:
- I have a fast processor
- I only ran linux.
- From what I have seen is reliable and fast and compared to fakeraid (dmraid) is more stable and slightly
faster.
See also the following for more discussion:
Link 1 Link 2   Link 3  



Superblock:

It turns out there are multiple versions. This is reported when running
mdadm --detail /dev/md0  (Under the version)
See link for more information.
Update: Add here choice ....


Swap file Location:
There is a discussion on where to put the swap file if you have a RAID partition... Should it be put on the raid
or separately???

Three solutions are proposed:
Separate RAID 1 for swap on 2 drives (so that if a drive fails there is swap on the other).
Add many swap partitions on each of the drives and let the kernel decide where to place the swap, or
place on raid 5.

After looking around the following discussion is the most convincing:
If you have everything on RAID on your server, it's often debated whether you want your swap partition on RAID as well. Some will state correctly that Linux optimally uses two swap partitions (e.g. on /dev/sda2 and /dev/sdb2) and that putting the swap on a RAID impacts the swap performance. While this is techncally correct, it is nonsense when it comes to availability.
First: if swap performance is an issue, the problem isn't RAID or not, it is too less RAM. Under normal circumstances, swap should be used only sparsely -- if at all. From time to time the system might swap out something not used for some time. If a larger amount of swap is used on a regular basis, else there's a memory leak in one of the applications running, or you simply have not enough RAM built in for the tasks running. Go buy some!
Second: while Linux can indeed distribute swapped pages across several swap partitions, once one of them suddenly disappears because the underlying disk died, the system simply crashes. And that's exactly what you don't want.

Conclusion: put the swap on a RAID as well as everything else.

Swap on RAID 5 for me



01 July, 2009

Notes on RAID 5/LVM Installation...

I found quite a few resources on the internet with useful information regarding setting up a RAID 5 system with LVM.

 Setup soft RAID/LVM
Using the 9.04 server CD, I partitioned the 5 disks as follows:
128MB on every disk (set flag to raid) (/dev/sd[a-e]1)
and the rest as a single partition (with the flag set on raid again) (/dev/sd[a-e]2)

I set the bootable flag on /dev/sda1 and /dev/sda2 and created a raid array that I formatted using XFS and set the mount point to /boot. I then proceeded with the creation of a raid5 array using /dev/sd[a-e]2 (/dev/md0) and then I set on top of it a partition that had the
flag set on lvm. Using the LVM tool I then proceeded to create partitions:
swap, var, home, root
for the corresponding directories.
I formatted everything as XFS (and mounted them in the appropriate locations)
Then grub was installed per the recommendations on the first link below:
Software Raid on Ubuntu: In this first link and useful suggestion on how to install the boot loader with grub
after the configuration has been completed (Essentially run and install the boot loader on both of the boot partitions)
Ubuntu forums: On this link an interesting suggestion regarding the swap is given. The suggestion is to have
multiple cache files

Once restarted everything seems to be working correctly and the raid array started the sync process...
cat /proc/mdstat

Also it easy to check that the file /etc/mdadm/mdadm.conf has been created correctly.
The following link (in Greek) has a very good description of the process as well as the
Greek Forums on Ubuntu describing many details of the procedure.

Chunk sizes and other misc stuff I have left them on their default values. It seems that there are benefits to
selecting the proper stripe sizes but for my system (which is not very heavy load) the difference would be
marginal with a concominant waste of my time.