21 July, 2009

Notes on various small tweaks

Note 1: Console properties
To change the console configuration (e.g. language switching etc.etc.)
dpkg-reconfigure console-setup

also for locales the old "debian" way of selecting the locales (via a menu) is not used. Rather one
has to do it manually (Check various easy guides on the web for more information).

Note 2: Dont' Zap
To be able to restart the X server with Alt+Ctrl+Backspace (this has changed recently following a
decision by Xorg maintainers). To restore this functionality do:
1. Edit xorg.cong and add

Section "ServerFlags"
        Option  "DontZap"       "False"
EndSection

2. or  (in ubuntu)
dontzap --disable

both have the same effect, i.e. they modify the xorg.file.

Note 3: MDADM
To configure mail notifications I had to enter my e-mail. The easiest way is:
to
dpkg-reconfigure mdadm
and run it as a deamon...

Note that it is a wise choice to have it run a consistency check (which might be time consuming but nevertheless it is useful).

Alternatively, you can manually edit the file
/etc/mdadm/mdadm.conf

Now to test it we need to simulate a drive failure. I will simulate a failure on my raid1 (md1) array (since it is easier to rebuild)
 mdadm --manage --set-faulty /dev/md1 /dev/sdb1
and then to see that it failed try:
mdadm --detail /dev/md1
or
dmesg
or
cat /proc/mdstat

You should now have received an e-mail notifying you of the failure.
Remove the failed drive:
mdadm /dev/md1 -r /dev/sdb1
and re-add it:
mdadm /dev/md1 -a /dev/sdb1
and verify that everything is back to normal.

Note 4: Configure smartmon tools
Verify that the packages mail (or mailx) and smartmontools are installed

1) Edit the file /etc/smartd.conf (see the man page for the available options)

*  Comment out the DEVICESCAN live

* and add the following 
# Run a Long self test on the 13th of each month and short self tests on Wednesday evenings.
# -a: Run default tests
# -m: root (Mail to root)
/dev/sda -d sat  -s (L/../13/./01|S/../../3/01) -a -W 4,47,55 -m root
/dev/sdb -d sat -s (L/../13/./02|S/../../3/02) -a -W 4,47,55 -m root
/dev/sdc -d sat -s (L/../13/./03|S/../../3/03) -a -W 4,47,55 -m root
/dev/sdd -d sat -s (L/../13/./04|S/../../3/04) -a -W 4,47,55 -m root
/dev/sde -d sat -s (L/../13/./05|S/../../3/05) -a -W 4,47,55 -m root
/dev/sdf  -d sat -s (L/../13/./06|S/../../3/06) -a -W 4,47,55 -m root

2) Edit /etc/default/smartmontools
and uncomment the line: (to start the deamon)
start_smartd=yes

3) Restart the deamon...
/etc/init.d/smartmontools restart

and check /var/log/syslog if everything works as expected

Note 5: To receive automatically various notifications.
Edit /etc/aliases (of course we need something like sendmail) installed and configured.
Edit  /etc/aliases

add the line
root: name@mailadd.com

and run newaliases

Note 6: Other rc.local options

Here we can add various optimizations. For example

Note 7: Disable ipv6

TODO

Note 8: Logwatch

TODO

Note 9: prefetch and readahead
TODO


Note 10: APC UPS (Configuration and notifications).
TODO

Note 11: Sensors
TODO

Note 12: Sound Card
TODO

Note 13: Sudoers

Run visudo as root and add the lines:

# User privilege specification
root    ALL=(ALL) ALL
username   ALL=(ALL) ALL



18 July, 2009

Fsarchiver

FSarchiver is a new tool that helps take snapshots (much like Acronis true Image does on Windows). With Ubuntu there is a rather mature solution, namely partimage.

Partimage has several problems:
  • It does not support multithreaded compression.
  • It has stopped being actively developed, and
  • does not seem to work well with lvm.
FSarchiver seems to be a better option as it resolves this option. The problem is that it is not yet packages for ubuntu and a compilation from source is required. Below information is given on how to compile it and also how to backup a snapshot of a partition (using the LVM snapshot function).

The website for fsarchiver is here

To have full functionality (i.e. lzma compression support the xz utils is needed). Download this from
here

Step 1: FSarchiver installation
Make sure you have some of the required packages:
apt-get install zlib1g-dev libssl-dev libbz2-dev liblzo2-dev e2fslibs-dev attr-dev libssl-dev libblkid-dev uuid-dev

Download the source code for fsarchiver and xz utils and untar it
cd
tar xvfz xz-4.999.8beta.tar.gz
tar xvfz fsarchiver-0.5.8.tar.gz

First let's build the xz utils:
cd xz-4.999.8beta/
./configure; make;make check
and verify that all tests are passed. Then do:
make install

cd ../fsarchiver-0.5.8
./configure --enable-static
make;make install
cd ../xz-4.999.8beta/
make uninstall
cd ..
rm -rf fsarchiver-0.5.8 xz-4.999.8beta/

And you will have fsarchiver installed (with all options regarding compression support) on /usr/local/sbin

Step 2: Creating an LVM snapshot and an image of useful directories (can be used to
restore system in case of failure).




To restore
fsarchiver restfs -j 4 backup_name.fsa id=0,dest=/dev/vgname/lvname
id=0: Is used in case the archiver has more than one filesystems...
-j 4: Use all four cores.

To display information regarding the partitions and the current filesystems:
fsarchiver probe simple

To see the details of an archive use:
fsarchiver archinfo backup_file.fsa






06 July, 2009

LVM Installation (Partition Alignment)

We are now ready to delve into the details and start the procedure again with the goal of performing various
optimizations.

Things to consider when doing lvm on top of raid:
- stripe vs. extent alignment
- stride vs. stripe vs. extent size for ext3 filesystems (or sunit swidth in the case of xfs filesystems)
- filesystem's awareness that there's also raid a layer below
- lvm's readahead
In the discussion that follows I will detail the various topics above and how I addressed them.

Step 0: Boot from the server or the desktop CD in rescue (or live) mode to execute these commands.

Step 1: Create the array
One of the choices that has to be performed is the stripe (chunk) size for the raid5 array.
Based on the discussion here regarding how to choose an appropriate stripe size (I found much, oftentimes conflicting,
information on the web the previous link gives satisfactory explanations)  also there are a number
of benchmarks that help understand on the effect of various parameters. Based on the benchmarks here and the
discussion in the previous link I created the array using stripe size 256kB.

To create the arrays:
mdadm --create /dev/md0 --chunk=256 --level=raid5 --raid-devices=5 /dev/sd[a-e]2
and
mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/sda[a-b]1

To delete the arrays: (Warning: This can and probably will destroy your data)
mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sd[a-e]2

and after 2-3 hours the building of the array is complete as can be verified by
cat /proc/mdstat
and
mdadm --detail /dev/md0
mdadm --detail /dev/md1

For md0 the layout is left-symmetric, i.e.
Left-Symmetric Layout
A quick bench shows
hdparm -tT /dev/md0 shows an uncached read speed of ~434MB/sec something to be expected based on the stripe size
and the per disk performance of the hardware used. Compared to the previous chunk size (64kB, see my previous post) and
increase in performance is obtained as expected given the increase in stripe size.

Step 2: LVM and it's alignment
A search on the web, turns out a long discussion about alignment of the various layers. This is especially important for RAID 5 installations since a misalignment will incur a performance hit especially during the write operations. It seems that there is a long discussion regarding alignment on the web, but many people are unclear about the exact procedure.

In my (long and winded) search, I found a number of interesting discussions. These can be found in the following links:

Link 1:   Is a discussion on alignment for SSDs. Although the topic is only somewhat related the discussion is extremely clear and all the salient points are addressed. This post helped me understand the various problems.

The main discussion of Ted Ts'o covers alignment at the sector level. Basically the idea is to change the hd geometry in such a way that each cylinder will be aligned with a certain basic (stripe) size. This depending on the application alignment can happen on 4KiB (for next gen H/Ds) or 128KiB (for SSDs, erase block boundaries his case) boundaries. His explanations are very clear so I will point to his discussion. One thing that needs special care is the partition table: for MS/DOS compatibility the first partition starts on track 1. To have proper partition alignment one has to move the start of the partition so that it is aligned correctly. This can happen with fdisk in expert mode (see at the end of this post for an example). In our case we do not need alignment at the disk level (this is not the case in hardware raid OR if we create a partition table in the md array, in this case read this for a discussion), but if we did we would have to manually move the partitions.

Link 2: The impact of misalignment can be significant as the link illustrates. Also as the discussion on this link illustrates it can have an impact of 30% or more on performance. It seems that the greater impact can be expected when the stripe size becomes smaller. This is obvious as the read-verify-write operation would cross more times the boundaries in the case of misalignment and therefore we would pay a higher premium in terms of performance.

Note I will be using the same partition scheme described in a previous post. That is on the hardware level for each of the hard drives there are two partitions: a small one (~100MB to be used for the boot partition); and a larger one to be used for the raid5 array. Note that on each of the drives we do not care about alignment  and there is a partition table on the first track (although we could easily take this into account). We need alignment once the array is created. In this case, given the stripe (chunk) size of 256kB, the basic "quantum" size for alignment is the 256K. While overlaying on top of the md array the LVM partition we have similar problems as the ones described before in the sense that the LVM extents should be aligned with the md stripe...

Note for HW Raids:
In the software case we need to have a RAID 1 partition so that we can boot from there since the bootloaders do not fully support booting from a RAID 5. In the case of HW this is not a problem. The system sees the hard drive as a unit (since the controller takes care of that). The best approach in this case is not to create a partition table (anyway) and overlay the LVM system --- then use the LVM to perform the partition.
Link 3: An
extremely comprehensive benchmark and comparison between hard and soft
raid. Essentially md is compared to a 3ware card on RAID 5 and RAID 10
configurations. Along the way many interesting information is
presented. A definite must read as it contains a lot of information...

To create the LVM we  have two options:
1) Create  a partition on /dev/md0 and label it for LVM, (typical of a HW raid) or
2) Create the LVM without a partition table.
In
the first case we would need alignment for the partition (since we are
doing this on top of the md layer see the example below), and then
alignment for the metadata, whereas in the second case alignment is
only required for the metadata (see what follows). The LVM tools
from version 2.0.40 onwards (which unfortunately as of this time is not
yet integrated with ubuntu) can get information from a software array
and arrange automagically aligment issues. For HW raids or in our case
(since we have version 2.0.39) we will do it manually.
Link 4:  Interesting discussion on alignment for Windows OSes..
Link 5:  Linux-raid mailing list: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Link 6:  LVM tools confuse Megabytes with Mebibytes. Overall a very detailed and interesting article...

Relevant/Interesting HOWTOs:
HOWTO: Software Raid
HOWTO: Multi Disk System Tuning
HOWTO: LVM

Disk partition adjustment for Linux systems
In
Linux, align the partition table before data is written to the LUN, as
the partition map will be rewritten and all data on the LUN destroyed.
In the following example, the LUN is mapped to /dev/emcpowerah, and the
LUN stripe element size is 128 blocks. Arguments for the fdisk utility
are as follows:
fdisk    /dev/emcpowerah
x      # expert mode
b      # adjust starting block number
1      # choose partition 1
xxx #    set it to an appropriate size for the alignment, our stripe element size
w      # write the new partition

Steps to setup LVM:

1) First create a test filesystem using the defaults

mkfs.xfs /dev/md0 and record the various parameters. (will be needed later)

Filesystem parameters by default on /dev/md0
meta-data=/dev/md0      isize=256           agcount=32, agsize=15258240 blks
                =                    sectsz=4096,     attr=2
data         =                    bsize=4096        blocks=488263424, imaxpct=5
                =                    sunit=64             swidth=256 blks
naming     = version 2    bsize=4096        ascii-ci=0
log           = internal log  bsize=4096        blocks=32768, version=2
                =                    sectsz=4096      sunit=1 blks, lazy-count=0
realtime    = none          extsz= 1048576 blocks=0, rtextents=0

this will remove the partition table...
dd if=/dev/zero of=/dev/md0 bs=512 count=1

2) Create physical volume
Normally the LVM metadata allocates 196kB (we need to allocate a little more for alignment)

pvcreate --metadatasize 250k /dev/md0     (apparently the calculation is 250KiB *1.024=256, what a mess...)

To verify:
pvs -o +pe_start 
(you can also add     --units B)
or
pvdisplay --units b

The second set of commands are used to verify that the first physical extent is aligned with the 256K boundary. Notice
that because lvm tools confuse KiB,MiB,GiB, with kB,MB,GB One might wonder why 250K is used.It's a mess but see Link 6 for an "explanation"..

3) Create volume group (32MB extend size)
This needs to align on top of the md layer. So it has to be a multiple of 256Kib

It can be argued that it is beneficial to have it a multiple of
256Kib*4=1MiB (where 4:Raid Devices-1).

Here we choose it to be 32MiB

vgcreate --physicalextentsize 32M /dev/md0

to verify alignment

vgdisplay --units b
and we get PE size 33554432= 32*(1024)^2

4) Create Logical Volumes

100GiB for /
600GiB for /var
600GiB for /home

In terms of extents this is equal to:
32extents*32MiB=1GiB
100Gib= 32*100=3200 extents
600Gib= 32*600=19200 extents

lvcreate -l 3200 -n root
lvcreate -l 19200 -n home

lvcreate -l 19200 -n var

lvs (to verify that everything is ok)

To activate an lv:
vgchange -a y


Step 3: Create the filesystem


To create the filesystem we need to make sure that we get alignment also at this level. Thankfully the XFS filesystem
can become RAID aware and adapt performance to the presence of soft/hard RAID. The relevant parameters are
the sunit (stripe unit) and swidth (stripe width) parameters.

Explanation of options: from the manpage:
and also using notes from the following links
 tuning the XFS        XFS FAQ     Tweaking XFS Performance

My choices are outlined below
Block Size
-b size : This option specifies the fundamental block size of the filesystem. This has to be smaller than the kernel pagesize, in 32-bit linux this is 4096 and in 64-bit it can be higher. Normally, a higher block size will result in better performance but here I let the default choice.

-b size=4096

Data Section

-d data_section_options

    agcount:This is used to specify the number of allocation groups. The data section of the filesystem is divided into allocation groups to improve the performance of XFS.
    sunit: This is used to specify the stripe unit for a RAID device or a logical volume. The value has to be specified in 512-byte block units. Use the su sub-option to specify the stripe unit size in bytes.
    swidth: This is used to specify the stripe width for a RAID device or a striped logical volume. The value has to be specified in 512-byte block units. Use the suboption sw to specify the width size in bytes.

 -d agcount=4,su=256k,sw=4

Here for RAID5: width=su*(number of Raid Drives - 1)
for RAID 6, it would be: width=su*(number of Raid Drives -2)

Force Overwrite (Optional)
-f Force overwrite when an existing filesystem is detected on the device.

Log Section
-l log_section_options
     internal: This is used to specify that the log section is a piece of the data section instead of being another device or logical volume.
     size: This is used to specify the size of the log section.
     version: This specifies the version of the log. The current default is 2, which allows for larger log buffer sizes as well as supporting stripe-aligned log writes (see the sunit and su options, below).
     sunit: This specifies the alignment to be used for log writes. The value has to be specified in 512-byte block units. Note: I do not set it as it done automatically once the data sunit is given.
     lazy-count: This changes the method of logging various persistent counters in the superblock. Under metadata intensive workloads, these counters are updated and logged frequently enough that the superblock updates become a serialisation point in the filesystem. The value can be either 0 or 1.

-l internal,size=128m, version=2, lazy-count=1

The remaining options can remain to their default values

mkfs.xfs -b size=4096 -d agcount=4, su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//root
running it gives that, for alignment AG must be a multiple of stripe width, so a recommendation is given
mkfs.xfs -b size=4096 -d agsize=6553536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//root
mkfs.xfs -b size=4096 -d agsize=39321536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//home
mkfs.xfs -b size=4096 -d agsize=39321536b,su=256k,sw=4 -l internal,size=128m, version=2, lazy-count=1 -f /dev//var

and then to mount (also change these options in the fstab)
nobarrier,logbufs=8,noatime,nodiratime /dev/root
nobarrier,logbufs=8,noatime,nodiratime /dev/var
nobarrier,logbufs=8,noatime,nodiratime /dev/home

Step 4: Final tweaks. Set readahead buffers correctly

There is an issue with the readahead buffers. This is a known problem and is discussed extensively in the following links:

Link 1     Link2

blockdev --getra /dev/md0 /dev//root /dev//var /dev//home

Gives
4096 256 256 256

To fix this:
blockdev --setra 4096 /dev/md0 /dev//root /dev//var /dev//home

(To make this permanent add an entry to /etc/rc.local)

and then do a bonnie++ benchmark to test that everything works as expected.
bonnie++ -u -f

Note: The benefits of partition alignment will be more profound as the chunk size becomes smaller.

Step 5: System Installation

Reboot the system and do the installation as usual (one could do it manually but there are no
significant reasons why one should complicate things more).

Once the basic installation has completed, the system will be restarted and the basic grub prompt will
appear:

find /grub/menu.lst (or find /boot/grub/menu.lst)
root (hd0,0)
setup (hd0)

root(hd1,0)
setup (hd1)

will install grub on two hard drives. And restart....

Once the system is setup:

1. Install the network (edit /etc/init.d/networking)

iface eth0 inet static
address 192.168.1.100
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
gateway 192.168.1.254

Edit /etc/resolv.conf to add the nameservers
search myisp.com
nameserver 192.168.1.254
nameserver 202.54.1.20
nameserver 202.54.1.30

/etc/init.d/networking restart

Test connectivity:
ping www.google.com


2. Update the system

apt-get update;apt-get dist-upgrade

dpkg-reconfigure debconf (to set the level of questions that you want asked, I choose medium)


3. Add swap (if you have not added it before)

Rule of thumb (for system memory of 4GB and higher, swap should be system memory+2GB),
so for me it is 6GB.

lvcreate -l 192 -n swap1
mkswap /dev//swap1
and record the UUID given.
swapon -va (To activate it)
and
cat /proc/swaps
or
free
To verify that it is installed

4. Edit /etc/fstab  and add there
a. The options for the xfs filesystems (see above)
b. for the swap one line along the lines
UUID=  swap     swap    defaults     0 0

5. Set readahead to a larger value automatically on system boot.

Edit rc.local and add the line
blockdev --getra 4096 /dev/md0 /dev//*

Reboot and we are done!!!!

Other minor topics defrag the filesystem...
1. Info on xfs system
xfs_info /dev/data/test
Check Fragmentation Level:
xfs_db -c frag -r /dev/hdXY
To lower fragmentation level:
xfs_fsr /dev/hdXY

2. Expert mode in server installation

Note: in Ubuntu 9.04 using expert mode seems to create problems during the installation of the base system when mkintrd is creating the initrd image. There are some workarounds on the internet but it is easier to not use expert mode.

3. Items that need further investigation

a. bonnie++ and bonnie++ -f give different results....

I have no clue, why this is the case. This difference is probably due to a bug with bonnie++. Other people around the net have noticed this behavior. While not certain, I can say that compared with other benchmarking software it seems that there is a problem with the -f switch.

After the tweakings above indicative numbers are shown below:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
name          8G 81129  93 161448  19 115095  13 86482  96 414836  27 586.2   0
                    ------Sequential Create------ --------Random Create--------  
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--  
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  
                 16  7780  23 +++++ +++  3866   8  8314  19 +++++ +++  3874   9  

and with the option in my rc.local file:

echo 4096 > /sys/block/md0/md/stripe_cache_size

I get the following:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
foxtrot          8G 87635  96 180044  21 131761  16 91021  97 401820  25 499.2   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  8988  27 +++++ +++  4239   8 10419  33 +++++ +++  3921   9


c. Why hdparm -tT gives different numbers on a mounted vs an unmounted filesystem
Based on the discussion here it seems that there is some communication between the filesystem and the block device. This gives slower hdparm results when the filesystem is mounted. 

d. Configure mdadm.conf to send automatic notifications regarding the health of the disk array.


04 July, 2009

LVM Advanced Installation Notes:

1) The problem
After the  default installation (see previous post) I noticed that performance was not satisfactory.
I ran bonnie++ and other io benchmarking software.
The problem can be illustrated as follows:
hdparm -tT /dev/md0
Gives reasonable performance (380MB/s reads), while
hdparm -tT /dev/vgvol/dir
gives abysmal performance (120 MB/sec, equivalent to that of one drive)...

This suggests that we might have a problem with alignment between raid/lvm/xfs....

2) Raid Information
The following resources provide a lot of useful information regarding raid installation:
RAID HOWTO

In particular it defines the superblock and gives lots of useful information on mdadm and it's use.

3) File mdadm.conf
/etc/mdadm.conf is mdadms' primary configuration file. Unlike /etc/raidtab, mdadm does not rely on /etc/mdadm.conf to create or manage arrays. Rather, mdadm.conf is simply an extra way of keeping track of software RAIDs. Using a configuration file with mdadm is useful, but not required. Having one means you can quickly manage arrays without spending extra time figuring out what array properties are and where disks belong. For example, if an array wasn't running and there was no mdadm.conf file describing it, then the system administrator would need to spend time examining individual disks to determine array properties and member disks.

# mdadm --detail --scan
ARRAY /dev/md0 level=raid0 num-devices=2   \
    UUID=410a299e:4cdd535e:169d3df4:48b7144a

If there were multiple arrays running on the system, then mdadm would generate an array line for each one. So after you're done building arrays you could redirect the output of mdadm --detail --scan to /etc/mdadm.conf. Just make sure that you manually create a DEVICEentry as well. Using the example I've provided above we might have an /etc/mdadm.conf that looks like:

DEVICE    /dev/sdb1 /dev/sdc1
ARRAY     /dev/md0 level=raid0 num-devices=2    \                      
    UUID=410a299e:4cdd535e:169d3df4:48b7144a


4) Choices
HW vs SoftRaid vs FakeRaid:
I had all three options (I have a raid controller, an ICH10R mobo, and only run linux).
See below for a discussion
Pros and Cons
I chose softraid because:
- I have a fast processor
- I only ran linux.
- From what I have seen is reliable and fast and compared to fakeraid (dmraid) is more stable and slightly
faster.
See also the following for more discussion:
Link 1 Link 2   Link 3  



Superblock:

It turns out there are multiple versions. This is reported when running
mdadm --detail /dev/md0  (Under the version)
See link for more information.
Update: Add here choice ....


Swap file Location:
There is a discussion on where to put the swap file if you have a RAID partition... Should it be put on the raid
or separately???

Three solutions are proposed:
Separate RAID 1 for swap on 2 drives (so that if a drive fails there is swap on the other).
Add many swap partitions on each of the drives and let the kernel decide where to place the swap, or
place on raid 5.

After looking around the following discussion is the most convincing:
If you have everything on RAID on your server, it's often debated whether you want your swap partition on RAID as well. Some will state correctly that Linux optimally uses two swap partitions (e.g. on /dev/sda2 and /dev/sdb2) and that putting the swap on a RAID impacts the swap performance. While this is techncally correct, it is nonsense when it comes to availability.
First: if swap performance is an issue, the problem isn't RAID or not, it is too less RAM. Under normal circumstances, swap should be used only sparsely -- if at all. From time to time the system might swap out something not used for some time. If a larger amount of swap is used on a regular basis, else there's a memory leak in one of the applications running, or you simply have not enough RAM built in for the tasks running. Go buy some!
Second: while Linux can indeed distribute swapped pages across several swap partitions, once one of them suddenly disappears because the underlying disk died, the system simply crashes. And that's exactly what you don't want.

Conclusion: put the swap on a RAID as well as everything else.

Swap on RAID 5 for me



01 July, 2009

Notes on RAID 5/LVM Installation...

I found quite a few resources on the internet with useful information regarding setting up a RAID 5 system with LVM.

 Setup soft RAID/LVM
Using the 9.04 server CD, I partitioned the 5 disks as follows:
128MB on every disk (set flag to raid) (/dev/sd[a-e]1)
and the rest as a single partition (with the flag set on raid again) (/dev/sd[a-e]2)

I set the bootable flag on /dev/sda1 and /dev/sda2 and created a raid array that I formatted using XFS and set the mount point to /boot. I then proceeded with the creation of a raid5 array using /dev/sd[a-e]2 (/dev/md0) and then I set on top of it a partition that had the
flag set on lvm. Using the LVM tool I then proceeded to create partitions:
swap, var, home, root
for the corresponding directories.
I formatted everything as XFS (and mounted them in the appropriate locations)
Then grub was installed per the recommendations on the first link below:
Software Raid on Ubuntu: In this first link and useful suggestion on how to install the boot loader with grub
after the configuration has been completed (Essentially run and install the boot loader on both of the boot partitions)
Ubuntu forums: On this link an interesting suggestion regarding the swap is given. The suggestion is to have
multiple cache files

Once restarted everything seems to be working correctly and the raid array started the sync process...
cat /proc/mdstat

Also it easy to check that the file /etc/mdadm/mdadm.conf has been created correctly.
The following link (in Greek) has a very good description of the process as well as the
Greek Forums on Ubuntu describing many details of the procedure.

Chunk sizes and other misc stuff I have left them on their default values. It seems that there are benefits to
selecting the proper stripe sizes but for my system (which is not very heavy load) the difference would be
marginal with a concominant waste of my time.



29 June, 2009

Bacula Installation Notes

Bacula is especially hard to configure as there are many options. My backup plan was to
be able to automatically take backup from various hosts. These might be user's machines,
in which case, depending on the operating system their /home (for linux) or My Documents
(for windows hosts) would be taken. What made it especially hard was the need to take
server backups. The servers are hosts to many websites, as well as other lab services.
This created the need to take automatic (and consistent backups) of the web site and the
associated database. The solution I devised was a set of scripts that allow to
take LVM snapshots and then backup these snapshots.

I had to write a number of scripts so that this would be scalable to many hosts,
and also found extremely useful the script mylvmbackup... The bacula conf files are
a word in progress. Especially in an effort to automate the various processes. Here I just document (for my own sake) the input-output to the scripts I wrote. For the bacula terminology see at the end of this post.

Bacula Installation Notes

The first problem when installing bacula is that the new version (3.0.x) is not yet officially packaged (although there  exists an unofficial PPA package). It seems that the ubuntu server team will be preparing an official ppa package but nothing has been done yet. I decided to use the old version and upgrade in a few months as the new version becomes
available. (In fact, I tried the PPA version and it would not work.)

To enable ssl follow the steps below:

apt-get build-dep bacula
apt-get install build-essential libssl-dev fakeroot devscripts
apt-get source bacula
cd bacula-2.2.4
(edit debian/rules to add the openssl option)
dch -i -Djaunty
fakeroot dpkg-buildpackage

and then install the deb packages to commence the installation.

After answering the questions (creating a separate db user with access privileges to the bacula catalog).



Add here stuff about the pools and how to create them.....


Backup Websites (or other applications that have a file and a db part)

Step 1: Download mylvmbackup and mylvmbackup.conf
Edit and place them in the bacula scripts directory (diffs follow)

mylvmbackup
19,26d18
<
< #
< # Note I have edited two things here.
< # a. $configfile to point to the actual file. Due to a bug I could not pass it as an option
< # b. removed the default user from being the root (since the my.cnf will be used).
< # c. and of course, I edited the file mylvmbackup.conf
<
<
45c37
< my $configfile = "/etc/bacula/scripts/mylvmbackup.conf";
---
/> my $configfile = "/etc/mylvmbackup.conf";
116c108
< }
---
/> }
411c403
<   $user = '';
---
/>   $user = 'root';

mylvmbackup.conf

16c16
< user=
---
/> user=root
18c18
< host=localhost
---
/> host=
21c21
< mycnf=/etc/mysql/my.cnf
---
/> mycnf=/etc/my.cnf
27,28c27,28
< vgname=
< lvname=
---
/> vgname=mysql
/> lvname=data
30c30
< lvsize=10G
---
/> lvsize=5G
88c88
< skip_hooks=1
---
/> skip_hooks=0


Step 2: In the file director with the website insert commented out the following in the file deamon (client) configuration file:
# WebSite {
#  Name = "Joomla_Website"
#  dbuser ="joomuser"; dbpassword ="dbpasswd"
#  dbname "Joomla";dbdir = "/path to db";
#  dbvgname="dbvgname"; dblvname="database"; dbxfs=0;
#  webdir ="/path to website";
#  webvgname="dbwebname";weblvname="websites"; webxfs=0;
# }

With the following information:

REQUIRED
Name: Unique name to identify the database
dbuser: User name to access the database
dbpassword: Password to access the database
dbname: Name of the database (used for the dump in the non-lvm case)
dbdir:
      In the non lvm case, full path to dir where the temp sql dump will be placed.
      In the lvm case, the relative path (in the lv) where the db is located.

OPTIONAL, 
If the optional values are provided an LVM snapshot is used.

Database options
dbvgname: Name of the volume group where the database resides.
dblvname: The name of the logical volume where the database resides.
dbxfs: Set to 1 if the snapshot volume has the xfs filesystem.

Website Data Directory options
webdir: Directory where the data files reside
       In the non-lvm case, this should be the actual directory.
       In the lvm case, the relative path (in the lv) where the website is located
       If not specified no website backup will be taken.
webvgname: The name of the volume group where the data dir resides
weblvname: The logical volume name where the data dir resides.
webxfs: Set to 1 if the snapshot volume has the xfs filesystem.

Also in the scripts directory copy the scripts mylvmbackup (see note above),
backup_website  (and)
backup_website_awk

The awk script scans the file for configuration information and then the backup_website (sh)
script is doing the actual work. In particular, to invoke the script

backup_website _mode_of_operation_    jobname

mode_of_operation has three possible choices:  snapshot, release, filelist
jobname: Is the jobname as created by bacula.

Step 3: In the director I use the following





Terminology

1.
Glossary on data storage schemes
Volume: A Volume is a single physical tape (or possibly a single file) on which Bacula will write your backup data.
Pools: Pools group together Volumes so that a backup is not restricted to the length of a single Volume (tape).
Label:Before Bacula will read or write a Volume, the physical Volume must have a Bacula software label so that Bacula can be sure the correct Volume is mounted.
Console: The program that interfaces to the Director allowing the user or system administrator to control Bacula.

2. There are a number of deamons used to facilitate the operation:
Bacula-Director: The director is used to orchestrate all the backup operations
Bacula-SD (Storage Deamon): The storage demo is in charge of handling the storage devices
Bacula-FD (File Deamon) essentially this is the client software installed on the machine to be backed up.
Upon installation all these deamons require (a minimal) configuration by editing their configuration
files that reside on the /etc/bacula subdirectory.

3. Other utilities/interfaces of note:
Bconsole: Console utility that starts whenever a user logs onto the console.
Bsmtp: smtp utility used to send messages to the administrators
BootStrapRecord: Is the crucial information used to recover files in case of a catastrophic failure of the server itself.

4. Types of backups:
Full: A full backup
Differential: A backup that includes all files that have changed since the last full backup,
Incremental: A backup that includes all the files changed since the last Full, Differential, or Incremental backup started.

5. Bacula Jobs (Configuration Resource)
A configuration resource that defines work that Bacula must perform to backup a particular client. It consists of:
Type: Backup, restore, verify, etc
Level: Full, Incremental, Differential
Fileset: A Resource contained in a configuration file that defines the files to be backed up. It consists of a list
   included files or directories, a list of excluded files, and how the file is to be stored.
Storage:
Storage Device, Media Pool

6. Types of Resources
Jobs: See 5 above
Restore: Describes the process of recovering a file from backup media.
Schedule: Defines when a job will be scheduled for execution
Verify: Operation (Job) to verify restored data.
Scan: A scan operation causes the contents of a Volume or a series of Volumes to be scanned.

7. Other terminology and information repositories
Resource: Part of a configuration file that defines a specific unit of information that is available to bacula.
Bootstrap file: Is an ASCII file containing commands that allow Bacula to restore the contents of one or more volumes.
Catalog: The catalog stores summary information about Jobs, Clients, and Files that were backed up on a Volume. 
Retention Period: The most important are the File Retention Period, Job Retention Period, and the Volume Retention Period. Each of these retention periods applies to the time that specific records will be kept in the Catalog database.
  • This period is important for two reasons:the first is that as long as File records remain in the database, you
    can ”browse” the database with a console program and restore any individual file. Once the File records are removed or pruned from the database, the individual files of a backup job can no longer be ”browsed”. The second reason for carefully choosing the File Retention Period is because the volume of the database File records use the most storage space in the database. As a consequence, you must ensure that regular ”pruning” of the database file records is done to keep your
    database from growing too large.
  • The Job Retention Period is the length of time that Job records will be kept in the database. Note, all the File records are tied to the Job that saved those files. The File records can be purged leaving the Job records. In this case, information will be available about the jobs that ran, but not the details of the files that were backed up. Normally, when a Job record is purged, all its File records will also be purged.


28 June, 2009

RAID/LVM Notes

General Notes on the concept:

The following link provides a comprehensive description of the fundamental ideas behind LVM

IBM Tutorial

Following the (excellent) discussion above, LVM is an interesting solution because it offers the following possibilities:

  • In multiple disk installations, it offers the possibility of having filesystems larger than any of the disks
  • Add disks/partitions to your disk-pool and extend existing filesystems online
  • Replace two 80GB disks with one 160GB disk without the need to bring the system offline or manually move data between disks
  • Shrink filesystems and remove disks from the pool when their storage space is no longer necessary
  • Perform consistent backups using snapshots (more on this later in the article)
  • This as we see below is not that big of a deal as long as one has a thorough understanding of the concepts.
    All this flexibility comes at a small added complexity in the sense that one has to properly describe the abstraction using CLI commands.
    The LVM is structured in three elements:
    • Volumes: physical and logical volumes and volume groups
    • Extents: physical and logical extents
    • Device mapper: the Linux kernel module

    Volume

    Linux LVM is organized into:

    • physical volumes (PVs),

    • volume groups (VGs), and

    • logical volumes (LVs).

    Physical volumes are physical disks or physical disk partitions (as in /dev/hda or /dev/hdb1). A volume group is an aggregation of physical volumes. And a volume group can be logically partitioned into logical volumes.

    Figure 1: Physical-to-logical volume mapping

    Physical to logical volume mapping

    Extends

    In order to do the n-to-m, physical-to-logical volumes mapping, PVs and VGs must share a common quantum size for their basic blocks; these are called physical extents (PEs) and logical extents (LEs). Despite the n-physical to m-logical volume mapping, PEs and LEs always map 1-to-1. The
    following image illustrate this concept.

    Physical to logical extent mapping

    Different extent sizes means different VG granularity. For instance, if you choose an extent size of 4GB, you can only shrink/extend LVs in steps of 4GB. Of importance is also the extent allocation policy. LVM2 doesn't always allocate PEs contiguously; for more details, see the Linux man page on lvm. The system administrator can set different allocation policies, but that isn't normally necessary, since the default one (called the normal allocation policy) uses common-sense rules such as not placing parallel stripes on the same physical volume.

    Device Mapper
    When creating VGs and LVs, you can give them a meaningful name (as opposed to the previous examples where, for didactic purposes, the names VG0, LV0, and LV1 were used). It is the Device mapper's job to map these names correctly to the physical devices. Using the previous examples, the Device mapper would create the following device nodes in the /dev filesystem:
    • /dev/mapper/VG0-LV0
    with /dev/VG0/LV0 a link to the above.
    Note: Many distributions provide utilities to partition using LVM and/or RAID. RedHat has a very nice tool, but I will be using
    Ubuntu (since I very much prefer the apt-package management system). In Ubuntu, the alternate installation CD has
    partman and support for LVM/RAID... But this does not offer much flexibility in setting extent sizes, stripe sizes, etc. So I will be
    using the CLI to do much of the partitioning. Also note that LVM (and RAID) support must be included in the initrd for the
    system to be able to boot from an LVM volume. Ubuntu does this automatically, from version 9.04, what you need though is
    the server edition.


    References

    1. IBM Tutorial , Logical Volume Management
    2. LVM-HOWTO, LVM Howto


    23 June, 2009

    Περί ελέγχου του σκληρού δίσκου

    This is a post documenting efforts to recover data from a failed hard drive. The drive had a reiserfs filesystem and failed suddenly. I can't mount it or in any other way access my data, so I will be documenting here the investigations…

    Smartmon Tools:

    It is possible to use the smartmon tools to check the health of the hard drive…

    1. Check the health of the drive

      smartctl –H –d ata /dev/sda (if PASSED this is a good indication)

    2. One can do more elaborate tests

      smartctl -t short –d ata /dev/sda (or)

      smartctl -t long –d ata /dev/sda

          smartctl -l selftest –d ata /dev/sda (to display results)


     

    1. And can also display the following

      smartctl -a /dev/sda

      smartctl -A /dev/sda

    Gives read failures by going for short and extended periods offline. Not good.

    reiserfsk:

        reiserfsck –check /dev/sda

    This gives out a warning that there is some sort of hardware failure. (Will get back to this later)

    seatools:

    Now I moved it over to windows and tried tools offered by Seagate (it turns out that some of their drives are shipped with buggy firmware and this can cause an unexpected crash. The idea is to run their diagnostic tests and see if they pass. Tried with the Seagate web-site and it turns out that for my serial no firmware update is required. I run the updater utility (to update the firmware on my other drive) and it also updated the firmware in the messed up one as well. Some of the status messages changed but no change whatsoever on the drive accessibility. I do get errors with all their diagnostic tests (long/short dst, generic dst).

    PCB:

    After looking around a little it seems that for people having problems with their drives one way to fix them is to replace their PCB boards. This is probably not an option for me as it seems that this is necessary when the drives are destroyed by a power surge or some anomaly. In my case the drive works "perfectly" (i.e. rotates) and the filesystems are recognized.

    badblocks:

    It is now time to investigate into bad blocks and the potential of, at least, partially recovering some data.

    References

    1. Smartmon Tools
    2. Linux Journal Article
    3. Ubuntu Data Recovery


     

    15 March, 2009

    I just found out this tutorial on HOW to set up chroot sftp sessions....
    chroot sftp

    Worked like a charm