We have made a significant upgrade to the central file services and
backup facilities. An important consideration is the availability and
security of the storage, as well as the capacity and speed of access.
The model of a RAID storage system served via dual computer systems, up
to now provided by the StorageWorks array and Alphaserver systems, is
one that has provided reliable service for the last five years. It
provides resiliance against the most common causes of lack of access to
data, such as disk failure and system crashes, and well as providing
access to the storage during a software upgrade on one of the systems.
A fully redundant system with no single point of failure would however
require considerably more hardware (dual storage systems, dual
controllers and dual connections for all possible data paths).
Equally important is the software running on the file servers which
coordinates disk access between the servers and ensures that external
clients are provided with an appropriate access path to the storage
whenever one or more of the servers is functioning. After some
investigation of such 'high-availablity' software and demonstrations
(some of which showed that potential software packages did not perform
as advertised), we decided on the 'Convolo' package from Mission
Critical Linux. Although less sophisticated than VMS clustering,
Convolo provides several important mechanisms for checking the
availabilty of the servers, including both serial-line and dedicated
ethernet interfaces for heartbeat, which enable each server to
ascertain the status of the other. The file services are seen by the
clients as being served from a number of 'virtual' host addresses which
are switched between systems if failover occurs or by management
commands. The services for specific filesystems are associated with
these virtual addresses, so a change to the alternate server is
completely transparent to the client, involving only an updating of its
ARP table to reflect the change of host hardware address.
Reliable hardware is also essential, and we therefore required
well-constructed server systems which do not fail like cheap hardware
for often trivial reasons, such as poor-quality power supplies or
inadequate ventillation. The final choice of hardware was
Two Compaq Proliant ML370 G2, each with dual 1.2 GHz PIII
processors and 1GB of main memory, four 160MB/sec SCSI channels,
10/100Mbit/sec and Gigabit Ethernet Interfaces, dual-redundant power
supplies and redundant fans.
Operating systems:- SuSE 7.2 distribution of Linux.
Jetstor III Raid array - six Seagate 180GB disks, 900GB usuable
space, dual Ultra160 SCSI interface.
Compaq MLS5026 SuperDLT robot. 100GB/tape. Data transfer speed up
to 70Gb/hr. Magazine capacity 25 tapes.
Compaq UPS R3000 XR Uninteruptible Power Supply.
All the above fit into one standard rack, with space still available
for expanding the storage, either with additional disks arrays or
greater magazine capacity for the SDLT robot. The SuperDLT tapes are a
development of DLT technology that we have been using with great
success since 1994, and are backwards compatible with the existing DLT
IV tapes.
Each system has its own system disk, which allows each system to
operate should the other fail or be unavailable due to maintenance or
upgrading, but does mean that there is a certain duplication of effort
on installing or upgrading.
The storage array has a separate SCSI connection to each system, which
allows one system to be powered off and disconnected without affecting
disk access by the other. It supports the most common RAID levels, but
for our purposes it has been configured as a RAID5 array, so a single
disk can fail with no loss of data. It is partitioned into four logical
disk devices and each of these is further software partioned to present
appropriately-sized filesystems to the users.
As of October 2002, a number of filesystems are being served both via
NFS and Samba, and migration of filesystems currently hosted on the
older servers will take place progressively.
At the moment this systems is supplementing rather than replacing the
AlphaServers, and the following systems continue to be in use:-
A VMS cluster consisting of an AlphaServer 1200 and AlphaStation
500, providing access to about 1.5TB of RAID disk storage, with a TL891
DLT tape robot. The storage is served to the rest of the laboratory
systems using various network protocols (VMScluster, NFS, SMB and
AppleTalk). This also runs network services for mail, POP, Imap2, DHCP
and the laboratory web server. The inherent security of the VMS
architecture gives greater protection against external disruption to
these services.
A 4-processor Compaq AlphaServer 4100 running Compaq Unix, which
provides the main computational power, serves about 50GB of storage,
and acts as the NIS master node.
1.1.2 Laboratory Network.
The laboratory network continues to be based on a 100Mbit/sec Fast
Ethernet network. The only significant addition this year has been a
3Com 4900 12-port Gigabit Ethernet switch. This initally provides
connectivity between the Proliant server nodes and the other network
switches, with of course the latter links still operating at
100Mbit/sec, but will provide connectivity for Gigabit links for
further servers and upgraded peripheral switches in the future.
An Uninterruptable Power Supply (UPS) has been installed to supply
power to the main network components in the computer room (switches and
media converter for the University backbone link). This protects
against mains power surges and dropouts, and also battery backup power
for about 15 minutes in case of power failure. This has been invaluable
during the series of power disconnections in the building for testing,
etc., as the equipment can continue to run while the power source is
changed to a temporary supply.
The thinwire network has finally been disconnected, and the other older
networks (LocalTalk, FDDI) serve rather few nodes, just two older
printers on LocalTalk, although the FDDI provides an useful addition to
the network capacity to the AlphaServers.
1.1.3 Reorganising NIS services and NFS mounting.
A vital component of the network services are the NIS services which
provide password information and other network information to all Unix
and Linux client systems. For several years various SGI workstations
were being used as both master and slave servers. In Feb 2002 this
service was migrated to the AS4100 as master and the dual Linux servers
as slaves, a rather fiddly process to ensure uninterrupted information
was supplied to clients, and that e.g. password changes were
communicated to the correct servers. This was fortuitously performed
just a month before the system disk on the previous SGI master failed!
In addition, a set of NIS maps were produced to distribute
automatically the NFS automounting information to all clients. This
information relates network filesystem names to the appropriate server
hosting the disk and the mount point to use, and had previously been
maintained separately on each client. After a one-off reconfiguration
of the automounter on each client, we are now assured that the disk
mounting information is consistent on all clients, and further changes
and additions to NFS disk mounting information can be propagated by a
single change to the NIS maps. This is invaluable as new disk services
are set up and existing ones migrated between servers
There are now nearly 200 host addresses registered to the Laboratory
network, which includes a wide variety of systems:- the central servers
as described above; Mark Sansom's group's PC cluster; a large number of
desktop systems, including Intel systems running either Windows or
Linux, and Apple Macintosh computers; personal laptop computers; and a
number of systems dedicated to specific tasks, such as control of the
Area Detectors and the Optronics scanner.