Current Status of the PPD Linux Systems

Current - 30/03/2010

There was a problem with the site-bdii last night after we moved it to new hardware yesterday. This was resolved from about 9:30 this morning (off site services may have taken a while to pick up the updated information).

One of the CEs heplnx207 had problems this morning with a large number of stuck processes, several services were restarted about 13:30 and the service should be settling down.

We are still running the home and experiment file systems on spare hardware after the failures both main servers within a couple of weeks.

10/03/2010

A copy of the Home file system from just before the server crashed has been copied to a fresh server and have made them available (read only) on the interactive machines (heplnx101-109). You should be able to find your files at the following location, from where you can copy them to where they are needed:

/net/restore/heplnx162restore/home/your_user_name

05/03/2010

The backup home file server and experiment software area server failed with errors on it's system disk today. Which cause the interactive services heplnx101-heplnx109 to be unavailable for most of the day.

We've now restored the services onto a third server and will set up and extra backup server on Monday.

These two servers will remain in production while we investigate the problems on the main servers.

26/02/2010

PPD Linux systems are running "at Risk" on backup air conditioning however the current setup appears to be performing well and Estates and PPD Computing Group are fairly confident that we will be able to lift the "at Risk" on Monday or Tuesday.

  • Most Services are up and running normally
  • 52 Worker Nodes in Lab 8 (208 batch slots) have been shut down - This includes all the SL4 batch nodes
  • We had problems overnight with one worker node not accepting jobs which caused a large number of lost jobs.
  • The main home file server for Linux is being rebuilt, the home file system is currently being served of a backup server.

25/02/2010

PPD Linux systems are currently running at risk with reduced capacity while additional temporary air conditioning is installed in R1 Lab 8.

  • Interactive Linux nodes heplnx103, heplnx105, heplnx106, heplnx107 and heplnx108 have been shutdown to reduce the heat load
  • A number of disk storage nodes have been shutdown
    • Grid and Local Batch jobs have been paused because of this
    • Some files will be unavailable, however new files should be able to be written
  • hepcvs has been shut down
  • Three of the four TCAD nodes have been shutdown.
  • 52 Worker Nodes in Lab 8 (208 batch slots) have been shut down - This includes all the SL4 batch nodes

Last 24 hour status according to WLCG SAM tests Jobs Currently running/queued on the farm

Future planned interventions affecting the PPD Linux Systems

%CALENDAR{topic="PPDLinuxStatusEvents" showweekdayheaders="1" width="90%" cellheight="100"}%

%CALENDAR{topic="PPDLinuxStatusEvents" month="+1" showweekdayheaders="1" width="90%" cellheight="100"}%

as a list

-- ChrisBrew - 2010-02-25


This topic: Computing > WebHome > PPDLinuxStatus
Topic revision: r6 - 2010-04-30 - ChrisBrew
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback