Data Acquisition Error Recovery

This document descrives how to recover from fatal errors in the MIDAS data acquisition system. Note that this is only applicable for complete crashes where nothing else works. In case of smaller problems, please refer to the MIDAS manual.

Restarting clients

Most problems can be solved by restarting various programs. The ones running on the backend computer can be restarted by exiting the current programs with "!", then restarting them by typing their name. On the "Ana & Log" Page there are currently the programs

mlogger, lazylogger, mhttpd and analyzer

On other pages there might be instances of odbedit, which can be stopped with "quit".

The two frontends down at the area are also stopped with "!", or with Ctrl-C if the this doesn't work. If they are frozen completely (and only then!), thy need a "hard reset" via the main PC power switch. To restart them, doulbe-click on their frontend icon after rebooting. You have to switch the screen between the two computers at the PC switch box.

Once all programs have been stopped, start a new odbedit and enter "cleanup" to remove hanging clients from the ODB. Then all programs can be restarted.

Trouble with terminating clients

Normally the ``$ kill -9 pid'' command kills the offending process. This is verified by the ``$ ps alx'' command.

However, in some cases, when you subsequently go into odbedit and check for surviving clients by typing ``scl'', you find a loose client ``hanging on''.

As stated above, this is normally remedied by typing ``cleanup'' inside odbedit. However, there are times when the ``cleanup'' command does not work and, instead, hangs up indefinitely. According to SR this appears to be caused by a confused system of communication with the various subprocesses.

More often than not, in my experience (DP), this condition did not require a full recreation of the ODB. I recommend ignoring the hanging client and trying to restart all of the processes in an orderly way. Usually, this procedure would take care of the superfluous ``hanging'' client.

If the error messege is "rpc timeout"

If the alarm is 'FE inactive' and one encounters a combination of following errors

'rpc timeout' in either screen of FrontEnd computers.
none of FrontEnd programms can be terminated by '!' .
'cleanup' in odbedit will hang up the odbedit.

one shoud do:

kill both frontends with crtl-c or click 'x' at the right corner of the programme window.
kill mlogger and analzyer while keeping an eye on odbedit window. after killing these two programmes, odbedit should be able to recover.
If it does, then restart all programs you just stopped or killed, the system has been fixed.
If not, keep killing other programms, essentially one has to do a 'minor recovery'.

If this does not work

In rare cases the ODB can be corrupt. In this case it has to be re-created from scratch. Try to save the current version in odbedit with

[local]> save tmp.odb

This creates an ASCII version of the ODB wich can later be used to recreate the ODB. If this does not work, you can later load a recent ODB file from one of the last runs at /data/runxxx.odb. Then delete the ODB at the UNIX prompt with:

$ cd ~/online
$ rm .ODB.SHM

This deletes the disk backup of the database. Now check if the shared memory still exists with:

$ ipcs

------ Shared Memory Segments --------
key       shmid     owner     perms     bytes     nattch    status
0x4e4c4e4f 1664      pibeta    666       20000000  1
0x4d013763 1537      pibeta    666       2027708   8
0x4d01376a 1538      pibeta    666       109532    8
0x4d01376b 1539      pibeta    666       1058108   4

------ Semaphore Arrays --------
key       semid     owner     perms     nsems     status
0x4d013684 1280      pibeta    666       1
0x4d013760 1281      pibeta    666       1
0x4d013763 1282      pibeta    666       1
0x4d01376a 1283      pibeta    666       1
0x4d01376b 1284      pibeta    666       1
0x4d013862 1285      pibeta    666       1

------ Message Queues --------
key       msqid     owner     perms     used-bytes  messages

The Semaphore Arrays and Message Queues are not relevant for this discussion. To understand which entry refers to what segment of the shared memory, it is best to go to

 /home/pibeta/online

and list the .*.SHM files:

[pibeta@pc2106 ~/online]$ ls -l .*SHM
-rw-r--r--   1 pibeta   users           0 May  3 21:04 .ALARM.SHM
-rw-r--r--   1 pibeta   users           0 May  3 21:04 .ELOG.SHM
-rw-r--r--   1 pibeta   users      527708 May 10 17:22 .Hl.SHM
-rw-r--r--   1 pibeta   users           0 May  4 21:46 .LAZY.SHM
-rw-r--r--   1 pibeta   users     2027708 Jun 15 11:28 .ODB.SHM
-rw-r--r--   1 pibeta   users      109532 Jun 13 12:45 .SYSMSG.SHM
-rw-r--r--   1 pibeta   users     1058108 Jun 13 12:45 .SYSTEM.SHM
-rw-r--r--   1 pibeta   users       27708 May 22 13:01 .?2.SHM

In the above example it is clear that the segment with size 2027708 corresponds to the ODB, the one with 109532 bytes to the system message buffer, and the one with 1058108 to the system event buffer. Not appearing above is the PAWC shared memory segment, which is the first one listed by "ipcs" with size of 20 million bytes. Thus, the shared memory region with 2027708 belongs to the ODB and must be deleted with

$ ipcrm shm [id]

where [id] is 1537 in the above case. You might have to login as root to do that (type su - and then same pw as pibeta). After the memory has disappeared (check with ipcs), the ODB can be recreated with:

$ odbedit -s 2000000
[local]/>load tmp.odb   (or /data/runxxx.odb)

[Note: if one does not intend to run the DSC in the RAW mode, it is enough to start $ odbedit -s 1000000 . However, for normal running with the DSC we want to have this option available and always start ODB with the size of 2 million.]

After that, all other programs can be restarted as described in the previous section. Once all programs are running (no red sections in the Web status page) a new run can be started

Backend PC reboot

The Backend Linux PC should never be rebooted. Hanging programs can be stopped with "kill -9 [pid]", the X-Server can be restarted with Alt-Ctrl-Backspace. If the X-Server is completely frozen, one can log in at a different console by pressing Alt-Ctrl-F1, then log in as root, the go to runlevel 3 with "telinit 3", then back to 5 with "telinit 5". This should restart the X-Server.

If the PC still has to be rebooted (maybe to SCSI problems or power failure), it rewinds the tape. In order not to overwrite the tape with the next run, it has to be spooled to the end of data with

$ mt /dev/nst0 seod (upper drive)
$ mt /dev/nst1 seod (lower drive)

which can take some while. The next run will then be appended at the end of the previous data.

-----

A logbook entry about a reboot and recover process you may find in the Elog on date August, 17/18 2000.

Whenever you will have to reboot the Backend PC. Be careful in which order you startup the clients. You should always start to wind the tape until the end of the data. Whatever happens then, you cannot overwrite your data. in the second step start odbedit and load your odb-file. If this is done you can start your mhttpd to get all the other clients running.

The Frontends you should start from their machines to make sure, that they startup properly without error messages.

Restarting WebPAW

Sometimes WebPAW is hanging on PC2106. You can start it from a normal window on `pc2106' with:

$ cd ~/online
$ setenv DISPLAY :2
$ webpaw -D -p 8080

Notes:

(1) If you don't start webpaw in the `online' directory, it won't find the PAW macros.

(2) After each rebooting of `pc2106' you have to start the VNCserver before you can start WebPAW:

$ vncserver

Operating PC809 And PC812 Remotely

PC812: Trigger Frontend

In case the trigger frontend is frozen and we have to remotely restart it, there are two ways to acchieve this:

Through rsh

log onto pc2106
type rsh pc812 ps, find out the PID of trigger frontend
type rsh pc812 'kill PID', PID is obtained from above command.
type rsh pc812 frontend.exe

Through VNC

Log onto pc2106
type vncviewer pc812:0

PC809: Slow Control Frontend

Currently the only way to restart SC frontend remotely is via vncviewer.
First log onto pc2106 and then type
vncviewer pc809:0

S. Ritt, August 13, 1999.