***********************************************************
                   Locate the Panelmate

I made a little movie on how to locate the panelmate.  If anyone calls
about no light, have them type:

cd
animate panelmate.gif

-James
*********************************************
                      TOUCH SCREEN 
touch.com is rsh'd to graphics1 from the server
killall rsh will KILL touch.com, so BEWARE!

To restart the touch screen software, first check that it's NOT running
(ps -ef | grep touch), and kill if needed. Make sure you're on graphics1.

to restart and append messages to the existing touch.log enter this command:
/programs/beamline/touch.tcl >>& /data/log/touch.log &

********************************************************
                             WATCHDOG

If watchdog.com spots an "error" it will grab control of blu-ice and make blu-ice go "passive".  
The user will have to click on "passive" to make it "active" again.

8.3.1 ~21:50pm 5-29-03

The user called to complain about a "weird image".  I looked at it over
the video camera and saw the unmistakable outline of the backlight
shadowing the diffraction pattern.  I told her to tap the polarizer
button on the touch screen to retract the polarizer and do the image
again.  That worked.

This does not normally happen.  Normally, a program called
"watchdog.tcl" running on bl831 checks for certain conditions and takes
corrective action:
polarizer in and shutter open --> retract polarizer
collimator down and shutter open --> close shutter
collecting and Izero < 0.1 nA   --> pause collection, back up one image
paused and Izero >= 0.1nA    -->  unpause collection

In this case, "watchdog.tcl" hand hung.  This seems to be a problem with
tcl on RedHat 7.2 when tcl is launched from a rsh.  "watchdog.tcl" is
new, and I havn't figured out how to prevent this yet.  If it happens
again, you can restart it like this:

ssh -l mcfuser bl831.als.lbl.gov
bl831% killall -9 watchdog.tcl
bl831% watchdog.tcl >>& /data/log/watchdog.log &
bl831% exit

-James
*****************************************************************
                           adxv
This is not a problem anymore, but I've left it for the record...

adxv runs on the server, but displays on graphics1 using an X11 window.
Killing adxv on graphics1 leaves an ssh process running which uses LOTS 
of CPU time.  Kill this ssh process on graphics1 to free up the CPU.

************************************************************
         GO: Green Light: CURRENT AMPLIFIERS and I-ZERO
When the beam came back up after a dump I noticed that blu-ice didn't
resume collecting on 8.3.1.  The reason was that the amplifer signal
from Izero was pegged at 7V or so, and autoscale wasn't working.  The
watchdog.tcl program was trying to read Izero and getting "Inf".  When
this happens, watchdog.tcl will turn on autoscale and wait for a
reasonable value, but autoscale wasn't autoscaling, so watchdog.tcl
would wait forever before resuming data collection.
I tried fiddling with the scaling manually, but couldn't control the
Izero amplifier settings from Labview.  I restarted the amp driver,
restarted Labview, and power-cycled the amplifier itself, and that
didn't change anything.  I then power-cycled the ethernet-to-serial
bridge connected to the amp, restarted the amp driver and then Izero
started autoscaling normally.
-James
**********************************************************************
                              LANTRONIX BOXES
The lantronix boxes on one of 8.3.1's pmac flaked out for a few minutes
today.  You can tell when this has happened by the red message "hardware
server pmac1 is offline" in blu-ice.  The solution was tapping on the
ribbon connector in the pmac's serial port.
*****************************************************************
                              no_hw_host
BLU-ICE refused to start a data collection with the "no_hw_host" error 
message.  This can mean that the Lantronix box flaked out, but 
this time it was the "self" hardware emulator that went offline.  Ken 
and I will be investigating this today.
*************************************************************************
at 02:17:53am the detector reported simultaneous retry errors from all three slaves.

grep retry /data/log/det_api_workstation.log

output_detcmd: DONE (retry ) at          972543: stopw (  7)   937384
output_detcmd: DONE (retry ) at          972543: stopw (  8)   937385
output_detcmd: DONE (retry ) at          972544: stopw (  9)   937385

**************************************************************
                     DETECTOR ERROR
Abinav called me to complain that the data collection had stopped and 
the lower-left corner of BLU-ICE was registering a "Detector Error".  
The /data/log/detectorDHS.log ended with a "could not start new thread" 
message.  This same error occurred on:
Mar 10 02:09  #435
Mar 29 14:13  #463
May  4 10:13  #477
Jun  5 21:46  #529

This is a new kind of error, and not the "RETRY" business we have been 
seeing in the past.  Ken and I will have to look into this.  I have a 
feeling the problem is with our heavily overloaded server, and not the 
software itself.

Unfortunately, the only solution we currently have to any "Detector 
Error" is a systemwide nuke.  You do this by typing "nuke" on any 
command line.  After that, you usually need to take a few snapshots to 
clear the module buffers of corrupted dark currents.
****************************************************************
                        COMPUTER CRASH
9-20-03

At around 10:00am on Saturday, the "graphics3" computer on 8.3.1
crashed.  The symptom is that every linux computer appears to "hang".
You won't be able to "ls" or collect data or anything until graphics3
comes back up.
Brian called me about this at ~10:40am, and I think I got back to him
at ~11:20am.  He power-cycled graphics3 at ~11:23am.

FYI: the network infrastructure at 8.3.1 is documented at:
http://bl831.als.lbl.gov/~jamesh/beamline/network.html

If "server" goes down, then /data will stop responding to "ls", etc. and
you won't be able to collect data.  You will have to run a "nuke" to
start the data collection software after rebooting "server".

if "graphics1" or "graphics2" go down, then the system suffers few ill
effects.  Data collection will keep running without the blu-ice GUI.
When you restart graphics#, just log back in and type "go" to resume
monitoring the data collection.


The last time we had to reboot graphics3 was Sept 3, and this was when
an unexpected screen saver was running.  I'm suspicious that some screen
saver is just crashing linux (regardless of wether HKL is running or
not).  I'm going to disable all screen savers on 8.3.1 and see if that
clears things up.

-James
******************************************************************************
                           Data Disk Full
~3:00am 8-3-03

The /data disk filled up on bl831.  This creates a wide variety of
problems, some more obvious than others.  For example, all of the
control system programs keep logs in /data/log, so the control system
tends to get hung up on these output pipes...
In general, a "nuke" needs to be run when /data fills up.

This is largely my fault for not staying on top of things.  It looks
like, in my absence our average data collection rate has doubled to
~2.0Gb/hour. 

If this ever happens when I'm not around, please try to keep the users
from deleting any data.  It is important for me to keep a faithful
systemwide backup.  Instead, please free up space on /data by moving
files to /data2 and collecting to /data2/mcfuser when /data is full.

-James
********************************************************************************
                       Resetting the "Remote Detector Op"
05:38am 10-4-03

The detector Module 1/2 started to register slow reads and finally stopped.  
Resetting the "Remote Detector Op" with:

foreach module ( detector1 detector2 detector3 detector4 )
    # send the reset signal
    echo -n "$module restart "
    echo "restart" | sock_exchange.tcl $module 8038 1
    echo ""
end

reported that each module reset itself "OK".  After restarting the detector DHS, 
(which didn't seem to know how to bring itself back online): 
Abinav reported that things were working again.  
This was "nuke" number 602 for those who want to review the logs.

This is a new one on me.  There were no module "retry" errors at all this time.  
I havn't done Chris's latest upgrade yet, so hopefully this will turn out to be the problem.

-James

*************************************
                             Couldn't open Hutch Door

looking into /data/log/pmac1DHS.tcl ...

between 00:50:56 and 00:51:37, they tried seven times to move the
collimator up with the door open.

at Jun 26 00:51:47 they shut the door, which started the detector moving
back to 421mm and the collimator up to -25 mm.

at Jun 26 00:51:48 the hutch door locked when the detector distance got
closer than 550mm

at Jun 26 00:51:52 , they pushed the "Abort" button with the detector at
505.3 mm and the collimator motor at -95.7 mm.

between 00:53:55 and 01:10:58 there were 38 requests to open the door.
However, the door button does not trigger a detector move to 600mm if the
detector is already at > 500mm, and the door will remain locked if the
detector is at <550mm.

at Jun 26 01:09:20 they took a snapshot, which brought the detector to
200mm.

at Jun 26 01:10:58 they requested the door open again, and this time the
detector made it all the way to 600mm, triggering the door to unlock at
550mm.

Five more mm and we would have never noticed it.

The problem was in my DHS.  Setting the "detector retract" position to be
closer than the "unlock door" position did create a logic trap where the
door can get stuck like this.

I have now moved the detector retract cutoff to 590mm, so this should
never happen again.

Incidentally, the "park" button in BLU-ICE or on the touch screen would
have circumvented this problem.  Also, if all else fails, crashing off the

motor power with the BIG RED BUTTON will always unlock the door.

-James
****************************************************
                    Collimator Stage Problem

I got a call from Ho this morning that "realign.com" was not working and
the stage was refusing to go into the hi-mag position.

I'm not sure how it happened, but the problem was that the stage "up"
position had been recalibrated to -25 mm.  It is normally around -0.35
mm.  To fix this, I checked the history of stage "up" calibrations:

grep stage /data/log/change.log | tail -2
Nov 12 04:34:17 2003 energy: 11111.0 Hdiv: 2.0 Vdiv: 0.3 Iring: 288.766
Iin: 0.7278 Iout: 0.3538 Izero: 8.2800 bl831 stage zero -0.336600
Nov 12 07:49:43 2003 energy: 11111.0 Hdiv: 2.0 Vdiv: 0.3 Iring: 228.98
Iin: 0.5774 Iout: 0.2802 Izero: 6.6689 bl831 stage zero -25.034900

and then corrected the stage position like this:
stage.com -0.3366
stage.com save

You can also check/edit the contents of the file:
/data/calibrations/stage_zero.txt

for the current "up" position of the stage.

-James


**********************************************
Shortest oscillation is 0.1s.  This is a software limit, 
and I can set things up to go shorter if you like.  
I have done more than 10 degrees in 0.1s.  
Reproducibility does not seem to be a problem.

The minimum slit width is, well, zero.  
The beam will probably disappear somewhere around 0.002 mrad.  
It is not possible to "crash" the slits together since they are staggered.  
You can go to small negative slitwidths if you want to really tweak the intensity.  
However, I usually stick in the Al attenuator if I need to go below 0.02mrad.