*********************************************************** Locate the Panelmate I made a little movie on how to locate the panelmate. If anyone calls about no light, have them type: cd animate panelmate.gif -James ********************************************* TOUCH SCREEN touch.com is rsh'd to graphics1 from the server killall rsh will KILL touch.com, so BEWARE! To restart the touch screen software, first check that it's NOT running (ps -ef | grep touch), and kill if needed. Make sure you're on graphics1. to restart and append messages to the existing touch.log enter this command: /programs/beamline/touch.tcl >>& /data/log/touch.log & ******************************************************** WATCHDOG If watchdog.com spots an "error" it will grab control of blu-ice and make blu-ice go "passive". The user will have to click on "passive" to make it "active" again. 8.3.1 ~21:50pm 5-29-03 The user called to complain about a "weird image". I looked at it over the video camera and saw the unmistakable outline of the backlight shadowing the diffraction pattern. I told her to tap the polarizer button on the touch screen to retract the polarizer and do the image again. That worked. This does not normally happen. Normally, a program called "watchdog.tcl" running on bl831 checks for certain conditions and takes corrective action: polarizer in and shutter open --> retract polarizer collimator down and shutter open --> close shutter collecting and Izero < 0.1 nA --> pause collection, back up one image paused and Izero >= 0.1nA --> unpause collection In this case, "watchdog.tcl" hand hung. This seems to be a problem with tcl on RedHat 7.2 when tcl is launched from a rsh. "watchdog.tcl" is new, and I havn't figured out how to prevent this yet. If it happens again, you can restart it like this: ssh -l mcfuser bl831.als.lbl.gov bl831% killall -9 watchdog.tcl bl831% watchdog.tcl >>& /data/log/watchdog.log & bl831% exit -James ***************************************************************** adxv This is not a problem anymore, but I've left it for the record... adxv runs on the server, but displays on graphics1 using an X11 window. Killing adxv on graphics1 leaves an ssh process running which uses LOTS of CPU time. Kill this ssh process on graphics1 to free up the CPU. ************************************************************ GO: Green Light: CURRENT AMPLIFIERS and I-ZERO When the beam came back up after a dump I noticed that blu-ice didn't resume collecting on 8.3.1. The reason was that the amplifer signal from Izero was pegged at 7V or so, and autoscale wasn't working. The watchdog.tcl program was trying to read Izero and getting "Inf". When this happens, watchdog.tcl will turn on autoscale and wait for a reasonable value, but autoscale wasn't autoscaling, so watchdog.tcl would wait forever before resuming data collection. I tried fiddling with the scaling manually, but couldn't control the Izero amplifier settings from Labview. I restarted the amp driver, restarted Labview, and power-cycled the amplifier itself, and that didn't change anything. I then power-cycled the ethernet-to-serial bridge connected to the amp, restarted the amp driver and then Izero started autoscaling normally. -James ********************************************************************** LANTRONIX BOXES The lantronix boxes on one of 8.3.1's pmac flaked out for a few minutes today. You can tell when this has happened by the red message "hardware server pmac1 is offline" in blu-ice. The solution was tapping on the ribbon connector in the pmac's serial port. ***************************************************************** no_hw_host BLU-ICE refused to start a data collection with the "no_hw_host" error message. This can mean that the Lantronix box flaked out, but this time it was the "self" hardware emulator that went offline. Ken and I will be investigating this today. ************************************************************************* at 02:17:53am the detector reported simultaneous retry errors from all three slaves. grep retry /data/log/det_api_workstation.log output_detcmd: DONE (retry ) at 972543: stopw ( 7) 937384 output_detcmd: DONE (retry ) at 972543: stopw ( 8) 937385 output_detcmd: DONE (retry ) at 972544: stopw ( 9) 937385 ************************************************************** DETECTOR ERROR Abinav called me to complain that the data collection had stopped and the lower-left corner of BLU-ICE was registering a "Detector Error". The /data/log/detectorDHS.log ended with a "could not start new thread" message. This same error occurred on: Mar 10 02:09 #435 Mar 29 14:13 #463 May 4 10:13 #477 Jun 5 21:46 #529 This is a new kind of error, and not the "RETRY" business we have been seeing in the past. Ken and I will have to look into this. I have a feeling the problem is with our heavily overloaded server, and not the software itself. Unfortunately, the only solution we currently have to any "Detector Error" is a systemwide nuke. You do this by typing "nuke" on any command line. After that, you usually need to take a few snapshots to clear the module buffers of corrupted dark currents. **************************************************************** COMPUTER CRASH 9-20-03 At around 10:00am on Saturday, the "graphics3" computer on 8.3.1 crashed. The symptom is that every linux computer appears to "hang". You won't be able to "ls" or collect data or anything until graphics3 comes back up. Brian called me about this at ~10:40am, and I think I got back to him at ~11:20am. He power-cycled graphics3 at ~11:23am. FYI: the network infrastructure at 8.3.1 is documented at: http://bl831.als.lbl.gov/~jamesh/beamline/network.html If "server" goes down, then /data will stop responding to "ls", etc. and you won't be able to collect data. You will have to run a "nuke" to start the data collection software after rebooting "server". if "graphics1" or "graphics2" go down, then the system suffers few ill effects. Data collection will keep running without the blu-ice GUI. When you restart graphics#, just log back in and type "go" to resume monitoring the data collection. The last time we had to reboot graphics3 was Sept 3, and this was when an unexpected screen saver was running. I'm suspicious that some screen saver is just crashing linux (regardless of wether HKL is running or not). I'm going to disable all screen savers on 8.3.1 and see if that clears things up. -James ****************************************************************************** Data Disk Full ~3:00am 8-3-03 The /data disk filled up on bl831. This creates a wide variety of problems, some more obvious than others. For example, all of the control system programs keep logs in /data/log, so the control system tends to get hung up on these output pipes... In general, a "nuke" needs to be run when /data fills up. This is largely my fault for not staying on top of things. It looks like, in my absence our average data collection rate has doubled to ~2.0Gb/hour. If this ever happens when I'm not around, please try to keep the users from deleting any data. It is important for me to keep a faithful systemwide backup. Instead, please free up space on /data by moving files to /data2 and collecting to /data2/mcfuser when /data is full. -James ******************************************************************************** Resetting the "Remote Detector Op" 05:38am 10-4-03 The detector Module 1/2 started to register slow reads and finally stopped. Resetting the "Remote Detector Op" with: foreach module ( detector1 detector2 detector3 detector4 ) # send the reset signal echo -n "$module restart " echo "restart" | sock_exchange.tcl $module 8038 1 echo "" end reported that each module reset itself "OK". After restarting the detector DHS, (which didn't seem to know how to bring itself back online): Abinav reported that things were working again. This was "nuke" number 602 for those who want to review the logs. This is a new one on me. There were no module "retry" errors at all this time. I havn't done Chris's latest upgrade yet, so hopefully this will turn out to be the problem. -James ************************************* Couldn't open Hutch Door looking into /data/log/pmac1DHS.tcl ... between 00:50:56 and 00:51:37, they tried seven times to move the collimator up with the door open. at Jun 26 00:51:47 they shut the door, which started the detector moving back to 421mm and the collimator up to -25 mm. at Jun 26 00:51:48 the hutch door locked when the detector distance got closer than 550mm at Jun 26 00:51:52 , they pushed the "Abort" button with the detector at 505.3 mm and the collimator motor at -95.7 mm. between 00:53:55 and 01:10:58 there were 38 requests to open the door. However, the door button does not trigger a detector move to 600mm if the detector is already at > 500mm, and the door will remain locked if the detector is at <550mm. at Jun 26 01:09:20 they took a snapshot, which brought the detector to 200mm. at Jun 26 01:10:58 they requested the door open again, and this time the detector made it all the way to 600mm, triggering the door to unlock at 550mm. Five more mm and we would have never noticed it. The problem was in my DHS. Setting the "detector retract" position to be closer than the "unlock door" position did create a logic trap where the door can get stuck like this. I have now moved the detector retract cutoff to 590mm, so this should never happen again. Incidentally, the "park" button in BLU-ICE or on the touch screen would have circumvented this problem. Also, if all else fails, crashing off the motor power with the BIG RED BUTTON will always unlock the door. -James **************************************************** Collimator Stage Problem I got a call from Ho this morning that "realign.com" was not working and the stage was refusing to go into the hi-mag position. I'm not sure how it happened, but the problem was that the stage "up" position had been recalibrated to -25 mm. It is normally around -0.35 mm. To fix this, I checked the history of stage "up" calibrations: grep stage /data/log/change.log | tail -2 Nov 12 04:34:17 2003 energy: 11111.0 Hdiv: 2.0 Vdiv: 0.3 Iring: 288.766 Iin: 0.7278 Iout: 0.3538 Izero: 8.2800 bl831 stage zero -0.336600 Nov 12 07:49:43 2003 energy: 11111.0 Hdiv: 2.0 Vdiv: 0.3 Iring: 228.98 Iin: 0.5774 Iout: 0.2802 Izero: 6.6689 bl831 stage zero -25.034900 and then corrected the stage position like this: stage.com -0.3366 stage.com save You can also check/edit the contents of the file: /data/calibrations/stage_zero.txt for the current "up" position of the stage. -James ********************************************** Shortest oscillation is 0.1s. This is a software limit, and I can set things up to go shorter if you like. I have done more than 10 degrees in 0.1s. Reproducibility does not seem to be a problem. The minimum slit width is, well, zero. The beam will probably disappear somewhere around 0.002 mrad. It is not possible to "crash" the slits together since they are staggered. You can go to small negative slitwidths if you want to really tweak the intensity. However, I usually stick in the Al attenuator if I need to go below 0.02mrad.