MANUAL

Chapter 7: Tips and Tricks

Neat-o unix commands

The unix command line

The unix command line is the little sentence you type to run unix programs. The text that appears on each line to the left of what you're typing is the command prompt. This is a very old convention, but it is incredibly powerful. No graphical interface has ever come close to providing the detailed control that a command line gives you. The only problem with command lines is that they almost always have obtuse syntax, and novice users have a really hard time figuring out what to type. This section is intended as a quick guide for crystallographers who want to optimize their unix environment without having to read through all the unix manuals that I have. :)

The unix "shell"

Your "shell" program is probaly tcsh (Turbo-C-shell). All a shell program does is accept your typed commands, find and run the unix program you have invoked, and then prompt you for another command. That's about it. However, the shell is equipped with an amazing array of time-saving tools and tricks (once you know how to use them). I have tried to outline the ones most useful to crystallographers here, as briefly as possible, with examples.

X-windows cut-and-paste

If you are sitting in front of a workstation running any kind of unix, then, chances are, you are using X-windows. If you are using X-windows, then you can use your left and middle mouse button to copy and paste (respectively) text between any windows on your screen.

Use the left mouse button to select some text somwhere. Move the mouse pointer into another window, and then click the middle mouse button. The text you selected will then be "typed" into that window, just as if you had typed it on the keyboard.

X is an incredibly flexible windowing system that is all but universally found on any unix workstation with graphics. There are many cool things about X that no other windowing system has managed to duplicate as powerfully as X. One of these is the remote-display window feature, but, in my opinion the two-button copy-paste feature is by far the coolest feature in X.

"Re-directing" input and output

If you have used CCP4, you are probably familiar with this:
refmac hklin x.mtz hklout y.mtz << EOF-refmac blah blah blah EOF-refmacbut don't really understand it. You have probably also seen the "|" and ">" character stuck in-between commands from time to time too. Maybe even a ">!", ">>&" or a ">&!". What does it all mean?

The "<<" characters mean: "input to this command is on the following lines, and ends when you see a line beggining with EOF-refmac".
The ">>" characters, however, are completely different. They mean "append the output of this command to the following file"
unix% refmac.com >> all_refmacs.log
The ">>&" characters are the same as ">>", exept that error messages are sent to the file too.
unix% refmac.com >>& all_refmacs_and_errors.log
The ">!" characters mean: "create/erase the following file, and send this command's output to it". Although you can use just ">", this will sometimes fail with a "file exists" error. Use ">!" when you're sure you don't want to keep the last output file.
unix% refmac.com >! refmac0001.log
The ">&!" characters are the same as ">!", except that error messages are sent to the file too.
The "|" character means "take whatever this program (on the left) prints out and type it into the next (on the right) program's input".
unix% refmac.com | grep "R_fac"This will monitor the R-factor in a refmac run. Note how this is different from sending the input to a file! If you typed:
unix% refmac.com > grep "R_fac"then you would create a file (in the current directory) called "grep" that would contain the refmac log! (a common mistake).
The "|&" characters, like all the other "&" combinations, means to send error messages along with "standard output" to the next program.

Keeping Logs

As a scientist, it is a good idea to keep a record of everything you do. To keep a computer record of the output of some program, you can "redirect" its output to a log file. You can then monitor this log file from another shell window, or use the unix backgrounding feature (described next). Instead of running:

unix% script.comuse:
unix% script.com >! logfile.log

This command sends everything that would normally be send to your screen (except for error messages) to the file "logfile.log", which then becomes a permanent record of how the program ran. The exclamation (!) tells unix to overwrite logfile.log if it already exists. To send error messages to the log too, use:

unix% script.com >&! logfile.log

"backgrounding" stuff

Unix was one of the first, and is still the most stable of all the multiprocessing operating systems. You can run unix programs in the "background" (while you do something else) in several ways. The most common is to just put a "&" at the very end of the command-line:

unix% script.com >&! logfile.log &

Alternately, you can launch the program as a "foreground" job (no "&"), and then background it using <Cntrl>-Z, and "bg". That is, hold down the "Ctrl" key and then hit "z". This will display the message "Suspended", and give you your command prompt back. At this point, your process is in suspended animation: "time" as far as your job is concerned, is stopped. To start it going again, you use the unix command "bg" which continues the job in the unix background. You can also type "fg" to make the job run in the foreground again instead.

To stop (kill) a job you have put in the background, you need to use the "kill" command. First, type "jobs". This will display all the programs you are running in the background, along with a job number in []:

[1] + Running refmac.com > refmac.log [2] + Running scala.com > scala.log

to kill your refmac job, type:

unix% kill -9 %1

The "-9" sends the strongest kill signal possible. Sometimes shell scripts don't respond to oridary "kill"s. Alternately, you can get the process number by typing "ps" and kill the process by number. To see all processes you are running on the current machine, use:

unix% ps -fu username

where "username" is your user name. You can use this to kill old jobs you might have forgotten about. This is the same process ID you see in "top". For programs that run other programs, however, you need to be more careful. Killing a program being run from a script doesn't always kill the script, and killing the script doesn't always kill the programs its running! To save yourself the headache of figuring out all the relationships between the processes in your jobs, do this:

unix% ps -fju username

This adds one more column to the "ps" output (usually labeled as PGID). This is the "process group" ID, and is usually the same as the process number of the "parent" jobs of all the processes. To kill an entire process group, use this:

unix% kill -9 -12345

where 12345 is the process group ID.

tail

Tail is a very popular program for monitoring logfiles. Especially ones in the background. Use it like this:

unix% script.com > logfile.log & unix% tail -1000f logfile.log

The "f" tells tail to "follow" the log. It will keep updating the display as the log is generated, until you hit <Ctrl>-C.

tee

The tee command took me a long time to find, but it turns out it's pretty neat. Have you ever wanted to monitor a job and keep a log file at the same time? Instead of tail (above), type this:

unix% script.com | tee logfile.log

Note that there is no "&". The job runs in the foreground, but keeps a log as usual. You could also tack a grep or awk command on the end of the pipe:

unix% script.com | tee logfile.log | grep R_factor

This is a useful way to keep track of critical numbers in the log and keep the whole log on disk, without having to do a lot of fancy footwork with backgrounded jobs.

foreach

A handy little command in csh and tcsh is the foreach loop. You use it like this:

unix% foreach X ( *.osc )
foreach? cmp -l ${X} /CDROM/${X}
foreach? end

This sets the value of ${X} to each of the filenames ending in .osc from the current directory, in turn. For each of these filenames, the commands up until the "end" are executed. The above command is an easy way to compare x-ray images on disk to ones you backed up onto a CD, checking for mis-matched bytes. Furthermore:

unix% foreach num ( `awk 'BEGIN{for(num=1;num<=180;++num) printf "%03d ", num; exit}'` )
foreach? mv stupidname${num}.img bettername_${num}.img
foreach? end

re-names all the x-ray images from stupidname001.img through stupidname180.img to bettername_001.img through bettername_180.img. I'll bet there have been plenty of times you wished you knew how to do that!

If you also want to re-number your images, you can add one line:

unix% foreach num ( `awk 'BEGIN{for(num=12;num<=180;++num) printf "%03d ", num; exit}'` )
foreach? set newnum = `echo ${num} | awk '{printf "%03d ", $1-11}'`
foreach? mv stupidname${num}.img bettername_${newnum}.img
foreach? end

re-names all the x-ray images from stupidname012.img through stupidname180.img to bettername_001.img through bettername_169.img.

Remember, the command:
unix% mv stupidname*.img bettername*.imgWILL NOT WORK! See below for why.

Another wonderful thing about foreach is that it allows you to systematically try a bunch of possible parameters in a script. For example, let's say you weren't exactly sure about your solvent content (and, let's face it, you never really are), and you wanted to just try running dm with several values in a row:

unix% foreach solc ( 30 35 40 45 50 ) foreach? awk '/^SOLC/{$2 = 0.'$solc'} {print}' dm.com >! dm_trial.com foreach? chmod u+x dm_trial.com foreach? echo -n "trying ${solc}%: " foreach? ./dm_trial.com >&! dm${solc}.log foreach? mv dm.mtz dm${solc}.mtz foreach? awk '/Free_R_factor/{p=1} p==1 && NF==3{print}' dm${solc}.log | tail -1 foreach? end

(A procedure like this one is used by Phaser Elves) This will take each of the numbers in the "()", and stick them on the end of the line in the dm.com script that begins with "SOLC" (for dm, this is the solvent content), and create a new script called dm_trial.com, which is absolutely identical to dm.com, execpt for the change in the number after the SOLC line. The output of dm_trial.com is then put in to dm30.log, dm35.log, etc., and a brief summary of the real-space free-R is printed to the screen after each run. The output MTZ file is also backed up.

wildcards

Most people know how to use the *.img type wildcards, but there are other useful ones too! The question mark: ? is a one-character wildcard, and brackets: [] and (to a lesser extent) braces {} are also nice. For example, if you have the files:

unix% ls -1 . refmac001.log refmac002.log refmac003.log refmac004.log refmac005.log

in some directory, then refmac*.log will, of course, refer to them all, as will refmac00?.log. But, refmac00[12].log will refer to just the first two, and refmac{001,003}.log will refer to the first and third. This can be REALLY handy when you're looking at refinement runs.

Pitfalls:

An unfortunate thing to remember about wildcards is that they are expanded by the shell, and not by the program you are launching. This means, when you type:

unix% mv frame_*.img bettername_*.img<- NEVER EVER DO THIS!

Which should be the most obvious way to re-name a bunch of files at once (or so I thought once...), but what the "mv" command sees is:

unix% mv frame_001.img frame_002.img frame_003.img bettername_001.img bettername_002.img bettername_003.img

If the bettername_*.img files don't exist already, then the shell will complain with a "no match" error, and not run the "mv" command. The worst thing that could happen is if the bettername_*.img files DO exist. If this is the case each file on the above command line will be, one by one, moved to "bettername_003.img", leaving you with only one file (which used to be bettername_002.img). And there is NO WAY YOU CAN GET THEM BACK! Unix does not do file recovery very well.

If you ever want to rename a bunch of frames, I recommend the 3-line foreach loop described above.

Another horrible mistake you could make (although, technically, unrelated) is mis-using tar:

unix% tar cvf frame*.img <- NEVER EVER DO THIS!

seems innocent enough, and you might think this would back up all your frames to tape, right? WRONG! what the "tar" command sees is this:

unix% tar cvf frame_001.img frame_002.img frame_003.img...etc.

which, to tar anyway, means to use "frame_001.img" as the tape device! Your entire dataset will then be "backed up" on top of your first image. This not only destroys "frame_001.img", but can double the size of your data set on disk (by making a gigantic "first image"), and probably fill up the disk. Your lab-mates will either laugh at you, or yell at you, depending on wether or not your filling up the disk killed one of their jobs. ;)

The moral of the story is, use:

unix% tar cv frame*.img

whenever you want to back up to the "default" tape. Only use "f" when you have a particular tape in mind.

filename completion and <Ctrl>-D

If you use C-shell (csh) then you might know that hitting <Esc> will complete filenames for you. However, this feature only fills-in up to the first ambiguity between filenames, which can be annoying if you're not sure what the abiguity is. At this point, hitting <Ctrl>-D will display all the alternative options of the filename completion! Pretty cool, eh?

nice command prompt

If you find yourself typing pwd and whoami a lot, then you should probably use the following tcsh command prompt:

set prompt="%n@%m:%C2 %h% "

This will make your prompt look like this:

jamesh@ucxray:alber/jamesh 40%

Note that only the last two directories in your current path ar displayed, this keeps your prompt from getting super-long, like it can with some other prompt strings. However, this prompt only works with tcsh, not csh.

changing over from csh to tcsh

The turbo-c-shell (tcsh) is quite widespread by now, and is decidedly superior to the old csh, but without sacrificing backward compatibility. In fact, if you are an avid csh user, you may well have found yourself using tcsh on some systems without realizing it. The only real difference is that the <Esc> key needs to be typed twice for filename completion in tcsh. This is because tcsh uses the <Tab> key for filename completion, and the <Esc> key has been depracated.

Simple awk commands that everyone should have

Hey! Don't skip this section! I know how scary something like awk can be. I resisted learning it for years, but there are a few little one-line "awk programs" that are incredibly useful for crystallographic stuff:

For example, most crystallographers use grep a lot:

unix% grep Overall_R_factor logs/refmac.log

This is a common way of watching the R-factor in a refmac log. If you want a "real-time" display of the R-factor, you can use this:

unix% tail -1111f logs/refmac.log | grep Overall_R_factor

Here the "tail" program continuously streams the logfile (as it is written) to grep, and you need to Ctrl-C it to stop (even after refmac is done). But, what if you want to "grep" for more than one thing from the file? Like the R and the the Free-R? Or the rms bond angle deviations? Here is a good point to introduce egrep:

unix% egrep "all_R_factor|ee_R_factor|Estimated bond angle" logs/refmac.log

Will display all three kinds of lines from the log file. Incidently, you can also use:

unix% egrep "Cycle|shift" logs/scala.log

to monitor the progress of a scala run.

Now, most people look at the difference between the R and the Free-R, so wouldn't it be nice to have a quick way to just display that value? Here is where awk becomes useful:

unix% awk '/all_R_factor/{R=$NF; ++cyc} /ee_R_factor/{print cyc, 100*R, 100*$NF, 100*($NF-R)}' logs/refmac.log

This command will look for "all_R_factor" and take the last word on the line ($NF) as the R value, and keeps a count of how many R-factors have been seen in "cyc". Whenever a line with the text "ee_R_factor" comes along, the cycle number, R, free-R, and free-R - R are printed out. At this point, I'd like to plug my Rplot.com script, which produces nice, xloggraph-formatted output from refmac logs.

As a general example, if you want to see the 4th and last words on a line containing the text "something important", use this:

unix% awk '/something important/{print $4,$NF}' file.log

You can also specify the beginning of a line with a "^" and the end of a line with a "$". For example:

unix% awk '/^Cycle/ && /[0-9]$/' random.log

will print only lines that begin with "Cycle" and end with a number in the file random.log.

unix% awk '/^ATOM/ && /CA/{print substr($0,61,6)+0}' file.pdb

will print the B-factors of all the C_alpha atoms in file.pdb. The "substr" function is used here because PDBs don't always have spaces between the various items on each line. substr() allows you to refer to the characters on each line, instead of just words, and the pdb file definition is column-based, not word-based. The substr($0,61,6) line means the 6 characters starting at the 61st column on the current line (where the B-factor is stored).

Now, what if you want to see something on the next line from some kind of telltale text? Use awk's "getline" command to skip lines. Like this:

unix% echo "HEAD" | mtzdump hklin test.mtz | awk '/Cell Dimensions/{getline;getline;print}'

this will print the unit cell from test.mtz. The getline command can be used to skip lines, but it is only available in newer versions of awk. If your computer's awk program complains, try using nawk (new awk) or gawk (GNU's awk).

Awk also supports a variety of math functions. The following commands:

unix% set CELL = "89.5 89.5 47.2 90 90 120" unix% echo "$CELL" | nawk 'NF==6{s=3.1415926535897899419/180; \ A=cos(s*$4); B=cos(s*$5); G=cos(s*$6); \ skew = 1 + 2*A*B*G - A*A - B*B - G*G ; if(skew < 0) skew = -skew;\ printf "%.3f\n", $1*$2*$3*sqrt(skew)}'

can be used to calculate the volume of your unit cell (in cubic Angstroms). If you know your protein's mass, and the number of asymmetric units on your crystal's unit cell, you can use this to calculate your Matthews coefficient.

Tricks for your shell scripts

We all write shell scripts sooner or later. It's either that or staying up all night running programs manually. Therefore I have compiled are a few very usefil tricks here that you can use to make your shell scripts more efficient.

Starting a csh script

One should always begin with the following as their very first line:

#! /bin/csh -f

Very few people know what this line does, but I'll tell you: the "#!" thing is a signal to the unix program launcher (exec()), that this file is a script, and that the remainder of the line is the command to use to run this script. This is usually /bin/csh, but /bin/sh is also a popular shell for scripts which, unfortunately, has completely different syntax than you're used to typing on the unix command line. This is the main reason why Elves avoid sh entirely. To csh, the -f option means "fast start" and keeps the /bin/csh program from running your ~/.cshrc script every time you run your script.

For debugging shell scripts, you can use "-fv", which will echo all the commands the script is executing to the terminal window, as they are run. This makes it easy to see which line killed your shell script.

You can also write scripts for other programs by putting them on this first line. for example, putting "#! /usr/bin/nawk -f" at the top of an awk script turns the script into a "standalone" awk script that can be run like any other unix command.

Command-line arguments

One of the most immidiately useful things to put into a shell script is the ability to read the command line. For example, if you are writing a script to do fft, wouldn't it be nice to only have to enter the input mtz file name on the command line instead of editing the script? Well, here's how you do it. It turns out that the first item on the command line to a shell script will be pasted into your script everywhere you put a $1. $2 is the second item, etc. So, in your script, you might write:

#! /bin/csh -f # command-line reading fft script fft hklin $1 mapout fft.map << EOF-fft blah blah blah EOF-fft

Now, when you run the above shell script like this:

unix% fft.com this.mtz

the string "this.mtz" will be substituded for the $1 after hklin before fft is run.

The $1 value will be set to "" if there is nothing on the command line, so watch out, and make sure you check for this.

Variable substitution

Shell scripts can also have arbitrary variables, set by you. You set variables using the set command:

set hires = "1.8"

from here on, any time you use the word "$hires" in your script, it will be replaced by "1.8". This is useful for putting commonly-changed values up at the top of your script, so changing one line affects all the programs you call in the script.

Command substitution

One of the most powerful features in shell programming is the ability to set variables to the output of a unix command. To do this, you use back-quotes:

set CELL = `echo "HEAD" | mtzdump hklin input.mtz | awk '/Cell Dimensions/{getline;getline;print}'`

will set the variable "CELL" to the unit cell from input.mtz. From this line on, the text: "${CELL}" will be substituted with the six unit cell numbers.

Reading from the keyboard

If, for some reason, you want the user to type something into your script, the easiest way to do this is to use the "special" shell variable "$<":

set userinput = "$<"

will read one line, entered by the user, and set ${userinput} to that string. Be careful, scripts that use this cannot easily be run in the background, because the "$<" command needs a terminal. (You'll get the "Stopped: tty input" message)

"if" statements

Even simple shell scripts will probably need a little flow control in them. For example, when using command-line arguments, something like this is almost always a good idea:

set mtz_input = "$1" if("$1" == "") then echo "usage: $0 input.mtz" exit 9 endif

this will remind you if you forget to type something on the command line, (instead of giving you some kind of weird, cryptic error message somewhere deeper in the script). Notice the strucrure of the "if" statement: the "if" line must end with a "then", and the "if" condition applies until a matching "endif" is found. You don't have to indent the way I do, but I find it makes the statements easier to read.

The csh "if" statement is a little finicky, and can easily crash the whole script if you make a syntax error, or provide unexpected input to it at run-time. But, once you get the hang of it, it can be a very powerful addition to your scripts. You can "man csh" or "man tcsh" for the detailed description of the "if" rules, (as well as everything else the shell can do), but I have a few rules that I find apply to crystallographic applications:

always put text-valued variables in double-quotes ("${variable}"). This way, even if they are empty, the "if" won't crash.
put plenty of spaces around numbers: if(1+2==3) will crash csh on some systems, but if( ( 1 + 2 ) == 3 ) will always work.
decimal numbers do not work in "if" statements: if(3.5 > 3) will crash with a "badly formed number" error.
make sure you never treat a string like a number: if $hires is "2A", then if($hires > 3) will crash. A neat trick to make sure this never happens is to use awk. Instead of set hires = $1, use set hires = `echo $1 | awk '{print $1+0}'`, this will always be a pure number.

As another example:

if(-e "${filename}") then ls -l ${filename} else echo "${filename} does not exist! " exit 9 endif

will execute the command: "ls -l ${filename}" only if the filename contained in the variable value ${filename} actually exists, otherwise, the statement in quotes following the "echo" command will be printed out on to the screen. you can also do "wildcard" comparisons in csh "if" statements:

if("$1" =~ *.mtz) then set mtzfile = "$1" endif

will set the value of the shell variable "$mtzfile" to the first word on the command line if that word ends in ".mtz". You can also put more complex logic into an "if" statement:

if((("$1" =~ *.mtz) && (-e "$1")) || (-e "$default_mtz")) then echo "okay" endif

will print "okay" only if either the first argument on the command line ends in ".mtz", and is an actual, existing file, or if the string stored in the default_mtz variable is an existing file.

Directory tree transfer

Most of us still use FTP to transfer files, but ftp has some serious security problems. It sends your passord, as clear text, in an unencrypted packet that any computer connected to your network segment can read, and get your password. Bad, eh? FTP is also not very good at transferring a whole tree-structured directory in one go.

A popular solution to the cleartext password problem is ssh, which encrypts the password before its sent. Unfortunately, only ssh version 2.x has an ftp program, and most synchrotrons don't have ssh 2.x, and probably won't for the forseeable future (since they have to pay big licenseing fees for it). Some users resort to logging in with ssh to temporarily change their password, do the FTP transfer, and then change their password back. This works (sort of), but can be complicated on remote systems that have slow NIS password updates.

A more direct (and portable) alternative is to send your files directly through an ssh login session. A nice, packaged version of this for x-ray images is in sendhome, but you might want more flexibility for, say, transferring data processing directories from here to there. The following command:

unix% tar cBf - processing | compress -c | ssh user@home.college.edu "cd /bigdisk/user; uncompress -c | tar xBvf -"

will move the local directory "processing" (and everything in it) to /bigdisk/user/processing (provided /bigdisk/user exists!) using "user"'s account on the remote computer: home.college.edu. The files are compressed during the transfer, so this is actually faster than FTP! The command to "pull" files the other way is this:

unix% cd /bigdisk/user
unix% ssh mcfuser@bl831.als.lbl.gov "cd /data/mcfuser/yourname ; tar cBf - . | compress -c" | uncompress -c | tar xvBf -

will move all the files and directories in "/data/mcfuser/yourname" on the remote computer bl831.als.lbl.gov to /bigdisk/user/ on the local machine.

Back to the Elves Manual Table of Contents.

This page is not finished. It will never be finished, and neither will yours. Admit it.

James Holton <JMHolton@lbl.gov>