The unix command line is the little sentence you type to run unix programs. The text that appears on each line to the left of what you're typing is the command prompt. This is a very old convention, but it is incredibly powerful. No graphical interface has ever come close to providing the detailed control that a command line gives you. The only problem with command lines is that they almost always have obtuse syntax, and novice users have a really hard time figuring out what to type. This section is intended as a quick guide for crystallographers who want to optimize their unix environment without having to read through all the unix manuals that I have. :)
Your "shell" program is probaly tcsh (Turbo-C-shell). All a shell program does is accept your typed commands, find and run the unix program you have invoked, and then prompt you for another command. That's about it. However, the shell is equipped with an amazing array of time-saving tools and tricks (once you know how to use them). I have tried to outline the ones most useful to crystallographers here, as briefly as possible, with examples.
If you are sitting in front of a workstation running any kind of unix, then, chances are, you are using X-windows. If you are using X-windows, then you can use your left and middle mouse button to copy and paste (respectively) text between any windows on your screen.
Use the left mouse button to select some text somwhere. Move the mouse pointer into another window, and then click the middle mouse button. The text you selected will then be "typed" into that window, just as if you had typed it on the keyboard.
X is an incredibly flexible windowing system that is all but universally found on any unix workstation with graphics. There are many cool things about X that no other windowing system has managed to duplicate as powerfully as X. One of these is the remote-display window feature, but, in my opinion the two-button copy-paste feature is by far the coolest feature in X.
"Re-directing" input and output
If you have used CCP4, you
are probably familiar with this:
refmac hklin x.mtz hklout y.mtz << EOF-refmac
blah blah blah
EOF-refmac
but don't really understand it. You have probably also seen
the "|" and ">" character
stuck in-between commands from time to time too. Maybe even a ">!",
">>&" or a ">&!".
What does it all mean?
As a scientist, it is a good idea to keep a record of everything you do. To keep a computer record of the output of some program, you can "redirect" its output to a log file. You can then monitor this log file from another shell window, or use the unix backgrounding feature (described next). Instead of running:
unix% script.com
use:
unix% script.com >! logfile.log
This command sends everything that would normally be send to your screen (except for error messages) to the file "logfile.log", which then becomes a permanent record of how the program ran. The exclamation (!) tells unix to overwrite logfile.log if it already exists. To send error messages to the log too, use:
unix% script.com >&! logfile.log
Unix was one of the first, and is still the most stable of all the multiprocessing operating systems. You can run unix programs in the "background" (while you do something else) in several ways. The most common is to just put a "&" at the very end of the command-line:
unix% script.com >&! logfile.log &
Alternately, you can launch the program as a "foreground" job (no "&"), and then background it using <Cntrl>-Z, and "bg". That is, hold down the "Ctrl" key and then hit "z". This will display the message "Suspended", and give you your command prompt back. At this point, your process is in suspended animation: "time" as far as your job is concerned, is stopped. To start it going again, you use the unix command "bg" which continues the job in the unix background. You can also type "fg" to make the job run in the foreground again instead.
To stop (kill) a job you have put in the background, you need to use the "kill" command. First, type "jobs". This will display all the programs you are running in the background, along with a job number in []:
[1] + Running refmac.com > refmac.log
[2] + Running scala.com > scala.log
to kill your refmac job, type:
unix% kill -9 %1
The "-9" sends the strongest kill signal possible. Sometimes shell scripts don't respond to oridary "kill"s. Alternately, you can get the process number by typing "ps" and kill the process by number. To see all processes you are running on the current machine, use:
unix% ps -fu username
where "username" is your user name. You can use this to kill old jobs you might have forgotten about. This is the same process ID you see in "top". For programs that run other programs, however, you need to be more careful. Killing a program being run from a script doesn't always kill the script, and killing the script doesn't always kill the programs its running! To save yourself the headache of figuring out all the relationships between the processes in your jobs, do this:
unix% ps -fju username
This adds one more column to the "ps" output (usually labeled as PGID). This is the "process group" ID, and is usually the same as the process number of the "parent" jobs of all the processes. To kill an entire process group, use this:
unix% kill -9 -12345
where 12345 is the process group ID.
Tail is a very popular program for monitoring logfiles. Especially ones in the background. Use it like this:
unix% script.com > logfile.log
&
unix% tail -1000f logfile.log
The "f" tells tail to "follow" the log. It will keep updating the display as the log is generated, until you hit <Ctrl>-C.
The tee command took me a long time to find, but it turns out it's pretty neat. Have you ever wanted to monitor a job and keep a log file at the same time? Instead of tail (above), type this:
unix% script.com | tee logfile.log
Note that there is no "&". The job runs in the foreground, but keeps a log as usual. You could also tack a grep or awk command on the end of the pipe:
unix% script.com | tee logfile.log | grep R_factor
This is a useful way to keep track of critical numbers in the log and keep the whole log on disk, without having to do a lot of fancy footwork with backgrounded jobs.
A handy little command in csh and tcsh is the foreach loop. You use it like this:
unix% foreach X ( *.osc ) foreach? cmp -l ${X} /CDROM/${X} foreach? end
This sets the value of ${X} to each of the filenames ending in .osc from the current directory, in turn. For each of these filenames, the commands up until the "end" are executed. The above command is an easy way to compare x-ray images on disk to ones you backed up onto a CD, checking for mis-matched bytes. Furthermore:
unix% foreach num ( `awk 'BEGIN{for(num=1;num<=180;++num) printf "%03d ", num; exit}'` ) foreach? mv stupidname${num}.img bettername_${num}.img foreach? end
re-names all the x-ray images from stupidname001.img through stupidname180.img to bettername_001.img through bettername_180.img. I'll bet there have been plenty of times you wished you knew how to do that!
If you also want to re-number your images, you can add one line:
unix% foreach num ( `awk 'BEGIN{for(num=12;num<=180;++num) printf "%03d ", num; exit}'` ) foreach? set newnum = `echo ${num} | awk '{printf "%03d ", $1-11}'` foreach? mv stupidname${num}.img bettername_${newnum}.img foreach? end
re-names all the x-ray images from stupidname012.img through stupidname180.img to bettername_001.img through bettername_169.img.
Remember, the command:
unix% mv stupidname*.img bettername*.img
WILL NOT WORK! See
below for why.
Another wonderful thing about foreach is that it allows you to systematically try a bunch of possible parameters in a script. For example, let's say you weren't exactly sure about your solvent content (and, let's face it, you never really are), and you wanted to just try running dm with several values in a row:
unix% foreach solc ( 30 35 40 45 50 )
foreach? awk '/^SOLC/{$2 = 0.'$solc'} {print}'
dm.com >! dm_trial.com
foreach? chmod u+x dm_trial.com
foreach? echo -n "trying ${solc}%: "
foreach? ./dm_trial.com >&! dm${solc}.log
foreach? mv dm.mtz dm${solc}.mtz
foreach? awk '/Free_R_factor/{p=1} p==1
&& NF==3{print}' dm${solc}.log | tail -1
foreach? end
(A procedure like this one is used by Phaser Elves) This will take each of the numbers in the "()", and stick them on the end of the line in the dm.com script that begins with "SOLC" (for dm, this is the solvent content), and create a new script called dm_trial.com, which is absolutely identical to dm.com, execpt for the change in the number after the SOLC line. The output of dm_trial.com is then put in to dm30.log, dm35.log, etc., and a brief summary of the real-space free-R is printed to the screen after each run. The output MTZ file is also backed up.
Most people know how to use the *.img type wildcards, but there are other useful ones too! The question mark: ? is a one-character wildcard, and brackets: [] and (to a lesser extent) braces {} are also nice. For example, if you have the files:
unix% ls -1 .
refmac001.log
refmac002.log
refmac003.log
refmac004.log
refmac005.log
in some directory, then refmac*.log will, of course, refer to them all, as will refmac00?.log. But, refmac00[12].log will refer to just the first two, and refmac{001,003}.log will refer to the first and third. This can be REALLY handy when you're looking at refinement runs.
An unfortunate thing to remember about wildcards is that they are expanded by the shell, and not by the program you are launching. This means, when you type:
unix% mv frame_*.img bettername_*.img <- NEVER EVER DO THIS!
Which should be the most obvious way to re-name a bunch of files at once (or so I thought once...), but what the "mv" command sees is:
unix% mv frame_001.img frame_002.img frame_003.img bettername_001.img bettername_002.img bettername_003.img
If the bettername_*.img files don't exist already, then the shell will complain with a "no match" error, and not run the "mv" command. The worst thing that could happen is if the bettername_*.img files DO exist. If this is the case each file on the above command line will be, one by one, moved to "bettername_003.img", leaving you with only one file (which used to be bettername_002.img). And there is NO WAY YOU CAN GET THEM BACK! Unix does not do file recovery very well.
If you ever want to rename a bunch of frames, I recommend the 3-line foreach loop described above.
Another horrible mistake you could make (although, technically, unrelated) is mis-using tar:
unix% tar cvf frame*.img <- NEVER EVER DO THIS!
seems innocent enough, and you might think this would back up all your frames to tape, right? WRONG! what the "tar" command sees is this:
unix% tar cvf frame_001.img frame_002.img frame_003.img ...etc.
which, to tar anyway, means to use "frame_001.img" as the tape device! Your entire dataset will then be "backed up" on top of your first image. This not only destroys "frame_001.img", but can double the size of your data set on disk (by making a gigantic "first image"), and probably fill up the disk. Your lab-mates will either laugh at you, or yell at you, depending on wether or not your filling up the disk killed one of their jobs. ;)
The moral of the story is, use:
unix% tar cv frame*.img
whenever you want to back up to the "default" tape. Only use "f" when you have a particular tape in mind.
filename completion and <Ctrl>-D
If you use C-shell (csh) then you might know that hitting <Esc> will complete filenames for you. However, this feature only fills-in up to the first ambiguity between filenames, which can be annoying if you're not sure what the abiguity is. At this point, hitting <Ctrl>-D will display all the alternative options of the filename completion! Pretty cool, eh?
If you find yourself typing pwd and whoami a lot, then you should probably use the following tcsh command prompt:
set prompt="%n@%m:%C2 %h% "
This will make your prompt look like this:
jamesh@ucxray:alber/jamesh 40%
Note that only the last two directories in your current path ar displayed, this keeps your prompt from getting super-long, like it can with some other prompt strings. However, this prompt only works with tcsh, not csh.
changing over from csh to tcsh
The turbo-c-shell (tcsh) is quite widespread by now, and is decidedly superior to the old csh, but without sacrificing backward compatibility. In fact, if you are an avid csh user, you may well have found yourself using tcsh on some systems without realizing it. The only real difference is that the <Esc> key needs to be typed twice for filename completion in tcsh. This is because tcsh uses the <Tab> key for filename completion, and the <Esc> key has been depracated.
Hey! Don't skip this section! I know how scary something like awk can be. I resisted learning it for years, but there are a few little one-line "awk programs" that are incredibly useful for crystallographic stuff:
For example, most crystallographers use grep a lot:
unix% grep Overall_R_factor logs/refmac.log
This is a common way of watching the R-factor in a refmac log. If you want a "real-time" display of the R-factor, you can use this:
unix% tail -1111f logs/refmac.log | grep Overall_R_factor
Here the "tail" program continuously streams the logfile (as it is written) to grep, and you need to Ctrl-C it to stop (even after refmac is done). But, what if you want to "grep" for more than one thing from the file? Like the R and the the Free-R? Or the rms bond angle deviations? Here is a good point to introduce egrep:
unix% egrep "all_R_factor|ee_R_factor|Estimated bond angle" logs/refmac.log
Will display all three kinds of lines from the log file. Incidently, you can also use:
unix% egrep "Cycle|shift" logs/scala.log
to monitor the progress of a scala run.
Now, most people look at the difference between the R and the Free-R, so wouldn't it be nice to have a quick way to just display that value? Here is where awk becomes useful:
unix% awk '/all_R_factor/{R=$NF; ++cyc} /ee_R_factor/{print cyc, 100*R, 100*$NF, 100*($NF-R)}' logs/refmac.log
This command will look for "all_R_factor" and take the last word on the line ($NF) as the R value, and keeps a count of how many R-factors have been seen in "cyc". Whenever a line with the text "ee_R_factor" comes along, the cycle number, R, free-R, and free-R - R are printed out. At this point, I'd like to plug my Rplot.com script, which produces nice, xloggraph-formatted output from refmac logs.
As a general example, if you want to see the 4th and last words on a line containing the text "something important", use this:
unix% awk '/something important/{print $4,$NF}' file.log
You can also specify the beginning of a line with a "^" and the end of a line with a "$". For example:
unix% awk '/^Cycle/ && /[0-9]$/' random.log
will print only lines that begin with "Cycle" and end with a number in the file random.log.
unix% awk '/^ATOM/ && /CA/{print substr($0,61,6)+0}' file.pdb
will print the B-factors of all the Calpha atoms in file.pdb. The "substr" function is used here because PDBs don't always have spaces between the various items on each line. substr() allows you to refer to the characters on each line, instead of just words, and the pdb file definition is column-based, not word-based. The substr($0,61,6) line means the 6 characters starting at the 61st column on the current line (where the B-factor is stored).
Now, what if you want to see something on the next line from some kind of telltale text? Use awk's "getline" command to skip lines. Like this:
unix% echo "HEAD" | mtzdump hklin test.mtz | awk '/Cell Dimensions/{getline;getline;print}'
this will print the unit cell from test.mtz. The getline command can be used to skip lines, but it is only available in newer versions of awk. If your computer's awk program complains, try using nawk (new awk) or gawk (GNU's awk).
Awk also supports a variety of math functions. The following commands:
unix% set CELL = "89.5 89.5
47.2 90 90 120"
unix% echo "$CELL" | nawk
'NF==6{s=3.1415926535897899419/180; \
A=cos(s*$4); B=cos(s*$5); G=cos(s*$6); \
skew = 1 + 2*A*B*G - A*A - B*B - G*G ; if(skew < 0) skew = -skew;\
printf "%.3f\n", $1*$2*$3*sqrt(skew)}'
can be used to calculate the volume of your unit cell (in cubic Angstroms). If you know your protein's mass, and the number of asymmetric units on your crystal's unit cell, you can use this to calculate your Matthews coefficient.
We all write shell scripts sooner or later. It's either that or staying up all night running programs manually. Therefore I have compiled are a few very usefil tricks here that you can use to make your shell scripts more efficient.
One should always begin with the following as their very first line:
#! /bin/csh -f
Very few people know what this line does, but I'll tell you: the "#!" thing is a signal to the unix program launcher (exec()), that this file is a script, and that the remainder of the line is the command to use to run this script. This is usually /bin/csh, but /bin/sh is also a popular shell for scripts which, unfortunately, has completely different syntax than you're used to typing on the unix command line. This is the main reason why Elves avoid sh entirely. To csh, the -f option means "fast start" and keeps the /bin/csh program from running your ~/.cshrc script every time you run your script.
For debugging shell scripts, you can use "-fv", which will echo all the commands the script is executing to the terminal window, as they are run. This makes it easy to see which line killed your shell script.
You can also write scripts for other programs by putting them on this first line. for example, putting "#! /usr/bin/nawk -f" at the top of an awk script turns the script into a "standalone" awk script that can be run like any other unix command.
One of the most immidiately useful things to put into a shell script is the ability to read the command line. For example, if you are writing a script to do fft, wouldn't it be nice to only have to enter the input mtz file name on the command line instead of editing the script? Well, here's how you do it. It turns out that the first item on the command line to a shell script will be pasted into your script everywhere you put a $1. $2 is the second item, etc. So, in your script, you might write:
#! /bin/csh -f
# command-line reading fft script
fft hklin $1
mapout fft.map << EOF-fft
blah blah blah
EOF-fft
Now, when you run the above shell script like this:
unix% fft.com this.mtz
the string "this.mtz" will be substituded for the $1 after hklin before fft is run.
The $1 value will be set to "" if there is nothing on the command line, so watch out, and make sure you check for this.
Shell scripts can also have arbitrary variables, set by you. You set variables using the set command:
set hires = "1.8"
from here on, any time you use the word "$hires" in your script, it will be replaced by "1.8". This is useful for putting commonly-changed values up at the top of your script, so changing one line affects all the programs you call in the script.
One of the most powerful features in shell programming is the ability to set variables to the output of a unix command. To do this, you use back-quotes:
set CELL = `echo "HEAD" | mtzdump hklin input.mtz | awk '/Cell Dimensions/{getline;getline;print}'`
will set the variable "CELL" to the unit cell from input.mtz. From this line on, the text: "${CELL}" will be substituted with the six unit cell numbers.
If, for some reason, you want the user to type something into your script, the easiest way to do this is to use the "special" shell variable "$<":
set userinput = "$<"
will read one line, entered by the user, and set ${userinput} to that string. Be careful, scripts that use this cannot easily be run in the background, because the "$<" command needs a terminal. (You'll get the "Stopped: tty input" message)
Even simple shell scripts will probably need a little flow control in them. For example, when using command-line arguments, something like this is almost always a good idea:
set mtz_input = "$1"
if("$1" == "") then
echo "usage: $0 input.mtz"
exit 9
endif
this will remind you if you forget to type something on the command line, (instead of giving you some kind of weird, cryptic error message somewhere deeper in the script). Notice the strucrure of the "if" statement: the "if" line must end with a "then", and the "if" condition applies until a matching "endif" is found. You don't have to indent the way I do, but I find it makes the statements easier to read.
The csh "if" statement is a little finicky, and can easily crash the whole script if you make a syntax error, or provide unexpected input to it at run-time. But, once you get the hang of it, it can be a very powerful addition to your scripts. You can "man csh" or "man tcsh" for the detailed description of the "if" rules, (as well as everything else the shell can do), but I have a few rules that I find apply to crystallographic applications:
As another example:
if(-e "${filename}") then
ls -l ${filename}
else
echo "${filename} does not exist! "
exit 9
endif
will execute the command: "ls -l ${filename}" only if the filename contained in the variable value ${filename} actually exists, otherwise, the statement in quotes following the "echo" command will be printed out on to the screen. you can also do "wildcard" comparisons in csh "if" statements:
if("$1" =~ *.mtz) then
set mtzfile = "$1"
endif
will set the value of the shell variable "$mtzfile" to the first word on the command line if that word ends in ".mtz". You can also put more complex logic into an "if" statement:
if((("$1" =~ *.mtz) && (-e
"$1")) || (-e "$default_mtz")) then
echo "okay"
endif
will print "okay" only if either the first argument on the command line ends in ".mtz", and is an actual, existing file, or if the string stored in the default_mtz variable is an existing file.
Most of us still use FTP to transfer files, but ftp has some serious security problems. It sends your passord, as clear text, in an unencrypted packet that any computer connected to your network segment can read, and get your password. Bad, eh? FTP is also not very good at transferring a whole tree-structured directory in one go.
A popular solution to the cleartext password problem is ssh, which encrypts the password before its sent. Unfortunately, only ssh version 2.x has an ftp program, and most synchrotrons don't have ssh 2.x, and probably won't for the forseeable future (since they have to pay big licenseing fees for it). Some users resort to logging in with ssh to temporarily change their password, do the FTP transfer, and then change their password back. This works (sort of), but can be complicated on remote systems that have slow NIS password updates.
A more direct (and portable) alternative is to send your files directly through an ssh login session. A nice, packaged version of this for x-ray images is in sendhome, but you might want more flexibility for, say, transferring data processing directories from here to there. The following command:
unix% tar cBf - processing | compress -c | ssh user@home.college.edu "cd /bigdisk/user; uncompress -c | tar xBvf -"
will move the local directory "processing" (and everything in it) to /bigdisk/user/processing (provided /bigdisk/user exists!) using "user"'s account on the remote computer: home.college.edu. The files are compressed during the transfer, so this is actually faster than FTP! The command to "pull" files the other way is this:
unix% cd /bigdisk/user
unix% ssh mcfuser@bl831.als.lbl.gov "cd /data/mcfuser/yourname
; tar cBf - . | compress -c" | uncompress -c | tar xvBf -
will move all the files and directories in "/data/mcfuser/yourname" on the remote computer bl831.als.lbl.gov to /bigdisk/user/ on the local machine.
Back to the Elves Manual Table of Contents.