Moving Data Around


The following are some useful tips for moving around x-ray data and processing in a highly platform-independent way.  We begin with the "easy" tools available at 8.3.1 and then move on to a deeper explanation of how they work.
 

The Easy Way

For data collected on beamline 8.3.1 you can use these live data transfer techniques:
graphics1% ssh server
bl831% sendhome /data/mcfuser/holton/semet_2_E1_025.img user@computer.college.edu:/some/big/disk
This will periodically check the contents of /data/mcfuser/holton and send any files newer than semet_2_E1_025.img to /some/big/disk on "computer".  If semet_2_E1_025.img is omitted, then the oldest file in /data/mcfuser/holton is used as the "first" file.  It is faster and more reliable to run sendhome on server/bl831 because that's the computer with the local /data disk.  You also should try to make sure /some/big/disk is locally attached to computer on the other end.
However, if your sysadmin at home is particularly paranoid, you might be faced with some kind of multi-layered firewall system, and there will be no way to ssh directly to the computer with the big disk.  Nevertheless, if you can ssh into your home computers, then you can also get your data home.  Like this:
graphics1% ssh -l name firewall.college.edu
firewall% ssh -l username machinewithdisk
unix% cd /some/big/disk
unix% ssh -l mcfuser bl831.als.lbl.gov "download holton semet_2_E1_035.img" | tar xvBf -
Here files are always transferred in chronological order (of collection), and any new files will be transferred as soon as they appear on the disk.  The word "holton" means to download all files containing the string "holton", so use your data directory name here.  The word "semet_2_E1_035.img" means to only send files that occur after semet_2_E1_035.img.  This is useful if your first attempt at a download dies and you want to start the second transfer where the first one left off.

If "semet_2_E1_035.img" is omitted, then all files containing "holton" in their pathname are transmitted.  If "holton" is omitted, then only newly-collected files are transmitted.  Therefore, only use "download" with no arguments before you start collecting data.


 

If your network link is really slow:

unix% cd /some/big/disk
unix% ssh -n -l mcfuser bl831.als.lbl.gov "download holton semet_2_E1_035.img | gzip --fast" |\
gunzip -c | tar xvBf -
This will make the transfer of x-ray images about 2-3 times faster.  However, on a fast network, the compression/decompression overhead will only slow things down.

 

WARNING: If you moved your files after collecting them, the download program will not find them! But if your home computer is equipped with rsync, you can use that:

unix% cd /some/big/disk
unix% rsync -azv mcfuser@bl831.als.lbl.gov:/data/mcfuser/your/stuff/ ./



Firewire

If you have a Mac-formatted firewire drive, just plug it into the firewire cable marked "iMac", open up a terminal, and type this:
[iMac:~]% df
Filesystem              512-blocks     Used    Avail Capacity  Mounted on
/dev/disk0s5             117221400 31035432 85673968    26%    /
devfs                          180      180        0   100%    /dev
fdesc                            2        2        0   100%    /dev
<volfs>                       1024     1024        0   100%    /.vol
automount -fstab [326]           0        0        0   100%    /Network/Servers
automount -static [326]          0        0        0   100%    /automount
//mcfuser@BL831/DATA    1341095936 858587136 482508800    64%    /Volumes/mcfuser
/dev/disk2s2             234441296   787160 233654128     0%    /Volumes/Untitled
[iMac:~]% rsync -av server:/data/mcfuser/you/ /Volumes/Untitled/831_2-14-04
This will recursively transfer everything under the "you" directory into /Volumes/Untitled/831_2-14-04.  The 831_2-14-04 folder will be created if neccesary. The rsync program will only transfer files you havn't already got, so just run this same command every so often to keep your firewire drive synched-up with data collection. Alternately, if you want to just lauch something and forget about it, you can run this:
[iMac:~]% rsh server "download holton" | tar xvBCf /Volumes/Untitled -
This will transfer all data files containing "holton" to the firewire disk named "Untitled" in real time.  Make sure your name and your firewire disk's name are entered appropriately.
 

If you have a PC-formatted firewire drive, just plug it into the firewire or USB2 cable marked "PC".

Then go to a terminal on graphics1 and type this:

graphics1% ssh server
bl831% cd /data/mcfuser/yourname
bl831% backup_firewire.com .
This will check /data/mcfuser/yourname at least every ten minutes and transfer any new files to your firewire drive.  The contents of your firewire drive will appear as /firewire on "graphics1" and "server".  So, you can check on it periodically with:

graphics1% ls /firewire

When you're done, check that all your data are on the firewire disk, and then just pull the plug and go home.


The Hard Way

(understanding file-transfer commands)


Copying one file at a time:

unix% cp test_0_001.img /somewhere/else/test_0_001.img
for hundreds of files, this can get boring really fast.  The "cp" command also has the unfortunate habit of changing the date stamps on files, and this can present problems latter on when you're trying to figure out how the data were collected, etc.  You can do this (on some systems):
unix% cp -p test_0_001.img /somewhere/else/test_0_001.img
to preserve the dates, but you are still stuck with the tedium of typing in a lot of cp commands.  This is especially annoying if you need to rename or renumber the files.  However, there is a better way:

Renaming a bunch of files:

 The following commands may seem like a lot of fancy typing, but it really is the most flexible way of doing this.  Just cut-and-paste from this web page into your terminal and you'll never have to understand how it works.  If you do want to understand, have a look at the tips n tricks page.

unix% foreach num ( `awk 'BEGIN{for(num=1;num<=180;++num) printf "%03d ", num; exit}'` )
foreach? mv stupidname${num}.img bettername_${num}.img
foreach? end
re-names all the x-ray images from stupidname001.img through stupidname180.img to bettername_001.img through bettername_180.img. I'll bet there have been plenty of times you wished you knew how to do that!

If you also want to re-number your images, you can add one line:

unix% foreach num ( `awk 'BEGIN{for(num=12;num<=180;++num) printf "%03d ", num; exit}'` )
foreach? set newnum = `echo ${num} | awk '{printf "%03d ", $1-11}'`
foreach? mv stupidname${num}.img bettername_${newnum}.img
foreach? end
re-names all the x-ray images from stupidname012.img through stupidname180.img to bettername_001.img through bettername_169.img.



Whatever you do,
DO NOT DO THIS -> mv frame_*.img bettername_*.img <- NEVER EVER DO THIS!

The above command may seem like a good idea, but it will actually destroy all but the last of your images!


Copying a whole directory system:

A popular way of doing this is to use the "-r" flag in cp.

unix% cp -r /data/wrong/place /data/right/
This will recurse into subdirectories and copy them too.  However, this will usually mess up your file dates, change the file permissions (read,write,execute) and usually change the ownership.  If you want yo make sure everything about the files is preserved, you want to do something like this:
unix% cd /data/wrong/place
unix% tar cf /tmp/deleteme.tar .
unix% cd /data/right/place
unix% tar xvf /tmp/deleteme.tar
unix% rm /tmp/deleteme.tar
Will copy every file in /data/wrong/place to /data/right/place.

This command sequence may seem like a lot more work than "cp -r", but the advantages to using "tar" will become apparent below. The  tar command also has the nice feature of recursing into all subdirectories but preseres all file dates, permissons and ownerships.  The "tar" command (tape archiver) is also one of the oldest backup/restore programs so you can move the deleteme.tar file to just about any kind of computer system (Unix,Windows,Mac) and be able to restore all your data.  The tar options "cf" mean "create a new archive" and "put it in this file" where "this" file is the next argument.  The "xvf" options mean "extract" and "verbose" and "from this file" where "this" file is the next argument.

The main distadvantage of the above five commands is that the copy can take twice as long as cp -r, because you are generating a huge intermediate deleteme.tar file.  You also might not have enough disk space.  The following command solves both these problems:

unix% cd /data/wrong/place
unix% tar cf - . | ( cd /data/right/place ; tar xvf - )
Will copy every file in /data/wrong/place to /data/right/place.

In this case, instead of "deleteme.tar", we use the special filename "-" which means "dump to screen" or "take from screen" depending on wether you are creating or restoring. Running the tar cf - . command by itself would probably be a bad idea.  However, unix allows you to send a programs "screen" output to another program by using a pipe "|".  Therefore, the "screen" output of tar cf - . becomes the "screen" input to the tar xvf - command.  The bytes that would have gone into "deleteme.tar" are still created and extracted, but only exists in the etherial world of unix pipes.  Unix pipes consume no extra bandwidth and no disk space.

 The "()" just allow you to change diretories before running the tar xvf - command.  Some tar programs allow you to use "tar xvCf /data/right/place -" instead of "( cd /data/right/place ; tar xvf - )", but not all systems support this.  the "()" thing will always work.

Copying a whole directory system from one computer to another:

The tar program has real advantages when you need to move data from one computer to another.

unix% cd /data/wrong/place
unix% tar cBf - . |\
ssh -n -l username othercomputer.college.edu "cd /data/right/place ; tar xvBf -"
This command shuttles the data through an ssh login session.  (This is how scp works).  The commands in "" are executed on othercomputer, and their "screen input" is taken from the "screen input" of the ssh program (the output of tar cBf - .).  In this way a tar file can be "streamed" over a secure connection to any other unix system that you have an account on. You will still be able to enter your password on the usual prompt, and this will not interfere with the data transfer.

The additional "B" in each tar command is a universal way to tell each computer's tar program to use the same block size.  Omitting the "B" can lead to problems if you are transferring data between different flavors of unix.

In our modern world of firewalls, you may find yourself on the wrong side of one.  Your data is on a remote computer, you can log into it, but you can't connect back to your own system.  In that case, use this command:

home% cd /data/right/place
home% ssh -n -l username bcsb-downolad.als.lbl.gov "cd /data/bl822/wrong/place ; tar cBf - ." |\
tar xvBf -
In this case, you want to get data you collected into /data/wrong/place on beamline 8.2.2.  The files will be expanded into /data/right/place on your current (home) machine.

For slow connections, you might want to use compression:

unix% cd /data/wrong/place
unix% tar cBf - . |\
compress -c |\
ssh -l username othercomputer.college.edu "cd /data/right/place ; uncompress -c | tar xvBf -"
This command adds a compress filter into the pipe, so that the bytes sent over the network are in compressed form.  For x-ray images, this usually means a factor of 2-3 increase in transfer rate.  On the other end, the uncompress command restores the archive stream to its original size and tar behaves just as it does in the last example.

The reverse direction would be this:

home% cd /data/right/place
home% ssh -n -l username bcsb-download.als.lbl.gov "cd /data/bl822/dcsuser/wrongplace ; tar cBf - . | compress -c" |\
uncompress -c | tar xvBf -
Note: you will only see the increase in speed for slow networks.  If you do this over gigabit, the CPU overhead from compression and decompression will only slow things down.

Copying the part of a whole directory system that didn't make it on the first try:

Network connections can break, and when they do you are left with the problem of figuring out which files made it and which files still need to be transferred.  This can be a little tricky if you didn't plan ahead.

1) Plan ahead:
unix% cd /data/wrong/place
unix% find . -type f -name '*.img' >! ~/2bsent.txt
unix% tar cTBf ~/2bsent.txt - |\
ssh -l username othercomputer.college.edu "cd /data/right/place ; tar xvBf -"

The find command will recurse into all subdirectories, looking for files that match certain criteria.  In this case the rules are: it has to be a "regular file" (not a directory or other special file), and it must end in ".img".  The output of the find command will be a one-line-per-file listing of each file, pathnames included.  It's just like "ls -1", expect that it recurses into subdirectories.
The T flag to tar means to take the list of files to archive from the contents of the next file, which is "~/2bsent.txt".  The "T" is the option used under Linux.  On an SGI, this is an "L" instead of a "T" and on a Sun it's an "I".

Creating ~/2bsent.txt insures that you know the order in which the files will be sent.  This lets you "continue" the transfer smoothly if it ever gets interrupted.  For example, if the connection breaks and the last file you see is "random_0_034.img", type this command to continue the transfer:

unix% awk '/random_0_034.img/{p=1} p==1{print}' ~/2bsent.txt |\
tar cTBf - - |\
ssh -l username othercomputer.college.edu "cd /data/right/place ; tar xvBf -"

The short awk program will only pass lines in ~/2bsent.txt after or including a line containing the text "random_0_034.img".  This will repeat the transmission of random_0_024.img (which is probably incomplete anyway) and commence sending every file that was not already sent.

2) Didn't plan ahead:
Okay, so you didn't plan ahead.  No problem.  You may also find yourself in the situation if you have collected more data since you started your last transmission, and the 2bsent.txt file above has become out-of-date already.  In either case, you'd probably rather not waste time transferring a bunch of files that you already have.  To make sure you don't do this, you need to make a list of all the files you want to transfer.  Then make a list of all the files that have transferred sucessfully.  Then, finally, take the difference between these lists.

First, lets find out what we already did:

unix% cd
unix% ssh -l username othercomputer "cd /data/right/place ; find . -type f -exec ls -ln \{\} \;" |\
awk '{print $5,$NF}' |\
sort >! alreadysent.txt
This will create a list of files and their sizes in alreadysent.txt. Like this:
10617344 ./fpp/tetPEG1_fpp_1_001.img
10617344 ./fpp/tetPEG1_fpp_1_002.img
10617344 ./fpp/tetPEG1_fpp_1_003.img
10617344 ./fpp/tetPEG1_fpp_1_004.img
10617344 ./fpp/tetPEG1_fpp_1_005.img
10617344 ./fpp/tetPEG1_fpp_1_006.img
etc...
This example takes advantage of the -exec flag for find.  This will run a given command on each found file.  The filename is substituted for the \{\} thing.  Here we call the ls command to get size information about each file. You want to include the size so you can detect the one file that was only partially transferred.

Now let's make a list of all files that should, eventually, be transferred.
unix% cd /data/wrong/place
unix% find . -type f -exec ls -ln \{\} \; |\
awk '{print $5,$NF}' |\
sort >! ~/shouldbesent.txt

Now that you have both lists, you just need to see the differences between them
unix% cd
unix% diff shouldbesent.txt alreadysent.txt | awk '/^</{print $NF}' >! 2bsent.txt

Now 2bsent.txt contains the names of all remaining files to send.
The diff program shows the differences between two files, and it prints out a "<" character before lines that occur in the left file.  The awk program looks for lines beginning with "<" and prints out the last word on the line (the name of the untransferred file).

Now you can just send these files via tar and ssh.  Just like the first example.
unix% tar cTBf ~/2bsent.txt - |\
ssh -l username othercomputer.college.edu "cd /data/right/place ; tar xvBf -"
 

Another increasingly popular way of doing incremental transfers like this is to use rsync.  You can man rsync to find out more.  However, rsync usually requires unacceptably low security for transferring data from synchrotron to home.

Live transfer of data as it is collected

Even over a fast link like firewire or gigabit ethernet, multiple-gigabyte data transfers can take a really long time.  For example, transferring 40Gb of data onto a firewire drive over a 100BaseT link will take about 6 hours (in reality).  However, data can usually be transferred much faster than it is collected.  Why not just transfer as you go?
Unfortunately, "ordinary" data transfer programs like ftp or scp will only do batch transfers of files that already exist.  However, tar can (usually) be fooled into generating the archive stream on-the-fly.  This works on Linux and SGI but not Suns or Alphas.
 

1) follow the beamline control system

On beamlines running BLU-ICE, you can generate a "live" list of collected images like this:
unix% xos3_channel.tcl | awk '/image_ready/{print $NF}'
This list of filenames can be fed into tar just like we did above.
unix% xos3_channel.tcl | awk '/image_ready/{print $NF}' |\
tar cBTf - - |\
ssh -l user@othercomputer.college.edu "cd /some/big/disk ; tar xvBf -"
This will send all new files to /some/big/disk on othercomputer.  You can also do this from the "other direction"
unix% cd /some/big/disk
unix% ssh -l mcfuser bl831.als.lbl.gov "xos3_channel.tcl | awk '/image_ready/{print $NF}' | tar cBTf - - " |\
tar xvBf -
On beamlines running MCF/DCS, you might be able to do a "live" file transfer like this:
bl822c% mkfifo fifo$$
bl822c% tail -f /prod/DCS/log/hwserv.log | awk '/img/{print $NF}' >> fifo$$ &
bl822c% tar cBf - -I fifo$$ |\
ssh -l user@othercomputer.college.edu "cd /some/big/disk ; tar xvBf -"


2) poll the file system

The following command will poll the filesystem every five minutes for new files and list them on the screen

unix% echo "/data/mcfuser semet_1_E1_025.img" |\
awk '{framedir=$1;firstfile=$NF;\
        # start sending everything if no firstfile specified\
        if(firstfile == "") sending=1;\
        if(firstfile != "") firstfile="-cnewer "firstfile;\
        cmd="find "framedir" -type f "$firstfile;\
        while(1){\
        # keep searching the directory every 300 seconds \
        while(cmd | getline){\
          # ignore entries before the first file \
          if($NF == firstfile){sending=1};\
          # pretend earlier entries were already sent \
          if(! sending) sent[$0]=1;\
          if((sending)&&(! sent[$0])){\
                # keep an internal record of files we have sent \
                print $NF; ; sent[$0]=1;};\
        };\
        # wait five minutes between polls \
        close(cmd); system(" sleep 300 ")}}'
This is kind of complicated, but it does work.  This will soon be packaged into new_files.com.

Firewire disks

If you have shared your firewire disk, as demonstrated above, you can shuttle data to it using smbclient like this:
 

unix% xos3_channel.tcl | awk '/image_ready/{print $NF}' |\
tar cBTf - - |\
smbclient //192.168.10.5/share -U "username%password" -Tx -
Where 192.168.10.5 is the IP address of the computer that is sharing the firewire disk as "share".  Under Windows, you can use the "ipconfig" command under "Programs:Acessories:Command Prompt" to see what your IP address is.

However, it is probably easier to use backup_firewire.com.



This page is not finished. It will never be finished, and neither will yours. Admit it.

James Holton <JMHolton@lbl.gov>