Project4/BIRNPortalIssues

From CVRG Wiki

Jump to: navigation, search

Image:CVRG_wiki-90x48px.gif Project 4

Current issues with BIRN portal implementation at JHU:

  1. A Dorian user cannot log into the BIRN portal
    1. How is authentication being handled?
    2. What is the current method to make a user work? Do they have to register via the BIRN registration page?
    3. Where is the ability to select the authentication source?
  2. Need for source code for everything from UCSD
    1. The accounts that we had to UCSD's CVS were all under Aaron's name
    2. Anthony needs to get accounts for himself, Tim, Steve Granite and Kyle Reynolds, to ensure that any one of us can get their source code for further understanding of their portal.
  3. Issues with condor/large memory allocation
    1. Current implemenation of submitting jobs to the condor queue does not allow the user to define resource requirements. The lddmm portal should allow the user to define required memory and temporary disk space need to run a job. The heart data sets took 8GB of memory. If we do not define the memory requirement, we exceed the memory limit of our processing machines.
    2. An alternative to user define requirements is to calculate the potential memory required automatically and attach this to the GWE submission.

---

'2008.10.08

Heart data processed through the portal.

2008.09.30

To      : Ramil Manansala <ramil@ncmir.ucsd.edu>
Cc      : Kyle Reynolds <ksr@jhu.edu>,
          Larry Lui <llui@ncmir.ucsd.edu>
Attchmnt:
Subject : Re: CVRG lddmm portal still not working
----- Message Text -----
Ramil,

/usr/local/bin/url_io already has that change. This mornong I copied
your /home/ramil/lddmm/url_io  to /usr/local/bin. That version didn't
work either.

Check: /home/ramil/lddmm/gwiz/spool/lddmm-40/82/gwiz.out

Note: the environment variable:

   X509_USER_PROXY=x509up

is in your version of url_io. But, the original one had LDDMM_PROXY.

   cd /usr/local/bin
   grep X509_USER_PROXY url_io
   grep LDDMM_PROXY url_io.OLD

The SERVER_DN is not using Dorian. It looks like:

   SERVER_DN=/C=US/O=BIRN/OU=BIRN-CC/CN=birnsrb/Storage Resource Broker

Is this the problem?

-Anthony

2008.09.10

  • Ramil and Larry are looking into the how the grid-mapfile gets created.
Date: Tue, 09 Sep 2008 11:51:10 -0700
From: Larry Lui <llui@ncmir.ucsd.edu>
To: Anthony Kolasny <akolasny@jhu.edu>
Cc: Ramil Manansala <ramil@ncmir.ucsd.edu>, V. Rowley <vrowley@ucsd.edu>,
    Stephen Granite <sgranite@jhu.edu>, Kyle Reynolds <ksr@jhu.edu>,
    Timothy Brown <tjab@jhu.edu>, rwinslow@bme.jhu.edu,
    Jeff Grethe <jgrethe@ncmir.ucsd.edu>
Subject: Re: condor/srb followup
Parts/Attachments:
   1 Shown    134 lines  Text
   2 Shown    113 lines  Text
   3 Shown     81 lines  Text
   4 Shown     28 lines  Text
----------------------------------------

Hello All,
Here are the components that I was able to gather that creates the
/etc/grid-security/grid-mapfile.  I'm not much of a perl expert, but I was
able to get a jist of what the code does....

1.)generate-grid-mapfile.prl (on gama server)
        A.)Creates a webpage that has DN strings mapped to users.  This
scripts looks thru all the certs stored in a directory and will produce a
static html page with the DN strings.


2.)sync_grid (on cluster)
        A.) Reads in the static webpage generated from
generate-grid-mapfile.prl and will create the user accounts necessary on the
cluster.
        B.) sync_grid calls sync-grid-accounts to alter the mapfile as well
as adding the user accounts to the cluster.

These files have been included in this email.

Larry

2008.09.08

  • Kyle provided Anthony with root on core001. This allowed for greater exploration of the portal processing.
  • The GridWizard is running through Ramil's account. /home/ramil/lddmm/gwiz/spool/lddmm-26/61 provides

information relating to the last lddmm run.

[root@cor001 61]# ls -l
total 36
-rw-r--r-- 1 ramil ramil 2016 Sep  8 10:59 gwiz.err
-rw------- 1 ramil ramil  291 Sep  8 10:59 gwiz.log
-rw-r--r-- 1 ramil ramil 4288 Sep  8 10:59 gwiz.out
-rw------- 1 ramil ramil  290 Sep  8 10:59 lddmm.condor
-rw------- 1 ramil ramil  618 Sep  8 10:59 lddmm.condor.log
-rw------- 1 ramil ramil 2197 Sep  8 10:59 lddmm-config.txt
-rw------- 1 ramil ramil 1231 Sep  8 10:59 parallel.sh
-rw------- 1 ramil ramil 2776 Sep  8 10:59 x509up
  • parallel.sh is the job that is submitted to condor. It looks like:
#!/bin/bash

#
# This script generated automatically
#
$GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Starting script
printenv
cd $GWIZ_WORK_DIR

chmod go=,u=rwX *

#  --> 61
(
  $GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Starting task 61
nullATLAS=srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img
TARGET=srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img
ATLASNAME=$(basename $ATLAS | perl -pe's/\..*//')
TARGETNAME=$(basename $TARGET | perl -pe's/\..*//')
DOMAIN=$(echo $ATLAS | perl -pe's/.*\/akolasny\.([^-]+-[^\/]+).*/\1/')
OUTDIR=$(dirname $ATLAS | perl -pe's/srbfile://')/results.lddmm-26
export srbUser=akolasny
export mdasDomainName=$DOMAIN
export mdasDomainHome=$DOMAIN
export PORTAL_USER_EMAIL=akolasny@jhu.edu
Smkdir $OUTDIR
Smkdir $OUTDIR/$ATLASNAME
RESULTDIR=$OUTDIR/$ATLASNAME
printenv

  /usr/local/bin/lddmm-volume -A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T  srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -d 3 -o srbdir:$RESULTDIR -m -c lddmm-config.txt  # --: Task 61
  $GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Job exited with code=$?.
) &
pid0=$!
#  <-- 61
for pid in $pid0 ; do
   wait $pid;
done;

$GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Job finished.
  • From 'gwiz.out', we learn about the envirnment variables being passed to parallel.sh. They look like:
srbHost=cvrg-portal.nbirn.net
mdasDomainHome=ucsd-bcc
defaultResource=cvrg-nas
SERVER_DN=/C=US/O=BIRN/OU=BIRN-CC/CN=birnsrb/Storage Resource Broker
AUTH_SCHEME=GSI_AUTH
X509_USER_PROXY=x509up
mdasDomainName=ucsd-bcc

### Repeats
_=/usr/bin/printenv
srbHost=cvrg-portal.nbirn.net
mdasDomainHome=
HOSTNAME=cor001.icm.jhu.edu
defaultResource=cvrg-nas
mdasDomainName=
  • What appears to be happening is that the following line in parallel.sh is wiping out the

mdasDomainName and mdasDomainHome information.

DOMAIN=$(echo $ATLAS | perl -pe's/.*\/akolasny\.([^-]+-[^\/]+).*/\1/')

2008.09.05

  • 'getenv = true' in the condor command script allows the SRB commands to work. It now

looks like:

Executable  = /home/lddmmproc/test_condor/lddmm_srb_getenv.sh
getenv = true
Universe = vanilla
machine_count = 4
output  = /home/lddmmproc/test_condor/lddmm_srb_getenv.out
error   = /home/lddmmproc/test_condor/lddmm_srb_getenv.err
Log     = /home/lddmmproc/test_condor/lddmm_srb_getenv.log
Queue
  • looking at /opt/condor/etc/condor_config.local on core001 and the birn cluster. They have the

same GSI information.

        SEC_DEFAULT_AUTHENTICATION = REQUIRED
        SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
        GSI_DAEMON_DIRECTORY = /etc/grid-security
        GSI_DAEMON_CERT = /etc/grid-security/birncondorpool.cert
        GSI_DAEMON_KEY = /etc/grid-security/birncondorpool.key
        GSI_DAEMON_NAME = /C=US/O=BIRN/OU=BIRN-CC/CN=birncondorpool
        GRIDMAP = /etc/grid-security/grid-mapfile

[akolasny@birn-cluster0 etc]$ more /etc/grid-security/grid-mapfile |grep kolas
"/C=US/O=BIRN/O=Johns Hopkins Univ/OU=CIS/CN=Anthony Kolasny" btakolas
"/C=US/O=BIRN/OU=JHU - Center for Image Science/CN=Anthony Kolasny/USERID=akolasny" btakolas

In the CVRG, version the /etc/grid-security/grid-mapfile is missing.

2008.09.04

  • condor/srb status - lddmm submission to condor using srb working. It appears the User environment is not being inherited.
Under /home/lddmmproc/test_condor, 'cat lddmm_srb.sh' show the changes:

 #!/bin/bash

 HOME=/home/lddmmproc
 export HOME

 . $HOME/.profile
 . $HOME/.bashrc

 /usr/local/bin/lddmm-volume \
         -A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img \
         -T srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img \
         -o srbdir:/home/akolasny.jhu-cis/ball_test -d 3

The key was adding the 'HOME'.

2008.09.02

  • More portal troubleshooting.
The following command used SRB data and put the results on the
   server:

   [lddmmproc@cor001 ~]$ /usr/local/bin/lddmm-volume -A \
        srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T \
        srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -o \
        srbdir:/home/akolasny.jhu-cis/ball_test -d 3

   [lddmmproc@cor001 ~]$ Sls
    /home/akolasny.jhu-cis/ball_test:
      ball3d_1.hdr
      ball3d_1.img
      ball3d_2.hdr
      ball3d_2.img
      C-/home/akolasny.jhu-cis/ball_test/ball3d_2

2. I tested this script using condor. It failed.

  /home/lddmmproc/software/lddmm-volume/url_io: line 66: 22084 Segmentation
fault
     (core dumped) Sinit
  Sget: Connection to srbMaster failed.
  Sget: FATAL: clConnect: Unable to determine the client domainName!

  CLI_ERR_INVAILD_DOMAIN: Invalid domain name

  Since it's a segment fault, it would appear it's calling 'Sinit'. Is it
  that the condor deamon cannot see the .srb information.

  This works through BIRN. But, the url_io is not calling 'Sinit' or
  'Sexit'.

For testing the condor queue with srb:

        su - lddmmproc
        cd test_condor
        condor_submit lddmm_srb.cmd

        condor_q  # Almost immediately dies (similar to portal)

        more lddmm_srb.err # shows what happened.

        rm lddmm_srb\.{out,err,log} core*

2008.08.29

  • In order to runs SRB Scommands on core-001, I needed to set up my .srb/.MdasEnv to look like:
mdasCollectionName '/home/akolasny.jhu-cis'
mdasCollectionHome '/home/akolasny.jhu-cis'
mdasDomainHome 'jhu-cis'
srbUser 'akolasny'
AUTH_SCHEME 'ENCRYPT1'
srbHost 'cvrg-portal.nbirn.net'
srbPort '5925'
defaultResource 'cvrg-nas'
SERVER_DN  '/C=US/O=BIRN/OU=BIRN-CC/CN=birnsrb/Storage Resource Broker'
  • From core-001, I was able to access CVRG SRB data and run LDDMM
[lddmmproc@cor001 portal_test]$ /usr/local/bin/lddmm-volume -A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -d 3
** SINGLE TARGET **
SINGLE PROCESSOR JOB
ATLAS: ball3d_1
TARGET: ball3d_2
   Performing mapping
 /home/lddmmproc/software/lddmm-volume/x86_64/BIRNfluidmatch3D_x86_64 /tmp/lddmm.11614/input.tmp/ball3d_1.img /tmp/lddmm.11614/input.tmp/ball3d_1.hdr /tmp/lddmm.11614/input.tmp/ball3d_2.img /tmp/lddmm.11614/input.tmp/ball3d_2.hdr 1000 10 0.1 0.0000000001 1 1 1 0.01 1 1 20 5 1000 0 1 0.02 25  > stdout.txt



--------- LDDMM-Volume License Validation ------------

  License for host id -595530932 valid
------------------------------------------------------
/home/lddmmproc/software/lddmm-volume/x86_64/gatherData_x86_64
Hmap Kimap Atlas defAtlas Patient defPatient gradI0
gather_out.txt
   Gathering output files
   Writing output to final destination
Finished ball3d_2
  • Test trying to send the outout to SRB.
[lddmmproc@cor001 portal_test]$ /usr/local/bin/lddmm-volume -A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -o srbdir:/home/akolasny.jhu-cis -d 3
** SINGLE TARGET **
SINGLE PROCESSOR JOB
ATLAS: ball3d_1
TARGET: ball3d_2
   Performing mapping
 /home/lddmmproc/software/lddmm-volume/x86_64/BIRNfluidmatch3D_x86_64 /tmp/lddmm.11819/input.tmp/ball3d_1.img /tmp/lddmm.11819/input.tmp/ball3d_1.hdr /tmp/lddmm.11819/input.tmp/ball3d_2.img /tmp/lddmm.11819/input.tmp/ball3d_2.hdr 1000 10 0.1 0.0000000001 1 1 1 0.01 1 1 20 5 1000 0 1 0.02 25  > stdout.txt



--------- LDDMM-Volume License Validation ------------

  License for host id -595530932 valid
------------------------------------------------------



/home/lddmmproc/software/lddmm-volume/x86_64/gatherData_x86_64
Hmap Kimap Atlas defAtlas Patient defPatient gradI0
gather_out.txt
   Gathering output files
   Writing output to final destination
connectSvr: initPort error. status =-1103
Finished ball3d_2
Personal tools
Project Infrastructures