Project4/BIRNPortalIssues
From CVRG Wiki
Current issues with BIRN portal implementation at JHU:
- A Dorian user cannot log into the BIRN portal
- How is authentication being handled?
- What is the current method to make a user work? Do they have to register via the BIRN registration page?
- Where is the ability to select the authentication source?
- Need for source code for everything from UCSD
- The accounts that we had to UCSD's CVS were all under Aaron's name
- Anthony needs to get accounts for himself, Tim, Steve Granite and Kyle Reynolds, to ensure that any one of us can get their source code for further understanding of their portal.
- Issues with condor/large memory allocation
- Current implemenation of submitting jobs to the condor queue does not allow the user to define resource requirements. The lddmm portal should allow the user to define required memory and temporary disk space need to run a job. The heart data sets took 8GB of memory. If we do not define the memory requirement, we exceed the memory limit of our processing machines.
- An alternative to user define requirements is to calculate the potential memory required automatically and attach this to the GWE submission.
---
'2008.10.08
Heart data processed through the portal.
2008.09.30
To : Ramil Manansala <ramil@ncmir.ucsd.edu>
Cc : Kyle Reynolds <ksr@jhu.edu>,
Larry Lui <llui@ncmir.ucsd.edu>
Attchmnt:
Subject : Re: CVRG lddmm portal still not working
----- Message Text -----
Ramil,
/usr/local/bin/url_io already has that change. This mornong I copied
your /home/ramil/lddmm/url_io to /usr/local/bin. That version didn't
work either.
Check: /home/ramil/lddmm/gwiz/spool/lddmm-40/82/gwiz.out
Note: the environment variable:
X509_USER_PROXY=x509up
is in your version of url_io. But, the original one had LDDMM_PROXY.
cd /usr/local/bin
grep X509_USER_PROXY url_io
grep LDDMM_PROXY url_io.OLD
The SERVER_DN is not using Dorian. It looks like:
SERVER_DN=/C=US/O=BIRN/OU=BIRN-CC/CN=birnsrb/Storage Resource Broker
Is this the problem?
-Anthony
2008.09.10
- Ramil and Larry are looking into the how the grid-mapfile gets created.
Date: Tue, 09 Sep 2008 11:51:10 -0700
From: Larry Lui <llui@ncmir.ucsd.edu>
To: Anthony Kolasny <akolasny@jhu.edu>
Cc: Ramil Manansala <ramil@ncmir.ucsd.edu>, V. Rowley <vrowley@ucsd.edu>,
Stephen Granite <sgranite@jhu.edu>, Kyle Reynolds <ksr@jhu.edu>,
Timothy Brown <tjab@jhu.edu>, rwinslow@bme.jhu.edu,
Jeff Grethe <jgrethe@ncmir.ucsd.edu>
Subject: Re: condor/srb followup
Parts/Attachments:
1 Shown 134 lines Text
2 Shown 113 lines Text
3 Shown 81 lines Text
4 Shown 28 lines Text
----------------------------------------
Hello All,
Here are the components that I was able to gather that creates the
/etc/grid-security/grid-mapfile. I'm not much of a perl expert, but I was
able to get a jist of what the code does....
1.)generate-grid-mapfile.prl (on gama server)
A.)Creates a webpage that has DN strings mapped to users. This
scripts looks thru all the certs stored in a directory and will produce a
static html page with the DN strings.
2.)sync_grid (on cluster)
A.) Reads in the static webpage generated from
generate-grid-mapfile.prl and will create the user accounts necessary on the
cluster.
B.) sync_grid calls sync-grid-accounts to alter the mapfile as well
as adding the user accounts to the cluster.
These files have been included in this email.
Larry
2008.09.08
- Kyle provided Anthony with root on core001. This allowed for greater exploration of the portal processing.
- The GridWizard is running through Ramil's account. /home/ramil/lddmm/gwiz/spool/lddmm-26/61 provides
information relating to the last lddmm run.
[root@cor001 61]# ls -l total 36 -rw-r--r-- 1 ramil ramil 2016 Sep 8 10:59 gwiz.err -rw------- 1 ramil ramil 291 Sep 8 10:59 gwiz.log -rw-r--r-- 1 ramil ramil 4288 Sep 8 10:59 gwiz.out -rw------- 1 ramil ramil 290 Sep 8 10:59 lddmm.condor -rw------- 1 ramil ramil 618 Sep 8 10:59 lddmm.condor.log -rw------- 1 ramil ramil 2197 Sep 8 10:59 lddmm-config.txt -rw------- 1 ramil ramil 1231 Sep 8 10:59 parallel.sh -rw------- 1 ramil ramil 2776 Sep 8 10:59 x509up
- parallel.sh is the job that is submitted to condor. It looks like:
#!/bin/bash # # This script generated automatically # $GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Starting script printenv cd $GWIZ_WORK_DIR chmod go=,u=rwX * # --> 61 ( $GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Starting task 61 nullATLAS=srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img TARGET=srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img ATLASNAME=$(basename $ATLAS | perl -pe's/\..*//') TARGETNAME=$(basename $TARGET | perl -pe's/\..*//') DOMAIN=$(echo $ATLAS | perl -pe's/.*\/akolasny\.([^-]+-[^\/]+).*/\1/') OUTDIR=$(dirname $ATLAS | perl -pe's/srbfile://')/results.lddmm-26 export srbUser=akolasny export mdasDomainName=$DOMAIN export mdasDomainHome=$DOMAIN export PORTAL_USER_EMAIL=akolasny@jhu.edu Smkdir $OUTDIR Smkdir $OUTDIR/$ATLASNAME RESULTDIR=$OUTDIR/$ATLASNAME printenv /usr/local/bin/lddmm-volume -A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -d 3 -o srbdir:$RESULTDIR -m -c lddmm-config.txt # --: Task 61 $GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Job exited with code=$?. ) & pid0=$! # <-- 61 for pid in $pid0 ; do wait $pid; done; $GWIZ_HOME/bin/loggit -f $GWIZ_WORK_DIR/gwiz.log Job finished.
- From 'gwiz.out', we learn about the envirnment variables being passed to parallel.sh. They look like:
srbHost=cvrg-portal.nbirn.net mdasDomainHome=ucsd-bcc defaultResource=cvrg-nas SERVER_DN=/C=US/O=BIRN/OU=BIRN-CC/CN=birnsrb/Storage Resource Broker AUTH_SCHEME=GSI_AUTH X509_USER_PROXY=x509up mdasDomainName=ucsd-bcc ### Repeats _=/usr/bin/printenv srbHost=cvrg-portal.nbirn.net mdasDomainHome= HOSTNAME=cor001.icm.jhu.edu defaultResource=cvrg-nas mdasDomainName=
- What appears to be happening is that the following line in parallel.sh is wiping out the
mdasDomainName and mdasDomainHome information.
DOMAIN=$(echo $ATLAS | perl -pe's/.*\/akolasny\.([^-]+-[^\/]+).*/\1/')
2008.09.05
- 'getenv = true' in the condor command script allows the SRB commands to work. It now
looks like:
Executable = /home/lddmmproc/test_condor/lddmm_srb_getenv.sh getenv = true Universe = vanilla machine_count = 4 output = /home/lddmmproc/test_condor/lddmm_srb_getenv.out error = /home/lddmmproc/test_condor/lddmm_srb_getenv.err Log = /home/lddmmproc/test_condor/lddmm_srb_getenv.log Queue
- looking at /opt/condor/etc/condor_config.local on core001 and the birn cluster. They have the
same GSI information.
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
GSI_DAEMON_DIRECTORY = /etc/grid-security
GSI_DAEMON_CERT = /etc/grid-security/birncondorpool.cert
GSI_DAEMON_KEY = /etc/grid-security/birncondorpool.key
GSI_DAEMON_NAME = /C=US/O=BIRN/OU=BIRN-CC/CN=birncondorpool
GRIDMAP = /etc/grid-security/grid-mapfile
[akolasny@birn-cluster0 etc]$ more /etc/grid-security/grid-mapfile |grep kolas
"/C=US/O=BIRN/O=Johns Hopkins Univ/OU=CIS/CN=Anthony Kolasny" btakolas
"/C=US/O=BIRN/OU=JHU - Center for Image Science/CN=Anthony Kolasny/USERID=akolasny" btakolas
In the CVRG, version the /etc/grid-security/grid-mapfile is missing.
2008.09.04
- condor/srb status - lddmm submission to condor using srb working. It appears the User environment is not being inherited.
Under /home/lddmmproc/test_condor, 'cat lddmm_srb.sh' show the changes:
#!/bin/bash
HOME=/home/lddmmproc
export HOME
. $HOME/.profile
. $HOME/.bashrc
/usr/local/bin/lddmm-volume \
-A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img \
-T srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img \
-o srbdir:/home/akolasny.jhu-cis/ball_test -d 3
The key was adding the 'HOME'.
2008.09.02
- More portal troubleshooting.
The following command used SRB data and put the results on the
server:
[lddmmproc@cor001 ~]$ /usr/local/bin/lddmm-volume -A \
srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T \
srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -o \
srbdir:/home/akolasny.jhu-cis/ball_test -d 3
[lddmmproc@cor001 ~]$ Sls
/home/akolasny.jhu-cis/ball_test:
ball3d_1.hdr
ball3d_1.img
ball3d_2.hdr
ball3d_2.img
C-/home/akolasny.jhu-cis/ball_test/ball3d_2
2. I tested this script using condor. It failed.
/home/lddmmproc/software/lddmm-volume/url_io: line 66: 22084 Segmentation
fault
(core dumped) Sinit
Sget: Connection to srbMaster failed.
Sget: FATAL: clConnect: Unable to determine the client domainName!
CLI_ERR_INVAILD_DOMAIN: Invalid domain name
Since it's a segment fault, it would appear it's calling 'Sinit'. Is it
that the condor deamon cannot see the .srb information.
This works through BIRN. But, the url_io is not calling 'Sinit' or
'Sexit'.
For testing the condor queue with srb:
su - lddmmproc
cd test_condor
condor_submit lddmm_srb.cmd
condor_q # Almost immediately dies (similar to portal)
more lddmm_srb.err # shows what happened.
rm lddmm_srb\.{out,err,log} core*
2008.08.29
- In order to runs SRB Scommands on core-001, I needed to set up my .srb/.MdasEnv to look like:
mdasCollectionName '/home/akolasny.jhu-cis' mdasCollectionHome '/home/akolasny.jhu-cis' mdasDomainHome 'jhu-cis' srbUser 'akolasny' AUTH_SCHEME 'ENCRYPT1' srbHost 'cvrg-portal.nbirn.net' srbPort '5925' defaultResource 'cvrg-nas' SERVER_DN '/C=US/O=BIRN/OU=BIRN-CC/CN=birnsrb/Storage Resource Broker'
- From core-001, I was able to access CVRG SRB data and run LDDMM
[lddmmproc@cor001 portal_test]$ /usr/local/bin/lddmm-volume -A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -d 3 ** SINGLE TARGET ** SINGLE PROCESSOR JOB ATLAS: ball3d_1 TARGET: ball3d_2 Performing mapping /home/lddmmproc/software/lddmm-volume/x86_64/BIRNfluidmatch3D_x86_64 /tmp/lddmm.11614/input.tmp/ball3d_1.img /tmp/lddmm.11614/input.tmp/ball3d_1.hdr /tmp/lddmm.11614/input.tmp/ball3d_2.img /tmp/lddmm.11614/input.tmp/ball3d_2.hdr 1000 10 0.1 0.0000000001 1 1 1 0.01 1 1 20 5 1000 0 1 0.02 25 > stdout.txt --------- LDDMM-Volume License Validation ------------ License for host id -595530932 valid ------------------------------------------------------ /home/lddmmproc/software/lddmm-volume/x86_64/gatherData_x86_64 Hmap Kimap Atlas defAtlas Patient defPatient gradI0 gather_out.txt Gathering output files Writing output to final destination Finished ball3d_2
- Test trying to send the outout to SRB.
[lddmmproc@cor001 portal_test]$ /usr/local/bin/lddmm-volume -A srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_1.img -T srbfile:/home/akolasny.jhu-cis/ball_test/ball3d_2.img -o srbdir:/home/akolasny.jhu-cis -d 3 ** SINGLE TARGET ** SINGLE PROCESSOR JOB ATLAS: ball3d_1 TARGET: ball3d_2 Performing mapping /home/lddmmproc/software/lddmm-volume/x86_64/BIRNfluidmatch3D_x86_64 /tmp/lddmm.11819/input.tmp/ball3d_1.img /tmp/lddmm.11819/input.tmp/ball3d_1.hdr /tmp/lddmm.11819/input.tmp/ball3d_2.img /tmp/lddmm.11819/input.tmp/ball3d_2.hdr 1000 10 0.1 0.0000000001 1 1 1 0.01 1 1 20 5 1000 0 1 0.02 25 > stdout.txt --------- LDDMM-Volume License Validation ------------ License for host id -595530932 valid ------------------------------------------------------ /home/lddmmproc/software/lddmm-volume/x86_64/gatherData_x86_64 Hmap Kimap Atlas defAtlas Patient defPatient gradI0 gather_out.txt Gathering output files Writing output to final destination connectSvr: initPort error. status =-1103 Finished ball3d_2
