wiki:RunningUCERF3Inversions

Getting and running the code

The UCERF3 inversion can be run with Java 6 and OpenSHA. First, you must either obtain OpenSHA source code and build yourself (see SettingUpEclipse) or you can download the latest complete distribution jar file from here: http://opensha.usc.edu/dev/opensha-complete/nightly/opensha-complete.jar

If you choose the jar file approach, you will also have to download apache commons cli 1.2 (or later) from here and include it in your java classpath when running inversions.

You then must run the following java class with all required arguments: scratch.UCERF3.inversion.CommandLineInversionRunner?. The inversion is extremely memory intensive and at least 8 GB of memory are recommended. For example, to run the UCERF3 reference branch with results written to temp from a jar file:

java -Xmx8G -Xms8G -cp /path/to/OpenSHA_complete.jar:/path/to/commons-cli-1.2.jar scratch.UCERF3.inversion.CommandLineInversionRunner --completion-time 5h --sub-completion 1s --cool FAST_SA --nonneg LIMIT_ZERO_RATES --num-threads 100% --branch-prefix FM3_1_ZENGBB_Shaw09Mod_DsrTap_CharConst_M5Rate7.9_MMaxOff7.6_NoFix_SpatSeisU3 --directory /tmp

Command Line Inversion Runner usage

The following arguments are required for any inversion:

option typical value description
-s,--sub-completion1sThis is the length of time or interations between synchronization between threads in the parallel simulated annealing framework. Append 's' for secionds, 'm' for minutes, 'h' for hours (default is millis). Typical values are 1s (1 second)
-time,--completion-time5hThis is the total length of time for the inversion. Append 's' for secionds, 'm' for minutes, 'h' for hours (default is millis). Typical values are 5h (5 hours) to ensure good convergence, although shorter times can converge quite well.
-t,--num-threads100%number of threads (percentage of available can also be specified, for example, '50%')
-branch,--branch-prefix Logic tree branch, for example: FM3_1_ZENGBB_Shaw09Mod_DsrTap_CharConst_M5Rate7.9_MMaxOff7.6_NoFix_SpatSeisU3. See Logic Tree Branch names below
-dir,--directory Directory where inversion results and plots should be written

Other optional arguments:

option typical value description
-cool,--cooling-scheduleFAST_SASimulated annealing cooling schedule. One of: CLASSICAL_SA,FAST_SA,VERYFAST_SA,LINEAR. Default: FAST_SA
-i,--initial-state-file An initial state for the inversion can be supplied (default is all zeros). It must have already had any waterlevel subtracted.
-inigr,--initial-gr(n/a)Flag for using a GR starting model.
-light,--lightweight(n/a)Only write out a bin file for the solution. Leave the rup set file if the prefix indicates run 0.
-nonneg,--nonnegativity-constLIMIT_ZERO_RATESNonnegativity constraint. One of: TRY_ZERO_RATES_OFTEN,LIMIT_ZERO_RATES,PREVENT_ZERO _RATES. Default: LIMIT_ZERO_RATES
-noplots,--no-plots(n/a)Flag for disabling any post inversion plot generation.
-perturb,--perturbation-functionUNIFORM_NO_TEMP_DEPENDENCEPerturbation function. One of: UNIFORM_NO_TEMP_DEPENDENCE,VARIABLE_NO_TEMP_DEPENDENCE,GAUSSIAN,TANGENT,POWER_LAW,EXPONENTIAL. Default: UNIFORM_NO_TEMP_DEPENDENCE
-serial,--force-serial(n/a)Force serial (classical) simulated annealing that doesn't use multiple threads.

Additional arguments for tweaking inversion weights and other non-production changes are described at the start of the CommandLineInversionRunner? class.

Logic Tree Branch Names

You must specify a full logic tree branch prefix for each inversion run. This is a concatenation of one choice from each of the following logic tree branch nodes, separated by underscores. Default (reference) choices are in bold.

For example, the reference branch (bold below) would be: FM3_1_ZENGBB_Shaw09Mod_DsrTap_CharConst_M5Rate7.9_MMaxOff7.6_NoFix_SpatSeisU3

Fault Models

Prefix Description
FM3_1Fault Model 3.1
FM3_2Fault Model 3.2

Deformation Models

Prefix Description
ABMAverage Block Model
NEOKNeokinema
ZENGBBZeng B-Fault Bounded
GEOLGeologic

Scaling Relationships

Prefix Description
Shaw09ModShaw (2009) Modified
HB08Hanks & Bakun (2008)
EllBEllsworth B
EllBsqrtLenEllB M(A) & Shaw12 Sqrt Length D(L)
ShConStrDrpShaw09 M(A) & Shaw12 Const Stress Drop D(L)

Slip Along Rupture Models

Prefix Description
DsrUniUniform
DsrTapTapered Ends

Inversion Models

Prefix Description
CharConstCharacteristic (Constrained)

Total Mag 5 Rate

Prefix Description
M5Rate6.56.5
M5Rate7.97.9
M5Rate9.69.6

Max Mag Off Fault

Prefix Description
MMaxOff7.37.3
MMaxOff7.67.6
MMaxOff7.97.9

Moment Rate Fixes

Prefix Description
NoFixNo Moment Rate Fixes

Spatial Seismicity PDF

Prefix Description
SpatSeisU2UCERF2
SpatSeisU3UCERF3

Running on a Cluster

All cluster job submission scripts were written by scratch.UCERF3.simulatedAnnealing.hpc.LogicTreePBSWriter.main(String[]). Inversions were run at USC HPCC (smaller tests/prototypes) and TACC Stampede (production runs). Instructions are given below for running inversions in a similar cluster environment.

Many single node jobs

The simplest way to run inversions on a cluster is to run a single inversion on each compute node, with a single submission script for each job. Sample job scripts are given below for both HPCC and Stampede. Paths and arguments would need to be modified for the user running the job.

HPCC Example Single Node PBS SCript

#!/bin/bash

#PBS -l walltime=00:360:00,nodes=1:quadcore:ppn=8
#PBS -V

/usr/usc/jdk/default/jre/bin/java -Djava.awt.headless=true -Xmx8000M -Xms8000M -cp /home/scec-02/kmilner/ucerf3/inversions/2013_05_10-ucerf3p3-cooling-tests/OpenSHA_complete.jar:/home/scec-02/kmilner/ucerf3/inversions/parallelcolt-0.9.4.jar:/home/scec-02/kmilner/ucerf3/inversions/commons-cli-1.2.jar:/home/scec-02/kmilner/ucerf3/inversions/csparsej.jar scratch.UCERF3.inversion.CommandLineInversionRunner --completion-time 5h --sub-completion 1s --cool FAST_SA --nonneg LIMIT_ZERO_RATES --num-threads 95% --branch-prefix FM3_1_ZENGBB_Shaw09Mod_DsrTap_CharConst_M5Rate6.5_MMaxOff7.6_NoFix_SpatSeisU3_VarNone_VarSlowCool10 --directory /home/scec-02/kmilner/ucerf3/inversions/2013_05_10-ucerf3p3-cooling-tests --slower-cooling 10
exit $?

Stampede Example Single Node PBS SCript

#!/bin/bash

#SBATCH -t 00:360:00
#SBATCH -n 16
#SBATCH -p normal

/home1/00950/kevinm/java/default/bin/java -Djava.awt.headless=true -Xmx25000M -Xms25000M -cp /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five/OpenSHA_complete.jar:/work/00950/kevinm/ucerf3/inversion/parallelcolt-0.9.4.jar:/work/00950/kevinm/ucerf3/inversion/commons-cli-1.2.jar:/work/00950/kevinm/ucerf3/inversion/csparsej.jar scratch.UCERF3.inversion.CommandLineInversionRunner --completion-time 5h --sub-completion 1s --cool FAST_SA --nonneg LIMIT_ZERO_RATES --num-threads 5 --branch-prefix FM3_1_ABM_EllB_DsrTap_CharConst_M5Rate6.5_MMaxOff7.3_NoFix_SpatSeisU2_run0 --directory /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five --no-plots
exit $

Bundled Large MPI Jobs

Some schedulers give preference to single large jobs over many small jobs. They may also have limits on the total number of jobs that can be submitted. You can get around this limitation by submitting a single MPI job that runs many inversions on many nodes. This has the added benefit of allowing you to run multiple inversions per node if enough processors/memory is available. UCERF3 production runs were run with this method on the Stampede supercomputer with 3 inversions per node.

To use this method, your PBS script must call scratch.UCERF3.simulatedAnnealing.hpc.MPJInversionDistributor. You must also supply the "--exact-dispatch X" (where X is the number of threads per node). The total number of inversions must be less than X*NODES, so with 3 threads per node and 256 nodes, you can submit at most 768 inversions. You must also supply an xml file argument, which is described below. Additionally, you must download and install FastMPJ in your user account as this library is required.

Stampede Batch PBS Script

#!/bin/bash

#SBATCH -t 00:420:00
#SBATCH -n 2048
#SBATCH -p normal

PBS_NODEFILE="/tmp/${USER}-hostfile-${SLURM_JOBID}"
echo "creating PBS_NODEFILE: $PBS_NODEFILE"
scontrol show hostnames $SLURM_NODELIST > $PBS_NODEFILE

export FMPJ_HOME=/home1/00950/kevinm/FastMPJ
export PATH=$PATH:$FMPJ_HOME/bin

if [[ -e $PBS_NODEFILE ]]; then
  #count the number of processors assigned by PBS
  NP=`wc -l < $PBS_NODEFILE`
  echo "Running on $NP processors: "`cat $PBS_NODEFILE`
else
  echo "This script must be submitted to PBS with 'qsub -l nodes=X'"
  exit 1
fi

if [[ $NP -le 0 ]]; then
  echo "invalid NP: $NP"
  exit 1
fi

date
echo "RUNNING FMPJ"
fmpjrun -machinefile $PBS_NODEFILE -np $NP -dev niodev -Djava.library.path=$FMPJ_HOME/lib -Djava.awt.headless=true -Xmx25000M -Xms25000M -cp /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five/OpenSHA_complete.jar:/work/00950/kevinm/ucerf3/inversion/parallelcolt-0.9.4.jar:/work/00950/kevinm/ucerf3/inversion/commons-cli-1.2.jar:/work/00950/kevinm/ucerf3/inversion/csparsej.jar  -class scratch.UCERF3.simulatedAnnealing.hpc.MPJInversionDistributor --exact-dispatch 3 /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five/batch00.xml
ret=$?

date
echo "DONE with process 0. EXIT CODE: $ret"

exit $ret

XML Input File

The XML input file simply supplies a list of arguments for each inversion. This is an example for 384 inversions. The "num" argument at the end of each InversionConfiguration? line is a sanity check which verifies the correct number of arguments.

<?xml version="1.0" encoding="UTF-8"?>

<OpenSHA>
  <InversionConfigurations num="384">
    <InversionConfiguration index="0" args="--completion-time 5h --sub-completion 1s --cool FAST_SA --nonneg LIMIT_ZERO_RATES --num-threads 5 --branch-prefix FM3_1_ABM_Shaw09Mod_DsrUni_CharConst_M5Rate6.5_MMaxOff7.3_NoFix_SpatSeisU2_run0 --directory /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five --no-plots" num="15"/>
    <InversionConfiguration index="1" args="--completion-time 5h --sub-completion 1s --cool FAST_SA --nonneg LIMIT_ZERO_RATES --num-threads 5 --branch-prefix FM3_1_ABM_Shaw09Mod_DsrUni_CharConst_M5Rate6.5_MMaxOff7.3_NoFix_SpatSeisU3_run0 --directory /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five --no-plots" num="15"/>
    <InversionConfiguration index="2" args="--completion-time 5h --sub-completion 1s --cool FAST_SA --nonneg LIMIT_ZERO_RATES --num-threads 5 --branch-prefix FM3_1_ABM_Shaw09Mod_DsrUni_CharConst_M5Rate6.5_MMaxOff7.6_NoFix_SpatSeisU2_run0 --directory /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five --no-plots" num="15"/>
    ...
    <InversionConfiguration index="383" args="--completion-time 5h --sub-completion 1s --cool FAST_SA --nonneg LIMIT_ZERO_RATES --num-threads 5 --branch-prefix FM3_1_ZENGBB_Shaw09Mod_DsrTap_CharConst_M5Rate6.5_MMaxOff7.9_NoFix_SpatSeisU3_run0 --directory /work/00950/kevinm/ucerf3/inversion/2013_05_03-ucerf3p3-production-first-five --no-plots" num="15"/>
  </InversionConfigurations>
</OpenSHA>
    
Last modified 2 years ago Last modified on Feb 3, 2015, 9:44:14 AM