Отправка заданий Open MPI в SGE

Я установил openmpi не в /usr/..., а в /commun/data/packages/openmpi/, он был скомпилирован с --with-sge.

Я добавил новый PE в SGE, как описано в http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html

# /commun/data/packages/openmpi/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)

# qconf -sq all.q | grep pe_
pe_list               make orte

Без SGE программа работает без проблем, используя несколько процессоров.

/commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args

Теперь я хочу отправить свою программу в SGE.

В Open MPI FAQ я прочитал :

# Allocate a SGE interactive job with 4 slots
# from a parallel environment (PE) named 'orte'
shell$ qsh -pe orte 4

но мой вывод:

qsh -pe orte 4
Your job 84550 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Could not start interactive job.

Я также пробовал команду mpirun, встроенную в скрипт:

$ cat ompi.sh
#!/bin/sh
/commun/data/packages/openmpi/bin/mpirun  \
    /path/to/a.out args

но это не удается

$ cat ompi.sh.e84552
error: executing task of job 84552 failed: execution daemon on host "node02" didn't accept task
--------------------------------------------------------------------------
A daemon (pid 18327) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
error: executing task of job 84552 failed: execution daemon on host "node01" didn't accept task
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

Как я могу это исправить?


ответ в списке рассылки openmpi: http://www.open-mpi.org/community/lists/users/2013/02/21360.php


person Pierre    schedule 08.02.2013    source источник


Ответы (1)


В моем случае установка «job_is_first_task FALSE» и «control_slaves TRUE» решила проблему.

# qconf -mp mpi1 

pe_name            mpi1
slots              9
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
person carlewis    schedule 13.09.2013