Почему указанная мастером память не соответствует запрашиваемой в скрипте slurm? - PullRequest
0 голосов
/ 08 мая 2019

Я использую следующий скрипт slurm для запуска spark 2.3.0.

#!/bin/bash
#SBATCH --account=def-hmcheick
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --mem=100G
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=6
#SBATCH --output=/project/6008168/moudi/job/spark-job/sparkjob-%j.out
#SBATCH --mail-type=ALL
#SBATCH --error=/project/6008168/moudi/job/spark-job/error6_hours.out



## --------------------------------------
## 0. Preparation
## --------------------------------------

# load the Spark module
module load spark/2.3.0
module load python/3.7.0
source "/home/moudi/ENV3.7.0/bin/activate"

set -x
# identify the Spark cluster with the Slurm jobid
export SPARK_IDENT_STRING=$SLURM_JOBID

# prepare directories
export SPARK_WORKER_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/worker
export SPARK_LOG_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/logs
export SPARK_LOCAL_DIRS=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/tmp/spark
mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR $SPARK_LOCAL_DIRS

# These are the defaults anyways, but configurable 
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080

export JOB_HOME="$HOME/.spark/2.3.0/$SPARK_IDENT_STRING"
echo "line 39----JOB_HOME=$JOB_HOME"
echo "line 40----SPARK_HOME=$SPARK_HOME"
mkdir -p $JOB_HOME

# Try to load stuff that the spark scripts will load
source "$SPARK_HOME/sbin/spark-config.sh"
source "$SPARK_HOME/bin/load-spark-env.sh"

## --------------------------------------
## 1. Start the Spark cluster master
## --------------------------------------

$SPARK_HOME/sbin/start-master.sh
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.*master*.out)
echo "line 54----MASTER_URL = ${MASTER_URL}"


## --------------------------------------
## 2. Start the Spark cluster workers
## --------------------------------------

# get the resource details from the Slurm job
export SPARK_WORKER_CORES=${SLURM_CPUS_PER_TASK:-1}
export SPARK_MEM=$(( ${SLURM_MEM_PER_CPU:-3072} * ${SLURM_CPUS_PER_TASK:-1} ))
#export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *93/100)))
export SPARK_DAEMON_MEMORY=${SPARK_MEM}m
export SPARK_WORKER_MEMORY=${SPARK_MEM}
NWORKERS=${SLURM_NTASKS:-1} #just for testing you should delete this line
NEXECUTORS=$((SLURM_NTASKS - 1))

# start the workers on each node allocated to the job
export SPARK_NO_DAEMONIZE=1

srun -n ${NWORKERS} -N $SLURM_JOB_NUM_NODES --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL}  &

## --------------------------------------
## 3. Submit a task to the Spark cluster
## --------------------------------------
spark-submit --master ${MASTER_URL} --total-executor-cores $((SLURM_NTASKS * SLURM_CPUS_PER_TASK)) --executor-memory ${SPARK_WORKER_MEMORY}m  --num-executors $((SLURM_NTASKS - 1)) --driver-memory ${SPARK_WORKER_MEMORY}m /project/6008168/moudi/mainold.py

flag_path=$JOB_HOME/master_host
export SPARK_MASTER_IP=$( hostname )
echo "line 81----SPARK_MASTER_IP=$SPARK_MASTER_IP"
MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )
MASTER_NODE=$MASTER_NODE.int.cedar.computecanada.ca
MASTER_URL="spark://$MASTER_NODE:$SPARK_MASTER_PORT"

## --------------------------------------
## 4. Clean up
## --------------------------------------


# stop the workers
scancel ${SLURM_JOBID}.0

# stop the master
$SPARK_HOME/sbin/stop-master.sh

Скрипт не работает хорошо.Я получаю следующую проблему: Ошибка произошла во время инициализации ВМ. Слишком маленькая начальная куча

Действительно, в выходном файле мастера я получаю начало:

Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host cdr1272.int.cedar.computecanada.ca --port 7077 --webui-port 8080

Из-за -Xmx1g Искра не работает.Можете ли вы помочь мне по диагностике, почему это 1 г?Я уже указываю, что память мастера составляет 15 г.

В том же файле (вывод мастера) я вижу 12 рабочих с 5 ядрами и 15 г каждый:

19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:46822 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:41554 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:38652 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35553 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:43477 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:36128 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35494 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.141.2:34899 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.140.247:40010 with 5 cores, 15.0 GB RAM
19/05/07 15:11:29 INFO Master: Registering worker 172.16.141.2:37054 with 5 cores, 15.0 GB RAM
19/05/07 15:11:31 INFO Master: Registering worker 172.16.141.2:37322 with 5 cores, 15.0 GB RAM
19/05/07 15:11:33 INFO Master: Registering worker 172.16.141.2:36519 with 5 cores, 15.0 GB RAM

...

19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/0 on worker worker-20190507151124-172.16.140.247-36128
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/1 on worker worker-20190507151124-172.16.141.2-38652
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/2 on worker worker-20190507151124-172.16.141.2-41554
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/3 on worker worker-20190507151124-172.16.140.247-43477
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/4 on worker worker-20190507151126-172.16.141.2-34899
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/5 on worker worker-20190507151128-172.16.141.2-37054
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/6 on worker worker-20190507151124-172.16.140.247-35553
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/7 on worker worker-20190507151130-172.16.141.2-37322
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/8 on worker worker-20190507151124-172.16.140.247-46822
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/9 on worker worker-20190507151124-172.16.140.247-35494
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/10 on worker worker-20190507151132-172.16.141.2-36519
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/11 on worker worker-20190507151127-172.16.140.247-40010

Кроме того, в рабочем каталоге я вижу подпапку с именем app-20190507151158-0000.Последний также имеет 11 подпапок 0..11.У каждого есть файл stderr, похожий на файл журнала.Я также замечаю, что каждый из этих файлов ...

19/05/05 23:49:38 INFO MemoryStore: MemoryStore started with capacity 7.8 GB

Я не знаю, что именно это означает?Есть ли у исполнителя только 7,8 ГБ или 15Y TH?

...