Я использую следующий скрипт slurm для запуска spark 2.3.0.
#!/bin/bash
#SBATCH --account=def-hmcheick
#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --mem=100G
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=6
#SBATCH --output=/project/6008168/moudi/job/spark-job/sparkjob-%j.out
#SBATCH --mail-type=ALL
#SBATCH --error=/project/6008168/moudi/job/spark-job/error6_hours.out
## --------------------------------------
## 0. Preparation
## --------------------------------------
# load the Spark module
module load spark/2.3.0
module load python/3.7.0
source "/home/moudi/ENV3.7.0/bin/activate"
set -x
# identify the Spark cluster with the Slurm jobid
export SPARK_IDENT_STRING=$SLURM_JOBID
# prepare directories
export SPARK_WORKER_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/worker
export SPARK_LOG_DIR=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/logs
export SPARK_LOCAL_DIRS=$HOME/.spark/2.3.0/$SPARK_IDENT_STRING/tmp/spark
mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR $SPARK_LOCAL_DIRS
# These are the defaults anyways, but configurable
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export JOB_HOME="$HOME/.spark/2.3.0/$SPARK_IDENT_STRING"
echo "line 39----JOB_HOME=$JOB_HOME"
echo "line 40----SPARK_HOME=$SPARK_HOME"
mkdir -p $JOB_HOME
# Try to load stuff that the spark scripts will load
source "$SPARK_HOME/sbin/spark-config.sh"
source "$SPARK_HOME/bin/load-spark-env.sh"
## --------------------------------------
## 1. Start the Spark cluster master
## --------------------------------------
$SPARK_HOME/sbin/start-master.sh
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.*master*.out)
echo "line 54----MASTER_URL = ${MASTER_URL}"
## --------------------------------------
## 2. Start the Spark cluster workers
## --------------------------------------
# get the resource details from the Slurm job
export SPARK_WORKER_CORES=${SLURM_CPUS_PER_TASK:-1}
export SPARK_MEM=$(( ${SLURM_MEM_PER_CPU:-3072} * ${SLURM_CPUS_PER_TASK:-1} ))
#export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *93/100)))
export SPARK_DAEMON_MEMORY=${SPARK_MEM}m
export SPARK_WORKER_MEMORY=${SPARK_MEM}
NWORKERS=${SLURM_NTASKS:-1} #just for testing you should delete this line
NEXECUTORS=$((SLURM_NTASKS - 1))
# start the workers on each node allocated to the job
export SPARK_NO_DAEMONIZE=1
srun -n ${NWORKERS} -N $SLURM_JOB_NUM_NODES --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
## --------------------------------------
## 3. Submit a task to the Spark cluster
## --------------------------------------
spark-submit --master ${MASTER_URL} --total-executor-cores $((SLURM_NTASKS * SLURM_CPUS_PER_TASK)) --executor-memory ${SPARK_WORKER_MEMORY}m --num-executors $((SLURM_NTASKS - 1)) --driver-memory ${SPARK_WORKER_MEMORY}m /project/6008168/moudi/mainold.py
flag_path=$JOB_HOME/master_host
export SPARK_MASTER_IP=$( hostname )
echo "line 81----SPARK_MASTER_IP=$SPARK_MASTER_IP"
MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )
MASTER_NODE=$MASTER_NODE.int.cedar.computecanada.ca
MASTER_URL="spark://$MASTER_NODE:$SPARK_MASTER_PORT"
## --------------------------------------
## 4. Clean up
## --------------------------------------
# stop the workers
scancel ${SLURM_JOBID}.0
# stop the master
$SPARK_HOME/sbin/stop-master.sh
Скрипт не работает хорошо.Я получаю следующую проблему: Ошибка произошла во время инициализации ВМ. Слишком маленькая начальная куча
Действительно, в выходном файле мастера я получаю начало:
Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/moudi/.spark/2.3.0/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host cdr1272.int.cedar.computecanada.ca --port 7077 --webui-port 8080
Из-за -Xmx1g
Искра не работает.Можете ли вы помочь мне по диагностике, почему это 1 г?Я уже указываю, что память мастера составляет 15 г.
В том же файле (вывод мастера) я вижу 12 рабочих с 5 ядрами и 15 г каждый:
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:46822 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:41554 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.141.2:38652 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35553 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:43477 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:36128 with 5 cores, 15.0 GB RAM
19/05/07 15:11:25 INFO Master: Registering worker 172.16.140.247:35494 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.141.2:34899 with 5 cores, 15.0 GB RAM
19/05/07 15:11:27 INFO Master: Registering worker 172.16.140.247:40010 with 5 cores, 15.0 GB RAM
19/05/07 15:11:29 INFO Master: Registering worker 172.16.141.2:37054 with 5 cores, 15.0 GB RAM
19/05/07 15:11:31 INFO Master: Registering worker 172.16.141.2:37322 with 5 cores, 15.0 GB RAM
19/05/07 15:11:33 INFO Master: Registering worker 172.16.141.2:36519 with 5 cores, 15.0 GB RAM
...
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/0 on worker worker-20190507151124-172.16.140.247-36128
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/1 on worker worker-20190507151124-172.16.141.2-38652
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/2 on worker worker-20190507151124-172.16.141.2-41554
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/3 on worker worker-20190507151124-172.16.140.247-43477
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/4 on worker worker-20190507151126-172.16.141.2-34899
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/5 on worker worker-20190507151128-172.16.141.2-37054
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/6 on worker worker-20190507151124-172.16.140.247-35553
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/7 on worker worker-20190507151130-172.16.141.2-37322
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/8 on worker worker-20190507151124-172.16.140.247-46822
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/9 on worker worker-20190507151124-172.16.140.247-35494
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/10 on worker worker-20190507151132-172.16.141.2-36519
19/05/07 15:11:58 INFO Master: Launching executor app-20190507151158-0000/11 on worker worker-20190507151127-172.16.140.247-40010
Кроме того, в рабочем каталоге я вижу подпапку с именем app-20190507151158-0000
.Последний также имеет 11 подпапок 0..11.У каждого есть файл stderr, похожий на файл журнала.Я также замечаю, что каждый из этих файлов ...
19/05/05 23:49:38 INFO MemoryStore: MemoryStore started with capacity 7.8 GB
Я не знаю, что именно это означает?Есть ли у исполнителя только 7,8 ГБ или 15Y TH?