Является ли клей AWS масштабируемым? - PullRequest
0 голосов
/ 09 июня 2018

Я покрыл всю необходимую информацию, пока я использую клей, пожалуйста, дайте мне знать, если вам нужна дополнительная информация.

Вот мой сценарий:

aws s3 lss3: // bucuketname / --recursive --profile производство |grep Авто |wc -l

2487

Не более 2487 s3 заинтересованных объектов для преобразования.

aws s3api list-objects --bucket bucketname --output json --query "[сумма (Содержание []. Размер), длина (Содержание [])]" - производство профилей |awk 'NR! = 2 {print $ 0; next} NR == 2 {print $ 0/1024/1024/1024 "ГБ"}'

[
344.768 GB
    3829
]

Каждый объект s3 неразмер превышает 100 МБ, и это сжатый файл json .

3829 - это общее количество объектов, но меня интересует только 2487 объектов для обработки.

Scala Code:

val glueContext: GlueContext = new GlueContext(sc)
val auto01: DynamicFrame = glueContext.getCatalogSource(database = "jsondb", tableName = "01").getDynamicFrame()
auto01.printSchema()

Пытаясь получить схему,

18/06/09 18:31:44 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 32, ip-172-31-16-40.ec2.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/09 18:31:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
..
..
..

18/06/09 18:34:13 WARN ExecutorAllocationManager: Attempted to mark unknown executor 12 idle
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 44, ip-172-31-16-40.ec2.internal, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.0 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
  at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
  at org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:57)
  at com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:235)
  at com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:223)
  at com.amazonaws.services.glue.DynamicFrame.printSchema(DynamicFrame.scala:244)
  ... 48 elided

Что-то мне не хватает здесь, чтобы рассмотреть вопрос об использовании клея?

...