Создание внешнего вида MongoDB с использованием только Spark SQL - PullRequest
0 голосов
/ 05 апреля 2020

Я пытаюсь запросить MongoDB с помощью оболочки Spark SQL. У меня есть ограничение, что я могу использовать только SQL: нет Scala, Python и т. Д. 1024 *. Я намерен использовать Thrift, но для подтверждения концепции я использую spark-sql. Я использую EMR с Spark версии 2.4.4. Дополнительная информация:

Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_242
Branch HEAD
Compiled by user ec2-user on 2019-12-14T00:54:30Z
Revision 5f788d5e8f90539ee331702c753fa250727128f4
Url git@aws157git.com:/pkg/Aws157BigTop
Type --help for more information.

Я запускаю свою оболочку с указателем на координаты MongoDB Spark maven:

spark-sql --packages org.mongodb.spark:mongo-spark-connector_2.12:2.4.1 --conf spark.mongodb.input.uri=mongodb://something.real/development?readPreference=secondary

Spark SQL, кажется, распознает банку, через журналы:

org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency

Затем я запускаю

CREATE TEMPORARY VIEW mongo 
USING com.mongodb.spark.sql.DefaultSource
OPTIONS (
  collection 'accounts'
);

И я получаю следующую ошибку:

java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<init>(DefaultMongoPartitioner.scala:64)
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<clinit>(DefaultMongoPartitioner.scala)
        at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:48)
        at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
        at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:91)
        at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:93)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<init>(DefaultMongoPartitioner.scala:64)
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<clinit>(DefaultMongoPartitioner.scala)
        at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:48)
        at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
        at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:91)
        at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:93)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Любая идея, как настроить это представление, используя Только SQL, в идеале без параметров командной строки запуска, за исключением координат Maven, также будет Ace.

1 Ответ

0 голосов
/ 05 апреля 2020

Выглядит как несоответствие между Scala версиями.

Запуск spark-sql с spark-sql --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 работал нормально.

Чтобы создать представление Mon go, используя только SQL:

CREATE TEMPORARY VIEW source 
USING mongo
OPTIONS (
  uri 'mongodb://…',
  collection 'my_collection'
);
...