Если вы пытаетесь оценить, какие 3 из них лучше подходят для той же цели, между ними нет никакой разницы.physical plan
ответьте на ваш вопрос: «За сценой?».
Метод 1:
sqlDF = spark.sql("SELECT CallNumber,CallFinalDisposition FROM parquet.`/tmp/ParquetA`").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#2988 as string) AS CallNumber#3026, CallFinalDisposition#2992]
+- *(1) FileScan parquet [CallNumber#2988,CallFinalDisposition#2992] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
Метод 2:
df = spark.read.parquet('/tmp/ParquetA')
df.select("CallNumber","CallFinalDisposition").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#3100 as string) AS CallNumber#3172, CallFinalDisposition#3104]
+- *(1) FileScan parquet [CallNumber#3100,CallFinalDisposition#3104] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
Метод 3:
tempDF = spark.read.parquet('/tmp/ParquetA/')
tempDF.createOrReplaceTempView("temptable");
tiny = spark.sql("SELECT CallNumber,CallFinalDisposition FROM temptable").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#2910 as string) AS CallNumber#2982, CallFinalDisposition#2914]
+- *(1) FileScan parquet [CallNumber#2910,CallFinalDisposition#2914] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>