Мы можем прочитать schema
из файла parquet
, используя .schema
и преобразовать в формат json
, наконец сохранить как textfile
.
input parquet file:
spark.read.parquet("/tmp").printSchema()
#root
#|-- id: integer (nullable = true)
#|-- name: string (nullable = true)
#|-- booking_date: timestamp (nullable = true)
Extract the schema and write to HDFS/local filesystem:
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
Read the output file from hdfs:
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}