Чтобы создать DataFrame из строк, один из подходов заключается в вызове SparkSession.createDataFrame()
в списке строк.
Если вы хотите создать DataFrame наподобие
# +----------+----------+-------+
# | product| category|revenue|
# +----------+----------+-------+
# | product|Cell phone| 6000|
# | Normal| Tablet| 1500|
# | Mini| Tablet| 5500|
# | ... |
# +----------+----------+-------+
# with Schema:
# root
# |-- product: string (nullable = true)
# |-- category: string (nullable = true)
# |-- revenue: long (nullable = true)
Тогда вместо того, чтобы иметь productRevenueAll
в качестве ряда строк, измените его на список строк, например:
productRevenueAll = [
productRevenue1, productRevenue2, productRevenue3,
productRevenue4, productRevenue5, productRevenue6,
productRevenue7, productRevenue8, productRevenue9,
productRevenue10,
]
dataFrame = spark.createDataFrame(productRevenueAll)
# then use it like:
dataFrame.product
# Column<b'product'>
dataFrame.select(dataFrame.product).show()
# +----------+
# | product|
# +----------+
# | product|
# | Normal|
# | Mini|
# | ... |
# +----------+
Однако, если вы действительно намеревались создать вложенную структуру, как:
# +-----------------------------+
# | productRevenue |
# +----------+----------+-------+
# | product| category|revenue|
# +----------+----------+-------+
# | product|Cell phone| 6000|
# | Normal| Tablet| 1500|
# | ... |
# +----------+----------+-------+
# with Schema:
# root
# |-- productRevenue: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- product: string (nullable = true)
# | | |-- category: string (nullable = true)
# | | |-- revenue: long (nullable = true)
feed createDataFrame()
со списком из одного элемента, например:
productRevenueAllNested = Row(
productRevenue=[
productRevenue1, productRevenue2, productRevenue3,
productRevenue4, productRevenue5, productRevenue6,
productRevenue7, productRevenue8, productRevenue9,
productRevenue10,
])
dataFrameNested = spark.createDataFrame([productRevenueAllNested])
# then access it like
dataFrameNested.printSchema()
dataFrameNested.select(dataFrameNested.productRevenue).show()
# +----------------------+
# |productRevenue.product|
# +----------------------+
# | [product, Normal,...|
# +----------------------+