У меня есть столбец в pyspark.sql.DataFrame
типа matrix
.
Каждая ячейка в этом столбце имеет DenseMatrix
формы (numRows, 268)
т.е. количество строк от ячейки к ячейке будет варьироваться, но количество столбцов всегда будет 268.
Я хочу разделить все строки во всех матрицах в этом столбце так, чтобы каждая строка в созданном кадре данных была вектором.
Например, как бы я преобразовал следующее:
|groups|windows |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |0.0 0.0 1.383419689119171 ... (268 total)
0.0 1.0308333333333333 1.0 ...
0.0 1.0714285714285714 1.0 ...
0.0 1.241112828438949 1.0 ...
0.0 1.01 1.0212464589235128 ...
0.0 0.0 1.0303994011640099 ...
0.0 1.0310714270488266 0.0 ...
0.0 1.7106598984771573 0.0 ...
0.0 1.0 1.7657142857142856 ...
0.0 1.3483709273182958 1.7071428571428573 ...
0.0 1.4608788853161845 1.2461538461538462 ...
0.0 1.0 0.0 ...
0.0 1.0 0.0 ...
1.6600496277915633 1.0 1.0 ...
1.3537936913895994 1.812121212121212 1.2403100775193798 ...
0.0 1.6721590909090909 1.0 ...
1.6479591836734695 0.0 0.0 ...
0.0 1.075 0.0 ...
1.2246376811594204 0.0 0.0 ...
1.0 1.659994867847062 1.0 ...
1.0 0.0 1.5507936E9 ...
0.0 1.0 0.0 ...
1.6974358974358972 0.0 0.0 ...|
|2 |0.0 0.0 1.4455958549222798 ... (268 total)
0.0 1.02875 1.0 ...
0.0 1.0714285714285714 1.0 ...
0.0 1.2179289026275115 1.0 ...
0.0 1.01 1.0191218130311614 ...
0.0 0.0 1.028490828331661 ...
0.0 1.028214284187194 0.0 ...
0.0 1.7309644670050761 0.0 ...
0.0 1.0 1.7885714285714287 ...
0.0 1.3525480367585632 1.7285714285714286 ...
0.0 1.4683815648445875 1.2153846153846155 ...
0.0 1.0 0.0 ...
0.0 1.0 0.0 ...
1.6972704714640199 1.0 1.0 ...
1.3580562659846547 1.8242424242424242 1.2170542635658914 ...
0.0 1.6971590909090908 1.0 ...
1.663265306122449 0.0 0.0 ...
0.0 1.0964285714285715 0.0 ...
1.2028985507246377 0.0 0.0 ...
1.0 1.6782140107775212 1.0 ...
1.0 0.0 1.5507936E9 ...
0.0 1.0 0.0 ...
1.7282051282051283 0.0 0.0 ...|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows
На что-то вроде:
|groups|windows
+------+-------------------------------------------------------------------------
|1 |0.0, 0.0, 1.383419689119171, ... (268 total)
+------+-----------------------------------------------------------------------
|1 |0.0, 1.0308333333333333, 1.0, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.0714285714285714, 1.0, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.241112828438949, 1.0, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.01, 1.0212464589235128, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 0.0, 1.0303994011640099, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.0310714270488266, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.7106598984771573, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.0, 1.7657142857142856, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.3483709273182958, 1.7071428571428573, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.4608788853161845, 1.2461538461538462, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.0, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.0, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |1.6600496277915633, 1.0, 1.0, ...
+------+-----------------------------------------------------------------------
|1 |1.3537936913895994, 1.812121212121212, 1.2403100775193798, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.6721590909090909, 1.0, ...
+------+-----------------------------------------------------------------------
|1 |1.6479591836734695, 0.0, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.075, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |1.2246376811594204, 0.0, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |1.0, 1.659994867847062, 1.0, ...
+------+-----------------------------------------------------------------------
|1 |1.0, 0.0, 1.5507936E9, ...
+------+-----------------------------------------------------------------------
|1 |0.0, 1.0, 0.0, ...
+------+-----------------------------------------------------------------------
|1 |1.6974358974358972, 0.0, 0.0, ...|
+------+-----------------------------------------------------------------------
|2 |0.0, 0.0, 1.4455958549222798, ... (268 total)
+------+-----------------------------------------------------------------------
|2 |0.0, 1.02875, 1.0, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.0714285714285714, 1.0, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.2179289026275115, 1.0, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.01, 1.0191218130311614, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 0.0, 1.028490828331661, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.028214284187194, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.7309644670050761, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.0, 1.7885714285714287, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.3525480367585632, 1.7285714285714286, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.4683815648445875, 1.2153846153846155, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.0, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.0, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |1.6972704714640199, 1.0, 1.0, ...
+------+-----------------------------------------------------------------------
|2 |1.3580562659846547, 1.8242424242424242, 1.2170542635658914, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.6971590909090908, 1.0, ...
+------+-----------------------------------------------------------------------
|2 |1.663265306122449, 0.0, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.0964285714285715, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |1.2028985507246377, 0.0, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |1.0, 1.6782140107775212, 1.0, ...
+------+-----------------------------------------------------------------------
|2 |1.0, 0.0, 1.5507936E9, ...
+------+-----------------------------------------------------------------------
|2 |0.0, 1.0, 0.0, ...
+------+-----------------------------------------------------------------------
|2 |1.7282051282051283, 0.0, 0.0, ...|
+------+-----------------------------------------------------------------------
+------+-----------------------------------------------------------------------
only showing top 2 rows
Любая помощь будет принята с благодарностью!
EDIT_1
Повторюсь, я начну с DenseMatrix
. Я также смог «решить» мою проблему с помощью функции explode
, но мне пришлось:
1) приведите столбец windows
к строке:
def stringify_matrices(x):
arr = x.toArray()
l = arr.tolist()
return l
stringify_matrices_udf = udf(lambda y: stringify_matrices(y),)
expanded = \
extracted.withColumn('expanded',
stringify_matrices_udf('windows')
)
2) анализ этой строки в массив строк (каждая строка представляет вектор)
def parse_matrices(x):
from ast import literal_eval
t = literal_eval(str(x))
str_arr = [str(a) for a in t]
return str_arr
parse_matrices_udf = udf(lambda y: parse_matrices(y), ArrayType(StringType()))
parsed = \
expanded.withColumn('parsed',
parse_matrices_udf('expanded')
)
3) explode
ing
parsed = parsed.withColumn('exploded', explode(parsed.parsed)).select('groups', 'exploded')
4) приведение к ArrayType(DoubleType()))
def convert_to_double(x):
str_arr = x.replace('[','').replace(']','').split(',')
flt_arr = [float(a) for a in str_arr]
return flt_arr
convert_to_double_udf = udf(lambda y: convert_to_double(y), ArrayType(DoubleType()))
converted = parsed.withColumn('feature_vector', convert_to_double_udf('exploded'))
Вышесказанное работает, но я чувствую, что есть лучший способ приблизиться к этому.
EDIT_2
@mayanak agrawal
Спасибо за Ваш ответ!
Я думаю, в ответ я бы спросил:
Как конвертировать из DenseMatrix
столбца:
например, * * тысяча пятьдесят-семь
dm_df = sqlContext.createDataFrame([
(1,
DenseMatrix(numRows=3, numCols=4, values=[2,4,2,5,30,4,2,5,30,4,2,5], isTransposed=True)),
(2,
DenseMatrix(numRows=2, numCols=4, values=[2,1,3,7,2,4,2,9], isTransposed=True)),
(3,
DenseMatrix(numRows=4, numCols=4, values=[2,4,2,5,2,4,2,5,2,1,3,7,2,1,3,7], isTransposed=True))],
['groups', 'windows'])
dm_df.show()
+------+-----------------------------------------------------------------------------------+
|groups|windows |
+------+-----------------------------------------------------------------------------------+
|1 |2.0 4.0 2.0 5.0
30.0 4.0 2.0 5.0
30.0 4.0 2.0 5.0 |
|2 |2.0 1.0 3.0 7.0
2.0 4.0 2.0 9.0 |
|3 |2.0 4.0 2.0 5.0
2.0 4.0 2.0 5.0
2.0 1.0 3.0 7.0
2.0 1.0 3.0 7.0 |
+------+-----------------------------------------------------------------------------------+
Для столбца 2D-чисел (как видно из вашего примера):
arr_df = sqlContext.createDataFrame([
(1, [[2,4,2,5],[30,4,2,5],[30,4,2,5]]),
(2, [[2,1,3,7],[2,4,2,9]]),
(3, [[2,4,2,5],[2,4,2,5],[2,1,3,7],[2,1,3,7]])],
['groups', 'windows'])
arr_df.show()
+------+--------------------------------------------------------+
|groups|windows |
+------+--------------------------------------------------------+
|1 |[[2, 4, 2, 5], [30, 4, 2, 5], [30, 4, 2, 5]] |
|2 |[[2, 1, 3, 7], [2, 4, 2, 9]] |
|3 |[[2, 4, 2, 5], [2, 4, 2, 5], [2, 1, 3, 7], [2, 1, 3, 7]]|
+------+--------------------------------------------------------+
Еще раз спасибо!