Pyspark MLP Нейронная сеть - PullRequest
       55

Pyspark MLP Нейронная сеть

0 голосов
/ 24 декабря 2018

PySpark - версия 2.4.0

Я пытаюсь уменьшить количество выходных слоев.К сожалению, не удалось добиться этого с помощью PySpark MLPC.

Я использовал набор данных букв.Ссылка: https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/letterdata.csv

>>> df.show(1)
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
|letter|xbox|ybox|width|height|onpix|xbar|ybar|x2bar| y2bar|xybar| x2ybar| xy2bar|xedge|xedgey|yedge|yedgex|
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
|     T|   2|   8|    3|     5|    1|   8|  13|    0|    6|    6|    10|     8|    0|     8|    0|     8|
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
only showing top 1 row

Я использовал StringIndexer для преобразования столбца "letter" в целое число (Column = "letter_id").

У меня есть два варианта, либоиспользуйте letter_id в качестве выходного слоя с 26 выходами.

>>> indexed_df.columns[1:-1]
['xbox', 'ybox', 'width', 'height', 'onpix', 'xbar', 'ybar', 'x2bar', 'y2bar', 'xybar', 'x2ybar', 'xy2bar', 'xedge', 'xedgey', 'yedge', 'yedgex', 'letter_id']

Или создайте двоичный столбец в качестве выхода, который будет иметь только 5 выходов.

>>> indexed_df.select('letter', 'letter_id', 'binary').distinct().show()
+------+---------+------+                                                       
|letter|letter_id|binary|
+------+---------+------+
|     D|        1| 00001|

Вот весь код,

from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import *


df = spark.read.csv('letterdata.csv', header=True, inferSchema=True)

indexer = StringIndexer(inputCol='letter', outputCol='letter_id')
indexed_df = indexer.fit(df).transform(df)
indexed_df = indexed_df.withColumn('letter_id', indexed_df['letter_id'].cast('int'))
indexed_df.select('letter', 'letter_id').distinct().show()

udf_binary = udf(lambda x: '{0:05b}'.format(x))

indexed_df = indexed_df.withColumn('binary', udf_binary(indexed_df['letter_id']))

indexed_df.select('letter', 'letter_id', 'binary').distinct().show()

final_cols = indexed_df.columns[1:-1]
#final_cols = indexed_df.columns[1:-2] + ['binary']

dataset = indexed_df.select(final_cols)

parser = VectorAssembler(inputCols=final_cols[:-1], outputCol="features")
dataset = parser.transform(dataset)
final = dataset.select(col('letter_id').alias('label'), col('features'))
#final = dataset.select(col('binary').alias('label'), col('features'))

splits = final.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]
layers = [17, 30, 20, 26]

trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

при попытке подгонки,

model = trainer.fit(train)

становится ниже ошибки,

2018-12-24 11:17:27 WARN  BlockManager:66 - Putting block rdd_44_0 failed due to exception java.lang.ArrayIndexOutOfBoundsException.
2018-12-24 11:17:27 WARN  BlockManager:66 - Block rdd_44_0 could not be removed as it was not found on disk or in memory
2018-12-24 11:17:27 ERROR Executor:91 - Exception in task 0.0 in stage 26.0 (TID 336)
java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:665)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:664)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:664)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:660)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:298)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-12-24 11:17:27 WARN  TaskSetManager:66 - Lost task 0.0 in stage 26.0 (TID 336, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:665)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:664)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:664)

Невозможно найти ссылку на вышеуказанную ошибку.выглядит что-то не так с форматом данных при изменении на ['label', 'functions'], поэтому не может продолжать дальше.

А также как добиться 5 выходов вместо 26 в MLPC?Есть указатели?

...