from pyspark.sql.functions import split,concat,lit
myValues = [('Alan Turing','UK',1000),('James Clark','US',5000)]
df = sqlContext.createDataFrame(myValues,['Name','Country','Income'])
df.show()
+-----------+-------+------+
| Name|Country|Income|
+-----------+-------+------+
|Alan Turing| UK| 1000|
|James Clark| US| 5000|
+-----------+-------+------+
df = df.withColumn('Name', concat(split(df['Name'], ' ')[0].substr(0,1), lit(' '), split(df['Name'], ' ')[1]))
df.show()
+--------+-------+------+
| Name|Country|Income|
+--------+-------+------+
|A Turing| UK| 1000|
| J Clark| US| 5000|
+--------+-------+------+
Приведенный выше код не будет работать, если имя Alan Turing Müller
.Следующий код более надежен -
from pyspark.sql.functions import concat, instr, length
myValues = [('Alan Turing Müller','UK',1000),('James Clark','US',5000)]
df = sqlContext.createDataFrame(myValues,['Name','Country','Income'])
df.show()
+------------------+-------+------+
| Name|Country|Income|
+------------------+-------+------+
|Alan Turing Müller| UK| 1000|
| James Clark| US| 5000|
+------------------+-------+------+
df = df.withColumn('Name', concat(df['Name'].substr(0,1),df['Name'].substr(instr(df['Name'],' '),length(df['Name'])-instr(df['Name'],' ')+1)))
df.show()
+---------------+-------+------+
| Name|Country|Income|
+---------------+-------+------+
|A Turing Müller| UK| 1000|
| J Clark| US| 5000|
+---------------+-------+------+