Я пишу инструмент grep в pyspark, который берет слово в командной строке, ищет текстовый файл и возвращает любую строку, содержащую слово, данное в командной строке.Мой поиск возвращает строки, которые не являются поисковым словом
#!/usr/bin/python
import sys
from pyspark import SparkContext
def search_word(word):
if (word) != -1:
print ('%s\t%s' % ( word, word.strip() ))
# assign search word given on command line
if len(sys.argv) > 1:
word = sys.argv[1]
sc = SparkContext()
textRDD = sc.textFile("input.txt")
textRDD = textRDD.map(lambda word: word.replace(',',' ').replace('.',' '). lower())
textRDD = textRDD.flatMap(lambda word: word.split())
textRDD = textRDD.filter(lambda word: search_word(word))
firstten = textRDD.take(10)
print(firstten)
пример командной строки: spark-submitself
пример текстового файла:
Ere quitting, for the nonce, the Sperm Whale's head, I would have
you, as a sensible physiologist, simply--particularly remark its front
aspect, in all its compacted collectedness. I would have you investigate
it now with the sole view of forming to yourself some unexaggerated,
intelligent estimate of whatever battering-ram power may be lodged
there. Here is a vital point; for you must either satisfactorily settle
this matter with yourself, or for ever remain an infidel as to one of
the most appalling, but not the less true events, perhaps anywhere to be found in all recorded history.
ожидаемый результат:
yourself -- it now with the sole view of forming to yourself some unexaggerated
Код выше возвращает это:
produce produce
our our
new new
ebooks ebooks