У меня следующий фрейм данных
MESSAGE DOCUMENT_ID
0 @Zuora wants to help @Network4Good with Hurricane and hurriacane... 263403828328665088
1 @ztrip please help spread the good word on hello and hello... 264142543883739136
2 #ZSwaggers @Zendaya96 did this,you should too. You... 265122997348753408
3 @Zendaya96 u have inspired me girl! So can eve... 265499798952628224
4 ''@Zendaya96 let's help the Hurricane Sandy vi... 265161977662435328
5 @Zendaya96 Help the hurricane Sandy victims . ... 265496790881669120
6 @Zendaya96 Help the hurricane Sandy victims¡¡ ... 265496111257624576
7 @Zendaya96 @bellathorne : Help the Hurricane ... 265192268137373696
8 Your Personal Discount Co... 263385298296270848
9 Your help is needed! Donate $10 to the America... 265578540001554432
Как создать файл данных pandas с количеством слов в сообщении
Для примера
DOCUMENT_ID word count
263403828328665088 hurricane 2
263403828328665088 with 1
.........
264142543883739136 hello 2
...........
Я пытался использовать функции, как показано ниже, но не знаю, как добавить DOCUMENT_ID для каждого слова:
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)