Question

Я работаю над текстовой проблемой, где у меня есть мой информационный фрейм pandas, содержащий много столбцов, из которых один состоит из параграфов. В выводе мне нужны 3 столбца, как определено -

Длина самых больших слов
Количество самых больших слов (в случае, если есть подобная длина)
Общее количество слов одинаковой длины.

Я отвечаю за слово, если оно отделено пробелом. Поиск ответа с использованием python apply-map.

Вот пример входных данных -

df = pd.DataFrame({'text':[
    "that's not where the biggest opportunity is - it's with heart failure drug - very very huge market....",
    "Of course! I just got diagnosed with congestive heart failure and type 2 diabetes. I smoked for 12 years and ate like crap for about the same time. I quit smoking and have been on a diet for a few weeks now. Let me assure you that I'd rather have a coke, gummi bears, and a bag of cheez doodles than a pack of cigs right now. Addiction is addiction.",
    "STILLWATER, Okla. (AP) ? Medical examiner spokeswoman SpokesWoman: Oklahoma State player Tyrek Coger died of enlarged heart, manner of death ruled natural."
]})

df

    text                                                
0   that's not where the biggest opportunity is - ...   
1   Of course! I just got diagnosed with congestiv...   
2   STILLWATER, Okla. (AP) ? Medical examiner spok...

Вот ожидаемый результат -

    text                                               word_count   word_length     words
0   that's not where the biggest opportunity is - ...   1           11             opportunity
1   Of course! I just got diagnosed with congestiv...   1           10              congestive
2   STILLWATER, Okla. (AP) ? Medical examiner spok...   2           11              spokeswoman SpokesWoman

meW · Answer 1 · 19 января 2019

Одно возможное решение с использованием apply-map -

import nltk
import pandas as pd

# Reading df and proceeding with code

expanded_text = df.text.apply(lambda x: ' '.join(nltk.word_tokenize(x))).str.split(" ", expand=True)

df.word_length = expanded_text.applymap(lambda x: len(str(x)) if x != None else 0).max(axis=1)

i = 1
for idx, val in enumerate(expanded_text.itertuples()):
    temp = expanded_text.iloc[idx:idx + i, :].applymap(lambda x: True if len(str(x)) == df.loc[idx, 'word_length'] else False if x != None else False).T
    idx_ = temp.index[temp[idx] == True].values 
    words = " ".join(expanded_text.iloc[idx:idx + i, idx_].values.tolist()[0])
    df.loc[idx, 'words'] = words
    df.loc[idx, 'word_count'] = len(words.split())
    i += 1

GRoutar · Answer 2 · 18 января 2019

Следующий код должен помочь:

def get_values(text):
    tokens = text.split() # Splitting by whitespace
    max_word_length = -1
    list_words = [] # Initializing list of max length words

    for token in tokens:
        if len(token) > max_word_length:
           max_word_length = len(token)
           list_words = [] # Clearning the list, since there's a new max
           list_words.append(token)
        elif len(token) == max_word_length:
           list_words.append(token)

     words_string = ' '.join(list_words) if len(list_words) > 1 else list_words[0] # Concatenating list into string

     return [len(list_words), max_word_length, list_words]

df['word_count'], df['word_length'], df['words'] = zip(*df['text'].map(get_values))

Редактировать: Забыл объединить список

Получить максимальную длину слова из абзаца

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Получить максимальную длину слова из абзаца

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы