Question

У меня есть фрейм данных с 500 текстами в столбце с именем Текст (1 текст в строке), и я хочу подсчитать наиболее распространенные слова из всех текстов.

Я пробовал до сих пор (оба метода из stackoverflow):

pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]

и

Counter(" ".join(df["Text"]).split()).most_common(100)

оба дали мне следующую ошибку:

TypeError: элемент последовательности 0: ожидаемый экземпляр str, список найден

И я попробовал метод счетчика просто с помощью

df.Text.apply(Counter())

, который дал мне количество слов в каждом тексте, и я также изменил метод счетчика, чтобы он возвращал наиболее распространенные слова в каждомtext

Но я хочу, чтобы общие наиболее распространенные слова

Вот пример фрейма данных (текст уже в нижнем регистре, очищен от пунктуации, размечен, а стоп-слова удалены)

    Datum   File    File_type                                         Text                         length    len_cleaned_text
Datum                                                   
2000-01-27  2000-01-27  _04.txt     _04     [business, date, jan, heineken, starts, integr...       396         220

Редактировать: код для его «переотображения»

  for file in file_list:
    name = file[len(input_path):]
        date = name[11:17]
        type_1 = name[17:20] 


with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
                format
                text = rfile.read()
                text = text.encode('utf-8', 'ignore')
                text = text.decode('utf-8', 'ignore')
     a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
        result_list.append(a)

новая ячейка

  df['Text']= df['Text'].str.lower()
    p = re.compile(r'[^\w\s]+')
    d = re.compile(r'\d+')
    for index, row in df.iterrows():
        df['Text']=df['Text'].str.replace('\n',' ')
        df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
        df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
        df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
    df['Text']=df['Text'].apply(word_tokenize)


    Datum   File    File_type   Text    length  the
Datum                       
2000-01-27  2000-01-27  0864820040_000127_04.txt    _04     [business, date, jan, heineken, starts, integr...   396     0
2000-02-01  2000-02-01  0910068040_000201_04.txt    _04     [group, english, cns, date, feb, bat, acquisit...   305     0
2000-05-03  2000-05-03  1070448040_000503_04.txt    _04     [date, may, cobham, plc, cob, acquisitionsdisp...   701     0
2000-05-11  2000-05-11  0865985020_000511_04.txt    _04     [business, date, may, swedish, match, complete...   439     0
2000-11-28  2000-11-28  1067252020_001128_04.txt    _04     [date, nov, intec, telecom, sys, itl, doc, pla...   158     0
2000-12-18  2000-12-18  1963867040_001218_04.txt    _04     [associated, press, apw, date, dec, volvo, div...   367     0
2000-12-19  2000-12-19  1065767020_001219_04.txt    _04     [date, dec, spirent, plc, spt, acquisition, co...   414     0
2000-12-21  2000-12-21  1076829040_001221_04.txt    _04     [bloomberg, news, bn, date, dec, eni, ceo, cfo...   271     0
2001-02-06  2001-02-06  1084749020_010206_04.txt    _04     [date, feb, chemring, group, plc, chg, acquisi...   130     0
2001-02-15  2001-02-15  1063497040_010215_04.txt    _04     [date, feb, electrolux, ab, elxb, acquisition,...   420     0

И описание кадра данных:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum               557 non-null datetime64[ns]
File                557 non-null object
File_type           557 non-null object
Text                557 non-null object
customers           557 non-null int64
grwoth              557 non-null int64
human               557 non-null int64
intagibles          557 non-null int64
length              557 non-null int64
synergies           557 non-null int64
technology          557 non-null int64
the                 557 non-null int64
len_cleaned_text    557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB

Заранее спасибо

RoyaumeIX · Answer 1 · 21 ноября 2018

Вот моя версия, в которой я преобразую значения столбцов в список, затем составляю список слов, очищаю его, и у вас есть счетчик:

your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 

counter = collections.Counter(flat_list)
top_words = counter.most_common(100)

Mikhail Stepanov · Answer 2 · 21 ноября 2018

Вы можете сделать это с помощью apply и Counter.update методов:

from collections import Counter

counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))

counter.most_common(10) 
Out:

[('Amy', 3), ('was', 3), ('hated', 2),
 ('Kamal', 2), ('her', 2), ('and', 2), 
 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Где df['Text']:

0    [Amy, normally, hated, Monday, mornings, but, ...
1    [Kamal, was, in, her, art, class, and, she, li...
2    [She, was, waiting, outside, the, classroom, w...
3              [Hi, Amy, Your, mum, sent, me, a, text]
4                         [You, forgot, your, inhaler]
5    [Why, don’t, you, turn, your, phone, on, Amy, ...
6    [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object

Scotty1- · Answer 3 · 21 ноября 2018

Хорошо, я понял.Ваш df['Text'] состоит из списков текстов.Таким образом, вы можете сделать это:

full_list = []  # list containing all words of all texts
for elmnt in df['Text']:  # loop over lists in df
    full_list += elmnt  # append elements of lists to full list

val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

Это решение позволяет избежать использования слишком большого числа списочных представлений и, таким образом, делает код легким для чтения и понимания.Кроме того, никаких дополнительных модулей, таких как re или collections, не требуется.

Как посчитать 500 самых распространенных слов в панде

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Как посчитать 500 самых распространенных слов в панде

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов