Question

Я на самом деле пытаюсь сделать элементарный классификатор, поэтому я буду в порядке с решением NLTK, но моя первая пара попыток сделать это была с Pandas.

У меня есть пара списков, которые яхотите проверить текст и получить количество слов, а затем вернуть упорядоченный

import pandas as pd
import re
fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet fruit with a peel.",
                                "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                                "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                                "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                                "Friday: Long yellow fruit day, peel it and it's ready to go."]
df = pd.DataFrame(fruit_sentences, columns = ['text'])
banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

print(df['text'].str.count(r'[XYZ_word in word list]'))

Здесь код взрывается, потому что str.count () не принимает список.

Конечная цель - получить возвращенный список кортежей, подобный этому:

fruits = [('banana', 5), ('pear', 6), ('apple', 6)]

Да, я мог бы перебрать все списки, чтобы сделать это, но, похоже, я просто недостаточно знаю Python, ачем Python не знает, как с этим справиться.

Я нашел этот вопрос, но похоже, что все ответили на него неправильно или с другим решением, отличным от того, что на самом деле запрашивалось, это здесь .

Спасибо, что помогли этому новичку разобраться!

YOLO · Answer 1 · 20 сентября 2018

Для этого я бы использовал поиск по словарю (супер быстрый) и использовал Счетчик O (n) для создания слова.

# create a dict of look up values
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}

# preprocess data
df['text'] = df['text'].str.lower()
df['text'] = [re.sub(r'[^a-zA-Z0-9\s]','',x) for x in df['text']]
df['text'] = df.text.str.split()

# flatten the list and create a dict
from collections import Counter 

my_list = [i for s in df['text'] for i in s]
word_count = Counter(my_list)

# final job
output_dict = {k:len([x for x in v if x in word_count]) for k,v in d.items()}
sorted(output_dict.items(), key=lambda x: x[1])

[('apple', 2), ('banana', 3), ('pear', 3)]

Abhi · Answer 2 · 20 сентября 2018

Используйте str.contains с регулярным выражением.

# store lists in a dictionary for checking values.
a = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}

d = {}
# regular expression to match words
regex = '(?<!\S){0}[^\w\s]?(?!\S)'  

for i, j in a.items():
    d[i] = sum([df['text'].str.contains(regex.format(k), case=False).sum() for k in j])

print (d.items())

Вывод:

[('banana', 6), ('apple', 6), ('pear', 6)]

Andrea Nagy · Answer 3 · 20 сентября 2018

Как насчет:

питон 3.6.4 / панды 0.23.4:

import pandas as pd

def count(word_list):
    d = pd.Series(word_list).apply(lambda x: s.str.count(x))
    return d.sum()

fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet 
fruit with a peel.",
                        "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                        "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                        "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                        "Friday: Long yellow fruit day, peel it and it's ready to go."]

banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}

s = pd.Series(fruit_sentences)
res = pd.DataFrame(columns=[])
res['type'] = pd.Series(list(keywords.keys()))
res['value'] = pd.Series(list(keywords.values())).apply(lambda x: count(x)).sum(axis=1)
print(list(res.itertuples(index=False, name=None)))

питон 2.7.11 / панды 0.17:

import pandas as pd


def count(word_list):
    d = pd.Series(word_list).apply(lambda x: s.str.count(x))
    return d.sum()


fruit_sentences = ["Monday: Yellow makes me happy.  So I eat a long, sweet fruit with a peel.",
                        "Tuesday: A fruit round red fruit with a green leaf a day keeps the doctor away.",
                        "Wednesday: The stout, sweet green fruit keeps me on my toes!",
                        "Thursday: Another day with the red round fruit.  I like to keep the green leaf.",
                        "Friday: Long yellow fruit day, peel it and it's ready to go."]

banana_words = ['yellow', 'long', 'peel']
apple_words = ['round', 'red', 'green leaf']
pear_words = ['stout', 'sweet', 'green']

keywords = {'banana': banana_words, 'apple': apple_words, 'pear': pear_words}

s = pd.Series(fruit_sentences)

res = pd.DataFrame(columns=[])
res['type'] = pd.Series(keywords.keys())
res['value'] = pd.Series(keywords.values()).apply(lambda x: count(x)).sum(axis=1)

print(list(res.itertuples(index=False)))

оба даст вам:

[('banana', 4), ('apple', 6), ('pear', 6)]

jezrael · Answer 4 · 20 сентября 2018

Я думаю, что нужно:

#create dict for names of lists
d = {'banana': banana_words, 'apple': apple_words, 'pear':pear_words}
#create one big list
L =  ' '.join(df['text'])

#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)

[('banana', 4), ('apple', 6), ('pear', 6)]

Если хотите проверить строчные значения:

#create one big list
L =  ' '.join(df['text']).lower()

#count each value of lists and sum in generator
out = [(k, sum(L.count(x) for x in v)) for k,v in d.items()]
print (out)

[('banana', 6), ('apple', 6), ('pear', 6)]

Самый быстрый способ получить количество слов из текста, используя поисковые слова из списков?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Вывод:

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Самый быстрый способ получить количество слов из текста, используя поисковые слова из списков?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Вывод:

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов