Я пытаюсь повторить этот ответ для некоторых данных в Engli sh:
Python / Pandas / spacy - перебрать DataFrame и посчитать количество тегов pos_
Я изо всех сил пытаюсь понять, где я иду не так. Я считаю, что я понял, что set_value
устарела в пользу .at
, но в противном случае я сталкиваюсь с ошибкой, когда мой pandas фрейм данных не вызывается при попытке применить функцию.
Пример структуры данных:
d = {'name': [20, 21, 22, 23, 24],
'text': ["""In this chapter and throughout the book, I use the standard
NumPy convention of always using import numpy as np. You are,
of course, welcome to put from numpy import * in your code to
avoid having to write np., but I advise against making a habit of
this. The numpy namespace is large and contains a number of functions
whose names conflict with built-in Python functions (like min
and max).""", """This chapter will introduce you to the basics of using NumPy arrays, and should be
sufficient for following along with the rest of the book. While it’s not necessary to
have a deep understanding of NumPy for many data analytical applications, becoming
proficient in array-oriented programming and thinking is a key step along the
way to becoming a scientific Python guru.""",
"""The easiest way to create an array is to use the array function. This accepts any
sequence-like object (including other arrays) and produces a new NumPy array containing
the passed data. For example, a list is a good candidate for conversion:""",
"""Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape
inferred from the data. We can confirm this by inspecting the ndim and shape
attributes:""","""In addition to np.array, there are a number of other functions for creating new
arrays. As examples, zeros and ones create arrays of 0s or 1s, respectively, with a
given length or shape. empty creates an array without initializing its values to any particular
value. To create a higher dimensional array with these methods, pass a tuple
for the shape:"""]}
df = pd.DataFrame(data=d)
# current code didn't seem to accept int values, ideally it would
df['name'] = ['red','blue','green','yellow','orange']
Текущая адаптация ответа, опубликованного выше:
import spacy
import pandas as pd
from spacy.lang.en import English
from collections import defaultdict
nlp = spacy.load("en_core_web_lg")
def calculate_the_word_types(data):
nouns = defaultdict(lambda: 0)
verbs = defaultdict(lambda: 0)
adjectives = defaultdict(lambda: 0)
# count all tokens, but not the punctuations
for i, row in data.iterrows():
doc = nlp(row["name"] + " " + row["text"])
data.at(i, "nr_token", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ != 'PUNCT', doc)))))
# count only the adjectives
for a in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'ADJ', doc)):
adjectives[a] += 1
data.at(i, "nr_adj", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ == 'ADJ', doc)))))
# count only the nouns
for n in map(lambda x: x.lemma_, filter(lambda x: x.pos_ == 'NOUN', doc)):
nouns[n] +=1
data.at(i, "nr_noun", len(list(map(lambda x: x.text,
filter(lambda x: x.pos_ == 'NOUN', doc)))))
# count only the verbs
for v in map(lambda x: x.lemma_, filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)):
verbs[v] += 1
data.at(i, "nr_verb", len(list(map(lambda x: x.text,
filter(lambda x: (x.pos_ == 'AUX') | (x.pos_ == 'VERB'), doc)))))
return data
calculate_the_word_types(df)
С ошибкой:
TypeError: '_AtIndexer' object is not callable
То же была бы полезна структура вывода из ответа выше, я в конечном итоге хотел бы добавить больше выходных переменных (например, # предложений, количество слов и т. д. c), но понимание того, что происходит не так, поможет мне устранить неполадки.