Question

У меня есть дерево зависимостей SpaCy, созданное с помощью этого кода:

from spacy import displacy


text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
print(displacy.render(nlp(text), style='dep', jupyter = True, options = {'distance': 120}))

Это выводит на экран следующее:

SpaCy determines that this entire string is connected in a dependency tree. What I am trying to figure out is how to discern how direct or indirect the connection is between a word and the next word. For example, looking at the first 3 words:

'We' is connected to the next word 'could', because it is directly connected to 'say', which is directly connected to 'could'. Therefor, it is 2 connection points away from the next word.
'could' is directly connected to 'say'. There for it is 1 connection point away from the start.
and so on.

Essentially, I want to make a df that would look like this:

word  connection_points_to_next_word

We            2
could         1
say           1
...

I'm not sure how to achieve this. As SpaCy makes this graph, I'm sure there is some efficient way to calculate the number of vertices required to connect adjacent nodes, but all of SpaCy's tools I've found, such as:

token.lefts
token.rights
token.subtree
token.children
more here https://spacy.io/api/token

Включите информацию о соединении, но не о том, насколько оно прямое. Есть идеи, как приблизиться к этой проблеме?

thorntonc · Answer 1 · 07 августа 2020

Используя библиотеку networkx, мы можем построить неориентированный граф из списка редактирования отношений токен-потомок. Я использую индекс токена в документе в качестве уникального идентификатора, так что повторяющиеся слова обрабатываются как отдельные узлы.

import spacy
import networkx as nx

nlp= spacy.load('en_core_web_lg')
text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
doc = nlp(text)
edges = []
for tok in doc:
    edges.extend([(tok.i, child.i) for child in tok.children])

Кратчайший путь между соседними токенами можно рассчитать следующим образом:

for idx, _ in enumerate(doc):
    if idx < len(doc)-1:
        print(doc[idx], doc[idx+1], nx.shortest_path_length(graph,source=idx, target=idx+1))

Вывод:

We could 2
could say 1
say to 1
to them 1
them that 4
that if 3
if in 2
in fact 1
fact that 3
that 's 1
's all 1
all there 2
there is 1
is , 4
, then 2
then we 2
we could 2
could , 2
, Oh 2
Oh , 2
, we 2
we can 2
can do 1
do something 1
something . 3

Ранжирование прямых зависимостей spaCy от дерева

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Ранжирование прямых зависимостей spaCy от дерева

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы