Обычно, для группировки, dict
- хороший путь.Для подсчета вы можете использовать реализацию, подобную следующей:
c = {}
singleabstract = 'This is a research abstract that includes words like
mental health and anxiety. My hope is that I get my code to work and
not resort to alcohol.'
for s in singleabstract.split():
s = ''.join(char.lower() for char in s if char.isalpha()) # '<punctuation>'.isalpha() yields False
# you'll need to check if the word is in the dict
# first, and set it to 1
if s not in c:
c[s] = 1
# otherwise, increment the existing value by 1
else:
c[s] += 1
# You can sum the number of occurrences, but you'll need
# to use c.get to avoid KeyErrors
occurrences = sum(c.get(term, 0) for term in mh_terms)
occurrences
3
# or you can use an if in the generator expression
occurrences = sum(c[term] for term in mh_terms if term in c)
Наиболее оптимальный способ подсчета вхождений - это использование collections.Counter
.Это словарь, который позволяет вам O (1) проверять ключи:
from collections import Counter
singleabstract = 'This is a research abstract that includes words like
mental health and anxiety. My hope is that I get my code to work and
not resort to alcohol.'
# the Counter can consume a generator expression analogous to
# the for loop in the dict implementation
c = Counter(''.join(char.lower() for char in s if char.isalpha())
for s in singleabstract.split())
# Then you can iterate through
for term in mh_terms:
# don't need to use get, as Counter will return 0
# for missing keys, rather than raising KeyError
print(term, c[term])
mental 1
ptsd 0
sud 0
substance abuse 0
drug abuse 0
alcohol 1
alcoholism 0
anxiety 1
depressing 0
bipolar 0
mh 0
smi 0
oud 0
opioid 0
Чтобы получить желаемый результат, вы можете суммировать значения для объекта Counter
:
total_occurrences = sum(c[v] for v in mh_terms)
total_occurrences
3