У меня есть необработанный текст, и я хочу извлечь из него пол пациента, но в итоге получаю меньше или больше строк, как мне справиться с такой ошибкой?
fil = data['transcription']
print(fil)
вывод:
0 SUBJECTIVE:, This 23-year-old white female pr...
1 PAST MEDICAL HISTORY:, He has difficulty climb...
2 HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3 2-D M-MODE: , ,1. Left atrial enlargement wit...
4 1. The left ventricular cavity size and wall ...
...
4994 HISTORY:, I had the pleasure of meeting and e...
4995 ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...
4996 SUBJECTIVE: , This is a 42-year-old white fema...
4997 CHIEF COMPLAINT: , This 5-year-old male presen...
4998 HISTORY: , A 34-year-old male presents today s...
Name: transcription, Length: 4999, dtype: object
и это код для извлечения пола из текста
import re
gender_aux = []
for i in fil:
try:
gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", i) or [" "]
except:
gender_aux.append(' ')
# pass
gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
"female": ["lady","female", "woman", "girl",'she']}
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
break
elif g in gender_dict['female']:
gender_aux.append('female')
break
else:
gender_aux+=[' ']
break
print(len(gender_aux))
print(gender_aux)
Если я удалю или [""] или else я получаю 4967, в противном случае получаю 5032, и на самом деле я должен получить 4999 всего экземпляров
output:
4967 or 5032 #it should be 4999 when i do print(len(gender_aux))
['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]