Что бы я хотел сделать
Я бы хотел избавиться не от дефиса, а от da sh в предложениях для предварительной обработки NLP.
Input
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
Ожидаемый результат
#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']
Приведенные выше предложения взяты из следующих двух статей о дефисах и da sh.
Проблема
- Первый процесс, от которого нужно избавиться символ '-' не удалось, и трудно понять причину, по которой второе и третье предложения были объединены без одинарных кавычек ('').
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
Понятия не имею, как я могу написать код, чтобы различать guish дефис и sh.
Текущий код
samples = [
'A former employee of the accused company, — — —, offered a statement off the record.', #dash
'He is afraid of two things—spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
ignore_symbol = ['-']
for i in range(len(samples)):
text = samples[i]
ret = []
for word in text.split(' '):
ignore = len(word) <= 0
for iw in ignore_symbol:
if word == iw:
ignore = True
break
if not ignore:
ret.append(word)
text = ' '.join(ret)
samples[i] = text
print(samples)
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
for i in range (len(samples)):
list_temp = []
text = samples[i]
list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
samples[i] = list_temp
print(samples)
#output
[['A former employee of the accused company',
'— — —',
'offered a statement off the record.'],
['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
'fifty-six bottles of pop.']]
Среда разработки
Python 3.7.0