Итак, когда вы вводите данные, это выглядит так:
a = 'She is the besst i like here'
b = ['', '', '', 'best', 'I', '', 'her']
c = ['x', '', '' , '', 'x', '', '']
df = pd.DataFrame({'A':a.split(), 'B':b, 'C': c})
print(df)
A B C
0 She x
1 is
2 the
3 besst best
4 i I x
5 like
6 here her
Тогда этот скрипт:
df.loc[df['B'] == '', 'B'] = df[df['B'] == '']['A']
df.loc[df['C'] == 'x', 'C'] = 1
df['C'] = pd.to_numeric(df['C']).cumsum().ffill()
data = df.groupby('C')['A', 'B'].agg(list).to_dict('list')
with open('file.txt', 'w') as f_out:
for incorrect, correct in zip(*data.values()):
print('{}. {}.'.format(' '.join(incorrect), ' '.join(correct)), file=f_out)
Создаст file.txt
, что содержит:
She is the besst. She is the best.
i like here. I like her.
РЕДАКТИРОВАТЬ: Версия со значениями NaN
:
a = 'She is the besst i like here'
b = [np.nan, np.nan, np.nan, 'best', 'I', np.nan, 'her']
c = ['x', np.nan, np.nan , np.nan, 'x', np.nan, np.nan]
df = pd.DataFrame({'A':a.split(), 'B':b, 'C': c})
df.loc[df['B'].isna(), 'B'] = df[df['B'].isna()]['A']
df.loc[df['C'] == 'x', 'C'] = 1
df['C'] = pd.to_numeric(df['C']).cumsum().ffill()
data = df.groupby('C')['A', 'B'].agg(list).to_dict('list')
with open('file.txt', 'w') as f_out:
for incorrect, correct in zip(*data.values()):
print('{}. {}.'.format(' '.join(map(str, incorrect)), ' '.join(map(str, correct))), file=f_out)