Я проанализировал данные XML в dict. Диктовка имеет следующую форму:
{'id': 'Q1',
'subject': 'Massage oil',
'question': 'Where I can buy good oil for massage?',
'comments': {},
'related': {'Q1_R1': {'rid': 'Q1_R1',
'rel_subject': 'massage oil',
'rel_question': 'is there any place i can find scented massage oils in qatar?',
'rel_givenRelevance': 'PerfectMatch',
'rel_givenRank': '1',
'rel_comments': {'Q1_R1_C1': {'cid': 'Q1_R1_C1',
'com_date': '2010-08-27 01:40:05',
'com_username': 'anonymous',
'comment': 'Yes. It is right behind Kahrama in the National area.',
'com_isTraining': True},
'Q1_R1_C2': {'cid': 'Q1_R1_C2',
'com_date': '2010-08-27 01:42:59',
'com_username': 'sognabodl',
'comment': 'whats the name of the shop?',
'com_isTraining': True},
'Q1_R1_C3': {'cid': 'Q1_R1_C3',
'com_date': '2010-08-27 01:44:09',
'com_username': 'anonymous',
'comment': "It's called Naseem Al-Nadir. Right next to the Smartlink shop. You'll find the chinese salesgirls at affordable prices there.",
'com_isTraining': True},
'Q1_R1_C4': {'cid': 'Q1_R1_C4',
'com_date': '2010-08-27 01:58:39',
'com_username': 'sognabodl',
'comment': 'dont want girls;want oil',
'com_isTraining': True},
'Q1_R1_C5': {'cid': 'Q1_R1_C5',
'com_date': '2010-08-27 01:59:55',
'com_username': 'anonymous',
'comment': "Try Both ;) I'am just trying to be helpful. On a serious note - Please go there. you'll find what you are looking for.",
'com_isTraining': True},
'Q1_R1_C6': {'cid': 'Q1_R1_C6',
'com_date': '2010-08-27 02:02:53',
'com_username': 'lawa',
'comment': 'you mean oil and filter both',
'com_isTraining': True},
'Q1_R1_C7': {'cid': 'Q1_R1_C7',
'com_date': '2010-08-27 02:04:29',
'com_username': 'anonymous',
'comment': "Yes Lawa...you couldn't be more right LOL",
'com_isTraining': True}},
'rel_featureVector': [],
'rel_isTraining': True}},
'featureVector': [],
'isTraining': True}
Общее, как:
{ID : Q1,
...
related:{
Q1_R1 :{
rid:Q1_R1,
....
rel_comments:{
Q1_R1_C1: {
cid: Q1_R1_C1,
....
}
....
Q1_R1_C10
}
...
Q1_R10
}
...
ID : 100
}
Я хочу включить:
ID ... question rid ... rel_question cid .... comment
Q1 ... 1234 Q1_R1 ... 5678 Q1_R1_c1 .... 90
Q1 ... 1234 Q1_R1 ... 5678 Q1_R1_c2 .... 92
Q1 ... 1234 Q1_R1 ... 5678 Q1_R1_c3 .... 93
..........................................
Q100 ... 1234 Q100_R10 ... 5678 Q100_R10_c13 ....465
Я пытаюсь сгладить этот дикт, но я получаю rid(Q1_R1 ...Q100_R10 )
и cid( Q1_R1_c1 ... Q100_R10_c13 )
в виде столбцов, есть ли способ сделать это?
Эта подзадача semesval 2016 года 1'data, я думаю, использовать функцию dataframe, например apply.
может улучшить производительность. Например, чтобы вычислить, насколько похожи Q1
вопрос и Q1_R1_C1
комментарий? ...