Использование метко названного sklearn.model_selection.train_test_split()
:
from sklearn.model_selection import train_test_split
data = {
'spam': [
['hi', "what's", 'going', 'on', 'sexy', 'thing'],
['1-800', 'call', 'girls', 'if', "you're", 'lonely'],
['sexy', 'girls', 'for', 'youuuuuu']],
'ham': [['hey', 'hey', 'I', 'got', 'your', 'message,', "I'll", 'be', 'home', 'soon!!!'],
['Madden', 'MUT', 'time', 'boys']]}
all_messages = [(words, k) for k, v in data.items() for words in v]
train, test = train_test_split(list(all_messages), test_size=0.2)
Вы можете и, вероятно, должны использовать что-то более мощное, например Pandas:
import pandas as pd
from sklearn.model_selection import train_test_split
data_dict = {
'spam': [
['hi', "what's", 'going', 'on', 'sexy', 'thing'],
['1-800', 'call', 'girls', 'if', "you're", 'lonely'],
['sexy', 'girls', 'for', 'youuuuuu']],
'ham': [['hey', 'hey', 'I', 'got', 'your', 'message,', "I'll", 'be', 'home', 'soon!!!'],
['Madden', 'MUT', 'time', 'boys']]}
df = pd.DataFrame(data=((k, words) for k, v in data_dict.items() for words in v))
print(df)
train, test = train_test_split(df, test_size=0.2)
print(train)
print(test)
Выход:
0 1
0 spam [hi, what's, going, on, sexy, thing]
1 spam [1-800, call, girls, if, you're, lonely]
2 spam [sexy, girls, for, youuuuuu]
3 ham [hey, hey, I, got, your, message,, I'll, be, h...
4 ham [Madden, MUT, time, boys]
0 1
1 spam [1-800, call, girls, if, you're, lonely]
2 spam [sexy, girls, for, youuuuuu]
0 spam [hi, what's, going, on, sexy, thing]
3 ham [hey, hey, I, got, your, message,, I'll, be, h...
0 1
4 ham [Madden, MUT, time, boys]