использовать sklearn.feature_extraction.text.CountVectorizer .
Демо:
In [192]: from sklearn.feature_extraction.text import CountVectorizer
In [193]: cv = CountVectorizer(token_pattern='(?u)\\b\\w+\\b', vocabulary=list('abcd'))
In [194]: X = cv.fit_transform(df['col1'])
In [195]: X
Out[195]:
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
In [196]: X.A
Out[196]:
array([[1, 0, 1, 0],
[0, 1, 1, 0],
[0, 0, 0, 1]], dtype=int64)
In [197]: cv.get_feature_names()
Out[197]: ['a', 'b', 'c', 'd']
, если мы не используем vocabulary
- мы будемполучить один столбец для каждого уникального слова:
In [203]: cv = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
In [204]: X = cv.fit_transform(df['col1'])
In [205]: X.A
Out[205]:
array([[1, 0, 1, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 1, 0],
[0, 0, 0, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
In [206]: cv.get_feature_names()
Out[206]: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'i', 'p', 'x']
Источник DF:
In [191]: df
Out[191]:
col1
0 a, c, e, f
1 b, c, g, p
2 d, e, i, x