Я бы использовал sklearn.preprocessing.MultiLabelBinarizer в этом случае:
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.read_csv(filenames, sep=":\s*", header=None, names=["cust", "items"])
res = (pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df["items"].str.split(",")),
index=df.index,
columns=mlb.classes_)
.set_index(df["cust"]))
результат:
In [24]: res
Out[24]:
item_1062 item_1312 item_1545 item_158 item_1659 ... item_2703 item_2858 item_454 item_575 \
cust ...
customer_21 0 0 0 0 0 ... 1 0 0 1
customer_11 0 0 0 1 0 ... 0 0 1 0
customer_10 0 0 0 0 0 ... 0 0 0 0
customer_4 0 1 1 0 0 ... 0 0 0 0
customer_6 1 0 0 0 0 ... 0 0 0 0
customer_23 0 0 0 0 1 ... 0 0 0 0
customer_14 0 0 0 0 0 ... 0 1 0 0
item_613
cust
customer_21 0
customer_11 0
customer_10 1
customer_4 0
customer_6 0
customer_23 0
customer_14 0
[7 rows x 14 columns]
In [25]: res.dtypes
Out[25]:
item_1062 Sparse[int32, 0]
item_1312 Sparse[int32, 0]
item_1545 Sparse[int32, 0]
item_158 Sparse[int32, 0]
item_1659 Sparse[int32, 0]
item_1760 Sparse[int32, 0]
item_2007 Sparse[int32, 0]
item_2608 Sparse[int32, 0]
item_2610 Sparse[int32, 0]
item_2703 Sparse[int32, 0]
item_2858 Sparse[int32, 0]
item_454 Sparse[int32, 0]
item_575 Sparse[int32, 0]
item_613 Sparse[int32, 0]
dtype: object