Я использую библиотеку pandas_dedupe. Я получаю эту ошибку, когда пытаюсь запустить на машине Windows, но этот же код отлично работает на Ma c.
import pandas as pd
import pandas_dedupe as pdd
df=pd.read_csv('sample.csv')
df=pdd.dedupe_dataframe(df,['firstname','lastname','gender','zipcode','address'])
df.to_csv('sample_deduped.csv')
df=df[df['cluster id'].isnull() | ~df[df['cluster id'].notnull()].duplicated(subset='cluster id',keep='first')]
df.to_csv('sample_deuped_removed.csv')
Вот журналы на случай, если вы захотите взглянуть :
Traceback (most recent call last):
File "C:/Users/vikas.mittal/Desktop/python projects/untitled2/deduplication.py", line 10, in <module>
df=pdd.dedupe_dataframe(df,['firstname','lastname','gender','zipcode','address'])
File "C:\Users\vikas.mittal\Desktop\python projects\untitled2\venv\lib\site-packages\pandas_dedupe\dedupe_dataframe.py", line 213, in dedupe_dataframe
sample_size)
File "C:\Users\vikas.mittal\Desktop\python projects\untitled2\venv\lib\site-packages\pandas_dedupe\dedupe_dataframe.py", line 72, in _train
dedupe.consoleLabel(deduper)
File "C:\Users\vikas.mittal\Desktop\python projects\untitled2\venv\lib\site-packages\dedupe\convenience.py", line 36, in consoleLabel
uncertain_pairs = deduper.uncertainPairs()
File "C:\Users\vikas.mittal\Desktop\python projects\untitled2\venv\lib\site-packages\dedupe\api.py", line 714, in uncertainPairs
return self.active_learner.pop()
File "C:\Users\vikas.mittal\Desktop\python projects\untitled2\venv\lib\site-packages\dedupe\labeler.py", line 323, in pop
raise IndexError("No more unlabeled examples to label")
IndexError: No more unlabeled examples to label
Process finished with exit code 1