Попытка сопоставить два df на основе трех общих столбцов, ни один из которых не является идентичным - PullRequest
0 голосов
/ 22 января 2020

У меня есть два df's

df1

    date            League                                 teams
0   201902272215    brazil cup                             foz do iguacu fcceara ce
1   201902272300    colombia primera a                     deportes tolimaatletico bucaramanga
2   201902272300    brazil campeonato gaucho 2nd division  ypiranga rsuniao frederiquense
3   201902272300    brazil campeonato gaucho 2nd division  esportivo rstupi rs
4   201902272300    brazil campeonato gaucho 2nd division  sao paulo rsgremio esportivo bage
14  201902280000    four nations women tournament (in usa) usa (w)japan (w)
25  201902280030    bolivia professional football league   real potosibolivar

df2

    date            league                     teams
0   201902280000    womens international       usa womenjapan women
1   201902280000    brazil amazonense          sul america ecrio negro am
2   201902280030    bolivia apertura           real potosibolivar
3   201902280030    brazil campeonato paulista palmeirasituano
4   201902280030    copa sudamericana          racing clubcorinthians

В результате я хотел бы получить все строки из df2, которые почти совпадают с df1

    date            league                     teams                      near_match
0   201902280000    womens international       usa womenjapan women       1
1   201902280000    brazil amazonense          sul america ecrio negro am 0
2   201902280030    bolivia apertura           real potosibolivar         1
3   201902280030    brazil campeonato paulista palmeirasituano            0
4   201902280030    copa sudamericana          racing clubcorinthians     0 

Я пытался использовать вариацию a для l oop, используя SequenceMatcher и устанавливая порог для совпадения выше 0,8, но мне не повезло.

df_1['merge_teams'] = df_1['teams'] # we will use these as the merge keys
df_1['merge_date'] = df_1['date']
# df_1['merge_league'] = df_1['league']


for teams_1, date_1, league_1 in df_1[['teams','date']].values:
    for ixb, (teams_1, teams_2) in enumerate(df_2[['teams','date']].values):
        if difflib.SequenceMatcher(None,teams_1,teams_2).ratio() > .8:
            df_2.ix[ixb,'merge_teams'] = teams_1 # creates a merge key in df_2
        if difflib.SequenceMatcher(None,date_1, date_2).ratio() > .8:
            df_2.ix[ixb,'merge_date'] = date_1 # creates a merge key in df_2

# This should rturn all rows where teams,date and league all match by over 80%
# This is just for teams and date, I want to include league as well

Буду признателен за любые советы или рекомендации.

...