Второе левое внешнее объединение не возвращает правильное количество строк с помощью Spark - PullRequest
0 голосов
/ 27 февраля 2019

В настоящее время я работаю с 3 фреймами данных и объединяю их вместе, начиная с фрейма network и присоединяя к нему фрейм данных organization, создавая новый фрейм данных, используя сопоставление левого внешнего соединения в столбце OrgID.Затем, используя новый фрейм данных и присоединяя к нему asn фрейм данных, выполняя внешнее левое соединение OrgID для объединения.

Фреймы данных:

network.show(5)
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+
|           NetHandle|    OrgID|          Parent|             NetName|            NetRange|     NetType|Comment|            RegDate|            Updated|Source|
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------ +
|NET-69-150-149-184-1|C00868859|NET-69-148-0-0-1|SBC06915014918429...|69.150.149.184 - ...|reassignment|   null|2004-07-23 00:00:00|2004-07-23 00:00:00|  ARIN|
| NET-69-224-242-40-1|C00868860|NET-69-224-0-0-1|SBC06922424204029...|69.224.242.40 - 6...|reassignment|   null|2004-07-23 00:00:00|2004-07-23 00:00:00|  ARIN|
| NET-170-55-30-176-1|  CC-3105|NET-170-55-0-0-1|FPLFI-CROWNSCFSW-...|170.55.30.176 - 1...|reassignment|   null|2018-03-26 00:00:00|2018-03-26 00:00:00|  ARIN|
| NET-69-224-249-24-1|C00868862|NET-69-224-0-0-1|SBC06922424902429...|69.224.249.24 - 6...|reassignment|   null|2004-07-23 00:00:00|2004-07-23 00:00:00|  ARIN|
| NET-69-29-107-152-1|C02309164| NET-69-29-0-0-1|   CTEL-CITZENS-BANK|69.29.107.152 - 6...|reassignment|   null|2009-09-03 00:00:00|2009-09-03 00:00:00|  ARIN|
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+
only showing top 5 rows

organization.show(5)
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
|    OrgID|             OrgName|CanAllocate|              Street|        City|State/Prov|Country|PostalCode|            RegDate|            Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
|C05709929|      Allen Matthews|       null|           PO Box399|Simpsonville|        MD|     US|     21150|2015-05-02 00:00:00|2015-05-02 00:00:00|          null|         null|          null|  ARIN|
|C07025896|BIANCA BIANCA-180...|       null|     Private Address|       Plano|        TX|     US|     75075|2018-07-19 00:00:00|2018-07-19 00:00:00|          null|         null|          null|  ARIN|
|  TBL-353|TEST BVOIP COMPAN...|       null|225 W RANDOLPH UN...|        CHGO|        IL|     US|     99774|2015-05-02 00:00:00|2015-05-02 00:00:00|  SHRES56-ARIN| SHRES56-ARIN|  SHRES56-ARIN|  ARIN|
|  AIM-109|ASHLEY INDUSTRIAL...|       null|      951 2ND AVE SE|     OELWEIN|        IA|     US|     50662|2015-05-02 00:00:00|2015-05-02 00:00:00|  MARTZ16-ARIN| MARTZ16-ARIN|  MARTZ16-ARIN|  ARIN|
|C07025664|Brodynt Global Se...|       null|2500 William Park...|    Brampton|        ON|     CA|   L6S 5M9|2018-07-19 00:00:00|2018-07-19 00:00:00|          null|         null|          null|  ARIN|
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
only showing top 5 rows

asn.show(5)
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
|ASHandle|    OrgID|      ASName|ASNumber|            RegDate|             Comment|            Updated|Source|
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
|     AS0|     IANA| IANA-RSVD-0|       0|2002-09-13 00:00:00|Reserved - May be...|2002-09-13 00:00:00|  ARIN|
|     AS1|  LPL-141|      LVLT-1|       1|2001-09-20 00:00:00|                null|2018-02-20 00:00:00|  ARIN|
|     AS2|UNIVER-19|    UDEL-DCN|       2|1991-01-10 00:00:00|                null|2012-06-21 00:00:00|  ARIN|
|     AS3|    MIT-2|MIT-GATEWAYS|       3|1970-01-01 00:00:00|                null|2010-09-27 00:00:00|  ARIN|
|     AS4|   USC-32|      ISI-AS|       4|1984-02-22 00:00:00|                null|2012-03-13 00:00:00|  ARIN|
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
only showing top 5 rows

Это количество3 искровых фрейма данных:

network.count()
3418057
organization.count()
3660886
asn.count()
27745

Первое объединение работает, как вы видите, у меня такое же количество, как у сетевого фрейма данных 3418057:

df = network.join(organization, ["OrgID"], 'leftouter')
df.show(2)
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
|  OrgID|           NetHandle|          Parent|             NetName|            NetRange|     NetType|Comment|            RegDate|            Updated|Source|             OrgName|CanAllocate|           Street|        City|State/Prov|Country|PostalCode|            RegDate|            Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
| 3DIMEN|  NET-199-33-182-0-1| NET-199-0-0-0-0|             NET-3DP|199.33.182.0 - 19...|  assignment|   null|1994-01-11 00:00:00|1994-06-21 00:00:00|  ARIN|                null|       null|             null|Philadelphia|        PA|     US|     19104|1994-01-11 00:00:00|2011-09-24 00:00:00|    JM143-ARIN|   JM143-ARIN|    JM143-ARIN|  ARIN|
|AA-1166|NET6-2001-1890-13...|NET6-2001-1890-1|ATTW-2001-1890-13...|2001:1890:131E:6D00:|reallocation|   null|2016-02-29 00:00:00|2016-02-29 00:00:00|  ARIN|AMERICAN ACCESSORIES|       null|3100 BANDINI BLVD|      VERNON|        CA|     US|     40456|2016-02-29 00:00:00|2016-02-29 00:00:00|   DURAZ5-ARIN|  DURAZ5-ARIN|   DURAZ5-ARIN|  ARIN|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
only showing top 2 rows

Редактировать: правильная опечатка

print(df.count())
[Stage 52:====================================================> (193 + 7) / 200]3418057

Но когда я использую новый фрейм данных и выполняю внешнее левое соединение с фреймом asn, я должен получить счет 3418057, но я получаю счет1661797:

df1 = df.join(asn, ["OrgID"], 'leftouter')
df1.show(2)
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+
|  OrgID|           NetHandle|          Parent|             NetName|            NetRange|     NetType|Comment|            RegDate|            Updated|Source|             OrgName|CanAllocate|           Street|        City|State/Prov|Country|PostalCode|            RegDate|            Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|ASHandle|ASName|ASNumber|RegDate|Comment|Updated|Source|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+
| 3DIMEN|  NET-199-33-182-0-1| NET-199-0-0-0-0|             NET-3DP|199.33.182.0 - 19...|  assignment|   null|1994-01-11 00:00:00|1994-06-21 00:00:00|  ARIN|                null|       null|             null|Philadelphia|        PA|     US|     19104|1994-01-11 00:00:00|2011-09-24 00:00:00|    JM143-ARIN|   JM143-ARIN|    JM143-ARIN|  ARIN|    null|  null|    null|   null|   null|   null|  null|
|AA-1166|NET6-2001-1890-13...|NET6-2001-1890-1|ATTW-2001-1890-13...|2001:1890:131E:6D00:|reallocation|   null|2016-02-29 00:00:00|2016-02-29 00:00:00|  ARIN|AMERICAN ACCESSORIES|       null|3100 BANDINI BLVD|      VERNON|        CA|     US|     40456|2016-02-29 00:00:00|2016-02-29 00:00:00|   DURAZ5-ARIN|  DURAZ5-ARIN|   DURAZ5-ARIN|  ARIN|    null|  null|    null|   null|   null|   null|  null|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+
only showing top 2 rows

print(df1.count())
[Stage 70:==================================================>   (187 + 7) / 200]4987448

Этот кадр данных должен иметь счетчик 3418057, а не 4987448.Что я делаю не так?

...