Как выполнить проверку на уровне столбца, присоединив один большой фрейм данных ко многим небольшим фреймам данных в искре - PullRequest
0 голосов
/ 02 мая 2018

У меня есть одна большая таблица или фрейм данных, который содержит более 50 миллионов записей и 135 столбцов. Теперь для каждой строки мне нужно выполнить проверку для более чем 50 столбцов.

Так что в основном для каждой строки каждого столбца мне нужно получить соответствующее значение из всех 25 таблиц.

Я перечислил здесь только 4 небольших таблицы, но в моем случае у меня будет 25 таких таблиц.

Например, вот одна из моих проверок, называемая проверкой CityId.

Для проверки CityId нам нужен TownCode из Таблицы2, передав физический код провайдера, физический код страны и имя физического города из Таблиц1

.

С помощью TownCode мне нужно перейти в Таблицу 3, передать физический код страны, физический код провайдера и TownCode и получить CityID.

Если CityID доступен, тогда он верен, false false.

Вот как выглядят мои фреймы данных.

И приведенная выше логика является примером для одного из столбцов, но мне нужно выполнить такую ​​проверку для более чем 50 столбцов.

Можем ли мы сделать это в искре?

Таблица 1 Основная таблица (50 миллионов записей)

+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|filler1|dunsnumber|businessname                               |tradestylename              |registeredaddressindicator|physicalstreetaddress    |physicalstreetaddress2|physicalcityname|physicalstateorprovincename|physicalcountryname|physicalcitycode|physicalcountycode|physicalstateorprovincecode|physicalstateorprovinceabbreviation|physicalcountrycode|physicalpostalcode|physicalcontinentcode|mailingaddress|mailingcityname|mailingcountyname|mailingstateorprovincename|mailingcountryname|mailingcitycode|mailingcountycode|mailingstateorprovincecode|mailingstateorprovinceabbreviation|mailingcountrycode|mailingpostalcode|mailingcontinentcode|nationalidentificationnumber|nationalidentificationsystemcode|countrytelephoneaccesscode|telephonenumber|cabletelex|faxnumber |chiefexecutiveofficername|chiefexecutiveofficertitle|lineofbusiness                           |sic1|sic2|sic3|sic4|sic5|sic6|primarylocalactivitycode|activityindicator|yearstarted|annualsaleslocal  |annualsalesindicator|annualsalesinusd|currencycode|employeeshere|employeeshereindicator|employeestotal|employeestotalindicator|includeprinciplesindicator|importexportagentindicator|legalstatus|filler2|statuscode|subsidiarycode|filler3|previousdunsnumber|financialstatementdate|filler4|headquarterorparentdunsnumber|headquarterorparentbusinessname            |headquarterorparentstreetaddress|headquarterorparentcityname|headquarterorparentstateorprovincename|headquarterorparentcountryname|headquarterorparentcitycode|headquarterorparentcountycode|headquarterorparentstateorprovinceabbreviation|headquarterorparentcountrycode|headquarterorparentpostalcode|headquarterorparentcontinentcode|filler5|domesticultimatedunsnumbers|domesticultimatebusinessname          |domesticultimatephysicalstreetaddress|domesticultimatecityname|domesticultimatestateorprovincename|domesticultimatecitycode|domesticultimatecountrycode|domesticultimatestateorprovinceabbreviation|domesticultimatepostalcode|globalultimateindicator|filler6|globalultimatedunsnumber|globalultimatebusinessname            |globalultimatestreetaddress           |globalultimatecityname|globalultimatestateorprovincename|globalultimatecountryname|globalultimatecitycode|globalultimatecountycode|globalultimatestateorprovinceabbreviation|globalultimatecountrycode|globalultimatepostalcode|globalultimatecontinentcode|numberoffamilymembers|diascode |hierarchycode|filler7|filler8|urldomain               |naics1|naics2|naics3|naics4|naics5|naics6|publicprivateindicator|obindicator|latitude  |longitude  |oporactdescpart1                                                                                                                                                                                                                         |oporactdescpart2|oporactdescpart3|oporactdescpart4|oporactdescpart5|nixieindicator|delistindicator|primary8digitsic|primary8digitdescription                                    |primarynaicsdescription                                        |natlidfull|transactionalindicator|
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+
|       |001007108 |DOLGENCORP, LLC                            |DOLLAR GENERAL              |N                         |1342 PINE ST             |                      |UNADILLA        |GEORGIA                    |USA                |008857          |296               |019                        |GA                                 |805                |31091             |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |4786279585     |          |          |EVE MEADOWS              |MANAGER                   |VARIETY STORES                           |5331|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000006      |1                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |068331990                    |DOLGENCORP, LLC                            |100 MISSION RDG                 |GOODLETTSVILLE             |TENNESSEE                             |USA                           |003754                     |203                          |TN                                            |805                           |370722171                    |6                               |       |006946172                  |DOLLAR GENERAL CORPORATION            |100 MISSION RDG                      |GOODLETTSVILLE          |TENNESSEE                          |003754                  |805                        |TN                                         |370722171                 |N                      |       |006946172               |DOLLAR GENERAL CORPORATION            |100 MISSION RDG                       |GOODLETTSVILLE        |TENNESSEE                        |USA                      |003754                |203                     |TN                                       |805                      |370722171               |6                          |11210                |005479269|02           |       |       |                        |452319|      |      |      |      |      |                      |N          |+32.252708|-083.740074|                                                                                                                                                                                                                                         |                |                |                |                |N             |N              |53310000        |VARIETY STORES                                              |ALL OTHER GENERAL MERCHANDISE STORES                           |          |C                     |
|       |001132690 |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|ADVANCE AMERICA             |N                         |332 N L ROGERS WELLS BLVD|                      |GLASGOW         |KENTUCKY                   |USA                |003211          |060               |033                        |KY                                 |805                |421411300         |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |2706511990     |          |          |LISA BROWN               |MANAGER                   |PERSONAL CREDIT INSTITUTIONS             |6141|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000002      |0                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |179469978                    |ADVANCE AMERICA, CASH ADVANCE CENTERS, INC.|135 N CHURCH ST                 |SPARTANBURG                |SOUTH CAROLINA                        |USA                           |008468                     |839                          |SC                                            |805                           |293065138                    |6                               |       |078454395                  |EAGLE U.S. SUB, INC.                  |135 N CHURCH ST                      |SPARTANBURG             |SOUTH CAROLINA                     |008468                  |805                        |SC                                         |293065138                 |N                      |       |811589639               |GRUPO ELEKTRA, S.A.B. DE C.V.         |AV. FERROCARRIL DE RIO FRIO NO. 419 CJ|CIUDAD DE MEXICO      |CIUDAD DE MEXICO                 |MEXICO                   |009100                |000                     |CDMX                                     |489                      |09310                   |5                          |04316                |008037671|03           |       |       |WWW.ADVANCEAMERICA.NET  |522291|      |      |      |      |      |                      |N          |+37.006016|-085.924526|                                                                                                                                                                                                                                         |                |                |                |                |N             |N              |61410000        |PERSONAL CREDIT INSTITUTIONS                                |CONSUMER LENDING                                               |          |C                     |
|       |001134456 |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |                            |N                         |126 DANIEL ST            |                      |PORTSMOUTH      |NEW HAMPSHIRE              |USA                |006885          |725               |057                        |NH                                 |805                |038013857         |6                    |              |               |                 |                          |                  |               |000              |000                       |                                  |000               |                 |                    |                            |                                |0001                      |               |          |          |BARBARA CONDA            |MANAGER                   |NATIONAL COMMERCIAL BANKS, NSK           |6021|    |    |    |    |    |                        |000              |0000       |000000000000000000|                    |000000000000000 |            |0000015      |0                     |              |                       |Y                         |G                         |000        |       |2         |0             |       |000000000         |00000000              |       |072147077                    |PEOPLE'S UNITED BANK, NATIONAL ASSOCIATION |850 MAIN ST FL 6                |BRIDGEPORT                 |CONNECTICUT                           |USA                           |000677                     |112                          |CT                                            |805                           |066044917                    |6                               |       |800407673                  |PEOPLE'S UNITED FINANCIAL, INC.       |850 MAIN ST                          |BRIDGEPORT              |CONNECTICUT                        |000677                  |805                        |CT                                         |066044917                 |N                      |       |800407673               |PEOPLE'S UNITED FINANCIAL, INC.       |850 MAIN ST                           |BRIDGEPORT            |CONNECTICUT                      |USA                      |000677                |112                     |CT                                       |805                      |066044917               |6                          |00583                |014029370|02           |       |       |WWW.BRANCHES.PEOPLES.COM|522110|      |      |      |      |      |                      |N          |+43.077690|-070.755372|                                                                                                                                                                                                                                         |                |                |                |                |P             |N              |60210000        |NATIONAL COMMERCIAL BANKS                                   |COMMERCIAL BANKING                                             |          |C                     |
+-------+----------+-------------------------------------------+----------------------------+--------------------------+-------------------------+----------------------+----------------+---------------------------+-------------------+----------------+------------------+---------------------------+-----------------------------------+-------------------+------------------+---------------------+--------------+---------------+-----------------+--------------------------+------------------+---------------+-----------------+--------------------------+----------------------------------+------------------+-----------------+--------------------+----------------------------+--------------------------------+--------------------------+---------------+----------+----------+-------------------------+--------------------------+-----------------------------------------+----+----+----+----+----+----+------------------------+-----------------+-----------+------------------+--------------------+----------------+------------+-------------+----------------------+--------------+-----------------------+--------------------------+--------------------------+-----------+-------+----------+--------------+-------+------------------+----------------------+-------+-----------------------------+-------------------------------------------+--------------------------------+---------------------------+--------------------------------------+------------------------------+---------------------------+-----------------------------+----------------------------------------------+------------------------------+-----------------------------+--------------------------------+-------+---------------------------+--------------------------------------+-------------------------------------+------------------------+-----------------------------------+------------------------+---------------------------+-------------------------------------------+--------------------------+-----------------------+-------+------------------------+--------------------------------------+--------------------------------------+----------------------+---------------------------------+-------------------------+----------------------+------------------------+-----------------------------------------+-------------------------+------------------------+---------------------------+---------------------+---------+-------------+-------+-------+------------------------+------+------+------+------+------+------+----------------------+-----------+----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------+----------------+----------------+--------------+---------------+----------------+------------------------------------------------------------+---------------------------------------------------------------+----------+----------------------+

справочные таблицы очень очень маленькие таблицы размером не более 10 МБ

Table2

+------------+------------+------------+-------------+---------+--------------+
|COUNTRY_CODE|COUNTRY_NAME|PROVINCE    |PROVINCE_CODE|TOWN_CODE|TOWN_NAME     |
+------------+------------+------------+-------------+---------+--------------+
|021         |ANDORRA     |null        |000          |000002   |ALDOSA        |
|021         |ANDORRA     |null        |000          |000013   |EL TARTER     |
|033         |ARGENTINA   |BUENOS AIRES|001          |000223   |OLIVOS        |
|033         |ARGENTINA   |BUENOS AIRES|001          |000226   |PABLO PODESTA |
+------------+------------+------------+-------------+---------+--------------+

Table3

+------+--------+-----------+---------+
|CityID|TownCode|CountryCode|StateCode|   
+------+--------+-----------+---------+
|110880|006129  |805        |001      |
|110888|007554  |805        |005      |
|111164|004661  |805        |009      |
|111368|005193  |805        |075      |
+------+--------+-----------+---------+

Таблица4 Идентификатор

+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|IdentifierTypeId|Value |EntityId  |ValueTypeId|EffectiveFrom       |ProviderId|ProviderType|SourceUpdateDate|SourceLink|SourceType|EffectiveToNACode|EffectiveToMinus|EffectiveTo           |EffectiveFromNACode|EffectiveFromPlus|NaCode|IsPrimary|ValueOrder|ValueTypeCode|EntityType|EntityTypeId|SysFrom             |SysFileId           |
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+
|320114          |3339  |4294963171|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |333997|4294963154|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |333999|4294963153|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
|320114          |334   |4294963152|320114     |1/1/1997 12:00:00 AM|null      |null        |null            |null      |null      |NA02             |null            |12/31/9999 12:00:00 AM|null               |null             |null  |False    |1         |Naics        |Industry  |404008      |7/2/2015 12:00:00 AM|2015-07-02-0000.Full|
+----------------+------+----------+-----------+--------------------+----------+------------+----------------+----------+----------+-----------------+----------------+----------------------+-------------------+-----------------+------+---------+----------+-------------+----------+------------+--------------------+--------------------+

1 Ответ

0 голосов
/ 02 мая 2018

Да, вы можете сделать это в Spark. И есть два подхода:

  1. Выполните broadcast на маленьких столах, а затем используйте filter или where на большом столе
  2. До broadcast join

Вот основной пример первого подхода.

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._

object Main {

  val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
  val sc = new SparkContext(conf)
  val sql = new SQLContext(sc)

  def main(args: Array[String]): Unit = {

    sc.setLogLevel("ERROR")
    import sql.implicits._

    // Creating a DataFrame with valid data. Column names will be _1 and _2
    val validDataRdd = sc.parallelize(Seq((1, 2), (2, 3), (3, 4), (10, 20), (20, 31), (30, 40), (100, 200), (200, 300)))
    val validDataDf = sql.createDataFrame(validDataRdd)

    // This is the big DataFrame. Column name is _1
    val theData = sc.parallelize(1 to 10000).toDF()

    // To broadcast data it first need to be brought locally
    val localValidData = validDataDf.collect()    // One can, instead of broadcasting Array[Row] transform Row into some custom case class for more convenient processing
    val broadcastedValidData = sc.broadcast(localValidData)

    // It's easier to do filtering on RDDs, but it also possible to use DataFrames.
    theData.rdd.filter(rowBig =>
      broadcastedValidData.value.exists(row => row.getAs[Int](0) == rowBig.getAs[Int](0))
    ).collect().foreach(println)
  }
}

РЕДАКТИРОВАТЬ (добавлен пример трансляции):

val ordersByCustomer = ordersDataFrame
    .join(broadcast(customersDataFrame), ordersDataFrame("customers_id") === customersDataFrame("id"), "left")
  ordersByCustomer.foreach(customerOrder => {
    println("> " + customerOrder.toString())
  })
...