Spark SQL: Почему время выполнения двух SQL операторов так сильно отличается? - PullRequest
0 голосов
/ 17 марта 2020
spark-sql> explain
         > SELECT 'ADDRESS', 'IDTYPE', a.pid
         >   FROM dmgr.ex_p10ids_address a
         >   LEFT JOIN p10ids_riskcon b
         >     ON (a.pid = b.apid OR a.pid = b.pid)
         >  WHERE a.pt IN ('20200308')
         >    AND b.classcode NOT IN ('26371100', '26371200', '26371300', '13770100', '26376000')
         >    AND a.src_sys = 'APP0001'
         >    AND a.endtime = '99991231999'
         >    AND a.idtype NOT IN ('00')
         >    AND zhengjianleixing(a.idtype, 'P10IDS') <> '0';
== Physical Plan ==
TungstenProject [ADDRESS AS ...,IDTYPE AS ...]
 Union
  SortMergeJoin [pid#459], [apid#475]
   TungstenSort [pid#459 ASC], false, 0
    TungstenExchange hashpartitioning(pid#459)
     ConvertToUnsafe
      Project [pid#459]
       Filter ((((src_sys#471 = APP0001) && (endtime#468 = 99991231999)) && NOT idtype#460 INSET (00)) && NOT (HiveSimpleUDF#com.cpic.udf.dmgr_udf.ZhengjianLeixingSensitive(idtype#460,P10IDS) = 0))
        HiveTableScan [pid#459,src_sys#471,endtime#468,idtype#460], (MetastoreRelation dmgr, ex_p10ids_address, Some(a)), [pt#446 INSET (20200308)], Statistics(10485761, 1522668470)
   TungstenSort [apid#475 ASC], false, 0
    TungstenExchange hashpartitioning(apid#475)
     ConvertToUnsafe
      Project [apid#475,pid#506]
       Filter NOT classcode#501 INSET (26371200,26371100,26376000,26371300,13770100)
        Scan ParquetRelation[hdfs://hacluster/user/hive/warehouse/dmgr.db/p10ids_riskcon](dmgr.p10ids_riskcon)[apid#475,pid#506,classcode#501] Statistics(94039006999, 1584411030)
  Filter NOT (pid#459 = apid#475)
   SortMergeJoin [pid#459], [pid#506]
    TungstenSort [pid#459 ASC], false, 0
     TungstenExchange hashpartitioning(pid#459)
      ConvertToUnsafe
       Project [pid#459]
        Filter ((((src_sys#471 = APP0001) && (endtime#468 = 99991231999)) && NOT idtype#460 INSET (00)) && NOT (HiveSimpleUDF#com.cpic.udf.dmgr_udf.ZhengjianLeixingSensitive(idtype#460,P10IDS) = 0))
         HiveTableScan [pid#459,src_sys#471,endtime#468,idtype#460], (MetastoreRelation dmgr, ex_p10ids_address, Some(a)), [pt#446 INSET (20200308)], Statistics(10485761, 1522668470)
    TungstenSort [pid#506 ASC], false, 0
     TungstenExchange hashpartitioning(pid#506)
      ConvertToUnsafe
       Project [apid#475,pid#506]
        Filter NOT classcode#501 INSET (26371200,26371100,26376000,26371300,13770100)
         Scan ParquetRelation[hdfs://hacluster/user/hive/warehouse/dmgr.db/p10ids_riskcon](dmgr.p10ids_riskcon)[apid#475,pid#506,classcode#501] Statistics(94039006999, 1584411030)

Второй sql просто поместил "AND b.classcode NOT IN" в предложение ON.

Первый sql работал 7 минут, а второй - часы, и я не не знаю причину. Ценю за ваши ответы!

...