Когда я использую TOP в выборке, чтобы получить пример данных из таблицы Teradata, он использует НАМНОГО больше спулинга (и, следовательно, часто спулинга), чем, например, использование SAMPLE.
Глядя на EXPLAINs, чтобы увидеть, в чем разница в обработке между SAMPLE и TOP, кажется, что гораздо больше копий таблиц спула происходит для TOP; но то, что меня смущает, это то, где говорится, что он делает шаг «STAT FUNCTION». Кто-нибудь может объяснить, что это за шаг? Ниже приведены два объяснения. Первичный индекс таблицы - УНИКАЛЬНЫЙ ПЕРВИЧНЫЙ ИНДЕКС (Customer_ID). Версия Teradata - 16.10.05.03.
Explain
SELECT TOP 2000
M.Customer_ID
, M.customer_type
from ESRE.MEAS_CUST_TBL as M
WHERE M.Customer_ID is not null;
1) First, we lock ESRE.M for read on a reserved RowHash to
prevent global deadlock.
2) Next, we lock ESRE.M for read.
3) We do an all-AMPs RETRIEVE step from ESRE.M by way of an
all-rows scan with a condition of ("NOT (ESRE.M.Customer_ID IS
NULL)") into Spool 2 (all_amps), which is built locally on the
AMPs. The size of Spool 2 is estimated with high confidence to be
43,384,684 rows (1,778,772,044 bytes). The estimated time for
this step is 1.28 seconds.
4) We do an all-AMPs STAT FUNCTION step from Spool 2 by way of an
all-rows scan into Spool 5, which is built locally on the AMPs.
The result rows are put into Spool 1 (group_amps), which is built
locally on the AMPs. This step is used to retrieve the TOP 2000
rows. One AMP is randomly selected to retrieve 2000 rows.
If this step retrieves less than 2000 rows, then execute step 5.
The size is estimated with high confidence to be 2,000 rows (
94,000 bytes).
5) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 5 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 1 (group_amps), which is built locally on the AMPs.
This step is used to retrieve the TOP 2000 rows. The size is
estimated with high confidence to be 2,000 rows (94,000 bytes).
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.
Explain
SELECT M.Customer_ID
, M.customer_type
from ESRE.MEAS_CUST_TBL as M
WHERE M.Customer_ID is not null
sample 2000;
1) First, we lock ESRE.M for read on a reserved RowHash to
prevent global deadlock.
2) Next, we lock ESRE.M for read.
3) We do an all-AMPs RETRIEVE step from ESRE.M by way of an
all-rows scan with a condition of ("NOT (ESRE.M.Customer_ID IS
NULL)") into Spool 2 (all_amps), which is built locally on the
AMPs. The size of Spool 2 is estimated with high confidence to be
43,384,684 rows (2,039,080,148 bytes). The estimated time for
this step is 1.28 seconds.
4) We do an all-AMPs SAMPLING step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 1 (group_amps), which is built locally
on the AMPs. Samples are specified as a number of rows. The size
of Spool 1 is estimated with high confidence to be 2,000 rows (
94,000 bytes).
5) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.