Возьмите копию NGrams8k , и вы можете сделать это:
DECLARE @String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE @String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE @String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE @String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
SELECT s.String, Total = COUNT(DISTINCT ng.token)
FROM (VALUES(@String1),(@String2),(@String3),(@String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1) AS ng
GROUP BY s.String;
Возвраты:
String Total
-------- -----------
* 1
*A-zz 4
11 1
1A^ 3
ОБНОВЛЕНО Просто быстрое обновление на основе сообщений и комментариев @ Larnu. Я не заметил, что ОП имел дело с Unicode, например, NVARCHAR. Я создал версию NVARCHAR (4000) здесь - аналогично тому, что @Larnu опубликовал выше. Я только что обновил токен возврата, чтобы использовать сопоставление Latin1_General_BIN.
SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N)
Это возвращает правильный ответ:
DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT ng.token)
FROM dbo.NGramsN4k(@String5,1) AS ng;
Без сопоставления вы можете использовать то, что опубликовал Ларну, и получитьправильный ответ так:
DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT UNICODE(ng.token))
FROM dbo.NGramsN4k(@String5,1) AS ng;
Вот моя обновленная функция NGramsN4K:
ALTER FUNCTION dbo.NGramsN4K
(
@string nvarchar(4000), -- Input string
@N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(@string,@N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,@N) ng;
Parameters:
@string = The input string to split into tokens.
@N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a @N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (@N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (@N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (@N=3)
--===== How many times the substring "AB" appears in each record
DECLARE @table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM @table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N) -- the @N-Sized token
FROM iTally
WHERE @N > 0 -- Protection against bad parameter values:
AND @N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)));