Это легко, используя функцию n-граммы на уровне слов (что-то покрыто здесь ). В конце этого поста приведен код для создания функции, которую я буду использовать для решения вашей проблемы. Сначала короткая демонстрация wngrams2012. Этот код разделит вашу строку на 4-Grams
(количество слов плюс условие поиска):
Запрос:
DECLARE
@string VARCHAR(MAX) = 'The quick brown fox jumps over the lazy dog',
@search VARCHAR(100) = 'Brown',
@words INT = 3;
SELECT
ng.ItemNumber,
ng.ItemIndex,
ng.ItemLength,
ng.Item
FROM dbo.wngrams2012(@string, @words+1) AS ng;
Результаты:
ItemNumber ItemIndex ItemLength Item
----------- ----------- ----------- ----------------------
1 1 20 The quick brown fox
2 5 22 quick brown fox jumps
3 11 21 brown fox jumps over
4 17 19 fox jumps over the
5 21 20 jumps over the lazy
6 27 17 over the lazy dog
Теперь для вашей конкретной задачи c:
DECLARE
@string VARCHAR(MAX) = 'The quick brown fox jumps over the lazy dog',
@search VARCHAR(100) = 'Brown',
@words INT = 3;
SELECT TOP (1)
ItemLength = ng.ItemLength,
Item = ng.Item
FROM (VALUES(LEN(@string), CHARINDEX(@search,@string))) AS s(Ln,Si)
CROSS APPLY (VALUES(s.Ln-s.Si+1)) AS nsl(Ln)
CROSS APPLY (VALUES(SUBSTRING(@string,s.Si,nsl.Ln))) AS ns(txt)
CROSS APPLY dbo.wngrams2012(ns.txt, @words+1) AS ng
WHERE s.Si > 0
ORDER BY ng.ItemNumber;
Результаты:
ItemLength Item
------------ ----------------------
21 brown fox jumps over
Пара других примеров. «Быстрый» и 1, возвращает:
ItemLength Item
------------ --------------
12 quick brown
«лиса» и 4 возвращает:
ItemLength Item
------------ -------------------------
24 fox jumps over the lazy
ОБНОВЛЕНИЕ: против таблицы
Я забыл включить это. Вот слова в двух отдельных таблицах:
DECLARE @sometable TABLE(someid INT IDENTITY, someword VARCHAR(100));
DECLARE @sometable2 TABLE(someid INT IDENTITY, someword VARCHAR(MAX));
INSERT @sometable(someword) VALUES ('brown'),('fox'),('quick'),('zoo');
INSERT @sometable2(someword) VALUES ('The quick brown fox jumps over the lazy dog'),
('The brown lazy dog went to the zoo for a quick visit')
DECLARE --@string VARCHAR(MAX) = 'The quick brown fox jumps over the lazy dog',
@words INT = 4;
SELECT
SearchId = t.someid,
StringId = t2.someid,
Searchstring = t.someword,
Item = f.Item
FROM @sometable AS t
CROSS JOIN @sometable2 AS t2
CROSS APPLY -- OUTER APPLY
(
SELECT TOP (1) ng.Item
FROM (VALUES(LEN(t2.someword), CHARINDEX(t.someword,t2.someword))) AS s(Ln,Si)
CROSS APPLY (VALUES(s.Ln-s.Si+1)) AS nsl(Ln)
CROSS APPLY (VALUES(SUBSTRING(t2.someword,s.Si,nsl.Ln))) AS ns(txt)
CROSS APPLY dbo.wngrams2012(ns.txt, @words+1) AS ng
WHERE s.Si > 0
ORDER BY ng.ItemNumber
) AS f;
Возвращает:
SearchId StringId Searchstring Item
--------- --------- -------------- ------------------------------
1 1 brown brown fox jumps over the
2 1 fox fox jumps over the lazy
3 1 quick quick brown fox jumps over
1 2 brown brown lazy dog went to
4 2 zoo zoo for a quick visit
Обратите внимание, что OUTER APPLY
приведет к тому, что запрос вернет строки, когда элемент поиска не найден в строке поиска.
Чисто основанный на множестве, полностью парализуемый (многопоточный), без циклов / курсоров / медленной итерации.
Функции:
CREATE FUNCTION dbo.NGrams2B
(
@string varchar(max),
@N int
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a stream of tokens based on an input
string (@string) up to 2^31-1 bytes (2 GB). For more
information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGrams2B(@string,@N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGrams2B(s.SomeValue,@N) ng;
Parameters:
@string = varchar(max); the input string to split into tokens
@N = bigint; the size of each token returned
Returns:
Position = bigint; the position of the token in the input string
token = varchar(max); a @N-sized character-level N-Gram token
Developer Notes:
1. Based on NGrams8k but modified to accept varchar(max)
2. NGrams2B is not case sensitive
3. Many functions that use NGrams2B will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
4. Performs about 2-3 times slower than NGrams8k. Only use when you are sure that
NGrams8k will not suffice.
5. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL)
6. NGrams8k can also be used as a tally table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
it into unigrams then only return the position column. NGrams8k will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100000 in ascending order:
SELECT N = position FROM dbo.NGrams2B(REPLICATE(CAST(0 AS varchar(max)),100000),1);
--===== (2) Get the numbers 1 to 100000 in descending order:
DECLARE @maxN bigint = 100000;
SELECT N = @maxN-position+1
FROM dbo.NGrams2B(REPLICATE(CAST(0 AS varchar(max)),@maxN),1)
ORDER BY position;
7. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGrams2B('abcd',1); -- bigrams (@N=1)
SELECT position, token FROM dbo.NGrams2B('abcd',2); -- bigrams (@N=2)
SELECT position, token FROM dbo.NGrams2B('abcd',3); -- trigrams (@N=3)
---------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20150909 - Initial Developement - Alan Burstein
Rev 01 - 20151029 - Added ISNULL logic to the TOP clause for both parameters: @string
and @N. This will prevent a NULL string or NULL @N from causing an
"improper value" to be passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS
(
SELECT N
FROM (VALUES
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(N)
), --216 values
iTally(N) AS
(
SELECT
TOP (
ABS(CONVERT(BIGINT,
(DATALENGTH(ISNULL(CAST(@string AS varchar(max)),'')) - (ISNULL(@N,1)-1)),0))
)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c CROSS JOIN L1 d
--2,176,782,336 rows: enough to handle varchar(max) -> 2^31-1 bytes
)
SELECT
position = N,
token = SUBSTRING(@string,N,@N)
FROM iTally
WHERE @N > 0 AND @N <= DATALENGTH(CAST(@string AS varchar(max)));
GO
CREATE FUNCTION dbo.wngrams2012(@string varchar(max), @N bigint)
/*****************************************************************************************
Purpose:
wngrams2012 accepts a varchar(max) input string (@string) and splits it into a contiguous
sequence of @N-sized, word-level tokens.
Per Wikipedia (http://en.wikipedia.org/wiki/N-gram) an "n-gram" is defined as:
"a contiguous sequence of n items from a given sequence of text or speech. The items can
be phonemes, syllables, letters, words or base pairs according to the application. "
------------------------------------------------------------------------------------------
Compatibility:
SQL Server 2012+, Azure SQL Database
2012+ because the function uses LEAD
Parameters:
@string = varchar(max); input string to spit into n-sized items
@N = int; number of items per row
Returns:
itemNumber = bigint; the item's ordinal position inside the input string
itemIndex = int; the items location inside the input string
item = The @N-sized word-level token
Determinism:
wngrams2012 is deterministic
SELECT ROUTINE_NAME, IS_DETERMINISTIC
FROM information_schema.routines where ROUTINE_NAME = 'wngrams2012';
------------------------------------------------------------------------------------------
Syntax:
--===== Autonomous
SELECT
ng.tokenNumber,
ng.token
FROM dbo.wngrams2012(@string,@N) ng;
--===== Against another table using APPLY
SELECT
t.someID
ng.tokenNumber,
ng.token
FROM dbo.SomeTable t
CROSS APPLY dbo.wngrams2012(@string,@N) ng;
-----------------------------------------------------------------------------------------
Usage Examples:
--===== Example #1: Word-level Unigrams:
SELECT
ng.itemNumber,
ng.itemIndex,
ng.item
FROM dbo.wngrams2012('One two three four words', 1) ng;
--Results:
ItemNumber position token
1 1 one
2 4 two
3 8 three
4 14 four
5 19 words
--===== Example #2: Word-level Bi-grams:
SELECT
ng.itemNumber,
ng.itemIndex,
ng.item
FROM dbo.wngrams2012('One two three four words', 2) ng;
--Results:
ItemNumber position token
1 1 One two
2 4 two three
3 8 three four
4 14 four words
--===== Example #3: Only the first two Word-level Bi-grams:
-- Key: TOP(2) does NOT guarantee the correct result without an order by, which will
-- degrade performance; see programmer note #5 below for details about sorting.
SELECT
ng.ItemNumber, ng.ItemIndex, ng.ItemLength, ng.Item
FROM dbo.wngrams2012('One two three four words',2) AS ng
WHERE ng.ItemNumber < 3;
--Results:
ItemNumber ItemIndex ItemLength Item
---------- --------- ----------- ---------------------------------------------------
1 1 8 One two
2 4 10 two three
-----------------------------------------------------------------------------------------
Programmer Notes:
1. This function requires ngrams8k which can be found here:
http://www.sqlservercentral.com/articles/Tally+Table/142316/
2. This function could not have been developed without what I learned reading "Reaping
the benefits of the Window functions in T-SQL" by Eirikur Eiriksson
The code looks different but, under the covers, WNGrams2012
is simply a slightly altered rendition of DelimitedSplit8K_LEAD.
3. Requires SQL Server 2012
4. wngrams2012 uses spaces (char(32)) as the delimiter; the text must be pre-formatted
to address line breaks, carriage returns multiple spaces, etc.
5. Result set order does not matter and therefore no ORDER BY clause is required. The
*observed* default sort order is ItemNumber which means position is also sequential.
That said, *any* ORDER BY clause will cause a sort in the execution plan. If you need
to sort by position (ASC) or itemNumber (ASC), follow these steps to avoid a sort:
A. In the function DDL, replace COALESCE/NULLIF for N1.N with the N. e.g. Replace
"COALESCE(NULLIF(N1.N,0),1)" with "N" (no quotes)
B. Add an ORDER BY position (which is logically identical to ORDER BY itemnumber).
C. This will cause the position of the 1st token to be 0 instead of 1 when position
is included in the final result-set. To correct this, simply use this formula:
"COALESCE(NULLIF(position,0),1)" for "position". Note this example:
SELECT
ng.itemNumber,
itemIndex = COALESCE(NULLIF(ng.itemIndex,0),1),
ng.item
FROM dbo.wngrams2012('One two three four words',2) ng
ORDER BY ng.itemIndex;
-----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20171116 - Initial creation - Alan Burstein
Rev 01 - 20200206 - Misc updates - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
delim(RN,N) AS -- locate all of the spaces in the string
(
SELECT 0,0 UNION ALL
SELECT ROW_NUMBER() OVER (ORDER BY ng.position), ng.position
FROM dbo.ngrams2b(@string,1) ng
WHERE ng.token = ' '
),
tokens(itemNumber,itemIndex,item,itemLength,itemCount) AS -- Create tokens (e.g. split string)
(
SELECT
N1.RN+1,
N1.N+1, -- change to N then ORDER BY position to avoid a sort
SUBSTRING(v1.s, N1.N+1, LEAD(N1.N,@N,v2.l) OVER (ORDER BY N1.N)-N1.N),
LEAD(N1.N,@N,v2.l) OVER (ORDER BY N1.N)-N1.N,
v2.l-v2.sp-(@N-2)
-- count number of spaces in the string then apply the N-GRAM rows-(@N-1) formula
-- Note: using (@N-2 to compinsate for the extra row in the delim cte).
FROM delim N1
CROSS JOIN (VALUES (@string)) v1(s)
CROSS APPLY (VALUES (LEN(v1.s), LEN(REPLACE(v1.s,' ','')))) v2(l,sp)
)
SELECT
ItemNumber = ROW_NUMBER() OVER (ORDER BY (t.itemIndex)),
ItemIndex = t.itemIndex, --ISNULL(NULLIF(t.itemIndex,0),1),
ItemLength = t.itemLength,
Item = t.item
FROM tokens t
WHERE @N > 0 AND t.itemNumber <= t.itemCount; -- startup predicate
GO