Сначала вам понадобится хорошая N-грамм функция, такая как описанная здесь . Ниже приводится версия NVARCHAR (4000) (спасибо Ларну за его вклад.) Я использовал NGramsN4K для создания функции NVARCHAR (4000) PatReplace . Я использую разные схемы для своих функций, но dbo будет работать нормально.
Обратите внимание, что это:
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
Возвращает: ABC123XYZ
Все символы, не соответствующие этому шаблону: [^0-9a-zA-Z]
были исключены , Теперь давайте используем функцию для таблицы с записями, содержащими плохие символы, удаляем их, а затем соединяем их с таблицей с хорошими значениями. Обратите внимание на мои комментарии.
-- Sample data
DECLARE @Customers TABLE (CustomerId INT IDENTITY, Surname NVARCHAR(100));
DECLARE @GoodValues TABLE (Surname NVARCHAR(100));
INSERT @Customers (Surname) VALUES (CHAR(10)+'Johnny'+CHAR(10)),('Smith'),('Jones'+CHAR(160));
INSERT @goodvalues (Surname) VALUES('Johnny'),('Smith'),('Jones'),('James');
-- Fail:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
JOIN @GoodValues AS g
ON c.Surname = g.Surname;
-- Success:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
CROSS APPLY samd.patreplaceN4K(c.Surname,'[^0-9a-zA-Z ]','') AS pr
JOIN @GoodValues AS g
ON pr.newString = g.Surname;
samd.NGramsN4K
CREATE FUNCTION samd.NGramsN4K
(
@string NVARCHAR(4000), -- Input string
@N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 NVARCHAR characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.position, ng.token
FROM samd.NGramsN4K(@string,@N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable AS s
CROSS APPLY samd.NGramsN4K(s.SomeValue,@N) AS ng;
[Parameters]:
@string = The input string to split into tokens.
@N = The size of each token returned.
[Returns]:
Position = bigint; the position of the token in the input string
token = NVARCHAR(4000); a @N-sized character-level N-Gram token
[Dependencies]:
1. core.rangeAB (iTVF)
[Developer Notes]:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Turn the string, 'ɰɰXɰɰ' into unigrams, bigrams and trigrams
DECLARE @string NVARCHAR(4000) = N'ɰɰXɰɰ';
BEGIN
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,1) AS ng; -- unigrams (@N=1)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,2) AS ng; -- bigrams (@N=2)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,3) AS ng; -- trigrams (@N=3)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,4) AS ng; -- 4-grams (@N=4)
END
--===== 2. Scenarios where the function would not return rows
SELECT ng.Position, ng.Token FROM samd.NGramsN4K('abcd',5) AS ng; -- 5-grams (@N=5)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', 0) AS ng;
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', NULL) AS ng;
This will fail:
--SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x',-1) AS ng;
--===== 3. How many times the substring "ƒƓ" appears in each record
BEGIN
DECLARE @table TABLE(stringID int identity primary key, string NVARCHAR(100));
INSERT @table(string)
VALUES (N'ƒƓ123ƒƓ'),(N'123ƒƓƒƓƒƓ'),(N'!ƒƓ!ƒƓ!'),(N'ƒƓ-ƒƓ-ƒƓ-ƒƓ-ƒƓ');
SELECT t.String, Occurances = COUNT(*)
FROM @table AS t
CROSS APPLY samd.NGramsN4K(t.string,2) AS ng
WHERE ng.token = N'ƒƓ'
GROUP BY t.string;
END;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20180829 - Changed TOP logic and startup-predicate logic in the WHERE clause
- Alan Burstein
Rev 02 - 20191129 - Redesigned to leverage rangeAB - Alan Burstein
Rev 03 - 20200416 - changed the cast from NCHAR(4000) to NVARCHAR(4000)
- Removed: WHERE @N BETWEEN 1 AND s.Ln; this must now be handled
manually moving forward. - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
Position = r.RN, -- Token Position
Token = CAST(SUBSTRING(@string,r.RN,@N) AS NVARCHAR(4000)) -- @N-Sized Token
FROM (VALUES(DATALENGTH(ISNULL(NULLIF(@string,N''),N'X'))/2)) AS s(Ln)
CROSS APPLY core.rangeAB(1,s.Ln-(ISNULL(@N,1)-1),1,1) AS r
GO
samd.patReplaceN4K
CREATE FUNCTION samd.patReplaceN4K
(
@string NVARCHAR(4000), -- Input String
@pattern NVARCHAR(50), -- Pattern to match/replace
@replace NVARCHAR(20) -- What to replace the matched pattern with
)
/*****************************************************************************************
[Purpose]:
Given a string (@string), a pattern (@pattern), and a replacement character (@replace)
patReplaceN4K will replace any character in @string that matches the @Pattern parameter
with the character, @replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplaceN4K(@String,@Pattern,@Replace) AS pr;
[Parameters]:
@string = NVARCHAR(4000); The input string to manipulate
@pattern = NVARCHAR(50); The pattern to match/replace
@replace = NVARCHAR(20); What to replace the matched pattern with
[Returns]:
Inline Table Valued Function returns:
NewString = NVARCHAR(4000); The new string with all instances of @Pattern replaced with
The value of @Replace.
[Dependencies]:
core.ngramsN4k (ITVF)
[Developer Notes]:
1. @Pattern IS case sensitive but can be easily modified to make it case insensitive
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
3. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
4. As is the case with functions which leverage samd.ngrams or samd.ngramsN4k,
samd.patReplaceN4K is almost always dramatically faster with a parallel execution
plan. One way to get a parallel query plan (if the optimizer does not choose one) is
to use make_parallel by Adam Machanic found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
On my PC (8 logical CPU, 64GB RAM, SQL 2019) samd.patReplaceN4K is about 4X
faster when executed using all 8 of my logical CPUs.
5. samd.patReplaceN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Remove non alphanumeric characters
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
--===== 2. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplaceN4K('My phone number is 555-2211','[0-9]','*') AS pr;
--==== 3. Using againsts a table
DECLARE @table TABLE(OldString varchar(60));
INSERT @table VALUES ('Call me at 555-222-6666'), ('phone number: (312)555-2323'),
('He can be reached at 444.665.4466 on Monday.');
SELECT t.OldString, pr.NewString
FROM @table AS t
CROSS APPLY samd.patReplaceN4K(t.oldstring,'[0-9]','*') AS pr;
[Revision History]:
-----------------------------------------------------------------------------------------
Rev 01 - 20200422 - Created - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN @string = a.Blank THEN a.Blank ELSE
CASE WHEN PATINDEX(@pattern,a.Token)&0x01=0 THEN ng.token ELSE @replace END END
FROM samd.NGramsN4K(@string,1) AS ng
CROSS APPLY (VALUES(CAST('' AS NVARCHAR(4000)),
ng.token COLLATE Latin1_General_BIN)) AS a(Blank,Token)
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'NVARCHAR(4000)');
GO