Question

Почему

При сравнении и дедупликации между двумя списками кодеры не всегда находят наиболее эффективную реализацию во время выполнения, испытывая давление времени. Два вложенных цикла for - это общее решение goto для многих кодеров. Можно попробовать CROSS JOIN с LINQ, но это явно неэффективно. Для этого кодерам необходим запоминающийся и эффективный код, который также относительно эффективен во время выполнения.

Этот вопрос был создан после просмотра более конкретного: Удалить дубликаты в одном наборе данных относительно другого в C # - он более специализирован с использованием наборов данных. Термин «набор данных» не поможет людям в будущем. Других обобщенных вопросов не найдено.

Что

Я использовал термин Список / Коллекция, чтобы помочь с более общей проблемой кодирования.

var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);

При реализации функции deduplicationFunction входные данные и выходные данные должны быть чистыми. Вывод может быть IEnumerable. Ожидаемый результат в этом входном примере будет четными числами от 1-1M {2,4,6,8, ...}

Примечание. В эталонном наборе могут быть дубликаты. Значения в обоих наборах являются только ориентировочными, поэтому я не ищу математическое решение - это также должно работать для ввода случайных чисел в обоих наборах.

Если к этому приблизиться с помощью простых функций LINQ, оно будет слишком медленным O (1M * 0,5M). Для таких больших наборов должен быть более быстрый подход.

Скорость важна, но постепенные улучшения с большим количеством кода будут иметь меньшее значение. Кроме того, в идеале это будет работать для других типов данных, включая объекты модели данных, но ответа на этот конкретный вопрос должно быть достаточно. Другие типы данных просто потребуют дополнительной предварительной обработки или небольшого изменения в ответе.

Краткое изложение решения

Вот код теста, результаты которого приведены ниже:

using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Preparing...");

            List<int> set1 = new List<int>();
            List<int> set2 = new List<int>();

            Random r = new Random();
            var max = 10000;

            for (int i = 0; i < max; i++)
            {
                set1.Add(r.Next(0, max));
                set2.Add(r.Next(0, max/2) * 2);
            }

            Console.WriteLine("First run...");

            Stopwatch sw = new Stopwatch();
            IEnumerable<int> result;
            int count;

            while (true)
            {
                sw.Start();
                result = deduplicationFunction(set1, set2);
                var results1 = result.ToList();
                count = results1.Count;
                sw.Stop();
                Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                sw.Start();
                result = deduplicationFunction2(set1, set2);
                var results2 = result.ToList();
                count = results2.Count;
                sw.Stop();
                Console.WriteLine("  HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction3(set1, set2);
                var results3 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("     Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction4(set1, set2);
                var results4 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test

                sw.Start();
                result = deduplicationFunction5(set1, set2);
                var results5 = result.ToList();
                count = results5.Count;
                sw.Stop();
                Console.WriteLine("        Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                Console.ReadLine();

                Console.WriteLine("");
                Console.WriteLine("Next Run");
                Console.WriteLine("");
            }

        }


        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var ReferenceHashSet = Reference
                                .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                                .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

            int throwAway;
            return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
        }

        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var SetAsHash = new HashSet<int>();

            Set.ForEach(x =>
            {
                if (SetAsHash.Contains(x))
                    return;

                SetAsHash.Add(x);
            }); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.

            SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing

            return SetAsHash.AsEnumerable();
        }

        static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
        {
            Set.Sort();
            Reference.Sort();
            Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

            return deduplicationFunction4(Set, Reference);
        }

        static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
        {
            int i1 = 0;
            int i2 = 0;
            int thisValue = Set[i1];
            int thisReference = Reference[i2];
            while (true)
            {
                var difference = thisReference - thisValue;

                if (difference < 0)
                {
                    i2++; //Compare side is too low, there might be an equal value to be found
                    if (i2 == Reference.Count)
                        break;
                    thisReference = Reference[i2];
                    continue;
                }

                if (difference > 0) //Duplicate
                    yield return thisValue;

                GoFurther:
                i1++;
                if (i1 == Set.Count)
                    break;
                if (Set[i1] == thisValue) //Eliminates duplicates
                    goto GoFurther; //I rarely use goto statements, but this is a good situation

                thisValue = Set[i1];
            }
        }

        static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
        {
            var found = false;
            var lastValue = 0;
            var thisValue = 0;
            for (int i = 0; i < Set.Count; i++)
            {
                thisValue = Set[i];

                if (thisValue == lastValue)
                    continue;

                lastValue = thisValue;

                found = false;
                for (int x = 0; x < Reference.Count; x++)
                {
                    if (thisValue != Reference[x])
                        continue;

                    found = true;
                    break;
                }

                if (found)
                    continue;

                yield return thisValue;
            }
        }
    }
}

Я буду использовать это для сравнения производительности нескольких подходов. (На данном этапе мне особенно интересен подход Hash-подхода против подхода двойного индекса-на-сортировке, хотя ExceptWith позволяет получить краткое решение)

Результаты на 10 000 предметов в наборе (Good Run):

Первый запуск

Словарь и где - Количество: 3565, Миллисекунды: 16.38.
HashSet ExceptWith - Количество: 3565, Миллисекунды: 5,33.
Sort Dual Index - Количество: 3565, Миллисекунды: 6,34.
предварительно отсортированный двойной индекс - счетчик: 3565, миллисекунды: 1,14.
Вложенный индекс - Количество: 3565, Миллисекунды: 964.16.

Good Run

Словарь и где - Количество: 3565, Миллисекунды: 1,21.
HashSet ExceptWith - Количество: 3565, Миллисекунды: 0,94.
Sort Dual Index - Количество: 3565, Миллисекунды: 1.09.
Предварительно отсортированный двойной индекс - Количество: 3565, Миллисекунды: 0,76.
Вложенный индекс - Количество: 3565, Миллисекунды: 628.60.

Выбранный ответ:

@ backs Подход HashSet.ExceptWith - незначительно быстрее с минимальным кодом, использует интересную функцию ExceptWith, однако она ослаблена из-за отсутствия универсальности, и тот факт, что интересная функция менее известна.
Один из моих ответов: HashSet> Where (.. Contains ..) - только чуть-чуть медленнее, чем @backs, но использует шаблон кода, использующий LINQ, и очень универсален за пределами списков первичных элементов. Я полагаю, что это более распространенный сценарий, с которым я сталкиваюсь при кодировании, и верю, что это имеет место для многих других кодеров.
Особая благодарность @TheGeneral за сравнительный анализ некоторых ответов, а также некоторых интересных unsafe версий, а также за то, что они помогли сделать ответ @Backs более эффективным для последующего теста.

Todd · Answer 1 · 25 июня 2018

Одноконтурный Dual-Index

Как рекомендовано @PepitoSh в комментариях к Вопросу:

Я думаю, что HashSet - это очень общее решение для довольноконкретная проблема.Если ваши списки упорядочены, сканирование их параллельно и сравнение текущих элементов является самым быстрым

Это очень отличается от наличия двух вложенных циклов.Вместо этого существует один общий цикл, и индексы увеличиваются параллельно по возрастанию, в зависимости от разницы относительных значений.Разница в основном заключается в выводе любой обычной функции сравнения: {отрицательный, 0, положительный}

static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
    int i1 = 0;
    int i2 = 0;
    int thisValue = Set[i1];
    int thisReference = Reference[i2];
    while (true)
    {
        var difference = thisReference - thisValue;

        if (difference < 0)
        {
            i2++; //Compare side is too low, there might be an equal value to be found
            if (i2 == Reference.Count)
                break;
            thisReference = Reference[i2];
            continue;
        }

        if (difference > 0) //Duplicate
            yield return thisValue;

        GoFurther:
        i1++;
        if (i1 == Set.Count)
            break;
        if (Set[i1] == thisValue) //Eliminates duplicates
            goto GoFurther; //I rarely use goto statements, but this is a good situation

        thisValue = Set[i1];
    }
}

Как вызывать эту функцию, если списки еще не отсортированы:

Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

return deduplicationFunction4(Set, Reference);

Это дало мне лучшую производительность в моем тестировании.Вероятно, это также может быть использовано с небезопасным кодом для ускорения в некоторых сценариях.В тех случаях, когда данные уже отсортированы, это, безусловно, лучший вариант.Также может быть выбран более быстрый алгоритм сортировки, но он не является предметом этого вопроса.

Примечание: этот метод дедуплицирует по ходу дела.

Я фактически кодировал такой шаблон с одной петлей раньше, когда завершалрезультаты текстового поиска, за исключением того, что у меня было N массивов для проверки на "близость".Итак, у меня был массив индексов - array[index[i]].Так что я уверен, что наличие единственного цикла с контролируемым приращением индекса не является новой концепцией, но это, безусловно, отличное решение здесь.

Backs · Answer 2 · 25 июня 2018

Используйте HashSet для вашего начального списка и ExceptWith метод для получения результата урегулирования:

var setToDeduplicate = new HashSet<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

setToDeduplicate.ExceptWith(referenceSet);

TheGeneral · Answer 3 · 25 июня 2018

Вот еще несколько, в основном, я хотел протестировать как разные, так и не отличные входные данные с различными решениями. В не отличной версии я должен был вызывать отличную, где это необходимо, в конечном результате.

Mode             : Release (64Bit)
Test Framework   : .NET Framework 4.7.1

Operating System : Microsoft Windows 10 Pro
Version          : 10.0.17134

CPU Name         : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Description      : Intel64 Family 6 Model 58 Stepping 9

Cores (Threads)  : 4 (8)      : Architecture  : x64
Clock Speed      : 3901 MHz   : Bus Speed     : 100 MHz
L2Cache          : 1 MB       : L3Cache       : 8 MB

Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000

Результаты Отдельный ввод

--- Random Set 1 ---------------------------------------------------------------------
| Value         |   Average |   Fastest |      Cycles |    Garbage | Test |     Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs         |  0.008 ms |  0.007 ms |      31,362 |   8.000 KB | Pass |  68.34 % |
| ListUnsafe    |  0.009 ms |  0.008 ms |      35,487 |   8.000 KB | Pass |  63.45 % |
| HasSet        |  0.012 ms |  0.011 ms |      46,840 |   8.000 KB | Pass |  50.03 % |
| ArrayUnsafe   |  0.013 ms |  0.011 ms |      49,388 |   8.000 KB | Pass |  47.75 % |
| HashSetUnsafe |  0.018 ms |  0.013 ms |      66,866 |  16.000 KB | Pass |  26.62 % |
| Todd          |  0.024 ms |  0.019 ms |      90,763 |  16.000 KB | Base |   0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs         |  0.070 ms |  0.060 ms |     249,374 |  28.977 KB | Pass |  57.56 % |
| ListUnsafe    |  0.078 ms |  0.067 ms |     277,080 |  28.977 KB | Pass |  52.67 % |
| HasSet        |  0.093 ms |  0.083 ms |     329,686 |  28.977 KB | Pass |  43.61 % |
| ArrayUnsafe   |  0.096 ms |  0.082 ms |     340,154 |  36.977 KB | Pass |  41.72 % |
| HashSetUnsafe |  0.103 ms |  0.085 ms |     367,681 |  55.797 KB | Pass |  37.07 % |
| Todd          |  0.164 ms |  0.151 ms |     578,933 | 112.664 KB | Base |   0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe    |  0.706 ms |  0.611 ms |   2,467,327 | 258.516 KB | Pass |  48.60 % |
| Backs         |  0.758 ms |  0.654 ms |   2,656,610 | 180.297 KB | Pass |  44.81 % |
| ArrayUnsafe   |  0.783 ms |  0.696 ms |   2,739,156 | 276.281 KB | Pass |  43.02 % |
| HasSet        |  0.859 ms |  0.752 ms |   2,999,230 | 198.063 KB | Pass |  37.47 % |
| HashSetUnsafe |  0.864 ms |  0.783 ms |   3,029,086 | 332.273 KB | Pass |  37.07 % |
| Todd          |  1.373 ms |  1.251 ms |   4,795,929 | 604.742 KB | Base |   0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe    |  5.624 ms |  4.874 ms |  19,658,154 |   2.926 MB | Pass |  40.36 % |
| HasSet        |  7.574 ms |  6.548 ms |  26,446,193 |   2.820 MB | Pass |  19.68 % |
| Backs         |  7.585 ms |  5.634 ms |  26,303,794 |   2.009 MB | Pass |  19.57 % |
| ArrayUnsafe   |  8.287 ms |  6.219 ms |  28,923,797 |   3.583 MB | Pass |  12.12 % |
| Todd          |  9.430 ms |  7.326 ms |  32,880,985 |   2.144 MB | Base |   0.00 % |
| HashSetUnsafe |  9.601 ms |  7.859 ms |  32,845,228 |   5.197 MB | Pass |  -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe    | 57.751 ms | 44.734 ms | 201,477,028 |  29.309 MB | Pass |  22.14 % |
| Backs         | 65.567 ms | 49.023 ms | 228,772,283 |  21.526 MB | Pass |  11.61 % |
| HasSet        | 73.163 ms | 56.799 ms | 254,703,994 |  25.904 MB | Pass |   1.36 % |
| Todd          | 74.175 ms | 53.739 ms | 258,760,390 |   9.144 MB | Base |   0.00 % |
| ArrayUnsafe   | 86.530 ms | 67.803 ms | 300,374,535 |  13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 |  39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------

Результаты Случайный список с использованием Distinct для результатов, где это необходимо

--- Random Set 1 ---------------------------------------------------------------------
| Value         |    Average |   Fastest |      Cycles |    Garbage | Test |    Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs         |   0.007 ms |  0.006 ms |      28,449 |   8.000 KB | Pass | 72.96 % |
| HasSet        |   0.010 ms |  0.009 ms |      38,222 |   8.000 KB | Pass | 62.05 % |
| HashSetUnsafe |   0.014 ms |  0.010 ms |      51,816 |  16.000 KB | Pass | 47.52 % |
| ListUnsafe    |   0.017 ms |  0.014 ms |      64,333 |  16.000 KB | Pass | 33.84 % |
| ArrayUnsafe   |   0.020 ms |  0.015 ms |      72,468 |  16.000 KB | Pass | 24.70 % |
| Todd          |   0.026 ms |  0.021 ms |      95,500 |  24.000 KB | Base |  0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs         |   0.061 ms |  0.053 ms |     219,141 |  28.977 KB | Pass | 70.46 % |
| HasSet        |   0.092 ms |  0.080 ms |     325,353 |  28.977 KB | Pass | 55.78 % |
| HashSetUnsafe |   0.093 ms |  0.079 ms |     331,390 |  55.797 KB | Pass | 55.03 % |
| ListUnsafe    |   0.122 ms |  0.101 ms |     432,029 |  73.016 KB | Pass | 41.19 % |
| ArrayUnsafe   |   0.133 ms |  0.113 ms |     469,560 |  73.016 KB | Pass | 35.88 % |
| Todd          |   0.208 ms |  0.173 ms |     730,661 | 148.703 KB | Base |  0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs         |   0.620 ms |  0.579 ms |   2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet        |   0.696 ms |  0.635 ms |   2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe |   0.731 ms |  0.679 ms |   2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe    |   0.804 ms |  0.761 ms |   2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe   |   0.810 ms |  0.751 ms |   2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd          |   1.388 ms |  1.271 ms |   4,863,651 | 736.953 KB | Base |  0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs         |   5.604 ms |  4.710 ms |  19,600,934 |   2.009 MB | Pass | 62.92 % |
| HasSet        |   6.607 ms |  5.847 ms |  23,093,963 |   2.820 MB | Pass | 56.29 % |
| HashSetUnsafe |   8.565 ms |  7.465 ms |  29,239,067 |   5.197 MB | Pass | 43.34 % |
| ListUnsafe    |  11.447 ms |  9.543 ms |  39,452,865 |   5.101 MB | Pass | 24.28 % |
| ArrayUnsafe   |  11.517 ms |  9.841 ms |  39,731,502 |   5.483 MB | Pass | 23.81 % |
| Todd          |  15.116 ms | 11.369 ms |  51,963,309 |   3.427 MB | Base |  0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs         |  53.766 ms | 44.321 ms | 187,905,335 |  21.526 MB | Pass | 51.32 % |
| HasSet        |  60.759 ms | 50.742 ms | 212,409,649 |  25.904 MB | Pass | 44.99 % |
| HashSetUnsafe |  79.248 ms | 67.130 ms | 275,455,545 |  39.527 MB | Pass | 28.25 % |
| ListUnsafe    | 106.527 ms | 90.159 ms | 370,838,650 |  39.153 MB | Pass |  3.55 % |
| Todd          | 110.444 ms | 93.225 ms | 384,636,081 |  22.676 MB | Base |  0.00 % |
| ArrayUnsafe   | 114.548 ms | 98.033 ms | 398,219,513 |  38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------

Данные

private Tuple<List<int>, List<int>> GenerateData(int scale)
{
   return new Tuple<List<int>, List<int>>(
      Enumerable.Range(0, scale)
                .Select(x => x)
                .ToList(),
      Enumerable.Range(0, scale)
                .Select(x => Rand.Next(10000))
                .ToList());
}

код

public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item1); 
      hashSet.ExceptWith(Input.Item2); 
      return hashSet.ToList(); 
   }
}

public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{

   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item2); 

      return Input.Item1.Where(y => !hashSet.Contains(y)).ToList(); 
   }
}

public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var referenceHashSet = Input.Item2.Distinct()                 
                                      .ToDictionary(x => x, x => x);

      return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
   }
}

public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new HashSet<int>();
      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new List<int>(Input.Item2.Count);

      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}

public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new int[Input.Item1.Count];

      fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
      {
         var j = 0;
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               *(pRes+j++) = *p;
         }
         return result.Take(j).ToList(); 
      }

   }
}

Резюме

Здесь нет ничего удивительного, если у вас есть отдельный список, который лучше начать с некоторых решений, если не самая простая версия с хэш-сетами, то лучше

Todd · Answer 4 · 25 июня 2018

HashSet и где

Вы должны использовать HashSet (или Словарь) для скорости:

//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = Reference
                        .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                        .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

    int throwAway;
        return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}

Это версия лямбда-выражения. Он использует словарь, который обеспечивает адаптируемость для изменения значения при необходимости. Можно использовать буквальные циклы for и, возможно, получить еще большее прирост производительности, но по сравнению с наличием двух вложенных циклов это уже удивительное улучшение.

Изучив несколько вещей, глядя на другие ответы, вы получите более быструю реализацию:

static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference);
    return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}

Важно, что этот подход (хотя и немного медленнее, чем ответ @backs) по-прежнему достаточно универсален, чтобы использовать его для объектов базы данных, И другие типы можно легко использовать в поле проверки дубликатов.

Вот пример того, как код легко настраивается для использования со списком сущностей базы данных Person.

static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
    return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
            .GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}

С двумя очень большими списками / коллекциями - как эффективно обнаруживать и / или удалять дубликаты

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Резюме

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

С двумя очень большими списками / коллекциями - как эффективно обнаруживать и / или удалять дубликаты

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Резюме

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов