Я пытаюсь оптимизировать код на C, и кажется, что одна инструкция занимает около 22% времени.
Код скомпилирован с gcc 8.2.0. Флаги -O3 -DNDEBUG -g
и -Wall -Wextra -Weffc++ -pthread -lrt
.
509529.517218 task-clock (msec) # 0.999 CPUs utilized
6,234 context-switches # 0.012 K/sec
10 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,985,640,853,831 cycles # 3.897 GHz (30.76%)
1,897,574,410,921 instructions # 0.96 insn per cycle (38.46%)
229,365,727,020 branches # 450.152 M/sec (38.46%)
13,027,677,754 branch-misses # 5.68% of all branches (38.46%)
604,340,619,317 L1-dcache-loads # 1186.076 M/sec (38.46%)
47,749,307,910 L1-dcache-load-misses # 7.90% of all L1-dcache hits (38.47%)
19,724,956,845 LLC-loads # 38.712 M/sec (30.78%)
3,349,412,068 LLC-load-misses # 16.98% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads
129,878,634 L1-icache-load-misses (30.77%)
604,482,046,140 dTLB-loads # 1186.353 M/sec (30.77%)
4,596,384,416 dTLB-load-misses # 0.76% of all dTLB cache hits (30.77%)
2,493,696 iTLB-loads # 0.005 M/sec (30.77%)
21,356,368 iTLB-load-misses # 856.41% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
509.843595752 seconds time elapsed
507.706093000 seconds user
1.839848000 seconds sys
Усилитель VTune дает мне подсказку для функции: https://pasteboard.co/IagrLaF.png
Инструкция cmpq
, кажется, занимает 22% от всего времени. С другой стороны, другие инструкции занимают незначительное время.
perf
дает мне несколько иную картину, но я думаю, что результаты согласуются:
Percent│ bool mapFound = false;
0.00 │ movb $0x0,0x7(%rsp)
│ goDownBwt():
│ bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
0.00 │ lea 0x20(%rsp),%r12
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.00 │ lea (%rax,%rax,2),%rax
0.00 │ shl $0x3,%rax
0.00 │ mov %rax,0x18(%rsp)
0.01 │ movzwl %dx,%eax
0.00 │ mov %eax,(%rsp)
0.00 │ ↓ jmp d6
│ nop
│ if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
0.30 │ 88: mov (%rax),%rsi
8.38 │ mov 0x10(%rsi),%rcx
0.62 │ test %rcx,%rcx
0.15 │ ↓ je 1b0
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
2.05 │ add 0x18(%rsp),%rcx
│ ++stats->nDownPreprocessed;
0.25 │ addq $0x1,0x18(%rdx)
│ newState->trace = PREPROCESSED;
0.98 │ movb $0x10,0x30(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
43.36 │ mov 0x8(%rcx),%rax
2.61 │ cmp %rax,(%rcx)
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.05 │ mov %rcx,0x20(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
3.47 │ setbe %dl
Функция
inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
++stats->nDown;
if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
++stats->nDownPreprocessed;
newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
newState->trace = PREPROCESSED;
return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
}
bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
newState->trace = 0;
return (newState->interval.k <= newState->interval.l);
}
state_t
определяется как
struct state_t {
union {
bwtinterval_t interval;
preprocessedInterval_t *preprocessedInterval;
};
unsigned char trace;
struct state_t *previousState;
};
preprocessedInterval_t
- это:
struct preprocessedInterval_t {
bwtinterval_t interval;
preprocessedInterval_t *firstChild;
};
Есть несколько (~ 1000) state_t
структур. Однако есть много (350k) preprocessedInterval_t
объектов, размещенных где-то еще.
Первое if
истинно в 15 миллиардов раз за 19 миллиардов.
Поиск ошибочно предсказанных ветвей с помощью perf record -e branches,branch-misses mytool
в функции дает мне:
Available samples
2M branches
1M branch-misses
Могу ли я предположить, что неправильное прогнозирование ветвей ответственно за это замедление?
Каким будет следующий шаг по оптимизации моего кода?
Код доступен на GitHub
Редактировать 1
valgrind --tool=cachegrind
дает мне:
I refs: 1,893,716,274,393
I1 misses: 4,702,494
LLi misses: 137,142
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 756,774,557,235 (602,597,601,611 rd + 154,176,955,624 wr)
D1 misses: 39,489,866,187 ( 33,583,272,379 rd + 5,906,593,808 wr)
LLd misses: 3,483,920,786 ( 3,379,118,877 rd + 104,801,909 wr)
D1 miss rate: 5.2% ( 5.6% + 3.8% )
LLd miss rate: 0.5% ( 0.6% + 0.1% )
LL refs: 39,494,568,681 ( 33,587,974,873 rd + 5,906,593,808 wr)
LL misses: 3,484,057,928 ( 3,379,256,019 rd + 104,801,909 wr)
LL miss rate: 0.1% ( 0.1% + 0.1% )
Редактировать 2
Я скомпилировал с -O3 -DNDEBUG -march=native -fprofile-use
и использовал команду perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out
508322.348021 task-clock (msec) # 0.998 CPUs utilized
21,592 context-switches # 0.042 K/sec
33 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,978,382,746,597 cycles # 3.892 GHz (44.44%)
228,898,532,311 branches # 450.302 M/sec (44.45%)
12,816,920,039 branch-misses # 5.60% of all branches (44.45%)
1,867,947,557,739 instructions # 0.94 insn per cycle (55.56%)
2,957,085,686,275 uops_issued.any # 5817.343 M/sec (55.56%)
2,864,257,274,102 uops_executed.thread # 5634.726 M/sec (55.56%)
2,490,571,629 mem_load_uops_retired.l3_miss # 4.900 M/sec (55.55%)
12,482,683,638 mem_load_uops_retired.l2_miss # 24.557 M/sec (55.55%)
18,634,558,602 mem_load_uops_retired.l1_miss # 36.659 M/sec (44.44%)
509.210162391 seconds time elapsed
506.213075000 seconds user
2.147749000 seconds sys
Редактировать 3
Я выбрал результаты perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out
, в которых упоминалась моя функция:
Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead Command Shared Object Symbol
49.76% srnaMapper srnaMapper [.] mapWithoutError
Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead Command Shared Object Symbol
49.36% srnaMapper srnaMapper [.] mapWithoutError
Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead Command Shared Object Symbol
43.01% srnaMapper srnaMapper [.] mapWithoutError
Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead Command Shared Object Symbol
50.32% srnaMapper srnaMapper [.] mapWithoutError
Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead Command Shared Object Symbol
34.85% srnaMapper srnaMapper [.] mapWithoutError
Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead Command Shared Object Symbol
42.19% srnaMapper srnaMapper [.] mapWithoutError
Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead Command Shared Object Symbol
48.14% srnaMapper srnaMapper [.] mapWithoutError
Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead Command Shared Object Symbol
33.24% srnaMapper srnaMapper [.] addState
31.00% srnaMapper srnaMapper [.] mapWithoutError
Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead Command Shared Object Symbol
40.79% srnaMapper srnaMapper [.] mapWithoutError
Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead Command Shared Object Symbol
39.11% srnaMapper srnaMapper [.] mapWithoutError
(с использованием perf record --period 10000
триггеров Workload failed: No such file or directory
.)