A review of [[likely]] and [[unlikely]]
tldr;
[[(un)likely]]may change how the compiler lays out basic blocks, but likely does not help the branch prediction unit. if you just want faster programs, try profile guided optimisation; if you want to change the layout of a specific section of code, consider[[(un)likely]], but impact is not consistent across implementations. always check the assembly, always benchmark.
- Background: Code Layout
- How
[[(un)likely]]affects branch probabilities - Applications
[[(un)likely]]vs PGO- Verifiable branch annotation with PGO
- Conclusion
C++20 introduced the [[likely]] and [[unlikely]] attributes. A quick search of these terms returns numerous articles warning that usage of these attributes is “perilous” 1; that they should not be used at all 2; and that using them is HGO: “hunch guided optimisation” 3. And yet, the cppreference page shows a function using [[likely]] and [[unlikely]] is faster than an otherwise identical function without them 4. In this post I’ll attempt to break down how these attributes affect the layout of the emitted assembly, and hopefully give a better idea of when (if ever) they may be useful.
Whilst the aim of using [[(un)likely]] is to write better performing code, reasoning about performance differences between code which differs only in layout is largely an examination of cache performance. Since cache behaviour is extremely volatile between systems, I won’t be providing any benchmarks, and encourage you to do your own.
I initially assumed that these attributes would somehow leave hints to the branch prediction unit (BPU) in the CPU to aid in predicting whether a conditional branch was taken or untaken. However, this won’t be the focus of this post. Whether branch prediction hints are supported and emitted is completely dependent on the processor generation 5 and compiler version 6. I recommend resources such as Agner Fog’s manuals 5 for great references for your microarchitecture, and checking the machine code generated by your compiler for hints. For quick context, between Intel Pentium 4 and Redwood Cove architectures, branch prediction hints were largely ignored 5 7.
Background: Code Layout
Wherever a compiler sees a conditional branch, a decision must be made as to whether the (spatially) subsequent instructions should be the “taken” path (and have a “jump” instruction to code which contains the “untaken” path) or the “untaken” path (and have “jump” to the “taken” path). For a toy example, should the following C code;
1
2
3
4
5
6
7
8
int func(int x, int y) {
if (x == 10) {
return -1;
} else {
int b = y + 20;
return b;
}
}
be compiled into assembly looking something like;
1
2
3
4
5
6
7
8
9
10
11
12
func(int, int):
push {r11, lr}
cmp r0, #10
bne .ELSE
mov r3, #-1
b .RET
.ELSE
add r3, r3, #20
.RET
pop {r11, pc}
where the condition x == 10 being false results in the conditional branches being taken? or instead, something like;
1
2
3
4
5
6
7
8
9
10
11
12
func(int, int):
push {r11, lr}
cmp r0, #10
beq .IF
add r3, r3, #20
b .RET
.IF
mov r3, #-1
.RET
pop {r11, pc}
where the condition x == 10 being true means the conditional branch is taken? Or maybe something else, like a conditional move?
In general, compilers aim to maximise the number of fallthrough branches, meaning the compiler would like to place conditional jump instructions so they are not taken most of the time. This is often referred to as maximising the straight-line sequence, because a contiguous (“straight line”) block of instructions are executed, rather than JMPing around the instruction address space. Such straight-line sequences are preferred for a few reasons. For one, they reduce the size of the instruction cache’s working-set size 8. Fetching a cache line brings multiple contiguous instructions into the cache. If the next instruction to be executed is the instruction contiguous in memory from the previous, then that next instruction is more likely to be in cache. For straight line sequences and beyond, check out 9.
But how does a compiler - at compile-time - know which condition is more likely at runtime, so that it may order blocks accordingly? We will look at how both gcc and clang decide which branch to make the fallthrough, but both compilers use the code contained within in the conditional branch to decide which path is most likely to be executed.
A good place to start with gcc is their predict.def 10. This file defines a number of “predictors” which attribute a particular code pattern to a probability. For example, gcc considers the return value of malloc to almost always be non-null; considers branches which cause an early-return unlikely; and considers branches which return constants also unlikely. To see how gcc uses these “predictions”, we can look at the “profile tree” 11 which is outputted when compiling with the -fdump-tree-profile_estimate flag. The outputted file contains the GIMPLE representation of the source code, along with annotations using static heuristics and profile data (if present). For now, we are only concerned with the static heuristics.
For a slightly more interesting example, consider the function below. There is a conditional branch which is encountered some number of times in a loop, and whose taken path contains a call to std::vector::push_back.
1
2
3
4
5
6
7
8
9
10
auto
noattr(const vector<int>& in) -> vector<int> {
vector<int> out;
for (const int x : in) {
if (x > CONSTANT) {
out.push_back(x);
}
}
return out;
}
Below is an abridged extract of the profile output regarding the noattr function.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
;; Function noattr (...)
...
Predictions for bb 3
DS theory heuristics: 33.00%
combined heuristics: 33.00%
call heuristics of edge 3->4: 33.00%
...
<bb 3> [local count: 1073312329]:
...
_2 = *SR.76_30;
x = _2;
if (_2 > 10)
goto <bb 4>; [33.00%]
else
goto <bb 5>; [67.00%]
gcc has kindly annotated the branches with the probability it has assigned them, as well as all the heuristics used to determine those probabilities. In this case its using the “call” heuristic (where the “call” here is to push_back) to estimate that most of the time, x will not be greater than 10 (33% of the time, to be precise). Sure enough, in the generated assembly, we have:
1
2
cmp r3, #10
ble .L23
i.e. the branch is taken if x is less than or equal to 10: the case the gcc thinks is less likely.
There is no clang equivalent of the profile tree, but we can take a look at some key files to determine how branches are predicted. BranchProbabilityInfo.h 12 gives a nice description of the algorithm, but we turn to BranchProbabilityInfo.cpp 13, and specifically the calculate function on line 1221, to see the closest equivalent to gcc’s predict.def. In this small block, we can see most of clang’s branch prediction logic:
1
2
3
4
5
6
7
8
9
10
if (calcMetadataWeights(BB))
continue;
if (calcEstimatedHeuristics(BB))
continue;
if (calcPointerHeuristics(BB))
continue;
if (calcZeroHeuristics(BB, TLI))
continue;
if (calcFloatingPointHeuristics(BB))
continue;
Let’s ignore calcMetadataWeights and calcEstimatedHeurisitcs for now. Notice how only one heuristic/source of probability is considered, and there is an explicit order. For instance, if calcZeroHeuristics returns true (meaning that heurisitic was applied), then the heuristics in calcFloatingPointHeuristics are not considered. clang uses a number of ProbabilityTables to store the estimated probabilities of various local events. For example, here is the PointerTable (found at the top of BranchProbabilityInfo.cpp):
1
2
3
4
static const ProbabilityTable PointerTable{
{ICmpInst::ICMP_NE, {PtrTakenProb, PtrUntakenProb}}, /// p != q -> Likely
{ICmpInst::ICMP_EQ, {PtrUntakenProb, PtrTakenProb}}, /// p == q -> Unlikely
};
which stores {probability evaluates to true, probability evaluates to false} for a given condition. The PtrTakenProb are hardcoded values in the top of this cpp file.
So clang thinks that pointers are not likely to be equal. What about calcZeroHeuristics? Interestingly, there are quite a few special cases considered, for example, special behaviour with ANDing with a single bit;
1
2
3
4
5
6
7
// If the LHS is the result of AND'ing a value with a single bit bitmask,
// we don't have information about probabilities.
if (Instruction *LHS = dyn_cast<Instruction>(CI->getOperand(0)))
if (LHS->getOpcode() == Instruction::And)
if (ConstantInt *AndRHS = GetConstantInt(LHS->getOperand(1)))
if (AndRHS->getValue().isPowerOf2())
return false;
and return values of library functions;
1
2
3
4
5
6
// Check if the LHS is the return value of a library function
LibFunc Func = LibFunc::NotLibFunc;
if (TLI)
if (CallInst *Call = dyn_cast<CallInst>(CI->getOperand(0)))
if (Function *CalledFn = Call->getCalledFunction())
TLI->getLibFunc(*CalledFn, Func);
are treated individually. Similarly to pointer comparisons, we see from the ICmpWithZeroTable that clang predicts comparisons with 0 are unlikely to be true, as well as comparisons of values less than 0 are unlikely. Notice that these rules essentially mirror gcc’s “predict comparisons with a negative constant as unlikely”, found in predict.def.
We’ve seen that both clang and gcc assign probabilities to branches based on some static heuristics. However, these heuristics could be called quite primitive, and the probabilities assigned are static, hardcoded values, with little indication of how they were chosen.
How [[(un)likely]] affects branch probabilities
Consider the following two functions, both of whom are identical to the previously defined noattr, aside from the fact the conditionals in the loop are now annotated with [[likely]] in the first, and [[unlikely]] in the second:
1
2
3
4
5
6
7
8
9
10
auto
likely(const vector<int>& in) -> vector<int> {
vector<int> out;
for (const int x : in) {
if (x > CONSTANT) [[likely]] {
out.push_back(x);
}
}
return out;
}
1
2
3
4
5
6
7
8
9
10
auto
unlikely(const vector<int>& in) -> vector<int> {
vector<int> out;
for (const int x : in) {
if (x > CONSTANT) [[unlikely]] {
out.push_back(x);
}
}
return out;
}
Let’s have a look at the equivalent profile tree for the likely function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
;; Function likely (...)
...
Predictions for bb 3
first match heuristics: 90.00%
combined heuristics: 90.00%
call heuristics of edge 3->4 (ignored): 33.00%
hot label heuristics of edge 3->4: 90.00%
...
<bb 3> [local count: 955630224]:
...
_2 = *SR.83_31;
x = _2;
if (_2 > 00)
goto <bb 4>; [90.00%]
else
goto <bb 5>; [10.00%]
It is almost identical to the previous output for noattr, but with the addition of: hot label heuristics of edge 3->4: 90.00%, and ignored added to the previously considered call heuristic.
The profile tree for unlikely shows the same, in the opposite direction:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
;; Function likely (...)
...
Predictions for bb 3
first match heuristics: 10.00%
combined heuristics: 10.00%
call heuristics of edge 3->4 (ignored): 33.00%
cold label heuristics of edge 3->4: 10.00%
...
<bb 3> [local count: 955630224]:
...
_2 = *SR.90_31;
x = _2;
if (_2 > 10)
goto <bb 4>; [10.00%]
else
goto <bb 5>; [90.00%]
In gcc, the [[(un)likely]] attributes override the branch probability heuristics that would otherwise be used. The probabilities are then overridden by a constant: 90% probability for [[likely]], 10% probability for [[unlikely]]. This ensures that the path (respectively, the opposite path to that) which is marked [[likely]] (respectively, [[unlikely]] ) is optimised for - in the sense that it is made the fallthrough block.
Examining the assembly of the above functions 14, we see for likely;
1
2
3
4
cmp r3, #10
ble .L34 ; next loop iteration
...
; fallthrough: std::vector::push_back
i.e. the branch is taken in the case that x is not greater than 10: the opposite path to what was marked [[likely]]. And for unlikely;
1
2
3
4
cmp r3, #10
bgt .L20 ; std::vector::push_back
...
; fallthrough: next loop iteration
i.e. the branch is taken in the path marked [[unlikely]].
Since clang does not have a profile tree output, we will take a look at the source code. clang lowers [[(un)likely]] to their __builtin_expect() intrinsic. This builtin takes two arguments: an expression to be annotated, and a Boolean to declare whether we expect the expression to evaluate to true or false. For example, __builtin_expect(x > 0, 0) declares that it is likely that x > 0 evaluates to false (0).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
llvm::Value *
CodeGenFunction::emitCondLikelihoodViaExpectIntrinsic(llvm::Value *Cond,
Stmt::Likelihood LH) {
switch (LH) {
case Stmt::LH_None:
return Cond;
case Stmt::LH_Likely:
case Stmt::LH_Unlikely:
...
llvm::Type *CondTy = Cond->getType();
...
llvm::Value *ExpectedValueOfCond =
llvm::ConstantInt::getBool(CondTy, LH == Stmt::LH_Likely);
return Builder.CreateCall(FnExpect, {Cond, ExpectedValueOfCond},
Cond->getName() + ".expval");
}
llvm_unreachable("Unknown Likelihood");
}
The call to ConstantInt::getBool gets the second argument for the builtin, so the {Cond, ExpectedValueOfCond} contains the arguments to pass to __builtin_expect. The above function is called in CodeGenFunction::EmitBranchOnBoolExpr 15 when emitting conditional branches.
Next, the __builtin_expect() is lowered to LLVM IR in a few passes, which is a bit too detailed to get into here, but a good starting point is LowerExpectIntrinsic.cpp 16. This lowering then provides branch probability info which is considered in the calcMetadataWeights() function that we saw earlier in BranchProbabilityInfo.cpp. Recall that if that metadata on branch weights exist, the other static heuristics are not used.
So we can see that the way clang handles the attribute annotations is similar to gcc in two ways:
-
[[(un)likely]]attributes dominate other static heuristics, and; -
[[likely]](respectively,[[unlikely]]) always ensures the annotated branch is optimised for (respectively, pessimised).
We can conclude that one should use the
[[(un)likely]]attributes if and only if they wish to override the code layout that the compiler would otherwise generate.
From gcc’s predict.def and clang’s heuristics, we can see that it is not necessary to use the attributes to annotate functions which are obviously cold/error paths. For example, annotations on conditionals such as; contains the arguments to the intrinsic.
1
2
3
4
5
6
constexpr int ERROR_CODE = -1;
if (bad_condition) [[unlikely]] {
// ... handle error
return ERROR_CODE;
}
or;
1
2
3
if (bad_condition) [[unlikely]] {
throw my_exception("...");
}
will likely have no effect on code layout, since they are both identified as unlikely by static heuristics.
Applications
I see two scenarios in which one may wish to override the code layout the compiler would otherwise generate:
- using the programmer’s external knowledge to provide a heuristic in cases otherwise ambiguous to the compiler;
- overriding a heuristic that the compiler would normally use to determine code layout.
For an example of the first scenario, consider a if statement with an else clause, where the code paths combined contain enough instructions that do not fit in a single cache line, and therefore are unlikely to exist in L1i after one fetch. Consider further, that the code in neither condition contain patterns which are identified by static heuristics that we outlined above. Neither return a constant value, neither call a function, etc. The general shape is
1
2
3
4
5
if (ambiguous condition) {
// do inline work
} else {
// do other inline work
}
This godbolt shows an example of a case where gcc estimates both branches to be equally likely, along with an alternative function which has an [[unlikely]] annotation, and a diff view of the different code emitted.
The [[unlikely]] attribute has changed the code layout such that the if path has been swapped with the else path, and by extension swapped which path is most likely in L1i cache. In this example, the condition (namely sin(a) == cos(b)) which determines code flow can be reasoned about quite easily. The programmer may well have contextual knowledge on the inputs a and b as to how likely this condition is to be true, and be tempted to annotate attributes accordingly.
However, this approach is often discouraged 2 in favour of Profile Guided Optimisation (PGO).
PGO 17 is a technique to optimise a program using dynamic analysis of the execution of that program. Statistics on the number of times a branch or function is taken are recorded during execution, and those statistics are fed into a second compilation of the same program. This second compilation uses the provided statistics to attempt to improve the runtime performance of the program; for example, by reordering basic blocks to maximise the number of fallthrough branches. (For more on PGO, I recommend 18 as an introduction.)
[[(un)likely]] vs PGO
The argument in the above example then becomes: “if sin(a) == cos(b) is particularly common/rare in production, an instrumented profile will record this, and a PGO compilation will lay out the paths in the most optimal way, without the need for [[(un)likely]] attributes”. With PGO, there is no need for a developer to possess domain knowledge of the branch likelihoods, and (possibly incorrectly) annotate likely branches.
More formally, PGO is often touted as superior to [[(un)likely]] for reasons such as:
-
humans are susceptible to incorrectly guessing branch probabilities; statistical profiles (used by PGO) are not,
-
attributes can become outdated as the codebase changes,
-
changes to generated assembly is implementation defined, i.e. changing compilers/architectures will not guarantee the same result.
However, I do not believe the set capabilities of PGO is a superset of the those of [[(un)likely]].
Profile runs can become outdated just as easily as branches annotated with attributes. Although profiles can be version-controlled along with source code (allowing reducibility of builds, observability of history), there is significantly lower friction with attributes. Since [[(un)likely]] reside inside the source, there is no extra effort required to manage a separate directory/repository of profile runs. Further, profile runs are not human-readable, whilst I suggest good practice for using attributes is to provide inline documentation as to why a branch was marked with the attribute it was. Not only is this human-readable, but does not require any extra version-control; providing a view of the history of rationale for any annotation. If, upon a refactor, one comes across a justification for an attribute which no longer holds true, it can be removed, with documentation as to why.
Moreover, acquiring PGO profiles always require some sacrifice. Instrumentation profiles collect data on (almost) all branches and function calls, and lead to unbounded slowdowns 18; unquestionably unsuitable for production. Sampling profiles are a compromise which collect data only at certain intervals (akin to how perf samples execution time), but are therefore not as accurate as a fully instrumented profile 18. The question of which workload to use to collect a sample from is also not straight forward. The profile need be representative for the code paths you wish to optimise for. If a sloppily chosen sample workload is biased towards a code path which in general is not common, it will be optimised for regardless, whilst other paths - which may in fact be more common in real workloads/production - may suffer. A PGO compilation is only as good as its profile. Granted, if there is a workload which you know you would like to optimise for, e.g. common queries in a DBMS, then PGO will often provide speedups with a profile-guided recompilation, with no changes to the source code 18.
There may also be cases where it is hard to generate a profile for the path you wish to optimise for. This is an instantiation of application 2. I mentioned above.
One example that is notorious with the discussion around [[(un)likely]]]; that of latency-sensitive trading applications which rarely send an order to the exchange, but when they do, need be as fast as possible. Something along the lines of;
1
2
3
if (signal) {
// send order to exchange
}
Another example, adapted from 2, is the case for safety-critical software handling a catastrophic failure. Perhaps;
1
2
3
if (critical_failure) {
// don't kill people
}
In both cases, signal and critical_failure are almost never true in real world workloads of both programs, but are the paths that need be optimised for. Therefore, to achieve that optimisation using PGO, one would need to craft a synthetic workload to produce a profile in which the condition is often true. Such synthetic workloads may be susceptible to inadvertently pessimising other branches in the codebase, which otherwise are the hot path (due to a skewed/misrepresentative workload). Instead of a fake workload, we could write;
1
2
3
if (signal) [[likely]] {
// send order to exchange
}
and;
1
2
3
if (critical_failure) [[likely]] {
// don't kill people
}
Using the attributes in this way is a little abrasive. It feels slightly wrong to write “likely” for a condition which we know is in fact unlikely to evaluate to true. However, in a single annotation we may be able to achieve the desired code layout (as shown above in the previous examples), with no PGO. Perhaps [[favour]] and [[disfavour]] would have been more apt names.
Verifiable branch annotation with PGO
Rather than pitting branch annotations and PGO against each other, there has been some effort from the LLVM project to make them work together. Given some program source and profile data on that program, clang-misexpect 19 can make clang emit warnings/reports when the source code contains branch annotations which contradict the profile data. For example, if you annotated a branch as unlikely, but in fact it was almost always the taken path, then clang-misexpect can catch this, and direct you to the line of the questionable annotation. It is then up to the developer’s discretion as to whether that annotation should be changed.
This workflow provides the empirical verification of branch likelihoods of PGO, whilst maintaining the flexibility to ignore profile results for specific branches with annotations. Also, it allows all your code layout manipulation (in the form of annotations) to be version-controlled and documented inline, as opposed to relying on a separate store of profile data.
However, as you may have guessed clang-misexpect only works when compiling with clang (and clang23+ at that), adding friction if you also need to target other compilers. Sadder still, is that only the clang-builtin annotations work - i.e __builtin_expect(...) - and [[(un)likely]] are not supported. And since a PGO pipeline is inherently toolchain dependent, we are unlikely to ever see a more portable option.
Conclusion
If your aim is to try and optimise runtime performance of your program, then PGO is a tried and tested approach, which will likely give you good results. If instead you would like to manipulate the code layout of your generated assembly, then consider [[(un)likely]]. But be aware of its limitations, and how it will not behave the same across compilers. As always, benchmark on your compiler and your machine. Take a look at the assembly generated to verify what your annotations are doing.
Reprise
I restricted the conversation to mostly be around simple if statements, as opposed to switch statements. Similar concepts apply, but since there are arbitrarily more branches, the probabilities can get a bit more complicated. But here’s a small trick with [[(un)likely]] that you might be able to use for certain switch statements:
Both gcc and clang (especially on high optimisation levels) tend to treat the default case of a switch as unlikely, and place it furthest away from the condition check, regardless of where you place the default yourself (simple godbolt to illustrate). If there are a lot of switch cases, the code in the default case might not be pulled into cache. But you may have a switch where the default case is actually the most common, or the path you’d like to optimise for. Annotating default with [[likely]] can sometimes help with this, with differing levels of effectiveness depending on compiler and optimisation level (godbolt).