Why doesn't GCC do this "easy" NRVO optimization?
2022-01-25
Since C++17, us C++ programmers have rejoiced in the fact that when we return something from a function, the standard now guarantees that the value won't be copied. This is known as return value optimization (RVO) or copy/move elision, and happens in cases like this:
MyType myfunc() {
    // ...
    return MyType{arg1, arg2};
}
That is, MyType is constructed once and never copied or moved.
But some compilers still don't perform RVO in some cases. It turns out this is because RVO refers only to when you return unnamed values. Named RVO is apparently not considered RVO by the standard's definition. Named means something like:
    MyType x{};
    x.do_something();
    return x;
And gcc (11.2) doesn't always perform NRVO, even if it "obviously" can. Why? Do other compilers do better? I tried to find out.
An example that works
Let's look at a code sample to see what the problem isn't, i.e. an example where NRVO does happen. We have this code:
void doSomething(char*);
struct MyType {
    char buffer[100000];
};
MyType wantNrvo() {
    MyType x;
    x.buffer[0] = '\0';
    return x;
}
int main() {
    auto x = wantNrvo();
    doSomething(x.buffer);
}
And we are interested in what the wantNrvo function compiles down to
with -O3, we want gcc to give it its best shot. That's why we add
the doSomething function, to stop x being optimised away. We could
have instead added volatile to x, but I hardly ever see that in real
code - that would just confuse us. A function call is more realistic.
Anyway, the generated assembly is:
wantNrvo():
        mov     BYTE PTR [rdi], 0
        mov     rax, rdi
        ret
main:
        sub     rsp, 100008
        mov     rdi, rsp
        mov     BYTE PTR [rsp], 0
        call    doSomething(char*)
        xor     eax, eax
        add     rsp, 100008
        ret
We can see there's obviously no copying of the huge 100,000-byte buffer there, since there's no memcpy (or equivalent) anywhere. That's what we're looking for.
It's interesting to note that NRVO means the caller has to make some space for the object to be returned on the stack before the function is even run, then the object is built in the space given.
The function is being passed the location to build the object as rdi. Thus, RVO can be viewed as a rewriting of our function to take a pointer to a block of memory big enough for MyType, and using placement new.
That might look like this:
void wantNrvo(char *memory) {
    MyType *x = new(memory) MyType{};
    x->buffer[0] = '\0';
}
The point is that there is one single block of memory where the return value is constructed, and this is allocated by the caller. That seems fine: the caller knows the size of the block and can allocate it - what could go wrong?
What goes wrong
Sometimes it's hard to tell which object out of multiple objects is going to be returned, meaning we don't know which object to construct in the return value area.
That is kind of non-obvious, so here's an example:
MyType wantNrvo(bool test) {
    MyType x;
    MyType y;
    x.buffer[0] = '\0';
    return test ? x : y;
}
There is obviously no way to know at compile time which of x or y we
should construct in the return value area. Using our knowledge of this
program, you might think: hey, why don't we just check test and
decide which of x and y to construct in the return value? Alas, gcc,
even with -O3, isn't that smart. This is the machine code produced:
wantNrvo(bool):
        sub     rsp, 200008
        mov     r9d, esi
        mov     edx, 100000
        lea     rax, [rsp+100000]
        test    r9b, r9b
        mov     rsi, rsp
        mov     BYTE PTR [rsp], 0
        cmove   rsi, rax
        call    memcpy
        add     rsp, 200008
        ret
It's there, plain as day: memcpy! The huge buffer is being copied!
It gets worse!
There are other, even more obvious cases where gcc fails. Maybe the problem is that we're construction two objects, then picking one to return. Maybe that's confusing for some reason. We can attempt to rewrite the code:
MyType wantNrvo(bool test) {
    if (test) {
        MyType x;
        x.buffer[0] = '\0';
        return x;
    } else {
        MyType y;
        return y;
    }
}
This code is starting to get a bit contrived. But let's see what that compiles to:
wantNrvo(bool) [clone .part.0]:
        sub     rsp, 100008
        mov     edx, 100000
        mov     rsi, rsp
        call    memcpy
        add     rsp, 100008
        ret
wantNrvo(bool):
        push    r12
        mov     r12, rdi
        test    sil, sil
        je      .L5
        mov     rax, r12
        mov     BYTE PTR [rdi], 0
        pop     r12
        ret
.L5:
        call    wantNrvo(bool) [clone .part.0]
        mov     rax, r12
        pop     r12
        ret
Now this is bizarre! There are two branches of this function:
- The test = true branch which uses NRVO and writes directly to the block of memory pointed to by rdi. 
- The test = false branch which allocates new memory on the stack, and copies the new, just-allocated memory into the area pointed to by rdi with memcpy. 
Surely this is just optimized badly because we don't actually call it, right? Don't be so sure, when we call it it gets inlined, but still uses memcpy:
int main() {
    auto x = wantNrvo(false);
    doSomething(x.buffer);
}
main:
        sub     rsp, 200008
        mov     edx, 100000
        lea     rsi, [rsp+100000]
        mov     rdi, rsp
        call    memcpy
        mov     rdi, rsp
        call    doSomething(char*)
        xor     eax, eax
        add     rsp, 200008
        ret
Why, God? Why?
I am not a compiler author. I have never written a compiler and am much dumber than those behind gcc's NRVO.
But NRVO is pretty important, and I expect it sometimes! That's backed up by this comment on this gcc bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51571
I would like to strongly oppose the notion that this "just a missed optimisation and not very critical, really".
NRVO is not just "an optimisation". It's actually one that is explcitly permitted to change observable behaviour of the program and it's extremely powerful.
And it it required for performant C++. Just try to return a std::vector by value to see the importance of this optimisation. This is not missed optimisation. This is premature pessimisation.
You could just as well stop all optimisation work for the C++ frontend until this is implemented, because any other optimisation effords are dwarfed by the overhead when NRVO is expected by the developer but not applied.
Please make this a top priority. Every C++ program will benefit both in text size and in runtime performance - dramatically.
-- Marc Mutz
But then again, there are often hyperbolic, borderline rude comments left on projects saying "make this a top priority", so who knows?
Let's see clang's business card
Let's see clang's machine code, does it do any better? Yes!
(using clang 13.0.0)
wantNrvo(bool):                           # @wantNrvo(bool)
        mov     rax, rdi
        test    esi, esi
        je      .LBB0_2
        mov     byte ptr [rax], 0
.LBB0_2:
        ret
This is exactly the perfect machine code we want. Give us some memory in rdi, collapse those branches down to writing or not writing a nul byte to it. Easy, no memcpy!
Lessons to learn
Although you might see some text on cppreference.com saying:
Return value optimization is mandatory and no longer considered as copy elision; see above. (since C++17)
This doesn't refer to NRVO! Assuming that NRVO will happen can be very dangerous, resulting in copies of huge objects, and that hurts performance a lot.
Try as hard as possible to return prvalues to take advantage of C++17's guaranteed RVO.
It's a dog-eat-RVO world out there. Be careful.
