I found a strange situation when programming in C for AVRs. I have a bunch of structures to be worked on, but any of them can be flagged "active" or not at any given moment, to signal whether they need further work. My first approximation was to add a (uint8_t volatile: set in main, checked in interrupt time) member variable named "active" to the struct, to quickly check/set each structure.
Shock: only looping through the array checking the "active" member variables was taking an enormous amount of time – about 20 clock cycles per access!
After some testing, looks like the stride of the memory accesses causes the time spent in each access to change. If the stride is 1 (i.e., accessing each machine-word-size member of an array) then each access takes 11 clock cycles. If the stride is 2, 13 cycles. At stride 10 and 20, 20 cycles. And my structs have stride 18...
And it is not really dependent on the array-member-size; just using a for(uint8_t i=0;..;i+=2) causes those 13 cycles too. So it really depends on the stride.
I know that memory access patterns can be a problem when there is a cache involved. But that has nothing to do here: the AVR I am using has no cache, and anyway the code is behaving predictably, with no memory access stalls. In fact the clock cycles needed for an access can be predicted by just inspecting the assembler; the code with longer strides simply do have more instructions and a longer jump implementing the loop. So, it's "just" the compiler's "fault".
I have tried using for, while, do while with predecrement (as recommended by the AVR docs)… nothing really changes. Looking at the assembler, the extra cycles are used in (strange) math involving the Z pointer, so it is related to calculating the memory addresses. I would have expected that having a fixed stride would make all the similar cases, well, similar.
(why I say "strange" math? because the thing is swapping nibbles in the registers which make up the Z pointer… WTF? having read recently about magic numbers used by compilers to avoid some expensive math operations, I am not feeling inclined to investigate right now what exactly is going on. It guess it HAS to be something smart, and it's not really like I can control it anyway... I'm using -O3, by the way)
So, the solution to the slowness problem will be using bit-addressable GPIO registers for bit flags (feels like overkill…) or using parallel arrays of word-sized-members to store the flags (ugly).
Maybe I am still too used to big, well organized structures and should get less fancy for embedded C? :P
But all of this only makes me think that I want to fix (?) the compiler instead. Would be really interesting to see what is the root problem that the compiler is trying to solve (maybe even solving?), and how other compilers (or even better, a JITC) would deal with it…
Shock: only looping through the array checking the "active" member variables was taking an enormous amount of time – about 20 clock cycles per access!
After some testing, looks like the stride of the memory accesses causes the time spent in each access to change. If the stride is 1 (i.e., accessing each machine-word-size member of an array) then each access takes 11 clock cycles. If the stride is 2, 13 cycles. At stride 10 and 20, 20 cycles. And my structs have stride 18...
And it is not really dependent on the array-member-size; just using a for(uint8_t i=0;..;i+=2) causes those 13 cycles too. So it really depends on the stride.
I know that memory access patterns can be a problem when there is a cache involved. But that has nothing to do here: the AVR I am using has no cache, and anyway the code is behaving predictably, with no memory access stalls. In fact the clock cycles needed for an access can be predicted by just inspecting the assembler; the code with longer strides simply do have more instructions and a longer jump implementing the loop. So, it's "just" the compiler's "fault".
I have tried using for, while, do while with predecrement (as recommended by the AVR docs)… nothing really changes. Looking at the assembler, the extra cycles are used in (strange) math involving the Z pointer, so it is related to calculating the memory addresses. I would have expected that having a fixed stride would make all the similar cases, well, similar.
(why I say "strange" math? because the thing is swapping nibbles in the registers which make up the Z pointer… WTF? having read recently about magic numbers used by compilers to avoid some expensive math operations, I am not feeling inclined to investigate right now what exactly is going on. It guess it HAS to be something smart, and it's not really like I can control it anyway... I'm using -O3, by the way)
So, the solution to the slowness problem will be using bit-addressable GPIO registers for bit flags (feels like overkill…) or using parallel arrays of word-sized-members to store the flags (ugly).
Maybe I am still too used to big, well organized structures and should get less fancy for embedded C? :P
But all of this only makes me think that I want to fix (?) the compiler instead. Would be really interesting to see what is the root problem that the compiler is trying to solve (maybe even solving?), and how other compilers (or even better, a JITC) would deal with it…
5 years later, I wonder if a couple of things would have helped control or reduce the number of cycles per access. I should have studied better the generated assembler - but that wasn't interesting nor fruitful at the moment.
ReplyDeleteThe loop counter being unsigned implies that the counter can overflow / wrap around without ill effect. That means that the generated code has to take care of the possibility... unless the compiler can directly rely on the hardware doing the right thing. So, I wonder whether changing the type of the counter to signed would have made the compiler not care about the possible overflow (because that would be Undefined Behavior), and so would simplify the generated assembler.
Otherwise, making explicit the range available to the loop counter in some way (in the for, or with some assert?) maybe would have also convinced the compiler to generate simpler assembler...