I have found what looks like a hardware bug in the AT90CAN128! I guess I could be happy in a geeky way, but the number of hours spent trying to track the problem make it a bit difficult :P.
The thing is that aborting a pending MOb in the CAN controller embedded in the AT90CAN128 can leave CONMOB in an unexpected state. I am not totally sure if this is a hardware bug… or could be Atmel's library at90CANlib_3_2 being buggy AND misleading.
The library's procedure to abort a MOb that was pending consists on writing 0 to the CONMOB bits. Also, the library depends on those bits being 0 to detect a MOb as being free.
So the library is self-consistent. But that's not how the hardware actually works! Turns out that a failed MOb (e.g. because of bus errors) will retry its process, and if its CONMOB is overwritten with 0 in that situation there is a rare chance that the CONMOB will jump back to its old value. So any routine testing for CONMOB==0, like Atmel's library does, will consider the MOb busy. Result: a MOb leaked and unusable until reset.
This happens very rarely, and seemingly only in pretty extreme situations. In my case, the CAN bus was suffering a big number of errors, to the point of taking periodically all nodes to the state of Bus Off, which anyway is already a catastrophic situation. So one could argue that this should never happen, to begin with. But consider the situation where there is a short period of broken communication because of (for example) some maintenance of the bus, after which the bus returns to a good condition; we would be then in a case where the CAN bus looks OK but yet the CAN controller in the µC is in a bad state because of the past period of bus errors.
(what does it mean "very rarely"? It means that, if the MOb is transmitting about 1 message every 2 ms, the bus works at 1 Mbaud and there is a Bus Off event about once per second, then about one MOb is leaked on average per hour. That means more rarely than 1 leaked MOb per million messages - so good luck catching it! One way of forcing enough bad bus behaviour to get those Bus Off events is having a CAN node trying to send messages at a different baudrate; its own cycle of going into Bus Off and active again will generate windows of good and bad bus behaviour. Even more frequent Bus Off events can be generated by combining more nodes at mismatched speeds)
I wrote to Atmel about this and their answer was that...
Also they said, ...
Also, the manual mentions repeatedly that the CONMOB is not expected to change by itself, not even after a reset. Which is what makes me think that this is actually a hardware bug. (The library has plenty of bugs anyway. The strangest thing is that it changes the semantics of the CAN messages; normally a CAN message is supposed to retry until successfully sent, and that's how the AT90CAN128 CAN controller works in fact if you program it directly. But the library will instead abort any errored-and-retrying CAN message the moment that you try to check its status!)
Anyway, my minimal-est workaround to the MOb leaks is: when aborting a MOb, write 0 to CONMOB, check that the corresponding CANEN bit is 0, and rewrite the 0 to CONMOB.
Alternatively, the safest option seems to be to stop checking CONMOB when searching for free MObs and switch to checking the CANEN bits.
(Atmel didn't answer to my objections to their "solutions", so that's why I am writing this. I hope I will save someone the hurried debugging nights I had to spend at the office :P)
The thing is that aborting a pending MOb in the CAN controller embedded in the AT90CAN128 can leave CONMOB in an unexpected state. I am not totally sure if this is a hardware bug… or could be Atmel's library at90CANlib_3_2 being buggy AND misleading.
The library's procedure to abort a MOb that was pending consists on writing 0 to the CONMOB bits. Also, the library depends on those bits being 0 to detect a MOb as being free.
So the library is self-consistent. But that's not how the hardware actually works! Turns out that a failed MOb (e.g. because of bus errors) will retry its process, and if its CONMOB is overwritten with 0 in that situation there is a rare chance that the CONMOB will jump back to its old value. So any routine testing for CONMOB==0, like Atmel's library does, will consider the MOb busy. Result: a MOb leaked and unusable until reset.
This happens very rarely, and seemingly only in pretty extreme situations. In my case, the CAN bus was suffering a big number of errors, to the point of taking periodically all nodes to the state of Bus Off, which anyway is already a catastrophic situation. So one could argue that this should never happen, to begin with. But consider the situation where there is a short period of broken communication because of (for example) some maintenance of the bus, after which the bus returns to a good condition; we would be then in a case where the CAN bus looks OK but yet the CAN controller in the µC is in a bad state because of the past period of bus errors.
(what does it mean "very rarely"? It means that, if the MOb is transmitting about 1 message every 2 ms, the bus works at 1 Mbaud and there is a Bus Off event about once per second, then about one MOb is leaked on average per hour. That means more rarely than 1 leaked MOb per million messages - so good luck catching it! One way of forcing enough bad bus behaviour to get those Bus Off events is having a CAN node trying to send messages at a different baudrate; its own cycle of going into Bus Off and active again will generate windows of good and bad bus behaviour. Even more frequent Bus Off events can be generated by combining more nodes at mismatched speeds)
I wrote to Atmel about this and their answer was that...
... but that is not a good solution, because ABRQ aborts all MObs. I only wanted to abort one MOb.If you want to abort CAN communications, I would instead recommend using ABRQ in CANGCON as the means to abort. CANCDMOB is not affected by the abort, so you would still have to manage it.
Also they said, ...
Which is true according to the documentation, which mentions that CANEN is the register to check for availability of MObs. But then, that means that their own library is pretty broken, since it never even refers to the CANEN registers.CANCDMOB really does not indicate that a MOB is "free" in the sense that it is ready for the next transmission - other bits indicate this; the user manual is clear that CONMOB bits are not cleared once communication is performed...
Also, the manual mentions repeatedly that the CONMOB is not expected to change by itself, not even after a reset. Which is what makes me think that this is actually a hardware bug. (The library has plenty of bugs anyway. The strangest thing is that it changes the semantics of the CAN messages; normally a CAN message is supposed to retry until successfully sent, and that's how the AT90CAN128 CAN controller works in fact if you program it directly. But the library will instead abort any errored-and-retrying CAN message the moment that you try to check its status!)
Anyway, my minimal-est workaround to the MOb leaks is: when aborting a MOb, write 0 to CONMOB, check that the corresponding CANEN bit is 0, and rewrite the 0 to CONMOB.
Alternatively, the safest option seems to be to stop checking CONMOB when searching for free MObs and switch to checking the CANEN bits.
(Atmel didn't answer to my objections to their "solutions", so that's why I am writing this. I hope I will save someone the hurried debugging nights I had to spend at the office :P)
Comments
Post a Comment