Optimizing data memory utilization
A key difference between embedded and desktop system programming is variability: every Windows PC is essentially the same, whereas every embedded system is different. There are a number of implications of this variability: tools need to be more sophisticated and flexible; programmers need to be ready to accommodate the specific requirements of their system; standard programming languages are mostly non-ideal for the job. This last point points towards a key issue: control of optimization.
Optimization is a set of processes and algorithms that enable a compiler to advance from translating code from (say) C into assembly language to translating an algorithm expressed in C into a functionally identical one expressed in assembly. This is a subtle but important difference.
A key aspect of optimization is memory utilization. Typically, a decision has to be made in the trade-off between having fast code or small code – it is rare to have the best of both worlds. This decision also applies to data. The way data is stored into memory affects its access time. With a 32-bit CPU, if everything is aligned with word boundaries, access time is fast; this is termed ‘unpacked data’. Alternatively, if bytes of data are stored as efficiently as possible, it may take more effort to retrieve data and hence the access time is slower; this is ‘packed’ data. So you have a choice much the same as with code: compact data that is slow to access, or some wasted memory but fast access to data.
For example, this structure:
could be mapped into memory in a number of ways. The C language standard gives the compiler complete freedom in this regard. Two possibilities are: packed, like in figure 1 (left), or unpacked like in figure 2 (right):
Unpacked could be even more wasteful. This graphic shows word (16-bit) alignment. Long word (32-bit) alignment would result in 5 bytes being wasted for every 3 bytes of data!
Most embedded compilers have a switch to select what kind of code generation and optimization is required. However, there may be a situation where you decide to have all your data unpacked for speed, but have certain data structures where you would rather save memory by packing. In this case, the language extension keyword packed may be applied, thus:
This overrides the optimization setting for this one object.
Alternatively, you may need to pack all the data to save memory, and have certain items that you want unpacked either for speed or for sharing with other software. This is where the unpacked extension keyword applies.
It is unlikely that you would use both packed and unpacked keywords in one program, as only one of the two code generation options can be active at any one time.
As previously discussed, modern embedded compilers provide the opportunity to minimize the space used by data objects; this may be controlled quite well by the developer. However, this optimization is only to the level of bytes, which might not be good enough.
For example, imagine an application that uses a large table of values, each of which is in the range 0 to 15. Clearly this requires 4 bits of storage (a nibble), so keeping them in bytes would only be 50% efficient. It is the developer’s job to do better (if memory footprint is deemed to be of greater importance than access time). There are broadly two ways to address this problem.
One way is to use bit fields in structures. This has the advantage that a compiler can readily optimize memory usage, if the target CPU offers a convenient capability. The downside is that bit fields within a structure cannot be indexed without writing additional code, but this is not too difficult. The following code shows how to access nibbles in an array of structures:
unsigned n0 : 4;
unsigned n1 : 4;
unsigned n2 : 4;
unsigned n3 : 4;
unsigned get_nibble(struct nibbles words, unsigned index)
nibble = index % 4;
index /= 4;
A similar put_nibble() function would be required, of course.
The other way to code a solution would be to perform all the bit shifting explicitly in the code, which is really just emulating what the compiler might generate. It is unlikely that a human programmer could produce code substantially more efficient than a modern compiler.
There is little a developer can do to improve speed of access to data beyond the optimization that the compiler does (i.e., not packing the data for fast access). But one option is to locate data in the fastest available memory. An embedded toolchain includes a linker, which will normally have the flexibility to effect this optimization. This opens up a few possibilities for consideration:
The fastest place to keep data is in a CPU register, but these are in short supply and should be used sparingly. Most compilers make smart choices for register optimization.
RAM is the fastest type of memory in most systems. Obviously, variables tend to be located in RAM, but it may be worthwhile to ensure that constant data is copied into RAM as well. This is commonly done automatically, as code is normally copied from flash to RAM for execution.
Microcontrollers typically have on-chip RAM, which is faster than external memory. So ensuring that speed-critical data is located there makes sense.
Memory is commonly cached into an internal buffer for fast access. Some CPUs permit locking of a cache so that the contents are always immediately available.
Embedded software developers are always interested in the efficient use of resources. Careful coding and use of compiler optimizations can ensure that code is optimal for a given application. To complete the job, a similar approach must be taken to the handling of data, where the balance of memory footprint against access time needs careful consideration.
About the author:
Colin Walls is an embedded software technologist with the Mentor Graphics Embedded Software Division – www.mentor.com/embedded-software and is based in the UK. He may be reached by email at email@example.com