Part Number:TMS320C6678
Tool/software: TI C/C++ Compiler
Hi there,
I'm having some performance issues with function I'm running without RTSC (fast) and with RTSC (slow).
In the first example, I link with the following C6678.cmd:
MEMORY { SHRAM: o = 0x0C000000 l = 0x00400000 /* 4MB Multicore shared Memmory */ CORE0_L2_SRAM: o = 0x10800000 l = 0x00080000 /* 512kB CORE0 L2/SRAM */ CORE0_L1P_SRAM: o = 0x10E00000 l = 0x00008000 /* 32kB CORE0 L1P/SRAM */ CORE0_L1D_SRAM: o = 0x10F00000 l = 0x00008000 /* 32kB CORE0 L1D/SRAM */ // goes on with CORE1-CORE7 } SECTIONS { #ifdef CORE0 .myfastsection > CORE0_L2_SRAM .text:optimized: load >> CORE0_L2_SRAM // goes on with other sections, all of them placed in L2SRAM }
The corresponding function are placed in .text:optimized using #pragma CODE_SECTION and arrays are placed in .myfastsection using #pragma DATA_SECTION and double-word aligned using #pragma DATA_ALIGN(., 2). The performance is very satisfying and looking at the generated assembly coded the compiler seems to pipeline well.
In the second example. I'm adding some RTSC because in some other code section (unrelated to the above one) I plan to use OMP. However, using the same compiler options for optimization, the performance of the function above greatly deteriorates (half the speed measured with both TSCL and omp_getwtime). The generated assembly code for the function is identical. My first guess was that I'm doing something wrong with the memory sections? In my modified cfg file I added
program.sectMap[".text:optimized"] = new Program.SectionSpec();
program.sectMap[".myfastsection"] = new Program.SectionSpec();
program.sectMap[".text:optimized"].loadSegment = "L2SRAM";
program.sectMap[".myfastsection"].loadSegment = "L2SRAM";
Shouldn't that be identical to the above linker.cmd? Is it also possibly (and necessary) to partition the L2SRAM for the different cores as above? In case I am not using any OMP in my code (even though I'm compiling with RTSC components), the performance is fine. However, as soon as I'm using OMP in a different function, called after my initial function, the performance is halfed. The initial function is called after omp_set_num_threads().
My second guess was that OMP introduces some overhead. However, I do not understand why since the initial function is totally unrelated to OMP. It would be helpful to get some additional insights here because in some cases it would be really useful to actually use OMP - but the performance degradation is not acceptable in our case.
NB: In the first case, code is loaded onto core0 only. In the second case (compilation with RTSC, no use of OMP in the code) and in the third case (compilation with RTSC, use of OMP in a different function), code is loaded onto all cores. The same optimizer flags are used in all cases. Arrays are double-word aligned and placed in L2SRAM in all cases. The functions are called 4 times in a row in all cases.
Please let me know if you need additional information. Thank you very much in advance.
Best wishes,
Idris