using the preprocessor to do something N times

topic posted Sun, July 30, 2006 - 12:27 PM by  Jason
Share/Save/Bookmark
Advertisement
Is there a c++ preprocessor instruction to repeat something a number of times?

I am doing some loop unrolling and I would like to do something like this (making up my own preprocessor command :P)

#repeat 16
x += p[a++];
#endrepeat
posted by:
Jason
Advertisement
Advertisement
  • Not that I know of. You could make an inline function or macro to simplify the code but you'll still have to call the function or macro 16 times.

    I wonder how much of a performance increase loop unrolling of this type would yield since (I believe) nearly all modern CPU architectures have good branch prediction/multiple execution pipelines for simple loops like this.
    • You might be surprised. Unrolling the follwing naive loop:

      x = 0;
      for(int a = 0; a<READ_UNROLL_BLOCKSIZE; a++)
      {
      x += p[a];
      }

      To the following:
      int a;
      x = 0;

      // we will use 16-fold unroll
      int blockLimit = READ_UNROLL_BLOCKSIZE & ~15;
      for(a = 0; a < blockLimit; a+=16)
      {
      x += p[a + 0];
      x += p[a + 1];
      x += p[a + 2];
      x += p[a + 3];
      x += p[a + 4];
      x += p[a + 5];
      x += p[a + 6];
      x += p[a + 7];
      x += p[a + 8];
      x += p[a + 9];
      x += p[a + 10];
      x += p[a + 11];
      x += p[a + 12];
      x += p[a + 13];
      x += p[a + 14];
      x += p[a + 15];
      }

      // process the remaining reads
      for (a = blockLimit; a < READ_UNROLL_BLOCKSIZE;)
      {
      x += p[a++];
      }
      results in a significant perf increase. If we run the routine several million times to get a nice sample wed see numbers like this:

      Running : Loop Unroll Read
      Running unoptimized version
      Elapsed time [6.33] seconds
      Running optimized version
      Elapsed time [3.17] seconds

      During the unrolled loop we do 16 read operations before we hit the conditional. This allows the processor to do a lot more pipelining. Unrolling more than 16-fold appears to result in negligible further improvement however.
      • btw I was thinking of using a macro like:

        #define BODY x += p[a++]

        then in the loop body doing:
        BODY;
        BODY;
        BODY;
        etc...

        Turns out that this kills the pipelining since the proc must increment a before issuing the read to the memory controller. The best speed I get comes from hardcoding the offsets as in the code in my prev. post.

  • Jason,

    Yes, you can do this with templates. It's called meta-programming, and it's got its' own slew of weirdnesses about it, and like any tool, it can be used to do cool things (good), yet also be used to write code that is absolute hell to figure out (evil). Basically, with templates, you can "run" code at compile-time, without ever "running" the compiled code.

    Here is a URL that describes some of what is possible with template metaprogramming:

    osl.iu.edu/~tveldhui/pa...meta-art.html

    Regards,

    John

    Falling You - exploring the beauty of voice and sound
    www.fallingyou.com