[refactor/fix] use proper tiling and tile order
1. refactor overall main function to match intended benchmark interface. 2. Use the new tiling type to cleanup the noprefetch version. Careful inspection unearthed some bad offset computations, which are fixed here. 3. double checked the way we were spawning threads, new code should be straightforward. I believe that code should be easier to read and to play with. Converting the prefetch versions might not be as easy.