I would like to try replacing the memory allocator with TBB's scalable memory allocator as detailed here:
http://www.threadingbuildingblocks.org/docs/help/tbb_userguide/Automical...
I would like to do this for allocations in offload regions. This is on windows. What I've tried:
1) Adding
tbbmalloc_proxy.lib /INCLUDE:"__TBB_malloc_proxy"
to the link line. This clearly only affects the host allocations
2) Adding
-ltbbmalloc_proxy -ltbbmalloc
to the offload linker options. I also had to copy the .so's to the MIC and put them in /usr/lib64.
Memory allocation still seems to be slow, although I'm not sure how to definitively tell if I'm actually using the TBB allocator. Is there anything else that I need to do?