• Hacker News
  • new|
  • comments|
  • show|
  • ask|
  • jobs|
  • 01HNNWZ0MV43FF 28 minutes

    This is good data, but I'm not sure what the actionable is for me as a Grug Programmer.

    It means if I'm doing very light processing (sums) I should try to move that to structure-of-arrays to take advantage of cache? But if I'm doing something very expensive, I can leave it as array-of-structures, since the computation will dominate the memory access in Amdahl's Law analysis?

    This data should tell me something about organizing my data and accessing it, right?

  • aapoalas 3 minutes

    Would kernel huge pages possibly have an effect here also?

  • smj-edison 17 seconds

    Side note, but this product looks really cool! I have a fundamental mistrust of all boolean operations, so to see a system that actually works with degenerate cases correctly is refreshing.

  • PhilipTrettner 3 days

    I looked into this because part of our pipeline is forced to be chunked. Most advice I've seen boils down to "more contiguity = better", but without numbers, or at least not generalizable ones.

    My concrete tasks will already reach peak performance before 128 kB and I couldn't find pure processing workloads that benefit significantly beyond 1 MB chunk size. Code is linked in the post, it would be nice to see results on more systems.

    twoodfin 5 hours

    Your results match similar analyses of database systems I’ve seen.

    64KB-128KB seems like the sweet spot.

  • _zoltan_ 2 hours

    is this an attempt at nerd sniping? ;-)

    on GPU databases sometimes we go up to the GB range per "item of work" (input permitting) as it's very efficient.

    I need to add it to my TODO list to have a look at your github code...

    PhilipTrettner 2 hours

    It definitely worked on myself :)

    Do have a look, I've tried to roughly keep it small and readable. It's ~250 LOC effectively.

    Also, this is CPU only. I'm not super sure what a good GPU version of my benchmark would be, though ... Maybe measuring a "map" more than a "reduction" like I do on the CPU? We should probably take a look at common chunking patterns there.