-
the advantage of random segmentation
Hey Everyone,
Keen for thoughts on this.
I’ve been setting up so my HDBs are segmented. I have a few different projects and I have 1 parent drive which hosts the home folders for HDBs then 4 extra drives which have segmented tables across them. (No RAID and daily backup). I do this because I’ve got more data than I can store in memory so any gains from parallel read/writes are worth it.
Thing is, whenever I look at any examples, it always looks like data is assigned to segments based on grouping data in some non arbitrary way, based on a feature of that data.
I think in reality this would almost always be the wrong approach (boom).
My thinking is that in order to get a gain from segmentation, you would need to ensure that your data is being accessed in parallel. So for any given dataset, you would prefer it to be randomly distributed across segments. Segmenting based on an attribute in the data would increase ‘clumpyness’ so the chances of one or more segments being accessed more than others is increased so the number of parallel I/Os would, I recon, decrease. In practice, I pull my segment list from par.txt, then randomly assign a segment label to each row in my data then filter on that and write to each segment. Because of something like neat freakishness, I add the segment label as a column in the data but frankly, I don’t think I’ll ever use it. (maybe some edge case on testing access speeds or something).
Anyway, am I missing something?
Simon
Log in to reply.