-
Loading data for combinations of instruments and date ranges (rolled series)
I am trying to come up with an efficient way to load data from disk, for multiple instruments over different date ranges. The actual problem we’re trying to solve is loading rolled series for futures, forwards, and other derivatives.
Say we want to load data for the following instruments and date ranges (spec)
inst startDate endDate A 2022.01.01 2022.03.31 B 2022.04.01 2022.06.30 C 2022.07.01 2022.08.31 One way of doing this, following the usual date-then-sym query pattern would be:
minDate: min exec startDate from spec; maxDate: max exec endDate from spec; insts: exec inst from spec;
result: select from trades where date within (minDate; maxDate), sym in insts;
And then perhaps join result and spec so that we can filter each instrument to its specific date range.
The problem with this is that we end up loading much more data than we care about, for example we are loading data for B from well before 2022.04.01. I imagine KDB is also wasting time scanning for instruments in historical dates that likely have no data at all for that instrument.
For our datasets, we’ve found the following to be much faster:
loadContract: {[x] select from trades where date within (x[`startDate]; x[`endDate]), sym=x[`inst]}; result: raze loadContract each spec
We now only load data for a given inst for its relevant dates.
Can we do better than this?
I was reading up on segmented databases, and that queries can be parallelised over how partitions are divided over the segments. I am concerned that by looping our queries rather than providing all the information upfront, we may lose out on this and other under-the-hood efficiencies/optimisations.
I had considered a table lookup, e.g.
where ([] date; sym) in ...
but have found in the past these don’t leverage the partitioned/parted attributes.
Many thanks.
Log in to reply.