Large Scale WindowJoins Questions

kdb+

Large Scale WindowJoins Questions

Posted by Laura on November 30, 2022 at 12:00 am

Hello Fellow Qbies,

I’m attempting to run a wj across a table with a dimension of 1million rows and 30 columns.

It works, but it takes too long to run.

What are some tips and tricks to dealing with large datasets and windowjoins?

Laura replied 1 year, 2 months ago 2 Members · 9 Replies
9 Replies

rocuinneagain

Member
November 30, 2022 at 12:00 am

The documentation does have a note on attributes to apply:

https://code.kx.com/q/ref/wj/

t and q are simple tables to be joined (q should be sorted `sym`time with `p# on sym)

Have you applied this?
Laura

Administrator
November 30, 2022 at 12:00 am

What if I’m not using a sym column? I’m going datetime to datetime for the windowjoin.

Would I get speed improvements by using a sym column?
rocuinneagain

Member
November 30, 2022 at 12:00 am

So you are only joining on time windows? no other column?

You could apply the sorted attribute.

https://code.kx.com/q/ref/set-attribute/#sorted

The ‘datetime’ datatype is deprecated. ‘timestamp’ should be used instead.

https://code.kx.com/q/basics/datatypes/#temporal

If you can share a minimal reproducible example in q that will help us answer your questions.

https://stackoverflow.com/help/minimal-reproducible-example
Laura

Administrator
November 30, 2022 at 12:00 am

I’ve tested it down to only 1 column, the timestamp, being wj to.

Is there a way to open threading or something? It’s like 12,000 entries per timestamp being read. How long should that take?

Laura

Administrator

November 30, 2022 at 12:00 am

I have 800,000 rows of data in a table with the following meta:

c              | t f a 
---------------| ----- 
col1           | n     
col2           | s     
col3           | d     
mmm3           | f     
col4           | f     
col5           | s     
col6           | i     
col7           | i     
col8           | j     
col9           | f     
col10          | j     
dt             | p   s 
col11          | f     
col12          | f     
col13          | j     
col14          | j     
col15          | f     
col16          | f     
col17          | f     
col18          | f     
col19          | f     
col20          | f     
col21          | f     
col22          | f     
col23          | j     
col24          | j

Im attempting to get a rolling 5minute min/max using wj. (i recognize the window isn’t such for 5minutes, im just writing up an example)

f:`dt; w:(-1000000; 0)+:data[`dt];
data: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))];

The wj works just fine, but takes insanely long to complete.

rocuinneagain

Member

December 1, 2022 at 12:00 am

On my laptop this took 765 milliseconds – what runtime are you seeing?

data:([] dt:`s#.z.p+0D00:00:01*til 800000;mmm3:800000?1000.0) 
f:`dt; w:(-1000000; 0)+:data[`dt]; 
\ts result: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))]

Laura

Administrator
December 1, 2022 at 12:00 am

It took me significantly longer. But, I’m also dealing with 30 columns — would that matter, even though I’m just using mmm3 for the wj?

In testing, when I make the data table that im searching (data;(max:`mmm3)) smaller, things speed up. For example, I ran a 1 minute xbar on that table and the windowjoin now takes a couple seconds.

I dont understand what I could be doing wrong. At full scale, 800,000x30cols, it took like 30-40 minutes to complete a 5 minute lookback.

rocuinneagain

Member

December 2, 2022 at 12:00 am

I added extra columns and it did not slow down the operation for me.
Is your data in memory or on disk?

data:([] dt:`s#.z.p+0D00:00:01*til 800000;mmm3:800000?1000.0) 
data:data,'flip (`$"col",/:string til 30)!30#enlist til 800000 
f:`dt; w:(-1000000; 0)+:data[`dt]; 
result: wj[w;f;data;(data;(max;`mmm3);(min;`mmm3))]

Laura

Administrator
December 2, 2022 at 12:00 am

Im working in memory.

I load the file in its 800,000 x 7. Then run a bunch of updates to make the remaining 23 columns (30 or so total).

And then run the wj. And pass back the resulting table.

Im trying to simulate what it would be doing in a tp. Like as more data comes in the wj is going to continuously run slower until it hits max file for the day, 800k rows. So I figured loading the whole file in and doing it all at once would be a decent enough way to test what it’d be doing.

KX Community

Large Scale WindowJoins Questions