KX Community

Find answers, ask questions, and connect with our KX Community around the world.

Home Forums kdb+ What is the role played by key columns in a keyed table [ query join/performance ] ?

  • What is the role played by key columns in a keyed table [ query join/performance ] ?

    Posted by anand_kulkarni_ on May 24, 2022 at 12:00 am

    I would like to understand what is the underlying optimization or performance tweak that kdb+/q is able to extract when we apply key on one/more columns in the table. Key serves the purpose of helping join 2 or more tables via row to row match using the keys. Internally the keyed table is stored as a map [ key value pair ]. The keyed columns are the key and the rest of unkeyed columns make the payload/data tuple. since it is a map querying the map for a particular key must be faster than an unkeyed table. The keys are basically nodes on a BST ( binary search tree ) which should look up in O(log(n)) time given n is size of the table. An unkeyed table on the other hand is an aggregate of rows [ an array basically ]. Hence searching it or  querying should be O(n). But when we run a simple test with some data there is no difference in speed between querying unkeyed vs keyed table at all. Why is this ? In Fact the keyed table takes slightly more   time to  lookup a particular record. I do understand that the table may not be sorted in memory based on the keys but still with a dictionary type keyed table should a faster search be not possible ?

    anand_kulkarni_ replied 2 months ago 4 Members · 6 Replies
  • 6 Replies
  • rocuinneagain

    Member
    May 24, 2022 at 12:00 am

    You can see the speed improvement if you index rather than query using qsql:

    q)select from t1 where Name=`Anand 
    Name Age Salary 
    ---------------- 
    Anand 10 1000 
    
    q)select from t2 where Name=`Anand 
    Name | Age Salary 
    -----| ---------- 
    Anand| 10 1000 
    
    q)t2`Anand 
    Age   | 10 
    Salary| 1000 
    
    q)\ts do[100000;select from t1 where Name=`Anand] 
    164 1808 
    
    q)\ts do[100000;select from t2 where Name=`Anand] 
    174 1808 
    
    q)\ts do[100000;t2`Anand] 
    71 960

     

  • darrenwsun

    Member
    May 24, 2022 at 12:00 am

    With qsql, the full column is searched before the result is presented. With key lookup, the search stops when it finds the first match. This is where the performance gain from key lookup comes from.

    > The keys are basically nodes on a BST ( binary search tree ) which should look up in O(log(n)) time given n is size of the table

    This isn’t true. KDB’s keyed table, or more generally a plain dictionary, does not use hashing techniques like Java’s HashMap. Lookup is done by searching through the table/list linearly. If one wishes less time for the lookup, grouped (conceptually the same as index) or parted (conceptually a space-optimized index that assumes data are sorted) attributes help.

     

  • anand_kulkarni_

    Member
    May 24, 2022 at 12:00 am

    It’s amazing that lookup via key works faster but same lookup isn’t used as a hint when querying the table via qsql ? Sounds surprising behaviour actually and doesn’t look sensible.

  • rocuinneagain

    Member
    May 24, 2022 at 12:00 am

    For optimisations in qsql queries you should look to attributes.

    Here applying the unique attribute to the Name column:

    q)t3:update `u#Name from t1 
    q)\ts do[100000;select from t3 where Name=`Anand] 
    64 1808

     

     

  • darrenwsun

    Member
    May 24, 2022 at 12:00 am

    Regarding why q-sql doesn’t do key lookup internally if the search is on the key, short answer is that they are not equivalent. Consider a keyed table where there are duplicate keys (yes this is allowed in KDB, there is no such thing as “primary key constraint” in this world), q-sql may return a table of multiple rows, while the key lookup form returns a dictionary representing the first entry (you know why – the search stops at first matched entry)

  • pcarroll

    Member
    May 24, 2022 at 12:00 am

    Just wanted to codify some of the above:

    q)tab:([]sym:-50000?`6;px:50000?10) 
    q)ktab:`sym xkey tab // wanted to show the last record to convince that if kdb // has to search through entire list, no speed gain is evident 
    q)-1#tab 
    sym px 
    --------- 
    obafmn 6 
    
    q)-1#ktab 
    sym   | px 
    ------| -- 
    obafmn| 6 
    
    q)\ts do[100000;select from tab where sym=`obafmn] 
    2034 66240 
    
    q)\ts do[100000;select from ktab where sym=`obafmn] 
    2051 66240 // below no speed gain, but memory usage is reduced 
    
    q)\ts do[100000;ktab`obafmn] 
    2079 960 // however this does mean any other *valid* request would be faster 
    
    q)rand key ktab 
    sym| kfpfmd 
    
    q)\ts do[100000;ktab`kfpfmd] 
    1169 960 // showing point above with grouped attribute applied to keyed column 
    q)gktab:`sym xkey update `g#sym from tab 
    q)\ts do[100000;select from gktab where sym=`obafmn] 
    89 1808 // final point I wanted to make showing duplicate keys 
    q)`a`a!(1;1) 
    a| 1 
    a| 1

    Thanks  and  for contributing here.

Log in to reply.