KX Community

Find answers, ask questions, and connect with our KX Community around the world.
KX Community Guidelines

Home Forums kdb+ Compression for null string column

Tagged: ,

  • Compression for null string column

    Posted by eohara_kdb on July 3, 2024 at 11:16 am

    Hi,

    We’re seeing compressed null string columns take up more space on disk than expected. Would anyone be able to shine some light on this behaviour?

    Example:

    q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist “”);(<backtick>:tab/;17;2;5) set tab

    <backtick>:tab/

    q)-21!<backtick>:tab/str

    compressedLength | 14074225

    uncompressedLength| 80004096

    algorithm | 2i

    logicalBlockSize | 17i

    zipLevel | 5i

    q)-21!<backtick>$”:tab/str#”

    compressedLength | 24189

    uncompressedLength| 20004096

    algorithm | 2i

    logicalBlockSize | 17i

    zipLevel | 5i

    According to this page, “the non-sharp file is a serialized q list of integers representing the lengths of each sublist of the original list.”

    For a null string column we’d expect the non-sharp file to just contain zeroes, which should compress better than what we’re seeing.

    Using 4.0 2020.06.18

    Thanks,

    Eoghan

    eohara_kdb replied 5 months ago 2 Members · 3 Replies
  • 3 Replies
  • rocuinneagain

    Member
    July 3, 2024 at 12:07 pm

    <div>Can you test against a newer version of 4.0? </div>

    My 4.1 gets much improved numbers:

    q)(.z.K;.z.k)
    4.1
    2024.04.29
    q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist "");(`:tab/;17;2;5) set tab
    `:tab/
    q)-21!`:tab/str
    compressedLength | 136807
    uncompressedLength| 80004096
    algorithm | 2i
    logicalBlockSize | 17i
    zipLevel | 5i
    //Your compression 5.6x
    q)80004096%14074225
    5.684441
    //Compression now 584x
    80004096%136807
    584.7953
    q)-21!`$":tab/str#"
    compressedLength | 93
    uncompressedLength| 4098
    algorithm | 2i
    logicalBlockSize | 17i
    zipLevel | 5i
  • rocuinneagain

    Member
    July 3, 2024 at 12:20 pm

    I expect this entry in 4.0 README for 2022.04.15 is the version from where you will see the improvement:


    2022.04.15
    NEW
    anymap write now detects consecutive deduplicated (address matching) toplevel objects, skipping them to save space
    q)a:("hi";"there";"world");`:a0 set a;`:a1 set a@where 1000 2000 3000;(hcount`$":a0#")=hcount`$":a1#"
    improved memory efficiency of writing nested data sourced from a type 77 file, commonly encountered during compression of files. e.g.
    q)`:a set 500000 100#"abc";system"ts `:b set get`:a" / was 76584400 bytes, now 8390720
  • eohara_kdb

    Member
    July 3, 2024 at 1:11 pm

    Thanks @rocuinneagain , will test.

    FYI we’d also tested changing the type from string to symbol, the symbol column compresses at the same ratio as in your example

    q)show c:count get`:eohara_dev/strCol
    18809996
    q)vals:sym?c#
    q)sym
    
    symbol$()
    q)(

    :eohara_dev/test;17;2;5)set vals

    :eohara_dev/test
    q)-21!

    :eohara_dev/test
    compressedLength | 257281
    uncompressedLength| 150484064
    algorithm | 2i
    logicalBlockSize | 17i
    zipLevel | 5i

    q)150484064%257281

    584.9016

Log in to reply.