Compression for null string column

Tagged: 4.0, kdb

kdb+

Compression for null string column

Posted by eohara_kdb on July 3, 2024 at 11:16 am

Hi,

We’re seeing compressed null string columns take up more space on disk than expected. Would anyone be able to shine some light on this behaviour?

Example:

q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist “”);(<backtick>:tab/;17;2;5) set tab

<backtick>:tab/

q)-21!<backtick>:tab/str

compressedLength | 14074225

uncompressedLength| 80004096

algorithm | 2i

logicalBlockSize | 17i

zipLevel | 5i

q)-21!<backtick>$”:tab/str#”

compressedLength | 24189

uncompressedLength| 20004096

algorithm | 2i

logicalBlockSize | 17i

zipLevel | 5i

According to this page, “the non-sharp file is a serialized q list of integers representing the lengths of each sublist of the original list.”

For a null string column we’d expect the non-sharp file to just contain zeroes, which should compress better than what we’re seeing.

Using 4.0 2020.06.18

Thanks,

Eoghan

code.kx.com
kdb+ and q documentation
Documentation for kdb+ and the q programming language

eohara_kdb replied 11 months ago 2 Members · 3 Replies
- 4.0
- kdb
3 Replies

rocuinneagain

Member

July 3, 2024 at 12:07 pm

<div>Can you test against a newer version of 4.0? </div>

My 4.1 gets much improved numbers:

q)(.z.K;.z.k)
4.1
2024.04.29
q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist "");(`:tab/;17;2;5) set tab
`:tab/
q)-21!`:tab/str
compressedLength  | 136807
uncompressedLength| 80004096
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 5i
//Your compression 5.6x
q)80004096%14074225
5.684441
//Compression now 584x
80004096%136807
584.7953
q)-21!`$":tab/str#"
compressedLength  | 93
uncompressedLength| 4098
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 5i

rocuinneagain

Member

July 3, 2024 at 12:20 pm

I expect this entry in 4.0 README for 2022.04.15 is the version from where you will see the improvement:

2022.04.15
NEW
anymap write now detects consecutive deduplicated (address matching) toplevel objects, skipping them to save space
 q)a:("hi";"there";"world");`:a0 set a;`:a1 set a@where 1000 2000 3000;(hcount`$":a0#")=hcount`$":a1#"
improved memory efficiency of writing nested data sourced from a type 77 file, commonly encountered during compression of files. e.g.
 q)`:a set 500000 100#"abc";system"ts `:b set get`:a" / was 76584400 bytes, now 8390720

eohara_kdb

Member
July 3, 2024 at 1:11 pm
Thanks @rocuinneagain , will test.

FYI we’d also tested changing the type from string to symbol, the symbol column compresses at the same ratio as in your example
```
q)show c:count get`:eohara_dev/strCol
18809996
q)vals:sym?c#
q)sym
```
```
symbol$()
q)(
```
:eohara_dev/test;17;2;5)set vals
```
:eohara_dev/test
q)-21!
```
:eohara_dev/test
compressedLength | 257281
uncompressedLength| 150484064
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i
```
q)150484064%257281
584.9016
```

KX Community

Compression for null string column

rocuinneagain

rocuinneagain

eohara_kdb