KX Community

Find answers, ask questions, and connect with our KX Community around the world.
KX Community Guidelines

Home Forums kdb+ Does KDB have string data type? What is a string in KDB?

  • Does KDB have string data type? What is a string in KDB?

    Posted by MilanGill on February 25, 2025 at 12:20 pm

    Hi,

    I am trying to understand what is meant by the concept of a String in KDB.

    Initially when I read the documentation on datatypes, which is related to the serialization and deserialization of data to/from disk or network sockets, I read that there is no String datatype in KDB.

    Please find the referenced page below.

    https://code.kx.com/q/basics/datatypes/

    There is no string datatype. On this site, string is a synonym for character vector (type 10h). In q, the nearest equivalent to an atomic string is the symbol.

    I understand that Symbols are atomic, and interned, and that these data types are distinct from Strings.

    My question is really in two parts. Firstly what is a String?

    • Is there nothing more to it than “a string is the same as a list of characters”?
    • This seems unlikely, given that a String could be unicode. Since a character is an 8 bit piece of data, clearly a character cannot hold all Unicode code-points. So it would seem there should be some distinction?

    Secondly:

    • In relation to serialization, KDB defines “char” (atom), “list of char”, “symbol” (atom) and “list of symbol”. It does not discuss strings. I wonder if anyone can comment on that?
    • Finally, assuming that a String is the same as a list of char, it would seem slightly strange that there is no datatype for list of string, in other words list of list of char. As far as I can see there are 5 things which can be serialized, rather than 4. Those things are: char, list of char, string, symbol and list of symbol. I am confused further by this because the serialization format for a list of char is the same as a symbol, except for the code used for the datatype. (10 for list of char, -11 for symbol) From the point of view of some other application which deserializes data which has been serialized by KDB, I struggle to understand how a Symbol would differ from a list of char. The fact that Symbols are atomic and interned is an internal implementation detail of KDB. This detail is not relevant once the data has been serialized to a file as a stream of bytes.

    Sorry my question perhaps isn’t particularly coherent. I’m just a bit confused by various factors.

    To provide some context on what I am actually trying to achieve – I am writing a serialization and deserialization library for another language which is not directly supported by KDB.

    This is why I am looking into detail at how KDB performs serialization and what the precise semantics are.

    Thank you in advance for any feedback and comments.

    MilanGill replied 2 hours, 50 minutes ago 2 Members · 2 Replies
  • 2 Replies
  • ss1

    Member
    February 25, 2025 at 1:32 pm

    Consider the type output on a list of longs, list of chars (string) , symbol, etc

    q)type 1 2 3 
    7h
    q)type "abcde"
    10h
    q)type `abcde
    -11h
    q)type "a"
    -10h
    q)type 1
    -7h
    q)type `abc`def
    11h

    It shows positive values for 7h (vector of long), 10h (vector of char). It shows a atom type for the symbol (changing to 11h when vector of symbols). There is no ‘string’ type, but often referred to when talking about type 10h (as seen above).

    This may be of some relevance Strings

    Symbols are interned. Imagine sending a symbol to another process (when IPC is used). It may never have seen that symbol before, it needs to know what the symbol represents so it may also intern it. So in your particular example of IPC representation it may appear similar (to tell the other process what it represents and that it should intern it as a symbol).

    • MilanGill

      Member
      February 26, 2025 at 10:22 am

      On the subject of interning, does the fact that KDB interns Symbols suggest that all processes it is communicating with should also intern strings?

      I lean slightly towards “yes” on the basis that these datatypes would be expected to have similar performance in both systems.

      However, consider the serialization format alone. It is independent of the implementation detail. Whether or not strings (symbols) are interned is a system implementation detail. A KDB “system” interns them. Some other system might not.

      The serialization format does not contain any information about whether or not strings (symbols) are interned, or atomic. It contains a tag (number) followed by some data.

      KDB could release a new version of their software tomorrow, and decide not to intern Symbols. They could keep the same serialization format.

      This suggests leaning towards answering “no” to the above question.

      Any thoughts?

Log in to reply.