Code Formatting

kdb+

Code Formatting

Posted by dhodgins on July 25, 2022 at 12:00 am
What does this statement mean? “can’t read it“

Let’s assume all readers of the code are developers.
- Are they all proficient enough with the language in order to read the entire .q file even if it is formatted in the ‘perfect’ way? A good beginner developer will not learn the whole programming language, only a poor beginner developer will learn the whole language. Seems backwards, right? A good developer knows that ‘getting stuff done’ is what they are hired for, and not furthering their own skillset. Of course, as they gain experience the 2 developers with different approaches will converge in terms of language knowledge.
- Why would you want to read the whole file? We often forget this when we write code. Other people will need to read it, and should not have to read all of it every time they want to find a bug or change behaviour. Imagine one scenario where you are trying to fix the spelling of a log message, you only want to find the part of the code that generates the message and not read the whole file.
So we have broken “can’t read it” into literally can’t, or is being forced by the coding style to read more of the code than they want to.

So we have two goals when formatting code:
- give structure to allow readers to find the section they are interested in
- make the reading of a section of code ‘easier’
Let’s consider one extreme, where a .q file is a single line 7000 characters long. I think it is safe to say such code is poorly formatted. How should we format the code? Let’s create a set of rules.

Rule 1 – Consistency
Ideally code should be autoformatted by the IDE. Please leave a comment if you know of an IDE that actually understands q syntax enough to do this. Whatever rules we come up with, it is important for everyone on the team to follow the same rules, and for code to be rejected during code reviews if it does not follow the teams rules.

Rule 2 – Each line should have a single statement

a:1;b:2; /This is 2 statements, and breaks this rule.

If the number of characters involved is less than say 15, and the variable names all have the same length then I have no objection to initialising variables on a single line like above. What I object to are statements like:

a:…………………….;b:3;

or

a:……………………;b:…………………………;c:………………………….;

or

abcdefghijk:3;e:4;bcd:5;

When I write code I write:

a:1; b:2;

Rule 3 – Consistent use of ;
Either every line that isn’t to be printed ends in a ; or only lines that need ;. I have no real preference and this normally boils down to whether the IDE people are using actually understands q. i.e.

a:1 b:2

If I select both lines and click execute it should work. The IDE should know that an unindented line is not a continuation of the previous line and not need a ; at the end of each line. Sadly, most IDE’s don’t and so the rule is normally to add the ; i.e.

a:1; b:2;

Rule 4 – Every branch gets it’s own line

if[….(….;…..)…….[…;…[…;..]..]…………………..;:…………..]; /This breaks the rule. An if statement splits the code into 2 branches (true case, false case) [let’s ignore signals].

To correctly format the statement the ‘then’ statement should be on it’s own line.

if[….(….;…..)…….[…;…[…;..]..]…………………..; :…………..];

This gives the reader 2 choices. Do I read the condition or do I read what happens if it is true. Quite often we can look at the ‘then’ and see that it doesn’t interest us, and move on to the next line. For example, if we are debugging and there is a log line later on that means we didn’t return(:) from the function, we can ignore most of this code. Similarly if the ‘then’ starts with variable42:…. and we are not interested in variable42 we can skip the line. Similarly if we are looking for why the function returned earlier than expected, then the return(:) makes us much more interested in this if, than others around it.

Rule 5 – Do not assign to variables inside branch conditions

if[….(….;…..)…….[…;…[.importantVariable:..;..]..]…………………..; :…………..];

Above is bad because it invalidates the assumptions of Rule 3. In rule 3 we assumed that we could skip reading the condition base on its outcome, but here the condition itself has an outcome. Instead write it like this:

importantVariable:..; if[….(….;…..)…….[…;…[.importantVariable;..]..]…………………..; :…………..];

One possible exception is if we name the condition. As it will be immediately after the ‘if[‘ this seems acceptable, eg:

if[nameOfCond:…………………..; …………….]

Rule 6 – Tabs are variable width so use spaces to align.
People often get into arguments about this. There is one, and only one way to use tabs and spaces when working in a team. Tabs indent, spaces align.

somevar:.[!]flip( (`key1 ;val1 ); (`key11;val11));

Lines 2 and 3 are indented as they are a continuation of line 1. We indent with tab, and the reader can configure their IDE to make the tab as wide as they like.
key1 and val1 are aligned with key11 and val11 using spaces to make them the same width, using tabs here would look wrong depending on how the IDE treats tabs. You can indent with spaces but then you’ve taken choice away from the reader for no benefit.

Rule 7 – Line length should be considered
Many people say lines should be no more than 80 characters long. These people are wrong. Firstly, you should not be printing out pages of code without permission as it poses security risks. I do not believe in setting a line length limit. It results in code that is harder to read. Instead you should consider each line on its own merits. Consider the following when judging a line:
- Does it break any other rules, such as Rule 4 about branches, or some of the later rules
- how long are the variable names. If the line contains 5 variable names whose length is longer than 15 characters then the 80 character limit makes no sense. If you variable names are being shortened to single characters in order to the an 80 character limit, then I just want to cry.
- What resolution monitors does the team use
- What is the largest font size the team uses
- Does the line have large sections within brackets (these could be put on their own line)
This is bad:

resultvar:select column1,column2,column3,column4,column5,column6,column7 from table;

And should be:

resultvar:select column1,column2,column3,column4,column5,column6,column7 from table;

Rule 8 – spaces should be used consistently
Either you put spaces around every operator:

a : sqrt[(b * b) + c * c]

Or you don’t:

a:sqrt[(b*b)+c*c]

If your IDE allows you to configure the colour, size, and style or keywords, operators and brackets then removal of unnecessary spaces makes sense. If you are living with monochrome text then you may make other choices. Discuss with your team, but be careful to fully understand the desires of people that want the spaces. In my own experiences, most often people that want spaces need to increase their font by three zoom levels and make it bold by default. Given the width of the lines decreases without the spaces, increasing the font can be a reasonable trade off for them.

Rule 9 – SQL should be formatted
I normally base this on complexity, eg this is fine:

select a,b from table;

Not so happy with this, it doesn’t seem complex enough to split onto 2 lines

select a,b from table

What about this?

select a,b,c:….,d:….. by ………… from table where ……………….

The above seems like it should probably be split like below:

select a,b,c:….,d:….. by ………… from table where ……………….

Once we’ve split a query onto multiple lines, I expect to see the keywords at the start of each line, even though the ‘from table’ line seems wasteful.

Consider this:

select ab,bc,cd:….,def:…..,egh:………,fij,gklmn,hopq,ijk:……………… by ………… from table where ……………….

We should split the columns into groups. Adjacent column names which aren’t being modified can stay together but columns that are modified should get their own lines becaues of the complexity of the statement.

select ab,bc, cd:…., def:….., egh:………, fij,gklmn,hopq, ijk:……………… by ………… from table where ……………….

Alternatively you can have 1 column per line – depending on your preferences.

Rule 10 – Early return and changing order of evaluation
‘:’ is ‘early return’ from a function, it is not ‘return’. The following is badly formatted and slightly slower than it needs to be.

{a:1; :a*2;}

Code like this is typically written by people that think colon is ‘return’, it is not. This code should be written as:

{a:1; a*2}

q returns the result of the last expression (if the last expression is ; then it returns identity). We should learn this and use it.

Similarly round brackets are used to change the evaluation order of the statement, don’t use them to make the code look ‘pretty’
Prefer:

a:(b*b)+c*c;

Over:

a:(b*b)+(c*c);

Kdb has good, simple order of evaluation. Learn it. Use it.

Rule 11 – Decide whether d is evil or not
Most of the people I have worked with think that d is evil incarnate and never use it. Your team, whatever it decides, should be consistent. Some of the reasons people don’t like d are:
- system”d .dave”; doesn’t work as expected
- it is harder to search for function calls if they are not fully qualified
- you have to learn a whole set of scoping rules that you otherwise don’t need to know
- you have to learn what the heck `. `trade and `..trade mean which you otherwise don’t need to know
- when reading the code you no longer know whether a variable is a local to namespace ‘age’ vs ‘.dave.age’ or a local to the function or a global, without checking
- when debugging and you want to redefine a function in a namespace it now takes 3 lines:
  d .thenamespace thefunc:newdef d .?Instead of one:
  .thenamespace.thefunc:newdef?
Rule 12 – Decide whether csv adds value
Personally I don’t understand why csv was added to the language, it seems to have no positive attributes.

“,”0:table

Above is shorter than:

csv 0:table

csv needs explanation, it hampers understanding of 0: which is already ridiculously overloaded and hence hard to understand. When you see “,” you can assume you can use “|” instead. When you see csv you start hunting for pip or txt which don’t exist, then you lookup 0: to see how to “|” delimit.

Rule 13 – Prefer infix over functional forms
Don’t write this:

+[3;4]

Write this:

3+4

Rule 14 – Minimise the distance between open and close brackets
Kdb is easier to read than other languages when it comes to order of evaluation, try to write your code to be simpler by reducing the number of brackets and the distance between them:
Prefer:

if[2<count a;

Over:

if[(count a)>2;

Similarly:
Prefer:

count[a]

Over:

(count a)

Rule 15 – Decide where to end your brackets
On the principle that I know where brackets end because indentation is important, and that I write code for the reader, not the writer, I don’t give closing brackets their own lines.
Bad:

a:( …..; …. )

Good:

a:( …..; ….))

For functions, I wish the kdb parser was smart enough to see the closing brace and not make me indent it, therefore I use a single space and not a tab for the last line of a function:
eg1:

f:{ ….; /tab indented ….; answer} /Single space because I wish it were zero spaces

If the function returns nothing then:

f:{ ….; /tab indented ….;} /Although i don’t mind if the } gets it’s own line.

Rule 16 – Decide how you want your case/switch statements to look and stick to it.
Personally I prefer:

{$[ /No case on the $ line a=2;4; /three spaces to align with above. Cond and case on the same line a=3;7; a=7;[ //Multi statement needs squares so nothing else on this line ……; …..]; 42]} //Default case aligns with conds not cases

If this is not the last thing in the function then each line might start with a tab and then the appropriate amount of spaces

Rule 17 – Immediately used projections are slower
If you write your code according to rule 14 you might use projections:

f[;a]………………………………….; /Move big ass calculation of first argument to f outside of the brackets.

I do like the above, but it is slightly slower than:

f[………………………………….;a];

This is because kdb has no optimizer.

q)f1:{a[1;3]} q)f2:{a[1]3} q)ts:5000000 f1[] 1427 464 q)ts:5000000 f1[] 1423 464 q)ts:5000000 f2[] 1894 464 q)ts:5000000 f2[] 1904 464 q)f3:{a[1;]3} q)ts:5000000 f3[] 1991 464 q)ts:5000000 f3[] 2000 464 /We can also see the difference in bytecodes: q)`int$first value f1 160 13 129 10 2 0 3i q)`int$first value f2 160 13 129 82 82 0 3i q)`int$first value f3 160 17 13 129 10 2 82 0 4i

Above we can see that f2 does a double index(82) whereas f1 simply identifies a(129) as a function(10) taking 2 arguments. For f3 we see that the byte code is longer due to both a projection and an index.

Rule 18 – Variable names should rarely be single characters.
Unless the function has mathematical meaning, in which case, x, y, z might make sense, variables and parameters should have meaningful names. d does not mean dictionary, use dict instead, or better describe the dictionary.

Let me know in the comments if I missed any rules, or you have a better set of rules.
dhodgins replied 4 months ago 2 Members · 1 Reply
1 Reply

davidcrossey

Member
March 19, 2024 at 11:01 am

Thanks for sharing @dhodgins.

Whilst I don’t necessarily agreed with all of the points based on personal preference, I do agree with the overall theme of keep it simple and consistent – both for readability and performance.

I find it’s a balancing act, for example on the projection performance, one may prefer a more tacit style approach as it can be cleaner/easier to read right-to-left the following expression:

aggFunc projFunc[var1;] calcFunc

vs

aggFunc projFunc[var1;calcFunc]

Especially if calcFunc has a longer definition and if the small performance overhead justifies it.

Your blog post and notes on style can be eye-opening and help to develop good code hygiene over the long term.

KX Community

davidcrossey