Laura
Forum Replies Created
-
Laura
AdministratorJune 30, 2023 at 12:00 am in reply to: Path not found error when launching sandboxHi – the error was happening when I tried to open a notebook that I had not accessed before, such as memory management and query optimization or Parallelization.
I do not have a screenshot of the error happening, the best description I have of it is that after I clicked “Launch KX Sandbox”, the page went through the normal loading process (went white, progress bar showed up and disappeared) and at the point the jupyter notebook logo showed, the given error (The path: /user/<my email>/lab/qbies.png was not found. JupyterLab redirected to: /user/<my email>/) appeared and sent me to that default directory where I could access previously opened notebooks.
If it happens again I will document with screenshots.
Thank you
-
Laura
AdministratorJune 30, 2023 at 12:00 am in reply to: Path not found error when launching sandboxThe problem has seemed to resolved itself today.
If anyone can explain why this was happening it would be appreciated, for future reference.
-
The other comments address well that md5 is a hashing function which is implicitly non-reversible. It’s also not a very secure hashing algorithm as it’s vulnerable to both collision attacks and length extension attacks (more reading here for interest).
Very much dependent on your use case, if there’s a known fixed list of strings that the original messages can be you can “decode” the data. Take the scenario where there are 2 users in a message chat, Alice and Bob, and we have a table that has those users md5 hashed to hide their identity, but we know prior that the users are Alice and Bob. You can see who sent what like this:
t:([]time:10#.z.p;users:10?`Alice`Bob;message:10?10); t:update users:{md5 x} each string users from t; lookup:(md5 "Alice";md5 "Bob")!`Alice`Bob; t:update users:lookup[users] from t;
This can be extended to any number of users (and other use cases) but has the prerequisite of there being a known fixed list of users.
If however, your use case is that you are storing large amounts of data and don’t want a lookup like this, I’d suggest not using MD5 for your encryption at rest but encrypting using public/private keys so you are able to decrypt your data while still having security at rest. You could use say OpenSSL and integrate with KDB/Q (example of encrypting/decrypting data here).
Unfortunately (or rather fortunately) if you don’t know what the values are in your table and are trying to retrieve them knowing they’ve been md5 hashed this isn’t possible outside of brute-force attacks, the likes of which underpin password cracking.
I hope one of these responses addresses your use case!
-
Laura
AdministratorJune 13, 2023 at 12:00 am in reply to: c# API – Is it possible to build (dynamically) and run aggregate queries to receive real time streaming results using the c# APIHi DG,
Here are a few solutions depending on requirements/specifics:
1. Depending on how dynamic the queries are, i.e. if it’s a mostly fixed list of aggregations, I would suggest using a Real Time Subscriber (RTS) and creating tables for the different aggregations, then subscribing to those aggregate tables from C# using .u.sub
2. If the aggregations are very dynamic, create on the C# side a polling query to Q. This would be similar to static queries (examples below) but wrapped in a looping statement within C#.
– AquaQ Documentation on C# Aggregations
– KX Example Simple QueryNote this second solution wouldn’t be truly Real-Time, but you could reduce the polling time to an extent where it’s close enough to real-time for your purposes.
3. If you’re truly looking for Real-Time responses you could subscribe via a Gateway Process and treat it as an aggregator, then define the aggregation that’s run on-update via remote queries from C#. Something along the lines of:
connection.ks(“.my.agg:{select avg price from x}”)
Where the string would be defined by C# dynamically and run on-change.
The 3rd method is quite a bit more complicated than the other two and will have a few issues that need to be overcome. I have noted it to answer this question in it’s completion but would advise either approach 1 or 2. If a non-polling approach is definitely required I’ll be happy to go into more detail on the third solution and the gotchas.
Regards,
Sam B
-
Laura
AdministratorJune 13, 2023 at 12:00 am in reply to: How does nested columns/lists fragment memory?Hi there,
Firstly, what you’re seeing here isn’t memory fragmentation due to nested columns. Looking at the documentation the memory fragmentation issue in relation to garbage collection is when KDB/Q can’t find contiguous blocks of memory to release back to the OS due to the memory being fragmented. Noting specifically here that although the used memory is low, even after garbage collection the heap remains large (see attached snippet). The solution of this fragmentation is to serialise your data, release then de-serialise (see the comments within the coded example)
Secondly, the coded example you provided doesn’t contain nested columns. An example of a nested column, continuing from your definition of trades:
select nested from update nested:n#enlist (.z.p;3i) from trades
Where the column type of nested is a mixed list with datatypes of 2-tuples at every index of the nested array
In relation to the time taken to garbage collect, you’ll find your second query requires more memory allocated to the process to complete, thus more time to release the memory.
To solve the issue of time for you, I’d advise using Immediate Garbage Collection which can be started within the Q session using the commandline argument of -g 1 or dynamically on the process using g 1. This will mean you don’t have to manually invoke .Q.gc[] as the system will automatically release memory blocks back to the OS upon availability.
More reading on Memory Fragmentation for interest here
Regards,
Sam B
-
Laura
AdministratorJune 13, 2023 at 12:00 am in reply to: How does nested columns/lists fragment memory?Ah yes you’re right, multiple entries of qty/price for a given uid/tid will create nested columns in the by clause response.
Agreed, memory fragmentation definitely not the issue here, just providing documentation supporting that claim
-
Laura
AdministratorJune 13, 2023 at 12:00 am in reply to: How does nested columns/lists fragment memory?
I have tried the Immediate garbage collection mode in my application but didn’t get much speed up.
Can I clarify what metric you’re using when you say you didn’t get much speed up. Just to restate, when you use immediate garbage collection you no longer need to call .Q.gc[] in your script, so most of the slowness of your script will come from the aggregations themselves. You can compare the two methods by logging out .Q.w[] and .z.p at different intervals of your script.
Regarding the memory fragmentation it’s not something you need to consider at this point, as mentioned by the space of the whole temporary result is released rather than part of it.
1. Yes, deallocating part of a nested structure will mean that part cannot be released if the other part is referenced, as per the comments in the docs:
2. Yes for the same reason as 1 if you haven’t deleted nl from the local namespace. The global table will also reference the same memory locations as nl so even if nl is deleted from the local namespace you will have this problem:
I’d advise experimenting with the code provided in the docs as I have here, you might find some of the answers to your questions by observing how much memory is released back to the OS when using globals/nested variables.
3. I believe it’s when new data is referenced. E.g.
Note when assigning b:a there is no memory increase, but once a is modified it needs to create a new memory allocation and b still uses the same memory space as a previously occupied.
But to reiterate, the problem of speed should be solved with immediate garbage collection (if you could verify with logs), and the issue of memory fragmentation shouldn’t be particularly relevant to you at this point. However if it does become an issue, you’ll see that reflected in .Q.w[] with the used being orders of magnitude lower than the heap even after manual .Q.gc[]. The solution is to serialise, release and de-serialise the variable that’s referencing the fragmented memory (your global table) periodically, or pursue a solution that doesn’t include nested vectors.
-
This is really interesting Simon!
Looking forward to hearing if how well it worked enumerating the parent posts, keep us posted
Laura
-
You are correct that in the real world in practice date or session could be considered – it depends what the end goal is.
Taking date for example you could either:
- join on time (no date included)
- gives an average value over a certain window across multiple days
- this tells the user that after say 3 days of practice sessions on average the sensor value is X in the morning or afternoon
- for example this might indicate if temperature tyres are colder in the morning and hotter in the afternoon meaning we might want to adjust the car setup in the later sessions
- in this scenario we don’t care about date – rather just the time of day it happened
- join on date and time
- gives an average value over a certain window on a specific day
- the dataset would be much larger (triple the size if 3 days)
- this would be useful if I wanted to have the breakdown of averages per window per day
The same is true for session – we don’t actually care what session it is – we just want averages per time of day.
Hope this makes sense!
- join on time (no date included)
-
HI , wj takes the initial value within the window to be the prevailing values that existed before the window started.
So for example in the case of tempBackRight for the second time interval between 12:02:56.325 and 12:03:33.564:
- you are correct there is no event between this time for tempBackRight
- wj will take the prevailing value that existed before the window started i.e. 20.87774 which happens to be the only entry for this sensor.
- This is a useful feature of wj because we would rather have the last known value over a table with a lot of nulls
Note that if we did want to consider the values that only occurred during the window, we could use wj1. Hope that helps. It is correct in this example to not have lapId as a column to join, the laps are represented already by the time and endTime values defining the lap window.
If you’re still getting an error you can share your attempt with a spoiler tag and I can better help with where you might be going wrong.
Thanks,
Michaela
-
Hi – you are very close! I had a go running your solution and see a different meta than what you have shared.
It’s the first column trade_id where the issue is – you should not need to add string in front of it.
messages: update trade_id: trade_id, exch_message, broker_id: extractBrokerId each exch_message from messages
-
Laura
AdministratorApril 18, 2023 at 12:00 am in reply to: How to walkthrough a tree and calculate value on path?Really like this solution, haven’t seen a scan indexing before. Just wanted to add to it so the user doesn’t have to define calc and outputTree:
tree:([]parent:`A`A`A`B`B`E`E;child:`B`C`D`E`F`G`H;data:(1;2;3;4;5;6;7)); traverse_dict:exec child!parent from tree; root:`A; // value pairing appending root node with 1 factor calc:(root, exec child from tree)!1,exec data from tree; // calc:`A`D`C`B`F`E`G`H!1 3 2 1 5 4 6 7; traverse_func:{[st;end;dict;calc] prd calc except[(dict) end;(dict) st] }[;;traverse_dict;calc]; outputTree:exec child by parent from tree; outputTree:key[outputTree]!raze each (value outputTree),' outputTree value outputTree; outputTree:(key outputTree)!except[;key outputTree]each distinct each (raze/) each (outputTree)each value outputTree; p:raze (count each value outputTree)#'key[outputTree]; c:raze value outputTree; outputTree:([] parent:p;child:c); // outputTree:([] parent:`A`A`A`A`A`B`B`B`E`E;child:`C`D`F`G`H`F`G`H`G`H); // output update val:traverse_func'[parent;child] from outputTree
There’s probably a more concise way to do the above so keen to see any further improvements.
-
Laura
AdministratorApril 18, 2023 at 12:00 am in reply to: mmap increasing every time table is queriedThat is a bit odd being unable to find the exact cause for the mmap increasing, with similar data not reflecting the same trend. If it persists after upgrading Q versions lets investigate, and if the behaviour is resolved with the update then we can say that the issue was covered by ANYMAP and maybe that can guide you towards what the differences in data were that you’re seeing in the older version of Q.
-
Laura
AdministratorApril 17, 2023 at 12:00 am in reply to: mmap increasing every time table is queriedFurther supporting the suggestion to update Q, this blog post might be of interest to you. Specifically the ANYMAP feature that was added in v3.6.
“The anymap structure within the files provides a format which is mappable, as opposed to previously unmappable non-fixed-width records”. Strings are a non-fixed-width records which would explain the values in mmap you’re experiencing. Further reading on this can be found in the release notes -
Adding to ‘s information surrounding the underlying issue of .Q.en for concurrent writes, you might find This White Paper useful. It covers specifically how dataloaders handle mass writedowns which explicitly deals with the integrity of the sym file using multiple processes. Leveraging this will mean you can still benefit from the performance of multiple processes while avoiding any conflicts with the sym file.
This may be an overengineered solution for your use case and a more straightforward solution would just be IPC calls between writing processes communicating when a sym file is being written to, ensuring only one process writing at any given time but this should give you another option if concurrent writedowns is a requirement.