Laura
Forum Replies Created
-
Laura
AdministratorApril 5, 2023 at 12:00 am in reply to: How to download attachments from *.eml fileHi KPC,
Looking at an example .eml file here. If you wanted to parse the attachments purely in KDB/Q without the use of Python libs (although I suggest using Python libs) I’d suggest something along the lines of:
- read0 the *.eml file. Depending on the contents and if you want to interpret new lines literally or not you may find “c”$read1 a more appropriate solution
- Use regex to locate the contents of the attachment, content type and encoding type (from the example looks to default to b64)
- Decode the body of the attachment – for b64 decoding in KDB/Q it looks like this is a solution
b64Decode:{c:sum x=”=”;neg[c]_”c”$raze 256 vs’64 sv’0N 4#.Q.b6?x}? - Post-process the data further into Q objects if it’s suitable. E.g. if the filetype is a json you may want to utilise the .j.k json deserialiser for Q
The solution provided should be the preferred solution with embedPy. Adding to this there is a PyPi lib that claims to handle attachments too:
https://pypi.org/project/eml-parser/ -
Hey Roc,
Thanks. Don’t think the Platform stuff is going to fit the bill given I already have a backend with the TP stuff going on.
I’ve been working off the UI options, since Im using Dashboards, and have managed to get data streaming in but it appears to stop after 20-30 seconds.
Do I need to add anything to the TP sym.q file?
Once I’ve added the:
.u.snap:{tablename}
.ringBuffer.read:{[t;i] $[i<=count t; i#t;i rotate t]};.ringBuffer.write:{[t;r;i] @[t;(i mod count value t)+til 1;:;r];};.stream.i:0-1;.stream.tablename:20000#tablename;Is there any other requisite to stream data into Dashboards?
-
Laura
AdministratorMarch 22, 2023 at 12:00 am in reply to: Heap is a lot larger than used, how to find the cause?Hi Nick,
Understood on the QCE version not being an issue. So in my initial response to this I wasn’t able to replicate the issue with n:50000000, if you look at that you see I call position twice and the heap returns to normal.
For n:2000000 I see the issue however so on the same page now:
Regardless, did you try my fix I suggested in the latest response – as it works for both QCE and Q:
See how if I delete position from the local namespace before reassigning it the heap returns to normal after GC.
I think your theory about the first block allocation then second block use on second IPC call is correct. The reason I didn’t see this for the n=50000000 case was because the data was of a size that the memory allocated was large enough to hold both the IPC read and what was currently in memory without allocating another block. For the data you’re using or the n=2000000 case the memory allocated was nearer to the amount taken up by the object in memory.
So my solution of deleting from the local namespace before calling again reduces the used memory in the process enough to be able to contain the second assignment and stop the invocation of the second block. Important to note that if you delete from the local namespace immediately before the second assignment this shouldn’t affect your code since the reassignment would overwrite the variable anyway.
-
Laura
AdministratorMarch 17, 2023 at 12:00 am in reply to: Heap is a lot larger than used, how to find the cause?I wasn’t able to replicate the issue on my local machine running on KDB+ 4.0 2020.07.15:
My heap returned back to the level it was at the start of the Q session on release as expected.
However I was able to recreate the issue running KDB+ 4.0 Cloud Edition 2022.01.31.
So the issue seems to lie with QCE releasing back to OS. I’ll follow up internally on this to see if it’s a known issue and what can be done to minimise the heap used.
However, per the screenshot I wasn’t able to recreate the re-assigning of position via IPC call not lowering heap after running .Q.gc[] (heap is the same after GC and re-assigning this as initial assign and GC).
As a potential fix to this can you try before your second assignment of position purging it from memory:
delete position from `. .Q.gc[]
.Q.w[] // to inspect position:h”position”
.Q.w[] // to inspect .Q.gc[]
.Q.w[] // to inspect
-
As a simple starting point before investigating the data, type errors in Q suggest that you have provided the wrong datatype to a function. In the case of .Q.chk make sure you’re passing in a filepath as per the documentation which is a symbol type. If you could share the line of code you’re calling the .Q.chk in and the argument you’re providing, as well as an ls on the directory you’re performing the check on, that would help in isolating the problem. An example call of .Q.chk is:
.Q.chk[`:/path/to/dir]
N.B. You will get a type error if you call .Q.chk with a string:
q).Q.chk[“/path/to/dir”]
‘type [0]
.Q.chk[“/path/to/dir”]
Hope this helps!
-
Laura
AdministratorMarch 15, 2023 at 12:00 am in reply to: Heap is a lot larger than used, how to find the cause?Hi Nick,
Here are the steps I did to attempt reproducing your issue:
Host Machine (Port 5000):
q)n:50000000 q)position:([]time:n?.z.p;sym:n?`ABC`APPL`WOW;x:n?10f)
Client Machine:
q)h:hopen`::5000 q).Q.w[] used| 357632 heap| 67108864 peak| 67108864 wmax| 0 mmap| 0 mphy| 8335175680 syms| 668 symw| 28560 q)position:h"position" q).Q.w[] used| 1610970544 heap| 2751463424 peak| 2751463424 wmax| 0 mmap| 0 mphy| 8335175680 syms| 672 symw| 28678 q).Q.gc[] 1073741824 q).Q.w[] used| 1610969232 heap| 1677721600 peak| 2751463424 wmax| 0 mmap| 0 mphy| 8335175680 syms| 673 symw| 28708 q)position:h"position" q).Q.w[] used| 1610969232 heap| 4362076160 peak| 4362076160 wmax| 0 mmap| 0 mphy| 8335175680 syms| 673 symw| 28708 q).Q.gc[] 2684354560 q).Q.w[] used| 1610969232 heap| 1677721600 peak| 4362076160 wmax| 0 mmap| 0 mphy| 8335175680 syms| 673 symw| 28708
As you can see in trying to replicate your issue, my example releases the expected amount of memory back to OS. Due to the number of records you have and the relative size of the table after, the issue I think you’re encountering is due to the data structure of position leading to memory fragmentation. As per my other reply the reference on code kx gives an example of this stating “nested data, e.g. columns of char vectors, or much grouping” will lead to fragmenting memory heavily, does this reflect your data?
To fix this I’d suggest the approach on the reference of serialise, release, deserialise. Or to extend further to your case: serialise, release, deserialise, release, IPC reassign, release. This will maintain a low memory footprint and try to remedy the memory fragmentation but you may still unavoidably have heap greater than used purely due to the data structure (however to a lesser extent to what you’re experiencing).
If memory fragmentation isn’t the cause can you give a bit more insight on the data structure of position as my attempt to replicate shows this problem is data specific.
-
Laura
AdministratorMarch 14, 2023 at 12:00 am in reply to: Heap is a lot larger than used, how to find the cause?Hi Nick,
The previous comment of using .Q.w[] is a good start for isolating what part of the calculations are memory intensive and requiring a large heap allocation by the OS. Printing to standard out using 0N! after each expected memory intensive line will isolate that point in your code.
On the more under-the-hood side, this article by AquaQ is quite helpful to help understand. But to summarise and add some additional points:
- KDB allocates memory in powers of two. Meaning a vector of data will be placed in a memory block one power of 2 up from the raw data, leading to at most 2x memory used.
- Memory fragmentation may also be an issue depending on your aggregations – example here
- The Q process starts with a certain amount of heap allocation that is larger than the used space (this can be seen by starting a Q session and running .Q.w[] straight away). The process won’t go below this heap allocation by the OS on startup.
If you don’t think that a combination of these points contributes enough to cause the heap to be this much larger than used after calling .Q.gc[] I’d recommend invoking the script from the timer manually and investigating with .Q.w from there, as the heap does appear rather large even given the above. This would eliminate the issue of running garbage collection, and the timer function running again while investigating with .Q.w causing the numbers to be misleading.
-
Laura
AdministratorMarch 8, 2023 at 12:00 am in reply to: Parallelising .Q.dpft with default compression enabledTacking on here some further improvements Alex and myself discussed:
funcMem:{[d;p;f;t] i:iasc t f; c:cols t; is:(ceiling count[i]%count c) cut i; tab:.Q.en[d;`. t]; {[d;tab;c;t;f;i].[{[d;t;i;c;a]@[d;c;,;a t[c]i]}[d;tab;i;;]]peach flip(c;)(::;`p#)f=c:cols t}[d:.Q.par[d;p;t];tab;c;t;f;]each is; @[d;`.d;:;f,c where not f=c]; t };
This makes the memory drawback less – theoretically this will be more memory efficient than the standard .Q.dpft. What the above is doing is slicing up the parted column into chunks, such that the maximum size of a chunk in memory of the table contains the same number of entries as a single column of the table (which is the maximum amount of data .Q.dpft holds in memory due to writing column-by-column).
The result of this will lead to the benefits of parallelisation as above without the memory drawback we have seen by simply adding peach.
My above statement I made of “more memory efficient than standard .Q.dpft“, I’ve claimed because the chunks are based on matching the number of elements of a column. .Q.dpft writing column-by-column means the maximum memory used would be for the biggest (in bytes) datatype column. The biggest for this new method would only contain part of that large datatype column at any one time, as well as other smaller datatypes, which will lead to at maximum the same memory usage of .Q.dpft in the case when the columns are of the same sized datatype.
Preliminary tests showed the maintained improvement in speed, with no memory drawback. However these tests were not standardised or conducted in an official unit testing framework. Would love to know the official results of this at some point – be that generated by myself or someone else who is curious.
-
Hi , thanks for your question.
You’ll need to set up the report template to connect, and then use report management in Dashboards to connect to the source data populating the reports e.g. ds_gw_report.
See the documentation here for more information: Report Manager – KX Dashboards
Hope that helps!Laura
-
Laura
AdministratorJanuary 25, 2023 at 12:00 am in reply to: Multiple Chart, Single Cursor — Chart GLBasics > Hover worked.
Y-Axis > Range > Selection Min and Max — hasnt worked so far. Do I need to click the min/max button too ?
-
Or, for that matter,
metaTbl . `ref`m
So, if you are thinking of defining objects by their paths, in this case the path would be
`ref`m
. -
Hi ,
You have two options to get workspaces back:
1. Force a fresh launch of the learn.kx.com sandbox by opening in a new incognito/private browsing mode in another browser provider than used previously to access.
2. Use the new Academy sandbox which will replace learn.kx.com as the default Academy sandbox very soon – the three workspaces are:
Let us know if you try number #2 and have any feedback on the user experience! More on the new sandbox here.
Thanks,
Michaela
-
Im working in memory.
I load the file in its 800,000 x 7. Then run a bunch of updates to make the remaining 23 columns (30 or so total).
And then run the wj. And pass back the resulting table.
Im trying to simulate what it would be doing in a tp. Like as more data comes in the wj is going to continuously run slower until it hits max file for the day, 800k rows. So I figured loading the whole file in and doing it all at once would be a decent enough way to test what it’d be doing.
-
It took me significantly longer. But, I’m also dealing with 30 columns — would that matter, even though I’m just using mmm3 for the wj?
In testing, when I make the data table that im searching (data;(max:`mmm3)) smaller, things speed up. For example, I ran a 1 minute xbar on that table and the windowjoin now takes a couple seconds.
I dont understand what I could be doing wrong. At full scale, 800,000x30cols, it took like 30-40 minutes to complete a 5 minute lookback.
-
What if I’m not using a sym column? I’m going datetime to datetime for the windowjoin.
Would I get speed improvements by using a sym column?