KX Community

Find answers, ask questions, and connect with our KX Community around the world.
KX Community Guidelines

Home Forums kdb+ embedPy/ BeautifulSoup

  • embedPy/ BeautifulSoup

    Posted by simon_watson_sj on July 15, 2021 at 12:00 am

    Hi All,

    I am currently looking into scraping data using BeautifulSoup4 in embedPy.

    My issue is that I am having difficulty accessing the list of lists produced by the method find_all from a BeautifulSoup object. I have ran the below code in python and all works as expected.

    For simplicity, I have defined a document in Python over which to apply BeautifulSoup. If you have a look at the bottom of this post, there are some lines of code to create the sample html document I used with BeautifulSoup – it’s built line by line because I was having all kinds of trouble trying to define in a super long string using JupyterQ (another post its own right!)

    // import the BeautifulSoup library

    Bs4: p.import `bs4

    // careful below the double speech marks need to be the straight ones not the right leaning monstrosities from Microsoft office.

    bs: bs4[`BeautifulSoup;example_html;html.parser]

    // at this point, running bs[`prettify;::] correctly returns the loaded html document.

    // now run the file_all method to find a attributes (again watching for those speech marks).

    rslt: bs[`find_all;a;`href pykw 1b]

    Now if I evaluate rslt` I get a list a list of 2 foreign elements. This agrees with the two found records I see when I run the same process in pure python.

    Here I get stuck.

    //I think the error is that when I ran the find_all method, I should have ran it as .p.qcallable:

    rslt: .p.qcallable bs[`find_all;a;`href pykw 1b]

    However, when I do that, I am left with an object

    Code.[code[foreign]]`.p.q2pargsenlist

    I cant seem to manipulate this object at all.

    Could you advise how I might use this function to end up with an object which Q sees as a list of 2 strings?

    I include the test data I used below. It is built line by line because something seems to overflow in JupyterQ if you do it as a single string with n.

    Regards,

    Simon

    \——————————————————————-

    example_html: “<html>”
    example_html: example_html,”<head>”
    example_html: example_html,”<title>Your Title Here</title>”
    example_html: example_html,”</head>”
    example_html: example_html,”<body bgcolor=”#ffffff”>”
    example_html: example_html,”<center>”
    example_html: example_html,”<img align=”bottom” src=”clouds.jpg”/>”
    example_html: example_html,”</center>”
    example_html: example_html,”<hr/>”
    example_html: example_html,”<a href=”http://somegreatsite.com“>Link Name</a> is a link to another nifty site”
    example_html: example_html,”<h1>This is a Header</h1>”
    example_html: example_html,”<h2>This is a Medium Header</h2>”
    example_html: example_html,”Send me mail at <a href=”mailto:support@yourcompany.com“>support@yourcompany.com</a>.”
    example_html: example_html,”<p>This is a paragraph!</p>”
    example_html: example_html,”<p>”
    example_html: example_html,”<b>This is a new paragraph!</b><br/>”
    example_html: example_html,”<b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>”
    example_html: example_html,”<a>This is an empty anchor</a>”
    example_html: example_html,”</p>”
    example_html: example_html,”<hr/>”
    example_html: example_html,”</body>”
    example_html: example_html,”</html>”

     

    simon_watson_sj replied 8 months ago 1 Member · 2 Replies
  • 2 Replies
  • simon_watson_sj

    Member
    July 16, 2021 at 12:00 am

    Hi All,

    I got a response from Conor at KX about this. It’s due to the python library having output in a nonstandard python data type.

    To get around this, you convert the output to standard Python data types when in python prior to bringing it over to Q. The key to his solution was the python function he defines to do this where you can see he changes the output to a string python type:

    q)p)def func(x):return str(x)

    Also note that the [<] symbol in  )qfunc[<] below is just a better way of telling embedPy to interpret the object in the memory space as a Q object rather than a Python one (which would look like ‘foreign’ when represented by Q if you recall from the docs).

    His full solution is as follows (based on the example_html doc I defined in my initial post):

    q)bs4[`:BeautifulSoup][example_html;”html.parser”]

    {[f;x]embedPy[f;x]}[foreign]enlist

    q)bs4[`:BeautifulSoup][example_html;”html.parser”]`

    foreign

    q)bs4[`:BeautifulSoup][example_html;”html.parser”]

    {[f;x]embedPy[f;x]}[foreign]enlist

    q)bs:bs4[`:BeautifulSoup][example_html;”html.parser”]

    q)result:bs[`:find_all][“a”;`href pykw 1b]

    q)p)def func(x):return str(x)

    q)qfunc:.p.get`func

    q)qfunc[<]each result`

    “<a href=”http://somegreatsite.com“>Link Name</a>”

    “<a href=”mailto:support@yourcompany.com“>support@yourcompany.com</a>”

     

     

  • simon_watson_sj

    Member
    July 16, 2021 at 12:00 am

    Actually, subsequent to this, using:

    q)p)def func(x):return x.attrs

    in place of

    q)p)def func(x):return str(x)

    will return a python dictionary of attributes. Since this is a native python type, I found that it imports to Q as a dictionary and keeps the nested structure.

Log in to reply.