-
embedPy/ BeautifulSoup
Hi All,
I am currently looking into scraping data using BeautifulSoup4 in embedPy.
My issue is that I am having difficulty accessing the list of lists produced by the method find_all from a BeautifulSoup object. I have ran the below code in python and all works as expected.
For simplicity, I have defined a document in Python over which to apply BeautifulSoup. If you have a look at the bottom of this post, there are some lines of code to create the sample html document I used with BeautifulSoup – it’s built line by line because I was having all kinds of trouble trying to define in a super long string using JupyterQ (another post its own right!)
// import the BeautifulSoup library
Bs4: p.import `bs4
// careful below the double speech marks need to be the straight ones not the right leaning monstrosities from Microsoft office.
bs: bs4[`BeautifulSoup;example_html;html.parser]
// at this point, running bs[`prettify;::] correctly returns the loaded html document.
// now run the file_all method to find a attributes (again watching for those speech marks).
rslt: bs[`find_all;a;`href pykw 1b]
Now if I evaluate rslt` I get a list a list of 2 foreign elements. This agrees with the two found records I see when I run the same process in pure python.
Here I get stuck.
//I think the error is that when I ran the find_all method, I should have ran it as .p.qcallable:
rslt: .p.qcallable bs[`find_all;a;`href pykw 1b]
However, when I do that, I am left with an object
Code.[code[foreign]]`.p.q2pargsenlist
I cant seem to manipulate this object at all.
Could you advise how I might use this function to end up with an object which Q sees as a list of 2 strings?
I include the test data I used below. It is built line by line because something seems to overflow in JupyterQ if you do it as a single string with n.
Regards,
Simon
\——————————————————————-
example_html: “<html>”
example_html: example_html,”<head>”
example_html: example_html,”<title>Your Title Here</title>”
example_html: example_html,”</head>”
example_html: example_html,”<body bgcolor=”#ffffff”>”
example_html: example_html,”<center>”
example_html: example_html,”<img align=”bottom” src=”clouds.jpg”/>”
example_html: example_html,”</center>”
example_html: example_html,”<hr/>”
example_html: example_html,”<a href=”http://somegreatsite.com“>Link Name</a> is a link to another nifty site”
example_html: example_html,”<h1>This is a Header</h1>”
example_html: example_html,”<h2>This is a Medium Header</h2>”
example_html: example_html,”Send me mail at <a href=”mailto:support@yourcompany.com“>support@yourcompany.com</a>.”
example_html: example_html,”<p>This is a paragraph!</p>”
example_html: example_html,”<p>”
example_html: example_html,”<b>This is a new paragraph!</b><br/>”
example_html: example_html,”<b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>”
example_html: example_html,”<a>This is an empty anchor</a>”
example_html: example_html,”</p>”
example_html: example_html,”<hr/>”
example_html: example_html,”</body>”
example_html: example_html,”</html>”
Log in to reply.