I want to fetch a HTML page from a website and store it in an entity.The catch is: the webpage uses a lot of Javascript to build the webpage. So just getting the source does not the trick.How can I fetch a webpage as seen in my browser, thus with all Javascript executed and then getting the result (just the HTML).
Hi Erwin,
There's no use storing the HTML only, as without the CSS it won't render correctly. I think your best bet is to use an HTML to PDF converter like Ultimate PDF and store that.
Hi Kilian, it is not a requirement to render it back correctly. I'm only interested in some data in certain tags.Think about this imaginary use case:
Ah, right. In that case, I'd not store the HTML, but process it directly to grab the data you need. But either way, you need to crawl the browser document, which is a problem, as that lives only client side. Server side, no JavaScript will be run, so you can only get the actual HTML document via HTTP. So I think you have a problem there with the JavaScript, if you are trying to fetch data server-side.
Kilian, I 've just tested UltimatePDF and it can print the page I want to save, but there is a Cookie banner of the website in it .... so I have to find out how and which cookies to send in the request :) Maybe this is gonna work ;-)And if it works I have to find out how to get the data I want out of the PDF. Nice to experiment, it's not a real customer testcase (yet)
Have you read this article on web scraping?
https://www.outsystems.com/blog/posts/web-scraping-tutorial/
I think this will help you figuring out your query.
Hi Ravi, I will definitive dive into this article to see if this is the solution!