[Text and HTML Processing] how to use Text and HTML processing? i try it but fail for web scrapping. please help

Back to Forums

Mohamad Yusup Dias Ibrahim

Question

Reactive

Forge

HTML

Web

Application Type

Reactive

I just want to scrape the product's name, link, name of the shop, link of shop on this website (https://www.bukalapak.com/c/perawatan-kecantikan/makeup-bibir?page=2&search%5Brating_gte%5D=4&search%5Btop_seller%5D=1). I try to get the HTML with Rest Integration, and I got it. But when I tried to parse the text with module SelectHtmlText on Text and HTML Processing Forge Asset, I got nothing. for the example : I want scrape these information (on the image) with this selector (

#product-explorer-container > div > div.bl-flex-container > div.bl-flex-item.bl-product-list-wrapper > div > div:nth-child(2) > div:nth-child(3) > div > div:nth-child(1) > div > div > div.bl-product-card__description > div.bl-product-card__description-name > p > a)

and I got nothing. please help me, how toscrape these page information seems like name of shop, name of all these stuff, price and these town. ty

Attachments: Oml files.

scrapingbukalapak.oml

29 Nov 2021

João Marques

MVP

Hi Mohamad,

It's very simple to do web scraping with OutSystems but several steps are needed: get the file, parse it to an HTML document, use the selector, etc.

You can follow a complete step-by-step tutorial, with screenshots, here.

After taking a quick look to your code, I would suggest the following:

In order to scrape the information for every product, you should get HTML elements first to loop through each one and get information from all products (check the article for the example);
Use shorter selectors, perhaps a class is enough, rather than the full path;
Debug it and check:
- If the HTML you are receiving from the web service has the info you needed. This is not true every time, the information may appear on the screen from another file coming from JavaScript, the page may be protected with a recaptcha, etc.).

Kind Regards,
João

29 Nov 2021

3 replies

Last reply 29 Nov 2021

Show thread

Hide thread

Mohamad Yusup Dias Ibrahim

on your second point (use shorter selectors) its has different result when I use shorter selectors rather than full path?

29 Nov 2021

João Marques

MVP

Replying to Mohamad Yusup Dias Ibrahim's comment on 29 Nov 2021 08:34:07

I would use shorter selectors for several reasons:

Easy to understand
Easy to identify
Easy to test (for instance, if a single class is enough to use it as a selector, checking if it exists on the response of the REST you call, while checking if the full path works is... harder)
More performant

29 Nov 2021

Miguel Antunes

Replying to Mohamad Yusup Dias Ibrahim's comment on 29 Nov 2021 08:34:07

Hi Mohamad,

João said it all. For the second point, short answer is yes. Having the full selector is only viable when you're scraping a pretty static website, because the full path selector is so so so specific, that if there's some JavaScript that changes the DOM content, you could easily fail to get the desired content.

In your example, follow what João said, and try to get the list of items, then cycle through them to get the details.
For the selector, you can try to use only the class name: '.bl-product-card__description-name', by using the Inspect tools and by doing a search in the DOM (CTRL+F) in the Inspect window, try pasting the previous selector, and you'll see that it picks the name of the products.

29 Nov 2021

Mohamad Yusup Dias Ibrahim

Miguel and marques,
thanks a lot for the information. I tried 1, and I got it. Problem solves, thanks. A shorter selector is the answer.

29 Nov 2021

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.

See the full guidelines