194
Views
5
Comments
[Text and HTML Processing] how to use Text and HTML processing? i try it but fail for web scrapping. please help
text-and-html-processing
Web icon
Forge asset by Leonardo Fernandes
Application Type
Reactive

I just want to scrape the product's name, link, name of the shop, link of shop on this website (https://www.bukalapak.com/c/perawatan-kecantikan/makeup-bibir?page=2&search%5Brating_gte%5D=4&search%5Btop_seller%5D=1). I try to get the HTML with Rest Integration, and I got it. But when I tried to parse the text with module SelectHtmlText on Text and HTML Processing Forge Asset, I got nothing. for the example : I want scrape these information (on the image) with this selector (

#product-explorer-container > div > div.bl-flex-container > div.bl-flex-item.bl-product-list-wrapper > div > div:nth-child(2) > div:nth-child(3) > div > div:nth-child(1) > div > div > div.bl-product-card__description > div.bl-product-card__description-name > p > a)

and I got nothing. please help me, how toscrape these page information seems like name of shop, name of all these stuff, price and these town. ty

Attachments: Oml files.

scrapingbukalapak.oml
2018-10-29 08-31-03
João Marques
 
MVP

Hi Mohamad,


It's very simple to do web scraping with OutSystems but several steps are needed: get the file, parse it to an HTML document, use the selector, etc.

You can follow a complete step-by-step tutorial, with screenshots, here.


After taking a quick look to your code, I would suggest the following:

  • In order to scrape the information for every product, you should get HTML elements first to loop through each one and get information from all products (check the article for the example);
  • Use shorter selectors, perhaps a class is enough, rather than the full path;
  • Debug it and check:
    • If the HTML you are receiving from the web service has the info you needed. This is not true every time, the information may appear on the screen from another file coming from JavaScript, the page may be protected with a recaptcha, etc.).


Kind Regards,
João

2023-08-28 11-33-39
Mohamad Yusup Dias Ibrahim

on your second point (use shorter selectors) its has different result when I use shorter selectors rather than full path? 

2018-10-29 08-31-03
João Marques
 
MVP

I would use shorter selectors for several reasons:

  • Easy to understand
  • Easy to identify
  • Easy to test (for instance, if a single class is enough to use it as a selector, checking if it exists on the response of the REST you call, while checking if the full path works is... harder)
  • More performant
2023-03-30 10-13-40
Miguel Antunes

Hi Mohamad,

João said it all. For the second point, short answer is yes. Having the full selector is only viable when you're scraping a pretty static website, because the full path selector is so so so specific, that if there's some JavaScript that changes the DOM content, you could easily fail to get the desired content.

In your example, follow what João said, and try to get the list of items, then cycle through them to get the details.
For the selector, you can try to use only the class name: '.bl-product-card__description-name', by using the Inspect tools and by doing a search in the DOM (CTRL+F) in the Inspect window, try pasting the previous selector, and you'll see that it picks the name of the products.

2023-08-28 11-33-39
Mohamad Yusup Dias Ibrahim

Miguel and marques,
thanks a lot for the information. I tried 1, and I got it. Problem solves, thanks. A shorter selector is the answer.

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.