265
Views
3
Comments
Best way to quickly batch process data
Application Type
Reactive

Hi,

I am working for a client where they want to create a sales forecasting tool using AI. Each model generated takes 2-15 minutes, depending on Sales data.

They want an initial 10,000 models.

How best to go about this? Timer is just very unrealistic. I am thinking BPT, but will this just timeout half the time? What about other applications they are running that also use BPT during the day?

Though without BPT, at an average of 5 minutes...10,000 models takes 50,000 minutes. around 34 days. Which is unfaesible. That's just to initially create too.

Any ideas? Can I stagger BPT instances? Or is there some way to run a maximum of, say 8 at once so other applications can use them?

Any responses would be hugely appreciated!

2022-08-03 04-32-50
Ravi Punjwani

Hi Aaron,

A quick browse on the different options available on async processes got me this comparison for reference.

In your situation 50,000 minutes would translate into following at bare minimum (if only you consider Timers having 1 Front-end server):

  • 50000 minutes ÷ 3 in parallel per Front-end server ÷ 60 mins ÷ 24 hours = 11.5 days, to start with.


Improving throughput:

  • Try to optimize the code contributing to faster processing aiming at lowering down the processing time required per model. Optimize aggregates & sql queries. With a lot of careful changes in optimization, you may notice significant improvements in your execution time.
  • Categorise your models into two or three parts which helps deciding if those go into Timers or as a Process. Taking the eligible models as a Process, will significantly leverage your total time due to having 10 activities per Front-End server. Best case scenario can leverage your processing limits by at least 3 fold.
  • For extremely smaller processes, try leveraging your code to support Light Processes. This can leverage your processing capacity by at least 6x (by having 20 activities per front-end server). Assuming if you had only 10% of models that take less than 3 minutes, you can process 5000 minutes of your processing into just 250 minutes only by considering them having in this category.
  • Chuck in multiple models into a single timer process. This can be done by optimizing timer code so it doesn't end up finishing execution in less than 60% of the allowed time (default 20 mins timeout). If you still have 8 minutes extra left when the model processing finished already, you can then pick-up another smaller model so it can be finished in remaining 8 minutes you've left with this timer process. This can help in processing smaller models covered by larger models free-of-cost.
  • Add just another front-end server on your environment. Straight up leveraging by 2x.

I'm personally don't have much experience in BPT, but have worked extensively on Timers. They work just great most of the time. However, the processing capabilities that are offered in Processes and Light Processes can seriously make a big impact in your situation.

What's your expectation regarding processing times from any solution that you're looking for? Like 10,000 models should be processed in how much time?

UserImage.jpg
Aaron Gordon

Ok, seems BPT is probably the way to go, along with optimisation + compartmentalising more of the process so it doesn't exceed 5 minutes.

Well, the client wants to start testing ASAP, so realistically anything more than a few days is a bit much. Especially as they'll want to push to prod soon, and therefore repeat the process.

Thanks so much for your long detailed answer, really appreciated!

2022-08-03 04-32-50
Ravi Punjwani

You're welcome @Aaron Gordon. Glad I was able to help you with this. Please mark it as answer if this resolved your initial query.

By the way, is that any good progress on your project? It's been a few days now since your reply. I would love to know if you got any results that can be insightful for others too. It seems an interesting project to work on.

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.