OutSystems allows processing events at ridiculous speeds! Our laboratory tests revealed that the pipeline can handle impressive amounts of sustained throughput.
Not only that, but our customers have started to deliver solutions that can batch process peaks of 400,000 payments in minutes.
These are so-called light processes, which can only run faster if we skip all the database interaction that long-running processes require. They are specifically suited to track events that are fast to process.
But it wasn’t always so. Five years ago, OutSystems launched Business Process Technology (BPT) as a pragmatic approach to process modeling and execution. At the time, BPMN had inspired the notation, and BPEL had inspired the execution.
BPT allowed customers to model and execute long-running processes where activities could wait for minutes, hours, or months until a given event occurred. Problem solved, right? Not so fast. Our customers had other ideas. So, we had to act. Here’s how it all went down.
Looking Under the Covers
Processes can advance whenever database entities are updated. So, at the time, we introduced the concept of “insert” and “update” events for specific database entities. This way, processes could wait for specific instances of a process or activity by specifying the primary key of the entity they were waiting for.
These events are captured using database triggers that are attached to the “insert” and “update” statements, adding new entries to a queue that is also in the database. These queued entries are then fetched by multiple schedulers that compete to process these entries in the context of a process definition.
This simplified the coding of fast reaction processes—processes that react to human activities in less than a second—because then developers only needed to define the entity they are waiting for to release a specific instance. This resulted in great joy for OutSystems developers.
We were caught by surprise when we learned that some of our customers were using BPT for another purpose: large-scale batch-processing and also event broker systems.
In Real Life: Creative Usage of BPT
With the original BPT model, the simplest process for handling an event creates three application server requests and nine database transactions. For the scenarios we had envisioned, this capability was more than acceptable.
Reality soon kicked in, and customers quickly turned up the volume. They found themselves dealing with more events than we had planned for, reaching or even surpassing 500.000 daily events on average!
BPT was not designed for such loads, leading to a performance degradation that in worst-case scenarios, could bring the whole system to a screeching halt.
The BPT activity identifier was a 32-bit integer. The customers who were consuming over 500.000 identifiers per day (and growing) would soon not be able to maintain the system for the number of years they had planned for.
Once we were aware of this issue, we looked into it, and we realized that the performance problem led some customers to avoid BPT altogether. This was disappointing because BPT’s expressiveness is the perfect fit for such event broker patterns.
Revising the Event Handling Pipeline
We revisited our entire event handling pipeline to search for bottlenecks and critical issues, and we realized that the scheduler query was taking a long time to dequeue and lock events in the database.
This query is one of the most complicated in the OutSystems platform. The scheduler service (inside each front-end server) uses it to gather events that the corresponding application server will then process.
Originally, this was a query that captured all the available events followed by a cycle that iterated the query’s results trying to lock each row. If the row were already locked, it would be skipped; otherwise, the row was marked to be fetched and collected. This cycle finished once it collected the first 10 or 20 rows.
Conceptually it was correct, and most of the time the database optimizer did a good job of using the appropriate strategy to optimize this query. But once the queue grew to a few hundred thousand rows, this was no longer true. The optimizer would frequently choose an inefficient strategy, and row locks were appearing for too many rows and even escalating to table locks.
Revisiting the Query
The first measure we took was to split the query into smaller and simpler ones. So, the first is a query to populate a temporary table that synthesizes the available modules (eSpaces) for a specific server using the “zone” configuration. The result is the delivery of a small set of identifiers, and its execution is trivial for the database optimizer.
Next, we focused on the queue table. The records of the previous tables were already placed into temporary tables, simplifying the optimizer work at this step.
The critical part here is getting the rows on the top of the queue using “updlock, rowlock, readpast” hints for the table that represents the queue. So we make sure the query is protected with a SELECT TOP(@batchsize) to lock only the rows you really want to fetch:
- "rowlock" forces row level locks instead of page or table locks.
- "updlock" locks each returned row.
- "readpast" skips all previously locked rows.
Once the rows are locked, we update them in a different query:
And finally, the results are returned with the joins we need from other tables with a trivial query.
This solved most of the scalability problems. We were finally able to execute a very fast query with only the number of returned events as a dependency.
Revisiting the Identifier: Defining an Event Pipeline
But we still had to deal with the identifier exhaustion problem. Changing the database metamodel was not an option due to the upgrade costs and risks. The OutSystems customers who depend on BPT have a zero downtime policy for such applications. A substantial database upgrade would violate that policy.
So, we needed to design a different approach. We ended up defining an entire event pipeline for the kind of process that was flooding the scheduler.
In the new pipeline, these processes are triggered by an “insert” entity event and are handled by the single automatic activity inside the process. Such processes are so simple that they don’t need to be tracked by our process metamodel. There are no sequences, no decisions, no wait activities, no human activities involved.
This new pipeline uses a new queue of events that is not restricted to the 32 identifier limitation. Each event only triggers a single application request to handle the code defined inside the automatic activity in a single database transaction.
Customer Scenario: Taking BPT All the Way Up to 11
Let’s take a look at a specific customer scenario so you can see how this works.
One of our customers needed to batch-process peaks of 400K payments in just a few minutes. Event-driven data processing with high parallelization is key in this situation to prevent the bottleneck of pooling payment requests and serializing the processing. This means that you need to add a scalable architecture on top of the light process execution.
The following picture depicts the solution that allows us to implement the massive batch processing.
- Files are sent from different commercial systems and are processed by a timer scheduled to run at regular intervals.
- Each file contains thousands of records that are inserted in bulk into a Payment table.
- For each set of 2,000 payment records, a single record is inserted into a bucket control table, setting the initial and final positions of the payments bound to the bucket.
At this point, a BPT process is fired, reacting to the bucket record insertion, and the payment records bound to the same bucket are iterated to validate and classify each payment (Clearing House).
The Clearing House then decides which payment channel should be used to optimize operation costs.
The process could be fired when the record is inserted in the Payment table, but the bucket strategy lets you balance between parallelism and overhead. Smaller bucket sizes increase parallelism and reduce the scope of potential processing failures. The tradeoff is that you get additional overhead from more requests, more database transactions, and more context gathering for each bucket.
The right balance must be achieved by trying, measuring, and iterating until you are happy with the results.
Finally, at the end of the processing period, a timer will group payments to direct them per channel.
To respect the maximum window of time required to respond to a certain peak, you can adjust your horizontal scaling options, such as number of front-end servers, timers per front-end server, and event processing threads per front-end server. In the example, you can see the results for fine-tuning the handling of 400,000 payments, when the requirement is to process all the payments in less than half an hour. With minimal infrastructure—values above the red line—it took us 89 minutes to process the 400,000 requests.
After some tests of the scalable options (see values below the red line), by duplicating the number of front-ends dedicated to each phase and increasing the number of threads, we again reduced the processing time to a mere 14 minutes.
Light Events at Work
OutSystems processes are designed to model and execute long-running activities where a specific event can take weeks or months to occur. This lighter execution mode is useful for scaling database queueing in the case of large-scale event-driven processes that handle several thousands of events processed per day. To learn how this works, read our documentation.