Google Cloud BigQuery Batch Load

Stable version 1.0.0 (Compatible with OutSystems 11)

Uploaded

on 15 Oct (3 weeks ago)

5.0

(1 rating)

Documentation

1.0.0

Google Cloud BigQuery Batch Load Component Documentation

This document provides a detailed overview and guide for using the Google Cloud BigQuery Batch Load OutSystems component. This component facilitates efficient, scalable, and automated data ingestion into Google BigQuery, alongside robust query execution and storage management capabilities.

1. Introduction

The Google Cloud BigQuery Batch Load component is designed to simplify interactions between OutSystems applications and Google BigQuery. It leverages Google Cloud Storage for bulk data processing, allowing you to ingest large volumes of data cost-effectively. The component covers the entire lifecycle of data ingestion, from automatic dataset and table creation to monitoring and cleanup of processed files.

A dedicated Test and Monitor application is included to assist with configuration, testing, and ongoing management of your BigQuery operations.

2. Core Functionalities & Use Cases

This component is ideal for scenarios requiring:

High-volume, cost-effective data loading: Utilize BigQuery's free batch loading tier for large JSON datasets.
Automated schema management: Datasets and tables are created automatically based on your input data.
Flexible query execution: Run various SQL queries against your BigQuery data.
Simplified storage management: Control and clean up files in your Google Cloud Storage bucket.
Monitoring and error handling: Track the status of your data ingestion jobs.

3. Component Structure

The component exposes several public service and server actions, categorized for clarity.

4. Getting Started: Installation and Configuration

After installing the component from Forge, you will have access to the BigQueryBatchLoad_IS module and the BigQueryBatchLoadManager application.

The BigQueryBatchLoadManager application serves as your control panel for configuring and testing the component. Navigate to the Status tab to begin.

4.1. Configuration

On the Status screen, you will configure and test your Google Cloud Project details.

Key Configuration Fields:

Google Project Id: Your Google Cloud Project ID.
Service Account Private Key: The private key of your Google Cloud service account.
Service Account Email: The email address of your Google Cloud service account.
Token Life Span in Seconds: Duration for authentication tokens.
Google Storage Bucket Name: The name of the Google Cloud Storage bucket used for staging JSON files.
Storage Upload Timeout: Timeout for uploading files to the bucket.
Remove Logs & Older Unsuccessful Records After (X Days): Defines the retention period for logs and unsuccessful records.

Important Permissions: Ensure your Google Service Account has the necessary permissions as outlined in the "Notes" section of the configuration screen, including permissions to view/list, create, and delete storage files, and to create/insert data into BigQuery datasets and tables.

5. Main Process: Sending JSON Files to BigQuery

The primary method for ingesting data is GCBigQueryFileProcess. This action handles sending your JSON data to Google Cloud Storage, which then triggers the import process into BigQuery.

Action: GCBigQueryFileProcess (Service Action)

Description: Uploads a JSON array file to the configured Google Cloud Storage bucket. BigQuery then automatically imports this file, creating the dataset and table if they don't already exist. The automatically created datasets and tables are configured with case-insensitive collation for text attributes, simplifying search and filtering.
Input Parameters:
- DatasetName (Text): The name of the BigQuery dataset.
- TableName (Text): The name of the BigQuery table.
- JSONFile (Text): A JSON array string (e.g., "[{\"name":"Name 1"},{"name":"Name 2"}]").
- TotalRecords (Long Integer): The count of records in the JSON array, primarily for logging and monitoring purposes.
Output Parameters:
- FileGuid (Text): A unique identifier for the uploaded file, useful for monitoring its status.
- ErrorMessage (Text): Returns an error message if the process fails.

Important Note: The component fully manages the cleanup of successfully imported files from Google Storage Bucket, based on the configured retention period. It also ensures that unsuccessfull processed files are also retried and resent until they are successfully processed.

6. Monitoring Data Ingestion

Once a file is sent via GCBigQueryFileProcess, you can monitor its status using the GCBigQueryFileStatus action and the Monitoring tab in the manager application.

Action: GCBigQueryFileStatus (Service Action)

Description: This action queries the status of a specific BigQuery load job associated with a FileGuid. It provides detailed feedback on whether the file has been processed, if any errors occurred, and allows for immediate cleanup of the source file from the Google Cloud Storage bucket upon successful import.
Input Parameters:
- FileGuid (Text): The unique identifier returned by the GCBigQueryFileProcess action. This is the key to tracking your file.
- DeleteIfProcessed (Boolean): If set to True, this action will immediately delete the corresponding file from the Google Cloud Storage bucket after confirming it has been successfully processed by BigQuery. This is a highly efficient way to manage storage, as it cleans up files as they complete, rather than waiting for a scheduled bulk cleanup.
Output Parameters:
- IsProcessed (Boolean): Returns True if the file has been successfully imported into BigQuery.
- JobErrors (Text): If the import failed, this output contains the specific error message from the BigQuery job (e.g., schema mismatch, data format error).
- FileNotFound (Boolean): Returns True if the file corresponding to the provided FileGuid could not be located.
- ErrorMessage (Text): Returns a general error message if the status check action itself fails (e.g., due to an authentication issue).

7. Execute Queries

The component provides two distinct actions for running SQL queries against your BigQuery datasets, catering to different timeout requirements.

7.1. `GCBigQueryExecuteQueryService` (Service Action)

Description: Ideal for quick, synchronous queries that are expected to complete within the standard platform timeout. This action runs with a fixed timeout of up to 100 seconds.
Inputs: SQLQuery (Text)
Outputs: ResultJSON (Text), TotalRows (Integer), ProcessingTimeMs (Long Integer), ErrorMessage (Text)

7.2. `GCBigQueryExecuteQuery` (Public Server Action)

Description: Designed for long-running or complex queries. As a public server action, it allows you to specify a custom timeout that can exceed 100 seconds, preventing platform timeout exceptions for heavy analytical workloads.
Inputs: SQLQuery (Text), Timeout (Integer)
Outputs: ResultJSON (Text), TotalRows (Integer), ProcessingTimeMs (Long Integer), ErrorMessage (Text)

The Execute tab in the management application provides a simple UI to run ad-hoc queries using this functionality.

Other Service Actions

The component includes several other utility actions for comprehensive management of your BigQuery and Google Cloud Storage resources.

BigQuery Actions

GCBigQueryListDatasets: Lists all datasets within your configured Google Cloud Project.
GCBigQueryListTables: Lists all tables within a specified dataset.

The Datasets & Tables tab in the management app uses these actions to allow you to browse your BigQuery structure.

Storage Actions

GCStorageFileList: Lists all files currently in the configured Google Cloud Storage bucket.
GCStorageFileDownload: Downloads a specific file from the bucket.
GCStorageFileDelete: Deletes a specific file from the bucket.
GCBatchLoadFileDelete: Deletes a file record from the component's internal logging/tracking tables.
GCBatchLoadCleanUp: A batch job action that cleans up old log records and processed files from the bucket immediately, ignoring the retention policy set in the configuration.

Management Application Guide (`BigQueryBatchLoadManager`)

The Google Cloud BigQuery Batch Load component includes a powerful management application that serves as your central hub for configuration, testing, and monitoring. This guide explains the functionality of each screen available in the app menu.

Status ✅

This is the main landing and configuration page. It provides an at-a-glance health check of your setup and is where you'll input your Google Cloud credentials.

Key Functions:

Configuration Status: Confirms that all necessary configuration fields have been filled.
Authentication Test: Shows whether the provided credentials are valid and can successfully authenticate with your Google Cloud project.
Configuration Form: This is where you set up the connection. You must provide your Google Project ID, Service Account Email, Service Account Private Key, and the Google Storage Bucket Name that will be used for staging files.
Settings: You can also configure operational parameters like token lifespan, upload timeouts, and the retention period for old logs and files.

Test 🧪

The Test screen provides a simple, two-step interface to perform a live, end-to-end test of the data ingestion pipeline, from sending a JSON file to checking its import status.

How to Use:

Send: In the first panel, specify a Dataset Name and Table Name. Paste your sample JSON File Data (as a JSON array) into the text area and enter the Total Records count. Click Send & Process. This action uploads the data to your bucket and initiates the BigQuery load job, returning a unique File GUID.
Check: In the second panel, paste the File GUID you received from the send step. Click Check Status to query the real-time status of the import process.

Monitoring 📊

This screen is your central dashboard for observing all data ingestion jobs. It gives you full visibility into the files being processed and provides a history of completed jobs (deleted files).

Key Features:

Active Files / Deleted Files Toggle: Switch between two views:
- Active Files: Shows all files currently being processed or awaiting processing. You can see the file's GUID, target dataset/table, and status. This view is essential for tracking ongoing jobs.
- Deleted Files: Provides a historical log of jobs that have been processed and whose source files have been cleaned up from the storage bucket.
Clean-Up Now: A manual trigger that runs the cleanup logic to remove all successfully processed files from your storage bucket immediately.
Refresh: Updates the dashboard with the latest status of all jobs.

Storage 🗄️

The Storage screen gives you direct access to view and manage the files within your configured Google Cloud Storage bucket.

Key Functions:

File Listing: Displays a list of all files currently in the staging bucket, along with their size and creation/update timestamps.
File Actions: For each file, you have two options:
- Download (cloud icon): Download a copy of the file directly.
- Delete (trash can icon): Manually delete the file from the bucket.
Note: This screen also reminds you of the automatic cleanup logic that removes files after a successful import or after they exceed the configured age for unsuccessful imports.

Execute 쿼리

This screen provides a simple but powerful tool for running ad-hoc SQL queries directly against your BigQuery project from within OutSystems.

Key Functions:

Query Editor: A text area where you can write and edit your SQL queries.
Timeout Setting: Allows you to specify a custom timeout in seconds. This is crucial for running complex or long-running queries that might otherwise exceed standard platform limits.
Execute Button: Runs the query and displays the results.

Datasets & Tables 🧬

This screen acts as a simple browser for your BigQuery project, allowing you to see the datasets and tables that exist. It's a great way to quickly verify that the component is creating your schemas as expected.

Key Functions:

Dataset List: On the left, it lists all the datasets available in your configured project.
Table List: When you select a dataset, the panel on the right populates with a list of all the tables contained within that dataset.

Google Cloud BigQuery Batch Load

Google Cloud BigQuery Batch Load Component Documentation

1. Introduction

2. Core Functionalities & Use Cases

3. Component Structure

4. Getting Started: Installation and Configuration

4.1. Configuration

5. Main Process: Sending JSON Files to BigQuery

6. Monitoring Data Ingestion

7. Execute Queries

7.1. GCBigQueryExecuteQueryService (Service Action)

7.2. GCBigQueryExecuteQuery (Public Server Action)

Other Service Actions

BigQuery Actions

Storage Actions

Management Application Guide (BigQueryBatchLoadManager)

Status ✅

Test 🧪

Monitoring 📊

Storage 🗄️

Execute 쿼리

Datasets & Tables 🧬

7.1. `GCBigQueryExecuteQueryService` (Service Action)

7.2. `GCBigQueryExecuteQuery` (Public Server Action)

Management Application Guide (`BigQueryBatchLoadManager`)