word-document-text-extractor
Service icon

Word Document Text Extractor

Stable version 1.0.0 (Compatible with OutSystems 11)
Uploaded
 on 25 October 2024
 by 
5.0
 (1 rating)
word-document-text-extractor

Word Document Text Extractor

Documentation
1.0.0

Word Document Text Extractor for OutSystems

Overview

The Word Document Text Extractor is a service extension for OutSystems that provides a simple, efficient way to extract text from .docx files. Built with the OpenXML SDK, it reads the content and structure of Word files to return text with preserved formatting, making it easy to incorporate document-based data into OutSystems applications without the need for additional software dependencies like Microsoft Word.

Key Features

  • Format-Preserved Text Extraction: Captures all text from Word files, preserving line breaks and paragraph structures.
  • OutSystems Integration: Designed for seamless integration into OutSystems workflows, scripts, and service actions.
  • Efficient Server Processing: Operates on the server side using OpenXML SDK, optimized for high performance and compatibility.

Requirements

  • OutSystems Version: Compatible with OutSystems 11 and later versions.
  • Dependencies: Requires the OpenXML SDK to be referenced in your OutSystems environment.

Installation Guide

  1. Download the Extension:

    • Go to the OutSystems Forge and download the Word Document Text Extractor component.
  2. Add to Your Application:

    • In Service Studio, open the application where you want to use this component.
    • Go to the Manage Dependencies panel and add the extension to your module.
  3. Reference in Logic:

    • In the Logic tab, drag and drop the ExtractText action to use it in your desired flow.

How to Use

  1. Input Requirements:

    • fileContent (Binary Data): Pass the .docx file content in binary format. You can obtain this from an uploaded file, database, or other source.
  2. Output:

    • The action will return a text string containing all extracted text from the word file, including paragraph breaks to maintain readability.
  3. Example Usage:

    • Use this extension in a server action that receives a file upload.
    • Pass the file’s binary content to the ExtractText action.
    • Retrieve the extracted text and, if needed, store it in a database, display it in the UI, or use it for further processing.

API Reference

ExtractTextFromDocx

  • Description: Extracts text from a Word file with line breaks preserved.
  • Input Parameters:
    • fileContent: Binary Data – The Word file content in binary format.
  • Output:
    • ExtractedText: Text – The plain text extracted from the Word file, retaining paragraph formatting.

Error Handling

The method will return an error message if:

  • The file format is invalid (e.g., not a .docx).
  • The file content cannot be parsed due to corruption or other issues.

Usage Scenarios

  1. Document Management Systems: Populate fields with text extracted from uploaded Word documents for easy data entry and content management.
  2. Search and Indexing Applications: Store and retrieve text for search and indexing, allowing content analysis and text-based searching.
  3. Data Analysis Workflows: Extract information for processing within automated data analysis workflows.

Troubleshooting

  • Invalid File Format: Ensure the file being passed is in .docx format. This component does not support other Word file types.
  • Error Messages: If errors occur, the output will include details. Verify that the input content is valid and in the correct binary format.

Support & Contributions

For issues, improvements, or contributions, please contact the developer or visit the component page on OutSystems Forge to submit suggestions.