Extract text from pdf
Description
The Extract Text from PDF activity processes uploaded PDFs and extracts structured or free-form text using a combination of markers, regex extractors, and column mapping. This activity supports parsing structured documents such as invoices, contracts, or system-generated reports where field boundaries and patterns can be identified.
Unlike OCR-based activities, this process assumes PDFs are machine-readable, and the content layer is directly accessible.
Use case:
A PDF invoice from a vendor is parsed using this activity by setting regex extractors for “Invoice Number”, “Customer Name”, and “Amount Due”. The output is a structured table of extracted values, which is then processed byFillEmptyCells, validated, and sent to an ERP system.
Input
| Field | Required | Description |
|---|---|---|
| PDF Files | Yes | One or more PDF documents to extract content from. Must be machine-readable. |
Output
| Output Type | Format | Description |
|---|---|---|
Data | Table | List of rows containing extracted key-value pairs. One row per document or data region. |
Configuration Fields
| Field Name | Description |
|---|---|
| Markers | List of start and end markers to segment sections of interest in the PDF. |
| Column Map | Maps the extracted values to specific columns (e.g., map “Invoice Total” to Amount). |
| Regex Extractors | List of regular expressions to extract text patterns like email, invoice number, or date. |
Sample Input
Not Applicable
Sample Configuration
| Field | Value |
|---|---|
Markers | ["StartInvoiceDetails", "EndInvoiceDetails"] |
Column Map | {"Amount Due": "Total", "Invoice No": "Invoice Number"} |
Regex Extractors | ["(?i)Invoice\\s*No\\s*:\\s*(\\w+)", "(?i)Amount\\s*Due\\s*:\\s*(\\d+\\.\\d{2})"] |
Sample Output
| Invoice No | Customer Name | Amount Due | Date |
|---|---|---|---|
| INV-12345 | Acme Corp | 500.00 | 2025-07-01 |
| INV-67890 | Beta Ltd. | 1200.50 | 2025-07-02 |
Output rows can be filtered, transformed, or routed in downstream activities such as
Filter,Send Email, orGoogleSheetUpload.