Description
The Extract Ngrams activity processes text data by extracting sequences of words (n-grams) from a selected column. This is useful for tasks involving natural language processing (NLP), pattern recognition, or feature generation for machine learning.
N-grams are contiguous sequences of n items (typically words) in a sentence. For example:
- 2-gram (bigram):
"great product"
- 3-gram (trigram):
"fast delivery time"
This activity provides options to remove stop words, apply stemming, and sort terms before forming n-grams.
Use case:
Extracting bigrams or trigrams from customer reviews, survey responses, or feedback fields for sentiment analysis or topic modeling.
| Type | Description |
|---|
| Data | Textual data containing the column to process |
Output
| Type | Description |
|---|
| Transformed Data | Table with extracted n-gram tokens |
Configuration Fields
| Field Name | Required | Description |
|---|
| Column To Extract | Yes | Name of the column containing text to extract n-grams from. (Uses Previous Data Column Editor) |
| Output Method | Yes | Format for outputting extracted n-grams:- One per row
- One per column
- JSON
|
| Output Column | Yes | Name of the column to store the resulting n-grams |
| Include Original | No | Whether to include the original columns in the output alongside the n-grams |
| Size | Yes | Number of words per n-gram (e.g., 2 = bigram, 3 = trigram) |
| Clear Stop Words | No | Remove common stop words (e.g., “the”, “is”, “and”) before generating n-grams |
| Stem Words | No | Reduce words to their root form before generating n-grams (e.g., “running” → “run”) |
| Sort Words | No | Sort words alphabetically within each n-gram (e.g., “great product” → “product great”) |
| ReviewID | ReviewText | Rating | Date | Reviewer |
|---|
| 101 | The product quality is amazing | 5 | 2024-02-01 | Alice |
| 102 | Great service and fast delivery | 4 | 2024-02-02 | Bob |
| 103 | The material is poor and fragile | 2 | 2024-02-03 | Charlie |
| 104 | Excellent support and great help | 5 | 2024-02-04 | David |
| 105 | Delivery was slow, but good item | 3 | 2024-02-05 | Emma |
Sample Configuration
| Field | Value |
|---|
| Column To Extract | ReviewText |
| Output Method | One per row |
| Output Column | ExtractedNgrams |
| Include Original | No |
| Size | 2 |
| Clear Stop Words | Yes |
| Stem Words | Yes |
| Sort Words | No |
Sample Output
| ExtractedNgrams |
|---|
| product quality |
| quality amazing |
| great service |
| service fast |
| fast delivery |
| material poor |
| poor fragile |
| excellent support |
| support great |
| great help |
| delivery slow |
| slow good |
| good item |
Use Sort Words and Stem Words options when generating normalized features for text clustering or classification.