Skip to content

Tokenizing Text

Description

This activity converts column values into tokens and applies transformations based on user specifications. Transformations can include clearing stop words, stemming words, and sorting words.

Input

Data only

Output

Transformed data

Configuration Fields

  • Column Names The columns containing text data to tokenize.
  • Option Mode The tokenization mode.
    • Options JSON, One token per row, One token per column
  • Output Column The column where the tokenized text will be stored.
  • Include Original If enabled, the input columns will be retained along with the transformed column.
  • Clear Stop Words If enabled, common stop words (such as “and”, “the”, “is”) will be removed.
  • Stem Words If enabled, words will be reduced to their root form (e.g., “running” → “run”).
  • Sort Words If enabled, words in the column will be sorted alphabetically.

Sample Input

IDTextColumn
1This is a sample text.
2Tokenizing helps in NLP tasks.

Sample Configuration

alt text

Sample Output

With One token per row mode

IDTextColumnTokenizedText
1This is a sample text.sample
1This is a sample text.text
2Tokenizing helps in NLP tasks.tokenizing
2Tokenizing helps in NLP tasks.helps
2Tokenizing helps in NLP tasks.NLP
2Tokenizing helps in NLP tasks.tasks

With One token per column mode

IDTextColumnTokenizedText1TokenizedText2TokenizedText3TokenizedText4
1This is a sample text.sampletext
2Tokenizing helps in NLP tasks.tokenizinghelpsNLPtasks