Skip to content

[Text] Regular expression for data cleansing #1131

Open
@ShihChun-H

Description

Describe Your Proposed Tutorial

Issue Description

Current State

  • Users cannot clean their data in VDP with simple flow

Why We Want to Change?

  • We want to exclude some data to make the chunks cleaner, which can improve the efficiency of RAG.

Proposed Change

Pseudo Recipe

# VDP Version
version: v1beta

component:
  text-0:
    type: text
    input:
      # "Array of text to be cleaned."
      texts:
      setting:
        # option 1
        clean-method: Regex
        # When the text is matched, it will be removed from the array of text.
        exclude-patterns: 
        # When the text is matched, it will be remained in the array of text.
        include-patterns:

        
        # option 2
        clean-method: Substring
        # When the text contains the substrings, it will be removed from the array of text.
        exclude-substrings: 
        # When the text contains the substrings, it will be remained in the array of text.
        include-substrings:
        # A flag indicating whether the substring matching is case-sensitive. When it is true, the matching is case-sensitive. When it is false, the matching is case-insensitive. The default value is false. For example, when it is case-sensitive, cat would only match 'cat' but not 'Cat' or 'CAT'. When cat is case-insensitive, on the other hand, would match 'cat', 'Cat', 'CAT', or any other variation of uppercase and lowercase letters.
        case-sensitive: 
          
    condition:
    task: TASK_CLEAN_DATA

Rules for the Component Hackathon

  • Each issue will only be assigned to one person/team at a time.
  • You can only work on one issue at a time.
  • To express interest in an issue, please comment on it and tag @kuroxx, allowing the Instill AI team to assign it to you.
  • Ensure you address all feedback and suggestions provided by the Instill AI team.
  • If no commits are made within five days, the issue may be reassigned to another contributor.
  • Join our Discord to engage in discussions and seek assistance in #hackathon channel. For technical queries, you can tag @chuang8511.

Component Contribution Guideline | Documentation | Official Go Tutorial

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Type

No type

Projects

  • Status

    In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions