Ingest pipeline best practices #1381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

philippkahr wants to merge 8 commits into elastic:main from philippkahr:best-practices-ingest-pipelines

+1,428 −0

philippkahr commented May 7, 2025 •

edited

Loading

based on the discussions here: #1052

this is my first PR against the docs, and I am building a couple of new pages. I think it makes sense to split it out. I am putting it into that part of the docs. https://www.elastic.co/docs/manage-data/ingest/transform-enrich/ingest-pipelines The tips and tricks are generic and not specific to just o11y, or security.


          initial commit, let's see if it builds

a01a622

github-actions bot deployed to docs-preview

May 7, 2025 16:45

View deployment


          this should help

ce87300

github-actions bot deployed to docs-preview

May 7, 2025 16:52

View deployment


          reworked the md

e42cd64

github-actions bot deployed to docs-preview

May 7, 2025 16:59

View deployment


          Reworking the line breaks

4ef0b86

github-actions bot deployed to docs-preview

May 7, 2025 17:15

View deployment


          Reworked, grammar, whitespace, formatting

0c7ed69

github-actions bot deployed to docs-preview

May 7, 2025 17:52

View deployment

Author

philippkahr commented May 7, 2025 •

edited

Loading

There are a couple of things I need help with.

Can someone proof read it and give suggestion on the ease of understanding.
Does the file order make sense in the way I put it? Should we do an additional subfolder?
Can you read through it please and here and there I think we can add links to different docs, like when I say remove. processor, we should link to the remove processor probably?
Not a 100% convincend that common mistakes is the correct heading


          wrong naming

545f91b

github-actions bot deployed to docs-preview

May 7, 2025 17:57

View deployment

philippkahr added Team:Platform documentation enhancement labels

kilfoyle added Team:Obs and removed Team:Platform labels

Contributor

kilfoyle commented May 7, 2025

Thanks a lot for opening this Philipp! I've added the "Team:Obs" label since under the new docs organization that's where the ingest content will land.


          Marius suggested to use append which makes more sense since it is an …

bd76756

…array!

github-actions bot deployed to docs-preview

May 9, 2025 10:47

View deployment


          fix typo

ca4d379

philippkahr requested review from a team as code owners

May 22, 2025 08:29

github-actions bot deployed to docs-preview

May 22, 2025 08:33

View deployment

colleenmcginnis added the Team:Ingest label

alexandra5000 self-assigned this

colleenmcginnis reviewed

View reviewed changes

Contributor

colleenmcginnis left a comment

@philippkahr I started reviewing this PR, but I didn't get very far (yet!). There's a lot of content to get into! I'm just going to post the comments/questions/suggestions I have so far so I can see if I'm on the right track. I can jump back in next week.

Some themes in my early feedback include:

I see some opportunities to simplify the examples to really emphasize the point you're making in each section.
It might be helpful to write out in plain language what the example is trying to achieve before jumping into a code snippet. (I provided a couple suggestions below.)
There are probably opportunities to remove redundant information.

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +9 to +11

    
              # Common mistakes

              Here we are not discussing any performance metrics and if one way or the other one is faster, has less heap usage etc. What we are looking for is ease of maintenance and readability. Anybody who knows a bit about ingest pipelines should be able to fix a lot of issues. This section should provide a clear guide to “oh I have written this myself, ah that is the easier way to write it”.

Contributor

colleenmcginnis May 30, 2025

What about reframing it so it's more focused on what it is (rather than what it's not)? Maybe something like this?

Suggested change

      
            # Common mistakes
          
            Here we are not discussing any performance metrics and if one way or the other one is faster, has less heap usage etc. What we are looking for is ease of maintenance and readability. Anybody who knows a bit about ingest pipelines should be able to fix a lot of issues. This section should provide a clear guide to “oh I have written this myself, ah that is the easier way to write it”.
          
            # Create readable and maintainable ingest pipelines
          
            There are many ways to achieve similar results when creating ingest pipelines, which can make maintenance and readability difficult. This guide outlines patterns you can follow to make the maintenance and readability of ingest pipelines easier without sacrificing functionality.
          
            :::{note}
          
            This guide does not provide guidance on optimizing for ingest pipeline performance.
          
            :::

manage-data/ingest/transform-enrich/common-mistakes.md

    
              ## if statements

              ### Contains and lots of ORs

Contributor

colleenmcginnis May 30, 2025

This just seems like a best practice for writing code in general. There's nothing specific about ingest pipelines (or am I missing something)?

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +35 to +54

    
              ### Missing ? and contains operation

              Here is another example, which would fail if `openshift` is not properly set since it is not using `?`, also the `()` are not really doing anything. As well as the unnecessary check of `openshift.origin` and then `openshift.origin.threadId`

              ```painless

              "if": "ctx.openshift.eventPayload != null 

              && (ctx.openshift.eventPayload.contains('Start expire sessions')) 

              && ctx.openshift.origin != null 

              && ctx.openshift.origin.threadId != null 

              && (ctx.openshift.origin.threadId.contains('Catalina-utility'))",

              ```

              This can become this:

              ```painless

              "if": "ctx.openshift?.eventPayload instanceof String 

              && ctx.openshift.eventPayload.contains('Start expire sessions') 

              && ctx.openshift?.origin?.threadId instanceof String 

              && ctx.openshift.origin.threadId.contains('Catalina-utility')",

              ```

Contributor

colleenmcginnis May 30, 2025

also the () are not really doing anything

I don't think this adds much value. Can we simplify these examples so the difference between before/after are easier to see?

We could end up with something like below instead. Let me know what you think about this approach.

Screenshot 2025-05-30 at 3 45 39 PM

### Null safe operator

Anticipate potential problems with the data, and use the [null safe operator](elasticsearch://reference/scripting-languages/painless/painless-operators-reference.md#null-safe-operator) (`?.`) to prevent data from being processed incorrectly.

For example, if you only want data that has a valid string in a `ctx.openshift.origin.threadId` field:

#### **Don't**: Leave the condition vulnerable to failures and use redundant checks

```painless
ctx.openshift.origin != null <1>
&& ctx.openshift.origin.threadId != null <2>
```
1. It's unnecessary to check both `openshift.origin` and `openshift.origin.threadId`.
2. This will fail if `openshift` is not properly set because it assumes that `ctx.openshift` and `ctx.openshift.origin` both exist.

#### **Do**: Use the null safe operator

```painless
ctx.openshift?.origin?.threadId instanceof String <1>
```
1. Only if there's a `ctx.openshift` and a `ctx.openshift.origin` will it check for a `ctx.openshift.origin.threadId` and make sure it is a string.

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +56 to +79

    
              ### Contains operation and null check

              This includes an initial null check, which is not necessary.

              ```painless

              "if": "ctx.event?.action !=null 

              && ['bandwidth','spoofed syn flood prevention','dns authentication','tls attack prevention',

                  'tcp syn flood detection','tcp connection limiting','http rate limiting',

                  'block malformed dns traffic','tcp connection reset','udp flood detection',

                  'dns rate limiting','malformed http filtering','icmp flood detection',

                  'dns nxdomain rate limiting','invalid packets'].contains(ctx.event.action)"

              ```

              This behaves nearly the same:

              ```painless

              "if": "['bandwidth','spoofed syn flood prevention','dns authentication','tls attack prevention',

                      'tcp syn flood detection','tcp connection limiting','http rate limiting',

                      'block malformed dns traffic','tcp connection reset','udp flood detection',

                      'dns rate limiting','malformed http filtering','icmp flood detection',

                      'dns nxdomain rate limiting','invalid packets'].contains(ctx.event?.action)"

              ```

              The difference is in the execution itself which should not matter since it is Java under the hood and pretty fast as this. In reality what happens is the following when doing the first one with the initial: `ctx.event?.action != null` If action is null, then it will exit here and not even perform the contains operation. In our second example we basically run the contains operation x times, for every item in the array and have `valueOfarray.contains('null')` then.

Contributor

colleenmcginnis May 30, 2025

This example confuses me. Why would you want to run the contains operation n times if you already know ctx.event.action is null and it's going to return false.

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +95 to +115

    
              Similar to the one above, in addition to that, we do not have `?` for dot-walking.

              ```json

              {

                "fail": {

                  "message": "This cannot be parsed as it a list and not a single message",

                  "if": "ctx._tmp.leef_kv.labelAbc != null && ctx._tmp.leef_kv.labelAbc instanceof List"

                }

              },

              ```

              This version is easier to read and maintain since we remove the unnecessary null check and add dot walking.

              ```json

              {

                "fail": {

                  "message": "This cannot be parsed as it a list and not a single message",

                  "if": "ctx._tmp?.leef_kv?.labelAbc instanceof List"

                }

              },

              ```

Contributor

colleenmcginnis May 30, 2025

Do these have to be two separate examples? Why are some examples painless and some json?

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +163 to +178

    
              ### Checking null way to often

              This:

              ```painless

              "if": "ctx.process != null && ctx.process.thread != null 

                     && ctx.process.thread.id != null && (ctx.process.thread.id instanceof String)"

              ```

              Can become just this:

              ```painless

              "if": "ctx.process?.thread?.id instanceof String"

              ```

              That is what the `?` is for, instead of listing every step individually and removing the unnecessary `()` as well.

Contributor

colleenmcginnis May 30, 2025

It feels like this lesson has already been learned on this page by the time you get here.

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +133 to +145

    
              ### Checking null

              It is not necessary to write a `?` after the ctx itself. For first level objects such as `ctx.message`, `ctx.demo` it is enough to write it like this. If ctx is ever null you face other problems (basically the entire context, so the entire `_source` is empty and there is not even a _source... it's basically all null)

              ```painless

              "if": "ctx?.message == null"

              ```

              Is the same as:

              ```painless

              "if": "ctx.message == null"

              ```

Contributor

colleenmcginnis May 30, 2025

Can this be covered in a note in the second example with something like this?

:::{tip}
It is not necessary to use a null safe operator for first level objects
(for example, use `ctx.openshift` instead of `ctx?.openshift`).
`ctx` will only ever be `null` if the entire `_source` is empty.
:::

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +81 to +93

    
              ### Checking null and type unnecessarily

              This is just unnecessary

              ```painless

              "if": "ctx?.openshift?.eventPayload != null && ctx.openshift.eventPayload instanceof String"

              ```

              Because this is the same.

              ```painless

              "if": "ctx.openshift?.eventPayload instanceof String"

              ```

Contributor

colleenmcginnis May 30, 2025

This is what this section could look like using the same structure as the section above.

Screenshot 2025-05-30 at 4 41 48 PM

### Use null safe operators when checking type

If you're using a null safe operator, it will return the value if it is not `null` so there is no reason to check whether a value is not `null` before checking the type of that value.

For example, if you only want data when the value of the `ctx.openshift.origin.eventPayload` field is a string:

#### ![don't](../../images/icon-cross.svg) **Don't**: Use redundant checks

```painless
ctx?.openshift?.eventPayload != null && ctx.openshift.eventPayload instanceof String
```

#### ![do](../../images/icon-check.svg) **Do**: Use the null safe operator with the type check

```painless
ctx.openshift?.eventPayload instanceof String
```

manage-data/ingest/transform-enrich/common-mistakes.md

Comment on lines +117 to +131

    
              ### Checking null and for a value

              This is interesting as it misses the `?` and therefore will have a null pointer exception if `event.type` is ever null.

              ```painless

              "if": "ctx.event.type == null || ctx.event.type == '0'"

              ```

              This needs to become this:

              ```painless

              "if": "ctx.event?.type == null || ctx.event?.type == '0'"

              ```

              The reason why we need twice the `?` is because we are using an OR operator `||` therefore both parts of the if statement are executed.

Contributor

colleenmcginnis May 30, 2025

This is what this section could look like using the same structure as the sections above.

Screenshot 2025-05-30 at 5 04 23 PM

### Use null safe operator with boolean OR operator

When using the [boolean OR operator](elasticsearch://reference/scripting-languages/painless/painless-operators-boolean.md#boolean-or-operator) (`||`), you need to use the null safe operator for both conditions being checked.

For example, if you want to include data when the value of the `ctx.event.type` field is either `null` or `'0'`:

#### ![don't](../../images/icon-cross.svg) **Don't**: Leave the conditions vulnerable to failures

```painless
ctx.event.type == null || ctx.event.type == '0' <1>
```
1. This will fail if `ctx.event` is not properly set because it assumes that `ctx.event` exists. If it fails on the first condition it won't even try the second condition.

#### ![do](../../images/icon-check.svg) **Do**: Use the null safe operator in both conditions

```painless
"if": "ctx.event?.type == null || ctx.event?.type == '0'"
```
1. Both conditions will be checked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation enhancement Team:Ingest Team:Obs