CT-594 truncate large events #64

dlobue · 2022-10-04T21:15:48Z

This PR truncates large events to ensure logstash does not get stuck trying to send an event that is so large it will never be accepted.

Truncation is performed by discarding fields to get the event under the the max record size set by max_record_size_bytes. Fields larger than max_field_size_bytes are dropped.

The exceptions to these rules are the fields serverHost, parser, logfile, severity, and message (if present). We ensure these fields do not get dropped, and if the field value exceeds max_field_size_bytes, we truncate the string.

This is to prevent logstash from getting stuck trying to send the same large event that will never be ingested over and over again. CT-594

CT-594

Kami · 2022-10-05T11:03:15Z

lib/logstash/outputs/scalyr.rb

          end
        end

+        if event_json.bytesize > @max_record_size_bytes


Just thinking out loud - wonder what kind of performance impact this may have during the worst case (a lot of events which hit this limit and possibly also events with many such fields - we may want to add some quick micro benchmarks to better understand the impact).

Since previously we wouldn't perform additional processing in such case on the client side, right (aka we would send it as-is and server would reject / truncate it)?

May be worthwhile to add some instrumentation / period logging (number of such events, duration of truncation operation). I also wonder if we may want to only periodically log "Event size exceeds max_record_size_bytes" message (something similar to rate limited warn in the agent code) to avoid potential noise / flood of those messages in case we encounter many such events.

In addition to that, we may also want to add periodic DEBUG log with event content (pre-truncation) to make troubleshooting / similar easier.

Since previously we wouldn't perform additional processing in such case on the client side, right (aka we would send it as-is and server would reject / truncate it)?

Right. This has actually been weighing on my mind a lot. I don't actually know how often logstash handles records that exceed the max record size (whatever that value is), or the arbitrary value we chose of 200KiB.

How does the server deal with events that exceed the max record size? Our docs don't mention a record size limit, only a post body size limit.
@Kami How does the agent deal with large events? I couldn't find an answer to that question quickly on my own, and didn't want to delay this feature in order to dig around the agent codebase to find out.

As mentioned on Slack, agent datasets are usually different - there is usually a single potentially large message attribute vs logstash scenario where it's common to have many other attributes (in addition or without the message attribute) on the event.

Having said that, I don't know how agent handles that so I will dig into code and also try to test it out.

Kami · 2022-10-05T11:04:20Z

lib/logstash/outputs/scalyr.rb

+      value = attrs.delete key
+      value_size = _get_size(value)
+      if value_size > @max_field_size_bytes
+        value.slice!(@max_field_size_bytes, value_size)


Do we want to add "..." or similar to the end to signal that the field name has been truncated? (iirc, we do the same on the server side for truncated field values)

makes sense. I'll add that.

Kami · 2022-10-05T11:07:01Z

lib/logstash/outputs/scalyr.rb

+    attrs = event[:attrs]
+    new_attrs = Hash.new
+    total_size = 0
+    priority_events = ["serverHost", "parser", "logfile", "severity", "message"]


We could define this as a top / module level constants, and perhaps we want to call it something along the lines of "special / reserved" field names or similar.

Depending on where this processing happens (either before or after the special event level serverHost field value is set), we may also need to include EVENT_LEVEL_SERVER_HOST_ATTRIBUTE_NAME constant / __origServerHost field name here.

logstash-output-scalyr/lib/logstash/outputs/scalyr.rb

Line 776 in 5a0f322

# Rename serverHost (if exists) to __origServerHost so sources filtering works correctly

On a related note, do we also want to include thread + log here?

And IIRC, post processing (what gets sent to the sever side), severity field value is stored in sev field -

logstash-output-scalyr/lib/logstash/outputs/scalyr.rb

Line 897 in 5a0f322

scalyr_event[:sev] = severity_int

And also worth noting that some of those field values (top level serverHost, logs, etc.) are, in some / all cases, sent once as session level parameters and not as event level attributes.

Not sure how things would behave in case one of the session level fields is very large (e.g. very large logs session field with many entries).

Probably also not a blocker and fully relevant to this change, but also worth thinking about.

And IIRC, post processing (what gets sent to the sever side), severity field value is stored in sev field

That is correct. That field is also in the parent object, so if it is present I don't touch it.

And also worth noting that some of those field values (top level serverHost, logs, etc.) are, in some / all cases, sent once as session level parameters and not as event level attributes.

also correct. I decided against touching anything in there since that isn't where the problem currently lies. For the sake of expediency, and to avoid introducing regression, I felt it better to be surgical in my approach.

Depending on where this processing happens (either before or after the special event level serverHost field value is set), we may also need to include EVENT_LEVEL_SERVER_HOST_ATTRIBUTE_NAME constant / __origServerHost field name here.

It happens after. I'll look into this. I think that data is stored outside of the attrs object, but I'll verify.

Yeah, correct, some of those attributes (e.g. sev) are stored outside of the event.attrs (tbh, not sure how limits apply to those attributes).

Kami · 2022-10-05T11:08:14Z

lib/logstash/outputs/scalyr.rb

+    end
+    attrs.each do |key, value|
+      kv_size = key.bytesize + _get_size(value)
+      # skip field if the combined size of the key and value exceed the max field size.


Yeah, I think this is a good call since doing that would be somewhat complex and possibly also add a lot of overhead during worst case scenario.

Having said that, I wonder if we still want to add some periodic opt-in debug logging of those offending fields to make troubleshooting easier for the end user?

That makes sense.

Kami · 2022-10-05T11:32:07Z

lib/logstash/outputs/scalyr.rb

+  # does not exceed `max_record_size_bytes`. This feature is only active when
+  # `estimate_each_event_size` is enabled.
+  config :max_record_size_bytes, :validate => :number, :default => 200 * 1024
+  config :max_field_size_bytes, :validate => :number, :default => 50 * 1024


IIRC, server side limit for those accounts right now is 200k (and general one is 100k). Would be good to double check and decide how to proceed (e.g. change the default value here or tell user to bump this value to 200k in their config or similar).

I didn't know where to find that limit, so I used the defaults suggested by Steven. If you can direct me towards where that limit is defined, I'd be happy to make sure the defaults here match the server-side defaults.

actually, I just remembered something: according to our docs on how we bill on logging volume (https://app.scalyr.com/help/scalyr-agent#billing), each key in the attrs object only counts as 1 byte. Since our concern is serialization size, we should set our max record size to be larger than the actual max record size to allow some wiggle room.

dlobue added 4 commits October 4, 2022 11:03

Add feature for truncating large events

b3df3f6

This is to prevent logstash from getting stuck trying to send the same large event that will never be ingested over and over again. CT-594

Event truncation unit tests

b2f422c

CT-594

Fix for bug discovered thanks to tests

633a54e

CT-594

Document the event truncation settings

5a0f322

CT-594

dlobue requested review from ArthurKamalov, Kami, czerwingithub and oliverhsu77 October 4, 2022 22:05

Kami reviewed Oct 5, 2022

View reviewed changes

CT-594 truncate large events #64

Are you sure you want to change the base?

CT-594 truncate large events #64

Uh oh!

Conversation

dlobue commented Oct 4, 2022

Uh oh!

Kami Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kami Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kami Oct 5, 2022 •

edited

Loading

Kami Oct 5, 2022 •

edited

Loading