-
Notifications
You must be signed in to change notification settings - Fork 2
CT-594 truncate large events #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is to prevent logstash from getting stuck trying to send the same large event that will never be ingested over and over again. CT-594
| end | ||
| end | ||
|
|
||
| if event_json.bytesize > @max_record_size_bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud - wonder what kind of performance impact this may have during the worst case (a lot of events which hit this limit and possibly also events with many such fields - we may want to add some quick micro benchmarks to better understand the impact).
Since previously we wouldn't perform additional processing in such case on the client side, right (aka we would send it as-is and server would reject / truncate it)?
May be worthwhile to add some instrumentation / period logging (number of such events, duration of truncation operation). I also wonder if we may want to only periodically log "Event size exceeds max_record_size_bytes" message (something similar to rate limited warn in the agent code) to avoid potential noise / flood of those messages in case we encounter many such events.
In addition to that, we may also want to add periodic DEBUG log with event content (pre-truncation) to make troubleshooting / similar easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since previously we wouldn't perform additional processing in such case on the client side, right (aka we would send it as-is and server would reject / truncate it)?
Right. This has actually been weighing on my mind a lot. I don't actually know how often logstash handles records that exceed the max record size (whatever that value is), or the arbitrary value we chose of 200KiB.
How does the server deal with events that exceed the max record size? Our docs don't mention a record size limit, only a post body size limit.
@Kami How does the agent deal with large events? I couldn't find an answer to that question quickly on my own, and didn't want to delay this feature in order to dig around the agent codebase to find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned on Slack, agent datasets are usually different - there is usually a single potentially large message attribute vs logstash scenario where it's common to have many other attributes (in addition or without the message attribute) on the event.
Having said that, I don't know how agent handles that so I will dig into code and also try to test it out.
| value = attrs.delete key | ||
| value_size = _get_size(value) | ||
| if value_size > @max_field_size_bytes | ||
| value.slice!(@max_field_size_bytes, value_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add "..." or similar to the end to signal that the field name has been truncated? (iirc, we do the same on the server side for truncated field values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. I'll add that.
| attrs = event[:attrs] | ||
| new_attrs = Hash.new | ||
| total_size = 0 | ||
| priority_events = ["serverHost", "parser", "logfile", "severity", "message"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could define this as a top / module level constants, and perhaps we want to call it something along the lines of "special / reserved" field names or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on where this processing happens (either before or after the special event level serverHost field value is set), we may also need to include EVENT_LEVEL_SERVER_HOST_ATTRIBUTE_NAME constant / __origServerHost field name here.
| # Rename serverHost (if exists) to __origServerHost so sources filtering works correctly |
On a related note, do we also want to include thread + log here?
And IIRC, post processing (what gets sent to the sever side), severity field value is stored in sev field -
| scalyr_event[:sev] = severity_int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also worth noting that some of those field values (top level serverHost, logs, etc.) are, in some / all cases, sent once as session level parameters and not as event level attributes.
Not sure how things would behave in case one of the session level fields is very large (e.g. very large logs session field with many entries).
Probably also not a blocker and fully relevant to this change, but also worth thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And IIRC, post processing (what gets sent to the sever side), severity field value is stored in sev field
That is correct. That field is also in the parent object, so if it is present I don't touch it.
And also worth noting that some of those field values (top level serverHost, logs, etc.) are, in some / all cases, sent once as session level parameters and not as event level attributes.
also correct. I decided against touching anything in there since that isn't where the problem currently lies. For the sake of expediency, and to avoid introducing regression, I felt it better to be surgical in my approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on where this processing happens (either before or after the special event level serverHost field value is set), we may also need to include EVENT_LEVEL_SERVER_HOST_ATTRIBUTE_NAME constant / __origServerHost field name here.
It happens after. I'll look into this. I think that data is stored outside of the attrs object, but I'll verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, correct, some of those attributes (e.g. sev) are stored outside of the event.attrs (tbh, not sure how limits apply to those attributes).
| end | ||
| attrs.each do |key, value| | ||
| kv_size = key.bytesize + _get_size(value) | ||
| # skip field if the combined size of the key and value exceed the max field size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is a good call since doing that would be somewhat complex and possibly also add a lot of overhead during worst case scenario.
Having said that, I wonder if we still want to add some periodic opt-in debug logging of those offending fields to make troubleshooting easier for the end user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
| # does not exceed `max_record_size_bytes`. This feature is only active when | ||
| # `estimate_each_event_size` is enabled. | ||
| config :max_record_size_bytes, :validate => :number, :default => 200 * 1024 | ||
| config :max_field_size_bytes, :validate => :number, :default => 50 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, server side limit for those accounts right now is 200k (and general one is 100k). Would be good to double check and decide how to proceed (e.g. change the default value here or tell user to bump this value to 200k in their config or similar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know where to find that limit, so I used the defaults suggested by Steven. If you can direct me towards where that limit is defined, I'd be happy to make sure the defaults here match the server-side defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I just remembered something: according to our docs on how we bill on logging volume (https://app.scalyr.com/help/scalyr-agent#billing), each key in the attrs object only counts as 1 byte. Since our concern is serialization size, we should set our max record size to be larger than the actual max record size to allow some wiggle room.
This PR truncates large events to ensure logstash does not get stuck trying to send an event that is so large it will never be accepted.
Truncation is performed by discarding fields to get the event under the the max record size set by
max_record_size_bytes. Fields larger thanmax_field_size_bytesare dropped.The exceptions to these rules are the fields
serverHost,parser,logfile,severity, andmessage(if present). We ensure these fields do not get dropped, and if the field value exceedsmax_field_size_bytes, we truncate the string.