-
Notifications
You must be signed in to change notification settings - Fork 233
AWS S3 exporter for OTC - Data Archiving #6205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
amee-sumo
wants to merge
13
commits into
main
Choose a base branch
from
AWS-S3-exporter-for-OTC---Data-Archiving
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+269
−4
Open
Changes from 2 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
7920000
AWS S3 exporter for OTC - Data Archiving
amee-sumo af6191e
Merge branch 'main' into AWS-S3-exporter-for-OTC---Data-Archiving
amee-sumo 0caed36
Update docs/manage/data-archiving/archive-otel.md
amee-sumo eb251fd
Update docs/manage/data-archiving/archive-otel.md
amee-sumo b3043d1
Update docs/manage/data-archiving/archive-otel.md
amee-sumo 4519fc4
Update docs/manage/data-archiving/archive-otel.md
amee-sumo c12e721
Update docs/manage/data-archiving/archive-otel.md
amee-sumo aac67f6
Update docs/manage/data-archiving/archive-otel.md
amee-sumo d8fa6f2
Update docs/manage/data-archiving/archive-otel.md
amee-sumo df60948
Update docs/manage/data-archiving/archive-otel.md
amee-sumo 56ffb99
Update docs/manage/data-archiving/archive-otel.md
amee-sumo 4ae7964
Update docs/manage/data-archiving/archive-otel.md
amee-sumo 51696c4
Update docs/manage/data-archiving/archive-otel.md
amee-sumo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,257 @@ | ||
| --- | ||
| id: archive-otel | ||
| title: Archive Log Data to S3 using OpenTelemetry Collectors | ||
| description: Send data to an Archive that you can ingest from later. | ||
| --- | ||
|
|
||
| import useBaseUrl from '@docusaurus/useBaseUrl'; | ||
|
|
||
| This document describes how to archive log data to Amazon S3 using OpenTelemetry Collectors. Archiving allows you to store log data cost-effectively in S3 and ingest it later on demand, while retaining full enrichment and searchability when the data is re-ingested. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| :::important | ||
| Do not change the name and location of the archived files in your S3 bucket, otherwise ingesting them later will not work properly. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ::: | ||
|
|
||
| ## Overview | ||
|
|
||
| With the OpenTelemetry-based approach, log data is sent to S3 using an OpenTelemetry Collector pipeline: | ||
|
|
||
| **Sources** > **OpenTelemetry Collector** > **awss3exporter** > **Amazon S3** | ||
|
|
||
| For S3 archiving, we use: | ||
| - The `awss3exporter` component to upload data to S3. | ||
| - The `sumo_ic marshaller`, which formats the archived files so they are compatible with Sumo Logic’s ingestion process. | ||
|
|
||
| ### Why use OpenTelemetry Collector for archiving logs to S3 | ||
|
|
||
| Compared to legacy Installed Collector-based archiving, the OpenTelemetry approach provides: | ||
| - Full control over metadata enrichment | ||
| - Flexible and transparent configuration through YAML | ||
| - Better alignment with modern observability pipelines | ||
| - Easier integration across hybrid and cloud-native environments | ||
| - A future-proof architecture aligned with OpenTelemetry standards | ||
|
|
||
| ## Required metadata for archived logs | ||
|
|
||
| For archived logs to be enriched and ingested correctly later, three resource attributes must be present on every log record. These are configured using the OpenTelemetry resource processor: | ||
|
|
||
| | Resource Attribute | Description | Maximum length | | ||
| |:--|:--|:--| | ||
| | `_sourceCategory` | This field is open and free to use to logically group data. | 1024 characters | | ||
| | `_sourceHost` | This field is ideally the hostname of the machine where logs originate, but can be any meaningful value. | 128 characters | | ||
| | `_sourceName` | This field is the source name such as a filename from which ingestion is happening or the type of logs (for example, dockerlogs, apachelogs). | | | ||
|
|
||
| These attributes can be set statically in configuration, or populated dynamically using a custom resource processor (advanced use case). However, dynamic extraction requires advanced implementation and is not generally recommended unless you have strong OpenTelemetry expertise. | ||
|
|
||
| ## Archive format | ||
|
|
||
| :::important | ||
| Only the `v2` archive format is supported when using OpenTelemetry Collector. The legacy `v1` format is deprecated for OpenTelemetry Collector and must not be used. | ||
| ::: | ||
|
|
||
| Archived files use the format: | ||
|
|
||
| `<deployment>/<collectorID>/<bladeID>` | ||
|
|
||
| These three identifiers are used to populate `_sourceCategory`, `_sourceHost`, and `_sourceName` during ingestion as described in the [attributes section](#required-metadata-for-archived-logs). | ||
|
|
||
| The identifier values do not need to be real IDs. Dummy values are allowed and ingestion will still work correctly. However, providing meaningful values is strongly recommended to help users differentiate log sources during ingestion. For example, if you archive Docker logs, Apache logs, and PostgreSQL logs into the same bucket, the filename alone generated by the `sumo_ic` marshaller does not indicate the source type. Using different `collectorID` and `bladeID` values allows you to differentiate log types during ingestion using path patterns. | ||
|
|
||
| :::note | ||
| In many environments, the `collectorID` can be a dummy value. The `bladeID` (source template ID) is particularly more useful for identifying log types. | ||
| ::: | ||
|
|
||
| Below is a sample OpenTelemetry Collector configuration that archives logs from files into S3 using the supported Sumo Logic archive format. | ||
|
|
||
| ``` | ||
| receivers: | ||
| filelog/myapps: | ||
| include: ["/home/ec2-user/docker/validation/s3archive/logs/*.log"] | ||
| start_at: beginning | ||
|
|
||
| processors: | ||
| resource/add_sumo_fields: | ||
| attributes: | ||
| - key: _sourceCategory | ||
| value: "testlogs" | ||
| action: insert | ||
| - key: _sourceHost | ||
| value: "my-host.example.com" # replace or dynamically set below | ||
| action: insert | ||
| - key: _sourceName | ||
| value: "myapp" # replace with a logical source name | ||
| action: insert | ||
|
|
||
| batch: | ||
| timeout: 600s | ||
| send_batch_size: 8192 | ||
|
|
||
| exporters: | ||
| awss3/my-sumo-archive: | ||
| marshaler: "sumo_ic" | ||
|
|
||
| s3uploader: | ||
| region: "eu-north-1" | ||
| s3_bucket: "s3-archive-test" | ||
| s3_prefix: "v2/" | ||
| s3_partition_format: 'dt=%Y%m%d/hour=%H/minute=%M/stag/0000000007EB64D7/000000002DFFBCA8' | ||
| s3_partition_timezone: 'UTC' | ||
| compression: gzip | ||
|
|
||
| service: | ||
| pipelines: | ||
| logs: | ||
| receivers: [filelog/myapps] | ||
| processors: [resource/add_sumo_fields, batch] | ||
| exporters: [awss3/my-sumo-archive] | ||
| ``` | ||
|
|
||
| ## Ingestion filtering using path patterns | ||
|
|
||
| When configuring an AWS S3 Archive Source on a Hosted Collector, specify a file path pattern to control what gets ingested. | ||
|
|
||
| For example, to ingest only Docker logs: | ||
|
|
||
| ``` | ||
| v2/*/<DockerSourceTemplateID>/* | ||
| ``` | ||
|
|
||
| If differentiation is not required, you can use dummy 16-digit hexadecimal values for both `collectorID` and `bladeID`, and ingestion will still work with correct metadata enrichment. | ||
|
|
||
| ## Batching | ||
|
|
||
| The size and time window of archived files is controlled using the OpenTelemetry batch processor. For example, a batching timeout of 15 minutes produces one S3 file approximately every 15 minutes. | ||
|
|
||
| If the ingestion job window does not exactly align with the batching boundaries, the Hosted Collector behaves conservatively and may ingest slightly more data rather than risk data loss. This ensures no data loss around interval boundaries. | ||
|
|
||
| Example: | ||
|
|
||
| | Archive File Creation Window | ngestion Job Window | Ingested File Window | | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| |:--|:--|:--| | ||
| | hour1/minute07 | hour1/minute05 to hour1/minute30 | hour0/minute52 | | ||
| | hour1/minute22 | hour1/minute05 to hour1/minute30 | hour1/minute07 | | ||
| | hour1/minute37 | hour1/minute05 to hour1/minute30 | hour1/minute22 | | ||
| | hour1/minute52 | hour1/minute05 to hour1/minute30 | hour1/minute37 | | ||
|
|
||
| ## Ingest data from Archive | ||
|
|
||
| You can ingest a specific time range of data from your Archive at any time with an **AWS S3 Archive Source**. First, [create an AWS S3 Archive Source](#create-an-aws-s3-archivesource), then [create an ingestion job](#create-an-ingestion-job). | ||
|
|
||
| ### Rules | ||
|
|
||
| * A maximum of 2 concurrent ingestion jobs is supported. If more jobs are needed contact your Sumo Logic account representative. | ||
| * An ingestion job has a maximum time range of 12 hours. If a longer time range is needed contact your Sumo Logic account representative. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * Filenames or object key names must be in either of the following formats: | ||
| * Sumo Logic [Archive format](#archive-format) | ||
| * `prefix/dt=YYYYMMDD/hour=HH/fileName.json.gz` | ||
| * If the logs from Archive do not have timestamps they are only searchable by receipt time. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * If a Field is tagged to an archived log message and the ingesting Collector or Source has a different value for the Field, the field values already tagged to the archived log take precedence. | ||
| * If the Collector or Source that Archived the data is deleted the ingesting Collector and Source metadata Fields are tagged to your data. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * You can create ingestion jobs for the same time range, however, jobs maintain a 10 day history of ingested data and any data resubmitted for ingestion within 10 days of its last ingestion will be automatically filtered so it's not ingested. | ||
|
|
||
| ### Create an AWS S3 Archive Source | ||
|
|
||
| :::note | ||
| You need the **Manage Collectors** role capability to create an AWS S3 Archive Source. | ||
| ::: | ||
|
|
||
| An AWS S3 Archive Source allows you to ingest your Archived data. Configure it to access the AWS S3 bucket that has your Archived data. | ||
|
|
||
| :::note | ||
| To use JSON to create an AWS S3 Archive Source reference our AWS Log Source parameters and use `AwsS3ArchiveBucket` as the value for `contentType`. | ||
| ::: | ||
|
|
||
| 1. [**New UI**](/docs/get-started/sumo-logic-ui). In the main Sumo Logic menu select **Data Management**, and then under **Data Collection** select **Collection**. You can also click the **Go To...** menu at the top of the screen and select **Collection**. <br/>[**Classic UI**](/docs/get-started/sumo-logic-ui-classic). In the main Sumo Logic menu, select **Manage Data > Collection > Collection**. | ||
| 1. On the **Collectors** page, click **Add Source** next to a Hosted Collector, either an existing Hosted Collector or one you have created for this purpose. | ||
| 1. Select **AWS S3 Archive**. <br/><img src={useBaseUrl('img/archive/archive-icon.png')} alt="Archive icon" width="100"/> | ||
| 1. Enter a name for the new Source. A description is optional. | ||
| 1. Select an **S3 region** or keep the default value of **Others**. The S3 region must match the appropriate S3 bucket created in your Amazon account. | ||
| 1. For **Bucket Name**, enter the exact name of your organization's S3 bucket. Be sure to double-check the name as it appears in AWS. | ||
| 1. For **Path Expression**, enter the wildcard pattern that matches the Archive files you'd like to collect. The pattern: | ||
| * can use one wildcard (\*). | ||
| * can specify a [prefix](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-keys) so only certain files from your bucket are ingested. For example, if your filename is `prefix/dt=<date>/hour=<hour>/minute=<minute>/<collectorId>/<sourceId>/v2/<fileName>.txt.gzip`, you could use `prefix*` to only ingest from those matching files. | ||
| * can **NOT** use a leading forward slash. | ||
| * can **NOT** have the S3 bucket name. | ||
| 1. For **Source Category**, enter any string to tag to the | ||
| data collected from this Source. Category metadata is stored in a | ||
| searchable field called _sourceCategory. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 1. **Fields**. Click the **+Add Field** link to add custom metadata Fields. Define the fields you want to associate, each field needs a name (key) and value. | ||
| :::note | ||
| Fields specified on an AWS S3 Archive Source take precedence if the archived data has the same fields. | ||
| ::: | ||
| * <img src={useBaseUrl('img/reuse/green-check-circle.png')} alt="green check circle.png" width="20"/> A green circle with a check mark is shown when the field exists and is enabled in the Fields table schema. | ||
| * <img src={useBaseUrl('img/reuse/orange-exclamation-point.png')} alt="orange exclamation point.png" width="20"/> An orange triangle with an exclamation point is shown when the field doesn't exist, or is disabled, in the Fields table schema. In this case, an option to automatically add or enable the nonexistent fields to the Fields table schema is provided. If a field is sent to Sumo that does not exist in the Fields schema or is disabled it is ignored, known as dropped. | ||
| 1. For **AWS Access** you have two **Access Method** options. Select **Role-based access** or **Key access** based on the AWS authentication you are providing. Role-based access is preferred, this was completed in the prerequisite step Grant Sumo Logic access to an AWS Product. | ||
| * For **Role-based access**, enter the Role ARN that was provided by AWS after creating the role. | ||
| * For **Key access** enter the **Access Key ID **and** Secret Access Key.** See [AWS Access Key ID](http://docs.aws.amazon.com/STS/latest/UsingSTS/UsingTokens.html#RequestWithSTS) and [AWS Secret Access Key](https://aws.amazon.com/iam/) for details. | ||
| 1. Create any Processing Rules you'd like for the AWS Source. | ||
| 1. When you are finished configuring the Source, click **Save**. | ||
|
|
||
| ## Archive page | ||
|
|
||
| :::important | ||
| You need the Manage or View Collectors role capability to manage or view Archive. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ::: | ||
|
|
||
| The Archive page provides a table of all the existing [AWS S3 Archive Sources](#create-an-aws-s3-archivesource) in your account and ingestion jobs. | ||
|
|
||
| [**New UI**](/docs/get-started/sumo-logic-ui/). To access the Archive page, in the main Sumo Logic menu select **Data Management**, and then under **Data Collection** select **Archive**. You can also click the **Go To...** menu at the top of the screen and select **Archive**. | ||
|
|
||
| [**Classic UI**](/docs/get-started/sumo-logic-ui-classic). To access the Archive page, in the main Sumo Logic menu select **Manage Data > Collection > Archive**. | ||
|
|
||
| <img src={useBaseUrl('img/archive/archive-page.png')} alt="Archive page" width="800"/> | ||
|
|
||
| ### Details pane | ||
|
|
||
| Click on a table row to view the Source details. This includes: | ||
|
|
||
| * **Name** | ||
| * **Description** | ||
| * **AWS S3 bucket** | ||
| * All **Ingestion jobs** that are and have been created on the Source. | ||
| * Each ingestion job shows the name, time window, and volume of data processed by the job. Click the icon <img src={useBaseUrl('img/archive/open-search-icon.png')} alt="Open in search icon" width="30" /> to the right of the job name to start a Search against the data that was ingested by the job. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * Hover your mouse over the information icon to view who created the job and when.<br/><img src={useBaseUrl('img/archive/archive-details-pane.png')} alt="Archive details pane" width="325"/> | ||
|
|
||
| ## Create an ingestion job | ||
|
|
||
| :::note | ||
| A maximum of 2 concurrent jobs is supported. | ||
| ::: | ||
|
|
||
| An ingestion job is a request to pull data from your S3 bucket. The job begins immediately and provides statistics on its progress. To ingest from your Archive you need an AWS S3 Archive Source configured to access your AWS S3 bucket with the archived data. | ||
|
|
||
| 1. [**New UI**](/docs/get-started/sumo-logic-ui). In the main Sumo Logic menu select **Data Management**, and then under **Data Collection** select **Archive**. You can also click the **Go To...** menu at the top of the screen and select **Archive**. <br/>[**Classic UI**](/docs/get-started/sumo-logic-ui-classic). In the main Sumo Logic menu, select **Manage Data > Collection > Archive**. | ||
| 1. On the **Archive** page search and select the AWS S3 Archive Source that has access to your archived data. | ||
| 1. Click **New Ingestion Job** and a window appears where you: | ||
| 1. Define a mandatory job name that is unique to your account. | ||
| 1. Select the date and time range of archived data to ingest. A maximum of 12 hours is supported. <br/><img src={useBaseUrl('img/archive/Archive-ingest-job.png')} alt="Archive ingest job" width="350"/> | ||
| 1. Click **Ingest Data** to begin ingestion. The status of the job is visible in the Details pane of the Source in the Archive page. | ||
|
|
||
| ### Job status | ||
|
|
||
| An ingestion job will have one of the following statuses: | ||
|
|
||
| * **Pending**. The job is queued before scanning has started. | ||
| * **Scanning**. The job is actively scanning for objects from your S3 bucket. Your objects could be ingesting in parallel. | ||
| * **Ingesting* The job has completed scanning for objects and is still ingesting your objects. | ||
amee-sumo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * **Failed**. The job has failed to complete. Partial data may have been ingested and is searchable. | ||
| * **Succeeded** The job completed ingesting and your data is searchable. | ||
|
|
||
| ## Search ingested Archive data | ||
|
|
||
| Once your Archive data is ingested with an ingestion job you can search for it as you would any other data ingested into Sumo Logic. On the Archive page find and select the Archive S3 Source that ran the ingestion job to ingest your Archive data. In the [Details pane](#details-pane), you can click the **Open in Search** link to view the data in a Search that was ingested by the job. | ||
|
|
||
| :::note | ||
| When you search for data in the Frequent or Infrequent Tier, you must explicitly reference the partition. | ||
| ::: | ||
|
|
||
| The metadata field `_archiveJob` is automatically created in your account and assigned to ingested Archive data. This field does not count against your Fields limit. Ingested Archive data has the following metadata assignments: | ||
|
|
||
| | Field | Description | | ||
| |:----------------|:-------------------------------------| | ||
| | `_archiveJob` | The name of the ingestion job assigned to ingest your Archive data. | | ||
| | `_archiveJobId` | The unique identifier of the ingestion job. | | ||
|
|
||
| ## Audit ingestion job requests | ||
|
|
||
| The [Audit Event Index](/docs/manage/security/audit-indexes/audit-event-index) provides events logs in JSON when ingestion jobs are created, completed, and deleted. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.