Open
Description
Hi,
We have dotnetmonitor set up on ECS Fargate. Running in listen mode collecting metrics every X. Our set up is a single dotnetmonitor side car inside each launched task with many tasks being launched. It stops working for us on some tasks after a few hours with the following error:
{
"Timestamp": "2022-04-15T06:50:03.0758604Z",
"EventId": 52,
"LogLevel": "Warning",
"Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
"Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
"State": {
"Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
"processId": "6",
"{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
},
"Scopes": []
}
Then all subsequent requests get this error:
{
"Timestamp": "2022-04-15T06:55:01.6363199Z",
"EventId": 1,
"LogLevel": "Error",
"Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
"Message": "Request failed.",
"Exception": "System.ArgumentException: Unable to discover a target process. at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100 at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 713 --- End of stack trace from previous location --- at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91",
"State": {
"Message": "Request failed.",
"{OriginalFormat}": "Request failed."
},
"Scopes": [
{
"Message": "SpanId:5f73f4ec6a4c2a06, TraceId:6e3bec22534dca3eed9ae13c8150dc0c, ParentId:0d6726492bd0e999",
"SpanId": "5f73f4ec6a4c2a06",
"TraceId": "6e3bec22534dca3eed9ae13c8150dc0c",
"ParentId": "0d6726492bd0e999"
},
{
"Message": "ConnectionId:0HMGU731FOFDF",
"ConnectionId": "0HMGU731FOFDF"
},
{
"Message": "RequestPath:/livemetrics RequestId:0HMGU731FOFDF:00000002",
"RequestId": "0HMGU731FOFDF:00000002",
"RequestPath": "/livemetrics"
},
{
"Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
"ActionId": "cc79e4d4-794e-481f-8083-fb3f3c7b5ca5",
"ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
},
{
"Message": "ArtifactType:livemetrics",
"ArtifactType": "livemetrics"
}
]
}
Note the main container itself keeps on working just fine and is processing requests without any issues. Per metrics captured before the error I do not see any abnormal memory/cpu/etc usage compared to the other tasks where dotnet-monitor keeps on working.
Here is our ecs task definition (the dotnetmonitor config values are under 'Environment'):
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Cpu: !Ref TaskCpu
Memory: !Ref TaskMemory
NetworkMode: awsvpc
ExecutionRoleArn: !Sub "arn:aws:iam::${AWS::AccountId}:role/ecsTaskExecutionRole"
TaskRoleArn: !ImportValue AppServicesEcsTaskRoleArn
RequiresCompatibilities:
- FARGATE
Volumes:
- Name: tmp
ContainerDefinitions:
- Essential: true
Name: appservices
Image:
!Sub
- "${repository}:${image}"
- repository: !ImportValue AppServicesEcrRepository
image: !Ref TaskEcrImageTag
Ulimits:
- Name: nofile
HardLimit: 65535
SoftLimit: 65535
PortMappings:
- ContainerPort: 44392
Protocol: tcp
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !ImportValue AppServicesEcsLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref EnvironmentName
LinuxParameters:
InitProcessEnabled: true
Capabilities:
Add:
- SYS_PTRACE
StopTimeout: 120
MountPoints:
- ContainerPath: /tmp
SourceVolume: tmp
Environment:
- Name: DOTNET_DiagnosticPorts
Value: /tmp/port
DependsOn:
- ContainerName: dotnet-monitor
Condition: START
- Essential: true
Name: dotnet-monitor
Image:
!Sub
- "${repository}:${image}-dotnetmonitor"
- repository: !ImportValue AppServicesEcrRepository
image: !Ref TaskEcrImageTag
MountPoints:
- ContainerPath: /tmp
SourceVolume: tmp
Environment:
- Name: Kestrel__Certificates__Default__Path
Value: /tmp/cert.pfx
- Name: DotnetMonitor_S3Bucket
Value: !Sub '{{resolve:ssm:/appservices/${EnvironmentName}/integration.bulk.s3.bucket:1}}'
- Name: DotnetMonitor_DefaultProcess__Filters__0__Key
Value: ProcessName
- Name: DotnetMonitor_DefaultProcess__Filters__0__Value
Value: dotnet
- Name: DotnetMonitor_DiagnosticPort__ConnectionMode
Value: Listen
- Name: DotnetMonitor_DiagnosticPort__EndpointName
Value: /tmp/port
- Name: DotnetMonitor_Storage__DumpTempFolder
Value: /tmp
- Name: DotnetMonitor_Egress__FileSystem__file__directoryPath
Value: /tmp/gcdump
- Name: DotnetMonitor_Egress__FileSystem__file__intermediateDirectoryPath
Value: /tmp/gcdumptmp
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Type
Value: EventCounter
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__ProviderName
Value: System.Runtime
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__CounterName
Value: working-set
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__GreaterThan
Value: !Ref TaskMemoryAutoGCDump
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__SlidingWindowDuration
Value: 00:00:05
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Type
Value: CollectGCDump
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Name
Value: GCDump
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Settings__Egress
Value: file
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Type
Value: Execute
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Path
Value: /bin/sh
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Arguments
Value: /app/gcdump.sh $(Actions.GCDump.EgressPath)
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCount
Value: 1
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCountSlidingWindowDuration
Value: 03:00:00
Secrets:
- Name: DotnetMonitor_Authentication__MonitorApiKey__Subject
ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.subject"
- Name: DotnetMonitor_Authentication__MonitorApiKey__PublicKey
ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.publickey"
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !ImportValue AppServicesEcsLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref EnvironmentName
And dockerfile to customize the default dotnet monitor container:
FROM mcr.microsoft.com/dotnet/monitor:6
RUN apk add curl && \
apk add jq && \
apk add aws-cli && \
apk add dos2unix
RUN adduser -s /bin/true -u 1000 -D -h /app app \
&& chown -R "app" "/app"
COPY --chown=app:app --chmod=500 gcdump.sh /app/gcdump.sh
RUN dos2unix /app/gcdump.sh
USER app