Skip to content

The process has stuck when reached too many open files #55

@Felixoid

Description

@Felixoid

Hello. Yesterday the process has stuck in a strange state without doing anything.

I did check the CH logs, there wasn't any suspicious.

There's a log from the stuck time:

> grep -C15 2020-02-06T20:12:33.379Z TMOP-CCH.log
[2020-02-06T20:12:27.381Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4"}
[2020-02-06T20:12:27.403Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019944197075408.lz4"}
[2020-02-06T20:12:27.617Z] ERROR [upload] handle failed {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4", "error": "Post http://ch-host:8123/?query=INSERT+INTO+graphite.tagged+%28Date%2C+Tag1%2C+Path%2C+Tags%2C+Version%29+FORMAT+RowBinary: EOF", "time": 0.235977166}
[2020-02-06T20:12:27.754Z] INFO [upload] handle success {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019944197075408.lz4", "time": 0.374817767}
[2020-02-06T20:12:28.129Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019944197075408.lz4", "time": 0.727357581}
[2020-02-06T20:12:28.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4"}
[2020-02-06T20:12:28.486Z] INFO [upload] handle success {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4", "time": 0.105951873}
[2020-02-06T20:12:30.379Z] INFO [upload] start handle {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.478Z] INFO [upload] handle success {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019947197180216.lz4", "time": 0.098053346}
[2020-02-06T20:12:30.625Z] INFO [upload] handle success {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019947197180216.lz4", "time": 0.245967823}
[2020-02-06T20:12:30.875Z] ERROR [upload] handle failed {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4", "error": "Post http://ch-host:8123/?query=INSERT+INTO+graphite.index+%28Date%2C+Level%2C+Path%2C+Version%29+FORMAT+RowBinary: EOF", "time": 0.495387526}
[2020-02-06T20:12:31.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4"}
[2020-02-06T20:12:31.639Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4", "time": 0.25905138}
[2020-02-06T20:12:33.379Z] INFO [upload] start handle {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019950197305764.lz4"}
[2020-02-06T20:12:33.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019950197305764.lz4"}
[2020-02-06T20:12:33.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019950197305764.lz4"}
[2020-02-06T20:12:45.198Z] INFO [writer] chunk interval changed {"interval": "10s"}
[2020-02-06T20:12:49.234Z] WARN [tcp] unfinished line {"peer": "[2a00:1f78:fffd:402e::101]:35402", "line": "servers.a"}
[2020-02-06T20:12:50.012Z] WARN [tcp] unfinished line {"peer": "[2a00:1f78:fffd:4002::a9d]:56246", "line": "metric.name"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
**** the line repeats next 1.3GiB****

Another strange thing - disk space was allocated in df output, but du didn't show anything: (upd: looks like logrotate has removed the file, but carbon-clickhouse hasn't release it by some reason)

> df
Filesystem     1K-blocks    Used Available Use% Mounted on
udev              980608       0    980608   0% /dev
tmpfs             198304   22024    176280  12% /run
/dev/vda         7329792 7329772        20 100% /
# cuted
> sudo du -hscx / | sort -h 
1.7G	/
1.7G	total
> sudo lsof | grep -i del
COMMAND     PID   TID             USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
carbon-cl 19199       carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
carbon-cl 19199  6028 carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
carbon-cl 19199 19200 carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
*** dozens of line with different TID ***

I'll increase nofiles limit for sure because it was the reason for the outage. But maybe you'd find a way to avoid such deadlocks. Even panic and die would be better behavior IMHO.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions