The purpose of this tool is to parse log files generated by the Percona Audit Plugin for Percona Server and import them into a clickhouse database.
Use at your own risk.
I needed to perform audit logging on several Percona servers. The Percona Audit Plugin does exactly that, but while generating the log is easy, doing a the acual audit -- not so much. When I saw the quite mindblowing performance of Clickhouse when searching through huge amount of data, I decided to give it a try, and import all my logs into a Clickhouse database. I looked around for a tool to handle the import, but couldn't find a perfect match. I found a couple of projects, but they didn't handle importing logs from different hosts into the same database, and because they were built for handling logs in general, they didn't perform quite as well as I'd like. So, to make a long story short, I made my own importer. It works for me, and perhaps it can work for you too.
- The importer does not interface with Percona Server itself in any way. It will only handle the log files you feed it. This means that you have to have your Percona Server up and running, and configured for Audit logs. Please read the documentation for configuring the Audit Plugin.
- The importer connects to Clickhouse like a normal client, so you have to set up a working Clickhouse server in advance and provide the necessary credentials (see config file below).
A few other things that are nice to know before starting:
- The parser itself is pretty CPU intensive, so its not recommended to use it on a production Percona server. Instead consider copying the log file to another host and run the parser from there.
- The current importer only support the JSON format. If you need to use other formats (like CSV or XML), you need to adapt the importer to support this.
- Also you'll have to have a Go environment to compile the source into a binary.
- The parser currently only supports log files in the UTC time zone (timestamps ending in "UTC" or "Z").
If you have a look at the code (and I urge you to do so), you'll see it consists of three .go files:
main.go
: The main file doing the actual parsing.subfunctions.go
: The definition of some simple functions used in the main.structs.go
: The JSON and configuration parameters are loaded into structs. These are defined in this file.
Download the code to your GO environment and build it with
go build
.
Move the binary to a location of your choice, and you're ready to go.
Download the binary file (clickhouseimporter).
Move the binary to a location of your choice, and you're ready to go.
The following commands will create a database called "audit" with one table called "mysql_audit_log" from the SQL files included in the repository:
[user@host]# clickhouse-client --multiline < createdatabase.sql
[user@host]# clickhouse-client --multiline < createtable.sql
The configuration is defined in a YAML file, similar to this:
# Clickhouse credentials
database:
user: "username"
pass: "password"
hostname: "localhost" # or the hostname of the clickhouse server
port: 9000
dbname: "audit"
tablename: "mysql_audit_log"
#MISC
misc:
batchsize: 20000 # the number of lines parsed before committing to clickhouse
The name of the database and table can be changed in the createdatabase.sql
and createtable.sql
files. Remember to change the names accordingly in the config file.
Changing the structure of the table itself is not recommended.
The structure of the table is almost identical to that of the JSON file,
except for the fields os_login
and proxy_user
. In all the samples I've examined, these fields have always been empty, so for simplicity's sake, they have been left out. Other than that, the structure of the table is like this:
CREATE TABLE audit.mysql_audit_log (
`time` DateTime,
`date` Date DEFAULT toDate(time),
`name` String,
`record` String,
`command_class` String,
`connection_id` String,
`status` UInt32,
`sqltext` String,
`user` LowCardinality(String),
`host` LowCardinality(String),
`os_user` String,
`ip` String,
`db` LowCardinality(String),
`dbserver` LowCardinality(String)
) ENGINE = MergeTree() PARTITION BY toDate(time) ORDER BY time SETTINGS index_granularity = 8192;
The table is partitioned by date to make cleanup easy.
./clickhouseimporter -d database-hostname -a file-to-parse -c path/to/yaml-config-file
I can't guarantee that this example will work out of the box in your environment, but this is more or less a copy of what I ended up doing in my own environment. This is an example of the clickhouseimporter processing a directory having subdirectories with incomming files from four different servers: server1-4. Obviously you have to copy the log files from the Percona server(s) to the importer host prior to running it.
The contents of the /data/audit/ingress/
folders looks something like this:
/data/audit/ingress/
|-- server1
| -- audit.log-20200616193501
|-- server2
| -- audit.log-20200616193501
|-- server3
| -- audit.log-20200616193001
| -- audit.log-20200616193501
| -- audit.log-20200616194001
|-- server4
| -- audit.log-20200616193501
The bash script below will find all folders in /data/audit/ingress/
and scan them individually for files named audit*
. Each file will then be parsed line by line, imported into clickhouse, and deleted afterwards. If you don't want your original files to be deleted, remove or comment the rm -f $INCOMINGDIR/$DBHOST/$file
line.
Note that the name of the folder is assumed to be the hostname from where the logfile originated. This name becomes to value of the dbserver column.
#!/bin/bash
shopt -s nullglob
INCOMINGDIR="/data/audit/ingress"
CONFIGFILE="/data/audit/conf.yml"
cd $INCOMINGDIR; for DBHOST in *; do
cd $INCOMINGDIR/$DBHOST
if [ -z audit* ]; then
echo "no new files for $DBHOST"
else
for file in audit* ;
do
echo "doing $INCOMINGDIR/$DBHOST/$file"
if ! /data/audit/clickhouseimporter -d $DBHOST -a $INCOMINGDIR/$DBHOST/$file -c $CONFIGFILE; then
echo "Could not complete task"
exit
fi
rm -f $INCOMINGDIR/$DBHOST/$file
done
fi
done
The performance of this importer will depend on almost every factor in the equation.
- The batchsize configured in the config file (read clickhouse documentation for best performance)
- The CPU on the host
- The IO capacity of the storage system
- The size of the file and the length of the queries within it.
- Probably more factors in your environment.
Tested on a VMWare VM with two CPU cores, a SAN for storage, and a batchsize of 20000, the parser was able to process roughly 60.000 lines of JSON per second with a CPU load of roughly 70-100% on one core. The Clickhouse server is using 15-30% of the other core. If your setup require more than this you might consider another aproach. Running the importer in parallel with a higher number of CPU cores should be posible, but has not been tested.
Disk usage depends on how much activity you have on your Percona servers.
In my particular case I log everything (audit_log_policy set to ALL), and after running this for a couple of months, I have accumulated 18 billion rows using roughly 870 GB of disk space. Even with this amount of data, Clickhouse is still going through it like a hot knife through butter.
Here we count the number of queries per server to see how the activity is distributed across the servers:
SELECT
dbserver,
count()
FROM mysql_audit_log
GROUP BY dbserver
┌─dbserver──────────────────────┬───count()─┐
│ server1 │ 48218977 │
│ server2 │ 7851352 │
│ server3 │ 48069978 │
│ server4 │ 780280 │
│ server5 │ 150245687 │
│ server6 │ 349 │
└───────────────────────────────┴───────────┘
6 rows in set. Elapsed: 1.138 sec. Processed 255.17 million rows, 1.28 GB (224.19 million rows/s., 1.12 GB/s.)
clickhouse-1.stofa.net :)
SELECT
status,
count()
FROM mysql_audit_log
WHERE (status != 0)
GROUP BY status
┌─status─┬─count()─┐
│ 1062 │ 219 │
│ 1146 │ 44 │
│ 1045 │ 5 │
│ 1226 │ 191 │
│ 1364 │ 29 │
│ 1406 │ 1 │
│ 1142 │ 42139 │
│ 1156 │ 152 │
│ 1094 │ 2 │
│ 1366 │ 12 │
│ 1048 │ 7 │
│ 1242 │ 31 │
│ 1292 │ 38 │
│ 1158 │ 1 │
│ 1064 │ 1261 │
│ 1356 │ 2 │
│ 1265 │ 2 │
└────────┴─────────┘
17 rows in set. Elapsed: 0.290 sec. Processed 256.85 million rows, 1.26 GB (887.05 million rows/s., 4.34 GB/s.)
clickhouse-1.stofa.net :)
You can now start digging into the different status codes and look for failed queries. You can identify if someone is trying to brute force their way into the database. Frankly, the opportunities are endless once you start combing through the data. You can combine various search parameters in many different ways in search of the elusive needle in a haystack. You're only limited by your imagination. Have fun!