Comparing Logging Solutions

9 min readJun 3, 2021

Loki vs ELK vs SPLUNK. A Case Study.

Logs has long been one of our greatest companion when it comes to investigating a production incident. Gone are the days where one had to parse and delve into a haystack of logs in order to find a cause of an issue. Nowadays, there are plenty of open-source and commercial solutions offering a complete suite of parsing support, correlation and even amazing visualizations.

In this article, I’ll be analysing three well known logging solutions: Grafana Loki, ELK Stack and SPLUNK Core. For each I’ll measure three aspects: usability, ingest performance and query/visualization performance.

Disclaimer: All steps shown below were based solely on my own searching, debugging, tunning and understanding of each technology. Feel free to contribute to it by dropping a comment :)

That being say, let’s start!

Methodology

Hardware

I used a m5.xlarge EC2 instance for each test. It comprises of 4 vCPUs and 16.0 GiB RAM. Each with 300GB EBS (gp2–900IOPS).

Dataset

One month of millisecond granularity randomly created nginx like format log of the sort:

log_format main '{ "time": "[]", "status": "$status", "path": "$request" }'

Log sample:

{ "time":"[01/May/2021:00:00:00.000 -0300]", "path":"page153", "status":"302" }
{ "time":"[01/May/2021:00:00:00.001 -0300]", "path":"page112", "status":"304" }
{ "time":"[01/May/2021:00:00:00.003 -0300]", "path":"page6", "status":"408" }
{ "time":"[01/May/2021:00:00:00.005 -0300]", "path":"page54", "status":"404" }
{ "time":"[01/May/2021:00:00:00.007 -0300]", "path":"page113", "status":"304" }
{ "time":"[01/May/2021:00:00:00.009 -0300]", "path":"page81", "status":"200" }
{ "time":"[01/May/2021:00:00:00.011 -0300]", "path":"page75", "status":"302" }
{ "time":"[01/May/2021:00:00:00.013 -0300]", "path":"page147", "status":"408" }
{ "time":"[01/May/2021:00:00:00.015 -0300]", "path":"page198", "status":"404" }

Performance measurement

Application logs and instance metrics were scraped using prometheus node_exporter.

For measurement I’ve measured the time taken to render two common visualizations when it comes to webserver visibility needed.

Number of connections per status_code
Number of connections per path

Two time ranges were also used:

The whole month
1 day

Workload

The data used can be easily replicated by running the script below:

import pandas as pd
import math
import os
import random
import datetime
import numpy as np
import matplotlib.pyplot as plt

random.seed(8)

status_codes = [200, 304, 500, 408, 404, 302]
status_codes += [s for s in status_codes for i in range(random.randint(10,50))]
# dummy_weight
paths = [f"page{i}" for i in range(0, 300)]
paths += [p for p in random.sample(paths, 20) for i in range(random.randint(10,50))]

series = []
variance = 0
start = datetime.datetime(2021,5,1)
for i in range(310):
    variance += (random.random() - 0.5)/10
    val = math.cos(i/10) + variance
    series.append(abs(val) * random.randint(8,40)*1000)
df = pd.DataFrame({"x": [start+datetime.timedelta(minutes=1*i) for i in range(len(series))], "y": series})
df.set_index('x').plot(figsize=(18,6))

# oversample to millisecond granularity
# 31days * 24hours * 60minutes = 44640
arr = [float('nan') for i in range(44640)]
for idx,v in enumerate(series):
    arr[idx*144] = v
ts_start = datetime.datetime(2021,5,1).timestamp()
x = [datetime.datetime.fromtimestamp(ts_start+60*i) for i in range(len(arr))]
df = pd.DataFrame({"x": x, "y": arr})
df['y'] = df['y'].interpolate(method='linear')
df['y'] = df['y'].astype(int)
df.set_index('x').plot(figsize=(18,6))


logdir = "./logdir"
if not os.path.exists(logdir):
    os.mkdir(logdir)
    

for idx,v in enumerate(df['y'].values):
    row = []
    dstart = datetime.datetime(2021,5,1) + datetime.timedelta(minutes=idx)
    dstart = dstart
    filename = logdir + "/" + dstart.strftime("%Y-%m-%d.log")
    dstart = float(dstart.timestamp()*1000)
    t = np.linspace(0, 59, v) * 1000
    for i in t:
        ts = int(dstart+i)
        ds = datetime.datetime.fromtimestamp(ts/1000.0)
        datefmt = ds.strftime("[%d/%b/%Y:%H:%M:%S.%f")[:-3] + " -0300]"
        status_code = random.choice(status_codes)
        path = random.choice(paths)
        row.append(f"{{ \"time\":\"{datefmt}\", \"path\":\"{path}\", \"status\":\"{status_code}\" }}")
    if os.path.exists(filename):
        with open(filename, "a+") as f:
            f.write("\n")
    with open(filename, "a+") as f:
        f.write("\n".join(row))

It took almost 3h30min hours to generate 42GB of data.

CPU times: user 1h 16min 53s, sys: 1min 16s, total: 1h 18min 10s
Wall time: 3h 28min 21s

File list:

2.9G    logdir/2021-05-01.log
1.1G    logdir/2021-05-02.log
2.3G    logdir/2021-05-03.log
3.1G    logdir/2021-05-04.log
1.4G    logdir/2021-05-05.log
1.5G    logdir/2021-05-06.log
2.7G    logdir/2021-05-07.log
895M    logdir/2021-05-08.log
2.2G    logdir/2021-05-09.log
2.8G    logdir/2021-05-10.log
870M    logdir/2021-05-11.log
1.7G    logdir/2021-05-12.log
3.2G    logdir/2021-05-13.log
2.4G    logdir/2021-05-14.log
818M    logdir/2021-05-15.log
3.6G    logdir/2021-05-16.log
2.2G    logdir/2021-05-17.log
957M    logdir/2021-05-18.log
1.8G    logdir/2021-05-19.log
2.4G    logdir/2021-05-20.log
963M    logdir/2021-05-21.log
2.7G    logdir/2021-05-22.log
2.6G    logdir/2021-05-23.log
664M    logdir/2021-05-24.log
2.8G    logdir/2021-05-25.log
3.8G    logdir/2021-05-26.log
1.2G    logdir/2021-05-27.log
2.7G    logdir/2021-05-28.log
3.3G    logdir/2021-05-29.log
1.4G    logdir/2021-05-30.log
2.0G    logdir/2021-05-31.log

Now, to the applications..

ELK

Deploying EK

sysctl -w vm.max_map_count=262144

docker-compose.yaml

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.1
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      #- discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms8g -Xmx8g"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
       soft: 65536
       hard: 65536
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastickib01:
    image: docker.elastic.co/kibana/kibana:7.12.1
    container_name: kib01
    ports:
      - 5601:5601
    environment:
      ELASTICSEARCH_URL: http://es01:9200
      ELASTICSEARCH_HOSTS: '["http://es01:9200"]'
    networks:
      - elasticvolumes:
  data01:
    driver: localnetworks:
  elastic:
    driver: bridge

Installing Logstash

rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
cat <<EOF > /etc/yum.repos.d/logstash.repo
[logstash-7.x]
name=Elastic repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
EOF
yum -y install logstash
echo "export PATH=\"/usr/share/logstash/bin/:$PATH\"" >> /root/.bashrc
export PATH="/usr/share/logstash/bin/:$PATH"

logstash config

input {
    file {
        path => "/root/wkdir/logs/*.log"
        start_position => "beginning"
        #codec => multiline {
        #    pattern => '.*128=.*'
        #    negate => false
        #    what => previous
        #}
        sincedb_path => "/dev/null"        
    }
}# https://stackoverflow.com/questions/27443392/how-can-i-have-logstash-drop-all-events-that-do-not-match-a-group-of-regular-exp
filter {
    grok {
        add_tag => [ "valid" ]
        match => {
            "message" => [
                ".*?"time":"(?P<time>[^"]+)".*?path":"(?<path>[^"]+)".*"status":"(?<status>[^"]+)""
            ]
        }
    }
    # https://discuss.elastic.co/t/what-is-the-date-format-to-be-used-with-logstash-to-match-9-digits-millisecond/107032 
    # https://discuss.elastic.co/t/how-to-replace-timestamp-with-logtime/118949/2
    # 20210521-23:47:00.333333333
    date {
        match => ["time", "yyyyMMdd-HH:mm:ss.SSS Z"]
        target => "@timestamp"
        remove_field => ["time"]
    }if "_grokparsefailure" in [tags] {
        drop {}
    }if "valid" not in [tags] {            
        drop { }
    }mutate {
        remove_tag => [ "valid" ]
        #remove_field => [ "message" ]
    }
}output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "nginx"
    }
    #stdout {
    #    codec => json_lines
    #}
}

ELK took about 10 hours to ingest all data and accounted for over 100GB of indexed data.

On one hand ELK’s ingest time was long and utilized disk size was almost twice the dataset size but, on the other hand the queries were pretty fast.
Querying 1 month (12/12h interval) of data only took 4mins and 1 day query was instantaneous.

SPLUNK

Download and install SPLUNK Enterprise trial here.

Splunk is very complete when it comes to usability. I didn’t need to touch the CLI after installing it. Configuration and data ingestion were all done using the web ui, even timestamp formating, that’s a plus.

It took almos 2h20minutes to ingest all data which resulted a total of 40GB of data indexed.

Regex:

.*?"time":"\[(?P<time>[^\]]+)\]".*?path":"(?<path>[^"]+)".*"status":"(?<status>[^"]+)"

1 Month query

Querying the whole month took around 1h30mins to display the results. Quite a lot!

Query by status:

sourcetype=nginx | timechart useother=f limit=1000 span=1h count by status

Query by path:

sourcetype=nginx | timechart useother=f limit=1000 span=1h count by path

1 Day Query

1 Day length query was faster. It took around 4minutes.

Grafana Loki

Loki being the “new kid” amongst its peers surprised me in many aspects. Also Grafana dashboards itself amazes me everytime I plot something there. Anyway.. let’s continue.

docker-compose.yaml

version: "3"networks:
  loki:services:
  loki:
    image: grafana/loki:2.2.0
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - lokipromtail:
    image: grafana/promtail:2.2.0
    volumes:
      - ./logs:/logs
      - ./promtail-config.yaml:/etc/promtail/config.yaml
    command: -config.file=/etc/promtail/config.yaml
    networks:
      - loki#  grafana:
        #    image: grafana/grafana:latest
        #    ports:
        #      - "3000:3000"
        #    networks:
        #      - loki

loki-config.yaml

auth_enabled: falseserver:
  http_listen_port: 3100
  grpc_listen_port: 9096
  http_server_read_timeout: 20m
  http_server_write_timeout: 20mingester:
  wal:
    enabled: true
    dir: /tmp/wal
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabledschema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24hstorage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /tmp/loki/chunksquerier:
  query_timeout: 20m
  engine:
    timeout: 20mcompactor:
  working_directory: /tmp/loki/boltdb-shipper-compactor
  shared_store: filesystemlimits_config:
  reject_old_samples: false
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 32
  max_streams_per_user: 100000
  #reject_old_samples_max_age: 168hchunk_store_config:
  max_look_back_period: 0stable_manager:
  retention_deletes_enabled: false
  retention_period: 0sruler:
  storage:
    type: local
    local:
      directory: /tmp/loki/rules
  rule_path: /tmp/loki/rules-temp
  alertmanager_url: http://localhost:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true
  search_pending_for: 20m

promtail-config.yaml

server:
  http_listen_port: 9080
  grpc_listen_port: 0
positions:
  filename: /tmp/positions.yaml
clients:
  - url: http://loki:3100/loki/api/v1/pushscrape_configs:
- job_name: receive_nginx
  pipeline_stages:
  - match:
      selector: '{job="nginx"}'
      stages:
        - json:
            expressions:
              time:
              path:
              status:- labels:
            path:
            status:
  
        - timestamp:
            #{ "time":"[01/May/2021:23:59:58.979 -0300]", "path":"page260", "status":"200" }
            source: time
            format: "[02/Jan/2006:15:04:05.000 -0700]"
                    
  static_configs:
  - targets:
      - nginx 
    labels:
      job: nginx
      __path__: /logs/*.log

Loki took over 6 hours to ingest and consumed only 9GB of disk space (that’s incredible!)

1 Month Query

Query by status:

sum by (status) (
    count_over_time({job="nginx"})
    [$__interval]
    )
)

Query by path:

sum by (path) (
    count_over_time({job="nginx"})
    [$__interval]
    )
)

Query 1 month of data took about 27minutes (30s interval) with a disk footprint of 70GB read.

loki_1      | level=info ... length=744h0m0s step=20m0s duration=27m20.345708133s status=200 limit=1944 returned_lines=0 throughput=43MB total_bytes=70GB
loki_1      | level=info ... length=744h0m0s step=20m0s duration=27m22.792997646s status=200 limit=1944 returned_lines=0 throughput=43MB total_bytes=70GB

1 Month (12/12h interval) also took the same time

1 Day Query

Querying 1 day of data took about 1min

loki_1      | ... length=24h0m0s step=30s duration=51.944914194s status=200 limit=1944 returned_lines=0 throughput=59MB total_bytes=3.1GB
loki_1      | ... length=24h0m0s step=30s duration=53.915508495s status=200 limit=1944 returned_lines=0 throughput=57MB total_bytes=3.1GB

Conclusion

As I sort of dumped a bunch of configurations, time measurements and notes on the sections above. Let’s organize them a bit.

As one can see, each software has its own perks as well as downsides. Grafana Loki being the only one that allows interval customization (others allow but up to a limit).

The right solution platform really depends on each administrator needs.

Further work

I’m pretty sure other configuration tunning can be done. We could also explore cluster setups. But, overall, this already gives us a good starting point whether of not you’re assessing your next CLM — Centralized Logging Management mission.

That’s all! Thank you.
If you enjoyed it please like it.
Feel free to drop a comment too.

Comparing Logging Solutions

Methodology

Hardware

Dataset

Performance measurement

Workload

ELK

Deploying EK

Installing Logstash

SPLUNK

1 Month query

1 Day Query

Grafana Loki

1 Month Query

1 Day Query

Conclusion

Further work

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by CrashLaker

Responses (5)