Skip to content

Parse Googlebot logs from MaxCDN#7

Open
mrdavidlaing wants to merge 1 commit intomasterfrom
parse_maxcdn
Open

Parse Googlebot logs from MaxCDN#7
mrdavidlaing wants to merge 1 commit intomasterfrom
parse_maxcdn

Conversation

@mrdavidlaing
Copy link
Copy Markdown
Member

Its possible to get logs of googlebot traffic to MaxCDN via the MaxCDN api. This gives source logs in the following format:

{  "bytes": 46953, "client_asn": "AS15169 Google Inc.", "client_city": "Mountain View", "client_continent": "NA", "client_country": "US", "client_dma": "0", "client_ip": "66.249.67.220", "client_latitude": 37.38600158691406, "client_longitude": -122.08380126953125, "client_state": "CA", "company_id": 85, "cache_status": "MISS", "hostname": "cdn.yoast.com", "method": "GET", "origin_time": 0.024, "pop": "vir", "protocol": "HTTP/1.1", "query_string": "", "referer": "-", "scheme": "https", "status": 200, "time": "2014-06-30T08:40:45.159Z", "uri": "/wp-content/uploads/2009/10/apple-404.png", "user_agent": "Googlebot-Image/1.0", "zone_id": 33008     }

These should be parsed into a format that makes analysing them easy

@mrdavidlaing
Copy link
Copy Markdown
Member Author

A very basic json filter gives the following:

'@type': googlebot-maxcdn
  '@message': '{"bytes":0,"client_asn":"AS16509 Amazon.com, Inc.","client_city":"-","client_continent":"EU","client_country":"IE","client_dma":"0","client_ip":"54.247.60.162","client_latitude":53,"client_longitude":-8,"client_state":"-","company_id":85,"cache_status":"MISS","hostname":"cdn.yoast.com","method":"HEAD","origin_time":0.471,"pop":"lhr","protocol":"HTTP\/1.1","query_string":"","referer":"-","scheme":"https","status":200,"time":"2014-07-01T05:10:50.388Z","uri":"\/wp-content\/uploads\/2007\/12\/blogmetrics02.png","user_agent":"Googlebot\/2.1
    (+http:\/\/www.google.com\/bot.html)","zone_id":33008}'
  '@version': '1'
  '@timestamp': 2014-07-01 06:10:50.388000000 +01:00
  bytes: 0
  client_asn: AS16509 Amazon.com, Inc.
  client_city: '-'
  client_continent: EU
  client_country: IE
  client_dma: '0'
  client_ip: 54.247.60.162
  client_latitude: 53
  client_longitude: -8
  client_state: '-'
  company_id: 85
  cache_status: MISS
  hostname: cdn.yoast.com
  method: HEAD
  origin_time: 0.471
  pop: lhr
  protocol: HTTP/1.1
  query_string: ''
  referer: '-'
  scheme: https
  status: 200
  time: '2014-07-01T05:10:50.388Z'
  uri: /wp-content/uploads/2007/12/blogmetrics02.png
  user_agent: Googlebot/2.1 (+http://www.google.com/bot.html)
  zone_id: 33008

Compared to @type:googlebot which has the following shape:

  '@type': googlebot
  '@message': '{ "content_type": "text/xml; charset=UTF-8", "@timestamp": "2014-06-19T21:54:20-07:00",
    "remote_addr": "66.249.69.45", "body_bytes_sent": 38704, "request_time": 1.539,
    "status": 200, "robots": "noindex,follow", "redirect_location": "-", "request_method":
    "GET", "scheme": "https", "server_name": "yoast.com", "request_uri": "/cat/wordpress/feed/",
    "document_uri": "/index.php", "http_user_agent": "Mozilla/5.0 (compatible; Googlebot/2.1;
    +http://www.google.com/bot.html)" }'
  '@version': '1'
  '@timestamp': 2014-06-20 04:54:20.000000000 Z
  content_type:
    charset: utf-8
    type: text/xml
  remote_addr: 66.249.69.45
  body_bytes_sent: 38704
  request_time: 1.539
  status: 200
  robots: noindex,follow
  redirect_location: '-'
  request_method: GET
  scheme: https
  server_name: yoast.com
  request_uri: /cat/wordpress/feed/
  document_uri: /index.php
  http_user_agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  remote_addr_dns: crawl-66-249-69-45.googlebot.com

I think we should rename the @type:googlebot-maxcdn fields to match those of @type:googlebot

@jdevalk - do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant