Wednesday, November 18, 2015

Using BigQuery to analyze data collected by GRR

GRR is great at collecting large amounts of data, but once you get more than a handful of results you need to rely on external systems for analysing that data. To make this work at scale GRR has output plugins that allow you to export data as the results are received from the clients. Upload is automatic and has a < 5min latency.

The newest and best output plugin for data analysis uses Google's BigQuery service. As of December 2015 it isn't in a server release, so you'll need to at least sync past this commit to use it.

Setup


To set it up you need to visit console.developers.google.com and create a Google cloud project that will hold your BigQuery data. Then follow the instructions to create a service account and download the credential file. From that file populate these values in your GRR config file:


BigQuery.service_account: accountname@projectname.iam.gserviceaccount.com
BigQuery.private_key: "-----BEGIN PRIVATE KEY-----......."
BigQuery.project_id: "projectname"

Note that OpenSSL is picky about newlines so you should make sure you copy-paste the private key as a single line with the embedded newlines just as it appears in the JSON file.

Restart the server processes (you can use the grr_restart_all shell helper) and test it by running a flow (e.g. FileFinder) that will generate some results with the "BigQueryOutputPlugin" added. After 5 minutes or less you should see a "grr" data set with a table created in the BigQuery console. If the data doesn't turn up check the GRR worker logs in /var/log/grr/grr-worker.log.

Create a hunt with BigQuery output


To get hunt data into BigQuery just choose the BigQuery output plugin in the hunt creation wizard. FileFinder or RegistryFinder are good ones to start with since their output formats are known to export cleanly.

Flows that output results with types that have exporters defined in export.py should export to BigQuery correctly and cover the common use cases. For everything else without defined converters we attempt to export on a best-effort basis.

Use BigQuery to analyze your results


BigQuery is extremely powerful, and intuitive for anyone familiar with SQL syntax. A full query reference is here. Below are some example queries operating on FileFinder hunt results. I ran these on an ExportedFile table with 604,600 rows (243 MB table size).

Calculate some file size stats for a hunt (query time: 1.4s)

SELECT
  COUNT(st_size) AS file_count,
  SUM(st_size) AS total_size,
  MAX(st_size) AS max_size,
  MIN(st_size) AS min_size,
  AVG(st_size) AS avg_size,
  STDDEV(st_size) AS standard_deviation_size
FROM
  grr.ExportedFile
WHERE
  metadata.timestamp > PARSE_UTC_USEC("2015-11-17")
  AND metadata.source_urn == "aff4:/hunts/H:ED7458F8/Results"

[
  {
    "file_count": "598207",
    "total_size": "5218211208",
    "max_size": "106065056",
    "min_size": "0",
    "avg_size": "8723.086169168866",
    "standard_deviation_size": "176877.15047685735"
  }
]

Count results for hunts and flows (query time: 1.3s)

SELECT
  COUNT(*) AS result_count,
  metadata.source_urn AS source
FROM
  grr.ExportedFile
GROUP BY
  metadata.source_urn,
  source

[
  {
    "result_count": "3196",
    "source": "aff4:/hunts/H:F5AF9AB4/Results"
  },
  {
    "result_count": "598207",
    "source": "aff4:/hunts/H:ECDB3112/Results"
  },
  {
    "result_count": "1",
    "source": "aff4:/C.82f05be53ee950dc/analysis/FileFinder/admin-1447724230.58"
  },
  {
    "result_count": "3196",
    "source": "aff4:/hunts/H:ED7458F8/Results"
  }
]

The 100 least-common filenames found (query time: 3.0s)

SELECT
  SUBSTR(urn, 25) AS filename,
  COUNT(metadata.hostname) AS host_count
FROM
  grr.ExportedFile
GROUP BY
  filename
ORDER BY
  host_count ASC
LIMIT
  100

Screenshots


The BigQuery exporter takes advantage of the protobuf definition to give you rich field name descriptions for the table.


Writing queries, re-running old queries, and tweaking as you go are all spectacularly easy with the BigQuery UI

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Would it be at all possible to set this up in an offline function, if say due to the sensitivity of the data being analyzed it could not be uploaded into public tables?

    ReplyDelete
  3. It's an online service that takes advantage of Google compute power to give you very fast querying, so there's not really a way to do that without moving the data to Google. Note however that BigQuery has configurable ACLs, and quite restrictive settings by default so it is far from "public".

    If you have your own analytical tools and resources on-prem you could export from GRR as CSV and do the analysis yourself.

    ReplyDelete