Tuesday, September 23, 2014

GAE Golang BigQuery Client - Streaming data from GAE



Today we've decided to test streaming our live event trackers to BigQuery instead of the long datastore write-export-import routine.

We have a few extremely high volume endpoints - Golang GAE handlers, and thus far we have came up with quite a few ways to query the data being collected - yet none was satisfactory. 

Streaming live traffic to BigQuery is very appealing for us, because right now we have quite a process to achieve the goal of being able to query our data. We need to log it, export it, and then load it onto an analytics DB.  At every step of the chains there are things that could be better - logging to datastore is not cheap, exporting is not really working well (even after refactoring the code to use cursors, as we do in our open source GAE Remote utility - fails when there are about 300,000 records to download), and uploading the data onto an analytics DB defeats the purpose of using PaaS. 

So we've decided to give BigQuery a shot. But to our surprise we couldn't find an example in the official GAE Go docs for streaming data into BigQuery from Golang. There is a client to be found but it's not a GAE client (it uses OAuth, which is irrelevant for the GAE backend).

So we've started to write a nice little go client for big query on GAE. it currently only supports connecting and inserting rows, but there's more coming up. Feel free to fork and add stuff!

go-gae-bigquery

A nice little package to abstract usage of the BigQuery service on GAE. Currently supports only inserting rows (queries coming soon, feel free to fork and add stuff!)

usage

Import the package:
import (
    "github.com/streamrail/go-gae-bigquery"
)

and go get it using the goapp gae command:
goapp get "github.com/streamrail/go-gae-bigquery"
The package is now imported under the "gobq" namespace.

example

Running the example:
git clone https://github.com/StreamRail/go-gae-bigquery.git
cd go-gae-bigquery
cd example-batch

goapp get "github.com/streamrail/go-gae-bigquery"
goapp serve 
The example may be found at examples-batch/example.go. The part you want to look at is the Track function:
func Track(w http.ResponseWriter, r *http.Request) {
    c := appengine.NewContext(r)
    // create instance of big query client
    if client, err := gobq.NewClient(&c); err != nil {
        c.Errorf(err.Error())
    } else {
        // get some data to write
        rowData := GetRowData(r)
        // append the row to the buffer
        if err := buff.Append(rowData); err != nil {
            c.Errorf(err.Error())
        }
        c.Infof("buffered rows: %d\n", buff.Length())
        // if the buffer is full, flush it into big query.
        // the flushing resets the buffer and you can accumulate rows again
        if buff.IsFull() {
            if err := client.InsertRows(*projectID, *datasetID, *tableID, buff.Flush()); err != nil {
                c.Errorf(err.Error())
            } else {
                c.Infof("inserted rows: %d", buff.Length())
            }
        }
    }
}

batching

To improve performance, you might want to batch your inserts. A request that only appends a row to a buffer takes about 10-60ms, while a request that performs an actual inserts takes about 1.3 sec! As long as you don't mind losing some rows here and there when the instance flushes the RAM memory, you can batch your inserts by utilizing the RAM of the currently running instance.
For this purpose the package includes a thread-safe BufferedWrite implementation, which takes care of mutex over a slice of rows, and can be used to flush a batch of rows into BigQuery in a single operation.
Be sure to set the MAX_BUFFERED to a feasible number: there are a few limitations for batching, they suggest not to use a MAX_BUFFERED size of more than 500 etc.

No comments:

Post a Comment