Introducing BigQuery ML for building predictive models with SQL

Google’s beta extension performs linear regression forecasting and binary logistic classification in the BigQuery data warehouse

Introducing BigQuery ML for building predictive models with SQL
Thinkstock

One key to efficient data analysis of big data is to do the computations where the data lives. In some cases, that means running R, Python, Java, or Scala programs in a database such as SQL Server or in a big data environment such as Spark. But that takes some fairly technical programming and data science skills not often found among business analysts or SQL programmers. In addition, if you have to extract, transform, and load your datasets from your data warehouse to another data store for machine learning, you introduce delays in the process.

To that end, Google has announced a beta release of BigQuery ML, a SQL-based extension to its enterprise data warehouse service, for building and deploying machine learning models. This is important because it lowers the barriers to training certain kinds of models and deploying predictive analytic services, both reducing the time required and bringing it within reach of data analysts and statisticians.

One SQL-style statement and six functions have been added to BigQuery's SQL dialect to support machine learning: the CREATE MODEL statement and the functions ML.EVALUATE, ML.ROC_CURVEML.PREDICT, ML.TRAINING_INFO, ML.FEATURE_INFO, and ML.WEIGHTS. Google has produced animated GIFs that show how to create a model with the CREATE MODEL statement and generate predictions from the model using the ML.PREDICT function in a SELECT query.

You might wonder how all the formidable complexity of machine learning can be encapsulated in a SQL statement. The short answer is that it can't, at least not now. BigQuery ML is nota general-purpose neural network modeling tool on the order of TensorFlow.

There are only two supported types of models: linear regression for forecasting (such as "What will next month's sales be?") and binary logistic regression for classification (such as "Will the credit applicant default?"). BigQuery ML can't even handle multiclass logistic regression, although there's a hint in the data structures used that multiclass logistic classification might be supported in a future release.

The way you store data for normal SQL queries isn't appropriate for machine learning, so in preparing features (input variables used for predictions) for a CREATE MODEL statement, BigQuery ML automatically one-hot encodes categorical variables (strings, dates, times, and Booleans) and standardizes numerical variables (numeric, float, and integer). Numerical variables are also standardized for prediction. BigQuery ML can draw data from multiple BigQuery datasets for both training and prediction.

The syntax for the CREATE MODEL statement is:

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} 
model_name 
[OPTIONS(model_option_list)] 
[AS query_statement]

The model option list specifies the type of model (linear or logistic regression) as well as the training parameters, such as the learning rate strategy, maximum number of iterations, early stop criteria (minimum relative progress), and the data-split parameters (for separating the training and evaluation data sets). BigQuery ML uses gradient descent to optimize the model weights.

Google offers the following simple example of a CREATE MODEL statement:

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='linear_reg',
    ls_init_learn_rate=.15,
    l1_reg=1,
    max_iterations=5 ) AS
SELECT
  column1,
  column2,
  column3,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Once you have created a model, you'll want to evaluate it and use it for prediction. Use the ML.EVALUATE function to evaluate model metrics for both linear regression and logistic regression models; it generates a single row of output holding the metrics. You can also use the ML.ROC_CURVEfunction to evaluate logistic regression models; it generates multiple rows with metrics for different threshold values for the model.

Use the ML.PREDICT function to predict outcomes of a model applied to a data set. For example, the following query compares the predictions of mymodel1 and mymodel2 for table mytable. You could of course add a WHERE clause to the SELECTstatement to predict values for a subset of the table.

SELECT
  label,
  predicted_label1,
  predicted_label AS predicted_label2
FROM
  ML.PREDICT(MODEL `mydataset.mymodel2`,
    (
    SELECT
      * REPLACE (predicted_label AS predicted_label1)
    FROM
      ML.PREDICT(MODEL `mydataset.mymodel1`,
        TABLE `mydataset.mytable`)))

Copyright © 2018 IDG Communications, Inc.