BigQuery: An Walkthrough of ML with Conventional SQL

Debanjan Last Updated : 05 Aug, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Most of us are familiar with SQL, and many of us have hands-on experience with it. Machine learning is an increasingly popular and developing trend among us. BigQueryML is a toolset that will allow us to build machine learning models by executing standard SQL queries. BigQuery ML, shortened to BQML, is a pure SQL solution that leverages BigQuery to query massive datasets and train a machine learning model with it. In this article, we’ll try out BQML, learn about its principles and how it works, and then follow an example implementation.

We will proceed step by step, starting with the introduction of BigQuery, to better grasp the entire process and what happens behind the scenes.

Prerequisite:

Intermediate knowledge and experience with standard SQL
Basic Understanding of Machine Learning concepts

What is BigQuery?

BigQuery is a highly scalable, serverless data warehouse that can process queries on petabytes of data in a few minutes. It is a cloud-based Paas, or platform as a service, data warehouse offered by Google. BigQuery features built-in functions such as geospatial analysis, real-time data collection, business intelligence, and integration with a range of Google Cloud Platform (GCP) services, in addition to Machine Learning, which we will emphasize today.

Any business works with data, and if the data is modest enough, it can probably be fit into spreadsheets. However, if the amount of data expands to gigabytes, terabytes, or even petabytes, a more efficient solution, such as a data warehouse, is required. Traditional database management systems are incapable of handling such massive amounts of data. This is where BigQuery comes in. It is built to manage huge amounts of data, such as log data from thousands of retail systems or IOT data from millions of car sensors worldwide. It can process at least 100 billion regular expressions at 1 μsec per. We can use BigQuery via clients like BigQuery Web UI, REST APIs, or bg command-line tool.

How does query processing work in BigQuery?

It is built on top of Dremel Technology, which Google has been developing internally since 2006. Dremel is the execution engine for BigQuery. Below is the representation of the BQ architecture.

^{Source: https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood}

First, the BigQuery client interacts with the Dremel engine via a client interface. Dremel converts the query into an execution tree. This tree is divided into two parts, its branches, and leaves. The branches are called mixers, which perform aggregation. The leaves are slots that perform necessary computation and read data using the Jupiter Network from the BQ filesystem. Google’s Jupiter network can deliver 1 petabit/sec of total bisection bandwidth. Now, mixers and slots are both run by Borg. Borg is a large-scale cluster management system that allocates server resources to Dremel jobs. Unlike traditional relational databases, BigQuery uses columnar storage, where data is co-located by column rather than storage. Internally, BQ stores data in a proprietary file format called a capacitor. It uses access patterns to encode data and reshuffle rows.

Databases such as MySQL and PostgreSQL use record-oriented storage to store data. It is effective for transactional modifications to a single or group of rows. In the case of aggregation, however, it must read the entire table into memory. Because BigQuery is focused on analytical use cases, its columnar storage allows it to read only a single column for aggregation.

All the files in BigQuery are stored in a distributed file system throughout Google called Colossus. Each Google data center has its Colossus cluster. Colossus ensures durability using erasure encoding, which breaks data into fragments and saves redundant pieces across a set of different disks.

Now that we have a general understanding of BigQuery and how it processes such massive quantities of data so quickly and effectively, we can move on to the machine learning portion.

Machine Learning on BigQuery

We are all aware that machine learning is an area of study in which we feed data to computers and allow them to learn and improve from that data without being explicitly programmed. In machine learning, any problem begins with identifying business problems, collecting an appropriate amount of data, preprocessing and splitting the data into train-test, training and evaluating the model, and finally deploying it to the cloud and making predictions.

Bigquery ML, on the other hand, greatly simplifies this process by automatically handling preprocessing and data splitting. It allows one to focus only on the right data formatting and choose which model to use. BQML allows us to,

Train & Deploy ML models without moving data from BigQuery
Iterate on models in SQL in BigQuery
Make predictions without worrying about model deployment

Pricing & Supported Models of BQML

BigQuery currently supports over ten models, ranging from linear regression to K-Means clustering and time series to deep neural networks. A full list of supported models can be found on BQML documentation here.

Models in BQML can be classified into two categories: built-in models or models that are trained within BigQuery and external models like any imported models, DNN, or AutoML models. BQML pricing is on-demand and dependent on data location and type of operation, such as model creation, evaluation, or prediction, in addition to the model utilized.

Hands-on Implementation of BQML

We will create a regression model to predict the probability of a buyer adding a product to the cart using BigQuery by setting up a sandbox environment. To do that, go to the URL console.cloud.google.com/bigquery and click on create project button.

BigQuery provides over 100 datasets publicly available to analyze. These datasets can be found in the marketplace section of the google cloud navigation panel following this link. All the public datasets are available under the project bigquery-public-data, and we will pin this project within our UI by clicking on the + ADD DATA button as shown below-

For our prediction, we will use the ga4_obfuscated_sample_ecommerce dataset. It has tables divided by name events_YYYYMMDD i.e., data for each day represents a table. We can find the schema for every table and write a query like the below-

In the above clip, we’re changing the table name to events_* to select all the tables under the dataset. In the upper right corner, we can also check how much memory the query will process when run.

Before beginning the ML training, the models must be stored in a dataset. We will create a new dataset using BigQuery UI, name the dataset, and choose a location like the below-

Now, we will create our training set by the following query (link to query) below,

The following table will be used as our training dataset containing data from 2020. The schema of this table can be viewed in the same manner. We will skip all the data exploration and jump straight to the model creation part.

A little introduction to the BQML convention for creating a model-

The CREATE MODEL statement is the same as CREATE TABLE is standard SQL. It’s always better to use CREATE OR REPLACE MODEL as per the standards.
The ML.TRANSFORM is used for input preprocessing.
BQML specifically looks for a column name label. If that column is not present in your query, then input_label_cols should be passed as an alternative target column.

Sample statement/convention for creating a model

Now, we will create our model using the query below

Query to Create or replace model,

(log_model_predict.sql)

We can also evaluate our prediction using this query below,

So, we’ve just created a logistic regression model that can predict the probability of adding an item to a cart event with an accuracy of 93%. Though this is a base model, many advanced techniques are available to tune our model in BQML, but we can add that as a future scope.

When to use BQML?

We’ve just seen how powerful BigQuery’s toolkit is. But still, it has several limitations and shortcomings which restrict BQ for general use in ML. One should choose BQML for any of the below cases-

When the dataset is too big to read into local memory or when there are other constraints on adding the dataset to local.
When we need to serve the model directly afterward training. Because the model is in the same location as our data, we can make predictions directly from the database, eliminating the need for code writing, unit testing, and explicitly deploying into production.
When we have a team of several languages, such as Python and R, SQL is undoubtedly the common field for all.

Conclusion

We’ve given a brief introduction to BigQuery ML. In this article, we’ve covered,

What is BigQuery, and how does BQ manage to query terra bytes of data within seconds?
Pricing and Currently Supported models in BQML.
How ML in BigQuery differentiates from traditional ML and when to use BQML.
Steps to build a logistic regression model in BQML from scratch can predict the probability of adding an item to the cart.

I hope this article was as straightforward and interactive as possible and that it inspired you to explore BigQuery for ML. If you have any suggestions or corrections, please let me know.

I’d love to connect with you via LinkedIn.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Debanjan

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

BigQuery: An Walkthrough of ML with Conventional SQL

Introduction

What is BigQuery?

How does query processing work in BigQuery?

Machine Learning on BigQuery

Pricing & Supported Models of BQML

Hands-on Implementation of BQML

When to use BQML?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

BigQuery: An Walkthrough of ML with Conventional SQL

Introduction

What is BigQuery?

How does query processing work in BigQuery?

Machine Learning on BigQuery

Pricing & Supported Models of BQML

Hands-on Implementation of BQML

When to use BQML?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques