Episode 36 — BigQuery ML: Models with SQL

Welcome to Episode 36, BigQuery M L: Models with S Q L, where we explore how machine learning can live directly where your data already resides. BigQuery M L brings modeling into the data warehouse itself, eliminating the traditional friction between data engineering and data science. Instead of exporting datasets to separate environments, teams can train, evaluate, and predict using simple S Q L commands within the same system that stores their data. This integration means fewer moving parts, faster iteration, and tighter governance. It’s machine learning for analysts, engineers, and business users alike—accessible, auditable, and scalable. BigQuery M L turns the warehouse from a passive storehouse into an active intelligence engine, bringing predictive power to every dataset and every decision.

Data preparation remains the foundation of reliable results. Even in S Q L, quality, partitioning, and feature selection determine model strength. Partitioning large tables by date or region ensures efficient training and avoids bias toward recent or dominant data. Feature engineering transforms raw columns into meaningful signals, like ratios or lagged values. Leakage—when information from the target variable leaks into predictors—can still occur if joins or derived features use future data. For example, calculating revenue growth using the current quarter when predicting the same quarter’s sales inflates accuracy unrealistically. By validating partitions and time logic, teams keep training realistic and predictive. BigQuery M L simplifies data access, but it still rewards rigor in preparation and validation.

The CREATE MODEL command defines everything about a model’s purpose, structure, and behavior. Users specify options such as model type, target column, input features, and hyperparameters—all in one statement. For example, a regression model predicting “sales” might use CREATE MODEL my_dataset.sales_model OPTIONS (model_type=‘linear_reg’) AS SELECT * FROM training_data. Additional options control regularization, early stopping, and learning rates. BigQuery M L handles scaling automatically, allocating compute resources based on dataset size. This declarative approach turns machine learning into an extension of data management: you describe what you want, and the platform does the rest. CREATE MODEL democratizes model creation, transforming S Q L scripts into predictive assets ready for evaluation and deployment.

Evaluating model performance with M L dot EVALUATE provides clear metrics for decision-making. This function outputs standard measures such as accuracy, precision, recall, and area under the curve for classification, or mean absolute error and R-squared for regression. These metrics quantify how well the model generalizes to unseen data. For example, an e-commerce analyst predicting customer lifetime value can compare model versions by their mean squared error directly in a query result table. The evaluation step encourages evidence-based iteration—keeping the focus on measurable improvement rather than assumptions. By storing evaluation outputs within BigQuery, teams can track performance history alongside the data that trained the model, reinforcing governance and reproducibility.

Explainability in BigQuery M L uses feature attribution to show which inputs influence predictions most strongly. Functions like M L dot EXPLAINWEIGHTS return feature importance values for linear and boosted tree models, while M L dot GLOBAL_EXPLAIN summarizes global trends across the dataset. These insights reveal how models think, supporting auditability and trust. For example, in a churn model, seeing that “recent support interactions” and “contract duration” carry the highest weights helps managers validate logic and refine strategy. Feature attribution also aids fairness analysis by detecting unintended dependencies. Transparency turns black boxes into glass boxes—models that can be examined, debated, and improved collaboratively. BigQuery M L’s explainability tools make accountability part of the workflow, not an afterthought.

Time-series forecasting in BigQuery M L leverages ARIMA plus models, an enhanced version of the classical autoregressive integrated moving average method. These models handle trends, seasonal patterns, and outliers automatically. Users can specify forecast horizons and confidence intervals directly in S Q L. For example, a retailer might forecast monthly sales per region for the next year with a single CREATE MODEL statement. ARIMA plus adapts to changing trends without manual parameter tuning. The simplicity hides sophisticated internal logic that balances accuracy and computational efficiency. This capability means that forecasting—once the domain of specialized statisticians—can now be part of everyday analytics, empowering business teams to project outcomes quickly and adjust operations proactively.

Recommendation systems in BigQuery M L use matrix factorization to predict user preferences. This method decomposes large interaction tables, such as user-product ratings, into latent factors that reveal hidden affinities. For instance, an online learning platform can recommend courses to students based on patterns shared by similar learners. Matrix factorization models are trained with a few lines of S Q L and scale automatically to millions of interactions. Evaluation metrics like mean average precision guide tuning. The combination of ease and power makes recommendation building accessible to analysts who once needed specialized environments. By hosting data, training, and prediction in one place, BigQuery M L removes the friction between insight and implementation in personalization projects.

Governance in BigQuery M L ensures that models remain secure, reproducible, and compliant. Access controls managed through Identity and Access Management define who can create, view, or execute models. Lineage tracking links models to datasets and queries, maintaining transparency over inputs and transformations. Reproducibility is guaranteed by storing model definitions, training parameters, and evaluation results directly in BigQuery tables. For example, auditors can trace a prediction back to the model version and data snapshot that generated it. Integration with audit logs and Cloud Monitoring adds operational assurance. Governance in BigQuery M L is not an extra step—it is built into the platform’s architecture, ensuring that speed never compromises accountability.

Common business use cases illustrate the platform’s practical range. Marketing teams use logistic regression to predict campaign responders, finance groups use linear regression for credit risk, and supply planners rely on time-series forecasting for inventory. Retailers generate recommendations, media platforms classify content, and support departments automate sentiment analysis. What unites these examples is accessibility—no external pipelines or custom code. The models live with the data, run at scale, and align with existing analytics workflows. The simplicity lowers barriers, allowing small teams to achieve outcomes once reserved for large data science divisions. BigQuery M L brings prediction to the front line, where questions and decisions intersect.

Episode 36 — BigQuery ML: Models with SQL
Broadcast by