This guide walks you through setting up a governed data lakehouse with Apache Iceberg tables in Snowflake, implementing data governance policies, establishing data quality monitoring, and creating an AI agent for natural language data exploration.
- Setup: External Iceberg Tables
- Governance: Access Control & Data Protection
- Data Quality: Monitoring with Data Metric Functions
- AI Agent: Natural Language Data Access
- Snowflake account with
ACCOUNTADMINrole access - Enterprise Edition or higher (required for Row Access Policies, Dynamic Data Masking, and Data Metric Functions)
- AWS account (if setting up your own external volume)
Apache Iceberg tables in Snowflake combine the performance and query semantics of native Snowflake tables with external cloud storage that you manage. They use the Apache Iceberg open table format specification, which provides:
- ACID transactions - Atomicity, consistency, isolation, durability
- Schema evolution - Modify schemas without rewriting data
- Hidden partitioning - Automatic partition management
- Table snapshots - Point-in-time data access
Before creating an external volume, you need to configure AWS IAM to grant Snowflake access to your S3 bucket.
Create a policy that grants Snowflake the necessary S3 permissions. The policy in iam_policy.json provides:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowIcebergObjectAccess",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::<bucket-name-and-prefix>/*"
},
{
"Sid": "AllowIcebergBucketAccess",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::<bucket-name>",
"Condition": {
"StringLike": {
"s3:prefix": [
"<prefix>/*"
]
}
}
}
]
}Key Permissions Explained:
| Permission | Purpose |
|---|---|
s3:PutObject |
Write data files to S3 |
s3:GetObject |
Read data and metadata files |
s3:GetObjectVersion |
Access versioned objects (required for Time Travel) |
s3:DeleteObject |
Remove data files during compaction |
s3:ListBucket |
List bucket contents |
s3:GetBucketLocation |
Verify bucket region |
- Create an AWS IAM role with a trust policy for Snowflake
- Attach the IAM policy created above
- Record the Role ARN for use in the external volume configuration
After creating the external volume, retrieve the Snowflake IAM user ARN using DESC EXTERNAL VOLUME and update the IAM role's trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<snowflake_user_arn>"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<your_external_id>"
}
}
}
]
}USE ROLE ACCOUNTADMIN;
CREATE DATABASE LAKEHOUSE_VHOL;
CREATE WAREHOUSE VHOL_XS WAREHOUSE_SIZE = XSMALL;
USE LAKEHOUSE_VHOL.PUBLIC;
USE WAREHOUSE VHOL_XS;An external volume is an account-level Snowflake object that connects Snowflake to your external cloud storage for Iceberg tables. It stores the IAM entity that Snowflake uses to securely access your storage.
CREATE OR REPLACE EXTERNAL VOLUME amazon_reviews_iceberg_volume
STORAGE_LOCATIONS = (
(
NAME = 'amazon-reviews-s3'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3://amazon-product-reviews-dataset/curated/product_reviews/product_reviews_db/product_reviews_100k/'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::849350360261:role/snowflake-iceberg-role'
STORAGE_AWS_EXTERNAL_ID = 'snowflake_iceberg_ext_id'
)
)
ALLOW_WRITES = TRUE;Parameters:
| Parameter | Description |
|---|---|
STORAGE_PROVIDER |
Cloud provider (S3, GCS, AZURE) |
STORAGE_BASE_URL |
Base URL for your storage location |
STORAGE_AWS_ROLE_ARN |
ARN of the IAM role with S3 access |
STORAGE_AWS_EXTERNAL_ID |
External ID for cross-account access security |
ALLOW_WRITES |
Enable write operations (required for Snowflake-managed tables) |
Verify the external volume configuration:
DESC EXTERNAL VOLUME amazon_reviews_iceberg_volume;
SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('amazon_reviews_iceberg_volume');A catalog integration tells Snowflake how to interact with Iceberg metadata. For externally managed tables, use OBJECT_STORE:
CREATE OR REPLACE CATALOG INTEGRATION amazon_reviews_catalog_int
CATALOG_SOURCE = OBJECT_STORE
TABLE_FORMAT = ICEBERG
ENABLED = TRUE;Create an Iceberg table pointing to existing metadata:
CREATE OR REPLACE ICEBERG TABLE product_reviews_iceberg
EXTERNAL_VOLUME = 'amazon_reviews_iceberg_volume'
CATALOG = 'amazon_reviews_catalog_int'
METADATA_FILE_PATH = 'metadata/00000-00a5b7be-48a2-4966-918f-99af062f8e85.metadata.json';SELECT * FROM product_reviews_iceberg LIMIT 10;
SELECT COUNT(*) AS total_reviews,
ROUND(AVG(OVERALL), 2) AS avg_rating,
COUNT(DISTINCT ASIN) AS unique_products,
COUNT(DISTINCT REVIEWERID) AS unique_reviewers
FROM product_reviews_iceberg;Snowflake provides comprehensive governance features for Iceberg tables, enabling you to:
- Control row-level access based on user roles
- Mask sensitive data dynamically
- Classify and tag data for compliance
USE ROLE ACCOUNTADMIN;
CREATE ROLE IF NOT EXISTS DATA_ADMIN;
CREATE ROLE IF NOT EXISTS ANALYST;
SET my_user = CURRENT_USER();
GRANT ROLE DATA_ADMIN TO USER IDENTIFIER($my_user);
GRANT ROLE ANALYST TO USER IDENTIFIER($my_user);GRANT USAGE ON DATABASE LAKEHOUSE_VHOL TO ROLE DATA_ADMIN;
GRANT USAGE ON DATABASE LAKEHOUSE_VHOL TO ROLE ANALYST;
GRANT USAGE ON SCHEMA LAKEHOUSE_VHOL.PUBLIC TO ROLE DATA_ADMIN;
GRANT USAGE ON SCHEMA LAKEHOUSE_VHOL.PUBLIC TO ROLE ANALYST;
GRANT SELECT ON ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg TO ROLE DATA_ADMIN;
GRANT SELECT ON ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg TO ROLE ANALYST;
GRANT USAGE ON WAREHOUSE VHOL_XS TO ROLE DATA_ADMIN;
GRANT USAGE ON WAREHOUSE VHOL_XS TO ROLE ANALYST;Row Access Policies enable row-level security by determining which rows are visible based on the user's role and context. The policy is evaluated at query runtime, creating a dynamic secure view.
This policy restricts the ANALYST role to only see reviews from 2010 onwards, while DATA_ADMIN and ACCOUNTADMIN can see all data:
CREATE OR REPLACE ROW ACCESS POLICY LAKEHOUSE_VHOL.PUBLIC.time_based_review_policy
AS (UNIXREVIEWTIME NUMBER) RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() IN ('ACCOUNTADMIN', 'DATA_ADMIN') THEN TRUE
WHEN CURRENT_ROLE() = 'ANALYST'
AND TO_TIMESTAMP(UNIXREVIEWTIME) >= '2010-01-01'::TIMESTAMP THEN TRUE
ELSE FALSE
END;ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
ADD ROW ACCESS POLICY LAKEHOUSE_VHOL.PUBLIC.time_based_review_policy ON (UNIXREVIEWTIME);-- DATA_ADMIN: should see all 100,000 rows (1997-2014)
USE ROLE DATA_ADMIN;
SELECT 'DATA_ADMIN' AS role, COUNT(*) AS visible_rows,
MIN(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS earliest,
MAX(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS latest
FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg;
-- ANALYST: should see fewer rows (2010-2014 only)
USE ROLE ANALYST;
SELECT 'ANALYST' AS role, COUNT(*) AS visible_rows,
MIN(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS earliest,
MAX(TO_TIMESTAMP(UNIXREVIEWTIME))::DATE AS latest
FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg;Dynamic Data Masking is a column-level security feature that masks data at query time based on the user's role. The original data remains unchanged; masking is applied dynamically.
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE MASKING POLICY LAKEHOUSE_VHOL.PUBLIC.mask_reviewer_name
AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('ACCOUNTADMIN', 'DATA_ADMIN') THEN val
ELSE '***MASKED***'
END;ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
MODIFY COLUMN REVIEWERNAME SET MASKING POLICY LAKEHOUSE_VHOL.PUBLIC.mask_reviewer_name;-- DATA_ADMIN sees real names
USE ROLE DATA_ADMIN;
SELECT REVIEWERID, REVIEWERNAME, SUMMARY
FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg LIMIT 5;
-- ANALYST sees masked names
USE ROLE ANALYST;
SELECT REVIEWERID, REVIEWERNAME, SUMMARY
FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg LIMIT 5;Tags enable data classification and can be used to organize data governance across your organization. Tags can be applied to tables, columns, and other database objects.
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE TAG LAKEHOUSE_VHOL.PUBLIC.pii_type ALLOWED_VALUES 'name', 'email', 'id', 'none';
CREATE OR REPLACE TAG LAKEHOUSE_VHOL.PUBLIC.data_sensitivity ALLOWED_VALUES 'high', 'medium', 'low';ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
MODIFY COLUMN REVIEWERNAME SET TAG LAKEHOUSE_VHOL.PUBLIC.pii_type = 'name';
ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
MODIFY COLUMN REVIEWERID SET TAG LAKEHOUSE_VHOL.PUBLIC.pii_type = 'id';
ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
MODIFY COLUMN REVIEWTEXT SET TAG LAKEHOUSE_VHOL.PUBLIC.data_sensitivity = 'medium';ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
SET TAG LAKEHOUSE_VHOL.PUBLIC.data_sensitivity = 'medium';SELECT * FROM TABLE(
INFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS(
'LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg',
'TABLE'
)
);Data Metric Functions (DMFs) measure the state and integrity of your data. Snowflake provides built-in system DMFs and supports custom DMFs for specialized quality measurements. DMFs can be scheduled to run automatically and results are stored in a dedicated event table.
- Automated measurement - Schedule quality checks without manual intervention
- Compliance facilitation - Document data quality for regulatory requirements
- SLA monitoring - Track freshness and accuracy metrics
- Consistent evaluation - Apply the same metrics across tables
Checks for NULL or empty review text:
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE DATA METRIC FUNCTION LAKEHOUSE_VHOL.PUBLIC.empty_review_count(
ARG_T TABLE(ARG_C STRING)
)
RETURNS NUMBER AS
'SELECT COUNT(*) FROM ARG_T WHERE ARG_C IS NULL OR TRIM(ARG_C) = \'\'';Validates ratings are within the expected 1-5 range:
CREATE OR REPLACE DATA METRIC FUNCTION LAKEHOUSE_VHOL.PUBLIC.invalid_rating_count(
ARG_T TABLE(ARG_C FLOAT)
)
RETURNS NUMBER AS
'SELECT COUNT(*) FROM ARG_T WHERE ARG_C < 1 OR ARG_C > 5 OR ARG_C IS NULL';ALTER ICEBERG TABLE LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
SET DATA_METRIC_SCHEDULE = '5 MINUTE';-- Check for empty reviews
SELECT LAKEHOUSE_VHOL.PUBLIC.empty_review_count(
SELECT REVIEWTEXT FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
) AS empty_review_count;
-- Check for invalid ratings
SELECT LAKEHOUSE_VHOL.PUBLIC.invalid_rating_count(
SELECT OVERALL FROM LAKEHOUSE_VHOL.PUBLIC.product_reviews_iceberg
) AS invalid_rating_count;| Feature | Description |
|---|---|
| Serverless compute | DMFs use Snowflake-managed compute resources |
| Event logging | Results are logged to a dedicated event table |
| Billing | Charged under "Data Quality Monitoring" category |
| Scheduling | All DMFs on a table follow the same schedule |
Snowflake Cortex Agents are AI-powered assistants that orchestrate across multiple tools to answer questions and perform tasks. In this lab, we create an agent that uses Cortex Analyst for text-to-SQL capabilities, allowing users to query product reviews using natural language.
The semantic model (product_reviews_semantic_model.yaml) defines the business meaning of your data, enabling the AI to generate accurate SQL queries.
Dimensions - Categorical attributes for grouping and filtering:
dimensions:
- name: asin
synonyms:
- product_id
- amazon_product_id
description: Amazon Standard Identification Number - unique product identifier
expr: ASIN
data_type: VARCHARTime Dimensions - Date/time fields for temporal analysis:
time_dimensions:
- name: review_timestamp
synonyms:
- review_date
- timestamp
description: Unix timestamp when the review was posted
expr: TO_TIMESTAMP(UNIXREVIEWTIME)
data_type: TIMESTAMPMeasures - Quantitative metrics for aggregation:
measures:
- name: overall_rating
synonyms:
- rating
- stars
- score
description: Overall rating given by the reviewer (typically 1-5 scale)
expr: OVERALL
data_type: FLOAT
default_aggregation: avgVerified Queries - Pre-validated query patterns:
verified_queries:
- name: average_rating
question: What is the average overall rating?
sql: |
SELECT AVG(OVERALL) as average_rating
FROM LAKEHOUSE_VHOL.PUBLIC.PRODUCT_REVIEWS_ICEBERGUSE LAKEHOUSE_VHOL.PUBLIC;
USE WAREHOUSE VHOL_XS;
USE ROLE ACCOUNTADMIN;
CREATE STAGE IF NOT EXISTS LAKEHOUSE_VHOL.PUBLIC.SEMANTIC_MODELS;
-- Manually upload product_reviews_semantic_model.yaml to the stageFor agents to be discoverable in Snowflake Intelligence, they must be created in the SNOWFLAKE_INTELLIGENCE.AGENTS schema:
CREATE DATABASE IF NOT EXISTS SNOWFLAKE_INTELLIGENCE;
GRANT USAGE ON DATABASE SNOWFLAKE_INTELLIGENCE TO ROLE PUBLIC;
CREATE SCHEMA IF NOT EXISTS SNOWFLAKE_INTELLIGENCE.AGENTS;
GRANT USAGE ON SCHEMA SNOWFLAKE_INTELLIGENCE.AGENTS TO ROLE PUBLIC;
GRANT CREATE AGENT ON SCHEMA SNOWFLAKE_INTELLIGENCE.AGENTS TO ROLE ACCOUNTADMIN;CREATE OR REPLACE AGENT SNOWFLAKE_INTELLIGENCE.AGENTS.AMAZON_PRODUCT_REVIEWS_AGENT
COMMENT = 'Agent for querying Amazon product reviews data'
PROFILE = '{"display_name": "Amazon Product Reviews Assistant"}'
FROM SPECIFICATION $$
{
"models": {
"orchestration": "claude-4-sonnet"
},
"instructions": {
"orchestration": "Use the product reviews tool to answer questions about reviews, ratings, and customer feedback.",
"response": "Be concise and data-driven. When showing ratings, include counts for context."
},
"tools": [
{
"tool_spec": {
"type": "cortex_analyst_text_to_sql",
"name": "product_reviews",
"description": "Query product reviews data including ratings, review text, summaries, and reviewer information"
}
}
],
"tool_resources": {
"product_reviews": {
"semantic_model_file": "@LAKEHOUSE_VHOL.PUBLIC.SEMANTIC_MODELS/product_reviews_semantic_model.yaml",
"execution_environment": {
"type": "warehouse",
"warehouse": "VHOL_XS"
}
}
}
}
$$;| Component | Description |
|---|---|
models.orchestration |
LLM for planning and tool selection (e.g., claude-4-sonnet, llama3.3-70B) |
instructions.orchestration |
Guidance for tool selection and query planning |
instructions.response |
Tone and formatting for responses |
tools |
Array of tool definitions (Cortex Analyst, Cortex Search, custom tools) |
tool_resources |
Configuration for each tool (semantic model, warehouse, etc.) |
GRANT USAGE ON AGENT SNOWFLAKE_INTELLIGENCE.AGENTS.AMAZON_PRODUCT_REVIEWS_AGENT TO ROLE PUBLIC;After creation, the agent is accessible via:
- Snowflake Intelligence UI (AI & ML > Snowflake Intelligence)
- REST API for programmatic access
- SQL using agent functions
Example questions the agent can answer:
- "What is the average rating for all products?"
- "How many reviews are there for each star rating?"
- "Show me the most recent reviews"
- "What products have the highest ratings?"
| File | Purpose |
|---|---|
iam_policy.json |
AWS IAM policy for S3 bucket access |
1-setup.sql |
Database, warehouse, external volume, and Iceberg table creation |
2-governance.sql |
Role creation, row access policies, masking policies, and tagging |
3-data-quality.sql |
Custom data metric functions for quality monitoring |
4-create-agent.sql |
Cortex Agent with Cortex Analyst for natural language queries |
product_reviews_semantic_model.yaml |
Semantic model defining business context for AI |