GoogleCloudPlatform
diff --git a/‎.github/actions/spelling/allow.txt‎
Lines changed: 24 additions & 0 deletions b/‎.github/actions/spelling/allow.txt‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎.github/actions/spelling/expect.txt‎ b/‎.github/actions/spelling/expect.txt‎
diff --git a/‎gemini/use-cases/multimodal-data-curation/README.md‎
Lines changed: 76 additions & 0 deletions b/‎gemini/use-cases/multimodal-data-curation/README.md‎
Lines changed: 76 additions & 0 deletions
@@ -67,6 +67,7 @@ COINIT
 CONOUT
 COPD
 COUNTIF
+CRF
 CRM
 CTHH
 CUAD
@@ -104,6 +105,7 @@ Dexin
 Disturbia
 Doaa
 Doogler
+Downscale
 Dreesen
 Duh
 Duonebs
@@ -257,6 +259,7 @@ Kubeflow
 Kvyat
 Kylian
 L'avenir
+LAION
 LCEL
 LEBRON
 LLMs
@@ -511,7 +514,10 @@ Utik
 VAPO
 VBG
 VFT
+VIDGEN
 VLM
+VLMs
+VMAF
 VMs
 VOS
 VQA
@@ -529,6 +535,7 @@ Verilog
 Ves
 Vesia
 Vettel
+Vid
 Vijay
 Virat
 Viru
@@ -599,6 +606,7 @@ arize
 arun
 arxiv
 astype
+atleast
 atms
 auc
 aujourd'hui
@@ -624,6 +632,7 @@ bbq
 beir
 bert
 bff
+bgr
 bgswap
 bigframes
 bigquery
@@ -699,6 +708,7 @@ csa
 cse
 ctd
 cupertino
+curating
 cva
 cycleway
 cycleways
@@ -711,6 +721,7 @@ dbln
 dcg
 ddbb
 ddl
+decord
 dedup
 deepeval
 deepseek
@@ -786,6 +797,7 @@ fea
 fect
 fewshot
 ffi
+ffprobe
 fibonacci
 figheight
 figsize
@@ -872,6 +884,7 @@ hashtag
 hashtags
 hdfs
 hdlr
+hdtv
 heatmap
 heatmapgl
 hexsha
@@ -927,6 +940,7 @@ jesieni
 jetbrains
 jit
 jiwer
+jsemer
 jsonify
 jsonlines
 jsonpath
@@ -1007,6 +1021,7 @@ mic
 mics
 millis
 miranda
+mlp
 mmarco
 mmol
 mmr
@@ -1063,6 +1078,7 @@ noabe
 nobserved
 nodularis
 nohup
+nokey
 norigin
 notetaker
 novnc
@@ -1074,6 +1090,7 @@ nunique
 nvidia
 oai
 objc
+oglevel
 ollama
 olleh
 onesie
@@ -1146,6 +1163,7 @@ putalpha
 putdata
 pvc
 pyautogen
+pyav
 pybind
 pydantic
 pydub
@@ -1254,6 +1272,7 @@ streamlit
 strfreev
 stt
 stuffie
+subclip
 subviews
 subword
 suis
@@ -1301,6 +1320,7 @@ tscore
 tseslint
 tsne
 tsv
+ttext
 tts
 tures
 typehints
@@ -1330,6 +1350,7 @@ vectoral
 vectordb
 veo
 vesselin
+vidgen
 viridis
 vllm
 vnc
@@ -1380,6 +1401,7 @@ xaxis
 xcassets
 xcconfig
 xcodeproj
+xcol
 xcscheme
 xctest
 xcvi
@@ -1391,9 +1413,11 @@ xsi
 xsum
 xticks
 xxxxxxxx
+xzn
 yanchor
 yaxes
 yaxis
+ycol
 ylabel
 yougov
 youtube
 
@@ -0,0 +1,76 @@
+# Building the Foundation for Breakthroughs: A Multimodal Data Curation Pipeline for Text-to-Video Models
+
+## Authors
+
+- Noa Ben-Efraim (`noabe`)
+- John Semerdjian (`jsemer`)
+
+
+In the rapidly evolving landscape of AI, Text-to-Video (T2V) models are capturing imaginations, promising to revolutionize content creation from film to marketing. But for all their dazzling potential, the real magic isn't just in the model architecture; it's in the data that feeds them. High-quality, diverse, and well-curated multimodal data is the secret sauce for building truly exceptional T2V models.
+
+That's where our new initiative comes in. We're thrilled to introduce a comprehensive solution designed specifically for organizations embarking on their journey into large-scale text-to-video model building – a fully built, serverless data curation pipeline on Google Cloud, accompanied by a blog series that dives deep into its operation and the strategic choices behind it.
+
+## Why Data is the New Frontier in Text-to-Video
+
+Just a short while ago, the focus in generative AI was heavily on novel model architectures. Today, with many powerful architectures being open-sourced and commoditized, the competitive edge has shifted. The true advantage now lies in the quality, quantity, and diversity of the underlying data used for training.
+
+Think of it this way: a brilliant artist can't create a masterpiece without high-quality paints and brushes. Similarly, even the most sophisticated T2V model will struggle to produce compelling results if it's trained on subpar or insufficient data. Data processing and filtering pipelines, once niche knowledge, have now become standardized best practices across leading open-source model builders.
+
+> Our mission is to aggregate these proven recipes and techniques into an easily deployable, robust pipeline and then, through this blog series, demystify the trade-offs and granular details. This isn't just about sharing code; it's about elevating the collective understanding of multimodal data curation, driving innovation, and demonstrating the immense value of Google Cloud's managed services.
+
+## What We're Offering: A Two-Pronged Approach
+
+1.  **A Fully Built, Serverless Architecture + Working Pipeline**: We've engineered a complete, ready-to-deploy pipeline for curating pre-training data for text-to-video models. Built on a serverless architecture, it minimizes maintenance costs and leverages Google Cloud's pay-what-you-use pricing model. Our packaged pipelines implement the best video data curation techniques honed over the last few years, making advanced practices accessible.
+
+2.  **A Comprehensive Blog Series**: This series will be your guide through the entire pipeline. We'll walk you through each step at a high level, explaining how to operate it effectively on top of Cloud resources. More importantly, we'll delve into the crucial trade-offs you'll encounter, providing insights backed by both cutting-edge research papers and practical links to public Cloud documentation.
+
+## Navigating the Data Curation Journey: Our Pipeline's Pillars
+
+Building a high-quality text-to-video dataset is a complex endeavor. Our pipeline addresses common challenges head-on, structured around key pillars that tackle specific pain points.
+
+### 1. Video Splitting
+
+* **Pain Point:** Raw video files are often too long and contain multiple distinct scenes. Manually segmenting these videos is time-consuming and prone to inconsistencies.
+* **Our Solution:** We leverage advanced techniques like `PySceneDetect` with boundary detection and offer options for short clip creation using `MoviePy`, enabling efficient and accurate segmentation of long videos into manageable, meaningful clips.
+
+### 2. Quality Filtering
+
+* **Pain Point:** Not all video content is suitable for training. Low resolution, poor brightness, watermarks, or very short clips can degrade model performance.
+* **Our Solution:** Our pipeline incorporates robust visual filtering based on resolution, brightness, aspect ratio, text boundary detection, and watermark presence. It also filters by minimum FPS to ensure smooth video. Furthermore, we include model-based filtering using techniques like aesthetic scoring (e.g., `LAION-Aesthetics`) and temporal consistency checks (e.g., using `CLIP`) to ensure only visually appealing and coherent content makes it through.
+
+### 3. Motion Filtering
+
+* **Pain Point:** T2V models benefit from understanding motion, but videos with too little or too much chaotic motion (e.g., slideshows, blurry camera shakes) can introduce noise.
+* **Our Solution:** We implement sophisticated motion analysis using metrics like **VMAF** (Video Multimethod Assessment Fusion) for optical flow and `PySceneDetect` for static scene detection. This helps identify and filter out clips with no slow motion or excessive jitter.
+
+### 4. Captioning & Tagging
+
+* **Pain Point:** High-quality text descriptions are paramount for T2V models. Manual captioning is prohibitively expensive, and generic captions limit a model's learning capabilities.
+* **Our Solution:** This pillar leverages Google Cloud's **Vertex AI** for automated Video Captioning and Tagging. We employ powerful models like **Gemini** for classifying camera motion (zoom, pan, etc.) and generating rich clip taxonomies. Additionally, we use Embedding APIs to create multimodal embeddings for semantic understanding.
+
+### 5. Deduplication
+
+* **Pain Point:** Large datasets often contain near-duplicate videos and captions, which can lead to overfitting and inefficient training.
+* **Our Solution:** We address this with semantic deduplication via K-Means clustering on multimodal embeddings. This method effectively groups similar clips and captions, allowing for intelligent removal of redundant data.
+
+Each of these pillars is orchestrated using Google Cloud's powerful services like **Dataflow**, ensuring a scalable, serverless, and robust solution for handling even the largest video datasets.
+
+## Target Audience
+
+This solution is tailor-made for customers relatively new to building proprietary text-to-video models, such as:
+* Film and TV studios
+* Educational content creators
+* Marketing agencies looking to leverage generative AI
+
+Our goal is to evangelize the power of Google Cloud's managed service offerings – including **Dataflow**, **BigQuery**, **Vector Search**, and **Gemini** – to demonstrate the tangible business value and productivity gains that migrating to Google Cloud can unlock.
+
+**Who it's not for:** Advanced research organizations that have already settled on alternative orchestration frameworks (like Ray or Spark) for their existing, highly specialized pipelines.
+
+## Unlocking Business Value and Driving Innovation
+
+By providing this pipeline and accompanying guidance, we aim to achieve several key business outcomes:
+
+* **Increased Thought Leadership:** Establish Google Cloud as a leading voice in the nascent but rapidly growing field of multimodal data curation for generative AI.
+* **Increased Revenue Across Cloud SKUs:** Drive adoption across multiple Google Cloud services by showcasing the seamless integration and power of Dataflow, BigQuery, Vector Search, and Gemini.
+
+The quality of your models will be directly proportional to the quality of your data. Our multimodal data curation pipeline provides the robust foundation you need to build the next generation of creative AI applications.