Skip to content

Commit df813a1

Browse files
authored
Multimodal data curation use case (#2139)
1 parent 50b414a commit df813a1

7 files changed

Lines changed: 3703 additions & 0 deletions

File tree

.github/actions/spelling/allow.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ COINIT
6767
CONOUT
6868
COPD
6969
COUNTIF
70+
CRF
7071
CRM
7172
CTHH
7273
CUAD
@@ -104,6 +105,7 @@ Dexin
104105
Disturbia
105106
Doaa
106107
Doogler
108+
Downscale
107109
Dreesen
108110
Duh
109111
Duonebs
@@ -257,6 +259,7 @@ Kubeflow
257259
Kvyat
258260
Kylian
259261
L'avenir
262+
LAION
260263
LCEL
261264
LEBRON
262265
LLMs
@@ -511,7 +514,10 @@ Utik
511514
VAPO
512515
VBG
513516
VFT
517+
VIDGEN
514518
VLM
519+
VLMs
520+
VMAF
515521
VMs
516522
VOS
517523
VQA
@@ -529,6 +535,7 @@ Verilog
529535
Ves
530536
Vesia
531537
Vettel
538+
Vid
532539
Vijay
533540
Virat
534541
Viru
@@ -599,6 +606,7 @@ arize
599606
arun
600607
arxiv
601608
astype
609+
atleast
602610
atms
603611
auc
604612
aujourd'hui
@@ -624,6 +632,7 @@ bbq
624632
beir
625633
bert
626634
bff
635+
bgr
627636
bgswap
628637
bigframes
629638
bigquery
@@ -699,6 +708,7 @@ csa
699708
cse
700709
ctd
701710
cupertino
711+
curating
702712
cva
703713
cycleway
704714
cycleways
@@ -711,6 +721,7 @@ dbln
711721
dcg
712722
ddbb
713723
ddl
724+
decord
714725
dedup
715726
deepeval
716727
deepseek
@@ -786,6 +797,7 @@ fea
786797
fect
787798
fewshot
788799
ffi
800+
ffprobe
789801
fibonacci
790802
figheight
791803
figsize
@@ -872,6 +884,7 @@ hashtag
872884
hashtags
873885
hdfs
874886
hdlr
887+
hdtv
875888
heatmap
876889
heatmapgl
877890
hexsha
@@ -927,6 +940,7 @@ jesieni
927940
jetbrains
928941
jit
929942
jiwer
943+
jsemer
930944
jsonify
931945
jsonlines
932946
jsonpath
@@ -1007,6 +1021,7 @@ mic
10071021
mics
10081022
millis
10091023
miranda
1024+
mlp
10101025
mmarco
10111026
mmol
10121027
mmr
@@ -1063,6 +1078,7 @@ noabe
10631078
nobserved
10641079
nodularis
10651080
nohup
1081+
nokey
10661082
norigin
10671083
notetaker
10681084
novnc
@@ -1074,6 +1090,7 @@ nunique
10741090
nvidia
10751091
oai
10761092
objc
1093+
oglevel
10771094
ollama
10781095
olleh
10791096
onesie
@@ -1146,6 +1163,7 @@ putalpha
11461163
putdata
11471164
pvc
11481165
pyautogen
1166+
pyav
11491167
pybind
11501168
pydantic
11511169
pydub
@@ -1254,6 +1272,7 @@ streamlit
12541272
strfreev
12551273
stt
12561274
stuffie
1275+
subclip
12571276
subviews
12581277
subword
12591278
suis
@@ -1301,6 +1320,7 @@ tscore
13011320
tseslint
13021321
tsne
13031322
tsv
1323+
ttext
13041324
tts
13051325
tures
13061326
typehints
@@ -1330,6 +1350,7 @@ vectoral
13301350
vectordb
13311351
veo
13321352
vesselin
1353+
vidgen
13331354
viridis
13341355
vllm
13351356
vnc
@@ -1380,6 +1401,7 @@ xaxis
13801401
xcassets
13811402
xcconfig
13821403
xcodeproj
1404+
xcol
13831405
xcscheme
13841406
xctest
13851407
xcvi
@@ -1391,9 +1413,11 @@ xsi
13911413
xsum
13921414
xticks
13931415
xxxxxxxx
1416+
xzn
13941417
yanchor
13951418
yaxes
13961419
yaxis
1420+
ycol
13971421
ylabel
13981422
yougov
13991423
youtube

.github/actions/spelling/expect.txt

Whitespace-only changes.
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Building the Foundation for Breakthroughs: A Multimodal Data Curation Pipeline for Text-to-Video Models
2+
3+
## Authors
4+
5+
- Noa Ben-Efraim (`noabe`)
6+
- John Semerdjian (`jsemer`)
7+
8+
9+
In the rapidly evolving landscape of AI, Text-to-Video (T2V) models are capturing imaginations, promising to revolutionize content creation from film to marketing. But for all their dazzling potential, the real magic isn't just in the model architecture; it's in the data that feeds them. High-quality, diverse, and well-curated multimodal data is the secret sauce for building truly exceptional T2V models.
10+
11+
That's where our new initiative comes in. We're thrilled to introduce a comprehensive solution designed specifically for organizations embarking on their journey into large-scale text-to-video model building – a fully built, serverless data curation pipeline on Google Cloud, accompanied by a blog series that dives deep into its operation and the strategic choices behind it.
12+
13+
## Why Data is the New Frontier in Text-to-Video
14+
15+
Just a short while ago, the focus in generative AI was heavily on novel model architectures. Today, with many powerful architectures being open-sourced and commoditized, the competitive edge has shifted. The true advantage now lies in the quality, quantity, and diversity of the underlying data used for training.
16+
17+
Think of it this way: a brilliant artist can't create a masterpiece without high-quality paints and brushes. Similarly, even the most sophisticated T2V model will struggle to produce compelling results if it's trained on subpar or insufficient data. Data processing and filtering pipelines, once niche knowledge, have now become standardized best practices across leading open-source model builders.
18+
19+
> Our mission is to aggregate these proven recipes and techniques into an easily deployable, robust pipeline and then, through this blog series, demystify the trade-offs and granular details. This isn't just about sharing code; it's about elevating the collective understanding of multimodal data curation, driving innovation, and demonstrating the immense value of Google Cloud's managed services.
20+
21+
## What We're Offering: A Two-Pronged Approach
22+
23+
1. **A Fully Built, Serverless Architecture + Working Pipeline**: We've engineered a complete, ready-to-deploy pipeline for curating pre-training data for text-to-video models. Built on a serverless architecture, it minimizes maintenance costs and leverages Google Cloud's pay-what-you-use pricing model. Our packaged pipelines implement the best video data curation techniques honed over the last few years, making advanced practices accessible.
24+
25+
2. **A Comprehensive Blog Series**: This series will be your guide through the entire pipeline. We'll walk you through each step at a high level, explaining how to operate it effectively on top of Cloud resources. More importantly, we'll delve into the crucial trade-offs you'll encounter, providing insights backed by both cutting-edge research papers and practical links to public Cloud documentation.
26+
27+
## Navigating the Data Curation Journey: Our Pipeline's Pillars
28+
29+
Building a high-quality text-to-video dataset is a complex endeavor. Our pipeline addresses common challenges head-on, structured around key pillars that tackle specific pain points.
30+
31+
### 1. Video Splitting
32+
33+
* **Pain Point:** Raw video files are often too long and contain multiple distinct scenes. Manually segmenting these videos is time-consuming and prone to inconsistencies.
34+
* **Our Solution:** We leverage advanced techniques like `PySceneDetect` with boundary detection and offer options for short clip creation using `MoviePy`, enabling efficient and accurate segmentation of long videos into manageable, meaningful clips.
35+
36+
### 2. Quality Filtering
37+
38+
* **Pain Point:** Not all video content is suitable for training. Low resolution, poor brightness, watermarks, or very short clips can degrade model performance.
39+
* **Our Solution:** Our pipeline incorporates robust visual filtering based on resolution, brightness, aspect ratio, text boundary detection, and watermark presence. It also filters by minimum FPS to ensure smooth video. Furthermore, we include model-based filtering using techniques like aesthetic scoring (e.g., `LAION-Aesthetics`) and temporal consistency checks (e.g., using `CLIP`) to ensure only visually appealing and coherent content makes it through.
40+
41+
### 3. Motion Filtering
42+
43+
* **Pain Point:** T2V models benefit from understanding motion, but videos with too little or too much chaotic motion (e.g., slideshows, blurry camera shakes) can introduce noise.
44+
* **Our Solution:** We implement sophisticated motion analysis using metrics like **VMAF** (Video Multimethod Assessment Fusion) for optical flow and `PySceneDetect` for static scene detection. This helps identify and filter out clips with no slow motion or excessive jitter.
45+
46+
### 4. Captioning & Tagging
47+
48+
* **Pain Point:** High-quality text descriptions are paramount for T2V models. Manual captioning is prohibitively expensive, and generic captions limit a model's learning capabilities.
49+
* **Our Solution:** This pillar leverages Google Cloud's **Vertex AI** for automated Video Captioning and Tagging. We employ powerful models like **Gemini** for classifying camera motion (zoom, pan, etc.) and generating rich clip taxonomies. Additionally, we use Embedding APIs to create multimodal embeddings for semantic understanding.
50+
51+
### 5. Deduplication
52+
53+
* **Pain Point:** Large datasets often contain near-duplicate videos and captions, which can lead to overfitting and inefficient training.
54+
* **Our Solution:** We address this with semantic deduplication via K-Means clustering on multimodal embeddings. This method effectively groups similar clips and captions, allowing for intelligent removal of redundant data.
55+
56+
Each of these pillars is orchestrated using Google Cloud's powerful services like **Dataflow**, ensuring a scalable, serverless, and robust solution for handling even the largest video datasets.
57+
58+
## Target Audience
59+
60+
This solution is tailor-made for customers relatively new to building proprietary text-to-video models, such as:
61+
* Film and TV studios
62+
* Educational content creators
63+
* Marketing agencies looking to leverage generative AI
64+
65+
Our goal is to evangelize the power of Google Cloud's managed service offerings – including **Dataflow**, **BigQuery**, **Vector Search**, and **Gemini** – to demonstrate the tangible business value and productivity gains that migrating to Google Cloud can unlock.
66+
67+
**Who it's not for:** Advanced research organizations that have already settled on alternative orchestration frameworks (like Ray or Spark) for their existing, highly specialized pipelines.
68+
69+
## Unlocking Business Value and Driving Innovation
70+
71+
By providing this pipeline and accompanying guidance, we aim to achieve several key business outcomes:
72+
73+
* **Increased Thought Leadership:** Establish Google Cloud as a leading voice in the nascent but rapidly growing field of multimodal data curation for generative AI.
74+
* **Increased Revenue Across Cloud SKUs:** Drive adoption across multiple Google Cloud services by showcasing the seamless integration and power of Dataflow, BigQuery, Vector Search, and Gemini.
75+
76+
The quality of your models will be directly proportional to the quality of your data. Our multimodal data curation pipeline provides the robust foundation you need to build the next generation of creative AI applications.

0 commit comments

Comments
 (0)