GSoC 2026 Proposal – Auto-Labeling Data Factory & Edge Training Integration #35100

MapleEagles · 2026-04-01T00:52:52Z

MapleEagles
Apr 1, 2026

About Me

Name: Jing Jing
University: Michigan State University (PhD student, Construction Engineering / Data-driven Systems)
Timezone: EST (UTC-5)

Short Bio:
I am a PhD student focusing on AI-driven data pipelines and decision systems. My research interests lie in transforming unstructured data (e.g., images, logs) into structured representations for downstream reasoning and control.

Programming Experience:

Strong experience in Python (data processing, ML, CV pipelines)
Experience with PyTorch, scikit-learn
Familiar with building end-to-end pipelines (data → model → evaluation)

Relevant Technical Experience:

Computer vision workflows (image-based analysis, UAV data)
Dataset handling and preprocessing
Experience working with structured and unstructured data

About the Project

Project Choice

Auto-Labeling "Data Factory" & Edge Training Integration

Why I Chose This Project

This project aligns strongly with my research interest in data-centric AI systems. I am particularly interested in building pipelines that transform raw data into structured datasets that can support scalable model training and deployment.

Compared to traditional annotation workflows, this project introduces a more scalable approach using zero-shot models as teachers, which I find both technically interesting and practically impactful.

Proposed Solution (Abstract)

I propose to design a teacher–student data pipeline that leverages zero-shot models to generate pseudo-labels and uses quality-aware filtering to produce reliable training datasets.

The system will include:

Batch inference on video/image streams using zero-shot models
Pseudo-label generation with structured outputs
Multi-stage quality filtering (confidence, temporal consistency, stability)
Dataset structuring using Datumaro
Export to YOLO / COCO formats
Integration with OTX for training lightweight models

Additionally, I propose an iterative refinement loop, where student model performance is used to improve pseudo-label quality over time.

Time Commitment

I plan to dedicate up to 30 hours/week during the GSoC period.

Timeline (High-Level)

Weeks 1–2: System design and setup
Weeks 3–5: Batch inference and pseudo-label generation
Weeks 6–8: Quality filtering and refinement
Weeks 9–10: Dataset structuring and export
Weeks 11–12: Integration with OTX
Week 13: Testing, optimization, and documentation

General Questions

How do I know OpenVINO?

I am looking into GSoC lists, and I found it aligns my thesis direction. And it can solve the problem in my field. In civil engineering field, data is limited. And I am thinking of if I can contribute to this.
I became familiar with OpenVINO through its role in optimizing AI models for edge deployment. I have explored its use in accelerating inference and supporting lightweight deployment scenarios.

What do I know about OpenVINO?

OpenVINO provides tools for optimizing and deploying deep learning models efficiently on edge hardware. It supports model conversion, inference optimization, and integration with training pipelines such as OTX.

Contributions to OpenVINO

I am currently exploring the repository and plan to contribute via the prerequisite task.

Professional Development

This project aligns closely with my research direction in AI-driven data systems. It will help me deepen my understanding of:

data-centric AI
scalable training pipelines
edge deployment workflows

Other Summer Plans

My primary focus for the summer is GSoC. I do not have conflicting commitments.

Why Should You Pick Me?

I bring a strong combination of:

hands-on experience in Python and machine learning
understanding of data pipelines and model training
research perspective on structured data and decision systems

I am particularly interested in not just building the pipeline, but improving data quality and understanding how it impacts model performance.

Prerequisites

I am currently working on the prerequisite task and will update this thread with my pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2026 Proposal – Auto-Labeling Data Factory & Edge Training Integration #35100

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GSoC 2026 Proposal – Auto-Labeling Data Factory & Edge Training Integration #35100

Uh oh!

MapleEagles Apr 1, 2026

About Me

About the Project

Project Choice

Why I Chose This Project

Proposed Solution (Abstract)

Time Commitment

Timeline (High-Level)

General Questions

How do I know OpenVINO?

What do I know about OpenVINO?

Contributions to OpenVINO

Professional Development

Other Summer Plans

Why Should You Pick Me?

Prerequisites

Replies: 0 comments

MapleEagles
Apr 1, 2026