Skip to content

INFORMSJoC/2023.0502

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INFORMS Journal on Computing Logo

Quantifying the Academic Quality of Children’s Videos Using Machine Comprehension

This archive is distributed in association with the INFORMS Journal on Computing under the MIT License.

The software and data in this repository are a snapshot of the software and data that were used in the research reported on in the paper Quantifying the Academic Quality of Children’s Videos Using Machine Comprehension by Sumeet Kumar, Mallikarjuna T., and Ashiqur Khudabukhsh.

Cite

To cite the contents of this repository, please cite both the paper and this repo, using their respective DOIs.

https://doi.org/10.1287/ijoc.2023.0502

https://doi.org/10.1287/ijoc.2023.0502.cd

Below is the BibTex for citing this snapshot of the repository.

@misc{kumar2025,
  author =        {Sumeet Kumar, Mallikarjuna Tupakula and Ashiqur R. KhudaBukhsh},
  publisher =     {INFORMS Journal on Computing},
  title =         {{Quantifying the Academic Quality of Children’s Videos Using Machine Comprehension}},
  year =          {2025},
  doi =           {10.1287/ijoc.2023.0502.cd},
  url =           {https://github.com/INFORMSJoC/2023.0502},
  note =          {Available for download at https://github.com/INFORMSJoC/2023.0502},
}  

Description

Our experimental validation is divided into three experiments. The first experiment validates our proposed approach of using an RC model for question-answering based on videos that use a labeled dataset present in the data/ folder. In this part, we also compare various RC models, showing how different RC models compare for varying video lengths to pick the most suitable RC model for our use case. Using the best model found in the first experiment, the second experiment discusses the video retrieval and ranking approach using topics from children’s textbooks (ScienceQA dataset). Finally, in Experiment 3, we compare different channels, examining their academic quality and viewership. All experiments are run on a machine with an Intel chip (Intel Xeon Platinum 8358 32 Cores 2.60 GHz 250W) with 512 GB RAM and one NVIDIA GTX 3090 GPU. For any questions regarding the code, data, or execution, please send an email to the second author at [email protected]. The process of running the code is described next.

Environment

For installing the libraries for this project, use command "pip install -r requirements.txt"

Data (data/)

All smaller data files needed to execute the experiments are present in the data/ folder. In addition, we create a dataset that is composed of video transcripts and video-frame captions from the top channels on YouTube that create children's content. These can be privately shared by emailing the second author at [email protected]. Additionally, we have also created labeled datasets for video-based question answering which can be found at 'data /CVQA_dataset.csv' and `data/CVQA_visual_dataset.csv'.

Source (src/)

To generate the data used in scripts/Experiment_1.ipynb, use "Experiment_1_src.ipynb" in src/. Similarly, to recreate the data used scripts/Experiment_2.ipynb, run 'src\Experiment_2_src_data.ipynbinsrc/`.

Scripts (scripts/)

Experiment 1 (scripts/Experiment_1.ipynb) script analyzes and visualizes the relationship between passage length and accuracy across different language models. It processes experimental data to plot how model performance varies with input length, generating high-resolution graphs that clearly illustrate which models maintain performance as passage length increases. The notebook processes the output to generate the plots.

Experiment 2 (scripts/Experiment_2.ipynb) script uses QA models to answer questions in the Science QA dataset based on YouTube video transcripts and video captions. It then compares the different reading comprehension (RC) models with questions from the ScienceQA dataset. Video language transcripts (Transcript) and video captions (Caption) are used as context for answering questions. Finally, it produces a heatmap to show how effectively YouTube videos in our dataset can be used to answer questions from the ScienceQA dataset.

Experiment 3 (scripts/Experiment_3.ipynb) script estimates and shows the academic quality of the top channels on YTK based on school textbook questions. The notebook processes the output to generate the plots.

Results (results/)

Table 1 and Figure 5 in the paper compare the performance of different RC models for varying passage lengths. These can be reproduced using scripts/Experiment_1.ipynb.

Table 1

Figure 5

Figure 6 in the paper presents a heatmap showing lessons' grades and categories from the ScienceQA dataset and how effectively YouTube videos in our dataset can be used to answer them. The plot can be reproduced by executing code in scripts/Experiment_2.ipynb.

Figure 6

Figure 7 in the paper visualizes the academic quality of the top channels on YTK based on school textbook questions. Each dot represents a YouTube Kids channel, wherein the dot size indicates the channel's view count. The plot can be reproduced using scripts/Experiment_3.ipynb. Figure 7

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •