CTLB Tweet Embedding Project

This repository provides scripts for processing, analyzing, and aggregating embeddings and mental health scores derived from the County Tweet Lexical Bank (CTLB) dataset.

Overview

The CTLB dataset comprises billions of geolocated tweets across U.S. counties over several years. In this project, we leverage sentence-transformer embeddings (generated via a separate pipeline — see below) to analyze mental health indicators like depression and anxiety at various levels of aggregation — user, weekly, and county level.

The embedding generation was handled in a different repository and stored in HDFS on the Apollo server for efficient distributed access. This repo provides downstream analysis utilities built on top of these embeddings.

Project Pipeline

1. Message-Level Embeddings

Message embeddings were generated using a Transformer-based model in a separate repo and stored in Parquet format on the Apollo HDFS cluster:

Each row includes:

user_id
message_id
user-year-week
embedding
Location metadata (e.g., FIPS code)

2. User and Region-Level Aggregation

Embeddings are aggregated using grouping scripts to generate:

User-week embeddings
User-county embeddings
County-level weekly average embeddings

These grouped embeddings are used for downstream prediction and spatial/temporal trend analysis.

Key Scripts

dep_anx_score_predictor.py
Predicts depression and anxiety scores using Ridge Regression models trained on Facebook data. Input is user-level or county-level aggregated embeddings. Output includes mental health scores written back to HDFS or local disk.
agg_scores_yr_week_cnty.py
Aggregates predicted scores to the year-week-county level for trend visualization and correlation studies.
save_dlatk_table.py
Converts predicted scores or aggregated embeddings into a format compatible with DLATK (Differential Language Analysis Toolkit) for further linguistic analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
DataSampler		DataSampler
README.md		README.md
agg_scores_yr_week_cnty.py		agg_scores_yr_week_cnty.py
dep_anx_score_predictor.py		dep_anx_score_predictor.py
embedding_grouping_3.py		embedding_grouping_3.py
save_dlatk_table.py		save_dlatk_table.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CTLB Tweet Embedding Project

Overview

Project Pipeline

1. Message-Level Embeddings

2. User and Region-Level Aggregation

Key Scripts

About

Uh oh!

Releases

Packages

Languages

ojasdeshpande10/ctlb-embedding-project

Folders and files

Latest commit

History

Repository files navigation

CTLB Tweet Embedding Project

Overview

Project Pipeline

1. Message-Level Embeddings

2. User and Region-Level Aggregation

Key Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages