Medical Transcription Specialty Classification

Overview

The primary objective is to accurately classify medical specialties using publicly available medical transcription samples scraped from mtsamples.com.

Dataset

The dataset consists of clinical notes categorized by medical specialty.

Size: 2,377 unique sample names and 2,358 unique transcriptions.
Data Quality: The keywords feature contains 21% null values and 2% empty values.
Distribution: The target variable is heavily skewed. Surgery constitutes 22% of the dataset. Consultations account for 10%. The remaining 68% falls into a long tail of other specialties.

Medical Specialty Frequencies (Excerpt)

Specialty	Count
Radiology	218
General Medicine	207
Gastroenterology	179
Neurology	178
SOAP / Chart / Progress Notes	133
Urology	125

Methodology

The pipeline processes the text and evaluates multiple classification frameworks.

Data Splitting: Data is partitioned into training, validation, and test sets.
Classical Baselines: We evaluate logistic regression, Support Vector Machine (SVM), random forest, and gradient boosting. We use 5-fold cross-validation.
Transformer Baseline: We fine-tune DistilBERT for sequence classification. This captures deeper semantic context.

Results

Macro-F1 is prioritized as the primary diagnostic metric. High overall accuracy is largely driven by frequent classes like Surgery. Macro-F1 provides a balanced evaluation across the long-tail label distribution.

Model Performance Leaderboard (5-Fold CV Means)

Model	Split	Accuracy	Macro-F1	Source
svm	cv5_mean	0.154332	0.120294	classical_cv
rf	cv5_mean	0.146776	0.065472	classical_cv
logreg	cv5_mean	0.274422	0.063190	classical_cv
gb	cv5_mean	0.151032	0.040418	classical_cv

Logistic regression maintains the highest accuracy across all five folds. The range is 0.25 to 0.29. Gradient boosting accuracy drops sharply at fold five. SVM and random forest maintain stable accuracy trajectories near 0.15.

The SVM model relies heavily on specific clinical tokens to form decision boundaries. The top five predictive tokens by absolute magnitude are "sleep", "eye", "operation", "discharge", and "thyroid".

Conclusion

Logistic regression achieves the highest overall cross-validation accuracy. SVM achieves the highest macro-F1 score. The severe class imbalance dictates these outcomes. Models struggle to generalize across the rare specialties.

Limitations and Future Work

Class Imbalance: Extreme label skew limits macro-F1 performance across all classical baselines.
Metric Consolidation: Held-out test metrics are currently unconsolidated. They appear as NaN in the broader pipeline outputs.
Transformer Integration: DistilBERT training requires patches for Transformers v5 API changes. Final metrics cannot be recorded until this is fixed.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
__pycache__		__pycache__
api		api
config		config
data		data
features		features
reports		reports
scripts		scripts
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
image-1.png		image-1.png
image.png		image.png
report.md		report.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Transcription Specialty Classification

Overview

Dataset

Medical Specialty Frequencies (Excerpt)

Methodology

Results

Model Performance Leaderboard (5-Fold CV Means)

Conclusion

Limitations and Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical Transcription Specialty Classification

Overview

Dataset

Medical Specialty Frequencies (Excerpt)

Methodology

Results

Model Performance Leaderboard (5-Fold CV Means)

Conclusion

Limitations and Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages