Perform exploratory data analysis (EDA) and advanced data preprocessing on simulated patient data leveraging Generative AI (GenAI), Large Language Models (LLMs), and Small Language Models (SLMs). The dataset will cover six vital signs (oxygen saturation, heart rate, temperature, blood pressure, weight, and blood glucose), questionnaire responses, and timestamps.
- Dataset Simulation using GenAI (3 marks)
- Simulate a dataset representing 500 patients monitored over 1 month. Utilize GenAI to produce realistic numerical variations in vital signs and generate plausible textual questionnaire responses or clinical notes, incorporating scenarios with missing data.
- Exploratory Data Analysis (EDA) enhanced by LLMs (4 marks)
- Conduct comprehensive exploratory data analysis using visualizations and statistical summaries.
- Utilize Large Language Models (e.g., GPT-4) to interpret complex patterns, automatically summarize findings, identify trends, anomalies, and provide clinically relevant insights.
- Advanced Data Preprocessing utilizing SLMs/LLMs (4 marks)
- Implement preprocessing techniques, including intelligent missing value handling, normalization, and categorical encoding.
- Apply Small Language Models or fine-tuned LLMs to handle textual data preprocessing tasks, such as classifying questionnaire responses, sentiment analysis, or textual data imputation.
- AI-Assisted Summary Report and Visualization (4 marks)
- Prepare a short, insightful report (2-3 pages) summarizing findings, preprocessing techniques, and key insights from the analysis.
- Leverage LLMs to draft clear, coherent explanations for visualizations and data-driven insights.
Load to your local machine:
git clone https://github.com/keanteng/wqd7005-assignment-1
Make sure to install tex-live
and latex-workshop
to compile the LaTeX. tex-live
can be downloaded online and latex-workshop
can be installed via VSCode extensions.