Appearing in NeurIPS 2022 dataset and benchmark track
We are releasing under the CC-BY licence a new large-scale dataset for Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the medical domain.
The dataset contains patients synthesized using a proprietary medical knowledge base and a commercial rule-based ASD system. Patients in the dataset are characterized by their socio-demographic data, a pathology they are suffering from, a set of symptoms and antecedents related to this pathology, and a differential diagnosis. The symptoms and antecedents can be binary, categorical and multi-choice, with the potential of leading to more efficient and natural interactions between ASD/AD systems and patients.
To the best of our knowledge, this is the first large-scale dataset that includes the differential diagnosis, and non-binary symptoms and antecedents.
- DDXPlus: A New Dataset For Automatic Medical Diagnosis
- Our paper is available on arXiv.
- The dataset in French is hosted on figshare.
- This is the original version of DDXPlus that all results in our paper were obtained on.
- Starting from 9 May 2023, the dataset is also available in English for easier use. This version is hosted on figshare.
- The English version of DDXPlus contains the same data in the same format as the French version.
- Wherever possible, English names or non-semantic codes are used instead of French names.
- Using the English version should lead to the same performance as using the French version.
Looking for a text version?
We're hoping to release a text version of DDXPlus in the coming weeks/months! Meanwhile, you can check out this script to convert it into conversational format.
In what follows, we use the term evidence as a general term to refer to a symptom or an antecedent. The dataset contains the following files:
release_evidences.json: a JSON file describing all possible evidences considered in the dataset.release_conditions.json: a JSON file describing all pathologies considered in the dataset.release_train_patients.zip: a CSV file containing the patients of the training set.release_validate_patients.zip: a CSV file containing the patients of the validation set.release_test_patients.zip: a CSV file containing the patients of the test set.
Each evidence in the release_evidences.json file is described using the following entries:
name: name of the evidence.- In the English version, this is replaced with a unique, non-semantic code starting with
E.
- In the English version, this is replaced with a unique, non-semantic code starting with
code_question: a code allowing to identify which evidences are related. Evidences having the samecode_questionform a group of related symptoms. The value of thecode_questionrefers to the evidence that need to be simulated/activated for the other members of the group to be eventually simulated.question_fr: the query, in French, associated to the evidence.question_en: the query, in English, associated to the evidence.is_antecedent: a flag indicating whether the evidence is an antecedent or a symptom.data_type: the type of the evidence. We use "B" for binary, "C" for categorical, and "M" for multi-choice.default_value: the default value of the evidence. If this value is used to characterize the evidence, then it is as if the evidence was not synthesized.possible-values: the possible values for the evidence. Only valid for categorical and multi-choice evidences.- In the English version, every value is replaced with a unique, non-semantic code starting with
V.
- In the English version, every value is replaced with a unique, non-semantic code starting with
value_meaning: The meaning, in French and English, of each code that is part of thepossible-valuesfield. Only valid for categorical and multi-choice evidences.
English
{
"name": "E_130",
"code_question": "E_129",
"question_fr": "De quelle couleur sont les lésions?",
"question_en": "What color is the rash?",
"is_antecedent": false,
"default_value": "V_11",
"value_meaning": {
"V_11": {"fr": "NA", "en": "NA"},
"V_86": {"fr": "foncée", "en": "dark"},
"V_107": {"fr": "jaune", "en": "yellow"},
"V_138": {"fr": "pâle", "en": "pale"},
"V_156": {"fr": "rose", "en": "pink"},
"V_157": {"fr": "rouge", "en": "red"}
},
"possible-values": [
"V_11",
"V_86",
"V_107",
"V_138",
"V_156",
"V_157"
],
"data_type": "C"
}French
{
"name": "lesions_peau_couleur",
"code_question": "lesions_peau",
"question_fr": "De quelle couleur sont les lésions?",
"question_en": "What color is the rash?",
"is_antecedent": false,
"default_value": "NA",
"value_meaning": {
"NA": {"fr": "NA", "en": "NA"},
"foncee": {"fr": "foncée", "en": "dark"},
"jaune": {"fr": "jaune", "en": "yellow"},
"pale": {"fr": "pâle", "en": "pale"},
"rose": {"fr": "rose", "en": "pink"},
"rouge": {"fr": "rouge","en": "red"}
},
"possible-values": [
"NA",
"foncee",
"jaune",
"pale",
"rose",
"rouge"
],
"data_type": "C"
}The file release_conditions.json contains information about the pathologies patients in the datasets may suffer from. Each pathology has the following attributes:
condition_name: name of the pathology.- In the English version, the English name is used instead of the French name.
cond-name-fr: name of the pathology in French.cond-name-eng: name of the pathology in English.icd10-id: ICD-10 code of the pathology.severity: the severity associated with the pathology. The lower the more severe.symptoms: data structure describing the set of symptoms characterizing the pathology. Each symptom is represented by its correspondingnameentry in therelease_evidences.jsonfile.antecedents: data structure describing the set of antecedents characterizing the pathology. Each antecedent is represented by its correspondingnameentry in therelease_evidences.jsonfile.
English
{
"condition_name": "Myasthenia gravis",
"cond-name-fr": "Myasthénie grave",
"cond-name-eng": "Myasthenia gravis",
"icd10-id": "G70.0",
"symptoms": {
"E_65": {},
"E_63": {},
"E_52": {},
"E_172": {},
"E_84": {},
"E_66": {},
"E_90": {},
"E_38": {},
"E_176": {}
},
"antecedents": {
"E_28": {},
"E_204": {}
},
"severity": 3
}French
{
"condition_name": "Myasthénie grave",
"cond-name-fr": "Myasthénie grave",
"cond-name-eng": "Myasthenia gravis",
"icd10-id": "G70.0",
"symptoms": {
"dysphagie": {},
"dysarthrie": {},
"diplopie": {},
"ptose": {},
"faiblesse_msmi": {},
"dyspn": {},
"fatigabilité_msk": {},
"claud_mâchoire": {},
"rds_paralys_gen": {}
},
"antecedents": {
"atcdfam_mg": {},
"trav1": {}
},
"severity": 3
}Each patient in each of the 3 sets has the following attributes:
AGE: the age of the synthesized patient.SEX: the sex of the synthesized patient.PATHOLOGY: name of the ground truth pathology (cfcondition_nameproperty in therelease_conditions.jsonfile) that the synthesized patient is suffering from.EVIDENCES: list of evidences experienced by the patient. An evidence can either be binary, categorical or multi-choice. A categorical or multi-choice evidence is represented in the format[evidence-name]_@_[evidence-value]where [evidence-name] is the name of the evidence (nameentry in therelease_evidences.jsonfile) and [evidence-value] is a value from thepossible-valuesentry. Note that for a multi-choice evidence, it is possible to have several[evidence-name]_@_[evidence-value]items in the evidence list, with each item being associated with a different evidence value. A binary evidence is represented as[evidence-name].INITIAL_EVIDENCE: the evidence provided by the patient to kick-start an interaction with an ASD/AD system. This is useful during model evaluation for a fair comparison of ASD/AD systems as they will all begin an interaction with a given patient from the same starting point. The initial evidence is randomly selected from the evidence list mentioned above (i.e.,EVIDENCES) and it is part of this list.DIFFERENTIAL_DIAGNOSIS: The ground truth differential diagnosis for the patient. It is represented as a list of pairs of the form[[patho_1, proba_1], [patho_2, proba_2], ...]wherepatho_iis the pathology name (condition_nameentry in therelease_conditions.jsonfile) andproba_iis its related probability.
English
{
"AGE": 18,
"DIFFERENTIAL_DIAGNOSIS": [["Bronchitis", 0.19171203430383882], ["Pneumonia", 0.17579340398940366], ["URTI", 0.1607809719801254], ["Bronchiectasis", 0.12429044460990353], ["Tuberculosis", 0.11367177304035844], ["Influenza", 0.11057936110639896], ["HIV (initial infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
"SEX": "M",
"PATHOLOGY": "URTI",
"EVIDENCES": ["E_48", "E_50", "E_53", "E_54_@_V_161", "E_54_@_V_183", "E_55_@_V_89", "E_55_@_V_108", "E_55_@_V_167", "E_56_@_4", "E_57_@_V_123", "E_58_@_3", "E_59_@_3", "E_77", "E_79", "E_91", "E_97", "E_201", "E_204_@_V_10", "E_222"],
"INITIAL_EVIDENCE": "E_91"
}French
{
"AGE": 18,
"DIFFERENTIAL_DIAGNOSIS": [["Bronchite", 0.19171203430383882], ["Pneumonie", 0.17579340398940366],["IVRS ou virémie", 0.1607809719801254], ["Bronchiectasies", 0.12429044460990353], ["Tuberculose", 0.11367177304035844], ["Possible influenza ou syndrome virémique typique", 0.11057936110639896], ["VIH (Primo-infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
"SEX": "M",
"PATHOLOGY": "IVRS ou virémie",
"EVIDENCES": ["crowd", "diaph", "douleurxx", "douleurxx_carac_@_sensible", "douleurxx_carac_@_une_lourdeur_ou_serrement", "douleurxx_endroitducorps_@_front", "douleurxx_endroitducorps_@_joue_D_", "douleurxx_endroitducorps_@_tempe_G_", "douleurxx_intens_@_4", "douleurxx_irrad_@_nulle_part", "douleurxx_precis_@_3", "douleurxx_soudain_@_3", "expecto", "f17.210", "fievre", "gorge_dlr", "toux", "trav1_@_N", "z77.22"],
"INITIAL_EVIDENCE": "fievre"
}| Binary | Categorical | Multi-choice | Total | |
|---|---|---|---|---|
| Evidences | 208 | 10 | 5 | 223 |
| Symptoms | 96 | 9 | 5 | 110 |
| Antecedents | 112 | 1 | 0 | 113 |
| Avg | Std dev | Min | 1st quartile | Median | 3rd quartile | Max | |
|---|---|---|---|---|---|---|---|
| Evidences | 13.56 | 5.06 | 1 | 10 | 13 | 17 | 36 |
| Symptoms | 10.07 | 4.69 | 1 | 8 | 10 | 12 | 25 |
| Antecedents | 3.49 | 2.23 | 0 | 2 | 3 | 5 | 12 |
Code for reproducing results in the paper can be found in code.
In our paper, we reported results of two methods, a RL-based method AARLC and a supervised method BASD which is adapted from ASD. For instructions on how to run them, see here for AARLC and here for BASD.



