Skip to content

Commit a602b42

Browse files
authored
Merge pull request #117 from ncihtan/update-id-regex
Refine HTAN identifier regex patterns and examples
2 parents 1fd21d5 + b1071c7 commit a602b42

File tree

1 file changed

+78
-10
lines changed

1 file changed

+78
-10
lines changed

data_model/identifiers.md

Lines changed: 78 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -89,18 +89,86 @@ HTA209_EXT3_D590
8989
```
9090

9191
### Phase 2 Regex Validation
92-
HTAN Phase 2 Identifiers can be validated with the following regex patterns.
9392

94-
Participant IDs:
95-
```
96-
^(?=.{1,50}$)(?P<center>HTA20[0-9])_(?P<participant>(?:0000|EXT\d+|\d+))
97-
```
93+
These regular expressions validate HTAN identifiers by enforcing a specific prefix range (HTA200–HTA229), a middle identifier (numeric or EXT-based), and specific suffix rules for data files and biospecimens.
9894

99-
Biospecimen and Data File IDs:
100-
```
101-
^(?=.{1,50}$)(?P<center>HTA20[0-9])_(?P<participant>(?:0000|EXT\d+|\d+))_(?P<id>(B|D)\d+)$
102-
```
95+
#### HTAN Data File ID
96+
**Regex:**
97+
`^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(D[0-9]{1,20})$`
98+
99+
**Examples:**
100+
*`HTA201_12345_D1`
101+
*`HTA201_12345_B1` (Ends with _B instead of _D)
102+
*`HTA250_12345_D1` (Prefix HTA250 is out of valid range)
103+
104+
#### HTAN Participant ID
105+
**Regex:**
106+
`^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})$`
107+
108+
**Examples:**
109+
*`HTA210_EXT999`
110+
*`HTA210_EXT999_B1` (Contains a suffix, which is not allowed)
111+
*`HTA199_EXT999` (Prefix HTA199 is out of valid range)
112+
113+
#### HTAN Biospecimen ID
114+
**Regex:**
115+
`^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(B[0-9]{1,20})$`
116+
117+
**Examples:**
118+
*`HTA220_55555_B2`
119+
*`HTA220_55555_D2` (Ends with _D instead of _B)
120+
*`HTA220_55555` (Missing the mandatory _B suffix)
121+
122+
#### HTAN Parent ID (from biospecimen)
123+
*Matches a Participant ID OR a Biospecimen ID.*
124+
125+
**Regex:**
126+
`^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})(?:_(B[0-9]{1,20}))?$`
127+
128+
**Examples:**
129+
*`HTA205_1001_B5`
130+
*`HTA205_1001` (Valid Participant ID used as parent)
131+
*`HTA205_1001_D5` (Contains _D suffix; only no suffix or _B allowed)
132+
*`HTA205_` (Missing the middle ID number section)
103133

134+
#### HTAN Parent ID (from core)
135+
*Matches a Biospecimen ID OR a Data File ID.*
136+
137+
**Regex:**
138+
`^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$`
139+
140+
**Examples:**
141+
*`HTA215_777_D1`
142+
*`HTA215_777` (Missing mandatory suffix; must be _B or _D)
143+
*`HTA215_777_A1` (Suffix _A is invalid)
144+
145+
### Regex Structure Explanation
146+
The following breakdown uses the **HTAN Parent ID (from core)** as an example, as it contains all component parts used across the identifiers.
147+
148+
**Pattern:** `^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$`
149+
150+
1. **`^(?=.{1,50}$)`**
151+
* **Start of string (`^`):** Ensures the match starts at the very beginning.
152+
* **Lookahead (`(?=...)`):** Checks that the total length of the string is between 1 and 50 characters before proceeding with the specific matching.
153+
154+
2. **`(HTA2[0-2][0-9])`**
155+
* **Center ID:** Matches the literal `HTA` followed by a number range strictly between 200 and 229 (`2` followed by `0-2`, followed by `0-9`).
156+
157+
3. **`_`**
158+
* **Separator:** Literal underscore character separating the Center ID from the Participant ID.
159+
160+
4. **`(0000|EXT[0-9]{1,18}|[0-9]{1,21})`**
161+
* **Participant ID:** Matches one of three valid formats:
162+
* `0000` (Standard zero ID)
163+
* `EXT` followed by 1 to 18 digits (External ID)
164+
* 1 to 21 digits (Standard numeric ID)
165+
166+
5. **`_`**
167+
* **Separator:** Literal underscore character.
168+
169+
6. **`([BD][0-9]{1,20})$`**
170+
* **Suffix & End:** Matches either `B` (Biospecimen) or `D` (Data File), followed by 1 to 20 digits.
171+
* **End of string (`$`):** Ensures there are no extra characters after the ID.
104172
## Phase 1 HTAN IDs
105173

106174
### Phase 1 Participant IDs
@@ -150,4 +218,4 @@ If a data file is derived from an external control participant, the biospecimen
150218
HTA4_EXT1_1
151219
HTA4_EXT2_2
152220
HTA4_EXT3_3
153-
```
221+
```

0 commit comments

Comments
 (0)