@@ -89,18 +89,86 @@ HTA209_EXT3_D590
8989```
9090
9191### Phase 2 Regex Validation
92- HTAN Phase 2 Identifiers can be validated with the following regex patterns.
9392
94- Participant IDs:
95- ```
96- ^(?=.{1,50}$)(?P<center>HTA20[0-9])_(?P<participant>(?:0000|EXT\d+|\d+))
97- ```
93+ These regular expressions validate HTAN identifiers by enforcing a specific prefix range (HTA200–HTA229), a middle identifier (numeric or EXT-based), and specific suffix rules for data files and biospecimens.
9894
99- Biospecimen and Data File IDs:
100- ```
101- ^(?=.{1,50}$)(?P<center>HTA20[0-9])_(?P<participant>(?:0000|EXT\d+|\d+))_(?P<id>(B|D)\d+)$
102- ```
95+ #### HTAN Data File ID
96+ ** Regex:**
97+ ` ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(D[0-9]{1,20})$ `
98+
99+ ** Examples:**
100+ * ✅ ` HTA201_12345_D1 `
101+ * ❌ ` HTA201_12345_B1 ` (Ends with _ B instead of _ D)
102+ * ❌ ` HTA250_12345_D1 ` (Prefix HTA250 is out of valid range)
103+
104+ #### HTAN Participant ID
105+ ** Regex:**
106+ ` ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})$ `
107+
108+ ** Examples:**
109+ * ✅ ` HTA210_EXT999 `
110+ * ❌ ` HTA210_EXT999_B1 ` (Contains a suffix, which is not allowed)
111+ * ❌ ` HTA199_EXT999 ` (Prefix HTA199 is out of valid range)
112+
113+ #### HTAN Biospecimen ID
114+ ** Regex:**
115+ ` ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(B[0-9]{1,20})$ `
116+
117+ ** Examples:**
118+ * ✅ ` HTA220_55555_B2 `
119+ * ❌ ` HTA220_55555_D2 ` (Ends with _ D instead of _ B)
120+ * ❌ ` HTA220_55555 ` (Missing the mandatory _ B suffix)
121+
122+ #### HTAN Parent ID (from biospecimen)
123+ * Matches a Participant ID OR a Biospecimen ID.*
124+
125+ ** Regex:**
126+ ` ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})(?:_(B[0-9]{1,20}))?$ `
127+
128+ ** Examples:**
129+ * ✅ ` HTA205_1001_B5 `
130+ * ✅ ` HTA205_1001 ` (Valid Participant ID used as parent)
131+ * ❌ ` HTA205_1001_D5 ` (Contains _ D suffix; only no suffix or _ B allowed)
132+ * ❌ ` HTA205_ ` (Missing the middle ID number section)
103133
134+ #### HTAN Parent ID (from core)
135+ * Matches a Biospecimen ID OR a Data File ID.*
136+
137+ ** Regex:**
138+ ` ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$ `
139+
140+ ** Examples:**
141+ * ✅ ` HTA215_777_D1 `
142+ * ❌ ` HTA215_777 ` (Missing mandatory suffix; must be _ B or _ D)
143+ * ❌ ` HTA215_777_A1 ` (Suffix _ A is invalid)
144+
145+ ### Regex Structure Explanation
146+ The following breakdown uses the ** HTAN Parent ID (from core)** as an example, as it contains all component parts used across the identifiers.
147+
148+ ** Pattern:** ` ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$ `
149+
150+ 1 . ** ` ^(?=.{1,50}$) ` **
151+ * ** Start of string (` ^ ` ):** Ensures the match starts at the very beginning.
152+ * ** Lookahead (` (?=...) ` ):** Checks that the total length of the string is between 1 and 50 characters before proceeding with the specific matching.
153+
154+ 2 . ** ` (HTA2[0-2][0-9]) ` **
155+ * ** Center ID:** Matches the literal ` HTA ` followed by a number range strictly between 200 and 229 (` 2 ` followed by ` 0-2 ` , followed by ` 0-9 ` ).
156+
157+ 3 . ** ` _ ` **
158+ * ** Separator:** Literal underscore character separating the Center ID from the Participant ID.
159+
160+ 4 . ** ` (0000|EXT[0-9]{1,18}|[0-9]{1,21}) ` **
161+ * ** Participant ID:** Matches one of three valid formats:
162+ * ` 0000 ` (Standard zero ID)
163+ * ` EXT ` followed by 1 to 18 digits (External ID)
164+ * 1 to 21 digits (Standard numeric ID)
165+
166+ 5 . ** ` _ ` **
167+ * ** Separator:** Literal underscore character.
168+
169+ 6 . ** ` ([BD][0-9]{1,20})$ ` **
170+ * ** Suffix & End:** Matches either ` B ` (Biospecimen) or ` D ` (Data File), followed by 1 to 20 digits.
171+ * ** End of string (` $ ` ):** Ensures there are no extra characters after the ID.
104172## Phase 1 HTAN IDs
105173
106174### Phase 1 Participant IDs
@@ -150,4 +218,4 @@ If a data file is derived from an external control participant, the biospecimen
150218HTA4_EXT1_1
151219HTA4_EXT2_2
152220HTA4_EXT3_3
153- ```
221+ ```
0 commit comments