Skip to content

Reading bytes

Matthew Archer edited this page Mar 18, 2021 · 1 revision

Originally written by Matthew Archer (@Fincap) 23/04/2020 on GitHub

1 Data Element

Note: This document heavily references the DICOM Standard maintained and published by NEMA.

1.1 Tags

A DICOM file consists of a sequence of Data Elements (the Data Set) which describe instances of real world information. Data Elements are uniquely identified by a tag consisting of two parts: the Group Number and the Element Number. Although similar or related Data Elements often have the same Group Number; a Data Group does not convey any semantic meaning. Some Data Elements may occur more than once in a DICOM's Data Set. A tag is represented by two 16-bit unsigned integers representing the Group Number followed by Element Number, for example the Data Element Modality is represented by the tag (0008,0060).

1.2 Value Representation

Each Data Element has a Value Representation (VR) which describes the data type and format of the Data Element's values. A VR determines the length of the Data Element's value and which characters are permitted in the value. VRs are encoded with two uppercase letters from the DICOM default character set (i.e. A - Z). A list of all VRs and their specifications can be found in the DICOM Standard PS3.5 Section 6.2.

1.3 Transfer Syntax

Every DICOM file has a Transfer Syntax which communicates how the subsequent data is encoded. The default Transfer Syntax provided by the Standard is DICOM Implicit VR Little Endian Transfer Syntax.

The first attribute of this Transfer Syntax is Implicit VR, which means that in each Data Element outside the File Meta Information header (more on this later), it is not necessary to declare the VR as every tag already has an implicit VR defined in the Standard. In an Explicit Transfer Syntax, each Data Element's VR would have to be declared within each Data Element.

The second attribute of the Transfer Syntax is Little Endian. Endianness describes the order in which bytes are interpreted. DICOM Standard PS3.5 Section 7.3 defines Little Endian Byte Ordering as follows:

  • In a binary number consisting of multiple bytes (e.g., a 32-bit unsigned integer value, the Group Number, the Element Number, etc.), the least significant byte shall be encoded first; with the remaining bytes encoded in increasing order of significance.
  • In a character string consisting of multiple 8-bit single byte codes, the characters will be encoded in the order of occurrence in the string (left to right).

Using a Data Element Tag for Modality as an example, we can see an example of how Little Endian Byte Ordering works. Consider the bytes represented in hexadecimal:

08 00 60 00

Let's say we already know that these four bytes represent the tag. We also know (as was established above in 1.1) that the Data Element's tag consists of two 16-bit unsigned integers. We also know from our knowledge of mathematics that a pair of hexadecimal numbers represent a byte (8 bits). From these points we can deduce that the above represents 4 bytes, which means there are two 16-bit numbers present: 08 00 and 60 00. Little Endian Byte Ordering dictates that these bytes were encoded with the least significant byte first. Essentially this means we are reading each of these bytes right to left. This results in the bytes being interpreted as 0x0008 and 0x0060. This quite clearly corresponds to the Group Number and Element Number (0008, 0060), which represents the DICOM's Modality.

Big Endian is also defined in the DICOM Standard, however the only Big Endian Transfer Syntax defined has been retired by the Standard.

1.4 Encoded Byte Structure

With the information above we can begin to look at the raw binary data of an encoded DICOM Data Set. A Data Element can be easily encoded and decoded once the rules of how they are structured are understood. Let's take a look at this example:

08 00 60 00 08 00 00 00 52 54 53 54 52 55 43 54

First there is the tag (which we have seen an example of in 1.3) which consists of 2 bytes representing the Group Number followed by 2 bytes representing the Element Number. This decodes to (0008, 0060).

Decoding the next set of bytes depends on the Transfer Syntax established for this particular DICOM File. In this instance we are using the default, DICOM Implicit VR Little Endian Transfer Syntax. This means that the next 4 bytes are dedicated to declaring the Value Length (32-bit unsigned integer) which is a number stating how long the Data Element's Value is in bytes. The example above decodes to 8 bytes.

In the case that we were using an Explicit VR Transfer Syntax, the next 2 bytes would be for declaring the VR, and the 2 bytes after that would represent Value Length (as such, this only allows for the length to be a 16-bit unsigned integer).

Note that a Value Length must be an even number.

Following the 8 bytes representing VR and/or Value Length is the Data Element's value itself, called the Value Field. The values in these bytes is determined by the Value Length and the Data Element's VR. Note that as Value Length must always be an even number, so must the Value Field have an even amount of bytes. In the case of falling short of an even number, the Value Field with be padded at the end with a null value 0x00, or in the case of some VRs which require 0x20 which is the ASCII 'Space' character. The ASCII characters in these 8 bytes decode as RTSTRUCT.

To break the example down more concisely:

First 2 bytes: Group Tag (0x0008)

Next 2 bytes: Element Tag (0x0060)

Next 2 bytes: Value Length (0x00000008)

Next 8 bytes: Value Field (RTSTRUCT)

And in plain language:

Tag: (0008, 0060), Modality

Length: 8 bytes

Value: RTSTRUCT

Clone this wiki locally