-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using datatype when reading fields instead of byte size #608
base: main
Are you sure you want to change the base?
Conversation
c158bc6
to
ab46321
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one! 🐱 The idea itself looks good, no complains here 👍
memcpy( &buf32, header + (field - 1), 4 ); | ||
*f = (int32_t)be32toh( buf32 ); | ||
switch ( fd->type ) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extend test suite with examples where stored value is close to border of each type?
Like if value is int16, can you correctly read -32766 and -32767?
We shouldn't even need the files, can simply create in-memory header with whatever values we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is on the list of things I would like to get in place. I have generated test files with binary headers with max and min. The main question is how to fit this neatly into the test framework. I suggest to put this in a separate PR or PO is okay with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid using files to test get_field
and set_field
functions through segy_get/set_field
.
These tests seem so simple that they could just be in this PR.
TEST_CASE(
"segy_get_field reads values correctly",
"[c.segy]" ) {
char header[ SEGY_TRACE_HEADER_SIZE ] = { 0 };
SECTION("test negative odd near-min int16") {
// -32767, big endian
header[28] = 0x80;
header[29] = 0x01;
int32_t value;
Err err = segy_get_field( header, SEGY_TR_TRACE_ID, &value );
CHECK( success( err ) );
CHECK( value == -32767 );
}
SECTION("test negative even near-min int16") {
// -32766, big endian
header[28] = 0x80;
header[29] = 0x02;
int32_t value;
Err err = segy_get_field( header, SEGY_TR_TRACE_ID, &value );
CHECK( success( err ) );
CHECK( value == -32766 );
}
// keep on
}
I am wrong though. I better not be even/odd, it likely should be 0x80 0x01
and 0x01 0x80
- to turn value negative depending on the machine endianness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked the idea of testing using a synthetic header.
Hardcoding indexes and hex values is not the way I prefer to build tests. As a minimum we have to use the predefined indexes and cast the values in a way that makes this readable. If I cant read and verify the test, it has little value. Furthermore there has to be some thought behind what to actually test. In my opinion this is not a smal thing we can just slap on.
Your suggest test is added.
uint32_t buf32 = 0; | ||
uint16_t buf16 = 0; | ||
uint8_t buf8 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we better leave these variables be - to simplify understanding.
uint16_t buf16 = 0;
memcpy( &buf16, header + (field - 1), formatsize( fd->type ));
fd->buffer = (int16_t) be16toh(buf16);
If we keep them, we would have simplified flow with less casting.
I hope our analyzers won't complain about it, but no promises...
(Don't read the wall of text below if you don't want to, I've been simply trying to rubber-duck what we do here to see if there may be any bugs. Spoiler: I don't see one).
I think all implementations rely extensively on two properties: what happens when we convert signed int into unsigned int and the other way around? Looks like compilers do reasonable thing, but I understand this is not very well defined:
6.3.1.3 Signed and unsigned integers
- When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
- Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.
- Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
Lets see at 2 bytes example.
A: Before we:
had the header with value in big endian, lets say -32767: 0x80 0x01
- memcopy content to buffer of uint16 (
buf16
). Nothing changed, memory still reads0x80 0x01
, which is 384. - be16toh accepts uint16, so we send memory as is.
- on little-endian system be16toh will simply swap the bytes and return uint16. bytes
0x01 0x80
will be swapped to0x80 0x01
and returned as 32769. - then value uint16 is explicitly cast to int16, which uses (3) implementation-defined "reinterpret cast" - we just read bytes as if it was int16 and get our -32767.
- in the end it is stored in int32, but as int16 is (1) nicely convertible, it is still -32767.
B: Now we
have the header with value in big endian, lets say -32767: 0x80 0x01
- memcopy 2 bytes to fd->buffer. So we have
0x80 0x01 0x00 0x00 0x00 0x00 0x00 0x00
bytes there. On little-endian system this sequence is 384. - then we explicitly convert it to int16_t by (1) casting. Value stays the same.
- then as be16toh accepts uint16_t, we have (1) casting again. Value stays the same.
- (same as A-3) on little-endian system be16toh will simply swap the bytes and return uint16. 384 has bytes
0x01 0x80
, so they will be swapped to0x80 0x01
and returned as 32769. - (same as A-4) then value uint16 is explicitly cast to int16, which uses (3) implementation-defined "reinterpret cast" - we just read bytes as if it was int16 and get our -32767.
- in the end we store int16_t in uint64 buffer again, this time using (2). I am not sure what byte representation it will be.
- beyond this function our uint64 value is casted to int32 again, assuming again using (3).
So we have more conversions. Also not sure what would happen for 64-bit values instead of 16-bit ones (what happens if value in B-1 gets interpreted as negative?).
C: Suggested version is the same as A, only that A-5 is exchanged with B-6 and B-7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for carefully checking. :-) This is a critical point.
When defining a variable or buffer as in this case it is a good to ask if it is actually necessary. In this case the answer is no. I did try a variant using buf8, buf16, and buf32. Unfortunately it is not allowed to define variables or buffers in a switch case. As a result all three has to be defined each time. At the same time we have to think of the variable as a buffer. Thus, it really does not help that much with readability. In the end the choice was made from an efficiency perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many kinds of efficiency 😄 Now we have to perform two more conversions 😈
Also what do you mean about why all 3 have to be defined each time? They have to be defined once. The code as it was before works just fine.
fd->type = table[ field ];
fd->buffer = 0;
uint32_t buf32 = 0;
uint16_t buf16 = 0;
uint8_t buf8 = 0;
switch (fd->type) {
case SEGY_SIGNED_INTEGER_4_BYTE:
memcpy(&buf32, header + (field - 1), formatsize(fd->type));
fd->buffer = (int32_t)be32toh(buf32);
return SEGY_OK;
case SEGY_SIGNED_SHORT_2_BYTE:
memcpy(&buf16, header + (field - 1), formatsize(fd->type));
fd->buffer = (int16_t)be16toh(buf16);
return SEGY_OK;
case SEGY_UNSIGNED_SHORT_2_BYTE:
memcpy(&buf16, header + (field - 1), formatsize(fd->type));
fd->buffer = (uint16_t)be16toh(buf16);
return SEGY_OK;
case SEGY_UNSIGNED_CHAR_1_BYTE:
memcpy(&buf8, header + (field - 1), formatsize(fd->type));
fd->buffer = (uint8_t)buf8;
return SEGY_OK;
default:
return SEGY_INVALID_FIELD;
}
If you also use the same logic in set_field
function, you avoid copying the w_fd, have code consistency between set and get functions and would also have less conversions 💛
(I am bringing it up again because my certainty that I understand this B-case correctly is lower than A/C-case, so I really prefer less conversions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crated a new version with different pointes to the buffer. In this way only the first, first two and first four bytes are used. Hope that it is better.
Each time a field i read in the old code three buffers
uint32_t buf32 = 0;
uint16_t buf16 = 0;
uint8_t buf8 = 0;
are declared in memory, while only one of them are actually used. How it is not confusing to define three variables and never user more than one if them is beyond me. I presume we have the option of improve in existing code when we find something.
Casting is basically question on how to interpret the binary sequence and not really costly. According to CatGPT it should be a single operation:
"No Additional Overhead: The casting itself does not introduce any additional overhead, such as function calls or memory allocation. The operation is typically done in a single instruction."
New version in fixup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for constructive feedback.
Comments addressed with change in fixup or comment.
lib/src/segy.c
Outdated
FieldData fd; | ||
err = get_field( header, tr_field_type, field, &fd ); | ||
if( err != 0 ) return SEGY_FSEEK_ERROR; | ||
if (lsb) fd.buffer = bswap_header_word((int32_t)fd.buffer, formatsize( fd.type )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suspicion is that the values used for testing is too small. Using values below 256 only tests that byte zero is in the right place. For many field values that are tested this seams to be the case.
Variable buf is an int pointer so the cast should be correct.
Sorry, I do not see the reason or motivation to change Jørgens code here.
uint32_t buf32 = 0; | ||
uint16_t buf16 = 0; | ||
uint8_t buf8 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many kinds of efficiency 😄 Now we have to perform two more conversions 😈
Also what do you mean about why all 3 have to be defined each time? They have to be defined once. The code as it was before works just fine.
fd->type = table[ field ];
fd->buffer = 0;
uint32_t buf32 = 0;
uint16_t buf16 = 0;
uint8_t buf8 = 0;
switch (fd->type) {
case SEGY_SIGNED_INTEGER_4_BYTE:
memcpy(&buf32, header + (field - 1), formatsize(fd->type));
fd->buffer = (int32_t)be32toh(buf32);
return SEGY_OK;
case SEGY_SIGNED_SHORT_2_BYTE:
memcpy(&buf16, header + (field - 1), formatsize(fd->type));
fd->buffer = (int16_t)be16toh(buf16);
return SEGY_OK;
case SEGY_UNSIGNED_SHORT_2_BYTE:
memcpy(&buf16, header + (field - 1), formatsize(fd->type));
fd->buffer = (uint16_t)be16toh(buf16);
return SEGY_OK;
case SEGY_UNSIGNED_CHAR_1_BYTE:
memcpy(&buf8, header + (field - 1), formatsize(fd->type));
fd->buffer = (uint8_t)buf8;
return SEGY_OK;
default:
return SEGY_INVALID_FIELD;
}
If you also use the same logic in set_field
function, you avoid copying the w_fd, have code consistency between set and get functions and would also have less conversions 💛
(I am bringing it up again because my certainty that I understand this B-case correctly is lower than A/C-case, so I really prefer less conversions).
memcpy( &buf32, header + (field - 1), 4 ); | ||
*f = (int32_t)be32toh( buf32 ); | ||
switch ( fd->type ) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid using files to test get_field
and set_field
functions through segy_get/set_field
.
These tests seem so simple that they could just be in this PR.
TEST_CASE(
"segy_get_field reads values correctly",
"[c.segy]" ) {
char header[ SEGY_TRACE_HEADER_SIZE ] = { 0 };
SECTION("test negative odd near-min int16") {
// -32767, big endian
header[28] = 0x80;
header[29] = 0x01;
int32_t value;
Err err = segy_get_field( header, SEGY_TR_TRACE_ID, &value );
CHECK( success( err ) );
CHECK( value == -32767 );
}
SECTION("test negative even near-min int16") {
// -32766, big endian
header[28] = 0x80;
header[29] = 0x02;
int32_t value;
Err err = segy_get_field( header, SEGY_TR_TRACE_ID, &value );
CHECK( success( err ) );
CHECK( value == -32766 );
}
// keep on
}
I am wrong though. I better not be even/odd, it likely should be 0x80 0x01
and 0x01 0x80
- to turn value negative depending on the machine endianness.
lib/src/segy.c
Outdated
FieldData fd; | ||
err = get_field( header, tr_field_type, field, &fd ); | ||
if( err != 0 ) return SEGY_FSEEK_ERROR; | ||
if (lsb) fd.buffer = bswap_header_word((int32_t)fd.buffer, formatsize( fd.type )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see both reason and motivation.
You are changing whole interface to work on 64-bit values. This function was clearly extracted just to avoid duplication.
At the moment you cast your unint64 buffer to int32 to adhere to the interface. It works because uint64 buffer contains just 32-bit values for now.
Soon it would contain real 64-bit value. Something will hopefully fail and we will have to dig out what. We would have to remove this cast to int32_t and rewrite the bswap_header_word
anyway.
So to me changing signature now is only natural.
uint32_t buf32; | ||
uint16_t buf16; | ||
uint8_t buf8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using these here as well?
static int set_field( char* header,
const uint8_t* table,
int field,
FieldData* fd) {
fd->type = table[ field ];
uint32_t buf32;
uint16_t buf16;
uint8_t buf8;
switch ( fd->type ) {
case SEGY_SIGNED_INTEGER_4_BYTE:
buf32 = htobe32( (int32_t)fd->buffer );
memcpy( header + (field - 1), &buf32, formatsize( fd->type ));
return SEGY_OK;
case SEGY_SIGNED_SHORT_2_BYTE:
buf16 = htobe16( (int16_t)fd->buffer );
memcpy( header + (field - 1), &buf16, formatsize( fd->type ));
return SEGY_OK;
case SEGY_UNSIGNED_SHORT_2_BYTE:
buf16 = htobe16( (uint16_t)fd->buffer );
memcpy( header + (field - 1), &buf16, formatsize( fd->type ));
return SEGY_OK;
case SEGY_UNSIGNED_CHAR_1_BYTE:
buf8 = (uint8_t) fd->buffer;
memcpy( header + (field - 1), &buf8, formatsize( fd->type ));
return SEGY_OK;
default:
return SEGY_INVALID_FIELD;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the same approach as for get.
411e32c
to
fcd1236
Compare
Red Hat Enterprise Linux (RHEL) 7 has reached its end of life in 2024. This means we can bump the minimum required version to 3.11 which is provided in RHEL 8 [1]. We assume that other systems either have CMake 3.11 or newer available, or have a way to install a sufficently new CMake versions. Bumping the minimum version removes some deprecation warning (about support for older CMake versions being dropped), and also ensures compatibility with CMake 4.0 which has dropped support for CMake <3.5 and was rolled out to GitHub runners recently. [1]: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html-single/8.0_release_notes/index#platform-tools
To be removed prior to pushing to main.
In order to keep the trace header offset and datatype definitions close we move the list of datatypes such that this is defined first.
Rewrite get_field function to read data base on field datatype and return the datatype along with the field value. This functunality will be required later when we are handeling float values.
Rewrite set_field function to write data base on field datatype and return the datatype. This functunality will be required later when we are handeling float values.
fcd1236
to
8def904
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comments. Github message log is a bit messy, so I hope that I did not miss anything.
Rewrote set and get field to only use the bytes that are required. Hopefully this will resolve the casting issue.
uint32_t buf32 = 0; | ||
uint16_t buf16 = 0; | ||
uint8_t buf8 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crated a new version with different pointes to the buffer. In this way only the first, first two and first four bytes are used. Hope that it is better.
Each time a field i read in the old code three buffers
uint32_t buf32 = 0;
uint16_t buf16 = 0;
uint8_t buf8 = 0;
are declared in memory, while only one of them are actually used. How it is not confusing to define three variables and never user more than one if them is beyond me. I presume we have the option of improve in existing code when we find something.
Casting is basically question on how to interpret the binary sequence and not really costly. According to CatGPT it should be a single operation:
"No Additional Overhead: The casting itself does not introduce any additional overhead, such as function calls or memory allocation. The operation is typically done in a single instruction."
New version in fixup.
memcpy( &buf32, header + (field - 1), 4 ); | ||
*f = (int32_t)be32toh( buf32 ); | ||
switch ( fd->type ) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked the idea of testing using a synthetic header.
Hardcoding indexes and hex values is not the way I prefer to build tests. As a minimum we have to use the predefined indexes and cast the values in a way that makes this readable. If I cant read and verify the test, it has little value. Furthermore there has to be some thought behind what to actually test. In my opinion this is not a smal thing we can just slap on.
Your suggest test is added.
lib/src/segy.c
Outdated
FieldData fd; | ||
err = get_field( header, tr_field_type, field, &fd ); | ||
if( err != 0 ) return SEGY_FSEEK_ERROR; | ||
if (lsb) fd.buffer = bswap_header_word((int32_t)fd.buffer, formatsize( fd.type )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to split this up into manageable PRs. This means that not all things can be done at once and choices have to be made. One of them is to postpone the 64 bit stuff. Have a look at the new attempt.
uint32_t buf32; | ||
uint16_t buf16; | ||
uint8_t buf8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the same approach as for get.
SEG technical standard defines fields for the binary header and the trace header. Each field has a well defined size and in most cases the datatype is also defined. This PR changes the system to solely relay on field size and instead use the filed datatype. This feature will become critical when implementing the new float field that are defined in version 2.0 and 2.1.