res['numeric_count'] calculation in Hierarchical Tokenizer

Hi, I am looking at your hierarchical tokenizer. I have a question for how you calculate ['numeric_count'] per property. 

The related codes are "weight = 1.0 / (num_subjects * total_events)" ([line 100](https://github.com/som-shahlab/femr/blob/37a99749637b36777770064f87b071a1c74bffd0/src/femr/models/tokenizer/hierarchical_tokenizer.py#L100C9-L100C53)) and "res['numeric_count'] += len(v['numeric_samples']) / weight"  ([line 151)](https://github.com/som-shahlab/femr/blob/37a99749637b36777770064f87b071a1c74bffd0/src/femr/models/tokenizer/hierarchical_tokenizer.py#L151). So res['numeric_count'] += len(v['numeric_samples'])*num_subjects * total_events. Should it be res['numeric_count'] += len(v['numeric_samples'])*weight instead? I think the idea here is to determine how many quantile bins each property gets depends on total number of numerical values, that's why it is divided by (num_subjects * total_events). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

res['numeric_count'] calculation in Hierarchical Tokenizer #252

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

res['numeric_count'] calculation in Hierarchical Tokenizer #252

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions