f1-score evaluation

Hi,

While working on #8, it seems to me that the evaluation of `f-score` is based on flatten true and pred labels. For example, given 2 samples whose lengths are `7` and `20`. The current code flatten the labels to shape (27,)  and compute the score. However, I think it could overestimate the value. 

To illustrate, I've made [a notebook][1] using random data. You can see in there that the avg f-score is slightly lower than the f-score from the flatten data.

Looking forward to your thought on this.

[1]: https://github.com/heytitle/thai-word-segmentation/blob/f1-test/f1-test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

f1-score evaluation #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

f1-score evaluation #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions