- [ ] Unify the set of languages between cld2 and fasttext (see `unify_lang` branch for a start) - [x] Audit the list of name pairs (noticed (maria, mary), (kathleen, katherine)) - [ ] Generally improve language detection on titles (would require a whole model) - [ ] if a person has two very disjoint "personas", they will end up as two clusters. Probably not resolvable, but putting here anyway - [ ] somehow do better with low information papers (e.g. no abstract, venue, affiliation, references)
unify_langbranch for a start)