Skip to content

Commit 13f94c0

Browse files
Enhance OpenAlex conference scoring to recognize high-quality single-year instances (fixes #77) (#82)
The previous scoring algorithm applied a harsh 70% penalty to all conference instances with 1-2 years of activity, regardless of their quality metrics. This incorrectly penalized major conferences like CVPR 2022 (2,082 papers, 179,530 citations) and ICCV 2021 (1,618 papers, 124,236 citations), causing them to be ranked below poor-quality matches. The enhanced algorithm now: - Calculates publication volume and citation impact before applying year-span logic - Identifies high-quality conferences (>50k citations OR >1000 papers) - Applies bonuses to high-quality single-year instances instead of penalties - Only penalizes low-quality short-span conferences (with reduced penalty) This ensures legitimate year-specific conference instances receive appropriate scores based on their actual quality metrics. [AI-assisted] Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent 25a12a4 commit 13f94c0

File tree

1 file changed

+20
-8
lines changed

1 file changed

+20
-8
lines changed

src/aletheia_probe/openalex.py

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -119,14 +119,6 @@ def _score_source_match(self, source: dict[str, Any], journal_name: str) -> floa
119119
elif any(word in display_name for word in search_name.split() if len(word) > 3):
120120
score += 0.2
121121

122-
# Avoid year-specific conference instances (conferences with only 1-2 years active)
123-
if source_type == "conference" and first_year and last_year:
124-
years_active = last_year - first_year + 1
125-
if years_active <= 2:
126-
score *= 0.3 # Heavily penalize single-year instances
127-
elif years_active >= 10:
128-
score += 0.1 # Bonus for long-running venues
129-
130122
# Publication volume (30% of score)
131123
if works_count > 1000:
132124
score += 0.3
@@ -147,6 +139,26 @@ def _score_source_match(self, source: dict[str, Any], journal_name: str) -> floa
147139
elif cited_by_count <= 10:
148140
score *= 0.5 # Penalize low-impact sources
149141

142+
# Enhanced conference scoring: Consider quality metrics before penalizing short spans
143+
if source_type == "conference" and first_year and last_year:
144+
years_active = last_year - first_year + 1
145+
146+
# Determine if this is a high-quality conference instance
147+
is_high_quality = (cited_by_count > 50000) or (works_count > 1000)
148+
is_medium_quality = (cited_by_count > 10000) or (works_count > 500)
149+
150+
if years_active <= 2:
151+
# High-quality single-year instances are legitimate (e.g., CVPR 2022)
152+
if is_high_quality:
153+
score += 0.15 # Bonus for major conference instance
154+
elif is_medium_quality:
155+
score += 0.05 # Small bonus for good conference instance
156+
else:
157+
# Only penalize low-quality short-span conferences
158+
score *= 0.4 # Reduced penalty (was 0.3)
159+
elif years_active >= 10:
160+
score += 0.1 # Bonus for long-running venues
161+
150162
# Recency (10% of score) - penalize inactive sources
151163
if last_year:
152164
current_year = datetime.now().year

0 commit comments

Comments
 (0)