Skip to content

Feature/fix wikipedia#240

Open
cmosguy wants to merge 9 commits into
anthropics:mainfrom
cmosguy:feature/fix-wikipedia
Open

Feature/fix wikipedia#240
cmosguy wants to merge 9 commits into
anthropics:mainfrom
cmosguy:feature/fix-wikipedia

Conversation

@cmosguy

@cmosguy cmosguy commented Oct 24, 2025

Copy link
Copy Markdown

There were some changes in the api for that needed to be updated in the wikipedia notebook example. We need to start using the transformers library for the tokenizer.

PedramNavid
PedramNavid previously approved these changes Oct 24, 2025

@PedramNavid PedramNavid left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, just a small suggestion on one line.

Comment thread third_party/Wikipedia/wikipedia-search-cookbook.ipynb Outdated
@PedramNavid

Copy link
Copy Markdown
Collaborator

@cmosguy there's a few conflicts that need to be resolved. i think you can just pull from the main branch for both pyproject and uv.lock but ensure your ipython notebook has run with ruff format and ruff check.

@cmosguy

cmosguy commented Oct 30, 2025

Copy link
Copy Markdown
Author

@PedramNavid - trying again, thanks for the heads up

@PedramNavid PedramNavid left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cmosguy

I went through the full PR. There's a few issues

  • Rather than import tokenizer, lets rely on the count tokens API already available in the Anthropic library
  • Please ensure you are using the ruff formatter, as there are a lot of changed lines here that are pure formatting changes that should not be part of the diff
  • Please do not make changes to the uv.lock file, we should not be making any dependency changes for this PR.

" def process_raw_search_results(\n",
" self,\n",
" results: list[SearchResult],\n",
" self, results: list[SearchResult],\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run ruff format on this notebok? This change you made undoes what our formatter is doing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ran ruff format

" result = \"\\n\".join(\n",
" [\n",
" f'<item index=\"{i + 1}\">\\n<page_content>\\n{r}\\n</page_content>\\n</item>'\n",
" f'<item index=\"{i+1}\">\\n<page_content>\\n{r}\\n</page_content>\\n</item>'\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, looks like your linter/formatter is not using the ruff format we use.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"class WikipediaSearchResult(SearchResult):\n",
" title: str\n",
"\n",
"from transformers import AutoTokenizer\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding a new dependency, I think we should use the messages.count_tokens API.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PedramNavid agreed removed this

" page = wikipedia.page(result)\n",
" print(page.url)\n",
" except Exception:\n",
" except:\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not have bare exceptions

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"# load the antrophic key from .env\n",
"from dotenv import load_dotenv\n",
"load_dotenv(verbose=True)\n",
"ANTHROPIC_SEARCH_MODEL = os.environ.get('ANTHROPIC_MODEL', 'claude-2')\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's default to haiku 4-5

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dome

Comment thread uv.lock Outdated
[[package]]
name = "huggingface-hub"
version = "1.0.0"
version = "0.36.0"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be changing our existing dependencies to a lower version.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cmosguy cmosguy requested a review from PedramNavid October 31, 2025 18:05
@cmosguy

cmosguy commented Nov 3, 2025

Copy link
Copy Markdown
Author

@PedramNavid gentle ping here on the updates. Do they meet your requirements now?

@PedramNavid PedramNavid left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cmosguy. I've given it another look, there's quite a few issues still with this PR. I think part of the challenge is that was an old notebook from Claude 2 and so is mixing a lot of old and new concepts. I wonder if maybe a re-write might be better than trying to fix things piece meal. Either way, I've noted a few logic issues that would need to be resolved before we can merge.

Comment thread pyproject.toml
"voyageai>=0.3.5",
"python-dotenv>=1.1.1",
"wikipedia>=1.4.0",
"huggingface-hub>=1.0.0",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can delete this I imagine

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave in wikipedia, right as that is required in this notebook

"\n",
" def __init__():\n",
" def __init__(self, anthropic_client: Anthropic):\n",
" self.anthropic_client = anthropic_client\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add the client here?
__init__ now takes anthropic_client: Anthropic as a required parameter, but then has a pass statement that does nothing with it.

Either remove the pass or properly initialize self.anthropic_client = anthropic_client.

"# Create a searcher\n",
"wikipedia_search_tool = WikipediaSearchTool()\n",
"ANTHROPIC_SEARCH_MODEL = \"claude-2\"\n",
"# load the antrophic key from .env\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can delete this comment, the loading happens at cell 3 with `load_dotenv()

" )\n",
" print(partial_completion)\n",
" token_budget -= self.count_tokens(partial_completion)\n",
" token_count = self.messages.count_tokens(\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct.

You call self.messages.count_tokens() after every partial completion to count the prompt tokens, not the completion tokens.

This is backwards - you should be subtracting partial_completion_.usage.input_tokens + partial_completion_.usage.output_tokens from the budget, which are already returned by the Messages API.

" )\n",
" information = extract_between_tags(\"information\", retrieval_response)[-1]\n",
"\n",
" # Try to extract information tags, handle case where none exist\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When <information> tags are missing you use the entire retrieval response as information. This could include all the scratchpad content and search quality reflections. Any reason why this is necessary? If no tags are found I would think there's an error in the response

" self.search_tool = search_tool\n",
" self.verbose = verbose\n",
"\n",
" # Pass the anthropic client to the search tool if it supports it\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldnt the search tool support it? You've updated the definition.

" if search_query is None:\n",
" raise Exception(\n",
" \"Completion with retrieval failed as partial completion returned mismatched <search_query> tags.\"\n",
" f\"Completion with retrieval failed as partial completion returned mismatched <search_query> tags.\"\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an f string?

@nidhishgajjar

Copy link
Copy Markdown

Orb Code Review (powered by GLM 5.1 on Orb Cloud)

This is a significant refactoring of the Wikipedia search cookbook that migrates from the deprecated completions API to the messages API and replaces local tokenizer-based truncation with API-based token counting.

Key Concerns

1. Token counting API calls in a loop (Medium)
The truncate_page_content() method calls messages.count_tokens() in a while loop, potentially making many API calls per Wikipedia page. For a page that needs significant truncation, this could be slow and costly:

while (
    truncated_token_count.input_tokens > self.truncate_to_n_tokens
    and len(truncated_content) > 100
):
    char_limit = int(char_limit * 0.9)
    truncated_content = page_content[:char_limit]
    truncated_token_count = self.anthropic_client.messages.count_tokens(...)

Consider using a binary search approach instead of linear 10% reduction, which would converge in O(log n) API calls rather than potentially many iterations.

2. Token budget tracking overcounts (Medium)
In the main retrieval loop, the token budget is reduced by the full message token count:

token_count = self.messages.count_tokens(
    model=model, messages=[{"role": "user", "content": prompt}]
)
token_budget -= token_count.input_tokens

This counts the entire prompt (which grows with each iteration) rather than just the new tokens generated. The budget will deplete much faster than intended. Consider tracking only the newly generated tokens.

3. huggingface-hub dependency appears unused (Low)
The huggingface-hub>=1.0.0 package is added to pyproject.toml and uv.lock, but I don't see it imported or used anywhere in the notebook code. If it's not needed, removing it would keep the dependency list clean.

4. AIClient sets itself as anthropic_client (Low)

self.search_tool.anthropic_client = self

The AIClient instance assigns itself as the anthropic_client, but AIClient is not an Anthropic instance — it's a wrapper. This works because it also has a messages attribute, but it could be confusing. Consider passing the underlying Anthropic client explicitly.

Positive Changes

  • Migration from deprecated completions API to messages API
  • Raw string prefix in regex (rf"..." instead of f"...")
  • Improved error messages with exception details
  • Added python-dotenv for environment management

Summary

The migration to the messages API is a necessary update. However, the token-counting approach in the truncation loop could be a performance concern in practice, and the token budget tracking overcounts. The unused huggingface-hub dependency should be removed or justified.

Assessment: request-changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants