Feature/fix wikipedia by cmosguy · Pull Request #240 · anthropics/claude-cookbooks

cmosguy · 2025-10-24T14:39:37Z

There were some changes in the api for that needed to be updated in the wikipedia notebook example. We need to start using the transformers library for the tokenizer.

…fix-wikipedia

PedramNavid

Approved, just a small suggestion on one line.

PedramNavid · 2025-10-28T16:34:28Z

@cmosguy there's a few conflicts that need to be resolved. i think you can just pull from the main branch for both pyproject and uv.lock but ensure your ipython notebook has run with ruff format and ruff check.

…ature/fix-wikipedia

cmosguy · 2025-10-30T16:08:34Z

@PedramNavid - trying again, thanks for the heads up

PedramNavid

Hi @cmosguy

I went through the full PR. There's a few issues

Rather than import tokenizer, lets rely on the count tokens API already available in the Anthropic library
Please ensure you are using the ruff formatter, as there are a lot of changed lines here that are pure formatting changes that should not be part of the diff
Please do not make changes to the uv.lock file, we should not be making any dependency changes for this PR.

PedramNavid · 2025-10-30T16:41:26Z

    "    def process_raw_search_results(\n",
-    "        self,\n",
-    "        results: list[SearchResult],\n",
+    "        self, results: list[SearchResult],\n",


Have you run ruff format on this notebok? This change you made undoes what our formatter is doing.

ok ran ruff format

PedramNavid · 2025-10-30T16:42:01Z

    "        result = \"\\n\".join(\n",
    "            [\n",
-    "                f'<item index=\"{i + 1}\">\\n<page_content>\\n{r}\\n</page_content>\\n</item>'\n",
+    "                f'<item index=\"{i+1}\">\\n<page_content>\\n{r}\\n</page_content>\\n</item>'\n",


Same here, looks like your linter/formatter is not using the ruff format we use.

PedramNavid · 2025-10-30T16:48:26Z

    "class WikipediaSearchResult(SearchResult):\n",
    "    title: str\n",
    "\n",
+    "from transformers import AutoTokenizer\n",


Rather than adding a new dependency, I think we should use the messages.count_tokens API.

@PedramNavid agreed removed this

PedramNavid · 2025-10-30T16:48:46Z

    "                page = wikipedia.page(result)\n",
    "                print(page.url)\n",
-    "            except Exception:\n",
+    "            except:\n",


should not have bare exceptions

PedramNavid · 2025-10-30T16:49:29Z

+    "# load the antrophic key from .env\n",
+    "from dotenv import load_dotenv\n",
+    "load_dotenv(verbose=True)\n",
+    "ANTHROPIC_SEARCH_MODEL = os.environ.get('ANTHROPIC_MODEL', 'claude-2')\n",


Let's default to haiku 4-5

PedramNavid · 2025-10-30T16:50:38Z

 [[package]]
 name = "huggingface-hub"
-version = "1.0.0"
+version = "0.36.0"


should not be changing our existing dependencies to a lower version.

cmosguy · 2025-11-03T13:21:36Z

@PedramNavid gentle ping here on the updates. Do they meet your requirements now?

PedramNavid

Hi @cmosguy. I've given it another look, there's quite a few issues still with this PR. I think part of the challenge is that was an old notebook from Claude 2 and so is mixing a lot of old and new concepts. I wonder if maybe a re-write might be better than trying to fix things piece meal. Either way, I've noted a few logic issues that would need to be resolved before we can merge.

PedramNavid · 2025-11-03T17:05:37Z

    "voyageai>=0.3.5",
+    "python-dotenv>=1.1.1",
+    "wikipedia>=1.4.0",
+    "huggingface-hub>=1.0.0",


Can delete this I imagine

I think we should leave in wikipedia, right as that is required in this notebook

PedramNavid · 2025-11-03T17:28:21Z

    "\n",
-    "    def __init__():\n",
+    "    def __init__(self, anthropic_client: Anthropic):\n",
+    "        self.anthropic_client = anthropic_client\n",


Why did you add the client here?
__init__ now takes anthropic_client: Anthropic as a required parameter, but then has a pass statement that does nothing with it.

Either remove the pass or properly initialize self.anthropic_client = anthropic_client.

PedramNavid · 2025-11-03T17:30:05Z

    "# Create a searcher\n",
    "wikipedia_search_tool = WikipediaSearchTool()\n",
-    "ANTHROPIC_SEARCH_MODEL = \"claude-2\"\n",
+    "# load the antrophic key from .env\n",


can delete this comment, the loading happens at cell 3 with `load_dotenv()

PedramNavid · 2025-11-03T17:31:06Z

    "            )\n",
    "            print(partial_completion)\n",
-    "            token_budget -= self.count_tokens(partial_completion)\n",
+    "            token_count = self.messages.count_tokens(\n",


I'm not sure this is correct.

You call self.messages.count_tokens() after every partial completion to count the prompt tokens, not the completion tokens.

This is backwards - you should be subtracting partial_completion_.usage.input_tokens + partial_completion_.usage.output_tokens from the budget, which are already returned by the Messages API.

PedramNavid · 2025-11-03T17:33:10Z

    "        )\n",
-    "        information = extract_between_tags(\"information\", retrieval_response)[-1]\n",
+    "\n",
+    "        # Try to extract information tags, handle case where none exist\n",


When <information> tags are missing you use the entire retrieval response as information. This could include all the scratchpad content and search quality reflections. Any reason why this is necessary? If no tags are found I would think there's an error in the response

PedramNavid · 2025-11-03T17:44:24Z

    "        self.search_tool = search_tool\n",
    "        self.verbose = verbose\n",
    "\n",
+    "        # Pass the anthropic client to the search tool if it supports it\n",


why wouldnt the search tool support it? You've updated the definition.

PedramNavid · 2025-11-03T17:44:46Z

    "        if search_query is None:\n",
    "            raise Exception(\n",
-    "                \"Completion with retrieval failed as partial completion returned mismatched <search_query> tags.\"\n",
+    "                f\"Completion with retrieval failed as partial completion returned mismatched <search_query> tags.\"\n",


why is this an f string?

nidhishgajjar · 2026-04-14T21:19:23Z

Orb Code Review (powered by GLM 5.1 on Orb Cloud)

This is a significant refactoring of the Wikipedia search cookbook that migrates from the deprecated completions API to the messages API and replaces local tokenizer-based truncation with API-based token counting.

Key Concerns

1. Token counting API calls in a loop (Medium)
The truncate_page_content() method calls messages.count_tokens() in a while loop, potentially making many API calls per Wikipedia page. For a page that needs significant truncation, this could be slow and costly:

while (
    truncated_token_count.input_tokens > self.truncate_to_n_tokens
    and len(truncated_content) > 100
):
    char_limit = int(char_limit * 0.9)
    truncated_content = page_content[:char_limit]
    truncated_token_count = self.anthropic_client.messages.count_tokens(...)

Consider using a binary search approach instead of linear 10% reduction, which would converge in O(log n) API calls rather than potentially many iterations.

2. Token budget tracking overcounts (Medium)
In the main retrieval loop, the token budget is reduced by the full message token count:

token_count = self.messages.count_tokens(
    model=model, messages=[{"role": "user", "content": prompt}]
)
token_budget -= token_count.input_tokens

This counts the entire prompt (which grows with each iteration) rather than just the new tokens generated. The budget will deplete much faster than intended. Consider tracking only the newly generated tokens.

3. huggingface-hub dependency appears unused (Low)
The huggingface-hub>=1.0.0 package is added to pyproject.toml and uv.lock, but I don't see it imported or used anywhere in the notebook code. If it's not needed, removing it would keep the dependency list clean.

4. AIClient sets itself as anthropic_client (Low)

self.search_tool.anthropic_client = self

The AIClient instance assigns itself as the anthropic_client, but AIClient is not an Anthropic instance — it's a wrapper. This works because it also has a messages attribute, but it could be confusing. Consider passing the underlying Anthropic client explicitly.

Positive Changes

Migration from deprecated completions API to messages API
Raw string prefix in regex (rf"..." instead of f"...")
Improved error messages with exception details
Added python-dotenv for environment management

Summary

The migration to the messages API is a necessary update. However, the token-counting approach in the truncation loop could be a performance concern in practice, and the token budget tracking overcounts. The unused huggingface-hub dependency should be removed or justified.

Assessment: request-changes

cmosguy added 3 commits October 24, 2025 09:04

trying to fix to use the new packages from anthropic

865fced

adding the wikipedia and transformers library

b78e11a

Merge commit 'b78e11aa58f2498e4de998bb8ac4cf6da477b24f' into feature/…

862233a

…fix-wikipedia

PedramNavid previously approved these changes Oct 24, 2025

View reviewed changes

Comment thread third_party/Wikipedia/wikipedia-search-cookbook.ipynb Outdated

fixed the recommended line of code

13c9e99

cmosguy dismissed PedramNavid’s stale review via 13c9e99 October 24, 2025 21:01

cmosguy added 2 commits October 30, 2025 11:07

Merge branch 'main' of github.com:anthropics/claude-cookbooks into fe…

6217692

…ature/fix-wikipedia

synced to main

e2f344d

PedramNavid requested changes Oct 30, 2025

View reviewed changes

cmosguy added 3 commits October 30, 2025 15:47

fixed based on feedback from PedramNavid

059d4d3

ran ruff format

70d6284

fixed some minor issues

606f19f

cmosguy requested a review from PedramNavid October 31, 2025 18:05

PedramNavid reviewed Nov 3, 2025

View reviewed changes

Uh oh!

Conversation

cmosguy commented Oct 24, 2025

Uh oh!

PedramNavid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PedramNavid commented Oct 28, 2025

Uh oh!

cmosguy commented Oct 30, 2025

Uh oh!

PedramNavid left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmosguy commented Nov 3, 2025

Uh oh!

PedramNavid left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nidhishgajjar commented Apr 14, 2026

Key Concerns

Positive Changes

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants