how-i-utilize-ai-agents-article/initial sketch.txt at main · j-wang/how-i-utilize-ai-agents-article · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
I did mention the Substack isn’t meant to be a how to or productivity thing but at the same time I think it’s helpful for me to illustrate how I use agents at this point additionally, a lot of people have asked me to illustrate this because they’ve been somewhat surprised by how extensive my set up is

I push it to the max, so I fully understand what its capabilities and limits are. Also, I don’t think humans should do what is easily automated anyway. But this prevents the NYU Professor-Uber problem.

Kudos on him for admitting he didn't use the service… though it may or may not have been a mea culpa vs. a somewhat snoody statement about using the subway. Still, You can't really know how disruptive a technology is until you try it. For example, a lot of people are surprised when I say Waymos are both more expensive and take longer to get a ride than Uber in SF. The reason? Well, tourists for one, but also some people refuse to take Ubers anymore. People who have gotten into car accidents. Women who have had bad experiences with drivers. Certain others who are just very nervous and uncomfortable with small talk. There's an archetype that you wouldn't have rationally sketched out on paper—a market that is opened by a new technology.

Intelligent agents are considered by many to be the ultimate goal of AI. The classic book by Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (Prentice Hall, 1995) defines the field of artificial intelligence research as "the study and design of rational agents."

We are unquestionably here
We may not have AGI but we definitely have AI, which is bit funny but I make the distinction because so many things we now "AI" including aspects of classical machine learning

(One of the other core principles: I iteratively improve both prompts AND context "automatically"—basically, I ask the AI agent to evaluate the diff between its outputs and what I did, and add more documentation, similar to a coding exercise. Most of this exists in DevonThink for me, because an MCP exists, I like the platform, and it's easier to back-up/protect/and sync across than just Files). It's NOT perfect. Sometimes, it overwrites past things, etc. but if I'm careful about both reading the changes to at least ensure it isn't overwriting, even if some of the guidance doesn't make as MUCH sense to me as a human, I'm ok with it and continue to improve it. Everything is iterative and builds on what came before—that's a BIG BIG portion of ultimately making it useful. For example, even just less structured context—I have all of my Substack pieces on my DevonThink. For a talk, I can simply provide some bullets, ask Claude to pull my relevant Substack pieces, which are categorized/tagged neatly as well, and write up notes for me. This is like an assistant—it's obviously still me giving the talk. After all, I gave the bullets on what I want to cover. I wrote the original Substack pieces. I ultimately approve/decide to utilize the sketch of the talk or not. But I can get leverage and some time savings. That only happens from organization/context. It's essentially using RAG but hyper specialized for my own context)

I can also have stuff for the future like the Slack Bot getting human feedback for its continually loop (ML, honestly pretty easy for a human to tell if the attempt is really bad or not. Some of these are good just to get feedback from nontechnical <well non ML/AI> team members qualitatively which can help improve things). Requires my approval to reach out to our team on Slack. No routes for exfiltration because these are closed

(I should call out exfiltration; for example, of your passwords/secret keys, or hank account info, as a key consideration and why this shouldn’t be done lightly/without thought—but it WILL be solved be companies inevitably. OpenClaw was just YOLO (you live only once) on this point and was powerful but also likely to create disaster)

Claude Code in folders doesn’t have to be about coding. It can just be structures for automation

Morning briefing each day, runs as a cron job (well, launchd, a Mac native equivalent because of auth token stuff), dumps into my Day One journal and DevonThink (… because for some reason the default Gmail connector can only draft. I get why—see prompt injection later—but it defeats its utility to send things to me)

Writing: research and junior drafting assistant. Credit having some inspiration from Alejandro Morfiss, though doing it myself and helping others set it up… it’s really personalized. And requires multiple iterations and lots of context/examples (like the whole corpus of my Substack in categories that can be retrieved for relevant topics) to be any good. I still need to write the outline of what I want (with the broad notes to hit). I still need to go through and substantially edit (ask for different charts/data, check math/probe it which often is wrong, and also change a decent amount of the text for my own tone). It’s more akin to an intern helping do a first draft than a ghost writer. Is it useful? Yeah, broadly—it helps get stuff on paper. It helps do research, even if I have to check it. There’s no real question as to who is the author in any meaningful sense though. Still, it actually saves me time and mental energy versus before when it was totally useless.

Really, though, it’s augmented by a lot of the research snippets I keep in DevonThink. I have my own “canon” that I direct it to draw off of instead of random internet data. I also still form the basis of the piece through my initial outline.

Make a GitHub showing the process and all of the drafts so people can click through it. And see that it’s mainly helpful in the one main step of taking scattered notes and organizing them into a coherent flow… but not necessarily a huge amount more.

It also seems to think a critical part of my writing is to say “Let’s be clear” and “The thing is” a lot. I mean, it’s obviously a tick, ok? But I would argue it isn’t a core part of my writing—I assume, anyway, it isn’t the reason you’re reading this.

Context matters—especially if it’s tailored by a human and refined over time

I also use it for meeting notes. I have detailed instructions in a directory for Claude Code to classify my notes from my Sony ICD-UX570 recorder with a DeepGram transcript, ask me who each speaker is with context on what is said, and auto-generate a summary in both Markdown (for me) and PDF (to share).

Ironically I use Opus for everything, even if it’s unnecessary. The one place I don’t as much is Claude Code—almost all implementation (note, not planning or my “main” chat) is Sonnet or Haiku. I think it tells you something that the most “permissive” case for problems/errors/inaccuracy/dumbness is code. It can be validated against tests, and I am generally going to review each pull request (set of changes) anyway. That perhaps says something for why adoption is happening so quickly in the coding realm

Prompt Injection (see Simon’s pieces)—it’s a real thing. Generally speaking, I only allow my OWN MCPs, ones that I’ve vetted, or ones that are more broadly trustworthy (e.g., Google Calendar). However, note that you could even have an issue with Google Calendar with prompt injection into it. In that case, you should be somewhat choosy about what your LLM is allowed to “phone out” for. It’s a pretty good idea to forbid external web calls except for what you have explicitly allowed for. Generally speaking, this doesn’t hugely impede on most use cases, from my experience

This is vs. OpenClaw, obviously

DevonThink, OmniFocus, and Day One MCP Passthrough (use Google OAuth to make it available on Claude Web), as well as my own other MCPs.

DevonThink Capture using tmux Claude Code that I can connect to with Blink.sh at any time to capture a Substack article (go to Substack reader, resize to 650px to remove the sidebar, capture as Web Archive in DevonThink in my To Read inbox)

(Claude just released remote control, which means most people won’t need to do what I’m doing with tmux, though mine still has an advantage of more control and flexibility. But remote control is probably easier for most)
https://code.claude.com/docs/en/remote-control

For the future, auto-start tmux sessions

Claude health information in my folder with tests

Photo of the report on the meetings with pass watchlist and follow up

As a note in the article about security as well, I cannot emphasize enough pre-reset what permissions are available or not and use isolation to ensure that it only does what you want you cannot rely on the ability to approve every single one of these requests the AI will do a lot of things and will quickly overwhelm you so just don’t get into that trap to begin with

Conclusion idea: What does this mean for people? Block (previously Square) just laid off 4,000 workers, or essentially half of their workforce. Citrini Research published a thought piece that essentially crashed various stocks because it painted a people-less future as AI takes over (which I disagree with). At the same time, a "it's true, trust me" refrain on various social media platforms is "AI adds nothing in productivity. It's clear/been shown in studies." My last piece indirectly addresses Citrini's point and directly rebuts the "nothing in productivity" point. Studies _don't _ show that. Even in fairly conservative cases (Copilot only from a year ago), we _clearly_ see productivity in different studies. This also fits my piece about the "Boring Stage of AI" where AI that does real tasks is both less flashy and likely much more important. That future is already here. We're just getting started. And I also think it's highly unlikely to cause a dystopian scenario of persistent mass unemployment (note: I didn't say it won't cause disruption, especially in the short term).