Here's a plan for a user-triggered browser extension, compatible with Chrome (Manifest V3) and Firefox (WebExtensions API), that leverages the Gemini API for this task:
Goal: A user-triggered browser extension that intelligently auto-fills web forms using a user's stored personal information, leveraging Gemini Flash 2.5's reasoning capabilities to understand form structure and make smart filling decisions.
Key Principles:
- User-Triggered: No automatic background processing. The user explicitly initiates the autofill process.
- Privacy-First: User data is stored locally and sent to the Gemini API only when explicitly requested by the user for autofill, with clear data handling policies.
- Cross-Browser Compatibility: Designed for both Chrome (Manifest V3) and Firefox (WebExtensions API).
- Robust Form Interaction: Handles various input types, dropdowns, radio buttons, and checkboxes.
- Intelligent Mapping: Leverages LLM for semantic understanding of form fields and user data.
The extension will primarily consist of:
- Popup UI (HTML, CSS, JS): The main interface displayed when the user clicks the extension icon.
- User Profile Management: Allows users to input and manage their personal information (e.g., name, address, email, phone, etc.). This data should be stored securely (e.g.,
chrome.storage.local/browser.storage.local). - Gemini API Key Input: A field for users to enter their personal Gemini API key. This should be securely stored.
- "Autofill Now" Button: Initiates the autofill process for the active tab.
- Status/Feedback: Displays messages about the autofill process (e.g., "Analyzing form...", "Filling fields...", "Completed!").
- User Profile Management: Allows users to input and manage their personal information (e.g., name, address, email, phone, etc.). This data should be stored securely (e.g.,
- Background Script (Service Worker for Chrome MV3, Background Script for Firefox):
- Handles long-running tasks, API calls to Gemini, and communication between the popup/options page and content script.
- Listens for messages from the popup (e.g., "autofill_request").
- Manages the Gemini API calls.
- Responsible for injecting and communicating with the content script.
- Content Script (JavaScript):
- Injected into the active webpage to interact with the DOM.
- Reads form structure and data.
- Receives instructions from the background script to fill fields.
- Manipulates input, select, and other form elements.
- Options Page (HTML, CSS, JS - Optional, but good for more settings):
- More persistent settings (e.g., default profile, advanced mapping rules, privacy settings).
- User Input: Users enter their personal data (e.g.,
First Name: John,Last Name: Doe,Email: john.doe@example.com,Address Line 1: 123 Main St,City: Anytown,State: CA,Zip: 90210,Phone: 555-123-4567,Gender: Male,Date of Birth: 1990-01-15). - Local Storage: This data is stored in
chrome.storage.local(Chrome) orbrowser.storage.local(Firefox). This keeps the sensitive data on the user's machine. - API Key Storage: The Gemini API key is also stored securely in local storage.
- Click Extension Icon: User clicks the browser extension icon.
- Popup Displays: The popup UI appears.
- "Autofill Now" Click: User clicks the "Autofill Now" button.
- Popup to Background Script: The popup sends a message to the background script, including:
- The user's stored personal data.
- The active tab ID.
- The Gemini API key.
- Background Script to Content Script (Phase 1: DOM Extraction):
- The background script injects the content script (if not already injected) into the active tab.
- Sends a message to the content script asking it to extract form information.
- Content Script (Phase 1: DOM Extraction):
- DOM Traversal: Iterates through the current webpage's DOM to find all relevant form elements (
<input>,<textarea>,<select>,<button>,<label>). - Feature Extraction per Field: For each identified form element, it extracts:
idattributenameattributetypeattribute (for inputs)placeholderattribute- Associated
<label>text (usingforattribute or traversing parent/sibling nodes). - Any
aria-labeloraria-labelledbyattributes. - Values for
<option>tags within<select>elements. - Current value (if any).
- Read-only/disabled status.
- XPath or unique CSS Selector: Crucial for reliable element identification later. This is often the hardest part to make robust across arbitrary sites.
- Structure Representation: Organizes this extracted data into a structured format (e.g., a JSON array of objects, each representing a form field).
- Content Script to Background Script: Sends this structured form data back to the background script.
- DOM Traversal: Iterates through the current webpage's DOM to find all relevant form elements (
- Background Script to Gemini API:
- Constructs a detailed prompt for Gemini Flash 2.5 using:
- The structured form data from the content script.
- The user's structured personal information.
- System Instruction: "You are an AI assistant specialized in intelligently filling web forms. Your goal is to match user data to form fields, handle variations, and suggest precise values. Always use the provided tool definitions to output your actions."
- Tool Definitions (Function Calling): Define functions that the LLM can "call" to represent actions:
fill_text_input(xpath_or_selector: str, value: str, field_type: str): Fortext,email,number,password,textarea.select_dropdown_option(xpath_or_selector: str, value: str): For<select>elements.check_radio_or_checkbox(xpath_or_selector: str, checked: bool): Forradioandcheckboxtypes.- (Optional)
click_button(xpath_or_selector: str): For "Next" or "Submit" buttons, though initial version should focus on filling, not submitting.
- User Query: "Based on the form elements provided and the user's information, generate a list of actions (tool calls) to fill out the form. Prioritize exact matches but use your general knowledge for semantic matching (e.g., 'Surname' for 'Last Name'). For dropdowns and radio buttons, pick the most appropriate option if a direct match isn't found, or infer a common default. Do not include actions for fields where you have no relevant user data."
- Makes an API call to Gemini Flash 2.5 with this prompt and the user's Gemini API key.
- Constructs a detailed prompt for Gemini Flash 2.5 using:
- Gemini API Response: Gemini returns a JSON object containing the recommended tool calls (e.g.,
[{ "tool_name": "fill_text_input", "args": { "xpath_or_selector": "/html/body/div[2]/form/input[1]", "value": "John", "field_type": "text"}}, ...]). - Background Script to Content Script (Phase 2: Action Execution):
- The background script receives Gemini's response.
- Validates the structure of the tool calls.
- Sends these validated actions to the content script.
- Content Script (Phase 2: Action Execution):
- Receives the list of actions from the background script.
- For each action:
- Locates the element using the provided XPath or selector.
- Performs the specified action (e.g., sets
element.value, selectsoption, clickselement). - Visually highlights the filled fields for user review (e.g., a temporary green border).
- Sends a "fill_complete" message back to the background script.
- Popup Updates: The popup UI updates to "Autofill Complete! Please review."
- Manifest File:
- Chrome:
manifest.jsonwithmanifest_version: 3. - Firefox:
manifest.jsonwithmanifest_version: 2(currently, Firefox is transitioning to MV3, but MV2 is still widely supported and simpler for initial development). - Differences in permissions and background script registration will need conditional logic or separate manifest files.
- Chrome:
- API Differences:
chrome.*vs.browser.*: Most WebExtension APIs are cross-compatible by usingbrowser.*for Firefox and falling back tochrome.*for Chrome, or by using a polyfill.- Background Script: Chrome MV3 uses Service Workers (event-driven, non-persistent), while Firefox traditionally uses persistent background scripts. This is the biggest architectural difference.
- Chrome MV3: Service workers run only when needed. State management might require
chrome.storage.localorIndexedDB. No direct DOM access; all interaction must be via content scripts. - Firefox MV2: Persistent background scripts can hold state. Direct DOM access is also restricted, so content scripts are still needed.
- Chrome MV3: Service workers run only when needed. State management might require
- Content Script Injection: Both support
chrome.scripting.executeScript(Chrome MV3) orbrowser.tabs.executeScript(Firefox). - Messaging:
chrome.runtime.sendMessage/onMessageandbrowser.runtime.sendMessage/onMessageare largely compatible. - Storage:
chrome.storage.localandbrowser.storage.localare compatible.
Strategy for Cross-Browser:
- Develop for Firefox first (MV2): Often more forgiving and allows for faster iteration.
- Port to Chrome (MV3): Address service worker non-persistence and
host_permissionschanges carefully. Use a build script to generate separate manifests and bundle for each. - Polyfills/Abstractions: Use libraries or custom wrappers to abstract
chromevs.browserAPI differences.
- User Data:
- Local Storage: All user personal information and API keys MUST be stored locally using
chrome.storage.local(orbrowser.storage.local) which is isolated per extension. Never store on a remote server. - Encryption (Optional but Recommended): For highly sensitive data, consider encrypting user profiles within local storage using a user-provided passphrase, though this adds complexity.
- Ephemeral API Key Usage: The API key is sent with each request to Gemini, but should not be persistently logged or exposed.
- Local Storage: All user personal information and API keys MUST be stored locally using
- Gemini API Communication:
- HTTPS Only: All communication with the Gemini API must be over HTTPS.
- Minimizing Data Sent: Only send the necessary DOM structure and user data to the LLM. Avoid sending entire page content unless absolutely necessary for context, and always filter out highly sensitive, non-form-related information.
- Data Usage Policy: Clearly inform users that their form data and parts of the webpage DOM will be sent to Google's Gemini API for processing when they trigger autofill. Link to Google's AI privacy policies.
- Permissions:
- Least Privilege: Request only the minimum necessary permissions in the
manifest.json.activeTab: Allows access to the current tab only when the user invokes the extension (e.g., clicks the icon), which is perfect for this user-triggered approach.storage: For storing user profiles and API keys locally.scripting(Chrome MV3) /tabs(Firefox): For injecting content scripts.host_permissions: Specific URLs or<all_urls>for content script injection.<all_urls>is necessary for arbitrary forms, but users should be warned.
- Least Privilege: Request only the minimum necessary permissions in the
- Content Script Security:
- Isolation: Content scripts run in an isolated world, preventing direct access to page JavaScript variables, but they can still interact with the DOM.
- Input Sanitization: Any data retrieved from the webpage should be treated as untrusted. Ensure robust sanitization if any part of it is used to construct dynamic code or displayed in the extension UI.
- Error Handling: Implement robust error handling for API calls, network issues, and unexpected DOM structures.
-
Phase 1: Core User Profile & Storage
- Create basic popup UI for adding/editing user profiles.
- Implement
chrome.storage.local/browser.storage.localfor saving/loading profiles. - Implement Gemini API key input and storage.
- Cross-browser test: Ensure storage and basic UI work on both Chrome and Firefox.
-
Phase 2: DOM Extraction (Content Script)
- Develop a content script to identify and extract relevant form elements (inputs, textareas, selects, labels, types, placeholders).
- Focus on generating robust XPath/CSS selectors for each element.
- Implement messaging from background script to content script to trigger extraction, and back to send data.
- Testing: Test on various complex forms (different websites, different frameworks like React, Angular, plain HTML).
-
Phase 3: Gemini API Integration & LLM Prompting
- Set up background script to receive form data and user profile.
- Craft initial Gemini Flash 2.5 prompt with tool definitions.
- Implement API call to Gemini.
- Parse Gemini's tool call response.
- Iterative Prompt Engineering: This will be the most time-consuming part. Experiment with different prompt structures, examples, and system instructions to get reliable and intelligent autofill results from Gemini. Test edge cases (e.g., "M" vs. "Male", "USA" vs. "United States of America").
-
Phase 4: Form Filling (Content Script)
- Implement content script logic to receive Gemini's tool calls and execute them on the DOM.
- Add visual feedback (e.g., temporary highlights) to filled fields.
- Testing: Test filling on various forms, ensuring correct values are set for all field types.
-
Phase 5: User Experience & Refinements
- Add "Autofill Now" button to the popup.
- Add loading states and success/error messages in the popup.
- Implement an optional "Review & Confirm" step where the user can see suggested fills before they are applied.
- Add a simple onboarding guide for users.
-
Phase 6: Cross-Browser Polish & Release
- Finalize manifest files for both Chrome MV3 and Firefox.
- Address any remaining API compatibility issues.
- Optimize performance (e.g., minimize content script footprint, handle large DOMs efficiently).
- Write clear privacy policy and instructions for API key usage.
- Prepare for submission to Chrome Web Store and Firefox Add-ons.
Here's a guide/spec on how to implement the AI Autofill Pro extension using this boilerplate, focusing on integration rather than rewriting the core logic from the previous plan.
This guide outlines how to integrate the AI Autofill Pro logic into the provided boilerplate. We will leverage the boilerplate's structure for background scripts, content scripts, and popup UI, while adding our specific logic for form analysis, LLM interaction, and form filling.
Boilerplate Overview:
src/pages/popup/: For the extension's primary popup UI (React application).src/pages/background/: For the Manifest V3 Service Worker (background script).src/pages/content/: For the content script injected into web pages.manifest.js: Where themanifest.jsonis generated (important for permissions and script registration).- Vite Configuration: Handles building and HMR.
-
Clone the Boilerplate:
git clone https://github.com/Jonghakseo/chrome-extension-boilerplate-react-vite.git cd chrome-extension-boilerplate-react-vite npm install # or yarn
-
Inspect
chrome-extension/manifest.ts:- Ensure
manifest_version: 3for Chrome compatibility. - Add necessary permissions:
"activeTab": Crucial for user-triggered interaction with the current tab."storage": For storing user profiles and API keys."scripting": For injecting the content script dynamically."host_permissions": ["<all_urls>"]: Required for the content script to run on any webpage to analyze forms. This is a sensitive permission, and users should be informed.
- Verify
background.service_workerpoints tobackground.js. - Verify
content_scriptsare configured correctly. The currentchrome-extension/manifest.tsalready has a single content script entry:js: ['content/index.iife.js'].
- Ensure
-
Clean up / Rename Boilerplate Components:
- The main popup component is
pages/popup/src/Popup.tsx. Adjust imports accordingly. - Remove any boilerplate example logic from
chrome-extension/src/background/index.tsorpages/content/src/matches/scripts that is not relevant to autofill.
- The main popup component is
This will be your main user interface.
- User Profile Management:
- State Management: Use React
useStateanduseEffecthooks for managing the plaintext user data in a single textarea. - Persistence:
- On component mount, load saved user data from
chrome.storage.localusingchrome.storage.local.get(). The key for profile data isprofileText. - On input changes, update local React state.
- On a "Save Profile" button click, save the current state to
chrome.storage.local.set().
- On component mount, load saved user data from
- UI Elements: A textarea for plaintext user data, a "Save Profile" button.
- State Management: Use React
- Gemini API Key Input:
- Dedicated input field for the user's Gemini API key.
- Similarly, load and save this key to
chrome.storage.local. Emphasize never hardcoding this key or bundling it with the extension.
- "Autofill Now" Button:
- This is the trigger for the entire autofill process.
- On click, it will:
- Retrieve the current plaintext user profile and Gemini API key from local storage.
- Send a message to the background service worker (
chrome.runtime.sendMessage) with the typeAUTOFILL_REQUESTand the payload containing the user data (as plaintext) and API key. - Update the popup UI to show a "Loading..." or "Analyzing..." status.
- Status and Feedback Display:
- A dedicated area (e.g., a
<div>) to show messages like "Autofill Complete!", "Error: Invalid API Key", "Analyzing form...", "Please review the filled fields." - Use
useStateto manage the status message.
- A dedicated area (e.g., a
- User Consent/Disclaimer:
- Prominently display a disclaimer about data being sent to Google's Gemini API for processing when autofill is triggered. Include a link to Google's privacy policy.
This script acts as the orchestrator and the bridge to the Gemini API.
-
Message Listener:
- Use
chrome.runtime.onMessage.addListenerto listen for messages from the popup and the content script. - Handle
AUTOFILL_REQUEST(from Popup):- Extract
profile(plaintext string) andgeminiApiKeyfrom the message. - Inject Content Script (if not already): The content script is already declared in
manifest.tsto run ondocument_idle. You will communicate with it viachrome.tabs.sendMessage. - Request Form Data: Send a message to the content script (e.g., type
EXTRACT_FORM_DATA).
- Extract
- Handle
FORM_DATA_EXTRACTED(from Content Script):- Receive the
formStructure(the structured JSON of form elements) from the content script. - Gemini API Call:
- Construct the detailed prompt for Gemini Flash 2.5 (as described in the previous plan: system instruction, tool definitions, user query, form structure, user data).
- Make the
fetchrequest to the Gemini API endpoint using thegeminiApiKey. - Implement robust
try-catchblocks for API errors (network issues, invalid key, rate limits). - The prompt should explicitly state that the user's personal information is provided as plaintext and the model should extract relevant details from it.
- Process Gemini Response:
- Parse the JSON response from Gemini.
- Validate that the
tool_callsarray is present and correctly structured.
- Send Actions to Content Script: Send a message to the content script (e.g., type
EXECUTE_ACTIONS) with the array oftool_callsreturned by Gemini.
- Receive the
- Handle
AUTOFILL_COMPLETE/AUTOFILL_ERROR(from Content Script):- Receive confirmation or error messages from the content script after filling.
- Send a final message back to the popup (e.g., type
UPDATE_POPUP_STATUS) to update its UI.
- Use
-
API Key Management:
- The API key is passed through the background script for the API call, but should never be logged or stored persistently by the background script itself. It's ephemeral for the API call.
-
Error Handling & Fallbacks:
- Implement robust error handling for network issues, API errors, and unexpected content script behavior.
- Communicate errors back to the popup for user feedback.
This script directly interacts with the webpage's DOM.
-
DOM Extraction Logic:
- Listen for
EXTRACT_FORM_DATAmessage from the background script. - Implement functions to traverse the DOM and extract form elements.
- Crucial: Generating stable and unique selectors. XPath is often more robust than simple CSS selectors for complex pages.
- For each element, store its
tagName,type,id,name,placeholder,value, text from associated<label>,aria-label,aria-labelledby, and most importantly, a unique XPath or highly specific CSS selector. - For
<select>elements, include theoptions(text and value). - For
radioandcheckboxelements, identify theirvalueandcheckedstatus. Group radio buttons byname.
- For each element, store its
- Send the extracted structured data (
formData) back to the background script usingchrome.runtime.sendMessage. -
// in pages/content/src/index.ts chrome.runtime.onMessage.addListener((message, sender, sendResponse) => { if (message.type === 'EXTRACT_FORM_DATA') { const formData = extractFormData(); // Implement this function chrome.runtime.sendMessage({ type: 'FORM_DATA_EXTRACTED', payload: formData }); return true; // Indicates async response } else if (message.type === 'EXECUTE_ACTIONS') { executeActions(message.payload); // Implement this function } }); function extractFormData() { const formElements: any[] = []; document.querySelectorAll('input, textarea, select').forEach(element => { const el = element as HTMLInputElement | HTMLTextAreaElement | HTMLSelectElement; const data: any = { tagName: el.tagName.toLowerCase(), type: el.type || el.tagName.toLowerCase(), // 'input' type can be missing id: el.id, name: el.name, placeholder: el.placeholder, selector: getElementXPath(el) || getCssSelector(el), // Prioritize XPath, then CSS // Add label text, aria-labels, etc. }; if (el.tagName.toLowerCase() === 'select') { data.options = Array.from((el as HTMLSelectElement).options).map(opt => ({ text: opt.textContent, value: opt.value, })); } // Add specific logic for radio/checkbox groups if needed formElements.push(data); }); return formElements; } // Helper function to get XPath (can be complex, use a robust library or implement carefully) // Example (simplified): function getElementXPath(element: Element): string { if (element.id !== '') return `//*[@id='${element.id}']`; if (element === document.body) return '/html/body'; let ix = 0; const siblings = element.parentNode?.children || []; for (let i = 0; i < siblings.length; i++) { const sibling = siblings[i]; if (sibling === element) return `${getElementXPath(element.parentNode!)}/${element.tagName.toLowerCase()}[${ix + 1}]`; if (sibling.nodeType === 1 && sibling.tagName === element.tagName) ix++; } return ''; // Fallback if no parent } // getCssSelector can also be implemented or use a library
- Listen for
-
Action Execution Logic:
- Listen for
EXECUTE_ACTIONSmessage from the background script. - Implement
executeActionsfunction:- Iterate through the
payload(Gemini's tool calls). - For each action, use the
selector(XPath/CSS selector) to find the element on the page. - Perform the action (e.g., set
value,selectedIndex,checkedstatus). - Add visual feedback (e.g.,
element.style.border = '2px solid green'). -
// in pages/content/src/index.ts async function executeActions(actions: any[]) { for (const action of actions) { const { tool_name, args } = action; const element = document.evaluate(args.selector, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue as HTMLElement | null; if (!element) { console.warn(`Element not found for selector: ${args.selector}`); continue; } if (tool_name === 'fill_text_input') { (element as HTMLInputElement | HTMLTextAreaElement).value = args.value; } else if (tool_name === 'select_dropdown_option') { (element as HTMLSelectElement).value = args.value; } else if (tool_name === 'check_radio_or_checkbox') { (element as HTMLInputElement).checked = args.checked; element.dispatchEvent(new Event('change', { bubbles: true })); // Trigger change event } // Add temporary visual feedback element.style.outline = '2px solid #4CAF50'; // Green outline element.style.transition = 'outline 0.3s ease-out'; setTimeout(() => { element.style.outline = 'none'; }, 1500); // Remove highlight after 1.5 seconds } chrome.runtime.sendMessage({ type: 'FILL_COMPLETE' }); }
- Iterate through the
- Listen for
- Start Development Server:
This will build the extension and provide HMR for
npm run dev
popupchanges. - Load Unpacked Extension:
- Chrome: Go to
chrome://extensions/, enable "Developer mode," click "Load unpacked," and select thedistfolder generated by Vite. - Firefox: Go to
about:debugging#/runtime/this-firefox, click "Load Temporary Add-on," and select themanifest.jsonfile inside thedistfolder. (Note: Firefox's temporary add-on won't persist across browser restarts, so you'll need to reload it frequently for testing.)
- Chrome: Go to
- Iterative Testing:
- Phase 1 (Popup & Storage): Test saving/loading user data and API key in the popup.
- Phase 2 (DOM Extraction): Open various complex websites. Trigger the
EXTRACT_FORM_DATAmessage (initially via a dummy button in popup, or direct console command in background script). Inspect theformStructurelogged by the content script to ensure it's accurate and robust. This is critical. - Phase 3 (Gemini Integration): Test the full roundtrip: Popup -> Background -> Gemini -> Background. Verify Gemini's response is structured correctly.
- Phase 4 (Form Filling): Test filling various input types (text, email, password, number, date), dropdowns, radio buttons, and checkboxes on different websites. Pay attention to events (e.g.,
changeevents for React forms).
- Debugging:
- Popup: Standard React DevTools, browser's console for the extension popup.
- Background Script: Open
chrome://extensions/, click "Service worker" link for your extension (or "Inspect" for Firefox background script). - Content Script: Open the developer tools of the webpage itself (F12). You'll see your content script's console logs there.
The boilerplate primarily targets Chrome MV3. For full Firefox compatibility:
chrome-extension/manifest.ts(Conditional Generation): You might need to adjust thechrome-extension/manifest.tsto generate a Manifest V2 for Firefox during its build step (Firefox is still in transition to MV3). This could involve:- Different
manifest_version. - Potentially different
backgroundkey (Firefox usesscriptsarray for persistent background pages, notservice_worker). - Different
permissionsif necessary. - The boilerplate's
manifest.tsmight have a way to handle this, or you may need to introduce a separatemanifest.firefox.ts.
- Different
- API Compatibility:
browservs.chrome: Forchrome.storage,chrome.runtime,chrome.tabs, etc., consider using a polyfill likewebextension-polyfillor manually checkingif (typeof browser !== 'undefined')to usebrowser.storagefor Firefox andchrome.storagefor Chrome.chrome.scripting: This API is Chrome-specific (MV3). Firefox usesbrowser.tabs.executeScript. Your background script will need conditional logic or a wrapper for this.- Service Worker vs. Persistent Background Script: The boilerplate's service worker will work for Chrome. For Firefox, if you stick to MV2, the background script will be persistent. This doesn't affect your core logic too much, but be aware of state management differences.
This guide provides a detailed roadmap for integrating the AI Autofill Pro logic into the chrome-extension-boilerplate-react-vite. The key is to leverage the boilerplate's established structure for communication and rendering, while focusing your efforts on the core logic of DOM analysis, prompt engineering for Gemini, and precise form manipulation.
The Jonghakseo/chrome-extension-boilerplate-react-vite is an excellent choice as it provides a solid foundation with React, TypeScript, Vite, and is configured for Manifest V3, supporting both Chrome and Firefox (though Firefox's MV3 transition might require slight adjustments depending on its current state).
Here's a guide/spec for implementing the AI Autofill Pro extension using this boilerplate:
This guide assumes you have cloned the boilerplate and have a basic understanding of its structure.
-
Clone the Boilerplate:
git clone https://github.com/Jonghakseo/chrome-extension-boilerplate-react-vite.git ai-autofill-pro cd ai-autofill-pro pnpm install # or yarn install / npm install
-
Update
package.json:- Change
name,description,version, andauthorto reflect "AI Autofill Pro".
- Change
-
Update
manifest.ts(ormanifest.json):-
Locate
src/manifest.ts(the boilerplate uses TypeScript for manifest generation, which is great). -
Name & Description: Update
nameanddescription. -
Permissions:
- Add
storagefor local data storage. - Add
activeTabfor user-triggered interaction with the current tab. - Add
scriptingfor injecting content scripts (Chrome MV3). - For Firefox, ensure
host_permissionsare correctly configured (e.g.,<all_urls>). The boilerplate might handle this viaweb_accessible_resources. - Example snippet for
permissionsandhost_permissionsinmanifest.ts:
// in src/manifest.ts export default defineManifest({ // ... existing fields permissions: ['storage', 'activeTab', 'scripting'], // 'scripting' for Chrome MV3, 'tabs' for Firefox MV2 host_permissions: ['<all_urls>'], // Necessary for content script to interact with any website // ... content_scripts: [ { matches: ['<all_urls>'], // Inject content script on all URLs js: ['src/pages/content/index.ts'], // Ensure this path is correct for your content script run_at: 'document_idle', // Run when the DOM is mostly ready }, ], // ... });
- Add
-
-
Remove Unused Boilerplate Code:
- Review
src/pages/newtabandsrc/pages/devtoolsif they exist and are not needed for this extension, you can remove them and their corresponding entries inmanifest.ts.
- Review
This is where the user will interact with the extension.
src/pages/popup/Popup.tsx(or similar main component):- Layout: Use React components to create the UI.
- User Profiles/Data Input:
- Implement forms for users to input their personal data (Name, Email, Address components, etc.).
- Use React state (e.g.,
useState) to manage form input. - Upon saving, use
chrome.storage.local.setto persist data. - Load existing data using
chrome.storage.local.getwhen the popup opens. - Example structure for stored data:
{ "profiles": { "default": { "firstName": "John", "lastName": "Doe", "email": "john.doe@example.com", "address": { "line1": "123 Main St", "city": "Anytown", "state": "CA", "zip": "90210", "country": "USA" }, "phone": "555-123-4567"
pu }
},
"geminiApiKey": "YOUR_API_KEY_HERE"
}
* **Gemini API Key Input:** A dedicated input field for the user's Gemini API key, stored alongside profiles. * **"Autofill Now" Button:** * Attach an `onClick` handler. * When clicked, retrieve the current active profile and Gemini API key from local storage. * Send a message to the **Background Script** to initiate the autofill process. *javascript
// Inside Popup.tsx
import { sendMessage } from '@src/shared/utils/messaging'; // Assuming a shared messaging utility
const handleAutofill = async () => {
// 1. Get current active tab ID
const [tab] = await chrome.tabs.query({ active: true, currentWindow: true });
if (!tab || !tab.id) {
console.error('No active tab found.');
return;
}
// 2. Retrieve user data and API key from storage
const storedData = await chrome.storage.local.get(['profiles', 'geminiApiKey']);
const currentProfile = storedData.profiles?.default; // Or allow user to select profile
const geminiApiKey = storedData.geminiApiKey;
if (!currentProfile || !geminiApiKey) {
// Show error message to user in UI
return;
}
// 3. Send message to background script
sendMessage({
type: 'AUTOFILL_REQUEST',
payload: {
tabId: tab.id,
profile: currentProfile,
apiKey: geminiApiKey,
},
});
// 4. Update UI to show "Autofilling..." status
};
// ... JSX for button
<button onClick={handleAutofill}>Autofill Now</button>
```
* **Status Display:** Add a React state variable to show messages like "Analyzing form...", "Filling fields...", "Autofill Complete!", or error messages.
This acts as the central orchestrator and API caller.
-
Message Listener:
- Listen for messages from the popup (
AUTOFILL_REQUEST). - Listen for messages from the content script (
FORM_DATA_EXTRACTED,FILL_COMPLETE). -
// in src/pages/background/index.ts chrome.runtime.onMessage.addListener((message, sender, sendResponse) => { if (message.type === 'AUTOFILL_REQUEST') { handleAutofillRequest(message.payload); // Don't sendResponse immediately, as this is an async operation return true; // Indicates async response } else if (message.type === 'FORM_DATA_EXTRACTED') { handleFormDataExtracted(message.payload, sender.tab?.id); } else if (message.type === 'FILL_COMPLETE') { // Update popup UI via another message or direct state if possible // (e.g., using a state management library accessible to both popup and background) } });
- Listen for messages from the popup (
-
handleAutofillRequestFunction:- Receives
tabId,profile, andapiKeyfrom the popup. - Inject Content Script (if not already injected): This is handled by the boilerplate's
content_scriptsinmanifest.ts, but you might explicitly inject it if you want more control over when it runs beyonddocument_idle. - Request DOM Extraction: Sends a message to the content script in the target
tabIdto initiate DOM extraction.// In background script, inside handleAutofillRequest async function handleAutofillRequest(payload: { tabId: number; profile: any; apiKey: string }) { try { // Send message to content script to extract form data await chrome.tabs.sendMessage(payload.tabId, { type: 'EXTRACT_FORM_DATA' }); // Store payload for later use after content script sends form data // (since service workers are stateless, store in global variable or chrome.storage.session) // Example: // globalAutofillState[payload.tabId] = payload; } catch (error) { console.error('Failed to inject or communicate with content script:', error); // Send error back to popup } }
- Receives
-
handleFormDataExtractedFunction:- Receives the
formData(structured representation of form elements) from the content script. - Gemini API Call:
- Construct the prompt for Gemini Flash 2.5.
- Include
formDataandprofile. - Define tool schemas (e.g.,
fill_text_input,select_dropdown_option,check_radio_or_checkbox). - Set
modeltogemini-1.5-flash-latest(orgemini-1.5-flash). - Use the
generationConfigto specifyresponse_mime_type: "application/json"andresponse_schemaif you want a stricter output format from Gemini.
- Include
- Use
fetchto call the Gemini API. Ensure API key is in theAuthorizationheader. -
// In background script, inside handleFormDataExtracted import { GoogleGenerativeAI } from '@google/generative-ai'; // ... (ensure you have the @google/generative-ai package installed) async function handleFormDataExtracted(formData: any[], tabId: number) { const { profile, apiKey } = globalAutofillState[tabId]; // Retrieve stored payload const genAI = new GoogleGenerativeAI(apiKey); const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash-latest' }); const prompt = `You are an AI assistant specialized in intelligently filling web forms. Here is the current state of the web page's form elements (including their unique selectors for interaction): ${JSON.stringify(formData, null, 2)} Here is the user's information: ${JSON.stringify(profile, null, 2)} Your goal is to fill out this form accurately using the provided user information. You have the following tools available: function fill_text_input(selector: string, value: string, field_type: string) function select_dropdown_option(selector: string, value: string) function check_radio_or_checkbox(selector: string, checked: boolean) Based on the form elements and user data, suggest the next action(s) to take using the available tools. Output your action(s) as a JSON array of tool calls. `; try { const result = await model.generateContent({ contents: [{ role: 'user', parts: [{ text: prompt }] }], tools: [ { functionDeclarations: [ { name: 'fill_text_input', parameters: { type: 'object', properties: { selector: { type: 'string' }, value: { type: 'string' }, field_type: { type: 'string' }, }, required: ['selector', 'value', 'field_type'], }, }, { name: 'select_dropdown_option', parameters: { type: 'object', properties: { selector: { type: 'string' }, value: { type: 'string' }, }, required: ['selector', 'value'], }, }, { name: 'check_radio_or_checkbox', parameters: { type: 'object', properties: { selector: { type: 'string' }, checked: { type: 'boolean' }, }, required: ['selector', 'checked'], }, }, ], }, ], // For stricter JSON output, you might leverage response_mime_type and response_schema in advanced cases // Or simply rely on function calling output and parse it. }); const response = result.response; const toolCalls = response.functionCalls(); // Assuming LLM provides function calls directly // If LLM output is a JSON string, parse it: JSON.parse(response.text()) // Send tool calls to content script for execution await chrome.tabs.sendMessage(tabId, { type: 'EXECUTE_ACTIONS', payload: toolCalls }); } catch (error) { console.error('Gemini API call failed:', error); // Send error back to popup } finally { delete globalAutofillState[tabId]; // Clean up state } }
- Construct the prompt for Gemini Flash 2.5.
- Send Actions to Content Script: Once Gemini returns the
toolCalls, send them to the content script for execution.
- Receives the
This script interacts directly with the webpage DOM.
-
DOM Extraction Logic:
- Listen for
EXTRACT_FORM_DATAmessage from the background script. - Implement functions to traverse the DOM and extract form elements.
- Crucial: Generating stable and unique selectors. XPath is often more robust than simple CSS selectors for complex pages.
- For each element, store its
tagName,type,id,name,placeholder,value, text from associated<label>,aria-label,aria-labelledby, and most importantly, a unique XPath or highly specific CSS selector. - For
<select>elements, include theoptions(text and value). - For
radioandcheckboxelements, identify theirvalueandcheckedstatus. Group radio buttons byname.
- For each element, store its
- Send the extracted structured data (
formData) back to the background script usingchrome.runtime.sendMessage. -
// in src/pages/content/index.ts chrome.runtime.onMessage.addListener((message, sender, sendResponse) => { if (message.type === 'EXTRACT_FORM_DATA') { const formData = extractFormData(); // Implement this function chrome.runtime.sendMessage({ type: 'FORM_DATA_EXTRACTED', payload: formData }); return true; // Indicates async response } else if (message.type === 'EXECUTE_ACTIONS') { executeActions(message.payload); // Implement this function } }); function extractFormData() { const formElements: any[] = []; document.querySelectorAll('input, textarea, select').forEach(element => { const el = element as HTMLInputElement | HTMLTextAreaElement | HTMLSelectElement; const data: any = { tagName: el.tagName.toLowerCase(), type: el.type || el.tagName.toLowerCase(), // 'input' type can be missing id: el.id, name: el.name, placeholder: el.placeholder, selector: getElementXPath(el) || getCssSelector(el), // Prioritize XPath, then CSS // Add label text, aria-labels, etc. }; if (el.tagName.toLowerCase() === 'select') { data.options = Array.from((el as HTMLSelectElement).options).map(opt => ({ text: opt.textContent, value: opt.value, })); } // Add specific logic for radio/checkbox groups if needed formElements.push(data); }); return formElements; } // Helper function to get XPath (can be complex, use a robust library or implement carefully) // Example (simplified): function getElementXPath(element: Element): string { if (element.id !== '') return `//*[@id='${element.id}']`; if (element === document.body) return '/html/body'; let ix = 0; const siblings = element.parentNode?.children || []; for (let i = 0; i < siblings.length; i++) { const sibling = siblings[i]; if (sibling === element) return `${getElementXPath(element.parentNode!)}/${element.tagName.toLowerCase()}[${ix + 1}]`; if (sibling.nodeType === 1 && sibling.tagName === element.tagName) ix++; } return ''; // Fallback if no parent } // getCssSelector can also be implemented or use a library
- Listen for
-
Action Execution Logic:
- Listen for
EXECUTE_ACTIONSmessage from the background script. - Implement
executeActionsfunction:- Iterate through the
payload(Gemini's tool calls). - For each action, use the
selector(XPath/CSS selector) to find the element on the page. - Perform the action (e.g., set
value,selectedIndex,checkedstatus). - Add visual feedback (e.g.,
element.style.border = '2px solid green'). -
// in src/pages/content/index.ts async function executeActions(actions: any[]) { for (const action of actions) { const { tool_name, args } = action; const element = document.evaluate(args.selector, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue as HTMLElement | null; if (!element) { console.warn(`Element not found for selector: ${args.selector}`); continue; } if (tool_name === 'fill_text_input') { (element as HTMLInputElement | HTMLTextAreaElement).value = args.value; } else if (tool_name === 'select_dropdown_option') { (element as HTMLSelectElement).value = args.value; } else if (tool_name === 'check_radio_or_checkbox') { (element as HTMLInputElement).checked = args.checked; element.dispatchEvent(new Event('change', { bubbles: true })); // Trigger change event } // Add temporary visual feedback element.style.outline = '2px solid #4CAF50'; // Green outline element.style.transition = 'outline 0.3s ease-out'; setTimeout(() => { element.style.outline = 'none'; }, 1500); // Remove highlight after 1.5 seconds } chrome.runtime.sendMessage({ type: 'FILL_COMPLETE' }); }
- Iterate through the
- Listen for
Create a shared utility for sending and receiving messages between different parts of the extension. This helps abstract chrome.runtime.sendMessage, etc., and makes the code cleaner.
// packages/shared/lib/utils/messaging.ts
export const sendMessage = async (message: any, tabId?: number) => {
if (tabId) {
// Send to specific tab (content script)
await chrome.tabs.sendMessage(tabId, message);
} else {
// Send to background script (from popup) or other parts
await chrome.runtime.sendMessage(message);
}
};
export const onMessage = (callback: (message: any, sender: chrome.runtime.MessageSender, sendResponse: (response?: any) => void) => boolean | void) => {
chrome.runtime.onMessage.addListener(callback);
};The current implementation uses direct chrome.runtime.sendMessage calls, which is also acceptable.
- Service Worker (
chrome-extension/src/background/index.ts): Remember that the background script is a Service Worker. It's event-driven and non-persistent. This means any state it needs to maintain across messages (like theprofileandapiKeyfor a specific autofill request) must be stored (e.g., inchrome.storage.sessionfor temporary state, or passed along in message payloads). For a single, sequential autofill request, passing the data in theAUTOFILL_REQUESTmessage and then tohandleFormDataExtractedis often sufficient. host_permissions: Make sure<all_urls>is declared inchrome-extension/manifest.tsunderhost_permissionsfor the content script to run on any website. This will prompt a warning to the user during installation.- Dynamic Content Script Injection (Optional): While
content_scriptsinchrome-extension/manifest.tswill auto-inject, for more control (e.g., injecting only when needed for performance), you could usechrome.scripting.executeScriptfrom the background script. The boilerplate already sets up automatic injection viamanifest.ts, which is fine for this use case.
- API Errors: Gracefully handle failed Gemini API calls (network issues, invalid API key, rate limits). Display informative messages to the user in the popup.
- Element Not Found: If the content script cannot find an element based on Gemini's suggested selector, log a warning and proceed with other fields. This is crucial for arbitrary forms where selectors might vary.
- Review Step: Before truly "completing" the autofill, consider adding a confirmation step in the popup where the user can see a summary of what was filled and approve/deny the changes. This could involve sending the filled data back from the content script to the popup for display.
- Development Server: Use
pnpm dev(oryarn dev/npm run dev) to start the Vite development server. - Loading Unpacked Extension:
- Chrome: Go to
chrome://extensions, enable Developer mode, click "Load unpacked," and select thedistfolder generated by Vite. - Firefox: Go to
about:debugging#/runtime/this-firefox, click "Load Temporary Add-on," and select themanifest.jsonfile inside thedistfolder.
- Chrome: Go to
- Debugging:
- Popup: Right-click the extension icon -> "Inspect popup".
- Background Script (Service Worker): Go to
chrome://extensions, find your extension, and click "Service Worker" (or "Inspect views background page" for Firefox). - Content Script: Open the target webpage, open DevTools (F12), and you'll see console logs/errors from your content script. You might need to select the content script's "context" in the DevTools console dropdown.
- Iterative Testing: Test on a variety of websites with different form structures to refine the DOM extraction logic and the LLM's prompt engineering.