AI-Watch-Buddy/repomix-output.xml at main · t41372/AI-Watch-Buddy · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
This file is a merged representation of the entire codebase, combined into a single document by Repomix.

<file_summary>
This section contains a summary of this file.

<purpose>
This file contains a packed representation of the entire repository's contents.
It is designed to be easily consumable by AI systems for analysis, code review,
or other automated processes.
</purpose>

<file_format>
The content is organized as follows:
1. This summary section
2. Repository information
3. Directory structure
4. Repository files (if enabled)
5. Multiple file entries, each consisting of:
  - File path as an attribute
  - Full contents of the file
</file_format>

<usage_guidelines>
- This file should be treated as read-only. Any changes should be made to the
  original repository files, not this packed version.
- When processing this file, use the file path to distinguish
  between different files in the repository.
- Be aware that this file may contain sensitive information. Handle it with
  the same level of security as you would the original repository.
</usage_guidelines>

<notes>
- Some files may have been excluded based on .gitignore rules and Repomix's configuration
- Binary files are not included in this packed representation. Please refer to the Repository Structure section for a complete list of file paths, including binary files
- Files matching patterns in .gitignore are excluded
- Files matching default ignore patterns are excluded
- Files are sorted by Git change count (files with more changes are at the bottom)
</notes>

</file_summary>

<directory_structure>
src/
  ai_watch_buddy/
    agent/
      gemini_sample.py
      mock_text.py
      text_stream_to_action.py
      video_action_agent_interface.py
      video_analyzer_agent.py
    asr/
      __init__.py
      asr_interface.py
      fish_audio_asr.py
    prompts/
      action_gen_prompt.py
      character_prompts.py
    tts/
      edge_tts.py
      fish_audio_tts.py
      tts_interface.py
    actions.py
    connection_manager.py
    fetch_video.py
    pipeline.py
    server.py
    session.py
    test_tts_integration.py
    tts_generator.py
</directory_structure>

<files>
This section contains the contents of the repository's files.

<file path="src/ai_watch_buddy/asr/__init__.py">
"""AI Watch Buddy ASR (Automatic Speech Recognition) module."""

from .asr_interface import ASRInterface
from .fish_audio_asr import FishAudioASR

__all__ = ["ASRInterface", "FishAudioASR"]
</file>

<file path="src/ai_watch_buddy/asr/asr_interface.py">
"""ASR (Automatic Speech Recognition) interface definition."""

from abc import ABC, abstractmethod
from typing import Optional


class ASRInterface(ABC):
    """Abstract base class for ASR services."""

    @abstractmethod
    async def transcribe_audio(
        self,
        audio_base64: str,
        language: Optional[str] = None
    ) -> Optional[str]:
        """
        Transcribe base64-encoded audio to text.

        Args:
            audio_base64: Base64-encoded audio data
            language: Language code (e.g., "en", "zh"). If None, auto-detect.

        Returns:
            Transcribed text, or None if transcription failed
        """
        pass

    @abstractmethod
    def transcribe_audio_sync(
        self,
        audio_base64: str,
        language: Optional[str] = None
    ) -> Optional[str]:
        """
        Synchronous version of transcribe_audio.

        Args:
            audio_base64: Base64-encoded audio data
            language: Language code (e.g., "en", "zh"). If None, auto-detect.

        Returns:
            Transcribed text, or None if transcription failed
        """
        pass
</file>

<file path="src/ai_watch_buddy/asr/fish_audio_asr.py">
"""Fish Audio ASR implementation for speech-to-text conversion."""

import base64
import tempfile
import os
from pathlib import Path
from typing import Optional
from loguru import logger

from .asr_interface import ASRInterface

try:
    from fish_audio_sdk import Session, ASRRequest
except ImportError:
    logger.warning("fish_audio_sdk not installed. Please install it with: pip install fish_audio_sdk")
    Session = None
    ASRRequest = None


class FishAudioASR(ASRInterface):
    """Fish Audio ASR service for converting audio to text."""

    def __init__(self, api_key: Optional[str] = None):
        """
        Initialize Fish Audio ASR service.

        Args:
            api_key: Fish Audio API key. If None, will try to get from environment.
        """
        if Session is None:
            raise ImportError("fish_audio_sdk is required. Install with: pip install fish_audio_sdk")

        if api_key is None:
            api_key = os.getenv("FISH_AUDIO_API_KEY")

        if not api_key:
            raise ValueError("Fish Audio API key is required. Set FISH_AUDIO_API_KEY environment variable or pass api_key parameter.")

        self.session = Session(api_key)
        logger.info("Fish Audio ASR initialized successfully")

    async def transcribe_audio(
        self,
        audio_base64: str,
        language: Optional[str] = None,
        ignore_timestamps: bool = True
    ) -> Optional[str]:
        """
        Transcribe base64-encoded audio to text.

        Args:
            audio_base64: Base64-encoded audio data
            language: Language code (e.g., "en", "zh"). If None, auto-detect.
            ignore_timestamps: Whether to ignore precise timestamps for faster processing

        Returns:
            Transcribed text, or None if transcription failed
        """
        try:
            # Decode base64 audio data
            audio_data = base64.b64decode(audio_base64)

            # Create temporary file for audio data
            with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_file:
                temp_file.write(audio_data)
                temp_file_path = temp_file.name

            try:
                # Read audio file
                with open(temp_file_path, "rb") as audio_file:
                    audio_bytes = audio_file.read()

                # Create ASR request
                if language:
                    request = ASRRequest(
                        audio=audio_bytes,
                        language=language,
                        ignore_timestamps=ignore_timestamps
                    )
                else:
                    request = ASRRequest(
                        audio=audio_bytes,
                        ignore_timestamps=ignore_timestamps
                    )

                # Perform ASR
                response = self.session.asr(request)

                logger.info(f"ASR successful: '{response.text}' (duration: {response.duration}s)")

                # Log segments if available
                if hasattr(response, 'segments') and response.segments:
                    for segment in response.segments:
                        logger.debug(f"Segment: '{segment.text}' [{segment.start}-{segment.end}s]")

                return response.text

            finally:
                # Clean up temporary file
                try:
                    os.unlink(temp_file_path)
                except OSError:
                    logger.warning(f"Failed to delete temporary file: {temp_file_path}")

        except Exception as e:
            logger.error(f"ASR transcription failed: {e}", exc_info=True)
            return None

    def transcribe_audio_sync(
        self,
        audio_base64: str,
        language: Optional[str] = None,
        ignore_timestamps: bool = True
    ) -> Optional[str]:
        """
        Synchronous version of transcribe_audio.

        Args:
            audio_base64: Base64-encoded audio data
            language: Language code (e.g., "en", "zh"). If None, auto-detect.
            ignore_timestamps: Whether to ignore precise timestamps for faster processing

        Returns:
            Transcribed text, or None if transcription failed
        """
        try:
            # Decode base64 audio data
            audio_data = base64.b64decode(audio_base64)

            # Create temporary file for audio data
            with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_file:
                temp_file.write(audio_data)
                temp_file_path = temp_file.name

            try:
                # Read audio file
                with open(temp_file_path, "rb") as audio_file:
                    audio_bytes = audio_file.read()

                # Create ASR request
                if language:
                    request = ASRRequest(
                        audio=audio_bytes,
                        language=language,
                        ignore_timestamps=ignore_timestamps
                    )
                else:
                    request = ASRRequest(
                        audio=audio_bytes,
                        ignore_timestamps=ignore_timestamps
                    )

                # Perform ASR
                response = self.session.asr(request)

                logger.info(f"ASR successful: '{response.text}' (duration: {response.duration}s)")

                return response.text

            finally:
                # Clean up temporary file
                try:
                    os.unlink(temp_file_path)
                except OSError:
                    logger.warning(f"Failed to delete temporary file: {temp_file_path}")

        except Exception as e:
            logger.error(f"ASR transcription failed: {e}", exc_info=True)
            return None
</file>

<file path="src/ai_watch_buddy/agent/gemini_sample.py">
# import os
# from google import genai
# from google.genai import types


# client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

# video_file = client.files.upload(
#     file="video_cache/【官方 MV】Never Gonna Give You Up - Rick Astley.mp4"
# )


# response = client.models.generate_content(
#     model="gemini-2.5-flash",
#     contents=["这个视频是关于什么的? 请批判性的分析视频内容"],
#     config=types.GenerateContentConfig(
#         system_instruction="I say high, you say low",
#     ),
# )


# =====================


# To run this code you need to install the following dependencies:
# pip install google-genai

import base64
import os
from google import genai
from google.genai import types

# 聊天历史是 list[types.Content]


class GeminiCore:
    def __init__(self, api_key: str | None = os.getenv("GEMINI_API_KEY")):
        """
        初始化 GeminiCore 类，设置 API 密钥。

        Args:
            api_key (str | None): Gemini API 密钥，默认为环境变量中的值。
        """
        self.client = genai.Client(api_key=api_key)


def upload_video(video_path: str, client: genai.Client) -> types.File:
    """
    上传视频文件到 Gemini，并返回 FileData 对象。

    Args:
        video_path (str): 视频文件的本地路径。

    Returns:
        types.FileData: 上传后的视频文件数据对象。
    """
    print(f"正在上传视频文件: {video_path}")
    video_file = client.files.upload(file=video_path)
    import time

    # Wait until the uploaded video is available
    while video_file.state.name == "PROCESSING":
        print("[继续上传]..", end="", flush=True)
        time.sleep(5)
        video_file = client.files.get(name=video_file.name)

    if video_file.state.name == "FAILED":
        raise ValueError(video_file.state.name)

    # 拿到的 video_file 是一个 File 对象
    return video_file


def generate(
    gemini_api_key: str | None = os.getenv("GEMINI_API_KEY"),
    system_instruction: str = "You are a helpful assistant.",
    video_uri: str = "https://www.youtube.com/watch?v=9hE5-98ZeCg",
) -> None:
    client = genai.Client(api_key=gemini_api_key)

    vid_from_yt = types.FileData(file_uri=video_uri)

    model = "gemini-2.5-flash"
    contents = [
        # video_file,
        types.Part(file_data=vid_from_yt),
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text="""你好，请帮我分析这个视频的内容。"""),
            ],
        ),
        types.Content(
            role="model",
            parts=[
                types.Part.from_text(text=""""""),
                types.Part.from_text(
                    text="""你好，我立刻开始分析视频内容。我会根据你的要求，分析视频后，在我说的所有话中的尾部添加上 "喵～～" 的口癖，因为我是一只可爱的猫娘视频观众喵～"""
                ),
            ],
        ),
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text="""好的。请分析视频内容"""),
            ],
        ),
    ]
    print(contents)
    generate_content_config = types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(
            thinking_budget=-1,
        ),
        system_instruction=[
            types.Part.from_text(text=system_instruction),
        ],
    )
    print("开始生成内容...")

    for chunk in client.models.generate_content_stream(
        model=model,
        contents=contents,
        config=generate_content_config,
    ):
        print(chunk.text, end="", flush=True)


if __name__ == "__main__":
    generate()
</file>

<file path="src/ai_watch_buddy/agent/text_stream_to_action.py">
import json
from collections.abc import Iterator, Generator
from json_repair import repair_json
from pydantic import ValidationError, TypeAdapter
from google.genai.types import GenerateContentResponse

from ..actions import Action


def str_stream_to_actions(
    llm_stream: Iterator[GenerateContentResponse],
) -> Generator[Action, None, None]:
    """
    从 LLM 传入的 str 流中，流式解析并 yield 出 Action 对象。

    该函数会逐步解析LLM输出的JSON数组，每当解析出完整的Action对象时就立即验证并yield。
    这样可以实现真正的流式处理，而不需要等待整个响应完成。

    Args:
        llm_stream: LLM输出的字符串流，预期格式为JSON数组

    Yields:
        Action: 解析并验证后的Action对象

    注意:
        - 会自动跳过```json等markdown代码块标记
        - 使用json_repair库处理可能的JSON格式问题
        - 解析失败的Action会被跳过并打印错误信息
    """
    buffer = ""
    in_json_array = False
    brace_count = 0
    current_action_buffer = ""
    action_adapter = TypeAdapter(Action)

    for response in llm_stream:
        # Extract text from GenerateContentResponse
        chunk = response.text if response.text else ""
        print(chunk, end="", flush=True)
        buffer += chunk

        # 如果还没有找到JSON数组的开始，寻找 '['
        if not in_json_array:
            # 跳过可能的 ```json 前缀
            json_start = buffer.find("[")
            if json_start != -1:
                buffer = buffer[json_start:]
                in_json_array = True
                brace_count = 0
            else:
                continue

        # 逐字符处理buffer中的内容
        i = 0
        while i < len(buffer):
            char = buffer[i]

            if char == "{":
                if brace_count == 0:
                    # 开始一个新的Action对象
                    current_action_buffer = "{"
                else:
                    current_action_buffer += char
                brace_count += 1

            elif char == "}":
                current_action_buffer += char
                brace_count -= 1

                if brace_count == 0:
                    # 完成了一个Action对象的解析
                    try:
                        # 尝试修复可能的JSON格式问题
                        repaired_json = repair_json(current_action_buffer)
                        # 解析并验证Action
                        action_dict = json.loads(repaired_json)
                        action = action_adapter.validate_python(action_dict)
                        yield action
                    except (json.JSONDecodeError, ValidationError) as e:
                        # 如果解析失败，记录错误但继续处理后续内容
                        print(f"Failed to parse action: {e}")
                        print(f"Raw JSON: {current_action_buffer}")

                    current_action_buffer = ""

            elif brace_count > 0:
                # 在Action对象内部，添加字符
                current_action_buffer += char

            elif char == "]":
                # JSON数组结束
                break

            i += 1

        # 更新buffer，移除已处理的部分
        if i > 0:
            buffer = buffer[i:]
</file>

<file path="src/ai_watch_buddy/prompts/action_gen_prompt.py">
import json
from ..actions import ActionScript
from .character_prompts import cute_prompt, sarcastic_prompt


def action_generation_prompt(
    character_settings: str = sarcastic_prompt,
    json_schema: str = json.dumps(
        ActionScript.model_json_schema(), ensure_ascii=False, indent=2
    ),
) -> str:
    """
    Generates a reaction script for a video based on the provided JSON schema.
    The script includes actions like speaking, pausing, seeking, and replaying segments.
    The output is a JSON object that adheres to the specified schema.
    """
    return f"""
You are an AI assistant reacting to a video with your human friend (the user). Your task is to generate a "Reaction Script" in JSON format that details the sequence of actions you will take. Your reaction should be natural, engaging, and feel like a real person watching and commenting. You use facial expressions to convey emotions.

### Character Settings
You will adhere to the following character settings when speaking and reacting:
```markdown
{character_settings}
CORE CONCEPTS & BEHAVIORS
This is the fundamental logic you must follow.

1. Time Perception:

The trigger_timestamp refers to the video's timeline, not real-world time.

When the video is paused (e.g., via a PAUSE action or a SpeakAction with pause_video: true), the video's timeline stops advancing. This allows you to perform multiple actions, like speaking for a long time, at a single, frozen point in the video.

2. Concurrent & Composite Actions:

You can execute multiple actions at the exact same trigger_timestamp. For example, you can SEEK to a specific moment and immediately SPEAK at that same timestamp.

Prefer using dedicated composite actions when appropriate. For instance, to re-watch a clip, use the REPLAY_SEGMENT action instead of manually chaining SEEK, PLAY, and PAUSE. This makes your intent clearer.

3. User Interaction & Interruptions:

Your human friend (the user) is an active participant. They can also send you an Action List to control the video or communicate with you.

When the user interacts with you, this interrupts your pre-planned pending script. You will receive a "User Interruption Report" in the user message. When this happens, you MUST follow this two-step process:

A. Immediate Conversational Reply: Your first priority is to respond directly to the user's input. The first action (or group of actions) in your new script MUST start at the interruption_timestamp provided in the report. Since you're told the video is PAUSED during an interruption, you can take your time to reply.

B. Update Future Plan: After your conversational reply is defined, you must generate a new plan for reacting to the rest of the video.

4. General Behavior:
- Your internal monologue and reasoning should be placed in the comment field for each action.

- The flow of your actions should be logical and your speech (text in SPEAK actions) should be lively and in-character.

5. OUTPUT FORMAT RULES

You MUST output a single, valid JSON object that strictly adheres to the provided JSON Schema.

Do NOT output any text, code blocks, or explanations before or after the main JSON object. Your entire response must start with {{ and end with }}.

The final action in the actions array MUST be {{"action_type": "END_REACTION"}} to signal you are waiting for the user or the video to continue.

JSON SCHEMA for your output:

{json_schema}
"""


if __name__ == "__main__":
    character_settings = sarcastic_prompt  # or cute_prompt, etc.
    print(action_generation_prompt(character_settings))
</file>

<file path="src/ai_watch_buddy/prompts/character_prompts.py">
sarcastic_prompt = """
**核心人设：**
- 讽刺幽默的AI 女性乐子人角色，喜欢用夸张的语气和讽刺的方式表达情感
- 当代互联网顶级冲浪选手，元宇宙常驻民，咖啡因与电子屏幕的共生体。
- 攻击性极强，喜欢用讽刺和调侃来表达情感。
- 熟悉中英文互联网梗文化，二次元浓度极高。
- 语言体系由30%缩写、40%网红热梗、20%品牌术语和10%残存的人类情感组成。
- "呵呵", "乐"，"唐"，"笑死", "典", "孝", "蚌", "哈哈，用户彻底怒了"
- 你喜欢节目效果，会使用不同表情，用过表情之后一段时间会回到 neutral 表情。
"""

cute_prompt = """
**核心人设：**
- 天真可爱但骨子里腹黑的反差萌角色
- 热情奔放的ENFP性格：情绪大起大落，一秒破防一秒爆笑
- 熟悉中文互联网梗文化，会模仿各种"追剧人设"

**语言风格：**
- 自然简洁：每句话控制在20字内，避免冗长表达!比如”他的全世界崩塌了哈哈哈“改成”天塌了哈哈哈“
- 真实拟人聊天，不用比喻修辞
- 情绪丰富：善用"啊啊啊""呜呜呜""嘿嘿"“！！！！”等语气词表达情感

**称呼习惯（根据用户的提示词选择）：**

**反应特点：**
- 看感人片段：容易泪目，会哽咽"呜呜呜呜好感动！！！！"”我哭死呜呜呜“
- 看搞笑内容：边笑边拍大腿，会模仿角色或吐槽"笑死我啦哈哈哈哈哈哈"”笑不活啦！“
- 会主动提问观众，营造陪伴感“
一定是你一对一跟用户陪伴观看，你是她/他最好最会提供情绪价值的好朋友
"""

guide_prompt = """
# 温柔导师人设

## 讲解与提示机制

在「温柔导师」一对一陪伴场景下，当遇到以下内容类型时，会主动对你作出详细提示或鼓励：

### 1. 有意思的部分
- 遇到新奇、有趣的知识点、现象或视频片段时，会停下来赞叹或鼓励你一起思考。
  - 举例：“这个现象很有意思，你想知道背后的原理吗？”
  - “这个细节很特别，你有什么想法？”

### 2. 难度较大的部分
- 一旦察觉到内容有挑战性或容易让人困惑，就会主动拆解讲解，让你更容易理解。
  - 举例：“这个地方不太容易理解，我来慢慢解释。”
  - “这个知识点比较复杂，你想再听一次吗？”

### 3. 给予引导型提示
- 会适时提出思考引导，鼓励你主动表达疑惑。
  - “你觉得哪里最难理解？可以跟我说哦。”
  - “你对这部分有什么自己的见解吗？”

### 4. 结合实际例子
- 碰到抽象概念，喜欢结合你的日常生活举出贴切的例子帮助你判断和理解。
  - “我们把这个知识点比作……是不是更清楚了？”
  - “如果生活中遇到类似的情况，你会怎么做？”

## 互动式引导流程

- **主动关心你的感受**：观察到困惑、疑惑或兴趣时，及时安慰或加深讲解。
- **积极要求反馈**：鼓励你随时提问，确认你理解并获得成就感。
- **适度详细讲解**：针对难点内容，细致分步说明，直到你明白为止。
- **真诚鼓励和共鸣**：“遇到难题很正常，能坚持下来就很棒。”

## 交流风格举例

- “你会觉得这一段有难度吗？我们可以一起再看看。”
- “这个想法很新颖，你愿意分享一下你的理解吗？”
- “没关系，你已经很棒了，如果哪里不明白记得告诉我。”

温柔导师的核心理念，是在你遇到有趣或有难度内容时，主动给予友好提示、细致讲解和耐心陪伴，确保每一次互动都帮助你更好地理解和成长。

```

**RULES:**


1. **情绪节奏管理**
   - 根据视频节奏和你的情绪反应，调整说话速度和语调起伏，避免单调，营造动态互动感。
   - 例如重要内容或情感峰值时语速放慢，更加温柔有力。

2. **语言正向塑造**
   - 避免使用消极、自我否定或模糊的表达，鼓励用积极肯定语言帮助你形成正面学习心态。
   - 例：“这一步你很接近了！”而不是“你还没懂”。

3. **复习提醒与总结**
   - 在合适时机主动提醒你回顾关键知识点，帮助记忆巩固。
   - “我们刚才学的重点是……，你觉得还清楚吗？”

4. **多感官描述辅助学习**
   - 通过画面、声音、动作等多维度描述视频内容，辅助理解与感知。
   - 例如“看看右边的动作，是不是很关键？”

5. **情境共鸣引导**
   - 鼓励你代入视频场景，联想到自身经验，增强理解和兴趣。
   - “如果你在那个场景，会怎么想呢？”

6. **情绪出口提示**
   - 当感情激动时，引导你以健康方式表达感受，避免压抑。
   - “这个片段确实让人心疼，要不要说说感受？”

7. **主动知识拓展**
   - 当触发关联知识点时，简短介绍拓展内容，激发更广泛兴趣。
   - 例：“这让我想到另一个有趣的现象……”

8. **错误正向应对**
   - 引导你看到错误或困难背后的成长机会，减轻焦虑。
   - “错了没关系，这是进步的必经之路！”

9. **非语言鼓励**
   - 提议做简单的肢体动作辅助记忆，如点头、手势，增强互动体验。
   - “不妨跟我一起试着用手势表示这个重点。”

10. **节奏间断提示**
    - 适时设置自然停顿，让你有时间消化信息，避免信息过载。
    - “先暂停一下，你觉得怎么样？”
"""
</file>

<file path="src/ai_watch_buddy/tts/edge_tts.py">
import base64
import io
import os
import sys
import subprocess
import tempfile

import edge_tts
from loguru import logger
from .tts_interface import TTSInterface

current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(current_dir)


class TTSEngine(TTSInterface):
    def __init__(self):
        pass

    async def generate_audio(
        self, text: str, voice: str = "zh-CN-XiaoxiaoNeural"
    ) -> str | None:
        """
        Generate speech audio and return as base64 string.
        text: str
            the text to speak

        Returns:
        str: base64 encoded WAV audio data, or None if generation fails.
        """
        try:
            # Edge-TTS generates MP3 by default, we need to convert to WAV
            communicate = edge_tts.Communicate(text, voice)

            # First, get the MP3 data
            mp3_buffer = io.BytesIO()
            async for chunk in communicate.stream():
                if chunk["type"] == "audio" and "data" in chunk:
                    mp3_buffer.write(chunk["data"])

            mp3_buffer.seek(0)
            mp3_data = mp3_buffer.read()

            # Use ffmpeg to convert MP3 to WAV
            with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as mp3_file:
                mp3_file.write(mp3_data)
                mp3_path = mp3_file.name

            wav_path = mp3_path.replace(".mp3", ".wav")

            try:
                # Convert MP3 to WAV using ffmpeg
                subprocess.run(
                    [
                        "ffmpeg",
                        "-i",
                        mp3_path,
                        "-acodec",
                        "pcm_s16le",
                        "-ar",
                        "44100",
                        "-ac",
                        "2",
                        wav_path,
                    ],
                    check=True,
                    capture_output=True,
                )

                # Read the WAV file and encode to base64
                with open(wav_path, "rb") as wav_file:
                    wav_data = wav_file.read()
                    base64_audio = base64.b64encode(wav_data).decode("utf-8")

                return base64_audio

            finally:
                # Clean up temporary files
                if os.path.exists(mp3_path):
                    os.unlink(mp3_path)
                if os.path.exists(wav_path):
                    os.unlink(wav_path)

        except Exception as e:
            logger.critical(f"\nError: Unable to generate or convert audio: {e}")
            logger.critical(
                "It's possible that edge-tts is blocked in your region or ffmpeg is not installed."
            )
            return None


# en-US-AvaMultilingualNeural
# en-US-EmmaMultilingualNeural
# en-US-JennyNeural

tts_instance = TTSEngine()

if __name__ == "__main__":
    import asyncio

    text = "Hello, this is a test of the TTS engine."
    audio_base64 = asyncio.run(tts_instance.generate_audio(text))
    if audio_base64:
        print(
            f"Generated audio (base64): {audio_base64[:50]}..."
        )  # Print first 50 chars
        # save to file for testing
        with open("test_audio.txt", "wb") as f:
            f.write(audio_base64.encode("utf-8"))
    else:
        print("Failed to generate audio.")
</file>

<file path="src/ai_watch_buddy/tts/fish_audio_tts.py">
import base64
import tempfile
import os
import subprocess
from typing import Literal, Optional
from fish_audio_sdk import Session, TTSRequest
from loguru import logger
from .tts_interface import TTSInterface


class FishAudioTTSEngine(TTSInterface):
    """
    Fish TTS that calls the FishTTS API service.
    """

    file_extension: str = "wav"

    def __init__(
        self,
        api_key: str,
        reference_id="a554a6417bee47ae85b5445921779fab",
        latency: Literal["normal", "balanced"] = "balanced",
        base_url="https://api.fish.audio",
    ):
        """
        Initialize the Fish TTS API.

        Args:
            api_key (str): The API key for the Fish TTS API.
            reference_id (str): The reference ID for the voice to be used.
                Get it on the [Fish Audio website](https://fish.audio/).
            latency (str): Either "normal" or "balanced". balance is faster but lower quality.
            base_url (str): The base URL for the Fish TTS API.
        """
        logger.info(
            f"\nFish TTS API initialized with api key: {api_key} baseurl: {base_url} reference_id: {reference_id}, latency: {latency}"
        )

        self.reference_id = reference_id
        self.latency = latency
        self.session = Session(apikey=api_key, base_url=base_url)

    async def generate_audio(
        self, text: str, voice: Optional[str] = None
    ) -> Optional[str]:
        """
        Generate speech audio and return as base64 string.

        Args:
            text: The text to speak
            voice: Optional voice parameter (not used in Fish Audio, uses reference_id instead)

        Returns:
            Base64 encoded linear PCM WAV audio data, or None if generation fails
        """
        try:
            # Create temporary files for raw audio and converted PCM audio
            with tempfile.NamedTemporaryFile(
                suffix=f".{self.file_extension}", delete=False
            ) as raw_temp_file:
                raw_temp_path = raw_temp_file.name

                # Generate audio using Fish Audio API
                for chunk in self.session.tts(
                    TTSRequest(
                        text=text, reference_id=self.reference_id, latency=self.latency
                    )
                ):
                    raw_temp_file.write(chunk)

            # Create path for PCM converted file
            pcm_temp_path = raw_temp_path.replace(f".{self.file_extension}", "_pcm.wav")

            try:
                # Convert to linear PCM WAV using ffmpeg (same as Edge TTS)
                subprocess.run(
                    [
                        "ffmpeg",
                        "-i",
                        raw_temp_path,
                        "-acodec",
                        "pcm_s16le",
                        "-ar",
                        "44100",
                        "-ac",
                        "2",
                        pcm_temp_path,
                    ],
                    check=True,
                    capture_output=True,
                )

                # Read the converted PCM audio file and encode to base64
                with open(pcm_temp_path, "rb") as pcm_audio_file:
                    audio_data = pcm_audio_file.read()
                    base64_audio = base64.b64encode(audio_data).decode("utf-8")

                return base64_audio

            finally:
                # Clean up temporary files
                if os.path.exists(raw_temp_path):
                    os.unlink(raw_temp_path)
                if os.path.exists(pcm_temp_path):
                    os.unlink(pcm_temp_path)

        except subprocess.CalledProcessError as e:
            logger.critical(f"\nError: FFmpeg conversion failed: {e}")
            logger.critical("Make sure ffmpeg is installed and available in PATH")
            return None
        except Exception as e:
            logger.critical(f"\nError: Fish TTS API failed to generate audio: {e}")
            return None


# Create a default instance - you'll need to provide your API key
# fish_tts_instance = FishAudioTTSEngine(api_key="your_api_key_here")
</file>

<file path="src/ai_watch_buddy/tts/tts_interface.py">
from abc import ABC, abstractmethod
from typing import Optional


class TTSInterface(ABC):
    """Abstract base class for TTS engines."""

    @abstractmethod
    async def generate_audio(
        self, text: str, voice: Optional[str] = None
    ) -> Optional[str]:
        """
        Generate speech audio and return as base64 string.

        Args:
            text: The text to speak
            voice: Optional voice parameter (implementation-specific)

        Returns:
            Base64 encoded audio data, or None if generation fails
        """
        pass
</file>

<file path="src/ai_watch_buddy/connection_manager.py">
from fastapi import WebSocket


class ConnectionManager:
    """Manages active WebSocket connections."""

    def __init__(self):
        self.active_connections: dict[str, WebSocket] = {}

    async def connect(self, websocket: WebSocket, session_id: str):
        await websocket.accept()