|
| 1 | +# TalkPipe developer handbook |
| 2 | + |
| 3 | +Contributor-oriented reference: glossary, repository conventions, parameter semantics, and standard configuration keys. |
| 4 | + |
| 5 | +## Glossary |
| 6 | + |
| 7 | +* **Unit** - A component in a pipeline that either produces or processes data. There are two types of units: Sources and Segments. |
| 8 | +* **Segment** - A unit that reads from another Unit and may or may not yield data of its own. All units that |
| 9 | +are not at the start of a pipeline are Segments. |
| 10 | +* **Source** - A unit that takes nothing as input and yields data items. These Units are used in the |
| 11 | +"INPUT FROM..." portion of a pipeline. |
| 12 | + |
| 13 | +## Conventions |
| 14 | + |
| 15 | +### Versioning |
| 16 | + |
| 17 | +This codebase will use [semantic versioning](https://semver.org/) with the additional convention that during the 0.x.y development that each MINOR version will mostly maintain backward compatibility and PATCH versions will include substantial new capability. So, for example, every 0.2.x version will be mostly backward compatible, but 0.3.0 might contain code reorganization. |
| 18 | + |
| 19 | +### Codebase Structure |
| 20 | + |
| 21 | +The following are the main breakdown of the codebase. These should be considered firm but not strict breakdowns. Sometimes a source could fit within either operations or data, for example. |
| 22 | + |
| 23 | +* **talkpipe.app** - Contains the primary runnable applications. |
| 24 | + * Example: chatterlang_script |
| 25 | +* **talkpipe.operations** - Contains general algorithm implementations. Associated segments and sources can be included next to the algorithm implementations, but the algorithms themselves should also work stand-alone. |
| 26 | + * Example: bloom filters |
| 27 | +* **talkpipe.data** - Contains components having to do with complex, type-specific data manipulation. |
| 28 | + * Example: extracting text from files. |
| 29 | +* **talkpipe.llm** - Contains the abstract classes and implementations for accessing LLMs, both code for accessing specific LLMs and code for doing prompting. |
| 30 | + * Example: Code for talking with Ollama or OpenAI |
| 31 | +* **talkpipe.pipe** - Code that implements the core classes and decorators for the pipe api as well and misc implementations of helper segments and sources. |
| 32 | + * Example: echo and the definition of the @segment decorator |
| 33 | +* **talkpipe.chatterlang** - The definition, parsers, and compiler for the chatterlang language as well as any chatterlang specific segments and sources |
| 34 | + * Example: the chatterlang compiler and the variable segment |
| 35 | + |
| 36 | +### Source/Segment Names |
| 37 | + |
| 38 | +- **For your own Units, do whatever you want!** These conventions are for authors writing units intended for broader reuse. |
| 39 | +- **Classes that implement Units** are named in CamelCase with the initial letter in uppercase. |
| 40 | +- **Units defined using `@segment` and `@source` decorators** should be named in camelCase with an initial lowercase letter. |
| 41 | +- In **ChatterLang**, sources and segments also use camelCase with an initial lowercase letter. |
| 42 | +- Except for the **`cast`** segment, segments that convert data into a specific format—whether they process items one-by-one or drain the entire input—should be named using the form `[tT]oX`, where **X** is the output data type (e.g., `toDataFrame` outputs a pandas DataFrame). |
| 43 | +- **Segments that write files** use the form `[Ww]riteX`, where **X** is the file type (e.g., `writeExcel` writes an Excel file, `writePickle` writes a pickle file). |
| 44 | +- **Segments that read files** use the form `[Rr]eadX`, where **X** is the file type (e.g., `readExcel` should read an Excel file). |
| 45 | +- **Parameter names in segments** should be in all lower case with words separated by an underscore (_) |
| 46 | + |
| 47 | +### Parameter Names |
| 48 | + |
| 49 | +These parameter names should behave consistently across all units: |
| 50 | + |
| 51 | +- **item** should be used in field_segment, referring to the item passed to the function. It will not |
| 52 | + be a parameter to the segment in ChatterLang. |
| 53 | + |
| 54 | +- **items** are used in segment definitions, referring to the iterable over all the pieces of data in the stream. |
| 55 | + It will not be a parameter used anywhere as a parameter in ChatterLang. |
| 56 | + |
| 57 | +- **set_as** |
| 58 | + If used, any processed output is attached to the original data using bracket notation. The original item is then emitted. |
| 59 | + |
| 60 | +- **fail_on_error** |
| 61 | + If True, the exception should be raised, likely aborting the pipeline. If False, the operation should continue |
| 62 | + and either None should be yielded or nothing, depending on the segment or source. A warning message should be logged. |
| 63 | + |
| 64 | +- **field** |
| 65 | + Specifies that the unit should operate on data accessed via “field syntax.” This syntax can include indices, properties, or parameter-free methods, separated by periods. |
| 66 | + - For example, given `{"X": ["a", "b", ["c", "d"]]}`, the field `"X.2.0"` refers to `"c"`. |
| 67 | + |
| 68 | +- **field_list** |
| 69 | + Specifies that a list of fields can or should be provided, with each field separated |
| 70 | + by a comma. In some cases, each field needs to be mapped to some other name. In |
| 71 | + those cases, the field and name should be separated by a colon. In field_lists, |
| 72 | + the underscore (_) refers to the item as a whole. |
| 73 | + - For example, "X.2.0:SomeName,X.1:SomeOtherName". If no "name" is provided, |
| 74 | + the fieldname itself is used. Where only a list of fields is needed and no names, |
| 75 | + the names can still be provided but have no effect. |
| 76 | + |
| 77 | +### General Behavior Principles |
| 78 | + |
| 79 | +* Units that have side effects (e.g. writing data to a disk) should generally also pass |
| 80 | +on their data. |
| 81 | + |
| 82 | +### Source and Segment Reference |
| 83 | + |
| 84 | +The chatterlang_workbench command starts a web service designed for experimentation. It also contains links to HTML and text versions |
| 85 | +of all the sources and segments included in TalkPipe. |
| 86 | + |
| 87 | +After talkpipe is installed, a script called "chatterlang_reference_browser" is available that provides an interactive command-line search and exploration of sources and segments. The command "chatterlang_reference_generator" will generate single page HTML and text versions of all the source and segment documentation. |
| 88 | + |
| 89 | +### Standard Configuration File Items |
| 90 | + |
| 91 | +Configuration constants can be defined either in ~/.talkpipe.toml or in environment variables. Any constant defined in an environment variable needs to be prefixed with TALKPIPE_. So email_password, stored in an environment variable, needs to be TALKPIPE_email_password. Note that in ChatterLang, any key defined in ~/.talkpipe.toml or set via a TALKPIPE_* environment variable can be referenced in scripts as a parameter using $var_name. That reference resolves to the environment variable TALKPIPE_var_name or to var_name in talkpipe.toml. |
| 92 | + |
| 93 | +* **default_embedding_model_source** - The default source (e.g. ollama) to be used for creating sentence embeddings. |
| 94 | +* **default_embedding_model_name** - The name of the LLM model to be used for creating sentence embeddings. |
| 95 | +* **default_model_name** - The default name of a LLM model to be used in chat |
| 96 | +* **default_model_source** - The default source (e.g. ollama) to be used in chat |
| 97 | +* **email_password** - Password for the SMTP server |
| 98 | +* **logger_files** - Files to store logs, in the form logger1:fname1,logger2:fname2,... |
| 99 | +* **logger_levels** - Logger levels in the form logger1:level1,logger2:level2 |
| 100 | +* **recipient_email** - Who should receive a sent email |
| 101 | +* **rss_url** - The default URL used by the rss segment |
| 102 | +* **sender_email** - Who the sender of an email should be |
| 103 | +* **smtp_port** - SMTP server port |
| 104 | +* **smtp_server** - SMTP server hostname |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +*For the main project overview, see the [project README](../../README.md).* |
0 commit comments