Skip to content

Commit 0bd1fd2

Browse files
authored
Spacy and Huggingface integrations (#84)
* first pass at adding spacy and hf integrations * fix pyproject * adding langchain and modifying document * added testing * adding docs * finish docs * fix test * fix test2 * skip transformers test * fix tests * adding magicmock for iterable * respond to feedback * finalise document attributes * update lock * constrain numpy version
1 parent aa6ee27 commit 0bd1fd2

File tree

11 files changed

+952
-514
lines changed

11 files changed

+952
-514
lines changed
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# HealthChain Integrations
2+
3+
This document provides an overview of the integration components available in the HealthChain package. These components allow you to easily incorporate popular NLP libraries into your HealthChain pipelines.
4+
5+
## Table of Contents
6+
7+
1. [SpacyComponent](#spacycomponent)
8+
2. [HuggingFaceComponent](#huggingfacecomponent)
9+
3. [LangChainComponent](#langchaincomponent)
10+
11+
## Installation Requirements
12+
Before utilizing the integration components, it is important to note that the required third-party libraries are not included in HealthChain's default installation. This design decision was made to:
13+
14+
- Maintain a lean and flexible core package
15+
- Allow users to selectively install only the necessary dependencies
16+
- Avoid potential version conflicts with other packages in your environment
17+
18+
To use these integrations, you will need to manually install the corresponding libraries using pip.
19+
20+
```python
21+
pip install spacy
22+
python -m spacy download en_core_web_sm # or another desired model
23+
pip install transformers
24+
pip install langchain
25+
```
26+
27+
28+
## SpacyComponent
29+
30+
The `SpacyComponent` allows you to integrate spaCy models into your HealthChain pipeline. There are several ways to initialize this component with different types of spaCy models:
31+
32+
1. Using standard spaCy models:
33+
```python
34+
# Using a standard spaCy model (requires: python -m spacy download en_core_web_sm)
35+
spacy_component = SpacyComponent("en_core_web_sm")
36+
```
37+
38+
2. Loading custom trained pipelines from a directory:
39+
```python
40+
# Using a custom pipeline saved to disk
41+
spacy_component = SpacyComponent("/path/to/your/custom/model")
42+
```
43+
44+
3. Using specialized domain models like [scispaCy](https://allenai.github.io/scispacy/) which can be used for classifying clinical or biomedical text:
45+
```python
46+
# Using scispaCy models for biomedical text (requires: pip install scispacy)
47+
spacy_component = SpacyComponent("en_core_sci_sm")
48+
```
49+
50+
Choose the appropriate model based on your specific needs - standard models for general text, custom-trained models for domain-specific tasks, or specialized models like scispaCy for biomedical text analysis.
51+
52+
```python
53+
from healthchain.pipeline.components.integrations import SpacyComponent
54+
55+
56+
spacy_component = SpacyComponent(path_to_pipeline="en_core_web_sm")
57+
```
58+
59+
When called on a document, this component processes the input document using the specified spaCy model and adds the resulting spaCy Doc object to the HealthChain Document.
60+
61+
### Example
62+
63+
```python
64+
from healthchain.io.containers import Document
65+
from healthchain.pipeline.base import Pipeline
66+
from healthchain.pipeline.components.integrations import SpacyComponent
67+
68+
pipeline = Pipeline()
69+
pipeline.add_node(SpacyComponent(path_to_pipeline="en_core_web_sm"))
70+
71+
doc = Document("This is a test sentence.")
72+
processed_doc = pipeline(doc)
73+
74+
# Access spaCy annotations
75+
spacy_doc = processed_doc.get_spacy_doc()
76+
for token in spacy_doc:https://github.com/dotimplement/HealthChain
77+
print(f"Token: {token.text}, POS: {token.pos_}, Lemma: {token.lemma_}")
78+
```
79+
80+
## HuggingFaceComponent
81+
82+
The `HuggingFaceComponent` integrates HuggingFace Transformers models into your HealthChain pipeline. Models can be browsed on the [HuggingFace website](https://huggingface.co/models). HuggingFace offers models for a wide range of different tasks, and while not all of these have been throughly tested for HealthChain compatability, we expect that all NLP models and tasks should be compatible. If you have an issues integrating any models please raise an issue on our [Github homepage](https://github.com/dotimplement/HealthChain)!
83+
84+
```python
85+
from healthchain.pipeline.components.integrations import HuggingFaceComponent
86+
87+
huggingface_component = HuggingFaceComponent(task="sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
88+
```
89+
90+
91+
- `task` (str): The NLP task to perform (e.g., "sentiment-analysis", "named-entity-recognition").
92+
- `model` (str): The name or path of the Hugging Face model to use.
93+
94+
This component applies the specified Hugging Face model to the input document and stores the output in the HealthChain Document's `huggingface_outputs` dictionary.
95+
96+
### Example
97+
98+
```python
99+
from healthchain.io.containers import Document
100+
from healthchain.pipeline.base import Pipeline
101+
from healthchain.pipeline.components.integrations import HuggingFaceComponent
102+
103+
pipeline = Pipeline()
104+
pipeline.add_node(HuggingFaceComponent(task="sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english"))
105+
106+
doc = Document("I love using HealthChain for my NLP projects!")
107+
processed_doc = pipeline(doc)
108+
109+
# Access Hugging Face output
110+
sentiment_result = processed_doc.get_huggingface_output("sentiment-analysis")
111+
print(f"Sentiment: {sentiment_result}")
112+
```
113+
114+
## LangChainComponent
115+
116+
The `LangChainComponent` allows you to integrate LangChain chains into your HealthChain pipeline.
117+
118+
```python
119+
from langchain import PromptTemplate, LLMChain
120+
from langchain_core
121+
from langchain.llms import FakeListLLM
122+
from langchain_core.output_parsers import StrOutputParser
123+
from healthchain.pipeline.components.integrations import LangChainComponent
124+
125+
# Let's create a simple FakeListLLM for demonstration
126+
fake_llm = FakeListLLM(responses=["This is a great summary!"])
127+
128+
# Define the prompt template
129+
prompt = PromptTemplate.from_template("Summarize the following text: {text}")
130+
131+
# Create the LCEL chain
132+
chain = prompt | fake_llm | StrOutputParser()
133+
134+
langchain_component = LangChainComponent(chain=llm_chain)
135+
```
136+
137+
- `chain`: A LangChain chain object to be executed within the pipeline.
138+
139+
This component runs the specified LangChain chain on the input document's text and stores the output in the HealthChain Document's `langchain_outputs` dictionary.
140+
141+
### Example
142+
143+
```python
144+
from healthchain.io.containers import Document
145+
from healthchain.pipeline.base import Pipeline
146+
from healthchain.pipeline.components.integrations import LangChainComponent
147+
from langchain import PromptTemplate, LLMChain
148+
from langchain_core.output_parsers import StrOutputParser
149+
from langchain.llms import FakeListLLM
150+
151+
# Set up LangChain with a FakeListLLM
152+
fake_llm = FakeListLLM(responses=["HealthChain integrates NLP libraries for easy pipeline creation."])
153+
# Define the prompt template
154+
prompt = PromptTemplate.from_template("Summarize the following text: {text}")
155+
156+
# Create the LCEL chain
157+
chain = prompt | fake_llm | StrOutputParser()
158+
159+
# Set up your HealthChain pipeline
160+
pipeline = Pipeline()
161+
pipeline.add_node(LangChainComponent(chain=llm_chain))
162+
163+
# Let's summarize something
164+
doc = Document("HealthChain is a powerful package for building NLP pipelines. It integrates seamlessly with popular libraries like spaCy, Hugging Face Transformers, and LangChain, allowing users to create complex NLP workflows with ease.")
165+
processed_doc = pipeline(doc)
166+
167+
# What summary did we get?
168+
summary = processed_doc.get_langchain_output("chain_output")
169+
print(f"Summary: {summary}")
170+
```
171+
172+
## Combining Components
173+
174+
You can easily combine multiple integration components in a single HealthChain pipeline:
175+
176+
```python
177+
from healthchain.io.containers import Document
178+
from healthchain.pipeline.base import Pipeline
179+
from healthchain.pipeline.components.integrations import SpacyComponent, HuggingFaceComponent, LangChainComponent
180+
from langchain import PromptTemplate, LLMChain
181+
from langchain.llms import FakeListLLM
182+
183+
# Set up our components
184+
spacy_component = SpacyComponent(path_to_pipeline="en_core_web_sm")
185+
huggingface_component = HuggingFaceComponent(task="sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
186+
187+
# Set up LangChain with a FakeListLLM
188+
fake_llm = FakeListLLM(responses=["HealthChain: Powerful NLP pipeline builder."])
189+
# Define the prompt template
190+
prompt = PromptTemplate.from_template("Summarize the following text: {text}")
191+
# Create the LCEL chain
192+
chain = prompt | fake_llm | StrOutputParser()
193+
langchain_component = LangChainComponent(chain=llm_chain)
194+
195+
# Build our pipeline
196+
pipeline = Pipeline()
197+
pipeline.add_node(spacy_component)
198+
pipeline.add_node(huggingface_component)
199+
pipeline.add_node(langchain_component)
200+
pipeline.build()
201+
202+
# Process a document
203+
doc = Document("HealthChain makes it easy to build powerful NLP pipelines!")
204+
processed_doc = pipeline(doc)
205+
206+
# Let's see what we got!
207+
spacy_doc = processed_doc.get_spacy_doc()
208+
sentiment = processed_doc.get_huggingface_output("sentiment-analysis")
209+
summary = processed_doc.get_langchain_output("chain_output")
210+
211+
print(f"Tokens: {[token.text for token in spacy_doc]}")
212+
print(f"Sentiment: {sentiment}")
213+
print(f"Summary: {summary}")
214+
```
215+
216+
This documentation provides an overview of the integration components available in HealthChain. For more detailed information on each library, please refer to their respective documentation:
217+
218+
- [spaCy Documentation](https://spacy.io/api)
219+
- [Hugging Face Transformers Documentation](https://huggingface.co/transformers/)
220+
- [LangChain Documentation](https://python.langchain.com/docs/introduction/)

docs/reference/pipeline/pipeline.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,12 @@ pipeline.add_input(cda_connector)
131131
pipeline.add_output(cda_connector)
132132
```
133133

134+
### Integrations
135+
136+
HealthChain offers powerful integrations with popular NLP libraries, enhancing its capabilities and allowing you to build more sophisticated pipelines. These integrations include components for spaCy, Hugging Face Transformers, and LangChain, enabling you to leverage state-of-the-art NLP models and techniques within your HealthChain workflows.
137+
138+
Integrations are covered in detail on the [Integration](./integrations.md) homepage.
139+
134140
## Pipeline Management 🔨
135141

136142
#### Adding

0 commit comments

Comments
 (0)