Skip to content

Commit 61ababc

Browse files
authored
feat(sql_connector): add support for sql connector (#1543)
* feat(sql_connector): adding support for the create and push of sql connectors * feat(sql_implementation): add columns of sql in schema file * feat(dataset): add test cases for the sql connectors * feat(sql_connectors): add unit tests for sql connectors * fix(connector): pull dataframe fixed * fix(sql_connector): update docs * remove extra schema redeclaration
1 parent 6ad4539 commit 61ababc

12 files changed

Lines changed: 863 additions & 126 deletions

File tree

docs/v3/semantic-layer.mdx

Lines changed: 127 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,17 @@
11
---
2-
title: 'Semantic Layer'
3-
description: 'Turn raw data into semantic-enhanced and clean dataframes'
2+
title: "Semantic Layer"
3+
description: "Turn raw data into semantic-enhanced and clean dataframes"
44
---
55

66
<Note title="Beta Notice">
7-
Release v3 is currently in beta. This documentation reflects the features and functionality in progress and may change before the final release.
7+
Release v3 is currently in beta. This documentation reflects the features and
8+
functionality in progress and may change before the final release.
89
</Note>
910

1011
## What's the Semantic Layer?
1112

1213
The semantic layer allows you to turn raw data into [dataframes](/v3/dataframes) you can ask questions to and [share with your team](/v3/share-dataframes) as conversational AI dashboards. It serves several important purposes:
14+
1315
1. **Data Configuration**: Define how your data should be loaded and processed
1416
2. **Semantic Information**: Add context and meaning to your data columns
1517
3. **Data Transformation**: Specify how data should be cleaned and transformed
@@ -60,7 +62,9 @@ pai.create(
6062
...
6163
)
6264
```
65+
6366
**Type**: `str`
67+
6468
- A string without special characters or spaces
6569
- Using kebab-case naming convention
6670
- Unique within your project
@@ -80,6 +84,7 @@ pai.create(
8084
```
8185

8286
**Type**: `str`
87+
8388
- Must follow the format: "organization-identifier/dataset-identifier"
8489
- Organization identifier should be unique to your organization
8590
- Dataset identifier should be unique within your organization
@@ -101,11 +106,42 @@ pai.create(
101106
```
102107

103108
**Type**: `DataFrame`
109+
104110
- Must be a pandas DataFrame created with `pai.read_csv()`
105111
- Contains the raw data you want to enhance with semantic information
106112
- Required parameter for creating a semantic layer
107113

114+
#### Connectors
115+
116+
The connector field allows you to connect your data sources like PostgreSQL, MySQL and Sqlite to the semantic layer.
117+
For example, if you're working with a SQL database, you can specify the connection details using the connector field.
118+
119+
```python
120+
121+
pai.create(
122+
path="acme-corp/sales-data",
123+
connector={
124+
"type": "postgres",
125+
"connection": {
126+
"host": "postgres-host",
127+
"port": 5432,
128+
"user": "postgres",
129+
"password": "*****",
130+
"database": "postgres",
131+
},
132+
"table": "orders",
133+
},
134+
...
135+
)
136+
```
137+
138+
**Type**: `Dict`
139+
140+
- Must be a sql connector source dict
141+
- Required connection string for creating a semantic layer
142+
108143
#### description
144+
109145
A clear text description that helps others understand the dataset's contents and purpose.
110146

111147
```python
@@ -121,15 +157,17 @@ pai.create(
121157
```
122158

123159
**Type**: `str`
160+
124161
- The purpose of the dataset
125162
- The type of data contained
126163
- Any relevant context about data collection or usage
127164
- Optional but recommended for better data understanding
128165

129166
#### columns
167+
130168
Define the structure and metadata of your dataset's columns to help PandaAI understand your data better.
131169

132-
**Note**: If the `columns` parameter is not provided, all columns from the input dataframe will be included in the semantic layer.
170+
**Note**: If the `columns` parameter is not provided, all columns from the input dataframe will be included in the semantic layer.
133171
When specified, only the declared columns will be included, allowing you to select specific columns for your semantic layer.
134172

135173
```python
@@ -171,6 +209,7 @@ pai.create(
171209
```
172210

173211
**Type**: `dict[str, dict]`
212+
174213
- Keys: column names as they appear in your DataFrame
175214
- Values: dictionary containing:
176215
- `type` (str): Data type of the column
@@ -181,22 +220,28 @@ pai.create(
181220
- "boolean": flags, true/false values
182221
- `description` (str): Clear explanation of what the column represents
183222

184-
185223
### For other data sources: YAML configuration
186224

187225
For other data sources (SQL databases, data warehouses, etc.), create a YAML file in your datasets folder:
226+
188227
> Keep in mind that you have to install the sql, cloud data (ee), or yahoo_finance data extension to use this feature.
189228
190-
Example
229+
Example PostgreSQL YAML file:
191230

192231
```yaml
193232
name: SalesData # Dataset name
194233
description: "Sales data from our SQL database"
195234

196235
source:
197-
type: postgresql
198-
connection_string: "postgresql://user:pass@localhost:5432/db"
199-
query: "SELECT * FROM sales"
236+
type: postgres
237+
connection:
238+
host: postgres-host
239+
port: 5432
240+
database: postgres
241+
user: postgres
242+
password: ******
243+
table: orders
244+
view: false
200245

201246
columns:
202247
- name: transaction_id
@@ -207,26 +252,54 @@ columns:
207252
description: Date and time of the sale
208253
```
209254
255+
Example Sqlite YAML file:
256+
257+
```yaml
258+
name: SalesData # Dataset name
259+
description: "Sales data from our SQL database"
260+
261+
source:
262+
type: sqlite
263+
connection:
264+
file_path: /Users/arslan/Documents/SinapTik/pandas-ai/companies.db
265+
table: companies
266+
view: false
267+
268+
description: Companies table
269+
columns:
270+
- name: id
271+
type: integer
272+
- name: name
273+
type: string
274+
- name: domain
275+
type: string
276+
- name: year_founded
277+
type: float
278+
```
279+
210280
### YAML Semantic Layer Configuration
211281
212282
The following sections detail all available configuration options for your schema.yaml file:
213283
214284
#### name (mandatory)
285+
215286
The name field identifies your dataset in the schema.yaml file.
287+
216288
```yaml
217289
name: sales-data
218290
```
219291
220-
221292
**Type**: `str`
293+
222294
- A string without special characters or spaces
223295
- Using kebab-case naming convention
224296
- Unique within your project
225297
- Examples: "sales-data", "customer-profiles"
226298

227-
228299
#### columns
300+
229301
Define the structure and metadata of your dataset's columns to help PandaAI understand your data better.
302+
230303
```yaml
231304
columns:
232305
- name: transaction_id
@@ -238,6 +311,7 @@ columns:
238311
```
239312

240313
**Type**: `list[dict]`
314+
241315
- Each dictionary represents a column.
242316
- **Fields**:
243317
- `name` (str): Name of the column.
@@ -252,10 +326,12 @@ columns:
252326
- `description` (str): Clear explanation of what the column represents.
253327

254328
**Constraints**:
329+
255330
1. Column names must be unique.
256331
2. For views, all column names must be in the format `[table].[column]`.
257332

258333
#### transformations
334+
259335
Apply transformations to your data to clean, convert, or anonymize it.
260336

261337
```yaml
@@ -274,26 +350,34 @@ transformations:
274350
```
275351

276352
**Type**: `list[dict]`
353+
277354
- Each dictionary represents a transformation
278355
- `type` (str): Type of transformation
279356
- "anonymize" for anonymizing data
280357
- "convert_timezone" for converting timezones
281358
- `params` (dict): Parameters for the transformation
282359

283-
284360
#### source (mandatory)
361+
285362
Specify the data source for your dataset.
286363

287364
```yaml
288365
source:
289-
type: postgresql
290-
connection_string: "postgresql://user:pass@localhost:5432/db"
291-
query: "SELECT * FROM sales"
366+
type: postgres
367+
connection:
368+
host: postgres-host
369+
port: 5432
370+
database: postgres
371+
user: postgres
372+
password: ******
373+
table: orders
374+
view: false
292375
```
293376

294377
> The available data sources depends on the installed data extensions (sql, cloud data (ee), yahoo_finance).
295378

296379
**Type**: `dict`
380+
297381
- `type` (str): Type of data source
298382
- "postgresql" for PostgreSQL databases
299383
- "mysql" for MySQL databases
@@ -306,11 +390,14 @@ source:
306390
- `connection_string` (str): Connection string for the data source
307391
- `query` (str): Query to retrieve data from the data source
308392

309-
{/* commented as destination and update frequency will be only in the materialized case
393+
{/\* commented as destination and update frequency will be only in the materialized case
394+
310395
#### destination (mandatory)
396+
311397
Specify the destination for your dataset.
312398

313399
**Type**: `dict`
400+
314401
- `type` (str): Type of destination
315402
- "local" for local storage
316403
- `format` (str): Format of the data
@@ -324,11 +411,12 @@ destination:
324411
path: /path/to/data
325412
```
326413

327-
328414
#### update_frequency
415+
329416
Specify the frequency of updates for your dataset.
330417

331418
**Type**: `str`
419+
332420
- "daily" for daily updates
333421
- "weekly" for weekly updates
334422
- "monthly" for monthly updates
@@ -337,12 +425,15 @@ Specify the frequency of updates for your dataset.
337425
```yaml
338426
update_frequency: daily
339427
```
340-
*/}
428+
429+
\*/}
341430

342431
#### order_by
432+
343433
Specify the columns to order by.
344434

345435
**Type**: `list[str]`
436+
346437
- Each string should be in the format "column_name DESC" or "column_name ASC"
347438

348439
```yaml
@@ -352,6 +443,7 @@ order_by:
352443
```
353444

354445
#### limit
446+
355447
Specify the maximum number of records to load.
356448

357449
**Type**: `int`
@@ -371,34 +463,37 @@ name: table_heart
371463
source:
372464
type: postgres
373465
connection:
374-
host: localhost
466+
host: postgres-host
375467
port: 5432
376-
database: test
377-
user: test
378-
password: test
379-
view: true
468+
database: postgres
469+
user: postgres
470+
password: ******
471+
table: heart
472+
view: false
380473
columns:
381-
- name: parents.id
382-
- name: parents.name
383-
- name: parents.age
384-
- name: children.name
385-
- name: children.age
474+
- name: parents.id
475+
- name: parents.name
476+
- name: parents.age
477+
- name: children.name
478+
- name: children.age
386479
relations:
387-
- name: parent_to_children
388-
description: Relation linking the parent to its children
389-
from: parents.id
390-
to: children.id
480+
- name: parent_to_children
481+
description: Relation linking the parent to its children
482+
from: parents.id
483+
to: children.id
391484
```
392485

393486
---
394487

395488
#### Constraints
396489

397490
1. **Mutual Exclusivity**:
491+
398492
- A schema cannot define both `table` and `view` simultaneously.
399493
- If `source.view` is `true`, then the schema represents a view.
400494

401495
2. **Column Format**:
496+
402497
- For views:
403498
- All columns must follow the format `[table].[column]`.
404499
- `from` and `to` fields in `relations` must follow the `[table].[column]` format.

0 commit comments

Comments
 (0)