Open
Description
Code:
import pandas as pd
import numpy as np
from dask.distributed import Client
from dask_sql import Context
client = Client()
cont = Context()
df1 = pd.DataFrame({
'dated': pd.date_range(pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-10')),
'var1': np.ones(10)})
df2 = pd.DataFrame({
'startdate': [pd.Timestamp('2020-12-30'), pd.Timestamp('2021-01-09')],
'enddate': [pd.Timestamp('2021-01-03'), pd.Timestamp('2021-01-20')],
'var2': np.array([2.0, 3.0])})
cont.create_table('df1', df1)
cont.create_table('df2', df2)
df3 = cont.sql(
"""select a.*, b.var2
from df1 a left join df2 b
on b.startdate<=a.dated and a.dated<=b.enddate""").compute()
Results:
df1
:
dated var1
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 1.0
5 2021-01-06 1.0
6 2021-01-07 1.0
7 2021-01-08 1.0
8 2021-01-09 1.0
9 2021-01-10 1.0
df2
:
startdate enddate var2
0 2020-12-30 2021-01-03 2.0
1 2021-01-09 2021-01-20 3.0
df3
:
dated var1 var2
0 2021-01-01 1.0 2.0
2 2021-01-02 1.0 2.0
4 2021-01-03 1.0 2.0
17 2021-01-09 1.0 3.0
19 2021-01-10 1.0 3.0
This is an Inner Join, not Left Join.
The correct output should be as follows, using sqlite3:
import sqlite3
# Connect database
conn = sqlite3.connect(':memory:')
df1.to_sql('df1', conn, index=False)
df2.to_sql('df2', conn, index=False)
df3 = pd.read_sql_query(
"""select a.*, b.var2
from df1 a left join df2 b
on b.startdate<=a.dated and a.dated<=b.enddate""",
conn,
parse_dates=['dated'])
where df3
is
dated var1 var2
0 2021-01-01 1.0 2.0
1 2021-01-02 1.0 2.0
2 2021-01-03 1.0 2.0
3 2021-01-04 1.0 NaN
4 2021-01-05 1.0 NaN
5 2021-01-06 1.0 NaN
6 2021-01-07 1.0 NaN
7 2021-01-08 1.0 NaN
8 2021-01-09 1.0 3.0
9 2021-01-10 1.0 3.0
Metadata
Metadata
Assignees
Labels
No labels