You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| llama2_13b_hf_lora_best | 0.744 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
20
-
| chatglm3_lora_default | 0.590 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
21
-
| chatglm3_qlora_default | 0.581 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
20
+
22
21
23
22
24
23
@@ -28,6 +27,279 @@ It's important to note that our evaluation results are obtained based on the cur
28
27
If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase.
29
28
30
29
30
+
## LLMs Text-to-SQL capability evaluation before 20231208
31
+
the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the [the Spider-based test-suite](https://github.com/taoyds/test-suite-sql-eval) ,size of 1.27G, diffrent from Spider official [website](https://yale-lily.github.io/spider) ,size only 95M.
32
+
the model
33
+
34
+
<table>
35
+
<tr>
36
+
<th>Model</th>
37
+
<th>Method</th>
38
+
<th>EX</th>
39
+
<th></th>
40
+
<th></th>
41
+
<th></th>
42
+
<th></th>
43
+
</tr>
44
+
<tr>
45
+
<td>Llama2-7B-Chat</td>
46
+
<td>base</td>
47
+
<td>0</td>
48
+
<td>0</td>
49
+
<td>0</td>
50
+
<td>0</td>
51
+
<td>0</td>
52
+
</tr>
53
+
<tr>
54
+
<td></td>
55
+
<td>lora</td>
56
+
<td>0.887</td>
57
+
<td>0.641</td>
58
+
<td>0.489</td>
59
+
<td>0.331</td>
60
+
<td>0.626</td>
61
+
</tr>
62
+
<tr>
63
+
<td></td>
64
+
<td>qlora</td>
65
+
<td>0.847</td>
66
+
<td>0.623</td>
67
+
<td>0.466</td>
68
+
<td>0.361</td>
69
+
<td>0.608</td>
70
+
</tr>
71
+
<tr>
72
+
<td>Llama2-13B-Chat</td>
73
+
<td>base</td>
74
+
<td>0</td>
75
+
<td>0</td>
76
+
<td>0</td>
77
+
<td>0</td>
78
+
<td>0</td>
79
+
</tr>
80
+
<tr>
81
+
<td></td>
82
+
<td>lora</td>
83
+
<td>0.907</td>
84
+
<td>0.729</td>
85
+
<td>0.552</td>
86
+
<td>0.343</td>
87
+
<td>0.68</td>
88
+
</tr>
89
+
<tr>
90
+
<td></td>
91
+
<td>qlora</td>
92
+
<td>0.911</td>
93
+
<td>0.7</td>
94
+
<td>0.552</td>
95
+
<td>0.319</td>
96
+
<td>0.664</td>
97
+
</tr>
98
+
<tr>
99
+
<td>CodeLlama-7B-Instruct</td>
100
+
<td>base</td>
101
+
<td>0.214</td>
102
+
<td>0.177</td>
103
+
<td>0.092</td>
104
+
<td>0.036</td>
105
+
<td>0.149</td>
106
+
</tr>
107
+
<tr>
108
+
<td></td>
109
+
<td>lora</td>
110
+
<td>0.923</td>
111
+
<td>0.756</td>
112
+
<td>0.586</td>
113
+
<td>0.349</td>
114
+
<td>0.702</td>
115
+
</tr>
116
+
<tr>
117
+
<td></td>
118
+
<td>qlora</td>
119
+
<td>0.911</td>
120
+
<td>0.751</td>
121
+
<td>0.598</td>
122
+
<td>0.331</td>
123
+
<td>0.696</td>
124
+
</tr>
125
+
<tr>
126
+
<td>CodeLlama-13B-Instruct</td>
127
+
<td>base</td>
128
+
<td>0.698</td>
129
+
<td>0.601</td>
130
+
<td>0.408</td>
131
+
<td>0.271</td>
132
+
<td>0.539</td>
133
+
</tr>
134
+
<tr>
135
+
<td></td>
136
+
<td>lora</td>
137
+
<td>0.94</td>
138
+
<td>0.789</td>
139
+
<td>0.684</td>
140
+
<td>0.404</td>
141
+
<td>0.746</td>
142
+
</tr>
143
+
<tr>
144
+
<td></td>
145
+
<td>qlora</td>
146
+
<td>0.94</td>
147
+
<td>0.774</td>
148
+
<td>0.626</td>
149
+
<td>0.392</td>
150
+
<td>0.727</td>
151
+
</tr>
152
+
<tr>
153
+
<td>Baichuan2-7B-Chat</td>
154
+
<td>base</td>
155
+
<td>0.577</td>
156
+
<td>0.352</td>
157
+
<td>0.201</td>
158
+
<td>0.066</td>
159
+
<td>335</td>
160
+
</tr>
161
+
<tr>
162
+
<td></td>
163
+
<td>lora</td>
164
+
<td>0.871</td>
165
+
<td>0.63</td>
166
+
<td>0.448</td>
167
+
<td>0.295</td>
168
+
<td>0.603</td>
169
+
</tr>
170
+
<tr>
171
+
<td></td>
172
+
<td>qlora</td>
173
+
<td>0.891</td>
174
+
<td>0.637</td>
175
+
<td>0.489</td>
176
+
<td>0.331</td>
177
+
<td>0.624</td>
178
+
</tr>
179
+
<tr>
180
+
<td>Baichuan2-13B-Chat</td>
181
+
<td>base</td>
182
+
<td>0.581</td>
183
+
<td>0.413</td>
184
+
<td>0.264</td>
185
+
<td>0.187</td>
186
+
<td>0.392</td>
187
+
</tr>
188
+
<tr>
189
+
<td></td>
190
+
<td>lora</td>
191
+
<td>0.903</td>
192
+
<td>0.702</td>
193
+
<td>0.569</td>
194
+
<td>0.392</td>
195
+
<td>0.678</td>
196
+
</tr>
197
+
<tr>
198
+
<td></td>
199
+
<td>qlora</td>
200
+
<td>0.895</td>
201
+
<td>0.675</td>
202
+
<td>0.58</td>
203
+
<td>0.343</td>
204
+
<td>0.659</td>
205
+
</tr>
206
+
<tr>
207
+
<td>Qwen-7B-Chat</td>
208
+
<td>base</td>
209
+
<td>0.395</td>
210
+
<td>0.256</td>
211
+
<td>0.138</td>
212
+
<td>0.042</td>
213
+
<td>0.235</td>
214
+
</tr>
215
+
<tr>
216
+
<td></td>
217
+
<td>lora</td>
218
+
<td>0.855</td>
219
+
<td>0.688</td>
220
+
<td>0.575</td>
221
+
<td>0.331</td>
222
+
<td>0.652</td>
223
+
</tr>
224
+
<tr>
225
+
<td></td>
226
+
<td>qlora</td>
227
+
<td>0.911</td>
228
+
<td>0.675</td>
229
+
<td>0.575</td>
230
+
<td>0.343</td>
231
+
<td>0.662</td>
232
+
</tr>
233
+
<tr>
234
+
<td>Qwen-14B-Chat</td>
235
+
<td>base</td>
236
+
<td>0.871</td>
237
+
<td>0.632</td>
238
+
<td>0.368</td>
239
+
<td>0.181</td>
240
+
<td>0.573</td>
241
+
</tr>
242
+
<tr>
243
+
<td></td>
244
+
<td>lora</td>
245
+
<td>0.895</td>
246
+
<td>0.702</td>
247
+
<td>0.552</td>
248
+
<td>0.331</td>
249
+
<td>0.663</td>
250
+
</tr>
251
+
<tr>
252
+
<td></td>
253
+
<td>qlora</td>
254
+
<td>0.919</td>
255
+
<td>0.744</td>
256
+
<td>0.598</td>
257
+
<td>0.367</td>
258
+
<td>0.701</td>
259
+
</tr>
260
+
<tr>
261
+
<td>ChatGLM3-6b</td>
262
+
<td>base</td>
263
+
<td>0</td>
264
+
<td>0</td>
265
+
<td>0</td>
266
+
<td>0</td>
267
+
<td>0</td>
268
+
</tr>
269
+
<tr>
270
+
<td></td>
271
+
<td>lora</td>
272
+
<td>0.855</td>
273
+
<td>0.605</td>
274
+
<td>0.477</td>
275
+
<td>0.271</td>
276
+
<td>0.59</td>
277
+
</tr>
278
+
<tr>
279
+
<td></td>
280
+
<td>qlora</td>
281
+
<td>0.843</td>
282
+
<td>0.603</td>
283
+
<td>0.506</td>
284
+
<td>0.211</td>
285
+
<td>0.581</td>
286
+
</tr>
287
+
<tr>
288
+
<td></td>
289
+
<td></td>
290
+
<td></td>
291
+
<td></td>
292
+
<td></td>
293
+
<td></td>
294
+
<td></td>
295
+
</tr>
296
+
</table>
297
+
298
+
299
+
1、All the models lora and qlora are trained by default based on the spider dataset training set.
300
+
2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated.
0 commit comments