Skip to content

Commit 3ec335f

Browse files
authored
docs:eval add exp and summary res,make base eval as same ,del filter (#177)
1 parent cb8aeeb commit 3ec335f

File tree

1 file changed

+274
-2
lines changed

1 file changed

+274
-2
lines changed

docs/eval_llm_result.md

Lines changed: 274 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,7 @@ This doc aims to summarize the performance of publicly available big language mo
1717
| Baichuan2-13B-Chat | 0.392 | eval in this project default param |
1818
| llama2_13b_hf | 0.449 | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b) |
1919
| llama2_13b_hf_lora_best | 0.744 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
20-
| chatglm3_lora_default | 0.590 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
21-
| chatglm3_qlora_default | 0.581 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
20+
2221

2322

2423

@@ -28,6 +27,279 @@ It's important to note that our evaluation results are obtained based on the cur
2827
If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase.
2928

3029

30+
## LLMs Text-to-SQL capability evaluation before 20231208
31+
the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the [the Spider-based test-suite](https://github.com/taoyds/test-suite-sql-eval) ,size of 1.27G, diffrent from Spider official [website](https://yale-lily.github.io/spider) ,size only 95M.
32+
the model
33+
34+
<table>
35+
<tr>
36+
<th>Model</th>
37+
<th>Method</th>
38+
<th>EX</th>
39+
<th></th>
40+
<th></th>
41+
<th></th>
42+
<th></th>
43+
</tr>
44+
<tr>
45+
<td>Llama2-7B-Chat</td>
46+
<td>base</td>
47+
<td>0</td>
48+
<td>0</td>
49+
<td>0</td>
50+
<td>0</td>
51+
<td>0</td>
52+
</tr>
53+
<tr>
54+
<td></td>
55+
<td>lora</td>
56+
<td>0.887</td>
57+
<td>0.641</td>
58+
<td>0.489</td>
59+
<td>0.331</td>
60+
<td>0.626</td>
61+
</tr>
62+
<tr>
63+
<td></td>
64+
<td>qlora</td>
65+
<td>0.847</td>
66+
<td>0.623</td>
67+
<td>0.466</td>
68+
<td>0.361</td>
69+
<td>0.608</td>
70+
</tr>
71+
<tr>
72+
<td>Llama2-13B-Chat</td>
73+
<td>base</td>
74+
<td>0</td>
75+
<td>0</td>
76+
<td>0</td>
77+
<td>0</td>
78+
<td>0</td>
79+
</tr>
80+
<tr>
81+
<td></td>
82+
<td>lora</td>
83+
<td>0.907</td>
84+
<td>0.729</td>
85+
<td>0.552</td>
86+
<td>0.343</td>
87+
<td>0.68</td>
88+
</tr>
89+
<tr>
90+
<td></td>
91+
<td>qlora</td>
92+
<td>0.911</td>
93+
<td>0.7</td>
94+
<td>0.552</td>
95+
<td>0.319</td>
96+
<td>0.664</td>
97+
</tr>
98+
<tr>
99+
<td>CodeLlama-7B-Instruct</td>
100+
<td>base</td>
101+
<td>0.214</td>
102+
<td>0.177</td>
103+
<td>0.092</td>
104+
<td>0.036</td>
105+
<td>0.149</td>
106+
</tr>
107+
<tr>
108+
<td></td>
109+
<td>lora</td>
110+
<td>0.923</td>
111+
<td>0.756</td>
112+
<td>0.586</td>
113+
<td>0.349</td>
114+
<td>0.702</td>
115+
</tr>
116+
<tr>
117+
<td></td>
118+
<td>qlora</td>
119+
<td>0.911</td>
120+
<td>0.751</td>
121+
<td>0.598</td>
122+
<td>0.331</td>
123+
<td>0.696</td>
124+
</tr>
125+
<tr>
126+
<td>CodeLlama-13B-Instruct</td>
127+
<td>base</td>
128+
<td>0.698</td>
129+
<td>0.601</td>
130+
<td>0.408</td>
131+
<td>0.271</td>
132+
<td>0.539</td>
133+
</tr>
134+
<tr>
135+
<td></td>
136+
<td>lora</td>
137+
<td>0.94</td>
138+
<td>0.789</td>
139+
<td>0.684</td>
140+
<td>0.404</td>
141+
<td>0.746</td>
142+
</tr>
143+
<tr>
144+
<td></td>
145+
<td>qlora</td>
146+
<td>0.94</td>
147+
<td>0.774</td>
148+
<td>0.626</td>
149+
<td>0.392</td>
150+
<td>0.727</td>
151+
</tr>
152+
<tr>
153+
<td>Baichuan2-7B-Chat</td>
154+
<td>base</td>
155+
<td>0.577</td>
156+
<td>0.352</td>
157+
<td>0.201</td>
158+
<td>0.066</td>
159+
<td>335</td>
160+
</tr>
161+
<tr>
162+
<td></td>
163+
<td>lora</td>
164+
<td>0.871</td>
165+
<td>0.63</td>
166+
<td>0.448</td>
167+
<td>0.295</td>
168+
<td>0.603</td>
169+
</tr>
170+
<tr>
171+
<td></td>
172+
<td>qlora</td>
173+
<td>0.891</td>
174+
<td>0.637</td>
175+
<td>0.489</td>
176+
<td>0.331</td>
177+
<td>0.624</td>
178+
</tr>
179+
<tr>
180+
<td>Baichuan2-13B-Chat</td>
181+
<td>base</td>
182+
<td>0.581</td>
183+
<td>0.413</td>
184+
<td>0.264</td>
185+
<td>0.187</td>
186+
<td>0.392</td>
187+
</tr>
188+
<tr>
189+
<td></td>
190+
<td>lora</td>
191+
<td>0.903</td>
192+
<td>0.702</td>
193+
<td>0.569</td>
194+
<td>0.392</td>
195+
<td>0.678</td>
196+
</tr>
197+
<tr>
198+
<td></td>
199+
<td>qlora</td>
200+
<td>0.895</td>
201+
<td>0.675</td>
202+
<td>0.58</td>
203+
<td>0.343</td>
204+
<td>0.659</td>
205+
</tr>
206+
<tr>
207+
<td>Qwen-7B-Chat</td>
208+
<td>base</td>
209+
<td>0.395</td>
210+
<td>0.256</td>
211+
<td>0.138</td>
212+
<td>0.042</td>
213+
<td>0.235</td>
214+
</tr>
215+
<tr>
216+
<td></td>
217+
<td>lora</td>
218+
<td>0.855</td>
219+
<td>0.688</td>
220+
<td>0.575</td>
221+
<td>0.331</td>
222+
<td>0.652</td>
223+
</tr>
224+
<tr>
225+
<td></td>
226+
<td>qlora</td>
227+
<td>0.911</td>
228+
<td>0.675</td>
229+
<td>0.575</td>
230+
<td>0.343</td>
231+
<td>0.662</td>
232+
</tr>
233+
<tr>
234+
<td>Qwen-14B-Chat</td>
235+
<td>base</td>
236+
<td>0.871</td>
237+
<td>0.632</td>
238+
<td>0.368</td>
239+
<td>0.181</td>
240+
<td>0.573</td>
241+
</tr>
242+
<tr>
243+
<td></td>
244+
<td>lora</td>
245+
<td>0.895</td>
246+
<td>0.702</td>
247+
<td>0.552</td>
248+
<td>0.331</td>
249+
<td>0.663</td>
250+
</tr>
251+
<tr>
252+
<td></td>
253+
<td>qlora</td>
254+
<td>0.919</td>
255+
<td>0.744</td>
256+
<td>0.598</td>
257+
<td>0.367</td>
258+
<td>0.701</td>
259+
</tr>
260+
<tr>
261+
<td>ChatGLM3-6b</td>
262+
<td>base</td>
263+
<td>0</td>
264+
<td>0</td>
265+
<td>0</td>
266+
<td>0</td>
267+
<td>0</td>
268+
</tr>
269+
<tr>
270+
<td></td>
271+
<td>lora</td>
272+
<td>0.855</td>
273+
<td>0.605</td>
274+
<td>0.477</td>
275+
<td>0.271</td>
276+
<td>0.59</td>
277+
</tr>
278+
<tr>
279+
<td></td>
280+
<td>qlora</td>
281+
<td>0.843</td>
282+
<td>0.603</td>
283+
<td>0.506</td>
284+
<td>0.211</td>
285+
<td>0.581</td>
286+
</tr>
287+
<tr>
288+
<td></td>
289+
<td></td>
290+
<td></td>
291+
<td></td>
292+
<td></td>
293+
<td></td>
294+
<td></td>
295+
</tr>
296+
</table>
297+
298+
299+
1、All the models lora and qlora are trained by default based on the spider dataset training set.
300+
2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated.
301+
302+
31303
## 2. Acknowledgements
32304
Thanks to the following open source projects.
33305

0 commit comments

Comments
 (0)