Does not work well  and not accurate

Stops working well after asking 2-3 Questions 

Sequence - > what is the status of my flight(Works) -> is there wifi available on the flight(Works)->can u cancel my flight(Does not work goes to FAQ tool) , is this being used internally anywhere at OpenAI internally. Is there any benchmark for this ?