Performance Suggestion: Replace iterrows().to_dict() with to_dict(orient='records') for better efficiency

https://github.com/bangoc123/drop-rag/blob/76726f84de8cb67a8a1accf4900a0599571a5365/utils.py#L18
Current code:
`metadatas = [row.to_dict() for _, row in batch_df.iterrows()]`

Recommended replacement:
`metadatas = batch_df.to_dict(orient='records')`
The original implementation uses iterrows(), which yields each row as a Pandas Series object. For every iteration, row.to_dict() is called individually. Since Series are not optimized for row-wise iteration, this results in substantial Python-level overhead, especially as DataFrame size increases. Each Series also carries index metadata, adding further cost in memory and construction time.

In contrast, to_dict(orient='records') is a vectorized method implemented in Cython. It directly constructs a list of dictionaries by row, bypassing Python loops and Series construction. This makes it drastically faster for large batches and much more memory-efficient. It also scales better in production scenarios where large dataframes are processed frequently.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Suggestion: Replace iterrows().to_dict() with to_dict(orient='records') for better efficiency #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance Suggestion: Replace iterrows().to_dict() with to_dict(orient='records') for better efficiency #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions