|
metadatas = [row.to_dict() for _, row in batch_df.iterrows()] |
Current code:
metadatas = [row.to_dict() for _, row in batch_df.iterrows()]
Recommended replacement:
metadatas = batch_df.to_dict(orient='records')
The original implementation uses iterrows(), which yields each row as a Pandas Series object. For every iteration, row.to_dict() is called individually. Since Series are not optimized for row-wise iteration, this results in substantial Python-level overhead, especially as DataFrame size increases. Each Series also carries index metadata, adding further cost in memory and construction time.
In contrast, to_dict(orient='records') is a vectorized method implemented in Cython. It directly constructs a list of dictionaries by row, bypassing Python loops and Series construction. This makes it drastically faster for large batches and much more memory-efficient. It also scales better in production scenarios where large dataframes are processed frequently.