It was observed that the number of chunks produced from a csv file is not always reliable. During the testing it was found that for the test dataset the number of chunks generated by the .report_from_directory() function was 76. If the report.csv file it produced is copied and renamed it was found that sometimes it returns 76 chunks and other times it returns 75 chunks. One potential reason for this could be related to the how whitespace or line breaks are treated in the document. There may be some autoformatting that reduces the number of characters in the csv file.
It should be investigated what is the exact root of this issue and if it can be resolved by adding a pre-processing step to remove problematic characters before the chunking step to improve reliability
It was observed that the number of chunks produced from a csv file is not always reliable. During the testing it was found that for the test dataset the number of chunks generated by the
.report_from_directory()function was 76. If thereport.csvfile it produced is copied and renamed it was found that sometimes it returns 76 chunks and other times it returns 75 chunks. One potential reason for this could be related to the how whitespace or line breaks are treated in the document. There may be some autoformatting that reduces the number of characters in the csv file.It should be investigated what is the exact root of this issue and if it can be resolved by adding a pre-processing step to remove problematic characters before the chunking step to improve reliability