Skip to content

Conversation

@FastLee
Copy link
Contributor

@FastLee FastLee commented Oct 9, 2024

closes #151

@github-actions
Copy link

github-actions bot commented Oct 9, 2024

✅ 38/38 passed, 2 skipped, 1m26s total

Running from acceptance #219

@gueniai gueniai requested a review from JCZuurmond January 13, 2025 15:23
Copy link
Contributor

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some pointers, please add more tests


__all__ = ["Installation", "MockInstallation", "IllegalState", "NotInstalled", "SerdeError"]

FILE_SIZE_LIMIT: int = 1024 * 1024 * 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint lookgs redudant

Suggested change
FILE_SIZE_LIMIT: int = 1024 * 1024 * 10
FILE_SIZE_LIMIT = 1024 * 1024 * 10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irrespective of whether it's redundant or not, it's incorrect: it should be ClassVar[int].

return Threads.strict(f"finding {product} installations", tasks)

@staticmethod
def extension(filename):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type hints, looks like a redundant methods, typically we use Path.suffix

writer = csv.DictWriter(buffer, field_names, dialect="excel")
writer.writeheader()
for as_dict in raw:
# Check if the buffer + the current row is over the file size limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asnare : could you have a look at this? From experience I know you are familiar with looping over these situations in a smart way ---> is there a way to chunk raw so that you have a minimum number of chunks while each of the chunks have a size smaller than FILE_SIZE_LIMIT (after writing a chunk to the io.StringIO buffer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fiddly, partly because we're treating the limit as a hard limit and there's no way to know in advance whether a row will exceed the limit.

A few things spring to mind:

  • If we treat the limit as a target and don't try to back out the last line then things become a lot simpler.
  • Even though we try to treat the maximum size as a limit, this code can still make files larger because we're counting characters but some will be more than a single byte. (Need to use BytesIO to fix this.)
  • An alternative approach, but not necessarily better, would be to first encode the header and rows as lines (in bytes), and then a second pass to chunk things up. I don't think this would be less code though.

def _dump_csv(raw: list[Json], type_ref: type) -> list[bytes]:
"""The `_dump_csv` method is a private method that is used to serialize a list of dictionaries to a CSV string.
This method is called by the `save` method."""
raws = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this closer to where it is used

with self._ws.workspace.download(f"{self.install_folder()}/{filename}") as f:
return self._convert_content(filename, f)
except NotFound:
# If the file is not found, check if it is a multi-part csv file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into to a separate mtehod

raws = converters[extension](as_dict, type_ref)
if len(raws) > 1:
for i, raw in enumerate(raws):
self.upload(f"{filename[0:-4]}.{i + 1}.csv", raw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why filename[0:-4]?

self.upload(filename, raw)
raws = converters[extension](as_dict, type_ref)
if len(raws) > 1:
for i, raw in enumerate(raws):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spark world it is common to use a folder instead:

  1. Name the folder like the file
  2. Combine all files inside the folder to get the final file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Installation save fails if the file is larger than 10MB

4 participants