Skip to content

Conversation

nivm-port
Copy link
Contributor

@nivm-port nivm-port commented Sep 10, 2025

User description

Description

What - gitlab-v2 memory optimizations

Why - ooms on large files (gitlab has no limit on the size)

How -

Type of change

Please leave one option from the following and delete the rest:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • New Integration (non-breaking change which adds a new integration)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Non-breaking change (fix of existing functionality that will not change current behavior)
  • Documentation (added/updated documentation)

All tests should be run against the port production environment(using a testing org).

Core testing checklist

  • Integration able to create all default resources from scratch
  • Resync finishes successfully
  • Resync able to create entities
  • Resync able to update entities
  • Resync able to detect and delete entities
  • Scheduled resync able to abort existing resync and start a new one
  • Tested with at least 2 integrations from scratch
  • Tested with Kafka and Polling event listeners
  • Tested deletion of entities that don't pass the selector

Integration testing checklist

  • Integration able to create all default resources from scratch
  • Completed a full resync from a freshly installed integration and it completed successfully
  • Resync able to create entities
  • Resync able to update entities
  • Resync able to detect and delete entities
  • Resync finishes successfully
  • If new resource kind is added or updated in the integration, add example raw data, mapping and expected result to the examples folder in the integration directory.
  • If resource kind is updated, run the integration with the example data and check if the expected result is achieved
  • If new resource kind is added or updated, validate that live-events for that resource are working as expected
  • Docs PR link here

Preflight checklist

  • Handled rate limiting
  • Handled pagination
  • Implemented the code in async
  • Support Multi account

Screenshots

Include screenshots from your environment showing how the resources of the integration will look.

API Documentation

Provide links to the API documentation used for this integration.


PR Type

Bug fix


Description

  • Memory optimizations for GitLab file processing

  • Stream-based file content handling to prevent OOM

  • Chunked base64 decoding and YAML-to-JSON conversion

  • Temporary file storage for large content processing


Diagram Walkthrough

flowchart LR
  A["Large File Request"] --> B["Stream Download"]
  B --> C["Chunked Base64 Decode"]
  C --> D["Temp File Storage"]
  D --> E["YAML-to-JSON Stream"]
  E --> F["Memory Efficient Processing"]
Loading

File Walkthrough

Relevant files
Enhancement
base_client.py
Add streaming file download with chunked decoding               

integrations/gitlab-v2/gitlab/clients/base_client.py

  • Add download_decoded_content method for streaming file downloads
  • Implement chunked base64 decoding to temporary files
  • Extract error handling into separate method
  • Add memory-efficient file processing with ijson
+51/-7   
gitlab_client.py
Enhance file enrichment with type detection                           

integrations/gitlab-v2/gitlab/clients/gitlab_client.py

  • Add file type detection (path vs content)
  • Conditional reference resolution based on file content
  • Ensure content is properly structured as dict
  • Add __base_jq field for file processing
+7/-4     
rest_client.py
Implement chunked decoding for file content                           

integrations/gitlab-v2/gitlab/clients/rest_client.py

  • Replace direct base64 decoding with streaming approach
  • Implement chunked UTF-8 decoding using memoryview
  • Use temporary file storage for large content
  • Add memory-efficient content processing
+20/-5   
utils.py
Rewrite file parsing with streaming YAML-to-JSON                 

integrations/gitlab-v2/gitlab/helpers/utils.py

  • Complete rewrite of parse_file_content for memory efficiency
  • Add streaming YAML-to-JSON conversion with YamlToJsonStreamer
  • Implement temporary file-based content storage
  • Add comprehensive scalar resolution for YAML parsing
+406/-23
Documentation
CHANGELOG.md
Update changelog for memory optimizations                               

integrations/gitlab-v2/CHANGELOG.md

  • Add version 0.2.29 changelog entry
  • Document memory optimizations improvement
+6/-0     
Configuration changes
pyproject.toml
Version bump to 0.2.29                                                                     

integrations/gitlab-v2/pyproject.toml

  • Bump version from 0.2.28 to 0.2.29
+1/-1     

Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns

Temp file handling:
Decoded contents are written under a predictable directory (/tmp/ocean) with globally readable default umask and no cleanup/rotation. Consider using secure temp files (e.g., NamedTemporaryFile with restricted permissions) and ensure deletion after use to avoid leaking sensitive repository contents.

⚡ Recommended focus areas for review

Possible Issue

Using ijson.items on an async iterator via _AiterReader may not work as written; ijson typically expects synchronous file-like objects or requires ijson.asyncio for async streams. Verify that parsing the streamed JSON for key content actually yields the base64 string and not an event stream, and that _AiterReader conforms to ijson expectations.

async def download_decoded_content(self, path: str, params: Optional[dict[str, Any]] = None, content_key: str = "content"):
        url = f"{self.base_url}/{path}"
        logger.debug(f"Downloading decoded content from {url}")

        try:
            async with self._client.stream("GET", url, params=params, headers=self._headers) as r:
                r.raise_for_status()
                reader = _AiterReader(r.iter_bytes())
                # ijson can parse from an async byte iterator via ijson.asyncio
                parser = ijson.items(reader, content_key)
                os.makedirs("/tmp/ocean", exist_ok=True)
                out_path = f"/tmp/ocean/bulk_{uuid.uuid4()}.json"
                async with aiofiles.open(out_path, "wb") as f:
                    for content_b64 in parser:
                        # For very long base64, decode in chunks:
                        for i in range(0, len(content_b64), 4*1024*1024):
                            chunk = content_b64[i:i+4*1024*1024]
                            await f.write(base64.b64decode(chunk, validate=True))
                return out_path
Inconsistent Return Type

get_file_data now sometimes returns a dict with 'content' and sometimes a string path, breaking the annotated return type and likely callers’ expectations. Standardize the return format and ensure callers handle temp file lifecycle.

async def get_file_data(
    self, project_id: str, file_path: str, ref: str
) -> dict[str, Any]:
    encoded_project_id = quote(project_id, safe="")
    encoded_file_path = quote(file_path, safe="")
    path = f"projects/{encoded_project_id}/repository/files/{encoded_file_path}"
    params = {"ref": ref}

    response_path = await self.download_decoded_content(path, params=params)
    if isinstance(response_path, str):
        with open(response_path, "r") as f:
            response = {"content": f.read()}
    else:
        response = response_path
    return response
JSON/YAML Semantics

The custom YAML-to-JSON streamer modifies scalar typing heuristics and may change behavior vs SafeLoader (e.g., keys coercion, numeric/boolean parsing, multi-doc handling). This could produce different results than before and affect reference resolution or mappings; validate against existing fixtures.

# -------- Scalar resolution (lightweight YAML 1.2-ish) --------

_NULLS = {"null", "~", ""}
_TRUE  = {"true"}
_FALSE = {"false"}

_int_re = re.compile(r"""^[+-]?(
    0
  | [1-9][0-9_]*
)$""", re.X)

_float_re = re.compile(r"""^[+-]?(
    ([0-9][0-9_]*)?\.[0-9_]+([eE][+-]?[0-9_]+)?
  | [0-9][0-9_]*([eE][+-]?[0-9_]+)
)$""", re.X)

def _clean_underscores(s: str) -> str:
    # YAML allows numeric separators "_"; JSON doesn't.
    return s.replace("_", "")

def scalar_to_json_text(val: str | None, tag: str | None, implicit: tuple[bool,bool] | None) -> str:
    """
    Convert a YAML scalar string to JSON text without building native containers.
    We ignore YAML 1.1 oddities and follow a YAML 1.2-ish core schema.
    """
    if val is None:
        return "null"

    raw = val.strip()

    # If tag explicitly says string, skip heuristics
    if tag and (tag.endswith(":str") or tag.endswith(":binary") or tag.endswith(":timestamp")):
        return json.dumps(val)

    # If the value was quoted in the original YAML (implicit[1] is True), treat as literal string
    if implicit is not None and implicit[1]:
        return json.dumps(val)

    # Nulls - only if not quoted (implicit)
    if raw.lower() in _NULLS:
        return "null"

    # Booleans - only if not quoted (implicit)
    low = raw.lower()
    if low in _TRUE:
        return "true"
    if low in _FALSE:
        return "false"

    # Integers
    if _int_re.match(raw):
        try:
            return str(int(_clean_underscores(raw)))
        except ValueError:
            pass  # fall through to string

    # Floats (NaN/Inf represented as strings since JSON lacks them)
    # Only treat as special values if not quoted (implicit)
    if raw.lower() in {"nan", ".nan"}:
        return json.dumps("NaN")
    if raw.lower() in {"inf", "+inf", ".inf", "+.inf"}:
        return json.dumps("Infinity")
    if raw.lower() in {"-inf", "-.inf"}:
        return json.dumps("-Infinity")

    if _float_re.match(raw):
        try:
            # Use repr to avoid locale surprises; still finite numbers only
            float_val = float(_clean_underscores(raw))
            # Check if the float is finite and reasonable (not inf/nan)
            if float_val == float_val and abs(float_val) < 1e308:
                return repr(float_val)
        except ValueError:
            pass

    # Fallback: JSON string
    return json.dumps(val)

# -------- Streaming JSON emitter from YAML events --------

class _Frame:
    __slots__ = ("kind", "index", "expect")  # kind: 'seq'|'map'|'docarray'
    def __init__(self, kind: str, expect: str = "key"):
        self.kind = kind
        self.index = 0
        self.expect = expect  # only for maps: 'key' or 'value'

class YamlToJsonStreamer:
    """
    Stream YAML -> JSON using PyYAML events.
    - No compose/load (no big trees in memory)
    - Emits chunks via a writer callback
    """
    def __init__(self, writer):
        self.writer = writer
        self.stack: list[_Frame] = []
        self._doc_count = 0
        self._mode = "single"  # or 'array' or 'newline'

    def set_multiple_mode(self, mode: str):
        self._mode = mode  # 'array' | 'newline' | 'single'

    # ---- helpers ----
    def _comma_if_needed(self):
        """Add comma before the next element if we're not at the first element"""
        if not self.stack:
            return
        top = self.stack[-1]
        if top.kind in ("seq", "map", "docarray") and top.index > 0:
            self.writer(",")

    def _open_seq(self):
        # Only add comma if we're not inside a map expecting a value
        # and we're not at the start of a document in array mode
        if not (self.stack and self.stack[-1].kind == "map" and self.stack[-1].expect == "value"):
            # Don't add comma if we're at the start of a document in array mode
            # (the document start event already handled the comma)
            if not (self._mode == "array" and self.stack and self.stack[-1].kind == "docarray" and self._doc_count > 1):
                self._comma_if_needed()
        self.writer("[")
        self.stack.append(_Frame("seq"))

    def _close_seq(self):
        self.writer("]")
        frame = self.stack.pop()
        if self.stack:
            top = self.stack[-1]
            if top.kind == "map":
                # We just closed a sequence that was the value of a key in a map
                # Set the expectation back to "key" for the next key-value pair
                top.expect = "key"
            top.index += 1
        return frame

    def _open_map(self):
        # Only add comma if we're not inside a map expecting a value
        # and we're not at the start of a document in array mode
        if not (self.stack and self.stack[-1].kind == "map" and self.stack[-1].expect == "value"):
            # Don't add comma if we're at the start of a document in array mode
            # (the document start event already handled the comma)
            if not (self._mode == "array" and self.stack and self.stack[-1].kind == "docarray" and self._doc_count > 1):
                self._comma_if_needed()
        self.writer("{")
        self.stack.append(_Frame("map", expect="key"))

    def _close_map(self):
        self.writer("}")
        frame = self.stack.pop()
        if self.stack:
            top = self.stack[-1]
            if top.kind == "map":
                # We just closed a map that was the value of a key in another map
                # Set the expectation back to "key" for the next key-value pair
                top.expect = "key"
            top.index += 1
        return frame

    def _emit_scalar(self, val: str, tag: str | None, implicit):
        # If inside a mapping and expecting a key, we must ensure JSON string key
        if self.stack and self.stack[-1].kind == "map" and self.stack[-1].expect == "key":
            self._comma_if_needed()
            # Convert to JSON text; must end up as a *string* key in JSON
            key_json = scalar_to_json_text(val, tag, implicit)
            if not (len(key_json) >= 2 and key_json[0] == '"' and key_json[-1] == '"'):
                # stringify any non-string key json
                key_json = json.dumps(json.loads(key_json))
            self.writer(key_json + ":")
            self.stack[-1].expect = "value"
            # Don't bump index yet; we'll do it after the value is emitted
            return

        # Normal value position - add comma before sequence elements
        if self.stack and self.stack[-1].kind == "seq":
            self._comma_if_needed()
        self.writer(scalar_to_json_text(val, tag, implicit))
        if self.stack:
            top = self.stack[-1]
            if top.kind == "map":
                # just wrote a value
                top.expect = "key"
                top.index += 1
            elif top.kind == "seq":
                top.index += 1

    # ---- main driver ----
    def feed(self, events):
        # Handle multiple docs wrapper
        docarray_opened = False
        for ev in events:
            if isinstance(ev, DocumentStartEvent):
                if self._mode == "array":
                    if not docarray_opened:
                        self.stack.append(_Frame("docarray"))
                        self.writer("[")
                        docarray_opened = True
                    else:
                        # Only add comma if this is not the first document
                        # and we're not already inside a sequence or mapping
                        if not self.stack or self.stack[-1].kind == "docarray":
                            self._comma_if_needed()
                self._doc_count += 1

            elif isinstance(ev, DocumentEndEvent):
                if self.stack and self.stack[-1].kind == "docarray":
                    self.stack[-1].index += 1

            elif isinstance(ev, SequenceStartEvent):
                self._open_seq()

            elif isinstance(ev, SequenceEndEvent):
                self._close_seq()

            elif isinstance(ev, MappingStartEvent):
                self._open_map()

            elif isinstance(ev, MappingEndEvent):
                self._close_map()

            elif isinstance(ev, ScalarEvent):
                self._emit_scalar(ev.value, ev.tag, ev.implicit)

            elif isinstance(ev, AliasEvent):
                # YAML alias/anchor -> stringify the alias name (low-memory + JSON-safe)
                # Alternative policies are possible (e.g., error out).
                self._emit_scalar("*" + (ev.anchor or ""), "tag:yaml.org,2002:str", (True, True))

            elif isinstance(ev, StreamStartEvent):
                pass
            elif isinstance(ev, StreamEndEvent):
                pass
            else:
                # Unknown/rare event types -> ignore or raise
                pass

        # Close doc array if used
        if docarray_opened:
            self.writer("]")
            self.stack.pop()  # docarray

def yaml_to_json_chunks(yaml_text: str, multiple: str = "array", file_stream=None):
    """
    Convert YAML to JSON and write directly to a file stream if provided,
    otherwise yield JSON text chunks while parsing YAML incrementally.

    Args:
        yaml_text: The YAML text to convert
        multiple: 'array' -> single JSON array of docs
                  'newline' -> one JSON per line
                  'single' -> exactly one doc expected
        file_stream: Optional file-like object to write JSON directly to
    """
    if file_stream is not None:
        # Write directly to file stream - separate function to avoid generator issues
        return _yaml_to_json_stream(yaml_text, multiple, file_stream)
    else:
        # Original behavior: yield chunks
        return _yaml_to_json_generator(yaml_text, multiple)

def _yaml_to_json_stream(yaml_text: str, multiple: str, file_stream):
    """Helper function to write YAML to JSON directly to a file stream."""
    bulk = []
    # Write directly to file stream
    def _write_to_stream(chunk: str):
        bulk.append(chunk)
        if len(bulk) == 10000:
            file_stream.write("".join(bulk))
            bulk.clear()

    parser = yaml.parse(io.StringIO(yaml_text))
    streamer = YamlToJsonStreamer(_write_to_stream)
    streamer.set_multiple_mode("array" if multiple == "array" else "single")

    streamer.feed(parser)

    # Flush any remaining chunks in the bulk buffer
    if bulk:
        file_stream.write("".join(bulk))
        bulk.clear()

    if multiple == "newline":
        # For newline mode, we need to handle per-document writing
        # Re-parse and write each document separately
        docs = list(iter_yaml_docs_as_single_json(yaml_text))
        for i, jtxt in enumerate(docs):
            if i:
                file_stream.write("\n")
            file_stream.write(jtxt)

def _yaml_to_json_generator(yaml_text: str, multiple: str):
    """Helper function to generate YAML to JSON chunks."""

    buf = []
    def _push(chunk: str):
        buf.append(chunk)

    parser = yaml.parse(io.StringIO(yaml_text))
    streamer = YamlToJsonStreamer(_push)
    streamer.set_multiple_mode("array" if multiple == "array" else "single")

    streamer.feed(parser)

    if multiple == "newline":
        # Re-parse once but flush per-document; to keep memory tiny, you could
        # instead refactor YamlToJsonStreamer to flush a newline on DocumentEnd.
        # Here we provide a tiny two-pass workaround for simplicity.
        buf.clear()
        # A more memory-tight single-pass variant is included below in the
        # "newline mode" example usage.
        docs = list(iter_yaml_docs_as_single_json(yaml_text))
        for i, jtxt in enumerate(docs):
            if i:
                yield "\n"
            yield jtxt
        return

    # Array/single: yield what we streamed
    out = "".join(buf)
    yield out

# Tiny helper to produce one-JSON-per-doc without storing all docs
def iter_yaml_docs_as_single_json(yaml_text: str):
    """
    Iterate JSON strings for each YAML document with low memory.
    """
    # We’ll restart a small streamer at each document boundary.
    stream = yaml.parse(io.StringIO(yaml_text))
    buf = []
    def flush_doc():
        if buf:
            s = "".join(buf)
            buf.clear()
            return s
        return None

    doc_started = False

    def writer(s): buf.append(s)

    local = YamlToJsonStreamer(writer)
    local.set_multiple_mode("single")

    # We proxy events to the local streamer and detect doc boundaries
    from collections import deque
    pending = deque()

    for ev in stream:
        if isinstance(ev, DocumentStartEvent):
            doc_started = True
            pending.append(ev)
        elif isinstance(ev, DocumentEndEvent):
            pending.append(ev)
            # feed the buffered doc events
            local.feed(list(pending))
            pending.clear()
            s = flush_doc()
            if s is not None:
                yield s
            doc_started = False
        else:
            pending.append(ev)

    # Handle no-document YAML (empty input)
    if not doc_started and not buf and not pending:
        yield "null"

Copy link
Contributor

qodo-merge-pro bot commented Sep 10, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Stream raw file endpoint

The current implementation using ijson still loads the entire base64-encoded
file content into memory, which negates the memory optimization goal. It also
discards essential file metadata. The suggestion is to switch to GitLab's raw
file endpoint to stream file bytes directly, avoiding base64 and JSON parsing
overhead for the content. This involves fetching metadata and content in
separate, more efficient calls and ensuring temporary files are always cleaned
up.

Examples:

integrations/gitlab-v2/gitlab/clients/base_client.py [65-83]
    async def download_decoded_content(self, path: str, params: Optional[dict[str, Any]] = None, content_key: str = "content"):
            url = f"{self.base_url}/{path}"
            logger.debug(f"Downloading decoded content from {url}")

            try:
                async with self._client.stream("GET", url, params=params, headers=self._headers) as r:
                    r.raise_for_status()
                    reader = _AiterReader(r.iter_bytes())
                    # ijson can parse from an async byte iterator via ijson.asyncio
                    parser = ijson.items(reader, content_key)

 ... (clipped 9 lines)
integrations/gitlab-v2/gitlab/clients/rest_client.py [61-75]
    async def get_file_data(
        self, project_id: str, file_path: str, ref: str
    ) -> dict[str, Any]:
        encoded_project_id = quote(project_id, safe="")
        encoded_file_path = quote(file_path, safe="")
        path = f"projects/{encoded_project_id}/repository/files/{encoded_file_path}"
        params = {"ref": ref}

        response_path = await self.download_decoded_content(path, params=params)
        if isinstance(response_path, str):

 ... (clipped 5 lines)

Solution Walkthrough:

Before:

# base_client.py
async def download_decoded_content(self, path, content_key="content"):
    async with self._client.stream("GET", ...) as r:
        # ijson still loads the entire 'content' value (a large base64 string) into memory
        parser = ijson.items(r.iter_bytes(), content_key)
        out_path = f"/tmp/ocean/bulk_{uuid.uuid4()}.json"
        async with aiofiles.open(out_path, "wb") as f:
            for content_b64 in parser: # content_b64 is the full base64 string
                # Chunking happens *after* the full string is in memory
                for i in range(0, len(content_b64), CHUNK_SIZE):
                    chunk = content_b64[i:i+CHUNK_SIZE]
                    await f.write(base64.b64decode(chunk))
        return out_path # Only the path is returned, metadata is lost

# rest_client.py
async def get_file_data(self, project_id, file_path, ref):
    response_path = await self.download_decoded_content(...)
    # The original file metadata from GitLab is discarded.
    return {"content": open(response_path).read()}

After:

# base_client.py
async def download_raw_content(self, path, params):
    out_path = f"/tmp/ocean/raw_{uuid.uuid4()}"
    try:
        async with self._client.stream("GET", path, params=params) as r:
            r.raise_for_status()
            # Stream raw bytes directly to a file, no base64 decoding needed
            async with aiofiles.open(out_path, "wb") as f:
                async for chunk in r.iter_bytes():
                    await f.write(chunk)
        return out_path
    except Exception:
        # Ensure temp file is cleaned up on error
        if os.path.exists(out_path):
            os.remove(out_path)
        raise

# rest_client.py
async def get_file_data(self, project_id, file_path, ref):
    # Get metadata first, without the content
    metadata = await self.send_api_request("GET", f"projects/.../files/{file_path}", params={"ref": ref})
    # Use the raw endpoint to stream content efficiently
    raw_content_path = await self.download_raw_content(f"projects/.../files/{file_path}/raw", params={"ref": ref})
    metadata["content_path"] = raw_content_path
    return metadata
Suggestion importance[1-10]: 10

__

Why: The suggestion correctly identifies a critical design flaw where the full base64 content is still loaded into memory, defeating the PR's purpose, and points out a significant regression where file metadata is discarded, likely causing breakages.

High
General
Add temporary file cleanup
Suggestion Impact:The commit wrapped the temp file writing in a try/except and deletes the file with os.unlink if an exception occurs, preventing disk leaks.

code diff:

+                    try:
+                        async with aiofiles.open(out_path, "wb") as f:
+                            for content_b64 in parser:
+                                # For very long base64, decode in chunks:
+                                for i in range(0, len(content_b64), 4*1024*1024):
+                                    chunk = content_b64[i:i+4*1024*1024]
+                                    await f.write(base64.b64decode(chunk, validate=True))
+                        return out_path
+                    except Exception:
+                        # Clean up temp file on error
+                        if os.path.exists(out_path):
+                            os.unlink(out_path)
+                        raise

Add proper cleanup of temporary files to prevent disk space leaks. The method
creates temporary files but doesn't provide a mechanism to clean them up after
use. Consider adding a context manager or cleanup callback.

integrations/gitlab-v2/gitlab/clients/base_client.py [65-83]

 async def download_decoded_content(self, path: str, params: Optional[dict[str, Any]] = None, content_key: str = "content"):
         url = f"{self.base_url}/{path}"
         logger.debug(f"Downloading decoded content from {url}")
 
         try:
             async with self._client.stream("GET", url, params=params, headers=self._headers) as r:
                 r.raise_for_status()
                 reader = _AiterReader(r.iter_bytes())
                 # ijson can parse from an async byte iterator via ijson.asyncio
                 parser = ijson.items(reader, content_key)
                 os.makedirs("/tmp/ocean", exist_ok=True)
                 out_path = f"/tmp/ocean/bulk_{uuid.uuid4()}.json"
-                async with aiofiles.open(out_path, "wb") as f:
-                    for content_b64 in parser:
-                        # For very long base64, decode in chunks:
-                        for i in range(0, len(content_b64), 4*1024*1024):
-                            chunk = content_b64[i:i+4*1024*1024]
-                            await f.write(base64.b64decode(chunk, validate=True))
-                return out_path
+                try:
+                    async with aiofiles.open(out_path, "wb") as f:
+                        for content_b64 in parser:
+                            # For very long base64, decode in chunks:
+                            for i in range(0, len(content_b64), 4*1024*1024):
+                                chunk = content_b64[i:i+4*1024*1024]
+                                await f.write(base64.b64decode(chunk, validate=True))
+                    return out_path
+                except Exception:
+                    # Clean up temp file on error
+                    if os.path.exists(out_path):
+                        os.unlink(out_path)
+                    raise

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a significant resource leak where temporary files created by download_decoded_content are never deleted, which could lead to disk space exhaustion.

High
  • Update

@github-actions github-actions bot added size/XXL and removed size/L labels Oct 9, 2025
@IdansPort
Copy link
Contributor

I wish I could see memory_profiler or tracemalloc summary of the difference between current integration and the new implementation. also please add unittests for core logic.

@IdansPort
Copy link
Contributor

writer = Mock()
streamer = YamlToJsonStreamer(writer)
import yaml
yaml_text = "key: value"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test for longer and more compilcated cases?

@IdansPort
Copy link
Contributor

did some test by myself, perfomance looks ok. please fix the lint issues and the test issues (they seems to failing for some reason)

@nivm-port
Copy link
Contributor Author

also did you read about https://pypi.org/project/ijson/ https://pypi.org/project/ruamel.yaml/

unfortunately, ruamel is not giving the performance we would get from the custom code (I've tried it and it is not very different from the yaml package we are using...)
about ijson we are already using it (and this is why I'm converting the yaml files to json 👍 )

@nivm-port nivm-port merged commit 9178525 into main Oct 15, 2025
79 checks passed
@nivm-port nivm-port deleted the PORT-16058-bug-items-to-parse-might-cause-oom-gitlab-update branch October 15, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants