Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Performance issue when using pagination #1297

Open
tschale opened this issue Sep 12, 2024 · 4 comments
Open

[BUG] Performance issue when using pagination #1297

tschale opened this issue Sep 12, 2024 · 4 comments

Comments

@tschale
Copy link

tschale commented Sep 12, 2024

Describe the bug
I have a GET endpoint of a model, that returns 872 instances in total. However when using pagination with limit=1000, the Kubernetes pod crashes due to OOMKilled, i.e. because it uses too much memory. When limit is omitted in the request, the endpoint returns the response of the 872 instances. The limit parameter can be used to around 750, at limit=800, the issue occurs.

One thing that might be of interest, the model of the GET endpoint uses 12 VectorFields from pgvector, with 768 and 1024 dimensions. The response contains them as lists. Each instance therefore contains quite some data, in comparison to "vanilla" models with basic fields. Could this be problematic?

I looked at the paginate decorator, but that code is a little bit too abstract to quickly find out whats going on exactly... Has somebody experienced similar issues? Or does somebody has knowledge about the pagination, to figure out whether the model/pgvector-fields might be causing the issue?

Versions (please complete the following information):

  • Python version: 3.11.9
  • Django version: 4.2.7
  • Django-Ninja version: 1.0.1
  • Pydantic version: 2.5.1
@vitalik
Copy link
Owner

vitalik commented Sep 12, 2024

Hi @tschale

could you provide some examples - like you model partially and operation code ?

@tschale
Copy link
Author

tschale commented Sep 18, 2024

Hi @vitalik, thanks for your response!

Do you mean by operation code, how the django app is run in Kubernetes?


Here is the model:

from django_extensions.db.models import TimeStampedModel
from pgvector.django import VectorField

class Foo(TimeStampedModel):
    date = models.DateTimeField(null=True, blank=True)
    equivalent_foo = models.ForeignKey(
        "self",
        on_delete=models.SET_NULL,
        null=True,
        blank=True,
    )
    url = models.URLField(verbose_name=_("URL"))
    flag = models.BooleanField(default=False)
    bar = models.ForeignKey(
        "Bar",
        on_delete=models.SET_NULL,
        null=True,
        blank=True,
    )
    summary = models.TextField(verbose_name=_("Summary"))
    many_bar2 = models.ManyToManyField(Bar2")
    bar3 = models.ForeignKey(
        "Bar3",
        on_delete=models.SET_NULL,
        null=True,
        blank=True
    )
    json = models.JSONField(null=True, blank=True)
    vector1 = VectorField(
        dimensions=768, null=True, blank=True
    )
    vector2 = VectorField(
        dimensions=768, null=True, blank=True
    )
    vector3 = VectorField(
        dimensions=768, null=True, blank=True
    )
    vector4 = VectorField(
        dimensions=768, null=True, blank=True
    )
    vector5 = VectorField(
        dimensions=1024, null=True, blank=True
    )
    vector6 = VectorField(
        dimensions=1024, null=True, blank=True
    )
    vector7 = VectorField(
        dimensions=1024, null=True, blank=True
    )
    vector8 = VectorField(
        dimensions=1024, null=True, blank=True
    )
    vector9 = VectorField(
        dimensions=1024, null=True, blank=True
    )
    vector10 = VectorField(
        dimensions=1024, null=True, blank=True
    )
    vector11 = VectorField(
        dimensions=1024, null=True, blank=True
    )
    vector12 = VectorField(
        dimensions=1024, null=True, blank=True
    )

    text1 = models.TextField(blank=True)
    text2 = models.TextField(blank=True)

Representation of the vectors in the schema:

from typing import Annotated, Optional, Tuple, TypeVar
from ninja import Field, Query, Schema

Vector = TypeVar("Vector", bound=Tuple[float, ...])
Vector768 = Annotated[Vector, Len(max_length=768, min_length=768)]
Vector1024 = Annotated[Vector, Len(max_length=1024, min_length=1024)]


class FooObject(Schema):
    # [...]
    json: Optional[Dict[Any, Any]] = Field(default=None)
    vector1: Optional[Vector768] = Field(default=None)
    # [...]
    vector5: Optional[Vector1024] = Field(default=None)
    # [...]

We use a custom filter schema, to support filtering with conjunctive normal form principles. E.g. a GET with params &url=www.foo.com&url=www.bar.com resolves to an AND (i.e. all Foo instances with url equal to www.foo.com and www.bar.com), a GET with params &url=www.foo.com,www.bar.com resolves to an OR (i.e. all Foo instances with url equal to www.foo.com or www.bar.com).

The filter schema supports all fields except the json or the vectorX fields. The filter schema uses a custom_expression and has following customizations:

class ConjunctiveNormalFormFilterSchema(FilterSchema):
    def custom_expression(self) -> List[Q]:  # type: ignore[override]
        """
        Customization: returns List[Q] instead of Q.
        """
        [...]

    def get_filter_expression(self) -> List[Q]:  # type: ignore[override]
        """Customization: returns List[Q] instead of Q and only uses
        custom_expression and not _connect_fields().
        """
        return self.custom_expression()

    def filter(self, queryset: T) -> T:
        """Customization: chains filter expressions instead of AND-chained
        Q-objects to support conjunctive normal form.
        """
        q_list = self.get_filter_expression()
        for q in q_list:
            queryset = queryset.filter(q)
        return queryset.distinct()

@vitalik
Copy link
Owner

vitalik commented Sep 18, 2024

Hi @tschale

still hard to tell.. I assume you are using postgres...

what you can try is check

  • if yourqueryset.count() works quick ?
  • if yourqueryset.all()[:800] works quick ?
  • if that also not giving any clues - maybe try printing sql queries with timing (set DEBUG to True and add this middlewaree) :
# myapp/middleware.py

import time
import logging
from django.db import connection, reset_queries

logger = logging.getLogger(__name__)

class SQLDebugMiddleware:
    # ... __init__ method remains the same ...

    def __call__(self, request):
        reset_queries()
        start_time = time.time()

        response = self.get_response(request)

        total_time = time.time() - start_time
        queries = connection.queries
        query_time = sum(float(q['time']) for q in queries)

        print(f"Total Request Time: {total_time:.3f}s")
        print(f"Total DB Query Time: {query_time:.3f}s")
       print(f"Number of Queries: {len(queries)}")

        for query in queries:
            sql = query['sql']
            time_taken = query['time']
            print(f"({time_taken}s) {sql}")

        return response

or use django debug toolbar

@tschale
Copy link
Author

tschale commented Sep 19, 2024

Hi @vitalik, thanks for your response!

We are using postgres, yes. I will try your suggestions. However it probably will take some days, or even weeks. I will get in contact again, once I had the opportunity to try them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants