Skip to content

Python framework improvements #12

@ghamerly

Description

@ghamerly

Suggestions from @cur33 . These can (and should) be broken into separate issues as is appropriate, and handled separately.

Proposed changes to core library:

    - Change Dataset::print to accept any ostream, default to cout, in order to
      simplify addition of __str__ method to py_dataset.cpp


Design decisions:

    - The method of struct initialization recommended by the Python/C API docs is
      to use C99 designated initializers, which are very useful but not available
      in C++. Instead, I followed this approach:

        - for structs with sufficiently few attributes, like PyModuleDef, I used
          a standard initialization approch by providing values for all members
          (except trailing ones)

        - for structs with many attributes, like PyTypeObject, I created
          init_*_members() procedures that would set only the needed members,
          which are called in the PyInit_fastkmeans() function

      However, having seen other extension modules such as those in numpy which
      simply provide 0 or NULL values for all unused struct members, I'm
      starting to think that might just be simpler and more maintainable than
      I'd originally thought.

    - The Kmeans::initialize method needs an unsigned short * as an argument, so rather
      than jumping through a lot of hoops to provide this simple functionality,
      I made a simple Assignment class that wraps an unsigned short *. An
      Assignment object is returned by the fastkmeans.init_centers* methods and
      then passed to the [SomeKmeansType].initialize method.

    - The Dataset type defined in py_dataset.cpp is technically a mapping type
      in order to allow the `a_dataset[i, j]` syntax. The Assignment type is a
      normal sequence type since it only needs a single value passed to the
      bracket operator.


Possible future enhancements:

    - It generally seems frowned upon to directly manipulate custom object
      types, e.g. NaiveObject instances. Rather, to access specific struct
      fields, one should provide macros or (usually inline) functions like
      `NaiveObject_SOME_FIELD(PyObject *)` that will return the fields desired.
      This way, a method that accepts a custom object instance does not need to
      convert the PyObject * to a custom object *, and does not need to know
      implementation details of the custom object type.
      
      This should be easy to change because I only had to directly manipulate
      Dataset and Assignment object pointers in one or two methods.

    - Originally, and primarily because Python does not natively have the concept
      of ``abstract base classes'', I followed the method of simply copying the
      file for one Kmeans subclass and altering it for each other subclass.
      However, I have since realized that the `NotImplementedError` could be
      used to simulate base classes and allow for inheritance. This would make
      the code base much more maintainable and easier to read. However, it might
      be overkill since the project is not huge.
      

    - Since the original structure of the project merely contained all the
      source files in the `src` directory, I kept it that way, but now that the
      number of files has increased quite a bit, it might make more sense to
      move the Python module files into a separate directory. This shouldn't
      create a problem for the Python module to find the header files of the C++
      classes and functions, because the setup.py file can just be altered to
      set the `include_dir` keyword argument to be something like '..' instead
      of '.'.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions