Framework for data analysis in Python #7685
Replies: 7 comments 2 replies
-
|
Hi @ynikitenko - thanks for following up and sharing the My impression is that Lena operates at the much higher level of Python generators, while That being said, I do see some analogies. An interesting experiment would be the ability to execute certain Ultimately, |
Beta Was this translation helpful? Give feedback.
-
|
Hi @shwina - thanks for looking into the library and for your kind words! Pipelines can operate at all levels. Let me show you several examples. 1. Lena sequence.This is the first example from the tutorial you’ve improved: 2. CUDA beginner code.Unfortunately, I could not launch a notebook yet, so I apologize for possible errors. This code is straightforward and in imperative style, though rather low-level. 3. “CUDA framework.”User code: Library code: Here, the user’s code, even though it could still be improved, is already more compact and structured. 4. GPU: 90% of analysis possible.As I’ve shown in example 3, it is straightforward to construct analysis chains already based on your implemented libraries. You have exposed good interfaces, which allow computation on-the-fly and abstract compositions of transformation iterators. The core feature of Data Analysis is that it uses mostly a standard set of functions: averages, standard deviations, quantiles, which are defined through simple mathematical operations and can be calculated on a GPU. Even when it is not enough, it is still necessary to calculate them in most cases. The fact that you’ve added histograms to the library makes it fit for 90% of analysis workflows. 5. 10% cases: fault tolerance.As you correctly mentioned, not every calculation can be performed on a GPU. Some tasks are inherently unparallelizable; many tasks do not require high performance, and often analysts don’t have GPUs (for example, I’m writing from a laptop with an integrated graphics card). Analysts often don’t have the required skills, which creates an entry barrier for them. Here a general framework would largely help. The reasons could be:
The features 2-3 can be called fault tolerance: If our program stops working on GPU, we lose performance, but can still execute that. We can also introduce optimizations only when needed and employ rapid development and prototyping. This largely improves the value of a general NVIDIA framework. Even though an exclusive “CUDA framework” would simplify some tasks for the users (section 3), I believe that because of the limitations of GPU programming it won’t be general and useful enough: You’ve made a right decision to support only a library. 6. Unified developer interface.You correctly noticed that Lena uses high-level iterators as well. Probably you meant another example of a “real analysis”: Indeed, after reduction (producing a histogram from our data) we operate on higher-level objects. Though we don’t need a GPU performance for a handful of histograms or tables, we still use one sequence for low-level and high-level analysis parts, since it is easier for the user to think in terms of the same sequence of operations (no context switching). A unified design improves clarity and usability of our interface. 7*. Example from DALI.This is a shortened example from the NVIDIA Data Loading Library. It is clear that images variable can be omitted in the pipeline (compare with example 1). Moreover, the interface (images, labels) is exactly a special case of what is called (data, context) in Lena. 8. What CUDA developers can provide.As I wrote, you have already implemented the core data analysis tools, which allows users to efficiently perform most standard analysis workflows. There are of course certain things to improve usability: For example, many tools are scattered across different CUDA libraries (like reading data and processing that).
I’m curious about your perspective here. Why do you prefer users to rely on third-party libraries instead of building more directly on CUDA? Do you see the entry barrier for CUDA GPU programming as relatively high? Is it mainly due to the risk of subtle mistakes that can affect results, or has user interest in GPU programming only increased recently? What do you think are the main reasons? If this feels like a long discussion, I’ve sent you a connection request on LinkedIn and we could discuss it in a short call. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @ynikitenko . I hear you about functionality being scattered across various libraries.
As you noted earlier, we are only providing a library here (not a framework). Frameworks have all the advantages you mentioned in your initial comment in this discussion. Additionally, frameworks can unify functionality from the various libraries we provide, hiding all that complexity from its users.
Thanks! Are there specific challenges related to |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @shwina. You write that it'd be easier for users to have a framework; however, NVIDIA supports only libraries. Is it programmers' team decision or a managerial one? What do you think could be the reasons for that? Building upon NVIDIA Python libraries. Do I understand right that NVIDIA is interested in scientific and data analysis applications? What do you think as a developer, if more people use CUDA libraries, will it generally improve its quality? |
Beta Was this translation helpful? Give feedback.
-
|
Hi @ynikitenko, This is a really good discussion! NVIDIA is very interested in accelerating all scientific and data analysis applications and for sure we benefit from more people using CUDA libraries as we can increase the quality of the CUDA libraries and interfaces like In general we provide as many tools for developers and end users to be successful in as many areas of high performance compute as we can. So lower level libs like From the CCCL side as @shwina said our idea is to provide the building blocks for domain experts to build/accelerate their libraries. In general our approach is to accelerate the libraries and frameworks that our users need and not to be super prescriptive on what libraries they should use.
We are working towards a 1.0 release soon and that would make the APIs stable. And about the other question we are interesting in working with the community on building these tools, we cant promise resources but our approach is to work with our users and therefore the permissive opensource licenses in our libraries. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @danielfrg - glad that you found it interesting! Let me explain what I understand under frameworks. Imagine your company is a renowned producer of bricks, windows and (high-level) components like walls and kitchens. Every consumer goes to one factory to buy bricks, then to another one to buy windows, and somehow assembles their own house as best they can. The question is: Why doesn't the company produce houses? Evidently, a ready product would be easier for most users – but we don’t know that in advance. Even if the potential market is huge, we still risk resources (that is essentially venture investment). Then I would ask the following questions: How much can you invest into an innovative product? How much can you invest to improve your infrastructure? How much can you invest to pay your technical debt? You mentioned that your first version would be 1.0.0 – which means you are preparing your product until it is ideal. In open source we often don’t make an ideal product, but first create a Minimal Viable Product (MVP) with version 0.1. That helps with scarce resources. Feedback from users then greatly helps with improving for the major release. I was thinking why a framework for data analysis could fail:
In fact, there also existed Web frameworks before Django (and I also met some for data analysis), but it’s easy to make a mistake when creating them. Their failures, however, did not preclude the success of Django. For users, as you have noted, a framework is a limitation. It takes control from the user and makes decisions for them. However, Braess's paradox shows that adding limitations can sometimes improve the system. And those are unimportant decisions: The user cares about their data and algorithm, not about how exactly data files are read and processed, or how memory is allocated. A framework by NVIDIA could help with the following:
My personal reasons for a framework were:
Those were also the reasons why my framework did not gain popularity: They are not a priority of the market, they don’t solve its daily problems. Many scientists, even those specialized on programming, do not understand what a framework is and how it could be useful, they’ve never used Django. Most data-analysis tasks do not involve complex systems. Iterators and lazy evaluation are a new foundation for computation and reasoning compared to arrays. The framework, however, proved indispensable for my advanced analysis in experimental physics. I also believe that those features are strategically important in the age of poorly-written AI code, since structured code allows easier reasoning and better testing. And I believe those features could be most valuable for NVIDIA. I’ve asked AI about major challenges of NVIDIA Python libraries, and these are its answers: 1) Fragmentation. There is no unified Python GPU layer. 2) Interoperability issues resulting in data copies between frameworks and hidden performance penalties, 3) Python control overhead: kernel launch overhead, many small GPU calls, synchronization issues, 4) GPU memory management. Better global memory orchestration would help. 5) Interface complexity. Structured GPU programming is missing, 6) User base imbalance. There is little tooling bridging the worlds of AI and HPC users. Developer team – please correct me if the AI is wrong. It mostly says what I wrote, but in different words. Probably you imagine a framework as a large product with many prepared decisions; but you may also think about it as an interface. If you don’t release a framework, you can miss an opportunity. If you wait with the interface, you accumulate technical debt, which grows invisibly and dangerously. An important concern you mentioned is the ecosystem. A pipeline interface is orthogonal to the existing software. Data in a pipeline can still be NumPy arrays, so there is no competition. As I’ve shown, the already implemented transform iterators from cuda.compute can be naturally combined into pipelines. That is so trivial, that probably no library would do that: That belongs more to NVIDIA. In 2021 you released a blog post, “To date, access to CUDA and NVIDIA GPUs through Python could only be accomplished by means of third-party software... Each wrote its own interoperability layer between the CUDA API and Python.” An interface is what the company should define itself. In 2021, to write a kernel, one still had to do so in C++. Since that time your developers have greatly improved the interface for the users. You create great resources like accelerated computing notebooks; however, as I tried to run them, it seems that few people outside NVIDIA actually use them, which is a pity. I can not guarantee how many new users it will bring, but I guarantee, that the existing interface can be further improved and simplified. In the end, success of a framework depends on the concrete framework and on how well your existing code base supports that (being 90% ready is a good sign). I hope I could explain what a framework could be. In my opinion, you have all bricks to build a good framework, which could help users, third-party and CUDA developers themselves. Data analysis is important and also foundational for AI. If this direction is interesting for you, I would be glad to discuss it further and see whether my experience could be useful. |
Beta Was this translation helpful? Give feedback.
-
|
Dear CCCL Team, I’ve recently been thinking about global optimizations that a framework could provide. Let us take a simple example: This is a sum of numbers from 1 to N, multiplied by a constant. It is based on the benchmark from Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python published last year. In the benchmark, as well as in many CUDA Python documentation examples, the transformation is typically a Python function: I decided to try a symbolic wrapper (also called an Expression Template in C++). In the code above, PerformanceFirst, I tried more complicated examples using the library operations from For many floating-point constants the benchmark is very close between a lambda and an expression tree. This is probably because of fused multiply-add, when an intermediate multiplication does not require additional processor cycles. However, I found a more complex example, where the constant is defined via an array and a lambda: Here are the benchmark results and the benchmark code I used. “Parallel” stands for the standard example with lambda. The optimized algorithm would be about 50% faster than the standard one. This is most likely due to the complicated runtime of Python. NVCC optimizes instructions quite well, but cannot optimize everything. One could argue that the example is somewhat exaggerated, but in our experiment we also use slightly different constants for each of the three detectors. An array of constants is realistic; in C++ we simply define that as constant, but standard Python does not support this directly. Here a framework, which would have a global view of the computations, could be of help. Another framework advantage would be a horizontal fusion, that is applying several algorithms to the same data within a single kernel. In data analysis we often want to calculate both the mean and the standard deviation of the same variable. Do I understand correctly that cuda.compute does not support horizontal fusion yet? Are there plans to implement it? PrecisionFor the simple problem at hand, we have a known answer. Moving a constant outside the sum improves accuracy. In our case of integer summation the results are exact; and when we don’t optimize lambdas, the results are wrong at the 8th digit. Naturally, it could get worse with more constants and computations. It is clear that for GPU algorithms the order of operations matters; however, for data analysis there is usually only one answer based on mathematics. To sum up, a global view of the computations would provide more optimizations both for the speed and for the accuracy of the computations. In my framework I used global optimizations for high-level operations such as reading data files only once, but it turns out they can be useful also for micro-calculations, which can be still too complicated for low-level optimizers. On the human level, thinking about optimizations of single vs combined operations is also different: For the former, a lambda would be ideal; the benefits of a symbolic wrapper appear only for the latter. Thanks to the CCCL Team for supporting the computational resources and for sharing the code benchmark. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear CCCL Team,
It is great to see NVIDIA focusing more on Python. Following our exchange with @aterrel and @shwina, I was encouraged to open a discussion here. What I have seen in
cuda.computeis very similar to what I have been developing in my architectural framework for data analysis. Your team has captured important features of modern data analysis:Building on this, it is also possible to structure an entire analysis framework. A functional design greatly facilitates modularity – enabling clear structure, loose coupling, and code reuse. All of that is glued by Python, a very high-level and flexible language.
Apart from the features mentioned above, the framework also supports:
Please refer to ReadTheDocs for a quick overview or watch a video of my presentation at the conference of Python developers in High-Energy Physics. The framework is under development, but the core features are stable, as they are based on solid design principles, my rigorous mathematical (MIPT, IUM) and software background (Debian/Ubuntu), and validated in real analysis workflows. Feel free to ask any relevant questions (test coverage, code maturity, etc.) here.
Lena was developed while solving real-world data analysis problems. A framework implies an initial entry threshold. On the other hand, having started, a scientist could use the same framework for decades. A common framework would greatly improve code quality and sharing – consider the Django ecosystem. The computational community and German funding agencies are steadily increasing requirements for exchangeable algorithms.
Why it could be interesting for NVIDIA:
What are your thoughts on the strategic alignment of our goals? I would also be interested in discussing this with others working on long-term CUDA Python architecture.
Beta Was this translation helpful? Give feedback.
All reactions