You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><em>Kernel Float</em> is a header-only library for CUDA/HIP that simplifies working with vector types and reducedprecision floating-point arithmetic in GPU code.</p>
87
+
<p><em>Kernel Float</em> is a header-only library for CUDA/HIP that makes working with reduced-precision floating-point types and vector arithmetic simple and expressive, with zero performance overhead.</p>
88
88
<sectionid="summary">
89
89
<h2>Summary<aclass="headerlink" href="#summary" title="Permalink to this heading"></a></h2>
type conversion is awkward (e.g., <codeclass="docutils literal notranslate"><spanclass="pre">__nv_cvt_halfraw2_to_fp8x2</span></code> converts float16 to float8),
95
95
and some functionality is missing (e.g., one cannot convert a <codeclass="docutils literal notranslate"><spanclass="pre">__half</span></code> to <codeclass="docutils literal notranslate"><spanclass="pre">__nv_bfloat16</span></code>).</p>
96
-
<p><em>Kernel Float</em> resolves this by offering a single data type <codeclass="docutils literal notranslate"><spanclass="pre">kernel_float::vec<T,</span><spanclass="pre">N></span></code> that stores <codeclass="docutils literal notranslate"><spanclass="pre">N</span></code> elements of type <codeclass="docutils literal notranslate"><spanclass="pre">T</span></code>.
97
-
Internally, the data is stored as a fixed-sized array of elements.
96
+
<p><em>Kernel Float</em> resolves this by offering a single unified vector type <codeclass="docutils literal notranslate"><spanclass="pre">kernel_float::vec<T,</span><spanclass="pre">N></span></code> that stores <codeclass="docutils literal notranslate"><spanclass="pre">N</span></code> elements of type <codeclass="docutils literal notranslate"><spanclass="pre">T</span></code>.
97
+
Internally, the data is stored using the optimal data layout for the given type.
98
98
Operator overloading (like <codeclass="docutils literal notranslate"><spanclass="pre">+</span></code>, <codeclass="docutils literal notranslate"><spanclass="pre">*</span></code>, <codeclass="docutils literal notranslate"><spanclass="pre">&&</span></code>) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
99
99
Many mathematical functions (like <codeclass="docutils literal notranslate"><spanclass="pre">log</span></code>, <codeclass="docutils literal notranslate"><spanclass="pre">exp</span></code>, <codeclass="docutils literal notranslate"><spanclass="pre">sin</span></code>) and common operations (such as <codeclass="docutils literal notranslate"><spanclass="pre">sum</span></code>, <codeclass="docutils literal notranslate"><spanclass="pre">range</span></code>, <codeclass="docutils literal notranslate"><spanclass="pre">for_each</span></code>) are also available.</p>
100
-
<p>Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.</p>
100
+
<p>The generated assembly is identical to hand-written intrinsics code, meaning you get clean and maintainable source code without sacrificing performance.</p>
101
101
</section>
102
102
<sectionid="features">
103
103
<h2>Features<aclass="headerlink" href="#features" title="Permalink to this heading"></a></h2>
@@ -109,16 +109,14 @@ <h2>Features<a class="headerlink" href="#features" title="Permalink to this head
109
109
<li><p>Support for quarter (8 bit) floating-point types.</p></li>
110
110
<li><p>Easy integration as a single header file.</p></li>
111
111
<li><p>Written for C++17.</p></li>
112
-
<li><p>Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).</p></li>
113
-
<li><p>Compatible with HIPCC (AMD HIP Compiler)</p></li>
112
+
<li><p>Compatible with CUDA: <codeclass="docutils literal notranslate"><spanclass="pre">nvcc</span></code>(NVIDIA Compiler) and <codeclass="docutils literal notranslate"><spanclass="pre">nvrtc</span></code> (NVIDIA Runtime Compilation).</p></li>
113
+
<li><p>Compatible with HIP: <codeclass="docutils literal notranslate"><spanclass="pre">hipcc</span></code> (AMD HIP Compiler)</p></li>
114
114
</ul>
115
115
</section>
116
-
<sectionid="example">
117
-
<h2>Example<aclass="headerlink" href="#example" title="Permalink to this heading"></a></h2>
118
-
<p>Check out the <aclass="reference external" href="https://github.com/KernelTuner/kernel_float/tree/master/examples">examples</a> directory for some examples.</p>
119
-
<p>Below shows a simple example of a CUDA kernel that adds a <codeclass="docutils literal notranslate"><spanclass="pre">constant</span></code> to the <codeclass="docutils literal notranslate"><spanclass="pre">input</span></code> array and writes the results to the <codeclass="docutils literal notranslate"><spanclass="pre">output</span></code> array.
120
-
Each thread processes two elements.
121
-
Notice how easy it would be to change the precision (for example, <codeclass="docutils literal notranslate"><spanclass="pre">double</span></code> to <codeclass="docutils literal notranslate"><spanclass="pre">half</span></code>) or the vector size (for example, 4 instead of 2 items per thread).</p>
116
+
<sectionid="quick-example">
117
+
<h2>Quick Example<aclass="headerlink" href="#quick-example" title="Permalink to this heading"></a></h2>
118
+
<p>Below shows a simple example kernel that multiplies an <codeclass="docutils literal notranslate"><spanclass="pre">input</span></code> array by a <codeclass="docutils literal notranslate"><spanclass="pre">constant</span></code> and accumulates into an <codeclass="docutils literal notranslate"><spanclass="pre">output</span></code> array.
@@ -128,8 +126,12 @@ <h2>Example<a class="headerlink" href="#example" title="Permalink to this headin
128
126
<spanclass="p">}</span>
129
127
</pre></div>
130
128
</div>
129
+
<p>Notice how easy it would be to change the precision (for example, <codeclass="docutils literal notranslate"><spanclass="pre">double</span></code> to <codeclass="docutils literal notranslate"><spanclass="pre">half</span></code>) or the vector size (for example, 4 instead of 2 items per thread).
130
+
Check out the <aclass="reference external" href="https://github.com/KernelTuner/kernel_float/tree/main/examples">examples</a> directory for some examples.</p>
131
131
<p>Here is how the same kernel would look for CUDA without Kernel Float.</p>
@@ -146,7 +148,7 @@ <h2>Example<a class="headerlink" href="#example" title="Permalink to this headin
146
148
<spanclass="p">}</span>
147
149
</pre></div>
148
150
</div>
149
-
<p>Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.</p>
151
+
<p>Even though the second kernel looks a lot more complex, both generate nearly identical PTX code.</p>
150
152
</section>
151
153
<sectionid="installation">
152
154
<h2>Installation<aclass="headerlink" href="#installation" title="Permalink to this heading"></a></h2>
@@ -159,9 +161,13 @@ <h2>Installation<a class="headerlink" href="#installation" title="Permalink to t
159
161
</pre></div>
160
162
</div>
161
163
</section>
162
-
<sectionid="documentation">
163
-
<h2>Documentation<aclass="headerlink" href="#documentation" title="Permalink to this heading"></a></h2>
164
-
<p>See the <aclass="reference external" href="https://kerneltuner.github.io/kernel_float/">documentation</a> for the <aclass="reference external" href="https://kerneltuner.github.io/kernel_float/api.html">API reference</a> of all functionality.</p>
164
+
<sectionid="links">
165
+
<h2>Links<aclass="headerlink" href="#links" title="Permalink to this heading"></a></h2>
0 commit comments