Skip to content

Commit db187a4

Browse files
committed
Add changes for f9ffc69
1 parent 65976ca commit db187a4

2 files changed

Lines changed: 25 additions & 19 deletions

File tree

index.html

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ <h1>Kernel Float<a class="headerlink" href="#kernel-float" title="Permalink to t
8484
<img alt="GitHub" src="https://img.shields.io/github/license/KernelTuner/kernel_float" />
8585
<img alt="GitHub tag (latest by date)" src="https://img.shields.io/github/v/tag/KernelTuner/kernel_float" />
8686
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/KernelTuner/kernel_float?style=social" /></p>
87-
<p><em>Kernel Float</em> is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.</p>
87+
<p><em>Kernel Float</em> is a header-only library for CUDA/HIP that makes working with reduced-precision floating-point types and vector arithmetic simple and expressive, with zero performance overhead.</p>
8888
<section id="summary">
8989
<h2>Summary<a class="headerlink" href="#summary" title="Permalink to this heading"></a></h2>
9090
<p>CUDA/HIP natively offers several reduced precision floating-point types (<code class="docutils literal notranslate"><span class="pre">__half</span></code>, <code class="docutils literal notranslate"><span class="pre">__nv_bfloat16</span></code>, <code class="docutils literal notranslate"><span class="pre">__nv_fp8_e4m3</span></code>, <code class="docutils literal notranslate"><span class="pre">__nv_fp8_e5m2</span></code>)
@@ -93,11 +93,11 @@ <h2>Summary<a class="headerlink" href="#summary" title="Permalink to this headin
9393
mathematical operations require intrinsics (e.g., <code class="docutils literal notranslate"><span class="pre">__hadd2</span></code> performs addition for <code class="docutils literal notranslate"><span class="pre">__half2</span></code>),
9494
type conversion is awkward (e.g., <code class="docutils literal notranslate"><span class="pre">__nv_cvt_halfraw2_to_fp8x2</span></code> converts float16 to float8),
9595
and some functionality is missing (e.g., one cannot convert a <code class="docutils literal notranslate"><span class="pre">__half</span></code> to <code class="docutils literal notranslate"><span class="pre">__nv_bfloat16</span></code>).</p>
96-
<p><em>Kernel Float</em> resolves this by offering a single data type <code class="docutils literal notranslate"><span class="pre">kernel_float::vec&lt;T,</span> <span class="pre">N&gt;</span></code> that stores <code class="docutils literal notranslate"><span class="pre">N</span></code> elements of type <code class="docutils literal notranslate"><span class="pre">T</span></code>.
97-
Internally, the data is stored as a fixed-sized array of elements.
96+
<p><em>Kernel Float</em> resolves this by offering a single unified vector type <code class="docutils literal notranslate"><span class="pre">kernel_float::vec&lt;T,</span> <span class="pre">N&gt;</span></code> that stores <code class="docutils literal notranslate"><span class="pre">N</span></code> elements of type <code class="docutils literal notranslate"><span class="pre">T</span></code>.
97+
Internally, the data is stored using the optimal data layout for the given type.
9898
Operator overloading (like <code class="docutils literal notranslate"><span class="pre">+</span></code>, <code class="docutils literal notranslate"><span class="pre">*</span></code>, <code class="docutils literal notranslate"><span class="pre">&amp;&amp;</span></code>) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
9999
Many mathematical functions (like <code class="docutils literal notranslate"><span class="pre">log</span></code>, <code class="docutils literal notranslate"><span class="pre">exp</span></code>, <code class="docutils literal notranslate"><span class="pre">sin</span></code>) and common operations (such as <code class="docutils literal notranslate"><span class="pre">sum</span></code>, <code class="docutils literal notranslate"><span class="pre">range</span></code>, <code class="docutils literal notranslate"><span class="pre">for_each</span></code>) are also available.</p>
100-
<p>Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.</p>
100+
<p>The generated assembly is identical to hand-written intrinsics code, meaning you get clean and maintainable source code without sacrificing performance.</p>
101101
</section>
102102
<section id="features">
103103
<h2>Features<a class="headerlink" href="#features" title="Permalink to this heading"></a></h2>
@@ -109,16 +109,14 @@ <h2>Features<a class="headerlink" href="#features" title="Permalink to this head
109109
<li><p>Support for quarter (8 bit) floating-point types.</p></li>
110110
<li><p>Easy integration as a single header file.</p></li>
111111
<li><p>Written for C++17.</p></li>
112-
<li><p>Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).</p></li>
113-
<li><p>Compatible with HIPCC (AMD HIP Compiler)</p></li>
112+
<li><p>Compatible with CUDA: <code class="docutils literal notranslate"><span class="pre">nvcc</span></code> (NVIDIA Compiler) and <code class="docutils literal notranslate"><span class="pre">nvrtc</span></code> (NVIDIA Runtime Compilation).</p></li>
113+
<li><p>Compatible with HIP: <code class="docutils literal notranslate"><span class="pre">hipcc</span></code> (AMD HIP Compiler)</p></li>
114114
</ul>
115115
</section>
116-
<section id="example">
117-
<h2>Example<a class="headerlink" href="#example" title="Permalink to this heading"></a></h2>
118-
<p>Check out the <a class="reference external" href="https://github.com/KernelTuner/kernel_float/tree/master/examples">examples</a> directory for some examples.</p>
119-
<p>Below shows a simple example of a CUDA kernel that adds a <code class="docutils literal notranslate"><span class="pre">constant</span></code> to the <code class="docutils literal notranslate"><span class="pre">input</span></code> array and writes the results to the <code class="docutils literal notranslate"><span class="pre">output</span></code> array.
120-
Each thread processes two elements.
121-
Notice how easy it would be to change the precision (for example, <code class="docutils literal notranslate"><span class="pre">double</span></code> to <code class="docutils literal notranslate"><span class="pre">half</span></code>) or the vector size (for example, 4 instead of 2 items per thread).</p>
116+
<section id="quick-example">
117+
<h2>Quick Example<a class="headerlink" href="#quick-example" title="Permalink to this heading"></a></h2>
118+
<p>Below shows a simple example kernel that multiplies an <code class="docutils literal notranslate"><span class="pre">input</span></code> array by a <code class="docutils literal notranslate"><span class="pre">constant</span></code> and accumulates into an <code class="docutils literal notranslate"><span class="pre">output</span></code> array.
119+
Each thread processes two elements.</p>
122120
<div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="cp">#include</span><span class="w"> </span><span class="cpf">&quot;kernel_float.h&quot;</span>
123121
<span class="k">namespace</span><span class="w"> </span><span class="nn">kf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nn">kernel_float</span><span class="p">;</span>
124122

@@ -128,8 +126,12 @@ <h2>Example<a class="headerlink" href="#example" title="Permalink to this headin
128126
<span class="p">}</span>
129127
</pre></div>
130128
</div>
129+
<p>Notice how easy it would be to change the precision (for example, <code class="docutils literal notranslate"><span class="pre">double</span></code> to <code class="docutils literal notranslate"><span class="pre">half</span></code>) or the vector size (for example, 4 instead of 2 items per thread).
130+
Check out the <a class="reference external" href="https://github.com/KernelTuner/kernel_float/tree/main/examples">examples</a> directory for some examples.</p>
131131
<p>Here is how the same kernel would look for CUDA without Kernel Float.</p>
132-
<div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="n">__global__</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="n">kernel</span><span class="p">(</span><span class="k">const</span><span class="w"> </span><span class="n">half</span><span class="o">*</span><span class="w"> </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="kt">double</span><span class="w"> </span><span class="n">constant</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">output</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
132+
<div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="cp">#include</span><span class="w"> </span><span class="cpf">&lt;cuda_fp16.h&gt;</span>
133+
134+
<span class="n">__global__</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="n">kernel</span><span class="p">(</span><span class="k">const</span><span class="w"> </span><span class="n">half</span><span class="o">*</span><span class="w"> </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">constant</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">output</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
133135
<span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">blockDim</span><span class="p">.</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
134136
<span class="w"> </span><span class="n">__half</span><span class="w"> </span><span class="n">in0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">0</span><span class="p">];</span>
135137
<span class="w"> </span><span class="n">__half</span><span class="w"> </span><span class="n">in1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">];</span>
@@ -146,7 +148,7 @@ <h2>Example<a class="headerlink" href="#example" title="Permalink to this headin
146148
<span class="p">}</span>
147149
</pre></div>
148150
</div>
149-
<p>Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.</p>
151+
<p>Even though the second kernel looks a lot more complex, both generate nearly identical PTX code.</p>
150152
</section>
151153
<section id="installation">
152154
<h2>Installation<a class="headerlink" href="#installation" title="Permalink to this heading"></a></h2>
@@ -159,9 +161,13 @@ <h2>Installation<a class="headerlink" href="#installation" title="Permalink to t
159161
</pre></div>
160162
</div>
161163
</section>
162-
<section id="documentation">
163-
<h2>Documentation<a class="headerlink" href="#documentation" title="Permalink to this heading"></a></h2>
164-
<p>See the <a class="reference external" href="https://kerneltuner.github.io/kernel_float/">documentation</a> for the <a class="reference external" href="https://kerneltuner.github.io/kernel_float/api.html">API reference</a> of all functionality.</p>
164+
<section id="links">
165+
<h2>Links<a class="headerlink" href="#links" title="Permalink to this heading"></a></h2>
166+
<ul class="simple">
167+
<li><p><a class="reference external" href="https://kerneltuner.github.io/kernel_float/">Documentation</a></p></li>
168+
<li><p><a class="reference external" href="https://kerneltuner.github.io/kernel_float/api.html">API reference</a></p></li>
169+
<li><p><a class="reference external" href="https://github.com/KernelTuner/kernel_float/tree/main/examples">Examples</a></p></li>
170+
</ul>
165171
</section>
166172
<section id="citation">
167173
<h2>Citation<a class="headerlink" href="#citation" title="Permalink to this heading"></a></h2>
@@ -179,7 +185,7 @@ <h2>Citation<a class="headerlink" href="#citation" title="Permalink to this head
179185
</section>
180186
<section id="license">
181187
<h2>License<a class="headerlink" href="#license" title="Permalink to this heading"></a></h2>
182-
<p>Licensed under Apache 2.0. See <a class="reference external" href="https://github.com/KernelTuner/kernel_float/blob/master/LICENSE">LICENSE</a>.</p>
188+
<p>Licensed under Apache 2.0. See <a class="reference external" href="https://github.com/KernelTuner/kernel_float/blob/main/LICENSE">LICENSE</a>.</p>
183189
</section>
184190
<section id="related-work">
185191
<h2>Related Work<a class="headerlink" href="#related-work" title="Permalink to this heading"></a></h2>

searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)