KernelTuner
diff --git a/‎index.html‎
Lines changed: 24 additions & 18 deletions b/‎index.html‎
Lines changed: 24 additions & 18 deletions
diff --git a/‎searchindex.js‎
Lines changed: 1 addition & 1 deletion b/‎searchindex.js‎
Lines changed: 1 addition & 1 deletion
@@ -84,7 +84,7 @@ <h1>Kernel Float<a class="headerlink" href="#kernel-float" title="Permalink to t
 <img alt="GitHub" src="https://img.shields.io/github/license/KernelTuner/kernel_float" />
 <img alt="GitHub tag (latest by date)" src="https://img.shields.io/github/v/tag/KernelTuner/kernel_float" />
 <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/KernelTuner/kernel_float?style=social" /></p>
-<p><em>Kernel Float</em> is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.</p>
+<p><em>Kernel Float</em> is a header-only library for CUDA/HIP that makes working with reduced-precision floating-point types and vector arithmetic simple and expressive, with zero performance overhead.</p>
 <section id="summary">
 <h2>Summary<a class="headerlink" href="#summary" title="Permalink to this heading"></a></h2>
 <p>CUDA/HIP natively offers several reduced precision floating-point types (<code class="docutils literal notranslate"><span class="pre">__half</span></code>, <code class="docutils literal notranslate"><span class="pre">__nv_bfloat16</span></code>, <code class="docutils literal notranslate"><span class="pre">__nv_fp8_e4m3</span></code>, <code class="docutils literal notranslate"><span class="pre">__nv_fp8_e5m2</span></code>)
@@ -93,11 +93,11 @@ <h2>Summary<a class="headerlink" href="#summary" title="Permalink to this headin
 mathematical operations require intrinsics (e.g., <code class="docutils literal notranslate"><span class="pre">__hadd2</span></code> performs addition for <code class="docutils literal notranslate"><span class="pre">__half2</span></code>),
 type conversion is awkward (e.g., <code class="docutils literal notranslate"><span class="pre">__nv_cvt_halfraw2_to_fp8x2</span></code> converts float16 to float8),
 and some functionality is missing (e.g., one cannot convert a <code class="docutils literal notranslate"><span class="pre">__half</span></code> to <code class="docutils literal notranslate"><span class="pre">__nv_bfloat16</span></code>).</p>
-<p><em>Kernel Float</em> resolves this by offering a single data type <code class="docutils literal notranslate"><span class="pre">kernel_float::vec&lt;T,</span> <span class="pre">N&gt;</span></code> that stores <code class="docutils literal notranslate"><span class="pre">N</span></code> elements of type <code class="docutils literal notranslate"><span class="pre">T</span></code>.
-Internally, the data is stored as a fixed-sized array of elements.
+<p><em>Kernel Float</em> resolves this by offering a single unified vector type <code class="docutils literal notranslate"><span class="pre">kernel_float::vec&lt;T,</span> <span class="pre">N&gt;</span></code> that stores <code class="docutils literal notranslate"><span class="pre">N</span></code> elements of type <code class="docutils literal notranslate"><span class="pre">T</span></code>.
+Internally, the data is stored using the optimal data layout for the given type.
 Operator overloading (like <code class="docutils literal notranslate"><span class="pre">+</span></code>, <code class="docutils literal notranslate"><span class="pre">*</span></code>, <code class="docutils literal notranslate"><span class="pre">&amp;&amp;</span></code>) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
 Many mathematical functions (like <code class="docutils literal notranslate"><span class="pre">log</span></code>, <code class="docutils literal notranslate"><span class="pre">exp</span></code>, <code class="docutils literal notranslate"><span class="pre">sin</span></code>) and common operations (such as <code class="docutils literal notranslate"><span class="pre">sum</span></code>, <code class="docutils literal notranslate"><span class="pre">range</span></code>, <code class="docutils literal notranslate"><span class="pre">for_each</span></code>) are also available.</p>
-<p>Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.</p>
+<p>The generated assembly is identical to hand-written intrinsics code, meaning you get clean and maintainable source code without sacrificing performance.</p>
 </section>
 <section id="features">
 <h2>Features<a class="headerlink" href="#features" title="Permalink to this heading"></a></h2>
@@ -109,16 +109,14 @@ <h2>Features<a class="headerlink" href="#features" title="Permalink to this head
 <li><p>Support for quarter (8 bit) floating-point types.</p></li>
 <li><p>Easy integration as a single header file.</p></li>
 <li><p>Written for C++17.</p></li>
-<li><p>Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).</p></li>
-<li><p>Compatible with HIPCC (AMD HIP Compiler)</p></li>
+<li><p>Compatible with CUDA: <code class="docutils literal notranslate"><span class="pre">nvcc</span></code> (NVIDIA Compiler) and <code class="docutils literal notranslate"><span class="pre">nvrtc</span></code> (NVIDIA Runtime Compilation).</p></li>
+<li><p>Compatible with HIP: <code class="docutils literal notranslate"><span class="pre">hipcc</span></code> (AMD HIP Compiler)</p></li>
 </ul>
 </section>
-<section id="example">
-<h2>Example<a class="headerlink" href="#example" title="Permalink to this heading"></a></h2>
-<p>Check out the <a class="reference external" href="https://github.com/KernelTuner/kernel_float/tree/master/examples">examples</a> directory for some examples.</p>
-<p>Below shows a simple example of a CUDA kernel that adds a <code class="docutils literal notranslate"><span class="pre">constant</span></code> to the <code class="docutils literal notranslate"><span class="pre">input</span></code> array and writes the results to the <code class="docutils literal notranslate"><span class="pre">output</span></code> array.
-Each thread processes two elements.
-Notice how easy it would be to change the precision (for example, <code class="docutils literal notranslate"><span class="pre">double</span></code> to <code class="docutils literal notranslate"><span class="pre">half</span></code>) or the vector size (for example, 4 instead of 2 items per thread).</p>
+<section id="quick-example">
+<h2>Quick Example<a class="headerlink" href="#quick-example" title="Permalink to this heading"></a></h2>
+<p>Below shows a simple example kernel that multiplies an <code class="docutils literal notranslate"><span class="pre">input</span></code> array by a <code class="docutils literal notranslate"><span class="pre">constant</span></code> and accumulates into an <code class="docutils literal notranslate"><span class="pre">output</span></code> array.
+Each thread processes two elements.</p>
 <div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="cp">#include</span><span class="w"> </span><span class="cpf">&quot;kernel_float.h&quot;</span>
 <span class="k">namespace</span><span class="w"> </span><span class="nn">kf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nn">kernel_float</span><span class="p">;</span>
 
@@ -128,8 +126,12 @@ <h2>Example<a class="headerlink" href="#example" title="Permalink to this headin
 <span class="p">}</span>
 </pre></div>
 </div>
+<p>Notice how easy it would be to change the precision (for example, <code class="docutils literal notranslate"><span class="pre">double</span></code> to <code class="docutils literal notranslate"><span class="pre">half</span></code>) or the vector size (for example, 4 instead of 2 items per thread).
+Check out the <a class="reference external" href="https://github.com/KernelTuner/kernel_float/tree/main/examples">examples</a> directory for some examples.</p>
 <p>Here is how the same kernel would look for CUDA without Kernel Float.</p>
-<div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="n">__global__</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="n">kernel</span><span class="p">(</span><span class="k">const</span><span class="w"> </span><span class="n">half</span><span class="o">*</span><span class="w"> </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="kt">double</span><span class="w"> </span><span class="n">constant</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">output</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
+<div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="cp">#include</span><span class="w"> </span><span class="cpf">&lt;cuda_fp16.h&gt;</span>
+
+<span class="n">__global__</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="n">kernel</span><span class="p">(</span><span class="k">const</span><span class="w"> </span><span class="n">half</span><span class="o">*</span><span class="w"> </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">constant</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">output</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
 <span class="w">    </span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">blockDim</span><span class="p">.</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
 <span class="w">    </span><span class="n">__half</span><span class="w"> </span><span class="n">in0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">0</span><span class="p">];</span>
 <span class="w">    </span><span class="n">__half</span><span class="w"> </span><span class="n">in1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">];</span>
@@ -146,7 +148,7 @@ <h2>Example<a class="headerlink" href="#example" title="Permalink to this headin
 <span class="p">}</span>
 </pre></div>
 </div>
-<p>Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.</p>
+<p>Even though the second kernel looks a lot more complex, both generate nearly identical PTX code.</p>
 </section>
 <section id="installation">
 <h2>Installation<a class="headerlink" href="#installation" title="Permalink to this heading"></a></h2>
@@ -159,9 +161,13 @@ <h2>Installation<a class="headerlink" href="#installation" title="Permalink to t
 </pre></div>
 </div>
 </section>
-<section id="documentation">
-<h2>Documentation<a class="headerlink" href="#documentation" title="Permalink to this heading"></a></h2>
-<p>See the <a class="reference external" href="https://kerneltuner.github.io/kernel_float/">documentation</a> for the <a class="reference external" href="https://kerneltuner.github.io/kernel_float/api.html">API reference</a> of all functionality.</p>
+<section id="links">
+<h2>Links<a class="headerlink" href="#links" title="Permalink to this heading"></a></h2>
+<ul class="simple">
+<li><p><a class="reference external" href="https://kerneltuner.github.io/kernel_float/">Documentation</a></p></li>
+<li><p><a class="reference external" href="https://kerneltuner.github.io/kernel_float/api.html">API reference</a></p></li>
+<li><p><a class="reference external" href="https://github.com/KernelTuner/kernel_float/tree/main/examples">Examples</a></p></li>
+</ul>
 </section>
 <section id="citation">
 <h2>Citation<a class="headerlink" href="#citation" title="Permalink to this heading"></a></h2>
@@ -179,7 +185,7 @@ <h2>Citation<a class="headerlink" href="#citation" title="Permalink to this head
 </section>
 <section id="license">
 <h2>License<a class="headerlink" href="#license" title="Permalink to this heading"></a></h2>
-<p>Licensed under Apache 2.0. See <a class="reference external" href="https://github.com/KernelTuner/kernel_float/blob/master/LICENSE">LICENSE</a>.</p>
+<p>Licensed under Apache 2.0. See <a class="reference external" href="https://github.com/KernelTuner/kernel_float/blob/main/LICENSE">LICENSE</a>.</p>
 </section>
 <section id="related-work">
 <h2>Related Work<a class="headerlink" href="#related-work" title="Permalink to this heading"></a></h2>