index.html

<!doctype html>
<html>
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>Titouan Galopin by tgalopin</title>

    <link rel="stylesheet" href="stylesheets/styles.css">
    <link rel="stylesheet" href="stylesheets/github-light.css">
    <script src="javascripts/scale.fix.js"></script>
    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
    <!--[if lt IE 9]>
    <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body>
    <div class="wrapper">
      <header>
        <h1 class="header">Titouan Galopin</h1>
        <p class="header">Personnal website</p>


        <ul>
          <li><a class="buttons github" href="https://github.com/tgalopin">GitHub Profile</a></li>
        </ul>

      </header>
      <section>
        <h1>
<a id="simhash-or-the-way-to-compare-quickly-two-datasets" class="anchor" href="#simhash-or-the-way-to-compare-quickly-two-datasets" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>SimHash or the way to compare quickly two datasets</h1>

<p>SimHash is an <strong>algorithm created by Moses Charikar</strong>, from Google. It let us to compare easily
two texts and more generally two datasets of any type, quickly and effectively.</p>

<p>What does this really mean? Let's talk a bit about it.</p>

<h3>
<a id="the-similarity-problem" class="anchor" href="#the-similarity-problem" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>The similarity problem</h3>

<p>Computers are very pragmatic. They can determine very fast whether two elements are different
or not. It's a binary world: equal or different.</p>

<p>Imagine now we would like to have an idea of the similarity between these two elements.
That would be a much bigger problem: a computer is not designed to do such comparison by nature.
It's difficult, so it takes resources.</p>

<p>For a better understanding, let's take the following sentences as an example:</p>

<pre lang="no-highlight"><code>Shakespeare produced most of his known work between 1589 and 1613
</code></pre>

<pre lang="no-highlight"><code>Shakespeare produced most of his work after 1589
</code></pre>

<p>How does the computer will determine if the texts are (strictly) equals? It will
browse the first text and compare each letter of it with the letter in the same
position in the second text.</p>

<p>But the point here is that if the computer find a difference, <strong>it stops</strong>. It does not
use anymore resources, it knows that the strings are strictly different.</p>

<p>That's where similarity is a challenge. If we check for strict equality, we only need
to stop on the first difference. If we want a similarity estimation, we have to check
the entire text. In math it would be:</p>

<p>$similarity(A, B) = \frac{A \cap B}{A \cup B}$</p>

<p>For big datasets, the part $A \cap B$ is very hard to compute. That's where SimHash is useful.</p>

<p>With SimHash, we will create two fingerprints to replace the datasets A and B:</p>

<p>$simhash(A) \cap simhash(B)$</p>

<p>Thus we will compare much smaller elements and the comparison time will be dramatically reduced.</p>

<h3>
<a id="simhash-or-the-way-to-create-fingerprints" class="anchor" href="#simhash-or-the-way-to-create-fingerprints" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>SimHash or the way to create fingerprints</h3>

<p>To compare fingerprints, we need an algorithm that generate them using a big dataset.
The first idea would be to use hash (md5, sha1), but theses algorithms does not represent
well a text as if the text change a bit, the hash change a lot.</p>

<p>SimHash does not behave like this: instead, it's a bit like a very compressed version
of the dataset (it does not change a lot if the text does not change a lot).</p>

<p>The official SimHash algorithm is:</p>

<ul>
<li>Define a fingerprint size (for instance 32 bits)</li>
<li>Create an array <code>V[]</code> filled with this size of zeros</li>
<li>For each element in the dataset, we create a unique hash with md5,
sha1 of any other hash algorithm that give same-sized results</li>
<li>For each hash, for each bit <code>i</code> in this hash:

<ul>
<li>If the bit is <code>0</code>, we add <code>1</code> to <code>V[i]</code>
</li>
<li>If the bit is <code>1</code>, we take <code>1</code> from <code>V[i]</code>
</li>
</ul>
</li>
<li>For each bit <code>j</code> of the global fingerprint:

<ul>
<li>If <code>V[j] &gt;= 0</code>, we set <code>V[j] = 1</code>
</li>
<li>If <code>V[j] &lt; 0</code>, we set <code>V[j] = 0</code>
</li>
</ul>
</li>
</ul>

<p>It gives us a fingerprint characterizing our text, an approximation of the text data.
This fingerprint is a binary number, for instance: <code>10101011100010001010000101111100</code>.</p>

<p>Now, to find</p>

<p>$simhash(A) \cap simhash(B)$</p>

<p>we only have to use a XOR operation:</p>

<pre lang="no-highlight"><code>    10101011100010001010000101111100
XOR 10101011100010011110000101111110
  = 00000000000000010100000000000010
</code></pre>

<p>Here, the <code>1</code> in the XOR result are the differences between the two fingerprints.
To get an idea of the difference between the original texts, we just have to count
the number of <code>1</code> and divide it by the total size.</p>

<p>Here, we have <code>3</code> ones for <code>32</code> characters : the estimation of the difference is
<code>3 / 32 = 0,09375</code>, so the estimation of the similarity is <code>1 - 3 / 32 = 0,90625</code>
(a bit more than 90%). We have our similarity index!</p>

<h3>
<a id="real-world-usage" class="anchor" href="#real-world-usage" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Real world usage</h3>

<p>SimHash is currently used by Google to compare page with its database, to avoid duplicate
contents. But we can use it too!</p>

<p>I created a small PHP library to use it programmatically in PHP: <a href="https://github.com/tgalopin/SimHashPhp">SimHashPHP</a>.</p>

<p>But the main usage of SimHash is to compare things in a database. For instance, let's
imagine we want to find the most similar articles of the one we are currently reading.
It appears complicated at the first sight using only SQL. However with SimHash it's not
that difficult: we just have to store a fingerprint for each article, and use the XOR
operation in SQL to count the <code>1</code> in the binary result.</p>

<p>For instance:</p>

<div class="highlight highlight-source-sql"><pre><span class="pl-k">SELECT</span> id, title, (LENGTH(CONV(fp ^ ?, <span class="pl-c1">10</span>, <span class="pl-c1">2</span>)) <span class="pl-k">-</span> LENGTH(REPLACE(CONV(fp ^ ?, <span class="pl-c1">10</span>, <span class="pl-c1">2</span>), <span class="pl-s"><span class="pl-pds">'</span>1<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span><span class="pl-pds">'</span></span>))) <span class="pl-k">/</span> LENGTH(<span class="pl-s"><span class="pl-pds">'</span>1<span class="pl-pds">'</span></span>) <span class="pl-k">AS</span> comparison
<span class="pl-k">FROM</span> articles
<span class="pl-k">ORDER BY</span> comparison <span class="pl-k">ASC</span></pre></div>
      </section>
      <footer>
        <p><small>Hosted on <a href="https://pages.github.com">GitHub Pages</a> using the Dinky theme</small></p>
      </footer>
    </div>
    <!--[if !IE]><script>fixScale(document);</script><![endif]-->
		
  </body>
</html>