index.html

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

        <title>Joan Arnaldich - Home</title>

        <link rel="stylesheet" type="text/css" href="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/css/bootstrap.min.css" />
        <link href="https://netdna.bootstrapcdn.com/font-awesome/4.0.1/css/font-awesome.css" rel="stylesheet">

        <link href="https://fonts.googleapis.com/css?family=Abel" rel="stylesheet" type="text/css">

	<link href="https://fonts.googleapis.com/css?family=Source+Code+Pro&display=swap" rel="stylesheet">
	<link href="https://fonts.googleapis.com/css?family=Quicksand&display=swap" rel="stylesheet">

	<link rel="stylesheet" type="text/css" href="./css/syntax.css" />
        <link rel="stylesheet" type="text/css" href="./css/custom.css" />
        <link rel="stylesheet" type="text/css" href="./css/prism.css" />
    </head>
    <body>
<script type="text/javascript" async src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
	<script src="./js/prism.js"></script>
        <div class="container">
            <div class="row">
                <div class="col-md-10 col-md-offset-1">
                    <div class="page-header">
                        <h1>Joan
                        Arnaldich<sup class="sup-title">(.me)</sup></h1>

                        <div style="margin-top:20px;">
                            <div class="row">
                                <div class="col-md-2"><a href="./"><i class="fa fa-home fa-lg fa-fw"></i> Home</a></div>

                                <div class="col-md-2"><a href="./posts.html"><i class="fa fa-pencil fa-lg fa-fw"></i> Posts</a></div>
                                <div class="col-md-2"><a href="./tags.html"><i class="fa fa-tag fa-lg fa-fw"></i> Tags</a></div>
                                <div class="col-md-2"><a href="./about.html"><i class="fa fa-male fa-lg fa-fw"></i> About</a></div>
                                <div class="col-xs-1"><a href="http://github.com/jarnaldich"><i class="fa fa-github fa-lg fa-fw"></i></a></div>
                                <div class="col-xs-1"><a href="https://twitter.com/jarnaldich"><i class="fa fa-twitter fa-lg fa-fw"></i></a></div>
                                <div class="col-xs-1"><a href="./rss.xml"><i class="fa fa-rss fa-lg fa-fw"></i></a></div>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            <div class="row">
                <div class="col-md-10 col-md-offset-1">
                    <h1>Near Duplicates Detection</h1>

<small>Posted on February 19, 2023 <a href="./blog/2023/03/19/near-duplicates.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>In my <a href="http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html">previous post</a> I set up a tool to ease the download of open datasets into a JupyterLite environment, which is a neat tool to perform simplish data wrangling without local installation.</p>
<p>In this post we will put that tool to good use for one of the most common data cleaning utilities: near duplicate detection.</p>
<figure>
<img src="./images/spiderman_double.png" title="spiderman double" class="center" alt /><figcaption> </figcaption>
</figure>
<h2 id="why-bother-about-near-duplicates">Why bother about near duplicates?</h2>
<p>Near duplicates can be a sign of a poor schema implementation, especially when they appear in variables with finite domains (factors). For example, in the following addresses dataset:</p>
<center>
<table>
<thead>
<tr class="header">
<th>kind</th>
<th>name</th>
<th>number</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>road</td>
<td>Abbey</td>
<td>3</td>
</tr>
<tr class="even">
<td>square</td>
<td>Level</td>
<td>666</td>
</tr>
<tr class="odd">
<td>drive</td>
<td>Mullholand</td>
<td>1</td>
</tr>
<tr class="even">
<td>boulevard</td>
<td>Broken Dreams</td>
<td>4</td>
</tr>
</tbody>
</table>
</center>
<p></p>
<p>The “kind” variable could predictably take any of the following values:</p>
<ul>
<li>road</li>
<li>square</li>
<li>avenue</li>
<li>drive</li>
<li>boulevard</li>
</ul>
<p>The problem is that this kind of data is too often modelled as an unconstrained string, which makes it error prone: ‘sqare’ is just as valid as ‘square’. This generates all kind of problems down the data analysis pipeline: what would happen if we analyze the frequency of each kind?</p>
<p>There are ways to ensure that the variable “kind” can only take one of those values, depending on the underlying data infrastructure:</p>
<ul>
<li>In relational databases one could use <a href="https://www.postgresql.org/docs/current/sql-createdomain.html">domain types</a> , data validation <a href="https://www.postgresql.org/docs/current/sql-createtrigger.html">triggers</a>, or plain old dictonary tables with 1:n relationships.</li>
<li>Non-relational DBs may have other ways to ensure schema conformance, e.g. through <a href="https://www.mongodb.com/docs/manual/core/schema-validation/specify-json-schema/">JSON schema</a> or <a href="http://exist-db.org/exist/apps/doc/validation">XML schema</a>.</li>
<li>The fallback option is to guarantee this “by construction” via application validation, (eg. using drop-downs in the UI), although this is a weaker solution since it incurs in unnecessary coupling… and thing can go sideways anyway, so in this scenario you should consider performing periodic schema validation tests on the data.</li>
</ul>
<p>Notice that all of these solutions require <em>a priori</em> knowledge of the domain.</p>
<p>But what happens when we are faced with an (underdocumented) dataset and asked to use it as a source for analysis? Or when we are asked to derive these rules <em>a posteriori</em> eg. to improve a legacy database? Well, without knowledge of the domain, it is just not possible to decide wether two similar values are both correct (and just happen to be spelled similarly) or a misspelling. The best thing we can do is to detect which values are indeed similar and raise a flag.</p>
<p>This is when the techniques explained in this blog post come handy.</p>
<h2 id="the-algorithm">The algorithm</h2>
<p>For the sake of simplicity, in this blog post we will assume our data is small enough so that a quadratic algorithm is acceptable (for the real thing, see the references at the end). Beware that, in modern hardware, this simple case can take you farther than you would initially expect. My advise is to always <em>use the simplest solution that gets the job done</em>. It usually pays off in both development time and incidental complexity (reliance on external dependencies, etc…).</p>
<p>There are two main metrics regarding similarity. The first one, restricted to strings, is the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein</a> (aka edit) distance and represents the number of edits needed to go from one string to another. This metric is hard to scale in general, since it requires pairwise comparison.</p>
<p>The other one is both more general and more scalable. It involves generating n-gram sets and then comparing them using a set-similarity measure.</p>
<h3 id="n-gram-sets">N-gram sets</h3>
<p>For each string, we can associate a set of n-grams that can be derived from it. N-grams (sometimes called <em>shingles</em>) are just substrings of length n. A typical case is <code>n=3</code>, which generates what is known as trigrams. For example, the trigram set for the string <code>"algorithm"</code> would be <code>['alg', 'lgo', 'gor', 'ori', 'rit', 'ith', 'thm']</code>.</p>
<h3 id="jaccard-index">Jaccard Index</h3>
<p>Once we have the n-gram set for a string, we can use a general metric for set similarity. A popular one is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard Index</a>. Which is defined as the ratio between the cardinality of intersection over the cardinality of the union of any two sets.</p>
<p><span class="math display">\[J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B|}\]</span></p>
<p>Note that this index will range from 0, for disjoint sets, to 1, for exactly equal sets.</p>
<h3 id="if-we-were-to-scale">If we were to scale…</h3>
<p>The advantadge of using n-gram sets is that we can use similarity-preserving summaries of those sets (eg. via <a href="https://en.wikipedia.org/wiki/MinHash">minhashing</a>) which, combined with <a href="https://en.wikipedia.org/wiki/Locality-sensitive_hashing">locality sensitive hashing</a> to efficiently compare pairs of sets provides a massively scalable solution. In this post we will just assume that the size of our data is small enought so that we do not need to scale.</p>
<h2 id="the-code">The Code</h2>
<p>All the above can be implemented in the following utility function, which will take an iterable of strings and the minimum jaccard similarity and max levenshtein distance to consider a pair a candidate for duplicity. It will return a pandas dataframe with the pair indices, their values, and their mutual Levenshtein and Jaccard distances. We will use the <a href="https://www.nltk.org/">Natural Languate Toolkit</a> for the implementation of those distances.</p>
<p>Bear in mind that, in a real use case, we would very likely apply some normalization before testing for near duplicates (eg. to account for spaces and/or differences in upper/lowercase versions).</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">def</span> near_duplicates(factors, min_jaccard: <span class="bu">float</span>, max_levenshtein: <span class="bu">int</span>):</span>
<span id="cb1-2"><a href="#cb1-2"></a>  trigrams <span class="op">=</span> [ <span class="bu">set</span>(<span class="st">''</span>.join(g) <span class="cf">for</span> g <span class="kw">in</span> nltk.ngrams(f, <span class="dv">3</span>)) <span class="cf">for</span> f <span class="kw">in</span> factors ]</span>
<span id="cb1-3"><a href="#cb1-3"></a>  jaccard <span class="op">=</span> <span class="bu">dict</span>()</span>
<span id="cb1-4"><a href="#cb1-4"></a>  levenshtein <span class="op">=</span> <span class="bu">dict</span>()</span>
<span id="cb1-5"><a href="#cb1-5"></a>  <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">len</span>(factors)):</span>
<span id="cb1-6"><a href="#cb1-6"></a>    <span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(i<span class="op">+</span><span class="dv">1</span>, <span class="bu">len</span>(factors)):</span>
<span id="cb1-7"><a href="#cb1-7"></a>      denom <span class="op">=</span> <span class="bu">float</span>(<span class="bu">len</span>(trigrams[i] <span class="op">|</span> trigrams[j]))</span>
<span id="cb1-8"><a href="#cb1-8"></a>      <span class="cf">if</span> denom <span class="op">&gt;</span> <span class="dv">0</span>:</span>
<span id="cb1-9"><a href="#cb1-9"></a>        jaccard[(i,j)] <span class="op">=</span> <span class="bu">float</span>(<span class="bu">len</span>(trigrams[i] <span class="op">&amp;</span> trigrams[j])) <span class="op">/</span> denom</span>
<span id="cb1-10"><a href="#cb1-10"></a>      <span class="cf">else</span>:</span>
<span id="cb1-11"><a href="#cb1-11"></a>        jaccard[(i,j)] <span class="op">=</span> np.NaN</span>
<span id="cb1-12"><a href="#cb1-12"></a>      levenshtein[(i,j)] <span class="op">=</span> nltk.edit_distance(factors[i], factors[j])</span>
<span id="cb1-13"><a href="#cb1-13"></a></span>
<span id="cb1-14"><a href="#cb1-14"></a>  acum <span class="op">=</span> []</span>
<span id="cb1-15"><a href="#cb1-15"></a>  <span class="cf">for</span> (i,j),v <span class="kw">in</span> jaccard.items():</span>
<span id="cb1-16"><a href="#cb1-16"></a>    <span class="cf">if</span> v <span class="op">&gt;=</span> min_jaccard <span class="kw">and</span> levenshtein[(i,j)] <span class="op">&lt;=</span> max_levenshtein: </span>
<span id="cb1-17"><a href="#cb1-17"></a>      acum.append([i,j,factors[i], factors[j], jaccard[(i,j)], levenshtein[(i,j)]])</span>
<span id="cb1-18"><a href="#cb1-18"></a></span>
<span id="cb1-19"><a href="#cb1-19"></a>  <span class="cf">return</span> pd.DataFrame(acum, columns<span class="op">=</span>[<span class="st">'i'</span>, <span class="st">'j'</span>, <span class="st">'factor_i'</span>, <span class="st">'factor_j'</span>, <span class="st">'jaccard_ij'</span>, <span class="st">'levenshtein_ij'</span>])</span></code></pre></div>
<p>We can extend the above functions to explore a set of columns in a pandas data frame with the following code:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">def</span> df_dups(df, cols<span class="op">=</span><span class="va">None</span>, except_cols<span class="op">=</span>[], min_jaccard<span class="op">=</span><span class="fl">0.3</span>, max_levenshtein<span class="op">=</span><span class="dv">4</span>):</span>
<span id="cb2-2"><a href="#cb2-2"></a>  acum <span class="op">=</span> []</span>
<span id="cb2-3"><a href="#cb2-3"></a>  </span>
<span id="cb2-4"><a href="#cb2-4"></a>  <span class="cf">if</span> cols <span class="kw">is</span> <span class="va">None</span>:</span>
<span id="cb2-5"><a href="#cb2-5"></a>    cols <span class="op">=</span> df.columns</span>
<span id="cb2-6"><a href="#cb2-6"></a></span>
<span id="cb2-7"><a href="#cb2-7"></a>  <span class="cf">if</span> <span class="bu">isinstance</span>(min_jaccard, numbers.Number):</span>
<span id="cb2-8"><a href="#cb2-8"></a>    mj <span class="op">=</span> defaultdict(<span class="kw">lambda</span> : min_jaccard)</span>
<span id="cb2-9"><a href="#cb2-9"></a>  <span class="cf">else</span>:</span>
<span id="cb2-10"><a href="#cb2-10"></a>    mj <span class="op">=</span> min_jaccard</span>
<span id="cb2-11"><a href="#cb2-11"></a></span>
<span id="cb2-12"><a href="#cb2-12"></a>  <span class="cf">if</span> <span class="bu">isinstance</span>(max_levenshtein, numbers.Number):</span>
<span id="cb2-13"><a href="#cb2-13"></a>    ml <span class="op">=</span> defaultdict(<span class="kw">lambda</span>: max_levenshtein)</span>
<span id="cb2-14"><a href="#cb2-14"></a>  <span class="cf">else</span>:</span>
<span id="cb2-15"><a href="#cb2-15"></a>    ml <span class="op">=</span> max_levenshtein</span>
<span id="cb2-16"><a href="#cb2-16"></a></span>
<span id="cb2-17"><a href="#cb2-17"></a>  <span class="cf">for</span> c <span class="kw">in</span> cols:</span>
<span id="cb2-18"><a href="#cb2-18"></a></span>
<span id="cb2-19"><a href="#cb2-19"></a>    <span class="cf">if</span> c <span class="kw">in</span> except_cols <span class="kw">or</span> <span class="kw">not</span> is_string_dtype(df[c]):</span>
<span id="cb2-20"><a href="#cb2-20"></a>      <span class="cf">continue</span></span>
<span id="cb2-21"><a href="#cb2-21"></a></span>
<span id="cb2-22"><a href="#cb2-22"></a>    factors <span class="op">=</span> df[c].factorize()[<span class="dv">1</span>]</span>
<span id="cb2-23"><a href="#cb2-23"></a>    col_dups <span class="op">=</span> near_duplicates(factors, mj[c], ml[c])</span>
<span id="cb2-24"><a href="#cb2-24"></a>    col_dups[<span class="st">'col'</span>] <span class="op">=</span> c</span>
<span id="cb2-25"><a href="#cb2-25"></a>    acum.append(col_dups)</span>
<span id="cb2-26"><a href="#cb2-26"></a></span>
<span id="cb2-27"><a href="#cb2-27"></a>  <span class="cf">return</span> pd.concat(acum)</span></code></pre></div>
<p>If we apply the above code to the open dataset from the <a href="http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html">last blog post</a></p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1"></a>df_dups(df, cols<span class="op">=</span>[<span class="st">'Proveïdor'</span>,</span>
<span id="cb3-2"><a href="#cb3-2"></a>       <span class="st">'Objecte del contracte'</span>, </span>
<span id="cb3-3"><a href="#cb3-3"></a>       <span class="st">'Tipus Contracte'</span>])</span></code></pre></div>
<p>The column names are in Catalan since the dataset comes from the <a href="https://opendata-ajuntament.barcelona.cat/">Barcelona Council Open Data Hub</a>, and stand for the <em>contractor</em>, <em>the service descripction</em>, and the <em>type of service</em>.</p>
<p>We get the following results:</p>
<figure>
<img src="./images/near_dups_menors.png" title="spiderman double" class="center" width="850" alt /><figcaption> </figcaption>
</figure>
<p>Notice that the first two are actually valid, despite being similar (two companies with similar names and <em>electric</em> vs <em>electronic</em> supplies), while the last two seem to be a case of not controlling the variable domain properly (singular/plural entries). We should definitely decide for a canonical value (singular/plural) for the column “Tipus Contracte” before we compute any aggregation for those columns.</p>
<h2 id="conclusions">Conclusions</h2>
<p>We can use the above functions as helpers prior to performing some analysis on datasets where domain rules have not been previously enforced. They are compatible with JupyterLite, so no need to install anything for the test. For convenience, you can find a working notebook <a href="https://gist.github.com/jarnaldich/24ece34b6fb441c3ef8878a39a265b82">in this gist</a>.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="http://www.mmds.org/">Mining Of Massive Datasets</a> - An absolute classic book. Chapter 3, in particular, describes a scalable improvement on the technique described in this blog post.</li>
</ul>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="./tags/jupyterlite.html">jupyterlite</a>, <a href="./tags/data.html">data</a>, <a href="./tags/nltk.html">nltk</a>, <a href="./tags/jaccard.html">jaccard</a>, <a href="./tags/qc.html">qc</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2023/03/19/near-duplicates.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/" size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    

                </div>
            </div>
        </div>

        <script>
          (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
            (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
              m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
                })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

                  ga('create', 'UA-6838480-2', 'jarnaldich.me');
                    ga('send', 'pageview');

        </script>


    </body>
</html>