rss.xml

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel>
        <title>jarnaldich.me</title>
        <link>http://jarnaldich.me</link>
        <description><![CDATA[Joan Arnaldich's Blog]]></description>
        <atom:link href="http://jarnaldich.me/rss.xml" rel="self"
                   type="application/rss+xml" />
        <lastBuildDate>Sun, 19 Feb 2023 00:00:00 UT</lastBuildDate>
        <item>
    <title>Near Duplicates Detection</title>
    <link>http://jarnaldich.me/blog/2023/03/19/near-duplicates.html</link>
    <description><![CDATA[<h1>Near Duplicates Detection</h1>

<small>Posted on February 19, 2023 <a href="/blog/2023/03/19/near-duplicates.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>In my <a href="http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html">previous post</a> I set up a tool to ease the download of open datasets into a JupyterLite environment, which is a neat tool to perform simplish data wrangling without local installation.</p>
<p>In this post we will put that tool to good use for one of the most common data cleaning utilities: near duplicate detection.</p>
<figure>
<img src="/images/spiderman_double.png" title="spiderman double" class="center" alt="" /><figcaption> </figcaption>
</figure>
<h2 id="why-bother-about-near-duplicates">Why bother about near duplicates?</h2>
<p>Near duplicates can be a sign of a poor schema implementation, especially when they appear in variables with finite domains (factors). For example, in the following addresses dataset:</p>
<center>
<table>
<thead>
<tr class="header">
<th>kind</th>
<th>name</th>
<th>number</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>road</td>
<td>Abbey</td>
<td>3</td>
</tr>
<tr class="even">
<td>square</td>
<td>Level</td>
<td>666</td>
</tr>
<tr class="odd">
<td>drive</td>
<td>Mullholand</td>
<td>1</td>
</tr>
<tr class="even">
<td>boulevard</td>
<td>Broken Dreams</td>
<td>4</td>
</tr>
</tbody>
</table>
</center>
<p/>
<p>The “kind” variable could predictably take any of the following values:</p>
<ul>
<li>road</li>
<li>square</li>
<li>avenue</li>
<li>drive</li>
<li>boulevard</li>
</ul>
<p>The problem is that this kind of data is too often modelled as an unconstrained string, which makes it error prone: ‘sqare’ is just as valid as ‘square’. This generates all kind of problems down the data analysis pipeline: what would happen if we analyze the frequency of each kind?</p>
<p>There are ways to ensure that the variable “kind” can only take one of those values, depending on the underlying data infrastructure:</p>
<ul>
<li>In relational databases one could use <a href="https://www.postgresql.org/docs/current/sql-createdomain.html">domain types</a> , data validation <a href="https://www.postgresql.org/docs/current/sql-createtrigger.html">triggers</a>, or plain old dictonary tables with 1:n relationships.</li>
<li>Non-relational DBs may have other ways to ensure schema conformance, e.g. through <a href="https://www.mongodb.com/docs/manual/core/schema-validation/specify-json-schema/">JSON schema</a> or <a href="http://exist-db.org/exist/apps/doc/validation">XML schema</a>.</li>
<li>The fallback option is to guarantee this “by construction” via application validation, (eg. using drop-downs in the UI), although this is a weaker solution since it incurs in unnecessary coupling… and thing can go sideways anyway, so in this scenario you should consider performing periodic schema validation tests on the data.</li>
</ul>
<p>Notice that all of these solutions require <em>a priori</em> knowledge of the domain.</p>
<p>But what happens when we are faced with an (underdocumented) dataset and asked to use it as a source for analysis? Or when we are asked to derive these rules <em>a posteriori</em> eg. to improve a legacy database? Well, without knowledge of the domain, it is just not possible to decide wether two similar values are both correct (and just happen to be spelled similarly) or a misspelling. The best thing we can do is to detect which values are indeed similar and raise a flag.</p>
<p>This is when the techniques explained in this blog post come handy.</p>
<h2 id="the-algorithm">The algorithm</h2>
<p>For the sake of simplicity, in this blog post we will assume our data is small enough so that a quadratic algorithm is acceptable (for the real thing, see the references at the end). Beware that, in modern hardware, this simple case can take you farther than you would initially expect. My advise is to always <em>use the simplest solution that gets the job done</em>. It usually pays off in both development time and incidental complexity (reliance on external dependencies, etc…).</p>
<p>There are two main metrics regarding similarity. The first one, restricted to strings, is the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein</a> (aka edit) distance and represents the number of edits needed to go from one string to another. This metric is hard to scale in general, since it requires pairwise comparison.</p>
<p>The other one is both more general and more scalable. It involves generating n-gram sets and then comparing them using a set-similarity measure.</p>
<h3 id="n-gram-sets">N-gram sets</h3>
<p>For each string, we can associate a set of n-grams that can be derived from it. N-grams (sometimes called <em>shingles</em>) are just substrings of length n. A typical case is <code>n=3</code>, which generates what is known as trigrams. For example, the trigram set for the string <code>"algorithm"</code> would be <code>['alg', 'lgo', 'gor', 'ori', 'rit', 'ith', 'thm']</code>.</p>
<h3 id="jaccard-index">Jaccard Index</h3>
<p>Once we have the n-gram set for a string, we can use a general metric for set similarity. A popular one is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard Index</a>. Which is defined as the ratio between the cardinality of intersection over the cardinality of the union of any two sets.</p>
<p><span class="math display">\[J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B|}\]</span></p>
<p>Note that this index will range from 0, for disjoint sets, to 1, for exactly equal sets.</p>
<h3 id="if-we-were-to-scale">If we were to scale…</h3>
<p>The advantadge of using n-gram sets is that we can use similarity-preserving summaries of those sets (eg. via <a href="https://en.wikipedia.org/wiki/MinHash">minhashing</a>) which, combined with <a href="https://en.wikipedia.org/wiki/Locality-sensitive_hashing">locality sensitive hashing</a> to efficiently compare pairs of sets provides a massively scalable solution. In this post we will just assume that the size of our data is small enought so that we do not need to scale.</p>
<h2 id="the-code">The Code</h2>
<p>All the above can be implemented in the following utility function, which will take an iterable of strings and the minimum jaccard similarity and max levenshtein distance to consider a pair a candidate for duplicity. It will return a pandas dataframe with the pair indices, their values, and their mutual Levenshtein and Jaccard distances. We will use the <a href="https://www.nltk.org/">Natural Languate Toolkit</a> for the implementation of those distances.</p>
<p>Bear in mind that, in a real use case, we would very likely apply some normalization before testing for near duplicates (eg. to account for spaces and/or differences in upper/lowercase versions).</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">def</span> near_duplicates(factors, min_jaccard: <span class="bu">float</span>, max_levenshtein: <span class="bu">int</span>):</span>
<span id="cb1-2"><a href="#cb1-2"></a>  trigrams <span class="op">=</span> [ <span class="bu">set</span>(<span class="st">&#39;&#39;</span>.join(g) <span class="cf">for</span> g <span class="kw">in</span> nltk.ngrams(f, <span class="dv">3</span>)) <span class="cf">for</span> f <span class="kw">in</span> factors ]</span>
<span id="cb1-3"><a href="#cb1-3"></a>  jaccard <span class="op">=</span> <span class="bu">dict</span>()</span>
<span id="cb1-4"><a href="#cb1-4"></a>  levenshtein <span class="op">=</span> <span class="bu">dict</span>()</span>
<span id="cb1-5"><a href="#cb1-5"></a>  <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">len</span>(factors)):</span>
<span id="cb1-6"><a href="#cb1-6"></a>    <span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(i<span class="op">+</span><span class="dv">1</span>, <span class="bu">len</span>(factors)):</span>
<span id="cb1-7"><a href="#cb1-7"></a>      denom <span class="op">=</span> <span class="bu">float</span>(<span class="bu">len</span>(trigrams[i] <span class="op">|</span> trigrams[j]))</span>
<span id="cb1-8"><a href="#cb1-8"></a>      <span class="cf">if</span> denom <span class="op">&gt;</span> <span class="dv">0</span>:</span>
<span id="cb1-9"><a href="#cb1-9"></a>        jaccard[(i,j)] <span class="op">=</span> <span class="bu">float</span>(<span class="bu">len</span>(trigrams[i] <span class="op">&amp;</span> trigrams[j])) <span class="op">/</span> denom</span>
<span id="cb1-10"><a href="#cb1-10"></a>      <span class="cf">else</span>:</span>
<span id="cb1-11"><a href="#cb1-11"></a>        jaccard[(i,j)] <span class="op">=</span> np.NaN</span>
<span id="cb1-12"><a href="#cb1-12"></a>      levenshtein[(i,j)] <span class="op">=</span> nltk.edit_distance(factors[i], factors[j])</span>
<span id="cb1-13"><a href="#cb1-13"></a></span>
<span id="cb1-14"><a href="#cb1-14"></a>  acum <span class="op">=</span> []</span>
<span id="cb1-15"><a href="#cb1-15"></a>  <span class="cf">for</span> (i,j),v <span class="kw">in</span> jaccard.items():</span>
<span id="cb1-16"><a href="#cb1-16"></a>    <span class="cf">if</span> v <span class="op">&gt;=</span> min_jaccard <span class="kw">and</span> levenshtein[(i,j)] <span class="op">&lt;=</span> max_levenshtein: </span>
<span id="cb1-17"><a href="#cb1-17"></a>      acum.append([i,j,factors[i], factors[j], jaccard[(i,j)], levenshtein[(i,j)]])</span>
<span id="cb1-18"><a href="#cb1-18"></a></span>
<span id="cb1-19"><a href="#cb1-19"></a>  <span class="cf">return</span> pd.DataFrame(acum, columns<span class="op">=</span>[<span class="st">&#39;i&#39;</span>, <span class="st">&#39;j&#39;</span>, <span class="st">&#39;factor_i&#39;</span>, <span class="st">&#39;factor_j&#39;</span>, <span class="st">&#39;jaccard_ij&#39;</span>, <span class="st">&#39;levenshtein_ij&#39;</span>])</span></code></pre></div>
<p>We can extend the above functions to explore a set of columns in a pandas data frame with the following code:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">def</span> df_dups(df, cols<span class="op">=</span><span class="va">None</span>, except_cols<span class="op">=</span>[], min_jaccard<span class="op">=</span><span class="fl">0.3</span>, max_levenshtein<span class="op">=</span><span class="dv">4</span>):</span>
<span id="cb2-2"><a href="#cb2-2"></a>  acum <span class="op">=</span> []</span>
<span id="cb2-3"><a href="#cb2-3"></a>  </span>
<span id="cb2-4"><a href="#cb2-4"></a>  <span class="cf">if</span> cols <span class="kw">is</span> <span class="va">None</span>:</span>
<span id="cb2-5"><a href="#cb2-5"></a>    cols <span class="op">=</span> df.columns</span>
<span id="cb2-6"><a href="#cb2-6"></a></span>
<span id="cb2-7"><a href="#cb2-7"></a>  <span class="cf">if</span> <span class="bu">isinstance</span>(min_jaccard, numbers.Number):</span>
<span id="cb2-8"><a href="#cb2-8"></a>    mj <span class="op">=</span> defaultdict(<span class="kw">lambda</span> : min_jaccard)</span>
<span id="cb2-9"><a href="#cb2-9"></a>  <span class="cf">else</span>:</span>
<span id="cb2-10"><a href="#cb2-10"></a>    mj <span class="op">=</span> min_jaccard</span>
<span id="cb2-11"><a href="#cb2-11"></a></span>
<span id="cb2-12"><a href="#cb2-12"></a>  <span class="cf">if</span> <span class="bu">isinstance</span>(max_levenshtein, numbers.Number):</span>
<span id="cb2-13"><a href="#cb2-13"></a>    ml <span class="op">=</span> defaultdict(<span class="kw">lambda</span>: max_levenshtein)</span>
<span id="cb2-14"><a href="#cb2-14"></a>  <span class="cf">else</span>:</span>
<span id="cb2-15"><a href="#cb2-15"></a>    ml <span class="op">=</span> max_levenshtein</span>
<span id="cb2-16"><a href="#cb2-16"></a></span>
<span id="cb2-17"><a href="#cb2-17"></a>  <span class="cf">for</span> c <span class="kw">in</span> cols:</span>
<span id="cb2-18"><a href="#cb2-18"></a></span>
<span id="cb2-19"><a href="#cb2-19"></a>    <span class="cf">if</span> c <span class="kw">in</span> except_cols <span class="kw">or</span> <span class="kw">not</span> is_string_dtype(df[c]):</span>
<span id="cb2-20"><a href="#cb2-20"></a>      <span class="cf">continue</span></span>
<span id="cb2-21"><a href="#cb2-21"></a></span>
<span id="cb2-22"><a href="#cb2-22"></a>    factors <span class="op">=</span> df[c].factorize()[<span class="dv">1</span>]</span>
<span id="cb2-23"><a href="#cb2-23"></a>    col_dups <span class="op">=</span> near_duplicates(factors, mj[c], ml[c])</span>
<span id="cb2-24"><a href="#cb2-24"></a>    col_dups[<span class="st">&#39;col&#39;</span>] <span class="op">=</span> c</span>
<span id="cb2-25"><a href="#cb2-25"></a>    acum.append(col_dups)</span>
<span id="cb2-26"><a href="#cb2-26"></a></span>
<span id="cb2-27"><a href="#cb2-27"></a>  <span class="cf">return</span> pd.concat(acum)</span></code></pre></div>
<p>If we apply the above code to the open dataset from the <a href="http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html">last blog post</a></p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1"></a>df_dups(df, cols<span class="op">=</span>[<span class="st">&#39;Proveïdor&#39;</span>,</span>
<span id="cb3-2"><a href="#cb3-2"></a>       <span class="st">&#39;Objecte del contracte&#39;</span>, </span>
<span id="cb3-3"><a href="#cb3-3"></a>       <span class="st">&#39;Tipus Contracte&#39;</span>])</span></code></pre></div>
<p>The column names are in Catalan since the dataset comes from the <a href="https://opendata-ajuntament.barcelona.cat/">Barcelona Council Open Data Hub</a>, and stand for the <em>contractor</em>, <em>the service descripction</em>, and the <em>type of service</em>.</p>
<p>We get the following results:</p>
<figure>
<img src="/images/near_dups_menors.png" title="spiderman double" class="center" width="850" alt="" /><figcaption> </figcaption>
</figure>
<p>Notice that the first two are actually valid, despite being similar (two companies with similar names and <em>electric</em> vs <em>electronic</em> supplies), while the last two seem to be a case of not controlling the variable domain properly (singular/plural entries). We should definitely decide for a canonical value (singular/plural) for the column “Tipus Contracte” before we compute any aggregation for those columns.</p>
<h2 id="conclusions">Conclusions</h2>
<p>We can use the above functions as helpers prior to performing some analysis on datasets where domain rules have not been previously enforced. They are compatible with JupyterLite, so no need to install anything for the test. For convenience, you can find a working notebook <a href="https://gist.github.com/jarnaldich/24ece34b6fb441c3ef8878a39a265b82">in this gist</a>.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="http://www.mmds.org/">Mining Of Massive Datasets</a> - An absolute classic book. Chapter 3, in particular, describes a scalable improvement on the technique described in this blog post.</li>
</ul>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/jupyterlite.html">jupyterlite</a>, <a href="/tags/data.html">data</a>, <a href="/tags/nltk.html">nltk</a>, <a href="/tags/jaccard.html">jaccard</a>, <a href="/tags/qc.html">qc</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2023/03/19/near-duplicates.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Sun, 19 Feb 2023 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2023/03/19/near-duplicates.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Dealing with CORS in JupyterLite</title>
    <link>http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html</link>
    <description><![CDATA[<h1>Dealing with CORS in JupyterLite</h1>

<small>Posted on January 29, 2023 <a href="/blog/2023/01/29/jupyterlite-jsonp.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>Following my <a href="blog/2022/12/08/data-manipulation-jupyterlite.html">previous post</a>, I am intending to see how far I can push JupyterLite as a platform for data analysis in the browser. The convenience of having a full enviroment with a sensible default set of libraries for dealing with data <a href="https://jupyterlite.github.io/demo/lab/index.html">one link away</a> is really something I could use.</p>
<p>But of course, for data analysis you need… well… data. There is certainly no shortage of public datasets on the internet, many of them falling into some sort of Open Data initiatives, such as the <a href="https://data.europa.eu/en/publications/open-data-maturity/2022">EU Open Data</a>.</p>
<p>But, as soon as you try to use JupyterLite to directly fetch data from those sites, you find yourself stumping on a wall named <a href="https://portswigger.net/web-security/cors/same-origin-policy">Same Origin Policy</a>.</p>
<h2 id="same-origin-policy">Same Origin Policy</h2>
<p>The Same Origin Policiy is a protection system designed to guarantee that resource providers (hosts) can restrict usage of their data to the pages they host. This is the safe thing to do when there is user data involved, since it prevents third parties to gain access to eg. the user’s cookies and session id’s.</p>
<p>Notice that, when there is no user data involved, it is perfectly safe to relax this policy. In fact, as we will see, it is desirable to do so.</p>
<p>Browsers implement this protection by not allowing a page to perform requests to a server that is different from where it was downloaded unless this other server explicitly allows for it.</p>
<p>This behaviour bites hard at any application involving third party data analysis in the browser, as well as a lot of webassembly “ports” of existing applications with networking capabilities, since the original desktop apps were not designed to deal with this kind of restrictions<a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a> in the first place.</p>
<figure>
<img src="/images/cors.png" title="CORS" class="center" alt="" /><figcaption> </figcaption>
</figure>
<p>For example, if you are using the Jupyterlite at <code>jupyterlite.github.io</code>, you will not be able to fetch any server beyond <code>github.io</code> that does not allow for it specifically… which many data providers don’t. The request will be blocked by the browser itself (step 2 in the diagram above). You will either need to download yourself the data and upload it to JupyterLite, or self-host JupyterLite and the data in your own server (using it as a proxy for data requests), which kinda takes all the convenience out of it. As an example, evaluating this snippet in JupyterLite works exactly as you would expect:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
<span id="cb1-2"><a href="#cb1-2"></a><span class="im">from</span> js <span class="im">import</span> fetch</span>
<span id="cb1-3"><a href="#cb1-3"></a></span>
<span id="cb1-4"><a href="#cb1-4"></a>WORKS <span class="op">=</span> <span class="st">&quot;https://raw.githubusercontent.com/jupyterlite/jupyterlite/main/examples/data/iris.csv&quot;</span></span>
<span id="cb1-5"><a href="#cb1-5"></a>WORKS_CORS_ENABLED  <span class="op">=</span> <span class="st">&quot;https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD&quot;</span></span>
<span id="cb1-6"><a href="#cb1-6"></a>FAILS_CORS_DISABLED <span class="op">=</span> <span class="st">&quot;https://opendata-ajuntament.barcelona.cat/data/dataset/1121f3e2-bfb1-4dc4-9f39-1c5d1d72cba1/resource/69ae574f-adfc-4660-8f81-73103de169ff/download/2018_menors.csv&quot;</span></span>
<span id="cb1-7"><a href="#cb1-7"></a></span>
<span id="cb1-8"><a href="#cb1-8"></a>res <span class="op">=</span> <span class="cf">await</span> fetch(WORKS)</span>
<span id="cb1-9"><a href="#cb1-9"></a>text <span class="op">=</span> <span class="cf">await</span> res.text()</span>
<span id="cb1-10"><a href="#cb1-10"></a><span class="bu">print</span>(text)</span></code></pre></div>
<p>There are two ways in which a data provider can accept cross-origin requests. The main one (the canonical, modern one) is known as <em>Cross Origin Resource Sharing</em> (CORS). By adding explicit permission in some dedicated HTTP headers, a resource provider can control <em>who</em> can access their data (the world or selected domains) and <em>how</em> (which HTTP methods).</p>
<p>Whenever this is not possible or practical (it needs access to the HTTP server configuration, and some hosting providers may not allow it), there is a second way: the JSONP callback.</p>
<h2 id="the-jsonp-callback">The JSONP Callback</h2>
<p>The JSONP callback works along these lines:</p>
<ol type="1">
<li>The calling page (eg. JupyterLite) defines a callback function, with a data parameter.</li>
<li>The calling page (JupyterLite) loads a script from the data provider, passing the name of the callback function.</li>
<li>The data provider script calls back the function with the requested data.</li>
</ol>
<p>Since the script was downloaded from the data provider’s domain, it can perform requests to that domain, so CORS restrictions do not apply.</p>
<p>This is not the recommended solution because it delegates to the application something that belongs to another layer: both the server and the consuming webpage have to modified. One typical use case is making older browsers work. The other is kind of accidental: downloading from (poorly configured?) Open Data portals. Most Open Data portals (including administrative ones) use pre-built data management systems such as <a href="https://ckan.org">CKAN</a>. These often can handle JSONP by default, while http servers have CORS disabled by default. So keeping the defaults leaves you with JSONP.</p>
<h2 id="implementing-a-jsonp-helper-in-jupyterlite">Implementing a JSONP helper in JupyterLite</h2>
<p>One of the things I love about the browser as a platform is that it is… pretty hackable… just press F12 and you can enter the kitchen. For example, you can see how JupyterLite “fakes” its filesystem on top of <a href="https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API">IndexedDB</a>, wich is an API for storing persistent data in the browser.</p>
<p>So, we have a way to perform CORS requests and get data from a server implementing JSONP, and we can also fiddle with JupyterLite’s virtual filesystem… would it be possible to write a helper to download datasets into the virtual filesystem? You bet! Just paste the following code in a javascript kernel cell, or use the <code>%%javascript</code> magic in a python one:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span id="cb2-1"><a href="#cb2-1"></a><span class="va">window</span>.<span class="at">saveJSONP</span> <span class="op">=</span> <span class="kw">async</span> (urlString<span class="op">,</span> file_path<span class="op">,</span> mime_type<span class="op">=</span><span class="st">&#39;text/json&#39;</span><span class="op">,</span> binary<span class="op">=</span><span class="kw">false</span>) <span class="kw">=&gt;</span> <span class="op">{</span></span>
<span id="cb2-2"><a href="#cb2-2"></a>    <span class="kw">const</span> sc <span class="op">=</span> <span class="va">document</span>.<span class="at">createElement</span>(<span class="st">&#39;script&#39;</span>)<span class="op">;</span></span>
<span id="cb2-3"><a href="#cb2-3"></a>    <span class="kw">var</span> url <span class="op">=</span> <span class="kw">new</span> <span class="at">URL</span>(urlString)<span class="op">;</span></span>
<span id="cb2-4"><a href="#cb2-4"></a>    <span class="va">url</span>.<span class="va">searchParams</span>.<span class="at">append</span>(<span class="st">&#39;callback&#39;</span><span class="op">,</span> <span class="st">&#39;window.corsCallBack&#39;</span>)<span class="op">;</span></span>
<span id="cb2-5"><a href="#cb2-5"></a>    </span>
<span id="cb2-6"><a href="#cb2-6"></a>    <span class="va">sc</span>.<span class="at">src</span> <span class="op">=</span> <span class="va">url</span>.<span class="at">toString</span>()<span class="op">;</span></span>
<span id="cb2-7"><a href="#cb2-7"></a></span>
<span id="cb2-8"><a href="#cb2-8"></a>    <span class="va">window</span>.<span class="at">corsCallBack</span> <span class="op">=</span> <span class="kw">async</span> (data) <span class="kw">=&gt;</span> <span class="op">{</span></span>
<span id="cb2-9"><a href="#cb2-9"></a>        <span class="va">console</span>.<span class="at">log</span>(data)<span class="op">;</span></span>
<span id="cb2-10"><a href="#cb2-10"></a></span>
<span id="cb2-11"><a href="#cb2-11"></a>        <span class="co">// Open (or create) the file storage</span></span>
<span id="cb2-12"><a href="#cb2-12"></a>        <span class="kw">var</span> open <span class="op">=</span> <span class="va">indexedDB</span>.<span class="at">open</span>(<span class="st">&#39;JupyterLite Storage&#39;</span>)<span class="op">;</span></span>
<span id="cb2-13"><a href="#cb2-13"></a></span>
<span id="cb2-14"><a href="#cb2-14"></a>        <span class="co">// Create the schema</span></span>
<span id="cb2-15"><a href="#cb2-15"></a>        <span class="va">open</span>.<span class="at">onupgradeneeded</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span></span>
<span id="cb2-16"><a href="#cb2-16"></a>            <span class="cf">throw</span> <span class="at">Error</span>(<span class="st">&#39;Error opening IndexedDB. Should not ever need to upgrade JupyterLite Storage Schema&#39;</span>)<span class="op">;</span></span>
<span id="cb2-17"><a href="#cb2-17"></a>        <span class="op">};</span></span>
<span id="cb2-18"><a href="#cb2-18"></a></span>
<span id="cb2-19"><a href="#cb2-19"></a>        <span class="va">open</span>.<span class="at">onsuccess</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span></span>
<span id="cb2-20"><a href="#cb2-20"></a>            <span class="co">// Start a new transaction</span></span>
<span id="cb2-21"><a href="#cb2-21"></a>            <span class="kw">var</span> db <span class="op">=</span> <span class="va">open</span>.<span class="at">result</span><span class="op">;</span></span>
<span id="cb2-22"><a href="#cb2-22"></a>            <span class="kw">var</span> tx <span class="op">=</span> <span class="va">db</span>.<span class="at">transaction</span>(<span class="st">&quot;files&quot;</span><span class="op">,</span> <span class="st">&quot;readwrite&quot;</span>)<span class="op">;</span></span>
<span id="cb2-23"><a href="#cb2-23"></a>            <span class="kw">var</span> store <span class="op">=</span> <span class="va">tx</span>.<span class="at">objectStore</span>(<span class="st">&quot;files&quot;</span>)<span class="op">;</span></span>
<span id="cb2-24"><a href="#cb2-24"></a></span>
<span id="cb2-25"><a href="#cb2-25"></a>            <span class="kw">var</span> now <span class="op">=</span> <span class="kw">new</span> <span class="at">Date</span>()<span class="op">;</span></span>
<span id="cb2-26"><a href="#cb2-26"></a></span>
<span id="cb2-27"><a href="#cb2-27"></a>            <span class="kw">var</span> value <span class="op">=</span> <span class="op">{</span></span>
<span id="cb2-28"><a href="#cb2-28"></a>                <span class="st">&#39;name&#39;</span><span class="op">:</span> <span class="va">file_path</span>.<span class="at">split</span>(<span class="ss">/</span><span class="sc">[\\/]</span><span class="ss">/</span>).<span class="at">pop</span>()<span class="op">,</span></span>
<span id="cb2-29"><a href="#cb2-29"></a>                <span class="st">&#39;path&#39;</span><span class="op">:</span> file_path<span class="op">,</span></span>
<span id="cb2-30"><a href="#cb2-30"></a>                <span class="st">&#39;format&#39;</span><span class="op">:</span> binary <span class="op">?</span> <span class="st">&#39;binary&#39;</span> : <span class="st">&#39;text&#39;</span><span class="op">,</span></span>
<span id="cb2-31"><a href="#cb2-31"></a>                <span class="st">&#39;created&#39;</span><span class="op">:</span> <span class="va">now</span>.<span class="at">toISOString</span>()<span class="op">,</span></span>
<span id="cb2-32"><a href="#cb2-32"></a>                <span class="st">&#39;last_modified&#39;</span><span class="op">:</span> <span class="va">now</span>.<span class="at">toISOString</span>()<span class="op">,</span></span>
<span id="cb2-33"><a href="#cb2-33"></a>                <span class="st">&#39;content&#39;</span><span class="op">:</span> <span class="va">JSON</span>.<span class="at">stringify</span>(data)<span class="op">,</span></span>
<span id="cb2-34"><a href="#cb2-34"></a>                <span class="st">&#39;mimetype&#39;</span><span class="op">:</span> mime_type<span class="op">,</span></span>
<span id="cb2-35"><a href="#cb2-35"></a>                <span class="st">&#39;type&#39;</span><span class="op">:</span> <span class="st">&#39;file&#39;</span><span class="op">,</span></span>
<span id="cb2-36"><a href="#cb2-36"></a>                <span class="st">&#39;writable&#39;</span><span class="op">:</span> <span class="kw">true</span></span>
<span id="cb2-37"><a href="#cb2-37"></a>            <span class="op">};</span>      </span>
<span id="cb2-38"><a href="#cb2-38"></a></span>
<span id="cb2-39"><a href="#cb2-39"></a>            <span class="kw">const</span> countRequest <span class="op">=</span> <span class="va">store</span>.<span class="at">count</span>(file_path)<span class="op">;</span></span>
<span id="cb2-40"><a href="#cb2-40"></a>            <span class="va">countRequest</span>.<span class="at">onsuccess</span> <span class="op">=</span> () <span class="kw">=&gt;</span> <span class="op">{</span></span>
<span id="cb2-41"><a href="#cb2-41"></a>              <span class="va">console</span>.<span class="at">log</span>(<span class="va">countRequest</span>.<span class="at">result</span>)<span class="op">;</span></span>
<span id="cb2-42"><a href="#cb2-42"></a>                <span class="cf">if</span>(<span class="va">countRequest</span>.<span class="at">result</span> <span class="op">&gt;</span> <span class="dv">0</span>) <span class="op">{</span></span>
<span id="cb2-43"><a href="#cb2-43"></a>                    <span class="va">store</span>.<span class="at">put</span>(value<span class="op">,</span> file_path)<span class="op">;</span></span>
<span id="cb2-44"><a href="#cb2-44"></a>                <span class="op">}</span> <span class="cf">else</span> <span class="op">{</span></span>
<span id="cb2-45"><a href="#cb2-45"></a>                    <span class="va">store</span>.<span class="at">add</span>(value<span class="op">,</span> file_path)<span class="op">;</span></span>
<span id="cb2-46"><a href="#cb2-46"></a>                <span class="op">}</span>   </span>
<span id="cb2-47"><a href="#cb2-47"></a>            <span class="op">};</span> </span>
<span id="cb2-48"><a href="#cb2-48"></a></span>
<span id="cb2-49"><a href="#cb2-49"></a>            <span class="co">// Close the db when the transaction is done</span></span>
<span id="cb2-50"><a href="#cb2-50"></a>            <span class="va">tx</span>.<span class="at">oncomplete</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span></span>
<span id="cb2-51"><a href="#cb2-51"></a>                <span class="va">db</span>.<span class="at">close</span>()<span class="op">;</span></span>
<span id="cb2-52"><a href="#cb2-52"></a>            <span class="op">};</span></span>
<span id="cb2-53"><a href="#cb2-53"></a>        <span class="op">}</span></span>
<span id="cb2-54"><a href="#cb2-54"></a>    <span class="op">}</span></span>
<span id="cb2-55"><a href="#cb2-55"></a></span>
<span id="cb2-56"><a href="#cb2-56"></a>    <span class="va">document</span>.<span class="at">getElementsByTagName</span>(<span class="st">&#39;head&#39;</span>)[<span class="dv">0</span>].<span class="at">appendChild</span>(sc)<span class="op">;</span></span>
<span id="cb2-57"><a href="#cb2-57"></a><span class="op">}</span></span></code></pre></div>
<p>Then, each time you need to download a file, you can just use the following javascript:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span id="cb3-1"><a href="#cb3-1"></a><span class="op">%%</span>javascript</span>
<span id="cb3-2"><a href="#cb3-2"></a><span class="kw">var</span> url <span class="op">=</span> <span class="st">&#39;https://opendata-ajuntament.barcelona.cat/data/es/api/3/action/datastore_search?resource_id=69ae574f-adfc-4660-8f81-73103de169ff&#39;</span></span>
<span id="cb3-3"><a href="#cb3-3"></a><span class="va">window</span>.<span class="at">saveJSONP</span>(url<span class="op">,</span> <span class="st">&#39;data/menors.json&#39;</span>)</span></code></pre></div>
<p>To clarify, you should either use a python kernel with the <code>%%javascript</code> magic or the javascript kernel in <em>both</em> the definition and the call, otherwise they won’t see each other.</p>
<p>Then from a python cell we can read it the standard way:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1"></a><span class="im">import</span> json</span>
<span id="cb4-2"><a href="#cb4-2"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
<span id="cb4-3"><a href="#cb4-3"></a></span>
<span id="cb4-4"><a href="#cb4-4"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">&#39;data/menors.json&#39;</span>, <span class="st">&#39;r&#39;</span>) <span class="im">as</span> f:</span>
<span id="cb4-5"><a href="#cb4-5"></a>  data <span class="op">=</span> json.load(f)</span>
<span id="cb4-6"><a href="#cb4-6"></a>  </span>
<span id="cb4-7"><a href="#cb4-7"></a>pd.read_json(json.dumps(data[<span class="st">&#39;result&#39;</span>][<span class="st">&#39;records&#39;</span>]))</span></code></pre></div>
<p>You can find a notebook with the whole code for your convenience <a href="https://gist.github.com/6418a53b50568a2b201bf592d854c0df#file-pythonjsonphelper-ipynb">in this GIST</a>.</p>
<h2 id="conclusions">Conclusions</h2>
<ul>
<li><p>We are just starting to see the potential of WebAssembly based solutions and the browser environment (IndexedDB…). This will increase the demand for data accessibility across origins.</p></li>
<li><p>If you are a data provider, please consider enabling CORS to promote the usage of your data. Otherwise you will be banning a growing market of web-based analysis tools from your data.</p></li>
</ul>
<h2 id="references">References</h2>
<ul>
<li>Simple IndexedDB <a href="https://gist.github.com/JamesMessinger/a0d6389a5d0e3a24814b">example</a></li>
<li><a href="https://github.com/jupyterlite/jupyterlite/discussions/91?sort=new">Sample code</a> for reading and writing files in JupyterLite (this is where the idea for this post comes from).</li>
<li><a href="https://enable-cors.org/">On CORS</a> and how to enable it.</li>
<li><a href="https://www.w3.org/wiki/CORS_Enabled">An w3 article</a> on how to open your data by enabling CORS and why it is important, with a list of providers implementing it.</li>
<li>A test <a href="https://www.test-cors.org/">web page</a> to check if a server is CORS enabled.</li>
</ul>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>If you are curious about the possible solutions to this problems, you may like to read how <a href="https://webvm.io/">WebVM</a>, a server-less virtual Debian, implements a general solution <a href="https://leaningtech.com/webvm-virtual-machine-with-networking-via-tailscale/">here</a>.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/jupyterlite.html">jupyterlite</a>, <a href="/tags/CORS.html">CORS</a>, <a href="/tags/data.html">data</a>, <a href="/tags/data.html">data</a>, <a href="/tags/webassembly.html">webassembly</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Sun, 29 Jan 2023 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Data Manipulation with JupyterLite</title>
    <link>http://jarnaldich.me/blog/2022/12/08/data-manipulation-jupyterlite.html</link>
    <description><![CDATA[<h1>Data Manipulation with JupyterLite</h1>

<small>Posted on December  8, 2022 <a href="/blog/2022/12/08/data-manipulation-jupyterlite.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<figure>
<img src="/images/jupyterlite.png" title="JupyterLite screenshot" class="center" width="400" alt="" /><figcaption> </figcaption>
</figure>
<p>Data comes in all sizes, shapes and qualities. The process of getting data ready for further analysis is equally crucial and tedious, as many data professionals will <a href="https://forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/">confirm</a>.</p>
<p>This process of many names (data wrangling/munging/cleaning) is often performed by an unholy mix of command-line tools, one-shot scripts and whatever is at hand depending on the data formats and computing environment.</p>
<p>I have been intending to share some of the tools I have found useful for this task in a series of blog posts, especially if they are particularly unexpected or lesser-known. I will always try to demonstrate the tool with some common data processing application, and then finally highlight which conditions the tool is most suitable under.</p>
<h1 id="jupyterlite">JupyterLite</h1>
<p>Jupyter/JupyterLab are the de-facto standard notebook environment, especially among Python data scientists (although it was designed from the start to work for multiple languages or <em>kernels</em> as the <a href="https://blog.jupyter.org/i-python-you-r-we-julia-baf064ca1fb6">name hints</a>). The frontend runs in a browser, and setting up the backend often requires a local installation, although some providers will let you spin a backend in the cloud, see: <a href="https://colab.research.google.com/">Google Collab</a> or <a href="https://mybinder.org/">The Binder Project</a>.</p>
<p>JupyterLite is simpler/cleaner solution for simple analysis if sharing is not needed: it is a quite complete Jupyter environment where all its components run in the browser via webassembly compilation. Just visit its <a href="https://github.com/jupyterlite/jupyterlite">Github</a> project page for the details. Following some of the referenced projects and examples is a worthy rabbit hole to enter.</p>
<p>Some things you might not expect from a webassembly solution:</p>
<ul>
<li>Comes with most data-science libraries ready to use: matplotlib, pandas, numpy.</li>
<li>Can install third party packages via regular magic:</li>
</ul>
<pre><code>%pip install -q bqplot ipyleaflet</code></pre>
<h2 id="example-not-so-simple-excel-manipulation">Example: Not so Simple Excel Manipulation</h2>
<p>Sometimes you need to perform some not so simple manipulation in an Excel sheet that outgrows pivot tables but is kind of the bread and butter of pandas. Since copying an excel table defaults to a tabbed-separated string, getting a pandas dataframe is as easy as firing up jupyterlite by visiting <a href="https://jupyterlite.github.io/demo/lab/index.html">this page</a>, then getting a python notebook and evaluating this code in the first cell, pasting the excel table between the multi-line string separator:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
<span id="cb2-2"><a href="#cb2-2"></a><span class="im">import</span> io</span>
<span id="cb2-3"><a href="#cb2-3"></a></span>
<span id="cb2-4"><a href="#cb2-4"></a>df <span class="op">=</span> pd.read_table(io.StringIO(<span class="st">&quot;&quot;&quot;</span></span>
<span id="cb2-5"><a href="#cb2-5"></a><span class="st">&lt;PRESS C-V HERE&gt;</span></span>
<span id="cb2-6"><a href="#cb2-6"></a><span class="st">&quot;&quot;&quot;</span>))</span>
<span id="cb2-7"><a href="#cb2-7"></a>df</span></code></pre></div>
<h1 id="highlights">Highlights</h1>
<ul>
<li><strong>Useful for:</strong> The kind of analysis/manipulation one would use Pandas / Numpy for, especially if it involves visualizations or richer interaction.</li>
<li><strong>Useful when:</strong> You don’t have access to a pre-installed Jupyter environment but have a modern browser and intenet connection at hand, or when you are dealing with sensitive data that should not leave your computer.</li>
</ul>
<h1 id="conclusion">Conclusion</h1>
<p>JupyterLite is an amazing project: as with many webassembly based solutions, we are just starting to see the possibilities. I encourage you to explore it beyond data manipulation because you can easily find other applications for it, from interactive dashboards to authoring diagrams…</p>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/data.html">data</a>, <a href="/tags/tools.html">tools</a>, <a href="/tags/jupyterlite.html">jupyterlite</a>, <a href="/tags/data-manipulation.html">data-manipulation</a>, <a href="/tags/data-wrangling.html">data-wrangling</a>, <a href="/tags/data-munging.html">data-munging</a>, <a href="/tags/webassembly.html">webassembly</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/12/08/data-manipulation-jupyterlite.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Thu, 08 Dec 2022 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2022/12/08/data-manipulation-jupyterlite.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Cloud Optimized Vector</title>
    <link>http://jarnaldich.me/blog/2022/04/22/cloud-optimized-vector.html</link>
    <description><![CDATA[<h1>Cloud Optimized Vector</h1>

<small>Posted on April 22, 2022 <a href="/blog/2022/04/22/cloud-optimized-vector.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>A few days ago a coworker of mine sent me a <a href="http://blog.cleverelephant.ca/2022/04/coshp.html">recent article</a> by Paul Ramsey (of <a href="http://blog.cleverelephant.ca/projects">Postgis et al.</a> fame) reflecting on what would a Cloud Optimized Vector format look like. His shocking proposal was … (didn’t see that coming)… shapefiles!</p>
<img src="https://imgs.xkcd.com/comics/duty_calls.png" title="fig:someone is wrong on the internet" class="center" alt=" " />
<p>
<center>
<small>Source: xkcd</small>
</center>
</p>
<p>I understand the article was written as a provocation for thought and as such makes some really good points. I also think that the general discussion over what a “cloud optimized vector” format would look like can be productive, but I am afraid that some less experienced developers (or, God forbid, managers!) would take the proposal of pushing shapefiles as the next cloud format a bit too literally, so I thought I would give some context and counterpoint to that article.</p>
<p>Him being Paul Ramsey and me being… well… <a href="/about.html">me</a>, I’d better motivate my opinion, so here comes a longish post. I will try to analyze what makes something <em>cloud optimized</em> based on the COG experience, see how that could be applied to a vector format, then justify why shapefiles should be (once again) avoided and finally see if we can get any closer to an ideal cloud vector format.</p>
<h2 id="what-makes-something-cloud-optimized-anyway">What makes something <em>cloud optimized</em> anyway?</h2>
<p><a href="https://www.cogeo.org/">Cloud Optimized GeoTiffs</a> are technically just a name for a GeoTiff with a <a href="https://github.com/cogeotiff/cog-spec/blob/master/spec.md">particular internal organization</a> (the sequencing of the bytes on disk). Tiff is a old format (old as in <em>venerable</em>) that allows for huge flexibility in terms of internal storage, data types, etc… For example, an image can be stored on disk one line after the other or, as is the case with COG, in small square “mini images” called tiles. Those tiles are then arranged in a larger grid and then several coarser-resolution layers (called overviews) of such grids can be stacked together to form an <a href="https://en.wikipedia.org/wiki/Pyramid_(image_processing)">image pyramid</a>.</p>
<img src="/images/pyramid.jpeg" title="fig:pyramid mage" class="center" width="400" alt=" " />
<p>
<center>
<small>Source: OsGEO Wiki</small>
</center>
</p>
<p>Of course, all data is properly indexed within the file so that accessing a tile of any pyramid level is easy (seeking byte ranges and at most some trivial multiplications or additions).</p>
<p>Whenever data is fetched in chunks through a channel with some latency (be it disk transfer or network), the efficiency of the overall processing can be improved by organizing data in the same order it will be read by the algorithm to compensate for the cost of setting up each read operation (seek times of spinning disks or protocol overhead in network communications).</p>
<p>A corollary of this is being that <em>data formats are not efficient per se</em>, in the void: it will always depend on the process/algorithm/use case. For example, for a raster point operation (such as applying a threshold mask for some value), organizing data line by line with no overviews is more efficient than a COG would be (…and that is why the Geotiff spec allows for different configurations).</p>
<p>When dealing with spatial data, that principle gets hit by a loose version of <a href="https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography">Tobler’s First Law</a>: data representing an area nearby is more likely to be accessed next. For example, when a user is viewing an image, tiles that are close to the ones on screen are more likely to be fetched next than tiles representing remote areas (because users pan, do not jump randomly).</p>
<p>So what is the use case COG is having in mind? Well, in case you hadn’t figured it out already, it is mainly <em>visualization</em><a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a>. Overviews allow for zooming in and out efficiently and tiles help with moving along a subset of the higher resolution.</p>
<p>This pattern has been the ABC of raster optimization for decades in the geospatial world. Be it <a href="https://mapproxy.org/docs/1.13.0/caches.html">tile caches</a>, <a href="https://www.ogc.org/standards/tms">tiling schemes</a>, <a href="https://mapserver.org/optimization/raster.html">WMS map servers</a>, etc… they all<a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a> try to have the same properties:</p>
<ol type="1">
<li>Efficient navigation along contiguous resolutions (through overviews, pyramids, wavelets).</li>
<li>Efficient access of contiguous areas at a given resolution (tiling).</li>
</ol>
<p>This also turns out to be a pretty sensible organization if you cannot know in advance what kind of processing will be performed, because it gives you fast access to a manageble piece of the data: be it a summary (overview) or a subset (a slice of tiles) or a combination of both.</p>
<p>Notice what it does <em>not</em> allow, though: it leaves you in the dry if you need a subset based on the <em>content</em> of the data: eg. I would like to see all pixels with a red channel value of 42: in that case you would have to read the whole image.</p>
<p>COG is just a name for a GeoTiff implementing that organization. It goes a bit further than that by forcing a particular order<a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a> of the inner sections, which is smart because a client can ask for a chunk at the beginning and it will take all the directores (think indices, metadata) and probably some overviews, which makes sense, because most viewers will start with the lowest zoom that covers the bounding box. It is also a nice organization for <em>streaming</em> tiles of data.</p>
<p>With that in mind, what would it mean for a vector format to be “cloud ready”? Well it sure should allow for the visualization use case, and here it would mean loosely speaking “rendering a map”, so that gives us an idea:</p>
<ol type="1">
<li>Having the ability to navigate different <em>zoom levels</em> / scales / generalization(s).</li>
<li>Efficient rendering of nearby areas at a given resolutions.</li>
</ol>
<p>Notice that point 1 <em>as a process</em> is much harder in vector than in raster formats: for rasters it is (mostly) a question of choosing what “summary” measure we pick for the overview pixel corresponding to the underlying level (nearest neighbor, interpolation, average, other…). Generalizing a vector is much harder, first because it can break topology and geometry validity in many ways, but also because deciding if/how to represent different features at different scales requires for cartographic design knowledge. But that is not relevant <em>for the format itself</em>, it just needs to be flexible enough to allow for different geometries at different resolutions and be efficient in navigating the different resolutions (we do not care how hard it was to generate the different resolution levels).</p>
<p>While I think these two requirements are the equivalent of what a COG offers for raster, I am unsure we would consider that enough in the vector case. For example, we might not consider acceptable not being able to have subsets or summaries based on attribute values, so there is a whole new level of complexity for vector <em>at the format level</em> as well. It all boils down if by <em>vector</em> we mean <em>features</em> or just <em>geometries</em>.</p>
<p>Now that I’ve established the two conditions I think define <em>cloud optimization</em>, at least by COG standards, let’s first dive into why I would say Shapefiles are <em>not</em> the future of the cloud.</p>
<h2 id="the-noble-art-of-bashing-shapefiles">The noble art of bashing shapefiles</h2>
<p>A lot has been argued over the years on the <a href="http://switchfromshapefile.org/">problems with shapefiles</a>. I will just refer here the problems specifically relevant in a cloud setting.</p>
<p>First, they are a multiple file format. There is a cost in the OS layer for opening a file (name resolution, checking permissions), and the web server will probably add another layer on top of that, so please let’s not choose a format for the cloud that means opening a .shp, .dbf, .prj, .dbx, .qix… and <a href="https://desktop.arcgis.com/en/arcmap/10.3/manage-data/shapefiles/shapefile-file-extensions.htm">potentially all of these</a>.</p>
<p>It’s limited to 2GB of file size. Most COGs are effectively BigTiffs, and easily <em>need</em> to go far beyond that. In any case, one of the reasons for moving to the cloud is being able to process larger data.</p>
<p>As for the use cases, they’re not even good for representation: you need several of them, one for each layer/geometry, to make most general maps (except maybe choropleths and other thematic maps). That already means multiplying the number of files even more.</p>
<p>Secondly, Paul’s article seems to care about property number 2: accessing contiguous areas at a given resolution. That is not cloud ready in the same way COGs are. We also need multi-scale map representation (property 1). You can of course use some sort of attribute to filter which elements should appear at different resolution levels, but that means attribute indexing and clashes with spatial ordering. The other option would be using different shapefiles for different layers so, even more files.</p>
<p>The tool for spatial ordering the article suggests would certainly be useful for a streaming algorithm where spatial contiguity is relevant, but then again there are <a href="https://flatgeobuf.org/">options tailored for this use case</a>.</p>
<h1 id="is-there-a-better-option">Is there a better option?</h1>
<p>For the representation use case which is what COGS provide, there certainly is, and has been around for a long time. It’s just that we call them <a href="https://docs.mapbox.com/data/tilesets/guides/vector-tiles-introduction/">vector tiles</a>.</p>
<p>Vector tiles are exactly the application of the old tiling schema idea to vectors. It’s just that instead of mini-images, we have a <code>pbf</code> encoding of an <a href="github.com/mapbox/vector-tile-spec/tree/master/2.1#41-layers">format</a> for encoding geometries and attributes.</p>
<p>Those tiles are then organized into a the same organization of grids and pyramids for different resolutions that we had in a COG. It’s just that most of the times, the tiling is not dependent of the dataset (though it can be), but <a href="https://www.maptiler.com/google-maps-coordinates-tile-bounds-projection/#3/15.00/50.00">globally fixed</a>, with a set of well-known tile schemas.</p>
<p>The tiles can have different schemas and information at different resolution levels (zoom) to allow for different generalization and visualization options.</p>
<p>We can pack all those tiles into a single <code>.mbtiles</code> file, which is a <code>sqlite</code>-based format containing the tiles as a blob. Having a global tile scheme is nice because you can then use sqlite’s <code>.attach</code> command to merge datasets, for example. And you can include any metadata (projection, etc…) inside a single file.</p>
<p>And of course there are libraries for rendering them in the browser (that is their primary use case), among <a href="https://github.com/mapbox/awesome-vector-tiles">many other things</a>. But Paul already knows that, since <a href="https://postgis.net/docs/ST_AsMVT.html">PostGis</a> itself can generate them.</p>
<h1 id="are-we-there-yet">Are we there yet?</h1>
<p>Well, for representation, at least we are close… but what if we want more complex queries over that (think spatial SQL)? With an <code>.mbtiles</code> alone you would need to actually decode each <code>.pbf</code> and query the attributes, so no luck there…</p>
<p>In a sqlite-based format (like <code>.mbtiles</code> or GeoPackage), it should be possible to add extra tables for queries that may or may not reference to the main tiles… but that’s an idea yet to be developed…</p>
<p>The other caveat for <em>vector tiles</em> is the possible loss of information as a general geometry repository. Internal VT coordinates are integers (mainly because they are optimal for screen rendering algorithms), so that means there is a discrete resolution for each zoom level. Special care has to be taken into account so that there is no loss of information (ie. making sure the zoom levels are enough for the internal raster cell to be below the resolution of the measuring instruments). So again, they may not be suitable for every application.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I hope I made my point on why I do not think shapefiles are the future of the cloud based vector formats (I wrote this in a bit of a hurry) and, more importantly, that the “cloud optimization” concept of the raster world can only be applied to the vector formats in a limited way. I <em>do</em> think there is an interesting space to explore, though… Of course I may be completely wrong and maybe Peter has actually found something.</p>
<p>Time will tell, I guess…</p>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>The trick is that some cloud processing platforms such as the <a href="https://earthengine.google.com/">Google Earth Engine</a> are in fact processing on a <em>visualization driven</em> also called <em>lazy</em> processing scheme: only the data that is visualized at any moment by the user gets processed, on demand, so the same principle applies.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn2" role="doc-endnote"><p>Actually, not all, there are more sophisticated methods like wavelet transforms allowing for multi-resolution decoding in formats like .ECW/MrSID (commercial) or JP2000, but for the purpose of this post let’s just call it a very sophisitcated pyramid.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn3" role="doc-endnote"><p>For many applications, the hard requirements are tiles and overviews. The order of IFDs may not have much of an impact. I encourage the user to try and read a “regular” tiled tiff through <code>/vsicurl/</code> in QGIS. Or even a raster geopackage, for that matter.<a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/vector.html">vector</a>, <a href="/tags/vector-tiles.html">vector-tiles</a>, <a href="/tags/mbtiles.html">mbtiles</a>, <a href="/tags/sqlite.html">sqlite</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/04/22/cloud-optimized-vector.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Fri, 22 Apr 2022 01:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2022/04/22/cloud-optimized-vector.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>ETL The Haskell Way</title>
    <link>http://jarnaldich.me/blog/2022/03/27/etl-the-haskell-way.html</link>
    <description><![CDATA[<h1>ETL The Haskell Way</h1>

<small>Posted on March 27, 2022 <a href="/blog/2022/03/27/etl-the-haskell-way.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>Extract Transform Load (ETL) is a broad term for processes that read a subset of data in one format, perform a more or less involved transformation and then store it in a (maybe) different format. Those processes can of course be linked together to form larger data pipelines. As in many such general terms, this can mean very different things in terms of software architecture and implementations. For example, depending on the scale of the data the solution may range from unix shell pipeline to a full-blown <a href="https://nifi.apache.org/">Apache nifi</a> solution.</p>
<p>One common theme is data impedance mismatch between formats. Take for example JSON and XML. They are surely different, but for any particular application you can find a way to move data from one to the other. They even have their own <a href="https://chrispenner.ca/posts/traversal-systems">traversal systems</a> (<a href="https://stedolan.github.io/jq/">jq</a>’s syntax and <a href="https://developer.mozilla.org/en-US/docs/Web/XPath">XPath</a>).</p>
<p>The most widely used solution for small to medium data is to write small ad-hoc scripts. One can somewhat abstract over these formats by <a href="https://blog.lazy-evaluation.net/posts/linux/jq-xq-yq.html">abusing jq</a>.</p>
<p>In this blog post we will explore more elegant way to perform such transformations using Haskell. The purpose of this post is just to pique your curiosity with what’s possible in this area with Haskell. It is definitely <em>not</em> intended as a tutorial on optics, which are not for Haskell beginners, anyways…</p>
<h2 id="the-problem">The Problem</h2>
<p>We will be enriching a <a href="https://datatracker.ietf.org/doc/html/rfc7946">geojson</a> dataset containing <a href="static/countries.geo.json">countries</a> at a world scale taken from natural earth and enriching it with <a href="static/population.xml">population data in xml</a> as provided by the world bank API so that it can be used, for example, to produce a <a href="">choropleth</a> <del>map</del><a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a> visualization.</p>
<figure>
<img src="/images/worldpop.png" title="this is not a map" class="center" alt="" /><figcaption> </figcaption>
</figure>
<p>Haskell is a curiously effective fit for this kind of problems due to the unlikely combination of three seemingly unrelated traits: its parsing libraries (driven by a community interested in programming languages theory), <em>optics</em> (also driven by PLT and a gruesome syntax for record accessors, at least up to the recent addition of <code>RecordDotSyntax</code>), and the convience for writing scripts with the <code>stack</code> tool (driven by the olden unreliability of <code>cabal</code> builds).</p>
<p>It is the fact that Haskell is so <em>abstract</em>, that makes it easy to combine libraries never intended to work together in the first place. Haskell libraries tend to define its interfaces in terms of very general terms (eg. structures that can be mapped into, structures that can be “summarized”, etc..).</p>
<p>Let’s break down how these work together.</p>
<h3 id="parsing-libraries">Parsing Libraries</h3>
<p>Haskell comes from a long tradition of programming language theory applications, and it shines for building parsers, so there is no shortage of libraries for reading the most common formats. But, more important than the availability of parsing libraries itself, it’s the <a href="https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/">parse, don’t validate</a> approach in this libraries that works here: most of them have the ability to decode (deserialize,parse) its input into a well typed structured value in memory (think Abstract Syntax Tree).</p>
<p>So a typical workflow would be to read the data from disk into a more or less abstract representation in memory involving nested data structures, then transform it into another representation in memory (maybe generated from a template) through the use of optics and then serialize it back to disk:</p>
<figure>
<img src="/images/haskell_lens_workflow.png" title="Haskell lens workflow" class="center" alt="" /><figcaption> </figcaption>
</figure>
<h3 id="optics">Optics</h3>
<p>Optics (lenses, prisms, traversals) are way to abstract getters and setters in a composable way. Their surface syntax reads like “pinpointing” or “bookmarking” into a deeply nested data structure (think <code>XPath</code>), which make it nice for visually keeping track of what is being read or altered.</p>
<p>The learning curve is wild, and their error messages convoluted, but the fact that in Haskell we can abstract accessors away from any particular data structure, and that there are well-defined functions to combine them can reduce the size of your data transformation toolbox. And lighter toolboxes are easier to carry around with you.</p>
<h3 id="scripting">Scripting</h3>
<p>A lot of the data wrangling programs are one-shot scripts, where you care about the result more than about the software itself. Having to create a new app each time can be tiresome, so using scripting and knowing you can rely on a set of curated libraries to get the job done is really nice. Starting with a script that can be turned at any time into a full blown app that works on all the major platforms is a plus.</p>
<h2 id="the-solution">The Solution</h2>
<p>The steps follow the typical workflow quite closely, in our case:</p>
<ol type="1">
<li>Parse the <code>.xml</code> file into a data structure (a document) in memory.</li>
<li>Build a map from country codes to population.</li>
<li>Read the geojson file with country info and get the array of features.</li>
<li>For each feature, create a new key with the population.</li>
</ol>
<p>This overall structure can be traced in our main function:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb1-1"><a href="#cb1-1"></a>main <span class="ot">=</span> <span class="kw">do</span></span>
<span id="cb1-2"><a href="#cb1-2"></a>  xml <span class="ot">&lt;-</span> XML.readFile XML.def <span class="st">&quot;population.xml&quot;</span> <span class="co">-- Parse the XML file into a memory document</span></span>
<span id="cb1-3"><a href="#cb1-3"></a>  <span class="kw">let</span> pop2020Map <span class="ot">=</span> Map.fromList <span class="op">$</span> runReader records xml <span class="co">-- Build a map Country -&gt; Population</span></span>
<span id="cb1-4"><a href="#cb1-4"></a>  jsonBytes <span class="ot">&lt;-</span> LB8.readFile <span class="st">&quot;countries.geo.json&quot;</span> <span class="co">-- Parse the countries geojson into memory</span></span>
<span id="cb1-5"><a href="#cb1-5"></a>  <span class="kw">let</span> <span class="dt">Just</span> json <span class="ot">=</span> Json.decode<span class="ot"> jsonBytes ::</span> <span class="dt">Maybe</span> <span class="dt">Json.Value</span></span>
<span id="cb1-6"><a href="#cb1-6"></a>  <span class="kw">let</span> featureList <span class="ot">=</span> runReader (features pop2020Map)<span class="ot"> json ::</span> [ <span class="dt">Json.Value</span> ] <span class="co">-- Get features with new population key</span></span>
<span id="cb1-7"><a href="#cb1-7"></a>  <span class="kw">let</span> newJson <span class="ot">=</span> json <span class="op">&amp;</span> key <span class="st">&quot;features&quot;</span>  <span class="op">.~</span> (<span class="dt">Json.Array</span> <span class="op">$</span> V.fromList featureList) <span class="co">-- Update the original Json</span></span>
<span id="cb1-8"><a href="#cb1-8"></a>  LB8.writeFile <span class="st">&quot;countriesWithPopulation.geo.json&quot;</span> <span class="op">$</span> Json.encode newJson <span class="co">-- Write back to disk</span></span></code></pre></div>
<p>The form of the input data is not especially well suited for this app. The world population xml is basically a table in disguise (remember the data impedance problem?). It is basically a list of:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode xml"><code class="sourceCode xml"><span id="cb2-1"><a href="#cb2-1"></a>    <span class="kw">&lt;record&gt;</span></span>
<span id="cb2-2"><a href="#cb2-2"></a>      <span class="kw">&lt;field</span><span class="ot"> name=</span><span class="st">&quot;Country or Area&quot;</span><span class="ot"> key=</span><span class="st">&quot;ABW&quot;</span><span class="kw">&gt;</span>Aruba<span class="kw">&lt;/field&gt;</span></span>
<span id="cb2-3"><a href="#cb2-3"></a>      <span class="kw">&lt;field</span><span class="ot"> name=</span><span class="st">&quot;Item&quot;</span><span class="ot"> key=</span><span class="st">&quot;SP.POP.TOTL&quot;</span><span class="kw">&gt;</span>Population, total<span class="kw">&lt;/field&gt;</span></span>
<span id="cb2-4"><a href="#cb2-4"></a>      <span class="kw">&lt;field</span><span class="ot"> name=</span><span class="st">&quot;Year&quot;</span><span class="kw">&gt;</span>1960<span class="kw">&lt;/field&gt;</span></span>
<span id="cb2-5"><a href="#cb2-5"></a>      <span class="kw">&lt;field</span><span class="ot"> name=</span><span class="st">&quot;Value&quot;</span><span class="kw">&gt;</span>54208<span class="kw">&lt;/field&gt;</span></span>
<span id="cb2-6"><a href="#cb2-6"></a>    <span class="kw">&lt;/record&gt;</span></span></code></pre></div>
<p>That means the function that reads it has to associate information from two siblings in the XML tree, but that is easy using the <code>magnify</code> function inside a <code>Reader</code> monad:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb3-1"><a href="#cb3-1"></a><span class="ot">records ::</span> <span class="dt">Reader</span> <span class="dt">XML.Document</span> [(<span class="dt">T.Text</span>, <span class="dt">Scientific</span>)]</span>
<span id="cb3-2"><a href="#cb3-2"></a>records <span class="ot">=</span></span>
<span id="cb3-3"><a href="#cb3-3"></a>  <span class="kw">let</span></span>
<span id="cb3-4"><a href="#cb3-4"></a>    <span class="co">-- Lens to access an attribute from record to field. Intended to be composed.</span></span>
<span id="cb3-5"><a href="#cb3-5"></a>    field name <span class="ot">=</span> nodes <span class="op">.</span> folded <span class="op">.</span> _Element <span class="op">.</span> named <span class="st">&quot;field&quot;</span> <span class="op">.</span> attributeIs <span class="st">&quot;name&quot;</span> name</span>
<span id="cb3-6"><a href="#cb3-6"></a>  <span class="kw">in</span> <span class="kw">do</span></span>
<span id="cb3-7"><a href="#cb3-7"></a>    <span class="co">-- Zoom and iterate all records</span></span>
<span id="cb3-8"><a href="#cb3-8"></a>    magnify (root <span class="op">.</span> named <span class="st">&quot;Root&quot;</span> <span class="op">./</span> named <span class="st">&quot;data&quot;</span> <span class="op">./</span> named <span class="st">&quot;record&quot;</span>) <span class="op">$</span> <span class="kw">do</span></span>
<span id="cb3-9"><a href="#cb3-9"></a>      record <span class="ot">&lt;-</span> ask</span>
<span id="cb3-10"><a href="#cb3-10"></a>      <span class="kw">let</span> name <span class="ot">=</span> record <span class="op">^?</span> (field <span class="st">&quot;Country or Area&quot;</span> <span class="op">.</span> attr <span class="st">&quot;key&quot;</span>)</span>
<span id="cb3-11"><a href="#cb3-11"></a>      <span class="kw">let</span> year <span class="ot">=</span> record <span class="op">^?</span> (field <span class="st">&quot;Year&quot;</span> <span class="op">.</span> text)</span>
<span id="cb3-12"><a href="#cb3-12"></a>      <span class="kw">let</span> val  <span class="ot">=</span> record <span class="op">^?</span> (field <span class="st">&quot;Value&quot;</span> <span class="op">.</span> text)</span>
<span id="cb3-13"><a href="#cb3-13"></a>      <span class="co">-- Returning a monoid instance (list) combines results.</span></span>
<span id="cb3-14"><a href="#cb3-14"></a>      <span class="fu">return</span> <span class="op">$</span> <span class="kw">case</span> (name, year, val) <span class="kw">of</span></span>
<span id="cb3-15"><a href="#cb3-15"></a>        (<span class="dt">Just</span> key, <span class="dt">Just</span> <span class="st">&quot;2020&quot;</span>, <span class="dt">Just</span> val) <span class="ot">-&gt;</span> [ (key, <span class="fu">read</span> <span class="op">$</span> T.unpack val) ]</span>
<span id="cb3-16"><a href="#cb3-16"></a>        _ <span class="ot">-&gt;</span> []</span></code></pre></div>
<p>Note how lenses look almost like <code>XPath</code> expressions. The <code>features</code> function just takes the original features and appends a new key:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb4-1"><a href="#cb4-1"></a><span class="ot">features ::</span> <span class="dt">Map.Map</span> <span class="dt">T.Text</span> <span class="dt">Scientific</span> <span class="ot">-&gt;</span> <span class="dt">Reader</span> <span class="dt">Json.Value</span> [ <span class="dt">Json.Value</span> ]</span>
<span id="cb4-2"><a href="#cb4-2"></a>features popMap <span class="ot">=</span> <span class="kw">do</span></span>
<span id="cb4-3"><a href="#cb4-3"></a>  magnify (key <span class="st">&quot;features&quot;</span> <span class="op">.</span> values) <span class="op">$</span> <span class="kw">do</span></span>
<span id="cb4-4"><a href="#cb4-4"></a>    feature <span class="ot">&lt;-</span> ask</span>
<span id="cb4-5"><a href="#cb4-5"></a>    <span class="kw">let</span> <span class="dt">Just</span> <span class="fu">id</span> <span class="ot">=</span> feature <span class="op">^?</span> (key <span class="st">&quot;id&quot;</span> <span class="op">.</span> _String) <span class="co">-- Gross, but effective</span></span>
<span id="cb4-6"><a href="#cb4-6"></a>    <span class="fu">return</span> <span class="op">$</span> <span class="kw">case</span> (Map.lookup <span class="fu">id</span> popMap) <span class="kw">of</span></span>
<span id="cb4-7"><a href="#cb4-7"></a>      <span class="dt">Just</span> pop <span class="ot">-&gt;</span> [ feature <span class="op">&amp;</span> key <span class="st">&quot;properties&quot;</span> <span class="op">.</span> _Object <span class="op">.</span> at <span class="st">&quot;pop2020&quot;</span> <span class="op">?~</span>  <span class="dt">Json.Number</span> pop ]</span>
<span id="cb4-8"><a href="#cb4-8"></a>      _ <span class="ot">-&gt;</span> [ feature ]</span></code></pre></div>
<p>That is really all it takes to perform the transformation. Please take a look at the full listing in <a href="https://gist.github.com/7cb4fd07bc8689f5c3bccb58b2e239ae#file-etl-hs">this gist</a>. Even with the imports, it cannot get much shorter or expressive than this fifty something lines…</p>
<h2 id="revenge-of-the-nerds">Revenge of the Nerds</h2>
<p>So Haskell turns out to be the most practical, straightforward solution I found for this kind of problems. Who knew?</p>
<p>I would absolutely not recommend learning Haskell just to solve this kind of problems (although I would absolutely recommend learning it for many other reasons). This is one of the occasions in which learning something just for the sake of it pays off in unexpected ways.</p>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>No lengend! No arrow pointing north! Questionable projection! This is not a post on map making, just an image to ease the reader’s eye after too much text for the internet…<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/haskell.html">haskell</a>, <a href="/tags/data.html">data</a>, <a href="/tags/xml.html">xml</a>, <a href="/tags/json.html">json</a>, <a href="/tags/geojson.html">geojson</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/03/27/etl-the-haskell-way.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Sun, 27 Mar 2022 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2022/03/27/etl-the-haskell-way.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Finding Curve Inflection Points in PostGIS</title>
    <link>http://jarnaldich.me/blog/2022/02/06/postgis-curve-inflection.html</link>
    <description><![CDATA[<h1>Finding Curve Inflection Points in PostGIS</h1>

<small>Posted on February  6, 2022 <a href="/blog/2022/02/06/postgis-curve-inflection.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>In this blog post I will present a way to find inflection points in a curve. An easy way to understand this: imagine the curve is the road we are driving along, we want to find the points in which we stop turning right and start turning left or vice versa, as shown below:</p>
<figure>
<img src="/images/curve_inflection.png" title="Sample of curve inflection points" class="center" width="400" alt="" /><figcaption> </figcaption>
</figure>
<p>We will show a sketch of the solution and a practial implementation with <a href="https://postgis.net">PostGIS</a>.</p>
<h2 id="a-sketch-of-the-solution">A sketch of the solution</h2>
<p>This problem can be solved with pretty standard 2d computational geometry resources. In particular, the use of the <a href="https://mathworld.wolfram.com/CrossProduct.html">cross product</a> as a way to detect if a point lies left or right of a given straight line will be useful here. The following pseudo-code is based on the determinant formula:</p>
<pre><code>function isLeft(Point a, Point b, Point c){
     return ((b.X - a.X)*(c.Y - a.Y) - (b.Y - a.Y)*(c.X - a.X)) &gt; 0;
}</code></pre>
<p>In general, I am against implementing your own computational geometry code: the direct translation of mathematical formulas are often plagued with rounding-off errors, corner cases and blatant inefficiencies. You would be better off using one of the excellent computational geometry libraries such as: <a href="https://libgeos.org">GEOS</a>, which started as a port of the <a href="https://github.com/locationtech/jts">JTS</a>, or <a href="https://www.cgal.org/">CGAL</a>. Chances are that you are using them anyway, since they lie at the bottom of many <a href="https://www.nationalgeographic.org/encyclopedia/geographic-information-system-gis/">GIS</a> software stacks. This holds true for any non-trivial mathematics (linear algebra, optimization…). Remember: <strong><code>floats</code> are NOT real numbers</strong></p>
<p>In this case, where I cared a lot more about practicality than sheer efficiency, the use of SQLs <code>numeric</code> types, which offer arbitrary precision arithmetics at the expense of speed, prevents some of the rounding-off errors we would get with <code>double precision</code>, sparing us to implement <a href="https://www.cs.cmu.edu/~quake/robust.html">fast robust predicates</a> ourselves.</p>
<h2 id="postgis-implementaton">PostGIS implementaton</h2>
<p>I have long felt that Postgres/PostGIS is the nicest workbench for geospatial analysis (prove me wrong). In many use cases, being able to perform the analysis directly where your data is stored is unbeatable. Having to write a SQL script may be a throwback for some users, but works charms in terms of reproducibility and traceability for your data workflows.</p>
<p>In this particular case we will assume our input is a table with <code>LineString</code> geometry features, each one with its unique identifier. Of course, geometries are properly indexed and tested for validity before any calculation. It is also often useful during development to limit the calculation to a subset of the data through an area of interest in order to shorten the iteration process for testing results and parameters.</p>
<p>The sketch of the solution is:</p>
<ol type="1">
<li>Simplify the geometries to avoid noise (false positives). <code>ST_Simplify</code> or <code>ST_SimplifyPreserveTopology</code> will suffice.</li>
<li>Explode the points, keeping track of the original geometries, this can be easily done with <code>generate_series</code> and <code>ST_DumpPoints</code>.</li>
<li>We need 3 points to calculate <code>isLeft</code>: 2 to define the segment and the point to test for. So, for each point along the <code>LineString</code>, we get the X,Y coordinates of the point itself and the 2 previous points. We will be checking for the current point position in relation to the segment defined by the two previous points. This also means that the turning point, when detected, will be last point of the segment, that is: the previous point. I found this calculation to be surprisingly easy through Posgres window functions.</li>
<li>Use the above points to calculate a measure for isLeft.</li>
<li>Select the points where this measure changes.</li>
</ol>
<p>As usual, good code practices in general also apply to the database. In particular, <a href="https://www.postgresql.org/docs/13/queries-with.html">CTEs</a> can be used to clarify queries in the same way you would name variables or functions in whatever programming language: to enable reuse, but also to enhance readability by giving descriptive names. There is no excuse for <em>any</em> of the eye-burning SQL queries that are too often considered normal in the language.</p>
<p>Look at the sketch solution and contrast with the following implementation to see what I mean:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">WITH</span> </span>
<span id="cb2-2"><a href="#cb2-2"></a>  <span class="co">-- Optional: area of interest.</span></span>
<span id="cb2-3"><a href="#cb2-3"></a>  aoi <span class="kw">AS</span> (</span>
<span id="cb2-4"><a href="#cb2-4"></a>    <span class="kw">SELECT</span> ST_SetSRID(</span>
<span id="cb2-5"><a href="#cb2-5"></a>          ST_MakeBox2D(</span>
<span id="cb2-6"><a href="#cb2-6"></a>            ST_Point(<span class="dv">467399</span>,<span class="dv">4671999</span>),</span>
<span id="cb2-7"><a href="#cb2-7"></a>            ST_Point(<span class="dv">470200</span>,<span class="dv">4674000</span>))</span>
<span id="cb2-8"><a href="#cb2-8"></a>          ,<span class="dv">25831</span>) </span>
<span id="cb2-9"><a href="#cb2-9"></a>        <span class="kw">AS</span> geom</span>
<span id="cb2-10"><a href="#cb2-10"></a>  ),</span>
<span id="cb2-11"><a href="#cb2-11"></a>  <span class="co">-- Simplify geometries to avoid excessive noise. Tolerance is empiric and depends on application</span></span>
<span id="cb2-12"><a href="#cb2-12"></a>  simplified <span class="kw">AS</span> (</span>
<span id="cb2-13"><a href="#cb2-13"></a>    <span class="kw">SELECT</span> <span class="kw">oid</span> <span class="kw">as</span> contour_id, ST_Simplify(input_contours.geom, <span class="fl">0.2</span>) <span class="kw">AS</span> geom </span>
<span id="cb2-14"><a href="#cb2-14"></a>    <span class="kw">FROM</span> input_contours, aoi</span>
<span id="cb2-15"><a href="#cb2-15"></a>    <span class="kw">WHERE</span> input_contours.geom &amp;&amp; aoi.geom</span>
<span id="cb2-16"><a href="#cb2-16"></a>  ), </span>
<span id="cb2-17"><a href="#cb2-17"></a>  <span class="co">-- Explode points generating index and keeping track of original curve</span></span>
<span id="cb2-18"><a href="#cb2-18"></a>  points <span class="kw">AS</span> (</span>
<span id="cb2-19"><a href="#cb2-19"></a>    <span class="kw">SELECT</span> contour_id,</span>
<span id="cb2-20"><a href="#cb2-20"></a>        generate_series(<span class="dv">1</span>, st_numpoints(geom)) <span class="kw">AS</span> npoint,</span>
<span id="cb2-21"><a href="#cb2-21"></a>        (ST_DumpPoints(geom)).geom <span class="kw">AS</span> geom</span>
<span id="cb2-22"><a href="#cb2-22"></a>    <span class="kw">FROM</span> simplified</span>
<span id="cb2-23"><a href="#cb2-23"></a>  ), </span>
<span id="cb2-24"><a href="#cb2-24"></a>  <span class="co">-- Get the numeric values for X an Y of the current point </span></span>
<span id="cb2-25"><a href="#cb2-25"></a>  coords <span class="kw">AS</span> (</span>
<span id="cb2-26"><a href="#cb2-26"></a>    <span class="kw">SELECT</span> <span class="op">*</span>, st_x(geom):<span class="ch">:numeric</span> <span class="kw">AS</span> cx, st_y(geom):<span class="ch">:numeric</span> <span class="kw">AS</span> cy</span>
<span id="cb2-27"><a href="#cb2-27"></a>    <span class="kw">FROM</span> points    </span>
<span id="cb2-28"><a href="#cb2-28"></a>    <span class="kw">ORDER</span> <span class="kw">BY</span> contour_id, npoint</span>
<span id="cb2-29"><a href="#cb2-29"></a>  ),</span>
<span id="cb2-30"><a href="#cb2-30"></a>  <span class="co">-- Add the values of the 2 previous points inside the same linestring</span></span>
<span id="cb2-31"><a href="#cb2-31"></a>  <span class="co">-- LAG and PARTITION BY do all the work here.</span></span>
<span id="cb2-32"><a href="#cb2-32"></a>  segments <span class="kw">AS</span> (</span>
<span id="cb2-33"><a href="#cb2-33"></a>    <span class="kw">SELECT</span> <span class="op">*</span>, </span>
<span id="cb2-34"><a href="#cb2-34"></a>      <span class="fu">LAG</span>(geom, <span class="dv">1</span>)        <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> prev_geom, </span>
<span id="cb2-35"><a href="#cb2-35"></a>      <span class="fu">LAG</span>(cx:<span class="ch">:numeric</span>, <span class="dv">2</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> ax, </span>
<span id="cb2-36"><a href="#cb2-36"></a>      <span class="fu">LAG</span>(cy:<span class="ch">:numeric</span>, <span class="dv">2</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> ay, </span>
<span id="cb2-37"><a href="#cb2-37"></a>      <span class="fu">LAG</span>(cx:<span class="ch">:numeric</span>, <span class="dv">1</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> bx, </span>
<span id="cb2-38"><a href="#cb2-38"></a>      <span class="fu">LAG</span>(cy:<span class="ch">:numeric</span>, <span class="dv">1</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> <span class="kw">by</span></span>
<span id="cb2-39"><a href="#cb2-39"></a>    <span class="kw">FROM</span> coords</span>
<span id="cb2-40"><a href="#cb2-40"></a>    <span class="kw">ORDER</span> <span class="kw">BY</span> contour_id, npoint</span>
<span id="cb2-41"><a href="#cb2-41"></a>  ),</span>
<span id="cb2-42"><a href="#cb2-42"></a>  det <span class="kw">AS</span> (</span>
<span id="cb2-43"><a href="#cb2-43"></a>    <span class="kw">SELECT</span> <span class="op">*</span>, </span>
<span id="cb2-44"><a href="#cb2-44"></a>      (((bx<span class="op">-</span>ax)<span class="op">*</span>(cy<span class="op">-</span>ay)) <span class="op">-</span> ((<span class="kw">by</span><span class="op">-</span>ay)<span class="op">*</span>(cx<span class="op">-</span>ax))) <span class="kw">AS</span> det <span class="co">-- cross product in 2d</span></span>
<span id="cb2-45"><a href="#cb2-45"></a>    <span class="kw">FROM</span> segments</span>
<span id="cb2-46"><a href="#cb2-46"></a>  ),</span>
<span id="cb2-47"><a href="#cb2-47"></a>  <span class="co">-- Uses the SIGN multipliaction as a proxy for XOR (change in convexity) </span></span>
<span id="cb2-48"><a href="#cb2-48"></a>  convexity <span class="kw">AS</span> (</span>
<span id="cb2-49"><a href="#cb2-49"></a>    <span class="kw">SELECT</span> <span class="op">*</span>, </span>
<span id="cb2-50"><a href="#cb2-50"></a>      <span class="fu">SIGN</span>(det) <span class="op">*</span> <span class="fu">SIGN</span>(<span class="fu">lag</span>(det, <span class="dv">1</span>) <span class="kw">OVER</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id)) <span class="kw">AS</span> <span class="kw">change</span></span>
<span id="cb2-51"><a href="#cb2-51"></a>    <span class="kw">FROM</span> det</span>
<span id="cb2-52"><a href="#cb2-52"></a>  )</span>
<span id="cb2-53"><a href="#cb2-53"></a><span class="kw">SELECT</span> contour_id, npoint, prev_geom <span class="kw">AS</span> geom</span>
<span id="cb2-54"><a href="#cb2-54"></a><span class="kw">FROM</span> convexity</span>
<span id="cb2-55"><a href="#cb2-55"></a><span class="kw">WHERE</span> <span class="kw">change</span> <span class="op">=</span> <span class="op">-</span><span class="dv">1</span></span>
<span id="cb2-56"><a href="#cb2-56"></a><span class="kw">ORDER</span> <span class="kw">BY</span> contour_id, npoint</span></code></pre></div>
<p>Here’s what the results look like for a sample area:</p>
<figure>
<img src="/images/curve_inflection_2.png" title="Sample of curve inflection points results" class="center" alt="" /><figcaption> </figcaption>
</figure>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/postgres.html">postgres</a>, <a href="/tags/postgis.html">postgis</a>, <a href="/tags/curve.html">curve</a>, <a href="/tags/inflection.html">inflection</a>, <a href="/tags/GIS.html">GIS</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/02/06/postgis-curve-inflection.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Sun, 06 Feb 2022 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2022/02/06/postgis-curve-inflection.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Introspection in PostgreSQL</title>
    <link>http://jarnaldich.me/blog/2021/08/30/postgres-introspection.html</link>
    <description><![CDATA[<h1>Introspection in PostgreSQL</h1>

<small>Posted on August 30, 2021 <a href="/blog/2021/08/30/postgres-introspection.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p> </p>
<figure>
<img src="/images/introspection.png" title="Detail of Alexander Stirling Calder Introspection (c. 1935)" class="wrap" alt="" /><figcaption> </figcaption>
</figure>
<p>In coding, introspection refers to the ability of some systems to query and expose information on their own structure. Typical examples are being able to query an object’s methods or properties (eg. Python’s <code>___dict___</code>).</p>
<p>In a DB system, it typically refers to the mechanism by which schema information regarding tables, attributes, foreign keys, indices, data types, etc… can be programmatically queried.</p>
<p>This is useful in many ways, eg:</p>
<ul>
<li>Code reuse: making code that can be made schema-agnostic. For example, <a href="https://github.com/adrianandrei-ca/pgunit">pgunit</a>, a NUnit-style testing framework for postgresql, automatically searches for functions whose name start with <code>test_</code>.</li>
<li>Discovery and research of the structure in ill-documented or legacy database.</li>
</ul>
<p>In this article we will explore some options for making use of the introspection capabilities of PostgreSQL.</p>
<h2 id="information-schema-vs-system-catalogs">Information schema vs system catalogs</h2>
<p>There are two main devices to query information of the objects defined in a Postgres database. The first one is the information schema, which is defined in the SQL standard and thus expected to be portable and remain stable, but cannot provide information about posgres-specific features. As with many aspects of the SQL standard, there are vendor-specific issues (most notably Oracle does not implement it out of the box). If you are using introspection as a part of a library, and do not need to get into postgres-specific information this approach gives you a better chance for future compatibility accross RDBMS and even PostgreSQL versions.</p>
<p>The other approach involves querying the so called <a href="https://www.postgresql.org/docs/13/catalogs.html">System Catalogs</a>. These are tables belonging to the <code>pg_catalog</code> schema. For example, the <code>pg_catalog.pg_class</code> (pseudo-)table catalogs tables and most everything else that has columns or is otherwise similar to a table (views, materialized or not…). This approach is version dependent, but I would be surprised to see major changes in the near future.</p>
<p>This is the approach we will be focusing on in this article, because the tooling and coding ergonomics from PostgreSQL are more convenient, as you will see in the nexts sections.</p>
<h2 id="use-the-command-line-luke">Use the command-line, Luke</h2>
<p>The <code>psql</code> command-line client is a very powerful and often overlooked utility (as many other command_-line tools). Typing <code>\?</code> after connecting will show a plethora of commands that let you inspect the DB. What most people do not know, though, is that these commands are implemented as regular SQL queries to the system catalogs and that <strong>you can actually see the code</strong> just by invoking the <code>psql</code> client with the <code>-E</code> option. For example:</p>
<pre><code>PGPASSWORD=&lt;password&gt; psql -E -U &lt;user&gt; -h &lt;host&gt; &lt;db&gt;</code></pre>
<p>And then typing for the description of the <code>pg__catalog.pg_class</code> table itself:</p>
<pre><code>\dt+ pg_catalog.pg_class</code></pre>
<p>yields:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb3-1"><a href="#cb3-1"></a><span class="op">*********</span> <span class="kw">QUERY</span> <span class="op">**********</span></span>
<span id="cb3-2"><a href="#cb3-2"></a><span class="kw">SELECT</span> n.nspname <span class="kw">as</span> <span class="ot">&quot;Schema&quot;</span>,</span>
<span id="cb3-3"><a href="#cb3-3"></a>  c.relname <span class="kw">as</span> <span class="ot">&quot;Name&quot;</span>,</span>
<span id="cb3-4"><a href="#cb3-4"></a>  <span class="cf">CASE</span> c.relkind </span>
<span id="cb3-5"><a href="#cb3-5"></a>    <span class="cf">WHEN</span> <span class="st">&#39;r&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;table&#39;</span> </span>
<span id="cb3-6"><a href="#cb3-6"></a>    <span class="cf">WHEN</span> <span class="st">&#39;v&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;view&#39;</span> </span>
<span id="cb3-7"><a href="#cb3-7"></a>    <span class="cf">WHEN</span> <span class="st">&#39;m&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;materialized view&#39;</span> </span>
<span id="cb3-8"><a href="#cb3-8"></a>    <span class="cf">WHEN</span> <span class="st">&#39;i&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;index&#39;</span> </span>
<span id="cb3-9"><a href="#cb3-9"></a>    <span class="cf">WHEN</span> <span class="st">&#39;S&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;sequence&#39;</span> </span>
<span id="cb3-10"><a href="#cb3-10"></a>    <span class="cf">WHEN</span> <span class="st">&#39;s&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;special&#39;</span> </span>
<span id="cb3-11"><a href="#cb3-11"></a>    <span class="cf">WHEN</span> <span class="st">&#39;f&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;foreign table&#39;</span> </span>
<span id="cb3-12"><a href="#cb3-12"></a>    <span class="cf">WHEN</span> <span class="st">&#39;p&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;partitioned table&#39;</span> </span>
<span id="cb3-13"><a href="#cb3-13"></a>    <span class="cf">WHEN</span> <span class="st">&#39;I&#39;</span> <span class="cf">THEN</span> <span class="st">&#39;partitioned index&#39;</span> </span>
<span id="cb3-14"><a href="#cb3-14"></a>  <span class="cf">END</span> <span class="kw">as</span> <span class="ot">&quot;Type&quot;</span>,</span>
<span id="cb3-15"><a href="#cb3-15"></a>  pg_catalog.pg_get_userbyid(c.relowner) <span class="kw">as</span> <span class="ot">&quot;Owner&quot;</span>,</span>
<span id="cb3-16"><a href="#cb3-16"></a>  pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.<span class="kw">oid</span>)) <span class="kw">as</span> <span class="ot">&quot;Size&quot;</span>,</span>
<span id="cb3-17"><a href="#cb3-17"></a>  pg_catalog.obj_description(c.<span class="kw">oid</span>, <span class="st">&#39;pg_class&#39;</span>) <span class="kw">as</span> <span class="ot">&quot;Description&quot;</span></span>
<span id="cb3-18"><a href="#cb3-18"></a><span class="kw">FROM</span> pg_catalog.pg_class c</span>
<span id="cb3-19"><a href="#cb3-19"></a>     <span class="kw">LEFT</span> <span class="kw">JOIN</span> pg_catalog.pg_namespace n <span class="kw">ON</span> n.<span class="kw">oid</span> <span class="op">=</span> c.relnamespace</span>
<span id="cb3-20"><a href="#cb3-20"></a><span class="kw">WHERE</span> c.relkind <span class="kw">IN</span> (<span class="st">&#39;r&#39;</span>,<span class="st">&#39;p&#39;</span>,<span class="st">&#39;s&#39;</span>,<span class="st">&#39;&#39;</span>)</span>
<span id="cb3-21"><a href="#cb3-21"></a>      <span class="kw">AND</span> n.nspname !~ <span class="st">&#39;^pg_toast&#39;</span></span>
<span id="cb3-22"><a href="#cb3-22"></a>  <span class="kw">AND</span> c.relname <span class="kw">OPERATOR</span>(pg_catalog.~) <span class="st">&#39;^(pg_class)$&#39;</span></span>
<span id="cb3-23"><a href="#cb3-23"></a>  <span class="kw">AND</span> n.nspname <span class="kw">OPERATOR</span>(pg_catalog.~) <span class="st">&#39;^(pg_catalog)$&#39;</span></span>
<span id="cb3-24"><a href="#cb3-24"></a><span class="kw">ORDER</span> <span class="kw">BY</span> <span class="dv">1</span>,<span class="dv">2</span>;</span>
<span id="cb3-25"><a href="#cb3-25"></a><span class="op">**************************</span></span>
<span id="cb3-26"><a href="#cb3-26"></a></span>
<span id="cb3-27"><a href="#cb3-27"></a>                        <span class="kw">List</span> <span class="kw">of</span> relations</span>
<span id="cb3-28"><a href="#cb3-28"></a>   <span class="kw">Schema</span>   |   Name   | <span class="kw">Type</span>  |  Owner   |  <span class="kw">Size</span>  | Description</span>
<span id="cb3-29"><a href="#cb3-29"></a><span class="co">------------|----------|-------|----------|--------|-------------</span></span>
<span id="cb3-30"><a href="#cb3-30"></a> pg_catalog | pg_class | <span class="kw">table</span> | postgres | <span class="dv">136</span> kB |</span>
<span id="cb3-31"><a href="#cb3-31"></a>(<span class="dv">1</span> <span class="kw">row</span>)</span></code></pre></div>
<p>Gives you a quite descriptive (and corner-case complete) template to start your own code from. For example, in the former query we could replace the <code>^(pg_class)$</code> regex with some other. Bear in mind that this trick is only helpful with the system catalog approach.</p>
<h2 id="regclasses-and-oids">Regclasses and OIDs</h2>
<p>Many objects in the system catalogs have some sort of “unique id” in the form of an <code>oid</code> attribute. It is sometimes convenient to know that you can turn descriptive names into such <code>oid</code>s by casting into the <code>regclass</code> data type.</p>
<p>For example, in a somewhat circular turn of events, the attributes of the catalog table storing attribute information can be queried by name as:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb4-1"><a href="#cb4-1"></a><span class="kw">SELECT</span> attnum, attname, format_type(atttypid, atttypmod) <span class="kw">as</span> <span class="ot">&quot;Type&quot;</span> </span>
<span id="cb4-2"><a href="#cb4-2"></a><span class="kw">FROM</span> pg_attribute </span>
<span id="cb4-3"><a href="#cb4-3"></a><span class="kw">WHERE</span> attrelid <span class="op">=</span> <span class="st">&#39;pg_attribute&#39;</span>:<span class="ch">:regclass</span> </span>
<span id="cb4-4"><a href="#cb4-4"></a>  <span class="kw">AND</span> attnum <span class="op">&gt;</span> <span class="dv">0</span> </span>
<span id="cb4-5"><a href="#cb4-5"></a>  <span class="kw">AND</span> <span class="kw">NOT</span> attisdropped <span class="kw">ORDER</span> <span class="kw">BY</span> attnum;</span></code></pre></div>
<p>In the result of that query, we can see that attrelid should be an <code>oid</code>:</p>
<pre><code>attnum     |   attname     | Type
-----------|---------------|-----------
         1 | attrelid      | oid
         2 | attname       | name
         ...
        20 | attoptions    | text[]
        21 | attfdwoptions | text[]</code></pre>
<p>without the “regclass” cast, querying by name would mean joining with the <code>pg_class</code> and filtering by name. There are other types that will get you an oid from a string description for other objects (<code>regprocedure</code> for procedures, <code>regtype</code> for types, …).</p>
<h2 id="system-catalog-information-functions">System Catalog Information Functions</h2>
<p>Another interesting utility for the <code>pg_catalog</code> approach is the ability to translate definitions into SQL DDL. We saw one of them (<code>format_type</code>) in the previous example, but there are many of them (constraints, function source code …).</p>
<p>Just refer to the <a href="https://www.postgresql.org/docs/13/functions-info.html#FUNCTIONS-INFO-CATALOG-TABLE">section in the manual</a> for more.</p>
<h2 id="inspecting-arbitrary-queries">Inspecting arbitrary queries</h2>
<p>As a sidenote, it might be useful to know that we can inspect the data types of any provided query by pretending to turn it into a temporary table. This might be useful for user-provided queries in external tools (injection caveats apply)…</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">CREATE</span> TEMP <span class="kw">TABLE</span> tmp <span class="kw">AS</span> <span class="kw">SELECT</span> <span class="dv">1</span>:<span class="ch">:numeric</span>, now() <span class="kw">LIMIT</span> <span class="dv">0</span>;</span></code></pre></div>
<h2 id="wrapping-up">Wrapping up</h2>
<p>As usual, <strong>good SW practices apply to DB code, too</strong>, and it is easy to isolate any incompatible code just by defining a clear interface in your library: instead of querying for the catalog everywhere, define just a set of views or functions that expose the introspection information to the rest of your code and work as an API. This way, any future change in system catalogs will not propagate further than those specific views. For example, if your application needs to know about tables and attribute data types, instead of querying the catalog from many places, define a view that works as in interface between the system catalogs and your code. As an example:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb7-1"><a href="#cb7-1"></a><span class="kw">CREATE</span> <span class="kw">OR</span> <span class="kw">REPLACE</span> <span class="kw">VIEW</span> table_columns <span class="kw">AS</span></span>
<span id="cb7-2"><a href="#cb7-2"></a><span class="kw">WITH</span> table_oids <span class="kw">AS</span> (</span>
<span id="cb7-3"><a href="#cb7-3"></a>      <span class="kw">SELECT</span> c.relname, c.<span class="kw">oid</span></span>
<span id="cb7-4"><a href="#cb7-4"></a>      <span class="kw">FROM</span> pg_catalog.pg_class c</span>
<span id="cb7-5"><a href="#cb7-5"></a>        <span class="kw">LEFT</span> <span class="kw">JOIN</span> pg_catalog.pg_namespace n <span class="kw">ON</span> n.<span class="kw">oid</span> <span class="op">=</span> c.relnamespace</span>
<span id="cb7-6"><a href="#cb7-6"></a>      <span class="kw">WHERE</span> </span>
<span id="cb7-7"><a href="#cb7-7"></a>        pg_catalog.pg_table_is_visible(c.<span class="kw">oid</span>) <span class="kw">AND</span> relkind <span class="op">=</span> <span class="st">&#39;r&#39;</span>),</span>
<span id="cb7-8"><a href="#cb7-8"></a>    column_types <span class="kw">AS</span> (</span>
<span id="cb7-9"><a href="#cb7-9"></a>      <span class="kw">SELECT</span></span>
<span id="cb7-10"><a href="#cb7-10"></a>        toids.relname <span class="kw">AS</span> <span class="ot">&quot;tablename&quot;</span>, </span>
<span id="cb7-11"><a href="#cb7-11"></a>        a.attname <span class="kw">as</span> <span class="ot">&quot;column&quot;</span>,</span>
<span id="cb7-12"><a href="#cb7-12"></a>        pg_catalog.format_type(a.atttypid, a.atttypmod) <span class="kw">as</span> <span class="ot">&quot;datatype&quot;</span></span>
<span id="cb7-13"><a href="#cb7-13"></a>      <span class="kw">FROM</span></span>
<span id="cb7-14"><a href="#cb7-14"></a>        pg_catalog.pg_attribute a, table_oids toids</span>
<span id="cb7-15"><a href="#cb7-15"></a>      <span class="kw">WHERE</span></span>
<span id="cb7-16"><a href="#cb7-16"></a>        a.attnum <span class="op">&gt;</span> <span class="dv">0</span></span>
<span id="cb7-17"><a href="#cb7-17"></a>        <span class="kw">AND</span> <span class="kw">NOT</span> a.attisdropped</span>
<span id="cb7-18"><a href="#cb7-18"></a>        <span class="kw">AND</span> a.attrelid <span class="op">=</span> toids.<span class="kw">oid</span>)</span>
<span id="cb7-19"><a href="#cb7-19"></a><span class="kw">SELECT</span> <span class="op">*</span> <span class="kw">FROM</span> column_types;</span></code></pre></div>
<p>I will be assembling some such utility views I find useful in the future in <a href="https://gist.github.com/jarnaldich/d5952a134d89dfac48d034ed141e86c5">this gist</a>.</p>
<p><strong>UPDATE Dec. 15th 2022:</strong> For any real use case, check <em>syonfox</em>’s solution (see comments) documented <a href="https://gist.github.com/jarnaldich/d5952a134d89dfac48d034ed141e86c5?permalink_comment_id=4401600">here</a>. It is way more powerful than my solution above, which I’ll only leave here just to keep things simple in this article.</p>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/postgres.html">postgres</a>, <a href="/tags/introspection.html">introspection</a>, <a href="/tags/database.html">database</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2021/08/30/postgres-introspection.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Mon, 30 Aug 2021 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2021/08/30/postgres-introspection.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Optimizing Geospatial Workloads</title>
    <link>http://jarnaldich.me/blog/2020/02/29/optimizing-geospatial-workloads.html</link>
    <description><![CDATA[<h1>Optimizing Geospatial Workloads</h1>

<small>Posted on February 29, 2020 <a href="/blog/2020/02/29/optimizing-geospatial-workloads.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>Large area geospatial processing often involves splitting into smaller working tiles to be processed or downloaded independently. As an example, 25cm resolution orthophoto production in Catalonia is divided into 4275 rectangular tiles, as seen in the following image.</p>
<figure>
<img src="/images/tiles5k.png" title="Orthophoto Tiling" class="center" alt="" /><figcaption> </figcaption>
</figure>
<p>Whenever a process can be applied to those tiles independently (ie, not depending on their neighborhood), parallel processing is an easy way to increase the throughput. In such environments, the total workload has to be distributed among a fixed, often limited, number of processing units (be they cores or computers). If the scheduling mechanism requires a predefined batch to be assigned to each core (or if there is no scheduling mechanism at all), and when the processing units are of similar processing power, then the maximum speedup is attained when all batches have an equal amount of tiles.</p>
<p>Furthermore, since the result often has to be mosaicked in order to inspect it, or to aggregate it into a larger final product, it is desireable for the different batches to keep a spatial continuity, ideally conforming axis parallel rectangles, since that is the basic form of georeference for geospatial imagery once projected.</p>
<h2 id="the-problem">The problem</h2>
<p>This is a discrete optimization problem, which can be solved using the regular machinery. Since I have been dusting off my <a href="https://www.minizinc.org">MiniZinc</a> abilities through Coursera’s discrete optimization series, I decided to give it a go.</p>
<h3 id="tile-scheme-representation">Tile scheme representation</h3>
<p>For convenience, the list of valid tiles can be read from an external <code>.dzn</code> data file.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb1-1"><a href="#cb1-1"></a>ntiles = <span class="fl">4275</span>;</span>
<span id="cb1-2"><a href="#cb1-2"></a>Tiles  = [| <span class="fl">253</span>, <span class="fl">055</span></span>
<span id="cb1-3"><a href="#cb1-3"></a>          | <span class="fl">254</span>, <span class="fl">055</span></span>
<span id="cb1-4"><a href="#cb1-4"></a>          | <span class="fl">253</span>, <span class="fl">056</span></span>
<span id="cb1-5"><a href="#cb1-5"></a>          | <span class="fl">254</span>, <span class="fl">056</span></span>
<span id="cb1-6"><a href="#cb1-6"></a>          | <span class="fl">255</span>, <span class="fl">055</span></span>
<span id="cb1-7"><a href="#cb1-7"></a>          | <span class="fl">255</span>, <span class="fl">056</span></span>
<span id="cb1-8"><a href="#cb1-8"></a>          | <span class="fl">256</span>, <span class="fl">056</span></span>
<span id="cb1-9"><a href="#cb1-9"></a>          | <span class="fl">257</span>, <span class="fl">056</span></span>
<span id="cb1-10"><a href="#cb1-10"></a>          | <span class="fl">252</span>, <span class="fl">059</span></span>
<span id="cb1-11"><a href="#cb1-11"></a>          …</span>
<span id="cb1-12"><a href="#cb1-12"></a>          |];</span></code></pre></div>
<p>The above basically declares the list of valid tiles as a 2d array with <code>ntiles</code> rows and 2 columns. Then, in our model file (<code>.mzn</code>) the data will be loaded into the <code>Tiles</code> constant array, declared as follows:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">int</span>: ntiles;</span>
<span id="cb2-2"><a href="#cb2-2"></a></span>
<span id="cb2-3"><a href="#cb2-3"></a><span class="kw">enum</span> e_tile = { col, row };</span>
<span id="cb2-4"><a href="#cb2-4"></a><span class="kw">array</span>[<span class="fl">1</span>..ntiles, e_tile ] <span class="kw">of</span> <span class="kw">int</span>: Tiles;</span></code></pre></div>
<p>Notice the use of a column enum to make access easier.</p>
<p>From the above data, a 2d grid can be built within the bounds of minimum and maximum columns, where the grid value is <code>true</code> if there exists a tile in that position, and <code>false</code> otherwise. This builds a nice representation for modelling the spatial restrictions in the problem.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb3-1"><a href="#cb3-1"></a><span class="kw">int</span>: mincol = <span class="kw">min</span>([ Tiles[i, ocol] | i <span class="kw">in</span> <span class="fl">1</span>..ntiles ]);</span>
<span id="cb3-2"><a href="#cb3-2"></a><span class="kw">int</span>: maxcol = <span class="kw">max</span>([ Tiles[i, ocol] | i <span class="kw">in</span> <span class="fl">1</span>..ntiles ]);</span>
<span id="cb3-3"><a href="#cb3-3"></a><span class="kw">int</span>: minrow = <span class="kw">min</span>([ Tiles[i, orow] | i <span class="kw">in</span> <span class="fl">1</span>..ntiles ]);</span>
<span id="cb3-4"><a href="#cb3-4"></a><span class="kw">int</span>: maxrow = <span class="kw">max</span>([ Tiles[i, orow] | i <span class="kw">in</span> <span class="fl">1</span>..ntiles ]);</span>
<span id="cb3-5"><a href="#cb3-5"></a></span>
<span id="cb3-6"><a href="#cb3-6"></a><span class="kw">array</span>[minrow..maxrow, mincol..maxcol] <span class="kw">of</span> <span class="kw">int</span>: Grid =</span>
<span id="cb3-7"><a href="#cb3-7"></a>  <span class="kw">array2d</span>(minrow..maxrow, mincol..maxcol,</span>
<span id="cb3-8"><a href="#cb3-8"></a>     [ <span class="cf">exists</span>(i <span class="kw">in</span> <span class="fl">1</span>..ntiles)(Tiles[i, orow] == r /\ Tiles[i, ocol] == c)</span>
<span id="cb3-9"><a href="#cb3-9"></a>       | r <span class="kw">in</span> minrow..maxrow, c <span class="kw">in</span> mincol..maxcol ]);</span></code></pre></div>
<p>Note that all this is computed at compile time, before the actual optimization begins.</p>
<h3 id="box-representation">Box representation</h3>
<p>Boxes are rectangles defined defined by their left, bottom and top bounds:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb4-1"><a href="#cb4-1"></a><span class="kw">int</span>: nboxes;</span>
<span id="cb4-2"><a href="#cb4-2"></a><span class="kw">enum</span> e_bbox  = { top, left, bottom, right };</span>
<span id="cb4-3"><a href="#cb4-3"></a><span class="kw">array</span>[<span class="fl">1</span>..nboxes, e_bbox] <span class="kw">of</span> <span class="kw">var</span> <span class="kw">int</span>: Boxes;</span></code></pre></div>
<p>Grid positions increase like in a matrix (first row top, left column first), and their bounds are constrained within the tile grid limits. Limits are inclusive. These requirements can be expressed as a minizinc <code>constraint</code>:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb5-1"><a href="#cb5-1"></a><span class="kw">constraint</span></span>
<span id="cb5-2"><a href="#cb5-2"></a>  forall(b <span class="kw">in</span> <span class="fl">1</span>..nboxes) (</span>
<span id="cb5-3"><a href="#cb5-3"></a>      mincol &lt;= Boxes[b, left] /\ Boxes[b, left]  &lt;= maxcol /\</span>
<span id="cb5-4"><a href="#cb5-4"></a>      minrow &lt;= Boxes[b, top] /\ Boxes[b, top] &lt;= maxcol /\</span>
<span id="cb5-5"><a href="#cb5-5"></a>      Boxes[b, left] &lt;= Boxes[b, right] /\</span>
<span id="cb5-6"><a href="#cb5-6"></a>      Boxes[b, top] &lt;= Boxes[b, bottom]);</span></code></pre></div>
<p>Each tile belongs to just one box, so boxes do not overlap.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">predicate</span> no_overlap(<span class="kw">var</span> <span class="kw">int</span>:l1, <span class="kw">var</span> <span class="kw">int</span>:t1, <span class="kw">var</span> <span class="kw">int</span>:b1, <span class="kw">var</span> <span class="kw">int</span>:r1,</span>
<span id="cb6-2"><a href="#cb6-2"></a>                     <span class="kw">var</span> <span class="kw">int</span>:l2, <span class="kw">var</span> <span class="kw">int</span>:t2, <span class="kw">var</span> <span class="kw">int</span>:b2, <span class="kw">var</span> <span class="kw">int</span>:r2) =</span>
<span id="cb6-3"><a href="#cb6-3"></a>   r1 &lt; l2 \/ l1 &gt; r2 \/ b1 &lt; t2 \/ t1 &gt; b2 \/</span>
<span id="cb6-4"><a href="#cb6-4"></a>   r2 &lt; l1 \/ l2 &gt; r1 \/ b2 &lt; t1 \/ t2 &gt; b1;</span>
<span id="cb6-5"><a href="#cb6-5"></a></span>
<span id="cb6-6"><a href="#cb6-6"></a><span class="kw">constraint</span> </span>
<span id="cb6-7"><a href="#cb6-7"></a>forall(b1,b2 <span class="kw">in</span> <span class="fl">1</span>..nboxes where b1 &lt; b2) (</span>
<span id="cb6-8"><a href="#cb6-8"></a>    no_overlap(</span>
<span id="cb6-9"><a href="#cb6-9"></a>     	Boxes[b1, left], Boxes[b1, top], Boxes[b1, bottom], Boxes[b1, right],</span>
<span id="cb6-10"><a href="#cb6-10"></a>	    Boxes[b2, left], Boxes[b2, top], Boxes[b2, bottom], Boxes[b2, right]));</span></code></pre></div>
<h3 id="assignment">Assignment</h3>
<p>In the end we want an array relating every tile with its box. Since we chose to represent a tile by its row and column, this can be modeled as a 2d array of <code>nboxes</code>. We will reserve a special 0 value for the empty tiles within the grid.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb7-1"><a href="#cb7-1"></a><span class="kw">array</span>[minrow..maxrow, mincol..maxcol] <span class="kw">of</span> <span class="kw">var</span> <span class="fl">0</span>..nboxes: Assignment;</span></code></pre></div>
<p>The rules that relate the tile Grid with the Boxes and Assignment vector can be enumerated as follows:</p>
<ol type="1">
<li>Every tile inside the range of a box is assigned to it.</li>
<li>Tiles not present are not assigned.</li>
<li>Tiles not assigned to a box, but present, are assigned to another box.</li>
</ol>
<div class="sourceCode" id="cb8"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb8-1"><a href="#cb8-1"></a><span class="kw">constraint</span></span>
<span id="cb8-2"><a href="#cb8-2"></a>  forall(b <span class="kw">in</span> <span class="fl">1</span>..nboxes) (</span>
<span id="cb8-3"><a href="#cb8-3"></a>      forall(r <span class="kw">in</span> minrow..maxrow) (</span>
<span id="cb8-4"><a href="#cb8-4"></a>          forall(c <span class="kw">in</span> mincol..maxcol) (</span>
<span id="cb8-5"><a href="#cb8-5"></a>            if Grid[r,c] &gt; <span class="fl">0</span> then</span>
<span id="cb8-6"><a href="#cb8-6"></a>              if contains(Boxes[b, left], Boxes[b, top],</span>
<span id="cb8-7"><a href="#cb8-7"></a>                          Boxes[b, bottom], Boxes[b, right],</span>
<span id="cb8-8"><a href="#cb8-8"></a>                          r, c)Ti</span>
<span id="cb8-9"><a href="#cb8-9"></a>              then</span>
<span id="cb8-10"><a href="#cb8-10"></a>                <span class="co">% 1 - Tiles within the range of a box are assigned to it</span></span>
<span id="cb8-11"><a href="#cb8-11"></a>                Assignment[r,c] = b</span>
<span id="cb8-12"><a href="#cb8-12"></a>              else</span>
<span id="cb8-13"><a href="#cb8-13"></a>                <span class="co">% 3 - Tiles not assigned to a box are assigned to another</span></span>
<span id="cb8-14"><a href="#cb8-14"></a>                Assignment[r,c] != b /\ Assignment[r,c] &gt; <span class="fl">0</span></span>
<span id="cb8-15"><a href="#cb8-15"></a>              endif</span>
<span id="cb8-16"><a href="#cb8-16"></a>            else</span>
<span id="cb8-17"><a href="#cb8-17"></a>              <span class="co">% 2 - Tiles not present are not assigned</span></span>
<span id="cb8-18"><a href="#cb8-18"></a>              Assignment[r,c] = <span class="fl">0</span></span>
<span id="cb8-19"><a href="#cb8-19"></a>            endif)));</span></code></pre></div>
<h3 id="objective-function">Objective function</h3>
<p>We want to make the resulting rectangles as equal as possible. In order to do so, we have to gather the cardinalities of each box.</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb9-1"><a href="#cb9-1"></a><span class="kw">array</span>[<span class="fl">1</span>..nboxes] <span class="kw">of</span> <span class="kw">var</span> <span class="kw">int</span>: BoxCardinality =</span>
<span id="cb9-2"><a href="#cb9-2"></a>  [ sum(r <span class="kw">in</span> minrow..maxrow, c <span class="kw">in</span> mincol..maxcol)(Grid[r,c] &gt; <span class="fl">0</span> /\ Assignment[r,c] == b) | b <span class="kw">in</span> <span class="fl">1</span>..nboxes];</span></code></pre></div>
<p>This can be done by minimizing the variance, which is the same as minimizing the square L2 norm (dot product of a vector with itself).</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb10-1"><a href="#cb10-1"></a><span class="kw">var</span> <span class="kw">int</span>: variance = sum(b <span class="kw">in</span> <span class="fl">1</span>..nboxes)(BoxCardinality[b]*BoxCardinality[b]);</span>
<span id="cb10-2"><a href="#cb10-2"></a>solve minimize variance;</span></code></pre></div>
<h3 id="showing-the-results">Showing the results</h3>
<p>It is useful to dump the result in some format that can be easily parsed by standard command-line tools, since some models have to be further processed. In this case, the lines corresponding to the assignment vector are prefixed with the tag <code>Tiles</code> to make them easy to redirect to another file.</p>
<p>The printing itself can be done with a combination of helper functions and array comprehensions.</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb11-1"><a href="#cb11-1"></a>function string: show_assignment(<span class="kw">int</span>: r, <span class="kw">int</span>: c) = &quot;Tile: &quot; ++ show(c) ++ &quot;-&quot; ++ show(r) ++ &quot;,&quot; ++ show(Assignment[r,c]) ++ &quot;\n&quot;;</span>
<span id="cb11-2"><a href="#cb11-2"></a></span>
<span id="cb11-3"><a href="#cb11-3"></a>output </span>
<span id="cb11-4"><a href="#cb11-4"></a>  [ show_assignment(r,c) | r <span class="kw">in</span> minrow..maxrow, c <span class="kw">in</span> mincol..maxcol where Grid[r,c] &gt; <span class="fl">0</span> ] ++ </span>
<span id="cb11-5"><a href="#cb11-5"></a>  [ &quot;Variance: &quot;, show(variance), &quot;\n&quot;,</span>
<span id="cb11-6"><a href="#cb11-6"></a>    &quot;Box Cardinalities: &quot;,  show(BoxCardinality) , &quot;\n&quot; ];</span></code></pre></div>
<p>For powershell users, this could be captured, for example:</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode powershell"><code class="sourceCode powershell"><span id="cb12-1"><a href="#cb12-1"></a><span class="va">$ENV</span>:FLATZINC_CMD = <span class="st">&quot;fzn-gecode&quot;</span></span>
<span id="cb12-2"><a href="#cb12-2"></a><span class="va">$Env</span>:PATH += <span class="st">&quot;;D:\Soft\MiniZinc\&quot;</span></span>
<span id="cb12-3"><a href="#cb12-3"></a>minizinc.<span class="fu">exe</span> -I D:\Soft\MiniZinc\share\minizinc\gecode\ .\tall5m.<span class="fu">mzn</span> .\tall5m.<span class="fu">dzn</span> | ? { <span class="va">$_</span> -match <span class="st">&quot;Ortho: &quot;</span> } | % { <span class="va">$_</span> -replace <span class="st">&quot;Ortho: &quot;</span> } | <span class="fu">out-file</span> -encoding ascii assign5.<span class="fu">csv</span></span></code></pre></div>
<h3 id="not-so-fast">Not so fast!</h3>
<p>For big grids, the process is too slow (on my hardware, ymmv). A practical way to mitigate that problem is including further “artificial” restrictions that capture some common-sense knowledge. Here we can set that box cardinalities belong to an environment around a <em>perfect</em> one, which would happen when every box has <code>ntiles / nboxes</code> tiles.</p>
<p>We can define a parameter <code>slack</code>, that will represent the radius of the environment, and add the following constraint:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb13-1"><a href="#cb13-1"></a><span class="co">% All boxes have at least one tile assigned to it</span></span>
<span id="cb13-2"><a href="#cb13-2"></a><span class="kw">float</span>: fill_factor = (ntiles / nboxes);</span>
<span id="cb13-3"><a href="#cb13-3"></a></span>
<span id="cb13-4"><a href="#cb13-4"></a><span class="kw">constraint</span></span>
<span id="cb13-5"><a href="#cb13-5"></a>   forall(b <span class="kw">in</span> <span class="fl">1</span>..nboxes) ( (<span class="fl">1.0</span> - slack)*fill_factor &lt;= BoxCardinality[b] /\ BoxCardinality[b] &lt;= (<span class="fl">1.0</span> + slack)*fill_factor ) ; </span></code></pre></div>
<p>This is common in discrete optimization problems, where a hybrid system can be developed. In this case, we could use some sort of search to optimize for the value of the slack, with different invocations of minizinc.</p>
<h2 id="results">Results</h2>
<p>By processing the results of minizinc and joining the results into a <a href="https://www.qgis.org">QGis</a> project, we can easily map the box assignment. Here is the result for 4 boxes:</p>
<p><img src="/images/tiles5k_colored.png" title="Orthophoto Tiling" class="center" /></p>
<p>For 8 boxes (8 parallel processors), the result would be:</p>
<p><img src="/images/tiles5k_8box_colored.png" title="Orthophoto Tiling" class="center" /></p>
<h2 id="conclusions">Conclusions</h2>
<p>Even when I know the basic theory behind mixed integer and fp solvers (even implemented a simplex-based solver as a practical exercise in the past), I keep having the feeling there is some form of magic at work here.</p>
<p>There are lots of other ways to model this problem. In particular, MiniZinc has special primitives for dealing with sets. Some of the restriction explicitly stated by the model are already available for reuse in the <code>globals</code> library, which would probably more efficient and would lead to terser code. I would like to rewrite the model using these functions and compare their efficiency if I ever have the time.</p>
<p>For now, I got my results!</p>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/geospatial.html">geospatial</a>, <a href="/tags/minizinc.html">minizinc</a>, <a href="/tags/optimization.html">optimization</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2020/02/29/optimizing-geospatial-workloads.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Sat, 29 Feb 2020 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2020/02/29/optimizing-geospatial-workloads.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Foreign Data Wrappers for Data Synchronization</title>
    <link>http://jarnaldich.me/blog/2018/10/02/fdw-sync.html</link>
    <description><![CDATA[<h1>Foreign Data Wrappers for Data Synchronization</h1>

<small>Posted on October  2, 2018 <a href="/blog/2018/10/02/fdw-sync.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>The <a href="http://www.postgresqltutorial.com/postgresql-copy-database/">standard way</a> of copying databases (or just tables) between PostgreSQL servers seems to be through backup (with its many options). This article describes another way that I have been using lately using foreign data wrappers. The two servers need to be able to connect to each other (albeit only during the synchronization time), it does not need shell access to any of them and avoids generating intermediate files. The steps involved are:</p>
<ol type="1">
<li>Install the foreign data wrapper extension for your database (just once in the target server).</li>
<li>Setup the foreign server connection from in the target server pointing to the source server.</li>
<li>Setup a user mapping.</li>
<li>Import the foreign tables (or schema) into a “proxy” schema.</li>
<li>Create materialized views for the desired tables</li>
</ol>
<h2 id="install-the-foreign-data-wrapper">Install the Foreign Data Wrapper</h2>
<p><a href="https://wiki.postgresql.org/wiki/Foreign_data_wrappers">Foreign Data Wrappers</a> are a mechanism that allow presenting external data sources as PostgreSQL tables. Note that it is not limited to foreign PostgreSQL databases: there are foreign data wrappers for other DB servers and other sources of data, including CSV files and even Twitter streams. Once the data is presented into Postgres, all the power of SQL becomes available for your data, so they are quite a feature for data management and integration.</p>
<p>FDWs are installed as extensions for every kind of data source. Of course there is one FDW for connecting to external PostgreSQL databases in the standard distribution:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">CREATE</span> EXTENSION postgres_fdw;</span></code></pre></div>
<h2 id="creating-the-server">Creating the server</h2>
<p>Once the extension is installed, the remote server needs to be set up:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">CREATE</span> SERVER remote_server</span>
<span id="cb2-2"><a href="#cb2-2"></a>  <span class="kw">FOREIGN</span> <span class="kw">DATA</span> WRAPPER postgres_fdw</span>
<span id="cb2-3"><a href="#cb2-3"></a>  OPTIONS (host <span class="st">&#39;host_or_ip&#39;</span>, dbname <span class="st">&#39;db_name&#39;</span>);</span></code></pre></div>
<h2 id="creating-the-user-mapping">Creating the user mapping</h2>
<p>In order to allow for a greater flexibility in terms of permissions, the remote server needs a user mapping, wich will map users between the source and target servers. For every mapped user, the following sentence should be used:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb3-1"><a href="#cb3-1"></a><span class="kw">CREATE</span> <span class="fu">USER</span> MAPPING <span class="cf">FOR</span> postgres SERVER perry</span>
<span id="cb3-2"><a href="#cb3-2"></a>    OPTIONS (<span class="kw">password</span> <span class="st">&#39;pwd&#39;</span>, <span class="ot">&quot;user&quot;</span> <span class="st">&#39;postgres&#39;</span>);</span></code></pre></div>
<h2 id="create-the-foreign-tables">Create the foreign tables</h2>
<p>Once the database and server are linked, we can start creating the foreign tables. Notice that foreign tables are just “proxies” for the external tables (think of them as symlinks in the file system or pointers in a programming language). That means creating them is just a matter of defining their structure, no data is transferred, and hence should be fast. The downside is that the description for the foreign tables has to be written in the target server (much like writing the table create script).</p>
<p>In order to make the process easier, PostgreSQL has a command that will just import the foreign structure through:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb4-1"><a href="#cb4-1"></a>IMPORT <span class="kw">FOREIGN</span> <span class="kw">SCHEMA</span> source_schema <span class="kw">FROM</span> SERVER source_server <span class="kw">INTO</span> proxy_schema;</span></code></pre></div>
<p>If you just want to import some tables of the schema, you can use:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb5-1"><a href="#cb5-1"></a>IMPORT <span class="kw">FOREIGN</span> <span class="kw">SCHEMA</span> <span class="kw">public</span> <span class="kw">LIMIT</span> <span class="kw">TO</span> </span>
<span id="cb5-2"><a href="#cb5-2"></a>( table1, table2 )</span>
<span id="cb5-3"><a href="#cb5-3"></a><span class="kw">FROM</span> SERVER source_server <span class="kw">INTO</span> proxy_schema;</span></code></pre></div>
<p>Just refer to the [documentation] for other options.</p>
<p>You can verify that the tables have been imported typing <code>\det</code> inside the <code>psql</code> cli.</p>
<h2 id="instantiate-materialized-views">Instantiate materialized views</h2>
<p>As stated before, foreign tables are just a proxy for the real data. In order to be able to work independently of the source server, actual data needs to be copied. The easiest way to do so in order to be able to update the data is through materialized views. You can think of them as new tables with a refresh mechanism. In particular, that means that the original indices over the data will be lost, so new indices should be created.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">CREATE</span> <span class="kw">MATERIALIZED</span> <span class="kw">VIEW</span> view_name <span class="kw">AS</span> <span class="kw">SELECT</span> <span class="op">*</span> <span class="kw">FROM</span> proxy_schema.table_name;</span></code></pre></div>
<p>Whenever the data needs to be refreshed, just:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb7-1"><a href="#cb7-1"></a><span class="kw">REFRESH</span> <span class="kw">MATERIALIZED</span> <span class="kw">VIEW</span> view_name;</span></code></pre></div>
<h2 id="in-the-command-line-client">In the command-line client</h2>
<p>The following commands might be useful if you use the <code>psql</code> client&gt;</p>
<ul>
<li><code>\det &lt;pattern&gt;</code> lists foreign tables</li>
<li><code>\des &lt;pattern&gt;</code> lists foreign servers</li>
<li><code>\deu &lt;pattern&gt;</code> lists user mappings</li>
<li><code>\dew &lt;pattern&gt;</code> lists foreign-data wrappers</li>
<li><code>\dm &lt;pattern&gt;</code> list marterialized views</li>
</ul>
<h2 id="helper-function">Helper function</h2>
<p>Depending on how many tables you wish to import, something along the following anonymous code block might be useful. It just creates the materialized views and indices, but it can be adapted to whatever is needed.</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb8-1"><a href="#cb8-1"></a>DO $$</span>
<span id="cb8-2"><a href="#cb8-2"></a><span class="kw">DECLARE</span> r <span class="dt">record</span>;</span>
<span id="cb8-3"><a href="#cb8-3"></a><span class="cf">BEGIN</span></span>
<span id="cb8-4"><a href="#cb8-4"></a>    <span class="cf">FOR</span> r <span class="kw">IN</span> <span class="kw">SELECT</span> tname <span class="kw">FROM</span> (<span class="kw">VALUES</span> </span>
<span id="cb8-5"><a href="#cb8-5"></a>            (<span class="st">&#39;table1&#39;</span>),</span>
<span id="cb8-6"><a href="#cb8-6"></a>            (<span class="st">&#39;table2&#39;</span>), </span>
<span id="cb8-7"><a href="#cb8-7"></a>            (<span class="st">&#39;...&#39;</span>), </span>
<span id="cb8-8"><a href="#cb8-8"></a>            (<span class="st">&#39;tableN&#39;</span>)) <span class="kw">AS</span> x(tname)</span>
<span id="cb8-9"><a href="#cb8-9"></a>    <span class="cf">LOOP</span></span>
<span id="cb8-10"><a href="#cb8-10"></a>        <span class="co">-- SQL automatically concatenates strings if there is a line separator in between</span></span>
<span id="cb8-11"><a href="#cb8-11"></a>        <span class="kw">EXECUTE</span> format(<span class="st">&#39;CREATE MATERIALIZED VIEW IF NOT EXISTS %s AS &#39;</span></span>
<span id="cb8-12"><a href="#cb8-12"></a>            <span class="st">&#39;SELECT * FROM proxy_schema.%s&#39;</span>,</span>
<span id="cb8-13"><a href="#cb8-13"></a>             r.tname, r.tname);</span>
<span id="cb8-14"><a href="#cb8-14"></a>        <span class="co">-- Index by geometry (Postgis), just an example</span></span>
<span id="cb8-15"><a href="#cb8-15"></a>        <span class="kw">EXECUTE</span> format(<span class="st">&#39;CREATE INDEX IF NOT EXISTS sidx_%s ON %s USING GIST (geom)&#39;</span>,</span>
<span id="cb8-16"><a href="#cb8-16"></a>             r.tname, r.tname);</span>
<span id="cb8-17"><a href="#cb8-17"></a>    <span class="cf">END</span> <span class="cf">LOOP</span>;</span>
<span id="cb8-18"><a href="#cb8-18"></a>END$$</span></code></pre></div>
<h2 id="conclusion-and-final-warnings">Conclusion and final warnings</h2>
<p>This method can be convenient if data has to be synced frequently, as it just boils down to refreshing the materialized view. This can also be used from within other pgplsql functions, and needs no external tools (apart from a client) or intermediate files.</p>
<p>It is not for every situation, though. In particular, <em>the data in materialized views are not backed up</em> (<code>psql</code> generates the view create script and performs a <code>REFRESH</code>). That means that if the original server is unavailable at restore time, data will be lost. This can be avoided by using regular <em>proxy</em> tables instead of materialized views.</p>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/postgres.html">postgres</a>, <a href="/tags/fdw.html">fdw</a>, <a href="/tags/backup.html">backup</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2018/10/02/fdw-sync.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Tue, 02 Oct 2018 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2018/10/02/fdw-sync.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
    <title>Porting the Unix Philosophy to Windows</title>
    <link>http://jarnaldich.me/blog/2017/06/06/powershell-unix-philo.html</link>
    <description><![CDATA[<h1>Porting the Unix Philosophy to Windows</h1>

<small>Posted on June  6, 2017 <a href="/blog/2017/06/06/powershell-unix-philo.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>

<p>Since 2016, it is safe to say that windows has a <a href="https://en.wikipedia.org/wiki/PowerShell">pretty decent shell</a>. Actually, it’s had it from some time now, but on August 2016 went open-source and cross-platform. Although it is still common to find old-style <code>.bat</code> files lingering around in many organizations, it looks clear that PowerShell is getting out of its initial sysadmin niche towards becoming the new de-facto standard shell for Windows (hey, even for <a href="https://www.symantec.com/content/dam/symantec/docs/security-center/white-papers/increased-use-of-powershell-in-attacks-16-en.pdf">malware…</a>). And no, I do not even think WSH deserves a mention.</p>
<p>Arguably, the most profound change use of PowerShell is not having a more powerful (ehem!) shell at windows, but enabling the kind of scripting Unix has excelled at before.</p>
<h2 id="the-unix-philosophy">The Unix Philosophy</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Unix_philosophy">Unix Philosophy</a> is often epitomized in one sentence:</p>
<blockquote>
<p>Do One Thing and Do It Well</p>
</blockquote>
<p>The implications of that for shell scripting are translated into the fact that Unix has lots of small executables devoted to one task, and it is the shell’s responsibility to enable composing these bits of functionality into more complex ones, the most prominent tool for that being the pipe <code>|</code> operator, which feeds the output of a program into the input of the next.</p>
<p>Powershell has a more ore less generalized version of this, where the pieces are called <code>CmdLets</code>, and what is sent down the pipe is a stream of <em>CLR objects</em>, not just a stream of bytes. Before PowerShell, it was impossible to do this in Windows to the extent that it was in Unix.</p>
<p>The idea is simple, but the skill is difficult to master. Shell scripting <em>is</em> programming, but the abstractions provided by the shell are different from the ones you would find in a fully-fledged programming language.</p>
<p>In this blog post I would like to present a particular example of what this change means.</p>
<h1 id="the-task">The Task</h1>
<p>A quick and dirty way to monitor the progress of a batch process is probing the number of output files in a directory at regular intervals and maybe save that to a file eg. for plotting, statistics, etc…</p>
<p>It is quite probable that a developer would come up with a solution very much like the function below:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode powershell"><code class="sourceCode powershell"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">function</span> Sample-Count-Files {</span>
<span id="cb1-2"><a href="#cb1-2"></a>    [CmdletBinding()]</span>
<span id="cb1-3"><a href="#cb1-3"></a></span>
<span id="cb1-4"><a href="#cb1-4"></a>    <span class="kw">param</span>(</span>
<span id="cb1-5"><a href="#cb1-5"></a>        <span class="co"># The file pattern to count</span></span>
<span id="cb1-6"><a href="#cb1-6"></a>        [Parameter(Mandatory=<span class="va">$true</span>, </span>
<span id="cb1-7"><a href="#cb1-7"></a>                   Position=0)]</span>
<span id="cb1-8"><a href="#cb1-8"></a>        [<span class="dt">string</span>]<span class="va">$pattern</span>,</span>
<span id="cb1-9"><a href="#cb1-9"></a></span>
<span id="cb1-10"><a href="#cb1-10"></a>        <span class="co"># Name of the log file</span></span>
<span id="cb1-11"><a href="#cb1-11"></a>        [Parameter(Mandatory=<span class="va">$true</span>, </span>
<span id="cb1-12"><a href="#cb1-12"></a>            Position=1)]</span>
<span id="cb1-13"><a href="#cb1-13"></a>        [<span class="dt">string</span>]<span class="va">$logfile</span>,</span>
<span id="cb1-14"><a href="#cb1-14"></a></span>
<span id="cb1-15"><a href="#cb1-15"></a>        <span class="co"># Seconds interval between samples</span></span>
<span id="cb1-16"><a href="#cb1-16"></a>        [Parameter(Mandatory=<span class="va">$true</span>, </span>
<span id="cb1-17"><a href="#cb1-17"></a>            Position=2)]</span>
<span id="cb1-18"><a href="#cb1-18"></a>        [<span class="dt">int</span>]<span class="va">$seconds</span>)</span>
<span id="cb1-19"><a href="#cb1-19"></a></span>
<span id="cb1-20"><a href="#cb1-20"></a>    <span class="kw">While</span>(<span class="va">$true</span>) {</span>
<span id="cb1-21"><a href="#cb1-21"></a>        <span class="fu">sleep</span> <span class="va">$seconds</span>;</span>
<span id="cb1-22"><a href="#cb1-22"></a>        <span class="va">$cnt</span> = (<span class="fu">dir</span> <span class="va">$pattern</span>).<span class="fu">Count</span>;</span>
<span id="cb1-23"><a href="#cb1-23"></a>        <span class="va">$d</span> = <span class="fu">Get-Date</span>;</span>
<span id="cb1-24"><a href="#cb1-24"></a>        <span class="va">$d</span>.<span class="fu">ToString</span>(<span class="st">&quot;yyyy-MM-dd HH:mm:ss&quot;</span>) + <span class="st">&quot;`t$cnt&quot;</span> | <span class="fu">Out-File</span> <span class="va">$logfile</span> -encoding ascii -Append -ob 0</span>
<span id="cb1-25"><a href="#cb1-25"></a>    }</span>
<span id="cb1-26"><a href="#cb1-26"></a>}</span></code></pre></div>
<p>That is, an infinte loop which waits for a number of seconds before counting the output files and outputting a line with the date and result. Executing the above code will block the console, so to monitor progress you can always wrap it into a <code>PSJob</code> or open a new console and then just tail the file:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode powershell"><code class="sourceCode powershell"><span id="cb2-1"><a href="#cb2-1"></a><span class="fu">gc</span> -tail 10 -Wait samples.<span class="fu">tsv</span></span></code></pre></div>
<p>I suspect most developers that have not dealt with Unix scripting would come up with something along the lines of this. Conversely, my bet is any experienced Unix scripter would frown upon it, feeling the script is trying to do <em>too much</em>. In particular, it is in charge of:</p>
<ul>
<li><strong>The When</strong>: The counting is done every <em>n</em> seconds.</li>
<li><strong>The What</strong>: The counting itself.</li>
<li><strong>The Output</strong>: We are saving into a text file and redirecting to screen.</li>
</ul>
<p>These three pieces of functionality are coupled in our function, and they would’nt need to be so. For example, separating the <em>when</em> from the <em>what</em> would allow us to fire <em>any action</em> every <em>n</em> seconds.</p>
<p>This is indeed not difficult in PowerShell, see:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode powershell"><code class="sourceCode powershell"><span id="cb3-1"><a href="#cb3-1"></a><span class="kw">function</span> Tick {</span>
<span id="cb3-2"><a href="#cb3-2"></a>    [CmdletBinding()]</span>
<span id="cb3-3"><a href="#cb3-3"></a>    <span class="kw">Param</span>([Parameter(Mandatory=<span class="va">$true</span>, Position=0)][<span class="dt">int</span>]<span class="va">$Seconds</span>)</span>
<span id="cb3-4"><a href="#cb3-4"></a>    <span class="kw">Process</span> {</span>
<span id="cb3-5"><a href="#cb3-5"></a>        <span class="kw">while</span>(<span class="va">$true</span>) { </span>
<span id="cb3-6"><a href="#cb3-6"></a>            <span class="fu">Start-Sleep</span> -Seconds <span class="va">$Seconds</span></span>
<span id="cb3-7"><a href="#cb3-7"></a>            <span class="fu">Get-Date</span></span>
<span id="cb3-8"><a href="#cb3-8"></a>        }    </span>
<span id="cb3-9"><a href="#cb3-9"></a>    }</span>
<span id="cb3-10"><a href="#cb3-10"></a>}</span></code></pre></div>
<p>This <code>CmdLet</code> waits for a number of seconds before sending a <code>Date</code> object downstream to do whatever we please with it, and loops (mind it is inside the <code>Process</code> section). It is like a pulse generating objects at regular intervals, so now we can reuse the <em>when</em> in different contexts:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode powershell"><code class="sourceCode powershell"><span id="cb4-1"><a href="#cb4-1"></a>Tick 10 | % { <span class="va">$_</span> }</span></code></pre></div>
<p>We can write a similar function for the <em>what</em>. In a real scenario we probably would just write a one-liner, because our code is so simple, but in this post we will go the full way. As a bonus, instead of working with strings, we demonstrate how to pass custom objects down the pipeline. Here, we create an object with two members: the time and the file count.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode powershell"><code class="sourceCode powershell"><span id="cb5-1"><a href="#cb5-1"></a></span>
<span id="cb5-2"><a href="#cb5-2"></a><span class="kw">function</span> Count {</span>
<span id="cb5-3"><a href="#cb5-3"></a>    [CmdletBinding()]</span>
<span id="cb5-4"><a href="#cb5-4"></a>    <span class="kw">Param</span>(</span>
<span id="cb5-5"><a href="#cb5-5"></a>    	[Parameter(Mandatory=<span class="va">$true</span>, Position=0)]</span>
<span id="cb5-6"><a href="#cb5-6"></a>	    [<span class="dt">string</span>]</span>
<span id="cb5-7"><a href="#cb5-7"></a>	    <span class="va">$Pattern</span>,</span>
<span id="cb5-8"><a href="#cb5-8"></a></span>
<span id="cb5-9"><a href="#cb5-9"></a>        [Parameter(Mandatory=<span class="va">$true</span>, Position=1, ValueFromPipeline=<span class="va">$true</span>)]</span>
<span id="cb5-10"><a href="#cb5-10"></a>	    <span class="va">$Time</span></span>
<span id="cb5-11"><a href="#cb5-11"></a>    )</span>
<span id="cb5-12"><a href="#cb5-12"></a></span>
<span id="cb5-13"><a href="#cb5-13"></a>    <span class="kw">Process</span> {</span>
<span id="cb5-14"><a href="#cb5-14"></a>        [PSCustomObject]@{ </span>
<span id="cb5-15"><a href="#cb5-15"></a>           Time=<span class="va">$Time</span>; </span>
<span id="cb5-16"><a href="#cb5-16"></a>           Files= $(<span class="fu">dir</span> <span class="va">$Pattern</span>).<span class="fu">Count</span> </span>
<span id="cb5-17"><a href="#cb5-17"></a>        }</span>
<span id="cb5-18"><a href="#cb5-18"></a>    }</span>
<span id="cb5-19"><a href="#cb5-19"></a>}</span></code></pre></div>
<p>For the output, we could write our own function again, but it turns out there already is a Powershell function that formats objects into CSV files, conveniently named <code>Export-Csv</code>. Putting it all together:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode powershell"><code class="sourceCode powershell"><span id="cb6-1"><a href="#cb6-1"></a>Tick 1 -ob 0 | Count *.<span class="fu">out</span> | <span class="fu">Export-Csv</span> times.<span class="fu">csv</span></span></code></pre></div>
<p>Which is clean, easy to understand and easy to reuse, just like good <em>Unix</em> scripting is supposed to. By the way, if you are wondering where the <code>-ob 0</code> or <code>-ObjectBuffer 0</code> came from (we did not explicitly add it to our script), that is known as a <a href="https://msdn.microsoft.com/en-us/powershell/reference/5.1/microsoft.powershell.core/about/about_commonparameters">common parameter</a>. For efficiency reasons, Powershell can wait until a bunch of objects are accumulated into a buffer before sending them downstream. Obviously that is not what we want here, so we set the buffer size to 0.</p>
<h2 id="conclusion">Conclusion</h2>
<p>It is often a good idea, when approaching shell scripting, to take a step back and think whether we are trying to accomplish too much at once, and which pieces of functionality we would like to reuse in the future. That is not really different from the software architecture best practices applied when programming in the large, but the abstractions (programming whith pipes and streams) are. Now Windows programmers can apply the same principles at play in Unix for decades.</p>
<h2 id="see-also">See also</h2>
<ul>
<li>The sampling approach is probably too naive. For more robust approaches one should probably take a look into the <code>Timer</code> and <code>FileWatcher</code> events.</li>
<li>This <a href="https://gist.github.com/jarnaldich/67296c892cde9f9c5bbe9d7ccac97ee9">Gist</a> has the code for this article.</li>
</ul>

<div class="panel panel-default">
    <div class="panel-body">
        <div class="pull-left">
            Tags: <a href="/tags/powershell.html">powershell</a>, <a href="/tags/unix.html">unix</a>, <a href="/tags/windows.html">windows</a>
        </div>
        <div class="social pull-right">
            <span class="twitter">
                <a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2017/06/06/powershell-unix-philo.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
            </span>

             <script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
             <span>
                <g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
  size="medium"></g:plusone>
             </span>
            
        </div>
    </div>
</div>

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

<div id="disqus_thread"></div>  
<script type"text/javascript">
      var disqus_shortname = 'jarnaldich';
      (function() {
          var dsq = document.createElement('script');
          dsq.type = 'text/javascript';
          dsq.async = true;
          dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
          (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
      })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
    
]]></description>
    <pubDate>Tue, 06 Jun 2017 00:00:00 UT</pubDate>
    <guid>http://jarnaldich.me/blog/2017/06/06/powershell-unix-philo.html</guid>
    <dc:creator>Joan Arnaldich</dc:creator>
</item>

    </channel>
</rss>