-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrss.xml
1410 lines (1329 loc) · 160 KB
/
rss.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>jarnaldich.me</title>
<link>http://jarnaldich.me</link>
<description><![CDATA[Joan Arnaldich's Blog]]></description>
<atom:link href="http://jarnaldich.me/rss.xml" rel="self"
type="application/rss+xml" />
<lastBuildDate>Sun, 19 Feb 2023 00:00:00 UT</lastBuildDate>
<item>
<title>Near Duplicates Detection</title>
<link>http://jarnaldich.me/blog/2023/03/19/near-duplicates.html</link>
<description><![CDATA[<h1>Near Duplicates Detection</h1>
<small>Posted on February 19, 2023 <a href="/blog/2023/03/19/near-duplicates.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<p>In my <a href="http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html">previous post</a> I set up a tool to ease the download of open datasets into a JupyterLite environment, which is a neat tool to perform simplish data wrangling without local installation.</p>
<p>In this post we will put that tool to good use for one of the most common data cleaning utilities: near duplicate detection.</p>
<figure>
<img src="/images/spiderman_double.png" title="spiderman double" class="center" alt="" /><figcaption> </figcaption>
</figure>
<h2 id="why-bother-about-near-duplicates">Why bother about near duplicates?</h2>
<p>Near duplicates can be a sign of a poor schema implementation, especially when they appear in variables with finite domains (factors). For example, in the following addresses dataset:</p>
<center>
<table>
<thead>
<tr class="header">
<th>kind</th>
<th>name</th>
<th>number</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>road</td>
<td>Abbey</td>
<td>3</td>
</tr>
<tr class="even">
<td>square</td>
<td>Level</td>
<td>666</td>
</tr>
<tr class="odd">
<td>drive</td>
<td>Mullholand</td>
<td>1</td>
</tr>
<tr class="even">
<td>boulevard</td>
<td>Broken Dreams</td>
<td>4</td>
</tr>
</tbody>
</table>
</center>
<p/>
<p>The “kind” variable could predictably take any of the following values:</p>
<ul>
<li>road</li>
<li>square</li>
<li>avenue</li>
<li>drive</li>
<li>boulevard</li>
</ul>
<p>The problem is that this kind of data is too often modelled as an unconstrained string, which makes it error prone: ‘sqare’ is just as valid as ‘square’. This generates all kind of problems down the data analysis pipeline: what would happen if we analyze the frequency of each kind?</p>
<p>There are ways to ensure that the variable “kind” can only take one of those values, depending on the underlying data infrastructure:</p>
<ul>
<li>In relational databases one could use <a href="https://www.postgresql.org/docs/current/sql-createdomain.html">domain types</a> , data validation <a href="https://www.postgresql.org/docs/current/sql-createtrigger.html">triggers</a>, or plain old dictonary tables with 1:n relationships.</li>
<li>Non-relational DBs may have other ways to ensure schema conformance, e.g. through <a href="https://www.mongodb.com/docs/manual/core/schema-validation/specify-json-schema/">JSON schema</a> or <a href="http://exist-db.org/exist/apps/doc/validation">XML schema</a>.</li>
<li>The fallback option is to guarantee this “by construction” via application validation, (eg. using drop-downs in the UI), although this is a weaker solution since it incurs in unnecessary coupling… and thing can go sideways anyway, so in this scenario you should consider performing periodic schema validation tests on the data.</li>
</ul>
<p>Notice that all of these solutions require <em>a priori</em> knowledge of the domain.</p>
<p>But what happens when we are faced with an (underdocumented) dataset and asked to use it as a source for analysis? Or when we are asked to derive these rules <em>a posteriori</em> eg. to improve a legacy database? Well, without knowledge of the domain, it is just not possible to decide wether two similar values are both correct (and just happen to be spelled similarly) or a misspelling. The best thing we can do is to detect which values are indeed similar and raise a flag.</p>
<p>This is when the techniques explained in this blog post come handy.</p>
<h2 id="the-algorithm">The algorithm</h2>
<p>For the sake of simplicity, in this blog post we will assume our data is small enough so that a quadratic algorithm is acceptable (for the real thing, see the references at the end). Beware that, in modern hardware, this simple case can take you farther than you would initially expect. My advise is to always <em>use the simplest solution that gets the job done</em>. It usually pays off in both development time and incidental complexity (reliance on external dependencies, etc…).</p>
<p>There are two main metrics regarding similarity. The first one, restricted to strings, is the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein</a> (aka edit) distance and represents the number of edits needed to go from one string to another. This metric is hard to scale in general, since it requires pairwise comparison.</p>
<p>The other one is both more general and more scalable. It involves generating n-gram sets and then comparing them using a set-similarity measure.</p>
<h3 id="n-gram-sets">N-gram sets</h3>
<p>For each string, we can associate a set of n-grams that can be derived from it. N-grams (sometimes called <em>shingles</em>) are just substrings of length n. A typical case is <code>n=3</code>, which generates what is known as trigrams. For example, the trigram set for the string <code>"algorithm"</code> would be <code>['alg', 'lgo', 'gor', 'ori', 'rit', 'ith', 'thm']</code>.</p>
<h3 id="jaccard-index">Jaccard Index</h3>
<p>Once we have the n-gram set for a string, we can use a general metric for set similarity. A popular one is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard Index</a>. Which is defined as the ratio between the cardinality of intersection over the cardinality of the union of any two sets.</p>
<p><span class="math display">\[J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B|}\]</span></p>
<p>Note that this index will range from 0, for disjoint sets, to 1, for exactly equal sets.</p>
<h3 id="if-we-were-to-scale">If we were to scale…</h3>
<p>The advantadge of using n-gram sets is that we can use similarity-preserving summaries of those sets (eg. via <a href="https://en.wikipedia.org/wiki/MinHash">minhashing</a>) which, combined with <a href="https://en.wikipedia.org/wiki/Locality-sensitive_hashing">locality sensitive hashing</a> to efficiently compare pairs of sets provides a massively scalable solution. In this post we will just assume that the size of our data is small enought so that we do not need to scale.</p>
<h2 id="the-code">The Code</h2>
<p>All the above can be implemented in the following utility function, which will take an iterable of strings and the minimum jaccard similarity and max levenshtein distance to consider a pair a candidate for duplicity. It will return a pandas dataframe with the pair indices, their values, and their mutual Levenshtein and Jaccard distances. We will use the <a href="https://www.nltk.org/">Natural Languate Toolkit</a> for the implementation of those distances.</p>
<p>Bear in mind that, in a real use case, we would very likely apply some normalization before testing for near duplicates (eg. to account for spaces and/or differences in upper/lowercase versions).</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">def</span> near_duplicates(factors, min_jaccard: <span class="bu">float</span>, max_levenshtein: <span class="bu">int</span>):</span>
<span id="cb1-2"><a href="#cb1-2"></a> trigrams <span class="op">=</span> [ <span class="bu">set</span>(<span class="st">''</span>.join(g) <span class="cf">for</span> g <span class="kw">in</span> nltk.ngrams(f, <span class="dv">3</span>)) <span class="cf">for</span> f <span class="kw">in</span> factors ]</span>
<span id="cb1-3"><a href="#cb1-3"></a> jaccard <span class="op">=</span> <span class="bu">dict</span>()</span>
<span id="cb1-4"><a href="#cb1-4"></a> levenshtein <span class="op">=</span> <span class="bu">dict</span>()</span>
<span id="cb1-5"><a href="#cb1-5"></a> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">len</span>(factors)):</span>
<span id="cb1-6"><a href="#cb1-6"></a> <span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(i<span class="op">+</span><span class="dv">1</span>, <span class="bu">len</span>(factors)):</span>
<span id="cb1-7"><a href="#cb1-7"></a> denom <span class="op">=</span> <span class="bu">float</span>(<span class="bu">len</span>(trigrams[i] <span class="op">|</span> trigrams[j]))</span>
<span id="cb1-8"><a href="#cb1-8"></a> <span class="cf">if</span> denom <span class="op">></span> <span class="dv">0</span>:</span>
<span id="cb1-9"><a href="#cb1-9"></a> jaccard[(i,j)] <span class="op">=</span> <span class="bu">float</span>(<span class="bu">len</span>(trigrams[i] <span class="op">&</span> trigrams[j])) <span class="op">/</span> denom</span>
<span id="cb1-10"><a href="#cb1-10"></a> <span class="cf">else</span>:</span>
<span id="cb1-11"><a href="#cb1-11"></a> jaccard[(i,j)] <span class="op">=</span> np.NaN</span>
<span id="cb1-12"><a href="#cb1-12"></a> levenshtein[(i,j)] <span class="op">=</span> nltk.edit_distance(factors[i], factors[j])</span>
<span id="cb1-13"><a href="#cb1-13"></a></span>
<span id="cb1-14"><a href="#cb1-14"></a> acum <span class="op">=</span> []</span>
<span id="cb1-15"><a href="#cb1-15"></a> <span class="cf">for</span> (i,j),v <span class="kw">in</span> jaccard.items():</span>
<span id="cb1-16"><a href="#cb1-16"></a> <span class="cf">if</span> v <span class="op">>=</span> min_jaccard <span class="kw">and</span> levenshtein[(i,j)] <span class="op"><=</span> max_levenshtein: </span>
<span id="cb1-17"><a href="#cb1-17"></a> acum.append([i,j,factors[i], factors[j], jaccard[(i,j)], levenshtein[(i,j)]])</span>
<span id="cb1-18"><a href="#cb1-18"></a></span>
<span id="cb1-19"><a href="#cb1-19"></a> <span class="cf">return</span> pd.DataFrame(acum, columns<span class="op">=</span>[<span class="st">'i'</span>, <span class="st">'j'</span>, <span class="st">'factor_i'</span>, <span class="st">'factor_j'</span>, <span class="st">'jaccard_ij'</span>, <span class="st">'levenshtein_ij'</span>])</span></code></pre></div>
<p>We can extend the above functions to explore a set of columns in a pandas data frame with the following code:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">def</span> df_dups(df, cols<span class="op">=</span><span class="va">None</span>, except_cols<span class="op">=</span>[], min_jaccard<span class="op">=</span><span class="fl">0.3</span>, max_levenshtein<span class="op">=</span><span class="dv">4</span>):</span>
<span id="cb2-2"><a href="#cb2-2"></a> acum <span class="op">=</span> []</span>
<span id="cb2-3"><a href="#cb2-3"></a> </span>
<span id="cb2-4"><a href="#cb2-4"></a> <span class="cf">if</span> cols <span class="kw">is</span> <span class="va">None</span>:</span>
<span id="cb2-5"><a href="#cb2-5"></a> cols <span class="op">=</span> df.columns</span>
<span id="cb2-6"><a href="#cb2-6"></a></span>
<span id="cb2-7"><a href="#cb2-7"></a> <span class="cf">if</span> <span class="bu">isinstance</span>(min_jaccard, numbers.Number):</span>
<span id="cb2-8"><a href="#cb2-8"></a> mj <span class="op">=</span> defaultdict(<span class="kw">lambda</span> : min_jaccard)</span>
<span id="cb2-9"><a href="#cb2-9"></a> <span class="cf">else</span>:</span>
<span id="cb2-10"><a href="#cb2-10"></a> mj <span class="op">=</span> min_jaccard</span>
<span id="cb2-11"><a href="#cb2-11"></a></span>
<span id="cb2-12"><a href="#cb2-12"></a> <span class="cf">if</span> <span class="bu">isinstance</span>(max_levenshtein, numbers.Number):</span>
<span id="cb2-13"><a href="#cb2-13"></a> ml <span class="op">=</span> defaultdict(<span class="kw">lambda</span>: max_levenshtein)</span>
<span id="cb2-14"><a href="#cb2-14"></a> <span class="cf">else</span>:</span>
<span id="cb2-15"><a href="#cb2-15"></a> ml <span class="op">=</span> max_levenshtein</span>
<span id="cb2-16"><a href="#cb2-16"></a></span>
<span id="cb2-17"><a href="#cb2-17"></a> <span class="cf">for</span> c <span class="kw">in</span> cols:</span>
<span id="cb2-18"><a href="#cb2-18"></a></span>
<span id="cb2-19"><a href="#cb2-19"></a> <span class="cf">if</span> c <span class="kw">in</span> except_cols <span class="kw">or</span> <span class="kw">not</span> is_string_dtype(df[c]):</span>
<span id="cb2-20"><a href="#cb2-20"></a> <span class="cf">continue</span></span>
<span id="cb2-21"><a href="#cb2-21"></a></span>
<span id="cb2-22"><a href="#cb2-22"></a> factors <span class="op">=</span> df[c].factorize()[<span class="dv">1</span>]</span>
<span id="cb2-23"><a href="#cb2-23"></a> col_dups <span class="op">=</span> near_duplicates(factors, mj[c], ml[c])</span>
<span id="cb2-24"><a href="#cb2-24"></a> col_dups[<span class="st">'col'</span>] <span class="op">=</span> c</span>
<span id="cb2-25"><a href="#cb2-25"></a> acum.append(col_dups)</span>
<span id="cb2-26"><a href="#cb2-26"></a></span>
<span id="cb2-27"><a href="#cb2-27"></a> <span class="cf">return</span> pd.concat(acum)</span></code></pre></div>
<p>If we apply the above code to the open dataset from the <a href="http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html">last blog post</a></p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1"></a>df_dups(df, cols<span class="op">=</span>[<span class="st">'Proveïdor'</span>,</span>
<span id="cb3-2"><a href="#cb3-2"></a> <span class="st">'Objecte del contracte'</span>, </span>
<span id="cb3-3"><a href="#cb3-3"></a> <span class="st">'Tipus Contracte'</span>])</span></code></pre></div>
<p>The column names are in Catalan since the dataset comes from the <a href="https://opendata-ajuntament.barcelona.cat/">Barcelona Council Open Data Hub</a>, and stand for the <em>contractor</em>, <em>the service descripction</em>, and the <em>type of service</em>.</p>
<p>We get the following results:</p>
<figure>
<img src="/images/near_dups_menors.png" title="spiderman double" class="center" width="850" alt="" /><figcaption> </figcaption>
</figure>
<p>Notice that the first two are actually valid, despite being similar (two companies with similar names and <em>electric</em> vs <em>electronic</em> supplies), while the last two seem to be a case of not controlling the variable domain properly (singular/plural entries). We should definitely decide for a canonical value (singular/plural) for the column “Tipus Contracte” before we compute any aggregation for those columns.</p>
<h2 id="conclusions">Conclusions</h2>
<p>We can use the above functions as helpers prior to performing some analysis on datasets where domain rules have not been previously enforced. They are compatible with JupyterLite, so no need to install anything for the test. For convenience, you can find a working notebook <a href="https://gist.github.com/jarnaldich/24ece34b6fb441c3ef8878a39a265b82">in this gist</a>.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="http://www.mmds.org/">Mining Of Massive Datasets</a> - An absolute classic book. Chapter 3, in particular, describes a scalable improvement on the technique described in this blog post.</li>
</ul>
<div class="panel panel-default">
<div class="panel-body">
<div class="pull-left">
Tags: <a href="/tags/jupyterlite.html">jupyterlite</a>, <a href="/tags/data.html">data</a>, <a href="/tags/nltk.html">nltk</a>, <a href="/tags/jaccard.html">jaccard</a>, <a href="/tags/qc.html">qc</a>
</div>
<div class="social pull-right">
<span class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2023/03/19/near-duplicates.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
</span>
<script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
<span>
<g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
size="medium"></g:plusone>
</span>
</div>
</div>
</div>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<div id="disqus_thread"></div>
<script type"text/javascript">
var disqus_shortname = 'jarnaldich';
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
]]></description>
<pubDate>Sun, 19 Feb 2023 00:00:00 UT</pubDate>
<guid>http://jarnaldich.me/blog/2023/03/19/near-duplicates.html</guid>
<dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
<title>Dealing with CORS in JupyterLite</title>
<link>http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html</link>
<description><![CDATA[<h1>Dealing with CORS in JupyterLite</h1>
<small>Posted on January 29, 2023 <a href="/blog/2023/01/29/jupyterlite-jsonp.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<p>Following my <a href="blog/2022/12/08/data-manipulation-jupyterlite.html">previous post</a>, I am intending to see how far I can push JupyterLite as a platform for data analysis in the browser. The convenience of having a full enviroment with a sensible default set of libraries for dealing with data <a href="https://jupyterlite.github.io/demo/lab/index.html">one link away</a> is really something I could use.</p>
<p>But of course, for data analysis you need… well… data. There is certainly no shortage of public datasets on the internet, many of them falling into some sort of Open Data initiatives, such as the <a href="https://data.europa.eu/en/publications/open-data-maturity/2022">EU Open Data</a>.</p>
<p>But, as soon as you try to use JupyterLite to directly fetch data from those sites, you find yourself stumping on a wall named <a href="https://portswigger.net/web-security/cors/same-origin-policy">Same Origin Policy</a>.</p>
<h2 id="same-origin-policy">Same Origin Policy</h2>
<p>The Same Origin Policiy is a protection system designed to guarantee that resource providers (hosts) can restrict usage of their data to the pages they host. This is the safe thing to do when there is user data involved, since it prevents third parties to gain access to eg. the user’s cookies and session id’s.</p>
<p>Notice that, when there is no user data involved, it is perfectly safe to relax this policy. In fact, as we will see, it is desirable to do so.</p>
<p>Browsers implement this protection by not allowing a page to perform requests to a server that is different from where it was downloaded unless this other server explicitly allows for it.</p>
<p>This behaviour bites hard at any application involving third party data analysis in the browser, as well as a lot of webassembly “ports” of existing applications with networking capabilities, since the original desktop apps were not designed to deal with this kind of restrictions<a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a> in the first place.</p>
<figure>
<img src="/images/cors.png" title="CORS" class="center" alt="" /><figcaption> </figcaption>
</figure>
<p>For example, if you are using the Jupyterlite at <code>jupyterlite.github.io</code>, you will not be able to fetch any server beyond <code>github.io</code> that does not allow for it specifically… which many data providers don’t. The request will be blocked by the browser itself (step 2 in the diagram above). You will either need to download yourself the data and upload it to JupyterLite, or self-host JupyterLite and the data in your own server (using it as a proxy for data requests), which kinda takes all the convenience out of it. As an example, evaluating this snippet in JupyterLite works exactly as you would expect:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
<span id="cb1-2"><a href="#cb1-2"></a><span class="im">from</span> js <span class="im">import</span> fetch</span>
<span id="cb1-3"><a href="#cb1-3"></a></span>
<span id="cb1-4"><a href="#cb1-4"></a>WORKS <span class="op">=</span> <span class="st">"https://raw.githubusercontent.com/jupyterlite/jupyterlite/main/examples/data/iris.csv"</span></span>
<span id="cb1-5"><a href="#cb1-5"></a>WORKS_CORS_ENABLED <span class="op">=</span> <span class="st">"https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD"</span></span>
<span id="cb1-6"><a href="#cb1-6"></a>FAILS_CORS_DISABLED <span class="op">=</span> <span class="st">"https://opendata-ajuntament.barcelona.cat/data/dataset/1121f3e2-bfb1-4dc4-9f39-1c5d1d72cba1/resource/69ae574f-adfc-4660-8f81-73103de169ff/download/2018_menors.csv"</span></span>
<span id="cb1-7"><a href="#cb1-7"></a></span>
<span id="cb1-8"><a href="#cb1-8"></a>res <span class="op">=</span> <span class="cf">await</span> fetch(WORKS)</span>
<span id="cb1-9"><a href="#cb1-9"></a>text <span class="op">=</span> <span class="cf">await</span> res.text()</span>
<span id="cb1-10"><a href="#cb1-10"></a><span class="bu">print</span>(text)</span></code></pre></div>
<p>There are two ways in which a data provider can accept cross-origin requests. The main one (the canonical, modern one) is known as <em>Cross Origin Resource Sharing</em> (CORS). By adding explicit permission in some dedicated HTTP headers, a resource provider can control <em>who</em> can access their data (the world or selected domains) and <em>how</em> (which HTTP methods).</p>
<p>Whenever this is not possible or practical (it needs access to the HTTP server configuration, and some hosting providers may not allow it), there is a second way: the JSONP callback.</p>
<h2 id="the-jsonp-callback">The JSONP Callback</h2>
<p>The JSONP callback works along these lines:</p>
<ol type="1">
<li>The calling page (eg. JupyterLite) defines a callback function, with a data parameter.</li>
<li>The calling page (JupyterLite) loads a script from the data provider, passing the name of the callback function.</li>
<li>The data provider script calls back the function with the requested data.</li>
</ol>
<p>Since the script was downloaded from the data provider’s domain, it can perform requests to that domain, so CORS restrictions do not apply.</p>
<p>This is not the recommended solution because it delegates to the application something that belongs to another layer: both the server and the consuming webpage have to modified. One typical use case is making older browsers work. The other is kind of accidental: downloading from (poorly configured?) Open Data portals. Most Open Data portals (including administrative ones) use pre-built data management systems such as <a href="https://ckan.org">CKAN</a>. These often can handle JSONP by default, while http servers have CORS disabled by default. So keeping the defaults leaves you with JSONP.</p>
<h2 id="implementing-a-jsonp-helper-in-jupyterlite">Implementing a JSONP helper in JupyterLite</h2>
<p>One of the things I love about the browser as a platform is that it is… pretty hackable… just press F12 and you can enter the kitchen. For example, you can see how JupyterLite “fakes” its filesystem on top of <a href="https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API">IndexedDB</a>, wich is an API for storing persistent data in the browser.</p>
<p>So, we have a way to perform CORS requests and get data from a server implementing JSONP, and we can also fiddle with JupyterLite’s virtual filesystem… would it be possible to write a helper to download datasets into the virtual filesystem? You bet! Just paste the following code in a javascript kernel cell, or use the <code>%%javascript</code> magic in a python one:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span id="cb2-1"><a href="#cb2-1"></a><span class="va">window</span>.<span class="at">saveJSONP</span> <span class="op">=</span> <span class="kw">async</span> (urlString<span class="op">,</span> file_path<span class="op">,</span> mime_type<span class="op">=</span><span class="st">'text/json'</span><span class="op">,</span> binary<span class="op">=</span><span class="kw">false</span>) <span class="kw">=></span> <span class="op">{</span></span>
<span id="cb2-2"><a href="#cb2-2"></a> <span class="kw">const</span> sc <span class="op">=</span> <span class="va">document</span>.<span class="at">createElement</span>(<span class="st">'script'</span>)<span class="op">;</span></span>
<span id="cb2-3"><a href="#cb2-3"></a> <span class="kw">var</span> url <span class="op">=</span> <span class="kw">new</span> <span class="at">URL</span>(urlString)<span class="op">;</span></span>
<span id="cb2-4"><a href="#cb2-4"></a> <span class="va">url</span>.<span class="va">searchParams</span>.<span class="at">append</span>(<span class="st">'callback'</span><span class="op">,</span> <span class="st">'window.corsCallBack'</span>)<span class="op">;</span></span>
<span id="cb2-5"><a href="#cb2-5"></a> </span>
<span id="cb2-6"><a href="#cb2-6"></a> <span class="va">sc</span>.<span class="at">src</span> <span class="op">=</span> <span class="va">url</span>.<span class="at">toString</span>()<span class="op">;</span></span>
<span id="cb2-7"><a href="#cb2-7"></a></span>
<span id="cb2-8"><a href="#cb2-8"></a> <span class="va">window</span>.<span class="at">corsCallBack</span> <span class="op">=</span> <span class="kw">async</span> (data) <span class="kw">=></span> <span class="op">{</span></span>
<span id="cb2-9"><a href="#cb2-9"></a> <span class="va">console</span>.<span class="at">log</span>(data)<span class="op">;</span></span>
<span id="cb2-10"><a href="#cb2-10"></a></span>
<span id="cb2-11"><a href="#cb2-11"></a> <span class="co">// Open (or create) the file storage</span></span>
<span id="cb2-12"><a href="#cb2-12"></a> <span class="kw">var</span> open <span class="op">=</span> <span class="va">indexedDB</span>.<span class="at">open</span>(<span class="st">'JupyterLite Storage'</span>)<span class="op">;</span></span>
<span id="cb2-13"><a href="#cb2-13"></a></span>
<span id="cb2-14"><a href="#cb2-14"></a> <span class="co">// Create the schema</span></span>
<span id="cb2-15"><a href="#cb2-15"></a> <span class="va">open</span>.<span class="at">onupgradeneeded</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span></span>
<span id="cb2-16"><a href="#cb2-16"></a> <span class="cf">throw</span> <span class="at">Error</span>(<span class="st">'Error opening IndexedDB. Should not ever need to upgrade JupyterLite Storage Schema'</span>)<span class="op">;</span></span>
<span id="cb2-17"><a href="#cb2-17"></a> <span class="op">};</span></span>
<span id="cb2-18"><a href="#cb2-18"></a></span>
<span id="cb2-19"><a href="#cb2-19"></a> <span class="va">open</span>.<span class="at">onsuccess</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span></span>
<span id="cb2-20"><a href="#cb2-20"></a> <span class="co">// Start a new transaction</span></span>
<span id="cb2-21"><a href="#cb2-21"></a> <span class="kw">var</span> db <span class="op">=</span> <span class="va">open</span>.<span class="at">result</span><span class="op">;</span></span>
<span id="cb2-22"><a href="#cb2-22"></a> <span class="kw">var</span> tx <span class="op">=</span> <span class="va">db</span>.<span class="at">transaction</span>(<span class="st">"files"</span><span class="op">,</span> <span class="st">"readwrite"</span>)<span class="op">;</span></span>
<span id="cb2-23"><a href="#cb2-23"></a> <span class="kw">var</span> store <span class="op">=</span> <span class="va">tx</span>.<span class="at">objectStore</span>(<span class="st">"files"</span>)<span class="op">;</span></span>
<span id="cb2-24"><a href="#cb2-24"></a></span>
<span id="cb2-25"><a href="#cb2-25"></a> <span class="kw">var</span> now <span class="op">=</span> <span class="kw">new</span> <span class="at">Date</span>()<span class="op">;</span></span>
<span id="cb2-26"><a href="#cb2-26"></a></span>
<span id="cb2-27"><a href="#cb2-27"></a> <span class="kw">var</span> value <span class="op">=</span> <span class="op">{</span></span>
<span id="cb2-28"><a href="#cb2-28"></a> <span class="st">'name'</span><span class="op">:</span> <span class="va">file_path</span>.<span class="at">split</span>(<span class="ss">/</span><span class="sc">[\\/]</span><span class="ss">/</span>).<span class="at">pop</span>()<span class="op">,</span></span>
<span id="cb2-29"><a href="#cb2-29"></a> <span class="st">'path'</span><span class="op">:</span> file_path<span class="op">,</span></span>
<span id="cb2-30"><a href="#cb2-30"></a> <span class="st">'format'</span><span class="op">:</span> binary <span class="op">?</span> <span class="st">'binary'</span> : <span class="st">'text'</span><span class="op">,</span></span>
<span id="cb2-31"><a href="#cb2-31"></a> <span class="st">'created'</span><span class="op">:</span> <span class="va">now</span>.<span class="at">toISOString</span>()<span class="op">,</span></span>
<span id="cb2-32"><a href="#cb2-32"></a> <span class="st">'last_modified'</span><span class="op">:</span> <span class="va">now</span>.<span class="at">toISOString</span>()<span class="op">,</span></span>
<span id="cb2-33"><a href="#cb2-33"></a> <span class="st">'content'</span><span class="op">:</span> <span class="va">JSON</span>.<span class="at">stringify</span>(data)<span class="op">,</span></span>
<span id="cb2-34"><a href="#cb2-34"></a> <span class="st">'mimetype'</span><span class="op">:</span> mime_type<span class="op">,</span></span>
<span id="cb2-35"><a href="#cb2-35"></a> <span class="st">'type'</span><span class="op">:</span> <span class="st">'file'</span><span class="op">,</span></span>
<span id="cb2-36"><a href="#cb2-36"></a> <span class="st">'writable'</span><span class="op">:</span> <span class="kw">true</span></span>
<span id="cb2-37"><a href="#cb2-37"></a> <span class="op">};</span> </span>
<span id="cb2-38"><a href="#cb2-38"></a></span>
<span id="cb2-39"><a href="#cb2-39"></a> <span class="kw">const</span> countRequest <span class="op">=</span> <span class="va">store</span>.<span class="at">count</span>(file_path)<span class="op">;</span></span>
<span id="cb2-40"><a href="#cb2-40"></a> <span class="va">countRequest</span>.<span class="at">onsuccess</span> <span class="op">=</span> () <span class="kw">=></span> <span class="op">{</span></span>
<span id="cb2-41"><a href="#cb2-41"></a> <span class="va">console</span>.<span class="at">log</span>(<span class="va">countRequest</span>.<span class="at">result</span>)<span class="op">;</span></span>
<span id="cb2-42"><a href="#cb2-42"></a> <span class="cf">if</span>(<span class="va">countRequest</span>.<span class="at">result</span> <span class="op">></span> <span class="dv">0</span>) <span class="op">{</span></span>
<span id="cb2-43"><a href="#cb2-43"></a> <span class="va">store</span>.<span class="at">put</span>(value<span class="op">,</span> file_path)<span class="op">;</span></span>
<span id="cb2-44"><a href="#cb2-44"></a> <span class="op">}</span> <span class="cf">else</span> <span class="op">{</span></span>
<span id="cb2-45"><a href="#cb2-45"></a> <span class="va">store</span>.<span class="at">add</span>(value<span class="op">,</span> file_path)<span class="op">;</span></span>
<span id="cb2-46"><a href="#cb2-46"></a> <span class="op">}</span> </span>
<span id="cb2-47"><a href="#cb2-47"></a> <span class="op">};</span> </span>
<span id="cb2-48"><a href="#cb2-48"></a></span>
<span id="cb2-49"><a href="#cb2-49"></a> <span class="co">// Close the db when the transaction is done</span></span>
<span id="cb2-50"><a href="#cb2-50"></a> <span class="va">tx</span>.<span class="at">oncomplete</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span></span>
<span id="cb2-51"><a href="#cb2-51"></a> <span class="va">db</span>.<span class="at">close</span>()<span class="op">;</span></span>
<span id="cb2-52"><a href="#cb2-52"></a> <span class="op">};</span></span>
<span id="cb2-53"><a href="#cb2-53"></a> <span class="op">}</span></span>
<span id="cb2-54"><a href="#cb2-54"></a> <span class="op">}</span></span>
<span id="cb2-55"><a href="#cb2-55"></a></span>
<span id="cb2-56"><a href="#cb2-56"></a> <span class="va">document</span>.<span class="at">getElementsByTagName</span>(<span class="st">'head'</span>)[<span class="dv">0</span>].<span class="at">appendChild</span>(sc)<span class="op">;</span></span>
<span id="cb2-57"><a href="#cb2-57"></a><span class="op">}</span></span></code></pre></div>
<p>Then, each time you need to download a file, you can just use the following javascript:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span id="cb3-1"><a href="#cb3-1"></a><span class="op">%%</span>javascript</span>
<span id="cb3-2"><a href="#cb3-2"></a><span class="kw">var</span> url <span class="op">=</span> <span class="st">'https://opendata-ajuntament.barcelona.cat/data/es/api/3/action/datastore_search?resource_id=69ae574f-adfc-4660-8f81-73103de169ff'</span></span>
<span id="cb3-3"><a href="#cb3-3"></a><span class="va">window</span>.<span class="at">saveJSONP</span>(url<span class="op">,</span> <span class="st">'data/menors.json'</span>)</span></code></pre></div>
<p>To clarify, you should either use a python kernel with the <code>%%javascript</code> magic or the javascript kernel in <em>both</em> the definition and the call, otherwise they won’t see each other.</p>
<p>Then from a python cell we can read it the standard way:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1"></a><span class="im">import</span> json</span>
<span id="cb4-2"><a href="#cb4-2"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
<span id="cb4-3"><a href="#cb4-3"></a></span>
<span id="cb4-4"><a href="#cb4-4"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">'data/menors.json'</span>, <span class="st">'r'</span>) <span class="im">as</span> f:</span>
<span id="cb4-5"><a href="#cb4-5"></a> data <span class="op">=</span> json.load(f)</span>
<span id="cb4-6"><a href="#cb4-6"></a> </span>
<span id="cb4-7"><a href="#cb4-7"></a>pd.read_json(json.dumps(data[<span class="st">'result'</span>][<span class="st">'records'</span>]))</span></code></pre></div>
<p>You can find a notebook with the whole code for your convenience <a href="https://gist.github.com/6418a53b50568a2b201bf592d854c0df#file-pythonjsonphelper-ipynb">in this GIST</a>.</p>
<h2 id="conclusions">Conclusions</h2>
<ul>
<li><p>We are just starting to see the potential of WebAssembly based solutions and the browser environment (IndexedDB…). This will increase the demand for data accessibility across origins.</p></li>
<li><p>If you are a data provider, please consider enabling CORS to promote the usage of your data. Otherwise you will be banning a growing market of web-based analysis tools from your data.</p></li>
</ul>
<h2 id="references">References</h2>
<ul>
<li>Simple IndexedDB <a href="https://gist.github.com/JamesMessinger/a0d6389a5d0e3a24814b">example</a></li>
<li><a href="https://github.com/jupyterlite/jupyterlite/discussions/91?sort=new">Sample code</a> for reading and writing files in JupyterLite (this is where the idea for this post comes from).</li>
<li><a href="https://enable-cors.org/">On CORS</a> and how to enable it.</li>
<li><a href="https://www.w3.org/wiki/CORS_Enabled">An w3 article</a> on how to open your data by enabling CORS and why it is important, with a list of providers implementing it.</li>
<li>A test <a href="https://www.test-cors.org/">web page</a> to check if a server is CORS enabled.</li>
</ul>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>If you are curious about the possible solutions to this problems, you may like to read how <a href="https://webvm.io/">WebVM</a>, a server-less virtual Debian, implements a general solution <a href="https://leaningtech.com/webvm-virtual-machine-with-networking-via-tailscale/">here</a>.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>
<div class="panel panel-default">
<div class="panel-body">
<div class="pull-left">
Tags: <a href="/tags/jupyterlite.html">jupyterlite</a>, <a href="/tags/CORS.html">CORS</a>, <a href="/tags/data.html">data</a>, <a href="/tags/data.html">data</a>, <a href="/tags/webassembly.html">webassembly</a>
</div>
<div class="social pull-right">
<span class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
</span>
<script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
<span>
<g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
size="medium"></g:plusone>
</span>
</div>
</div>
</div>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<div id="disqus_thread"></div>
<script type"text/javascript">
var disqus_shortname = 'jarnaldich';
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
]]></description>
<pubDate>Sun, 29 Jan 2023 00:00:00 UT</pubDate>
<guid>http://jarnaldich.me/blog/2023/01/29/jupyterlite-jsonp.html</guid>
<dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
<title>Data Manipulation with JupyterLite</title>
<link>http://jarnaldich.me/blog/2022/12/08/data-manipulation-jupyterlite.html</link>
<description><![CDATA[<h1>Data Manipulation with JupyterLite</h1>
<small>Posted on December 8, 2022 <a href="/blog/2022/12/08/data-manipulation-jupyterlite.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<figure>
<img src="/images/jupyterlite.png" title="JupyterLite screenshot" class="center" width="400" alt="" /><figcaption> </figcaption>
</figure>
<p>Data comes in all sizes, shapes and qualities. The process of getting data ready for further analysis is equally crucial and tedious, as many data professionals will <a href="https://forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/">confirm</a>.</p>
<p>This process of many names (data wrangling/munging/cleaning) is often performed by an unholy mix of command-line tools, one-shot scripts and whatever is at hand depending on the data formats and computing environment.</p>
<p>I have been intending to share some of the tools I have found useful for this task in a series of blog posts, especially if they are particularly unexpected or lesser-known. I will always try to demonstrate the tool with some common data processing application, and then finally highlight which conditions the tool is most suitable under.</p>
<h1 id="jupyterlite">JupyterLite</h1>
<p>Jupyter/JupyterLab are the de-facto standard notebook environment, especially among Python data scientists (although it was designed from the start to work for multiple languages or <em>kernels</em> as the <a href="https://blog.jupyter.org/i-python-you-r-we-julia-baf064ca1fb6">name hints</a>). The frontend runs in a browser, and setting up the backend often requires a local installation, although some providers will let you spin a backend in the cloud, see: <a href="https://colab.research.google.com/">Google Collab</a> or <a href="https://mybinder.org/">The Binder Project</a>.</p>
<p>JupyterLite is simpler/cleaner solution for simple analysis if sharing is not needed: it is a quite complete Jupyter environment where all its components run in the browser via webassembly compilation. Just visit its <a href="https://github.com/jupyterlite/jupyterlite">Github</a> project page for the details. Following some of the referenced projects and examples is a worthy rabbit hole to enter.</p>
<p>Some things you might not expect from a webassembly solution:</p>
<ul>
<li>Comes with most data-science libraries ready to use: matplotlib, pandas, numpy.</li>
<li>Can install third party packages via regular magic:</li>
</ul>
<pre><code>%pip install -q bqplot ipyleaflet</code></pre>
<h2 id="example-not-so-simple-excel-manipulation">Example: Not so Simple Excel Manipulation</h2>
<p>Sometimes you need to perform some not so simple manipulation in an Excel sheet that outgrows pivot tables but is kind of the bread and butter of pandas. Since copying an excel table defaults to a tabbed-separated string, getting a pandas dataframe is as easy as firing up jupyterlite by visiting <a href="https://jupyterlite.github.io/demo/lab/index.html">this page</a>, then getting a python notebook and evaluating this code in the first cell, pasting the excel table between the multi-line string separator:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
<span id="cb2-2"><a href="#cb2-2"></a><span class="im">import</span> io</span>
<span id="cb2-3"><a href="#cb2-3"></a></span>
<span id="cb2-4"><a href="#cb2-4"></a>df <span class="op">=</span> pd.read_table(io.StringIO(<span class="st">"""</span></span>
<span id="cb2-5"><a href="#cb2-5"></a><span class="st"><PRESS C-V HERE></span></span>
<span id="cb2-6"><a href="#cb2-6"></a><span class="st">"""</span>))</span>
<span id="cb2-7"><a href="#cb2-7"></a>df</span></code></pre></div>
<h1 id="highlights">Highlights</h1>
<ul>
<li><strong>Useful for:</strong> The kind of analysis/manipulation one would use Pandas / Numpy for, especially if it involves visualizations or richer interaction.</li>
<li><strong>Useful when:</strong> You don’t have access to a pre-installed Jupyter environment but have a modern browser and intenet connection at hand, or when you are dealing with sensitive data that should not leave your computer.</li>
</ul>
<h1 id="conclusion">Conclusion</h1>
<p>JupyterLite is an amazing project: as with many webassembly based solutions, we are just starting to see the possibilities. I encourage you to explore it beyond data manipulation because you can easily find other applications for it, from interactive dashboards to authoring diagrams…</p>
<div class="panel panel-default">
<div class="panel-body">
<div class="pull-left">
Tags: <a href="/tags/data.html">data</a>, <a href="/tags/tools.html">tools</a>, <a href="/tags/jupyterlite.html">jupyterlite</a>, <a href="/tags/data-manipulation.html">data-manipulation</a>, <a href="/tags/data-wrangling.html">data-wrangling</a>, <a href="/tags/data-munging.html">data-munging</a>, <a href="/tags/webassembly.html">webassembly</a>
</div>
<div class="social pull-right">
<span class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/12/08/data-manipulation-jupyterlite.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
</span>
<script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
<span>
<g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
size="medium"></g:plusone>
</span>
</div>
</div>
</div>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<div id="disqus_thread"></div>
<script type"text/javascript">
var disqus_shortname = 'jarnaldich';
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
]]></description>
<pubDate>Thu, 08 Dec 2022 00:00:00 UT</pubDate>
<guid>http://jarnaldich.me/blog/2022/12/08/data-manipulation-jupyterlite.html</guid>
<dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
<title>Cloud Optimized Vector</title>
<link>http://jarnaldich.me/blog/2022/04/22/cloud-optimized-vector.html</link>
<description><![CDATA[<h1>Cloud Optimized Vector</h1>
<small>Posted on April 22, 2022 <a href="/blog/2022/04/22/cloud-optimized-vector.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<p>A few days ago a coworker of mine sent me a <a href="http://blog.cleverelephant.ca/2022/04/coshp.html">recent article</a> by Paul Ramsey (of <a href="http://blog.cleverelephant.ca/projects">Postgis et al.</a> fame) reflecting on what would a Cloud Optimized Vector format look like. His shocking proposal was … (didn’t see that coming)… shapefiles!</p>
<img src="https://imgs.xkcd.com/comics/duty_calls.png" title="fig:someone is wrong on the internet" class="center" alt=" " />
<p>
<center>
<small>Source: xkcd</small>
</center>
</p>
<p>I understand the article was written as a provocation for thought and as such makes some really good points. I also think that the general discussion over what a “cloud optimized vector” format would look like can be productive, but I am afraid that some less experienced developers (or, God forbid, managers!) would take the proposal of pushing shapefiles as the next cloud format a bit too literally, so I thought I would give some context and counterpoint to that article.</p>
<p>Him being Paul Ramsey and me being… well… <a href="/about.html">me</a>, I’d better motivate my opinion, so here comes a longish post. I will try to analyze what makes something <em>cloud optimized</em> based on the COG experience, see how that could be applied to a vector format, then justify why shapefiles should be (once again) avoided and finally see if we can get any closer to an ideal cloud vector format.</p>
<h2 id="what-makes-something-cloud-optimized-anyway">What makes something <em>cloud optimized</em> anyway?</h2>
<p><a href="https://www.cogeo.org/">Cloud Optimized GeoTiffs</a> are technically just a name for a GeoTiff with a <a href="https://github.com/cogeotiff/cog-spec/blob/master/spec.md">particular internal organization</a> (the sequencing of the bytes on disk). Tiff is a old format (old as in <em>venerable</em>) that allows for huge flexibility in terms of internal storage, data types, etc… For example, an image can be stored on disk one line after the other or, as is the case with COG, in small square “mini images” called tiles. Those tiles are then arranged in a larger grid and then several coarser-resolution layers (called overviews) of such grids can be stacked together to form an <a href="https://en.wikipedia.org/wiki/Pyramid_(image_processing)">image pyramid</a>.</p>
<img src="/images/pyramid.jpeg" title="fig:pyramid mage" class="center" width="400" alt=" " />
<p>
<center>
<small>Source: OsGEO Wiki</small>
</center>
</p>
<p>Of course, all data is properly indexed within the file so that accessing a tile of any pyramid level is easy (seeking byte ranges and at most some trivial multiplications or additions).</p>
<p>Whenever data is fetched in chunks through a channel with some latency (be it disk transfer or network), the efficiency of the overall processing can be improved by organizing data in the same order it will be read by the algorithm to compensate for the cost of setting up each read operation (seek times of spinning disks or protocol overhead in network communications).</p>
<p>A corollary of this is being that <em>data formats are not efficient per se</em>, in the void: it will always depend on the process/algorithm/use case. For example, for a raster point operation (such as applying a threshold mask for some value), organizing data line by line with no overviews is more efficient than a COG would be (…and that is why the Geotiff spec allows for different configurations).</p>
<p>When dealing with spatial data, that principle gets hit by a loose version of <a href="https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography">Tobler’s First Law</a>: data representing an area nearby is more likely to be accessed next. For example, when a user is viewing an image, tiles that are close to the ones on screen are more likely to be fetched next than tiles representing remote areas (because users pan, do not jump randomly).</p>
<p>So what is the use case COG is having in mind? Well, in case you hadn’t figured it out already, it is mainly <em>visualization</em><a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a>. Overviews allow for zooming in and out efficiently and tiles help with moving along a subset of the higher resolution.</p>
<p>This pattern has been the ABC of raster optimization for decades in the geospatial world. Be it <a href="https://mapproxy.org/docs/1.13.0/caches.html">tile caches</a>, <a href="https://www.ogc.org/standards/tms">tiling schemes</a>, <a href="https://mapserver.org/optimization/raster.html">WMS map servers</a>, etc… they all<a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a> try to have the same properties:</p>
<ol type="1">
<li>Efficient navigation along contiguous resolutions (through overviews, pyramids, wavelets).</li>
<li>Efficient access of contiguous areas at a given resolution (tiling).</li>
</ol>
<p>This also turns out to be a pretty sensible organization if you cannot know in advance what kind of processing will be performed, because it gives you fast access to a manageble piece of the data: be it a summary (overview) or a subset (a slice of tiles) or a combination of both.</p>
<p>Notice what it does <em>not</em> allow, though: it leaves you in the dry if you need a subset based on the <em>content</em> of the data: eg. I would like to see all pixels with a red channel value of 42: in that case you would have to read the whole image.</p>
<p>COG is just a name for a GeoTiff implementing that organization. It goes a bit further than that by forcing a particular order<a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a> of the inner sections, which is smart because a client can ask for a chunk at the beginning and it will take all the directores (think indices, metadata) and probably some overviews, which makes sense, because most viewers will start with the lowest zoom that covers the bounding box. It is also a nice organization for <em>streaming</em> tiles of data.</p>
<p>With that in mind, what would it mean for a vector format to be “cloud ready”? Well it sure should allow for the visualization use case, and here it would mean loosely speaking “rendering a map”, so that gives us an idea:</p>
<ol type="1">
<li>Having the ability to navigate different <em>zoom levels</em> / scales / generalization(s).</li>
<li>Efficient rendering of nearby areas at a given resolutions.</li>
</ol>
<p>Notice that point 1 <em>as a process</em> is much harder in vector than in raster formats: for rasters it is (mostly) a question of choosing what “summary” measure we pick for the overview pixel corresponding to the underlying level (nearest neighbor, interpolation, average, other…). Generalizing a vector is much harder, first because it can break topology and geometry validity in many ways, but also because deciding if/how to represent different features at different scales requires for cartographic design knowledge. But that is not relevant <em>for the format itself</em>, it just needs to be flexible enough to allow for different geometries at different resolutions and be efficient in navigating the different resolutions (we do not care how hard it was to generate the different resolution levels).</p>
<p>While I think these two requirements are the equivalent of what a COG offers for raster, I am unsure we would consider that enough in the vector case. For example, we might not consider acceptable not being able to have subsets or summaries based on attribute values, so there is a whole new level of complexity for vector <em>at the format level</em> as well. It all boils down if by <em>vector</em> we mean <em>features</em> or just <em>geometries</em>.</p>
<p>Now that I’ve established the two conditions I think define <em>cloud optimization</em>, at least by COG standards, let’s first dive into why I would say Shapefiles are <em>not</em> the future of the cloud.</p>
<h2 id="the-noble-art-of-bashing-shapefiles">The noble art of bashing shapefiles</h2>
<p>A lot has been argued over the years on the <a href="http://switchfromshapefile.org/">problems with shapefiles</a>. I will just refer here the problems specifically relevant in a cloud setting.</p>
<p>First, they are a multiple file format. There is a cost in the OS layer for opening a file (name resolution, checking permissions), and the web server will probably add another layer on top of that, so please let’s not choose a format for the cloud that means opening a .shp, .dbf, .prj, .dbx, .qix… and <a href="https://desktop.arcgis.com/en/arcmap/10.3/manage-data/shapefiles/shapefile-file-extensions.htm">potentially all of these</a>.</p>
<p>It’s limited to 2GB of file size. Most COGs are effectively BigTiffs, and easily <em>need</em> to go far beyond that. In any case, one of the reasons for moving to the cloud is being able to process larger data.</p>
<p>As for the use cases, they’re not even good for representation: you need several of them, one for each layer/geometry, to make most general maps (except maybe choropleths and other thematic maps). That already means multiplying the number of files even more.</p>
<p>Secondly, Paul’s article seems to care about property number 2: accessing contiguous areas at a given resolution. That is not cloud ready in the same way COGs are. We also need multi-scale map representation (property 1). You can of course use some sort of attribute to filter which elements should appear at different resolution levels, but that means attribute indexing and clashes with spatial ordering. The other option would be using different shapefiles for different layers so, even more files.</p>
<p>The tool for spatial ordering the article suggests would certainly be useful for a streaming algorithm where spatial contiguity is relevant, but then again there are <a href="https://flatgeobuf.org/">options tailored for this use case</a>.</p>
<h1 id="is-there-a-better-option">Is there a better option?</h1>
<p>For the representation use case which is what COGS provide, there certainly is, and has been around for a long time. It’s just that we call them <a href="https://docs.mapbox.com/data/tilesets/guides/vector-tiles-introduction/">vector tiles</a>.</p>
<p>Vector tiles are exactly the application of the old tiling schema idea to vectors. It’s just that instead of mini-images, we have a <code>pbf</code> encoding of an <a href="github.com/mapbox/vector-tile-spec/tree/master/2.1#41-layers">format</a> for encoding geometries and attributes.</p>
<p>Those tiles are then organized into a the same organization of grids and pyramids for different resolutions that we had in a COG. It’s just that most of the times, the tiling is not dependent of the dataset (though it can be), but <a href="https://www.maptiler.com/google-maps-coordinates-tile-bounds-projection/#3/15.00/50.00">globally fixed</a>, with a set of well-known tile schemas.</p>
<p>The tiles can have different schemas and information at different resolution levels (zoom) to allow for different generalization and visualization options.</p>
<p>We can pack all those tiles into a single <code>.mbtiles</code> file, which is a <code>sqlite</code>-based format containing the tiles as a blob. Having a global tile scheme is nice because you can then use sqlite’s <code>.attach</code> command to merge datasets, for example. And you can include any metadata (projection, etc…) inside a single file.</p>
<p>And of course there are libraries for rendering them in the browser (that is their primary use case), among <a href="https://github.com/mapbox/awesome-vector-tiles">many other things</a>. But Paul already knows that, since <a href="https://postgis.net/docs/ST_AsMVT.html">PostGis</a> itself can generate them.</p>
<h1 id="are-we-there-yet">Are we there yet?</h1>
<p>Well, for representation, at least we are close… but what if we want more complex queries over that (think spatial SQL)? With an <code>.mbtiles</code> alone you would need to actually decode each <code>.pbf</code> and query the attributes, so no luck there…</p>
<p>In a sqlite-based format (like <code>.mbtiles</code> or GeoPackage), it should be possible to add extra tables for queries that may or may not reference to the main tiles… but that’s an idea yet to be developed…</p>
<p>The other caveat for <em>vector tiles</em> is the possible loss of information as a general geometry repository. Internal VT coordinates are integers (mainly because they are optimal for screen rendering algorithms), so that means there is a discrete resolution for each zoom level. Special care has to be taken into account so that there is no loss of information (ie. making sure the zoom levels are enough for the internal raster cell to be below the resolution of the measuring instruments). So again, they may not be suitable for every application.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I hope I made my point on why I do not think shapefiles are the future of the cloud based vector formats (I wrote this in a bit of a hurry) and, more importantly, that the “cloud optimization” concept of the raster world can only be applied to the vector formats in a limited way. I <em>do</em> think there is an interesting space to explore, though… Of course I may be completely wrong and maybe Peter has actually found something.</p>
<p>Time will tell, I guess…</p>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>The trick is that some cloud processing platforms such as the <a href="https://earthengine.google.com/">Google Earth Engine</a> are in fact processing on a <em>visualization driven</em> also called <em>lazy</em> processing scheme: only the data that is visualized at any moment by the user gets processed, on demand, so the same principle applies.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn2" role="doc-endnote"><p>Actually, not all, there are more sophisticated methods like wavelet transforms allowing for multi-resolution decoding in formats like .ECW/MrSID (commercial) or JP2000, but for the purpose of this post let’s just call it a very sophisitcated pyramid.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn3" role="doc-endnote"><p>For many applications, the hard requirements are tiles and overviews. The order of IFDs may not have much of an impact. I encourage the user to try and read a “regular” tiled tiff through <code>/vsicurl/</code> in QGIS. Or even a raster geopackage, for that matter.<a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>
<div class="panel panel-default">
<div class="panel-body">
<div class="pull-left">
Tags: <a href="/tags/vector.html">vector</a>, <a href="/tags/vector-tiles.html">vector-tiles</a>, <a href="/tags/mbtiles.html">mbtiles</a>, <a href="/tags/sqlite.html">sqlite</a>
</div>
<div class="social pull-right">
<span class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/04/22/cloud-optimized-vector.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
</span>
<script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
<span>
<g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
size="medium"></g:plusone>
</span>
</div>
</div>
</div>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<div id="disqus_thread"></div>
<script type"text/javascript">
var disqus_shortname = 'jarnaldich';
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
]]></description>
<pubDate>Fri, 22 Apr 2022 01:00:00 UT</pubDate>
<guid>http://jarnaldich.me/blog/2022/04/22/cloud-optimized-vector.html</guid>
<dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
<title>ETL The Haskell Way</title>
<link>http://jarnaldich.me/blog/2022/03/27/etl-the-haskell-way.html</link>
<description><![CDATA[<h1>ETL The Haskell Way</h1>
<small>Posted on March 27, 2022 <a href="/blog/2022/03/27/etl-the-haskell-way.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<p>Extract Transform Load (ETL) is a broad term for processes that read a subset of data in one format, perform a more or less involved transformation and then store it in a (maybe) different format. Those processes can of course be linked together to form larger data pipelines. As in many such general terms, this can mean very different things in terms of software architecture and implementations. For example, depending on the scale of the data the solution may range from unix shell pipeline to a full-blown <a href="https://nifi.apache.org/">Apache nifi</a> solution.</p>
<p>One common theme is data impedance mismatch between formats. Take for example JSON and XML. They are surely different, but for any particular application you can find a way to move data from one to the other. They even have their own <a href="https://chrispenner.ca/posts/traversal-systems">traversal systems</a> (<a href="https://stedolan.github.io/jq/">jq</a>’s syntax and <a href="https://developer.mozilla.org/en-US/docs/Web/XPath">XPath</a>).</p>
<p>The most widely used solution for small to medium data is to write small ad-hoc scripts. One can somewhat abstract over these formats by <a href="https://blog.lazy-evaluation.net/posts/linux/jq-xq-yq.html">abusing jq</a>.</p>
<p>In this blog post we will explore more elegant way to perform such transformations using Haskell. The purpose of this post is just to pique your curiosity with what’s possible in this area with Haskell. It is definitely <em>not</em> intended as a tutorial on optics, which are not for Haskell beginners, anyways…</p>
<h2 id="the-problem">The Problem</h2>
<p>We will be enriching a <a href="https://datatracker.ietf.org/doc/html/rfc7946">geojson</a> dataset containing <a href="static/countries.geo.json">countries</a> at a world scale taken from natural earth and enriching it with <a href="static/population.xml">population data in xml</a> as provided by the world bank API so that it can be used, for example, to produce a <a href="">choropleth</a> <del>map</del><a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a> visualization.</p>
<figure>
<img src="/images/worldpop.png" title="this is not a map" class="center" alt="" /><figcaption> </figcaption>
</figure>
<p>Haskell is a curiously effective fit for this kind of problems due to the unlikely combination of three seemingly unrelated traits: its parsing libraries (driven by a community interested in programming languages theory), <em>optics</em> (also driven by PLT and a gruesome syntax for record accessors, at least up to the recent addition of <code>RecordDotSyntax</code>), and the convience for writing scripts with the <code>stack</code> tool (driven by the olden unreliability of <code>cabal</code> builds).</p>
<p>It is the fact that Haskell is so <em>abstract</em>, that makes it easy to combine libraries never intended to work together in the first place. Haskell libraries tend to define its interfaces in terms of very general terms (eg. structures that can be mapped into, structures that can be “summarized”, etc..).</p>
<p>Let’s break down how these work together.</p>
<h3 id="parsing-libraries">Parsing Libraries</h3>
<p>Haskell comes from a long tradition of programming language theory applications, and it shines for building parsers, so there is no shortage of libraries for reading the most common formats. But, more important than the availability of parsing libraries itself, it’s the <a href="https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/">parse, don’t validate</a> approach in this libraries that works here: most of them have the ability to decode (deserialize,parse) its input into a well typed structured value in memory (think Abstract Syntax Tree).</p>
<p>So a typical workflow would be to read the data from disk into a more or less abstract representation in memory involving nested data structures, then transform it into another representation in memory (maybe generated from a template) through the use of optics and then serialize it back to disk:</p>
<figure>
<img src="/images/haskell_lens_workflow.png" title="Haskell lens workflow" class="center" alt="" /><figcaption> </figcaption>
</figure>
<h3 id="optics">Optics</h3>
<p>Optics (lenses, prisms, traversals) are way to abstract getters and setters in a composable way. Their surface syntax reads like “pinpointing” or “bookmarking” into a deeply nested data structure (think <code>XPath</code>), which make it nice for visually keeping track of what is being read or altered.</p>
<p>The learning curve is wild, and their error messages convoluted, but the fact that in Haskell we can abstract accessors away from any particular data structure, and that there are well-defined functions to combine them can reduce the size of your data transformation toolbox. And lighter toolboxes are easier to carry around with you.</p>
<h3 id="scripting">Scripting</h3>
<p>A lot of the data wrangling programs are one-shot scripts, where you care about the result more than about the software itself. Having to create a new app each time can be tiresome, so using scripting and knowing you can rely on a set of curated libraries to get the job done is really nice. Starting with a script that can be turned at any time into a full blown app that works on all the major platforms is a plus.</p>
<h2 id="the-solution">The Solution</h2>
<p>The steps follow the typical workflow quite closely, in our case:</p>
<ol type="1">
<li>Parse the <code>.xml</code> file into a data structure (a document) in memory.</li>
<li>Build a map from country codes to population.</li>
<li>Read the geojson file with country info and get the array of features.</li>
<li>For each feature, create a new key with the population.</li>
</ol>
<p>This overall structure can be traced in our main function:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb1-1"><a href="#cb1-1"></a>main <span class="ot">=</span> <span class="kw">do</span></span>
<span id="cb1-2"><a href="#cb1-2"></a> xml <span class="ot"><-</span> XML.readFile XML.def <span class="st">"population.xml"</span> <span class="co">-- Parse the XML file into a memory document</span></span>
<span id="cb1-3"><a href="#cb1-3"></a> <span class="kw">let</span> pop2020Map <span class="ot">=</span> Map.fromList <span class="op">$</span> runReader records xml <span class="co">-- Build a map Country -> Population</span></span>
<span id="cb1-4"><a href="#cb1-4"></a> jsonBytes <span class="ot"><-</span> LB8.readFile <span class="st">"countries.geo.json"</span> <span class="co">-- Parse the countries geojson into memory</span></span>
<span id="cb1-5"><a href="#cb1-5"></a> <span class="kw">let</span> <span class="dt">Just</span> json <span class="ot">=</span> Json.decode<span class="ot"> jsonBytes ::</span> <span class="dt">Maybe</span> <span class="dt">Json.Value</span></span>
<span id="cb1-6"><a href="#cb1-6"></a> <span class="kw">let</span> featureList <span class="ot">=</span> runReader (features pop2020Map)<span class="ot"> json ::</span> [ <span class="dt">Json.Value</span> ] <span class="co">-- Get features with new population key</span></span>
<span id="cb1-7"><a href="#cb1-7"></a> <span class="kw">let</span> newJson <span class="ot">=</span> json <span class="op">&</span> key <span class="st">"features"</span> <span class="op">.~</span> (<span class="dt">Json.Array</span> <span class="op">$</span> V.fromList featureList) <span class="co">-- Update the original Json</span></span>
<span id="cb1-8"><a href="#cb1-8"></a> LB8.writeFile <span class="st">"countriesWithPopulation.geo.json"</span> <span class="op">$</span> Json.encode newJson <span class="co">-- Write back to disk</span></span></code></pre></div>
<p>The form of the input data is not especially well suited for this app. The world population xml is basically a table in disguise (remember the data impedance problem?). It is basically a list of:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode xml"><code class="sourceCode xml"><span id="cb2-1"><a href="#cb2-1"></a> <span class="kw"><record></span></span>
<span id="cb2-2"><a href="#cb2-2"></a> <span class="kw"><field</span><span class="ot"> name=</span><span class="st">"Country or Area"</span><span class="ot"> key=</span><span class="st">"ABW"</span><span class="kw">></span>Aruba<span class="kw"></field></span></span>
<span id="cb2-3"><a href="#cb2-3"></a> <span class="kw"><field</span><span class="ot"> name=</span><span class="st">"Item"</span><span class="ot"> key=</span><span class="st">"SP.POP.TOTL"</span><span class="kw">></span>Population, total<span class="kw"></field></span></span>
<span id="cb2-4"><a href="#cb2-4"></a> <span class="kw"><field</span><span class="ot"> name=</span><span class="st">"Year"</span><span class="kw">></span>1960<span class="kw"></field></span></span>
<span id="cb2-5"><a href="#cb2-5"></a> <span class="kw"><field</span><span class="ot"> name=</span><span class="st">"Value"</span><span class="kw">></span>54208<span class="kw"></field></span></span>
<span id="cb2-6"><a href="#cb2-6"></a> <span class="kw"></record></span></span></code></pre></div>
<p>That means the function that reads it has to associate information from two siblings in the XML tree, but that is easy using the <code>magnify</code> function inside a <code>Reader</code> monad:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb3-1"><a href="#cb3-1"></a><span class="ot">records ::</span> <span class="dt">Reader</span> <span class="dt">XML.Document</span> [(<span class="dt">T.Text</span>, <span class="dt">Scientific</span>)]</span>
<span id="cb3-2"><a href="#cb3-2"></a>records <span class="ot">=</span></span>
<span id="cb3-3"><a href="#cb3-3"></a> <span class="kw">let</span></span>
<span id="cb3-4"><a href="#cb3-4"></a> <span class="co">-- Lens to access an attribute from record to field. Intended to be composed.</span></span>
<span id="cb3-5"><a href="#cb3-5"></a> field name <span class="ot">=</span> nodes <span class="op">.</span> folded <span class="op">.</span> _Element <span class="op">.</span> named <span class="st">"field"</span> <span class="op">.</span> attributeIs <span class="st">"name"</span> name</span>
<span id="cb3-6"><a href="#cb3-6"></a> <span class="kw">in</span> <span class="kw">do</span></span>
<span id="cb3-7"><a href="#cb3-7"></a> <span class="co">-- Zoom and iterate all records</span></span>
<span id="cb3-8"><a href="#cb3-8"></a> magnify (root <span class="op">.</span> named <span class="st">"Root"</span> <span class="op">./</span> named <span class="st">"data"</span> <span class="op">./</span> named <span class="st">"record"</span>) <span class="op">$</span> <span class="kw">do</span></span>
<span id="cb3-9"><a href="#cb3-9"></a> record <span class="ot"><-</span> ask</span>
<span id="cb3-10"><a href="#cb3-10"></a> <span class="kw">let</span> name <span class="ot">=</span> record <span class="op">^?</span> (field <span class="st">"Country or Area"</span> <span class="op">.</span> attr <span class="st">"key"</span>)</span>
<span id="cb3-11"><a href="#cb3-11"></a> <span class="kw">let</span> year <span class="ot">=</span> record <span class="op">^?</span> (field <span class="st">"Year"</span> <span class="op">.</span> text)</span>
<span id="cb3-12"><a href="#cb3-12"></a> <span class="kw">let</span> val <span class="ot">=</span> record <span class="op">^?</span> (field <span class="st">"Value"</span> <span class="op">.</span> text)</span>
<span id="cb3-13"><a href="#cb3-13"></a> <span class="co">-- Returning a monoid instance (list) combines results.</span></span>
<span id="cb3-14"><a href="#cb3-14"></a> <span class="fu">return</span> <span class="op">$</span> <span class="kw">case</span> (name, year, val) <span class="kw">of</span></span>
<span id="cb3-15"><a href="#cb3-15"></a> (<span class="dt">Just</span> key, <span class="dt">Just</span> <span class="st">"2020"</span>, <span class="dt">Just</span> val) <span class="ot">-></span> [ (key, <span class="fu">read</span> <span class="op">$</span> T.unpack val) ]</span>
<span id="cb3-16"><a href="#cb3-16"></a> _ <span class="ot">-></span> []</span></code></pre></div>
<p>Note how lenses look almost like <code>XPath</code> expressions. The <code>features</code> function just takes the original features and appends a new key:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span id="cb4-1"><a href="#cb4-1"></a><span class="ot">features ::</span> <span class="dt">Map.Map</span> <span class="dt">T.Text</span> <span class="dt">Scientific</span> <span class="ot">-></span> <span class="dt">Reader</span> <span class="dt">Json.Value</span> [ <span class="dt">Json.Value</span> ]</span>
<span id="cb4-2"><a href="#cb4-2"></a>features popMap <span class="ot">=</span> <span class="kw">do</span></span>
<span id="cb4-3"><a href="#cb4-3"></a> magnify (key <span class="st">"features"</span> <span class="op">.</span> values) <span class="op">$</span> <span class="kw">do</span></span>
<span id="cb4-4"><a href="#cb4-4"></a> feature <span class="ot"><-</span> ask</span>
<span id="cb4-5"><a href="#cb4-5"></a> <span class="kw">let</span> <span class="dt">Just</span> <span class="fu">id</span> <span class="ot">=</span> feature <span class="op">^?</span> (key <span class="st">"id"</span> <span class="op">.</span> _String) <span class="co">-- Gross, but effective</span></span>
<span id="cb4-6"><a href="#cb4-6"></a> <span class="fu">return</span> <span class="op">$</span> <span class="kw">case</span> (Map.lookup <span class="fu">id</span> popMap) <span class="kw">of</span></span>
<span id="cb4-7"><a href="#cb4-7"></a> <span class="dt">Just</span> pop <span class="ot">-></span> [ feature <span class="op">&</span> key <span class="st">"properties"</span> <span class="op">.</span> _Object <span class="op">.</span> at <span class="st">"pop2020"</span> <span class="op">?~</span> <span class="dt">Json.Number</span> pop ]</span>
<span id="cb4-8"><a href="#cb4-8"></a> _ <span class="ot">-></span> [ feature ]</span></code></pre></div>
<p>That is really all it takes to perform the transformation. Please take a look at the full listing in <a href="https://gist.github.com/7cb4fd07bc8689f5c3bccb58b2e239ae#file-etl-hs">this gist</a>. Even with the imports, it cannot get much shorter or expressive than this fifty something lines…</p>
<h2 id="revenge-of-the-nerds">Revenge of the Nerds</h2>
<p>So Haskell turns out to be the most practical, straightforward solution I found for this kind of problems. Who knew?</p>
<p>I would absolutely not recommend learning Haskell just to solve this kind of problems (although I would absolutely recommend learning it for many other reasons). This is one of the occasions in which learning something just for the sake of it pays off in unexpected ways.</p>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>No lengend! No arrow pointing north! Questionable projection! This is not a post on map making, just an image to ease the reader’s eye after too much text for the internet…<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>
<div class="panel panel-default">
<div class="panel-body">
<div class="pull-left">
Tags: <a href="/tags/haskell.html">haskell</a>, <a href="/tags/data.html">data</a>, <a href="/tags/xml.html">xml</a>, <a href="/tags/json.html">json</a>, <a href="/tags/geojson.html">geojson</a>
</div>
<div class="social pull-right">
<span class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/03/27/etl-the-haskell-way.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
</span>
<script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
<span>
<g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
size="medium"></g:plusone>
</span>
</div>
</div>
</div>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<div id="disqus_thread"></div>
<script type"text/javascript">
var disqus_shortname = 'jarnaldich';
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
]]></description>
<pubDate>Sun, 27 Mar 2022 00:00:00 UT</pubDate>
<guid>http://jarnaldich.me/blog/2022/03/27/etl-the-haskell-way.html</guid>
<dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
<title>Finding Curve Inflection Points in PostGIS</title>
<link>http://jarnaldich.me/blog/2022/02/06/postgis-curve-inflection.html</link>
<description><![CDATA[<h1>Finding Curve Inflection Points in PostGIS</h1>
<small>Posted on February 6, 2022 <a href="/blog/2022/02/06/postgis-curve-inflection.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<p>In this blog post I will present a way to find inflection points in a curve. An easy way to understand this: imagine the curve is the road we are driving along, we want to find the points in which we stop turning right and start turning left or vice versa, as shown below:</p>
<figure>
<img src="/images/curve_inflection.png" title="Sample of curve inflection points" class="center" width="400" alt="" /><figcaption> </figcaption>
</figure>
<p>We will show a sketch of the solution and a practial implementation with <a href="https://postgis.net">PostGIS</a>.</p>
<h2 id="a-sketch-of-the-solution">A sketch of the solution</h2>
<p>This problem can be solved with pretty standard 2d computational geometry resources. In particular, the use of the <a href="https://mathworld.wolfram.com/CrossProduct.html">cross product</a> as a way to detect if a point lies left or right of a given straight line will be useful here. The following pseudo-code is based on the determinant formula:</p>
<pre><code>function isLeft(Point a, Point b, Point c){
return ((b.X - a.X)*(c.Y - a.Y) - (b.Y - a.Y)*(c.X - a.X)) > 0;
}</code></pre>
<p>In general, I am against implementing your own computational geometry code: the direct translation of mathematical formulas are often plagued with rounding-off errors, corner cases and blatant inefficiencies. You would be better off using one of the excellent computational geometry libraries such as: <a href="https://libgeos.org">GEOS</a>, which started as a port of the <a href="https://github.com/locationtech/jts">JTS</a>, or <a href="https://www.cgal.org/">CGAL</a>. Chances are that you are using them anyway, since they lie at the bottom of many <a href="https://www.nationalgeographic.org/encyclopedia/geographic-information-system-gis/">GIS</a> software stacks. This holds true for any non-trivial mathematics (linear algebra, optimization…). Remember: <strong><code>floats</code> are NOT real numbers</strong></p>
<p>In this case, where I cared a lot more about practicality than sheer efficiency, the use of SQLs <code>numeric</code> types, which offer arbitrary precision arithmetics at the expense of speed, prevents some of the rounding-off errors we would get with <code>double precision</code>, sparing us to implement <a href="https://www.cs.cmu.edu/~quake/robust.html">fast robust predicates</a> ourselves.</p>
<h2 id="postgis-implementaton">PostGIS implementaton</h2>
<p>I have long felt that Postgres/PostGIS is the nicest workbench for geospatial analysis (prove me wrong). In many use cases, being able to perform the analysis directly where your data is stored is unbeatable. Having to write a SQL script may be a throwback for some users, but works charms in terms of reproducibility and traceability for your data workflows.</p>
<p>In this particular case we will assume our input is a table with <code>LineString</code> geometry features, each one with its unique identifier. Of course, geometries are properly indexed and tested for validity before any calculation. It is also often useful during development to limit the calculation to a subset of the data through an area of interest in order to shorten the iteration process for testing results and parameters.</p>
<p>The sketch of the solution is:</p>
<ol type="1">
<li>Simplify the geometries to avoid noise (false positives). <code>ST_Simplify</code> or <code>ST_SimplifyPreserveTopology</code> will suffice.</li>
<li>Explode the points, keeping track of the original geometries, this can be easily done with <code>generate_series</code> and <code>ST_DumpPoints</code>.</li>
<li>We need 3 points to calculate <code>isLeft</code>: 2 to define the segment and the point to test for. So, for each point along the <code>LineString</code>, we get the X,Y coordinates of the point itself and the 2 previous points. We will be checking for the current point position in relation to the segment defined by the two previous points. This also means that the turning point, when detected, will be last point of the segment, that is: the previous point. I found this calculation to be surprisingly easy through Posgres window functions.</li>
<li>Use the above points to calculate a measure for isLeft.</li>
<li>Select the points where this measure changes.</li>
</ol>
<p>As usual, good code practices in general also apply to the database. In particular, <a href="https://www.postgresql.org/docs/13/queries-with.html">CTEs</a> can be used to clarify queries in the same way you would name variables or functions in whatever programming language: to enable reuse, but also to enhance readability by giving descriptive names. There is no excuse for <em>any</em> of the eye-burning SQL queries that are too often considered normal in the language.</p>
<p>Look at the sketch solution and contrast with the following implementation to see what I mean:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">WITH</span> </span>
<span id="cb2-2"><a href="#cb2-2"></a> <span class="co">-- Optional: area of interest.</span></span>
<span id="cb2-3"><a href="#cb2-3"></a> aoi <span class="kw">AS</span> (</span>
<span id="cb2-4"><a href="#cb2-4"></a> <span class="kw">SELECT</span> ST_SetSRID(</span>
<span id="cb2-5"><a href="#cb2-5"></a> ST_MakeBox2D(</span>
<span id="cb2-6"><a href="#cb2-6"></a> ST_Point(<span class="dv">467399</span>,<span class="dv">4671999</span>),</span>
<span id="cb2-7"><a href="#cb2-7"></a> ST_Point(<span class="dv">470200</span>,<span class="dv">4674000</span>))</span>
<span id="cb2-8"><a href="#cb2-8"></a> ,<span class="dv">25831</span>) </span>
<span id="cb2-9"><a href="#cb2-9"></a> <span class="kw">AS</span> geom</span>
<span id="cb2-10"><a href="#cb2-10"></a> ),</span>
<span id="cb2-11"><a href="#cb2-11"></a> <span class="co">-- Simplify geometries to avoid excessive noise. Tolerance is empiric and depends on application</span></span>
<span id="cb2-12"><a href="#cb2-12"></a> simplified <span class="kw">AS</span> (</span>
<span id="cb2-13"><a href="#cb2-13"></a> <span class="kw">SELECT</span> <span class="kw">oid</span> <span class="kw">as</span> contour_id, ST_Simplify(input_contours.geom, <span class="fl">0.2</span>) <span class="kw">AS</span> geom </span>
<span id="cb2-14"><a href="#cb2-14"></a> <span class="kw">FROM</span> input_contours, aoi</span>
<span id="cb2-15"><a href="#cb2-15"></a> <span class="kw">WHERE</span> input_contours.geom && aoi.geom</span>
<span id="cb2-16"><a href="#cb2-16"></a> ), </span>
<span id="cb2-17"><a href="#cb2-17"></a> <span class="co">-- Explode points generating index and keeping track of original curve</span></span>
<span id="cb2-18"><a href="#cb2-18"></a> points <span class="kw">AS</span> (</span>
<span id="cb2-19"><a href="#cb2-19"></a> <span class="kw">SELECT</span> contour_id,</span>
<span id="cb2-20"><a href="#cb2-20"></a> generate_series(<span class="dv">1</span>, st_numpoints(geom)) <span class="kw">AS</span> npoint,</span>
<span id="cb2-21"><a href="#cb2-21"></a> (ST_DumpPoints(geom)).geom <span class="kw">AS</span> geom</span>
<span id="cb2-22"><a href="#cb2-22"></a> <span class="kw">FROM</span> simplified</span>
<span id="cb2-23"><a href="#cb2-23"></a> ), </span>
<span id="cb2-24"><a href="#cb2-24"></a> <span class="co">-- Get the numeric values for X an Y of the current point </span></span>
<span id="cb2-25"><a href="#cb2-25"></a> coords <span class="kw">AS</span> (</span>
<span id="cb2-26"><a href="#cb2-26"></a> <span class="kw">SELECT</span> <span class="op">*</span>, st_x(geom):<span class="ch">:numeric</span> <span class="kw">AS</span> cx, st_y(geom):<span class="ch">:numeric</span> <span class="kw">AS</span> cy</span>
<span id="cb2-27"><a href="#cb2-27"></a> <span class="kw">FROM</span> points </span>
<span id="cb2-28"><a href="#cb2-28"></a> <span class="kw">ORDER</span> <span class="kw">BY</span> contour_id, npoint</span>
<span id="cb2-29"><a href="#cb2-29"></a> ),</span>
<span id="cb2-30"><a href="#cb2-30"></a> <span class="co">-- Add the values of the 2 previous points inside the same linestring</span></span>
<span id="cb2-31"><a href="#cb2-31"></a> <span class="co">-- LAG and PARTITION BY do all the work here.</span></span>
<span id="cb2-32"><a href="#cb2-32"></a> segments <span class="kw">AS</span> (</span>
<span id="cb2-33"><a href="#cb2-33"></a> <span class="kw">SELECT</span> <span class="op">*</span>, </span>
<span id="cb2-34"><a href="#cb2-34"></a> <span class="fu">LAG</span>(geom, <span class="dv">1</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> prev_geom, </span>
<span id="cb2-35"><a href="#cb2-35"></a> <span class="fu">LAG</span>(cx:<span class="ch">:numeric</span>, <span class="dv">2</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> ax, </span>
<span id="cb2-36"><a href="#cb2-36"></a> <span class="fu">LAG</span>(cy:<span class="ch">:numeric</span>, <span class="dv">2</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> ay, </span>
<span id="cb2-37"><a href="#cb2-37"></a> <span class="fu">LAG</span>(cx:<span class="ch">:numeric</span>, <span class="dv">1</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> bx, </span>
<span id="cb2-38"><a href="#cb2-38"></a> <span class="fu">LAG</span>(cy:<span class="ch">:numeric</span>, <span class="dv">1</span>) <span class="kw">over</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id) <span class="kw">AS</span> <span class="kw">by</span></span>
<span id="cb2-39"><a href="#cb2-39"></a> <span class="kw">FROM</span> coords</span>
<span id="cb2-40"><a href="#cb2-40"></a> <span class="kw">ORDER</span> <span class="kw">BY</span> contour_id, npoint</span>
<span id="cb2-41"><a href="#cb2-41"></a> ),</span>
<span id="cb2-42"><a href="#cb2-42"></a> det <span class="kw">AS</span> (</span>
<span id="cb2-43"><a href="#cb2-43"></a> <span class="kw">SELECT</span> <span class="op">*</span>, </span>
<span id="cb2-44"><a href="#cb2-44"></a> (((bx<span class="op">-</span>ax)<span class="op">*</span>(cy<span class="op">-</span>ay)) <span class="op">-</span> ((<span class="kw">by</span><span class="op">-</span>ay)<span class="op">*</span>(cx<span class="op">-</span>ax))) <span class="kw">AS</span> det <span class="co">-- cross product in 2d</span></span>
<span id="cb2-45"><a href="#cb2-45"></a> <span class="kw">FROM</span> segments</span>
<span id="cb2-46"><a href="#cb2-46"></a> ),</span>
<span id="cb2-47"><a href="#cb2-47"></a> <span class="co">-- Uses the SIGN multipliaction as a proxy for XOR (change in convexity) </span></span>
<span id="cb2-48"><a href="#cb2-48"></a> convexity <span class="kw">AS</span> (</span>
<span id="cb2-49"><a href="#cb2-49"></a> <span class="kw">SELECT</span> <span class="op">*</span>, </span>
<span id="cb2-50"><a href="#cb2-50"></a> <span class="fu">SIGN</span>(det) <span class="op">*</span> <span class="fu">SIGN</span>(<span class="fu">lag</span>(det, <span class="dv">1</span>) <span class="kw">OVER</span> (<span class="kw">PARTITION</span> <span class="kw">BY</span> contour_id)) <span class="kw">AS</span> <span class="kw">change</span></span>
<span id="cb2-51"><a href="#cb2-51"></a> <span class="kw">FROM</span> det</span>
<span id="cb2-52"><a href="#cb2-52"></a> )</span>
<span id="cb2-53"><a href="#cb2-53"></a><span class="kw">SELECT</span> contour_id, npoint, prev_geom <span class="kw">AS</span> geom</span>
<span id="cb2-54"><a href="#cb2-54"></a><span class="kw">FROM</span> convexity</span>
<span id="cb2-55"><a href="#cb2-55"></a><span class="kw">WHERE</span> <span class="kw">change</span> <span class="op">=</span> <span class="op">-</span><span class="dv">1</span></span>
<span id="cb2-56"><a href="#cb2-56"></a><span class="kw">ORDER</span> <span class="kw">BY</span> contour_id, npoint</span></code></pre></div>
<p>Here’s what the results look like for a sample area:</p>
<figure>
<img src="/images/curve_inflection_2.png" title="Sample of curve inflection points results" class="center" alt="" /><figcaption> </figcaption>
</figure>
<div class="panel panel-default">
<div class="panel-body">
<div class="pull-left">
Tags: <a href="/tags/postgres.html">postgres</a>, <a href="/tags/postgis.html">postgis</a>, <a href="/tags/curve.html">curve</a>, <a href="/tags/inflection.html">inflection</a>, <a href="/tags/GIS.html">GIS</a>
</div>
<div class="social pull-right">
<span class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2022/02/06/postgis-curve-inflection.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
</span>
<script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
<span>
<g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
size="medium"></g:plusone>
</span>
</div>
</div>
</div>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<div id="disqus_thread"></div>
<script type"text/javascript">
var disqus_shortname = 'jarnaldich';
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
]]></description>
<pubDate>Sun, 06 Feb 2022 00:00:00 UT</pubDate>
<guid>http://jarnaldich.me/blog/2022/02/06/postgis-curve-inflection.html</guid>
<dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
<title>Introspection in PostgreSQL</title>
<link>http://jarnaldich.me/blog/2021/08/30/postgres-introspection.html</link>
<description><![CDATA[<h1>Introspection in PostgreSQL</h1>
<small>Posted on August 30, 2021 <a href="/blog/2021/08/30/postgres-introspection.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<p> </p>
<figure>
<img src="/images/introspection.png" title="Detail of Alexander Stirling Calder Introspection (c. 1935)" class="wrap" alt="" /><figcaption> </figcaption>
</figure>
<p>In coding, introspection refers to the ability of some systems to query and expose information on their own structure. Typical examples are being able to query an object’s methods or properties (eg. Python’s <code>___dict___</code>).</p>
<p>In a DB system, it typically refers to the mechanism by which schema information regarding tables, attributes, foreign keys, indices, data types, etc… can be programmatically queried.</p>
<p>This is useful in many ways, eg:</p>
<ul>
<li>Code reuse: making code that can be made schema-agnostic. For example, <a href="https://github.com/adrianandrei-ca/pgunit">pgunit</a>, a NUnit-style testing framework for postgresql, automatically searches for functions whose name start with <code>test_</code>.</li>
<li>Discovery and research of the structure in ill-documented or legacy database.</li>
</ul>
<p>In this article we will explore some options for making use of the introspection capabilities of PostgreSQL.</p>
<h2 id="information-schema-vs-system-catalogs">Information schema vs system catalogs</h2>
<p>There are two main devices to query information of the objects defined in a Postgres database. The first one is the information schema, which is defined in the SQL standard and thus expected to be portable and remain stable, but cannot provide information about posgres-specific features. As with many aspects of the SQL standard, there are vendor-specific issues (most notably Oracle does not implement it out of the box). If you are using introspection as a part of a library, and do not need to get into postgres-specific information this approach gives you a better chance for future compatibility accross RDBMS and even PostgreSQL versions.</p>
<p>The other approach involves querying the so called <a href="https://www.postgresql.org/docs/13/catalogs.html">System Catalogs</a>. These are tables belonging to the <code>pg_catalog</code> schema. For example, the <code>pg_catalog.pg_class</code> (pseudo-)table catalogs tables and most everything else that has columns or is otherwise similar to a table (views, materialized or not…). This approach is version dependent, but I would be surprised to see major changes in the near future.</p>
<p>This is the approach we will be focusing on in this article, because the tooling and coding ergonomics from PostgreSQL are more convenient, as you will see in the nexts sections.</p>
<h2 id="use-the-command-line-luke">Use the command-line, Luke</h2>
<p>The <code>psql</code> command-line client is a very powerful and often overlooked utility (as many other command_-line tools). Typing <code>\?</code> after connecting will show a plethora of commands that let you inspect the DB. What most people do not know, though, is that these commands are implemented as regular SQL queries to the system catalogs and that <strong>you can actually see the code</strong> just by invoking the <code>psql</code> client with the <code>-E</code> option. For example:</p>
<pre><code>PGPASSWORD=<password> psql -E -U <user> -h <host> <db></code></pre>
<p>And then typing for the description of the <code>pg__catalog.pg_class</code> table itself:</p>
<pre><code>\dt+ pg_catalog.pg_class</code></pre>
<p>yields:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb3-1"><a href="#cb3-1"></a><span class="op">*********</span> <span class="kw">QUERY</span> <span class="op">**********</span></span>
<span id="cb3-2"><a href="#cb3-2"></a><span class="kw">SELECT</span> n.nspname <span class="kw">as</span> <span class="ot">"Schema"</span>,</span>
<span id="cb3-3"><a href="#cb3-3"></a> c.relname <span class="kw">as</span> <span class="ot">"Name"</span>,</span>
<span id="cb3-4"><a href="#cb3-4"></a> <span class="cf">CASE</span> c.relkind </span>
<span id="cb3-5"><a href="#cb3-5"></a> <span class="cf">WHEN</span> <span class="st">'r'</span> <span class="cf">THEN</span> <span class="st">'table'</span> </span>
<span id="cb3-6"><a href="#cb3-6"></a> <span class="cf">WHEN</span> <span class="st">'v'</span> <span class="cf">THEN</span> <span class="st">'view'</span> </span>
<span id="cb3-7"><a href="#cb3-7"></a> <span class="cf">WHEN</span> <span class="st">'m'</span> <span class="cf">THEN</span> <span class="st">'materialized view'</span> </span>
<span id="cb3-8"><a href="#cb3-8"></a> <span class="cf">WHEN</span> <span class="st">'i'</span> <span class="cf">THEN</span> <span class="st">'index'</span> </span>
<span id="cb3-9"><a href="#cb3-9"></a> <span class="cf">WHEN</span> <span class="st">'S'</span> <span class="cf">THEN</span> <span class="st">'sequence'</span> </span>
<span id="cb3-10"><a href="#cb3-10"></a> <span class="cf">WHEN</span> <span class="st">'s'</span> <span class="cf">THEN</span> <span class="st">'special'</span> </span>
<span id="cb3-11"><a href="#cb3-11"></a> <span class="cf">WHEN</span> <span class="st">'f'</span> <span class="cf">THEN</span> <span class="st">'foreign table'</span> </span>
<span id="cb3-12"><a href="#cb3-12"></a> <span class="cf">WHEN</span> <span class="st">'p'</span> <span class="cf">THEN</span> <span class="st">'partitioned table'</span> </span>
<span id="cb3-13"><a href="#cb3-13"></a> <span class="cf">WHEN</span> <span class="st">'I'</span> <span class="cf">THEN</span> <span class="st">'partitioned index'</span> </span>
<span id="cb3-14"><a href="#cb3-14"></a> <span class="cf">END</span> <span class="kw">as</span> <span class="ot">"Type"</span>,</span>
<span id="cb3-15"><a href="#cb3-15"></a> pg_catalog.pg_get_userbyid(c.relowner) <span class="kw">as</span> <span class="ot">"Owner"</span>,</span>
<span id="cb3-16"><a href="#cb3-16"></a> pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.<span class="kw">oid</span>)) <span class="kw">as</span> <span class="ot">"Size"</span>,</span>
<span id="cb3-17"><a href="#cb3-17"></a> pg_catalog.obj_description(c.<span class="kw">oid</span>, <span class="st">'pg_class'</span>) <span class="kw">as</span> <span class="ot">"Description"</span></span>
<span id="cb3-18"><a href="#cb3-18"></a><span class="kw">FROM</span> pg_catalog.pg_class c</span>
<span id="cb3-19"><a href="#cb3-19"></a> <span class="kw">LEFT</span> <span class="kw">JOIN</span> pg_catalog.pg_namespace n <span class="kw">ON</span> n.<span class="kw">oid</span> <span class="op">=</span> c.relnamespace</span>
<span id="cb3-20"><a href="#cb3-20"></a><span class="kw">WHERE</span> c.relkind <span class="kw">IN</span> (<span class="st">'r'</span>,<span class="st">'p'</span>,<span class="st">'s'</span>,<span class="st">''</span>)</span>
<span id="cb3-21"><a href="#cb3-21"></a> <span class="kw">AND</span> n.nspname !~ <span class="st">'^pg_toast'</span></span>
<span id="cb3-22"><a href="#cb3-22"></a> <span class="kw">AND</span> c.relname <span class="kw">OPERATOR</span>(pg_catalog.~) <span class="st">'^(pg_class)$'</span></span>
<span id="cb3-23"><a href="#cb3-23"></a> <span class="kw">AND</span> n.nspname <span class="kw">OPERATOR</span>(pg_catalog.~) <span class="st">'^(pg_catalog)$'</span></span>
<span id="cb3-24"><a href="#cb3-24"></a><span class="kw">ORDER</span> <span class="kw">BY</span> <span class="dv">1</span>,<span class="dv">2</span>;</span>
<span id="cb3-25"><a href="#cb3-25"></a><span class="op">**************************</span></span>
<span id="cb3-26"><a href="#cb3-26"></a></span>
<span id="cb3-27"><a href="#cb3-27"></a> <span class="kw">List</span> <span class="kw">of</span> relations</span>
<span id="cb3-28"><a href="#cb3-28"></a> <span class="kw">Schema</span> | Name | <span class="kw">Type</span> | Owner | <span class="kw">Size</span> | Description</span>
<span id="cb3-29"><a href="#cb3-29"></a><span class="co">------------|----------|-------|----------|--------|-------------</span></span>
<span id="cb3-30"><a href="#cb3-30"></a> pg_catalog | pg_class | <span class="kw">table</span> | postgres | <span class="dv">136</span> kB |</span>
<span id="cb3-31"><a href="#cb3-31"></a>(<span class="dv">1</span> <span class="kw">row</span>)</span></code></pre></div>
<p>Gives you a quite descriptive (and corner-case complete) template to start your own code from. For example, in the former query we could replace the <code>^(pg_class)$</code> regex with some other. Bear in mind that this trick is only helpful with the system catalog approach.</p>
<h2 id="regclasses-and-oids">Regclasses and OIDs</h2>
<p>Many objects in the system catalogs have some sort of “unique id” in the form of an <code>oid</code> attribute. It is sometimes convenient to know that you can turn descriptive names into such <code>oid</code>s by casting into the <code>regclass</code> data type.</p>
<p>For example, in a somewhat circular turn of events, the attributes of the catalog table storing attribute information can be queried by name as:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb4-1"><a href="#cb4-1"></a><span class="kw">SELECT</span> attnum, attname, format_type(atttypid, atttypmod) <span class="kw">as</span> <span class="ot">"Type"</span> </span>
<span id="cb4-2"><a href="#cb4-2"></a><span class="kw">FROM</span> pg_attribute </span>
<span id="cb4-3"><a href="#cb4-3"></a><span class="kw">WHERE</span> attrelid <span class="op">=</span> <span class="st">'pg_attribute'</span>:<span class="ch">:regclass</span> </span>
<span id="cb4-4"><a href="#cb4-4"></a> <span class="kw">AND</span> attnum <span class="op">></span> <span class="dv">0</span> </span>
<span id="cb4-5"><a href="#cb4-5"></a> <span class="kw">AND</span> <span class="kw">NOT</span> attisdropped <span class="kw">ORDER</span> <span class="kw">BY</span> attnum;</span></code></pre></div>
<p>In the result of that query, we can see that attrelid should be an <code>oid</code>:</p>
<pre><code>attnum | attname | Type
-----------|---------------|-----------
1 | attrelid | oid
2 | attname | name
...
20 | attoptions | text[]
21 | attfdwoptions | text[]</code></pre>
<p>without the “regclass” cast, querying by name would mean joining with the <code>pg_class</code> and filtering by name. There are other types that will get you an oid from a string description for other objects (<code>regprocedure</code> for procedures, <code>regtype</code> for types, …).</p>
<h2 id="system-catalog-information-functions">System Catalog Information Functions</h2>
<p>Another interesting utility for the <code>pg_catalog</code> approach is the ability to translate definitions into SQL DDL. We saw one of them (<code>format_type</code>) in the previous example, but there are many of them (constraints, function source code …).</p>
<p>Just refer to the <a href="https://www.postgresql.org/docs/13/functions-info.html#FUNCTIONS-INFO-CATALOG-TABLE">section in the manual</a> for more.</p>
<h2 id="inspecting-arbitrary-queries">Inspecting arbitrary queries</h2>
<p>As a sidenote, it might be useful to know that we can inspect the data types of any provided query by pretending to turn it into a temporary table. This might be useful for user-provided queries in external tools (injection caveats apply)…</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">CREATE</span> TEMP <span class="kw">TABLE</span> tmp <span class="kw">AS</span> <span class="kw">SELECT</span> <span class="dv">1</span>:<span class="ch">:numeric</span>, now() <span class="kw">LIMIT</span> <span class="dv">0</span>;</span></code></pre></div>
<h2 id="wrapping-up">Wrapping up</h2>
<p>As usual, <strong>good SW practices apply to DB code, too</strong>, and it is easy to isolate any incompatible code just by defining a clear interface in your library: instead of querying for the catalog everywhere, define just a set of views or functions that expose the introspection information to the rest of your code and work as an API. This way, any future change in system catalogs will not propagate further than those specific views. For example, if your application needs to know about tables and attribute data types, instead of querying the catalog from many places, define a view that works as in interface between the system catalogs and your code. As an example:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode sql"><code class="sourceCode sql"><span id="cb7-1"><a href="#cb7-1"></a><span class="kw">CREATE</span> <span class="kw">OR</span> <span class="kw">REPLACE</span> <span class="kw">VIEW</span> table_columns <span class="kw">AS</span></span>
<span id="cb7-2"><a href="#cb7-2"></a><span class="kw">WITH</span> table_oids <span class="kw">AS</span> (</span>
<span id="cb7-3"><a href="#cb7-3"></a> <span class="kw">SELECT</span> c.relname, c.<span class="kw">oid</span></span>
<span id="cb7-4"><a href="#cb7-4"></a> <span class="kw">FROM</span> pg_catalog.pg_class c</span>
<span id="cb7-5"><a href="#cb7-5"></a> <span class="kw">LEFT</span> <span class="kw">JOIN</span> pg_catalog.pg_namespace n <span class="kw">ON</span> n.<span class="kw">oid</span> <span class="op">=</span> c.relnamespace</span>
<span id="cb7-6"><a href="#cb7-6"></a> <span class="kw">WHERE</span> </span>
<span id="cb7-7"><a href="#cb7-7"></a> pg_catalog.pg_table_is_visible(c.<span class="kw">oid</span>) <span class="kw">AND</span> relkind <span class="op">=</span> <span class="st">'r'</span>),</span>
<span id="cb7-8"><a href="#cb7-8"></a> column_types <span class="kw">AS</span> (</span>
<span id="cb7-9"><a href="#cb7-9"></a> <span class="kw">SELECT</span></span>
<span id="cb7-10"><a href="#cb7-10"></a> toids.relname <span class="kw">AS</span> <span class="ot">"tablename"</span>, </span>
<span id="cb7-11"><a href="#cb7-11"></a> a.attname <span class="kw">as</span> <span class="ot">"column"</span>,</span>
<span id="cb7-12"><a href="#cb7-12"></a> pg_catalog.format_type(a.atttypid, a.atttypmod) <span class="kw">as</span> <span class="ot">"datatype"</span></span>
<span id="cb7-13"><a href="#cb7-13"></a> <span class="kw">FROM</span></span>
<span id="cb7-14"><a href="#cb7-14"></a> pg_catalog.pg_attribute a, table_oids toids</span>
<span id="cb7-15"><a href="#cb7-15"></a> <span class="kw">WHERE</span></span>
<span id="cb7-16"><a href="#cb7-16"></a> a.attnum <span class="op">></span> <span class="dv">0</span></span>
<span id="cb7-17"><a href="#cb7-17"></a> <span class="kw">AND</span> <span class="kw">NOT</span> a.attisdropped</span>
<span id="cb7-18"><a href="#cb7-18"></a> <span class="kw">AND</span> a.attrelid <span class="op">=</span> toids.<span class="kw">oid</span>)</span>
<span id="cb7-19"><a href="#cb7-19"></a><span class="kw">SELECT</span> <span class="op">*</span> <span class="kw">FROM</span> column_types;</span></code></pre></div>
<p>I will be assembling some such utility views I find useful in the future in <a href="https://gist.github.com/jarnaldich/d5952a134d89dfac48d034ed141e86c5">this gist</a>.</p>
<p><strong>UPDATE Dec. 15th 2022:</strong> For any real use case, check <em>syonfox</em>’s solution (see comments) documented <a href="https://gist.github.com/jarnaldich/d5952a134d89dfac48d034ed141e86c5?permalink_comment_id=4401600">here</a>. It is way more powerful than my solution above, which I’ll only leave here just to keep things simple in this article.</p>
<div class="panel panel-default">
<div class="panel-body">
<div class="pull-left">
Tags: <a href="/tags/postgres.html">postgres</a>, <a href="/tags/introspection.html">introspection</a>, <a href="/tags/database.html">database</a>
</div>
<div class="social pull-right">
<span class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-url="https://jarnaldich.me/blog/2021/08/30/postgres-introspection.html" data-via="jarnaldich.me" data-dnt="true">Tweet</a>
</span>
<script src="https://apis.google.com/js/plusone.js" type="text/javascript"></script>
<span>
<g:plusone href="https://www.example.com/blog/2013/12/14/parallel-voronoi-in-haskell/"
size="medium"></g:plusone>
</span>
</div>
</div>
</div>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<div id="disqus_thread"></div>
<script type"text/javascript">
var disqus_shortname = 'jarnaldich';
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
]]></description>
<pubDate>Mon, 30 Aug 2021 00:00:00 UT</pubDate>
<guid>http://jarnaldich.me/blog/2021/08/30/postgres-introspection.html</guid>
<dc:creator>Joan Arnaldich</dc:creator>
</item>
<item>
<title>Optimizing Geospatial Workloads</title>
<link>http://jarnaldich.me/blog/2020/02/29/optimizing-geospatial-workloads.html</link>
<description><![CDATA[<h1>Optimizing Geospatial Workloads</h1>
<small>Posted on February 29, 2020 <a href="/blog/2020/02/29/optimizing-geospatial-workloads.html"><i class="fa fa-link fa-lg fa-fw"></i></a></small>
<p>Large area geospatial processing often involves splitting into smaller working tiles to be processed or downloaded independently. As an example, 25cm resolution orthophoto production in Catalonia is divided into 4275 rectangular tiles, as seen in the following image.</p>
<figure>
<img src="/images/tiles5k.png" title="Orthophoto Tiling" class="center" alt="" /><figcaption> </figcaption>
</figure>
<p>Whenever a process can be applied to those tiles independently (ie, not depending on their neighborhood), parallel processing is an easy way to increase the throughput. In such environments, the total workload has to be distributed among a fixed, often limited, number of processing units (be they cores or computers). If the scheduling mechanism requires a predefined batch to be assigned to each core (or if there is no scheduling mechanism at all), and when the processing units are of similar processing power, then the maximum speedup is attained when all batches have an equal amount of tiles.</p>
<p>Furthermore, since the result often has to be mosaicked in order to inspect it, or to aggregate it into a larger final product, it is desireable for the different batches to keep a spatial continuity, ideally conforming axis parallel rectangles, since that is the basic form of georeference for geospatial imagery once projected.</p>
<h2 id="the-problem">The problem</h2>
<p>This is a discrete optimization problem, which can be solved using the regular machinery. Since I have been dusting off my <a href="https://www.minizinc.org">MiniZinc</a> abilities through Coursera’s discrete optimization series, I decided to give it a go.</p>
<h3 id="tile-scheme-representation">Tile scheme representation</h3>
<p>For convenience, the list of valid tiles can be read from an external <code>.dzn</code> data file.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode minizinc"><code class="sourceCode minizinc"><span id="cb1-1"><a href="#cb1-1"></a>ntiles = <span class="fl">4275</span>;</span>
<span id="cb1-2"><a href="#cb1-2"></a>Tiles = [| <span class="fl">253</span>, <span class="fl">055</span></span>
<span id="cb1-3"><a href="#cb1-3"></a> | <span class="fl">254</span>, <span class="fl">055</span></span>
<span id="cb1-4"><a href="#cb1-4"></a> | <span class="fl">253</span>, <span class="fl">056</span></span>
<span id="cb1-5"><a href="#cb1-5"></a> | <span class="fl">254</span>, <span class="fl">056</span></span>
<span id="cb1-6"><a href="#cb1-6"></a> | <span class="fl">255</span>, <span class="fl">055</span></span>
<span id="cb1-7"><a href="#cb1-7"></a> | <span class="fl">255</span>, <span class="fl">056</span></span>
<span id="cb1-8"><a href="#cb1-8"></a> | <span class="fl">256</span>, <span class="fl">056</span></span>