-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.xml
More file actions
1049 lines (913 loc) · 106 KB
/
index.xml
File metadata and controls
1049 lines (913 loc) · 106 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Adrián Abreu</title><link>https://adrianabreu.com/</link><description>Recent content on Adrián Abreu</description><generator>Hugo</generator><language>es-ES</language><copyright>2017-2024 Adrián Abreu powered by Hugo and Kiss Theme</copyright><lastBuildDate>Tue, 27 Jan 2026 11:40:43 +0000</lastBuildDate><atom:link href="https://adrianabreu.com/index.xml" rel="self" type="application/rss+xml"/><item><title>Stuck connecting snowflake integrated mcp and cursor</title><link>https://adrianabreu.com/blog/2026-01-27-stuck-snowflake-mcp-cursor/</link><pubDate>Tue, 27 Jan 2026 11:40:43 +0000</pubDate><guid>https://adrianabreu.com/blog/2026-01-27-stuck-snowflake-mcp-cursor/</guid><description><p>Lately I&rsquo;m not exactly enjoying my job. Snowflake step curve from a data engineer perspective is really hard.</p>
<p>Yesterday my boss asked me to test the Snowflake integrated MCP from Cursor. He’d been looking into Snowflake OAuth access, so I went to check the resources. Once again—as of January 27th, 2026—the Terraform resource for this still doesn&rsquo;t exist (though the data source does, for some reason).</p>
<p>For connecting to snowflake you need to define a <a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-agents-mcp#set-up-oauth-authentication">security integration</a>.</p></description></item><item><title>Snowflake questionable choices</title><link>https://adrianabreu.com/blog/2026-01-14-snowflake-questionable-choices/</link><pubDate>Wed, 14 Jan 2026 20:14:10 +0000</pubDate><guid>https://adrianabreu.com/blog/2026-01-14-snowflake-questionable-choices/</guid><description><p>I switched jobs all over again. I&rsquo;m working in a data company called Shalion, where we provide insights about retail products online. If you&rsquo;re brave enough to handle Snowflake&rsquo;s quirks, <a href="https://shalion.teamtailor.com/jobs/6328344-data-engineer">we&rsquo;re hiring</a>&hellip;</p>
<p>I&rsquo;m working on revamping our permission grant system from <a href="https://gitlab.com/gitlab-data/permifrost">permifrost</a> to a terraform managed approach.</p>
<p>We also decided to embrace a 2 layer permission grant, using Access Roles (AR) and Functional Roles (FR). The former has access to the resources with the typical permissions (SELECT, USAGE, CREATE)&hellip;, while the latter is the one that we assign to people and just groups access roles.</p></description></item><item><title>Aprender de nuevo</title><link>https://adrianabreu.com/blog/2026-02-13-learn-to-learn-again/</link><pubDate>Tue, 06 Jan 2026 20:14:10 +0000</pubDate><guid>https://adrianabreu.com/blog/2026-02-13-learn-to-learn-again/</guid><description><p>2025 ha sido un año bastante loco.</p>
<p>Pasé la mayor parte del año sentando las bases de una startup en el mundo de la seguridad, haciendo frontend, backend, mucho devops y&hellip; poco data.</p>
<p>Me pilló de lleno la era de la AI y como siempre, el campo ajeno es más verde. Viví la burbuja de v0 y mi trabajo respecto a este, fue bastante miserable. El proyecto a medida que crecía dejaba de ser funcional. Un cambio manual que arreglase el mismo era incompatible.</p></description></item><item><title>I’m Building Stuff – My New Motto</title><link>https://adrianabreu.com/blog/2025-01-13-building-stuff/</link><pubDate>Mon, 13 Jan 2025 20:07:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2025-01-13-building-stuff/</guid><description><p>For the past seven years, I worked in data, and I have mixed feelings about it. I still believe data is the most important part of any app, but it’s meaningless without the app itself.</p>
<p>Now that I’m working at a startup, I’ve decided to focus on building things. To start, I revisited one of my older projects: a PDF parser about professor designations in the Canary Islands, where one of my best friends works as a teacher.</p></description></item><item><title>Finding pet projects</title><link>https://adrianabreu.com/blog/2024-08-13-finding-pet-projects/</link><pubDate>Tue, 13 Aug 2024 07:00:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2024-08-13-finding-pet-projects/</guid><description><p>As my company undergoes layoffs, I&rsquo;m back on the job hunt. While I&rsquo;m in the field of data, I often find myself missing the hands-on experience that comes from personal projects. I realized that I&rsquo;m not practicing all the skills I need.</p>
<p>During a recent interview, I was asked about my experience with sending reports via email—something I hadn’t done in a few years. That got me thinking: could I turn this into a pet project?</p></description></item><item><title>Developing on windows</title><link>https://adrianabreu.com/blog/2024-03-22-developing-on-windows/</link><pubDate>Fri, 22 Mar 2024 18:06:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2024-03-22-developing-on-windows/</guid><description><p>Over the years, I&rsquo;ve been using MacOS at work and Ubuntu at home for my development tasks. However, my Lenovo P1 Gen 3 laptop didn&rsquo;t work well with Linux, leading to frequent issues with the camera and graphics (screen flickering, I&rsquo;m looking at you, and it hurts).</p>
<p>I&rsquo;ve triend Windows Subsystem for Linux (WSL) but it was quite bad to be honest. But as I&rsquo;ve heard of WSL2 and WSLg, I decided to give it another shot.</p></description></item><item><title>Querying the databricks api</title><link>https://adrianabreu.com/blog/2024-01-26-querying-the-databricks-api/</link><pubDate>Fri, 26 Jan 2024 09:06:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2024-01-26-querying-the-databricks-api/</guid><description><p>Exploring databricks SQL usage</p>
<p>At my company, we adopted databricks SQL for most of our users. Some users have developed applications that use the JDBC connector, some users have built their dashboards, and some users write plain ad-hoc queries.</p>
<p>We wanted to know what they queried, so we tried to use Unity Catalog&rsquo;s insights, but it wasn&rsquo;t enough for our case. We work with IOT and we are interested in what filters they apply within our tables.</p></description></item><item><title>Tweaking Spark Kafka</title><link>https://adrianabreu.com/blog/2023-10-27-tweaking-spark-kafka/</link><pubDate>Fri, 27 Oct 2023 12:06:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2023-10-27-tweaking-spark-kafka/</guid><description><p>Well, I&rsquo;m facing a huge interesting case. I&rsquo;m working at Wallbox where we need to deal with billions of rows every day. Now we need to use Spark for some Kafka filtering and publish the results into different topics according to some rules.</p>
<p>I won&rsquo;t dig deep into the logic except for performance-related stuff, let&rsquo;s try to increase the processing speed.</p>
<p>When reading from Kafka you usually get 1 task per partition, so if you have 6 partitions and 48 cores you are not using 87.5 percent of your cluster. That could be adjusted with the following property <code>**minPartitions</code>.**</p></description></item><item><title>KSQL, a horror tale</title><link>https://adrianabreu.com/blog/2023-10-22-ksql-a-horror-tale/</link><pubDate>Sat, 21 Oct 2023 22:52:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2023-10-22-ksql-a-horror-tale/</guid><description><p>After spending several weeks working on a ksql solution to filter billions of events and determine their destination topic, I was disappointed to find that it did not live up to my expectations.</p>
<p>I had hoped for a more robust product that would align with our needs. Previously, we utilized a similar filter in Spark, incurring traffic costs for both Confluent and AWS. With kSQL, the advantage was that we could avoid paying for AWS traffic.</p></description></item><item><title>Repairing metadata unity catalog</title><link>https://adrianabreu.com/blog/2023-10-02-repairing-metadata-unity-catalog/</link><pubDate>Mon, 02 Oct 2023 13:25:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2023-10-02-repairing-metadata-unity-catalog/</guid><description><p>I&rsquo;ve been subscribed to <a href="https://www.dataengineeringweekly.com/p/data-engineering-weekly-148">https://www.dataengineeringweekly.com/p/data-engineering-weekly-148</a> for years. This last number included several on-call posts on Medium. I found these quite useful.</p>
<p>Today, I got an alert from Metaplane that a cost monitor dashboard was out of date. I checked the processes, and everything was fine. I ran a query to check the freshness of the data and it was ok too.</p>
<p>Metaplane checks our delta table freshness by querying the table information available in the Unity Catalog. For some unknown reason that metadata didn&rsquo;t receive any update. I ran an optimization operation (the table tiny) and the metadata didn&rsquo;t update either.</p></description></item><item><title>Adding extra params on DatabricksRunNowOperator</title><link>https://adrianabreu.com/blog/2023-07-28-extra_params_databricksrunnow/</link><pubDate>Fri, 28 Jul 2023 16:00:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2023-07-28-extra_params_databricksrunnow/</guid><description><p>With the <a href="https://docs.databricks.com/api/workspace/jobs/runnow">new Databricks jobs API 2.1</a> you have different parameters depending on the kind of tasks you have in your workflow. Like: jar_params, sql_params, python_params, notebook_params&hellip;</p>
<p>And not always the airflow operator is ready to handle all of the. If we check the <a href="https://airflow.apache.org/docs/apache-airflow-providers-databricks/stable/operators/run_now.html">current release of the DatabricksRunNowOperator</a>, we can see that there is only support for:
notebook_params
python_params
python_named_parameters
jar_params
spark_submit_params
And not the query_params mentioned earlier. But there is a way of combining both, there is a param called <em>jsob</em> that allows you to write the payload of a databricksrunnow and it will also merge the content of the JSON with your named_params!</p></description></item><item><title>Enabling Unity Catalog</title><link>https://adrianabreu.com/blog/2023-05-23-enabling-unity-catalog/</link><pubDate>Tue, 23 May 2023 07:48:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2023-05-23-enabling-unity-catalog/</guid><description><p>I&rsquo;ve spent the last few weeks setting up the unity catalog for my company. It&rsquo;s been an extremely tiring process. And there are several concepts to bring here. My main point is to have a clear view of the requirements.</p>
<p>Disclaimer: as of today with <a href="https://github.com/databricks/terraform-provider-databricks">https://github.com/databricks/terraform-provider-databricks</a> release 1.17.0, some steps should be done in an &ldquo;awkward way&rdquo; that is, the account API does not expose the catalog&rsquo;s endpoint and should be done through a workspace.</p></description></item><item><title>Duplicates with delta, how can it be?</title><link>https://adrianabreu.com/blog/2023-03-20-delta-duplicates/</link><pubDate>Mon, 20 Mar 2023 09:50:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2023-03-20-delta-duplicates/</guid><description><p>Long time without writing!
On highlights: I left my job at <strong>Schwarz It</strong> in December last year, and now I&rsquo;m a full-time employee at Wallbox! I&rsquo;m really happy with my new job, and I&rsquo;ve experienced interesting stuff. This one was just one of these strange cases where you start doubting the compiler.</p>
<h2 id="context">Context</h2>
<p>One of my main tables represents sensor measures from our chargers with millisecond precision. The numbers are quite high, we are talking over 2 billion rows per day. So the analytic model doesn&rsquo;t handle that level of granularity.
The analyst created a table that will make a window of 5 minutes, select some specific sensors and write there those values as a column. To keep the data consistent they were generating fake rows between sessions, so if a value was missing a synthetic value would be put in place.</p></description></item><item><title>Testing Databricks Photon</title><link>https://adrianabreu.com/blog/2022-08-12-testing-photon-engine/</link><pubDate>Fri, 12 Aug 2022 09:52:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-08-12-testing-photon-engine/</guid><description><p>I was a bit skeptical about photon since I realized that it cost about double the amount of DBU, required specifically optimized machines and did not support UDFs (it was my main target).</p>
<p>From the Databricks Official Docs:</p>
<h1 id="limitations"><strong>Limitations</strong></h1>
<ul>
<li>Does not support Spark Structured Streaming.</li>
<li>Does not support UDFs.</li>
<li>Does not support RDD APIs.</li>
<li>Not expected to improve short-running queries (&lt;2 seconds), for example, queries against small amounts of data.</li>
</ul>
<p><a href="https://docs.databricks.com/runtime/photon.html">Photon runtime</a></p></description></item><item><title>Databricks Cluster Management</title><link>https://adrianabreu.com/blog/2022-07-30-databricks-cluster-management/</link><pubDate>Sat, 30 Jul 2022 13:52:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-07-30-databricks-cluster-management/</guid><description><p>For the last few months, I&rsquo;ve been into ETL optimization. Most of the changes were as dramatic as moving tables from ORC to delta revamping the partition strategy to some as simple as upgrading the runtime version to 10.4 so the ETL starts using low-shuffle merge.</p>
<p>But at my job, we have a <em>lot</em> of jobs. Each ETL can be easily launched at *30 with different parameters so I wanted to dig into the most effective strategy for it.</p></description></item><item><title>Pusing data to tinybird for free</title><link>https://adrianabreu.com/blog/2022-07-25-pushing-data-to-tinybird-free/</link><pubDate>Mon, 25 Jul 2022 07:28:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-07-25-pushing-data-to-tinybird-free/</guid><description><p>So my azure subscription expired and I ended up losing the function I was using to feed my real-time data on analytics (part of the <a href="https://github.com/adrianabreu/titsa-gtfs-api">Transportes Insulares de Tenerife SA</a> analysis I was making).</p>
<p>And after some struggle, I decided to move it to a GitHub action. Why? Because the free mins per month were more than enough and because I just needed some script to run on a cron and that script just makes a quest and a post. So, it was quite straightforward.</p></description></item><item><title>Associate Spark Developer Certification</title><link>https://adrianabreu.com/spark-certification/2022-07-21-passed-certification/</link><pubDate>Thu, 21 Jul 2022 12:16:32 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-07-21-passed-certification/</guid><description><p>Yesterday I took (and passed with more than 90% yay!) the <a href="https://databricks.com/learn/certification/apache-spark-developer-associate">Associate Spark Developer Certificaton</a>. And before I forget I want to share my experience:</p>
<p>In general:</p>
<ul>
<li>First of all, I needed to install Windows as there was no Linux support for the control software used during the exam.</li>
<li>Secondly, you need to disable both the antivirus and the firewall before joining. I didn&rsquo;t disable the antivirus and the technician contacted me as there was a problem with the webcam despite I was able to see myself.</li>
<li>It is a controlled window started by the software, not a browser page (I had a good zoom on the example docs they provide, and well not the same on the software window).</li>
<li>You can mark the questions for reviewing them later.</li>
</ul>
<p>About the exam:</p></description></item><item><title>Reading firebase data</title><link>https://adrianabreu.com/blog/2022-07-01-reading-firebase-data/</link><pubDate>Fri, 01 Jul 2022 07:28:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-07-01-reading-firebase-data/</guid><description><p>Firebase is a common component nowadays for most mobile apps. And it can provide some useful insights, for example in my previous company we use it to detect where the people left at the initial app wizard. (We could measure it).</p>
<p>It is quite simple to export your data to BigQuery: <a href="https://firebase.google.com/docs/projects/bigquery-export">https://firebase.google.com/docs/projects/bigquery-export</a></p>
<p>But maybe your lake is in AWS or Azure. In the next lines, I will try to explain how to load the data in your lake and some improvements we have applied.</p></description></item><item><title>Qbeast</title><link>https://adrianabreu.com/blog/2022-06-30-qbeast/</link><pubDate>Thu, 30 Jun 2022 07:28:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-06-30-qbeast/</guid><description><p>A few days ago I ran into <a href="https://twitter.com/Qbeast_io">Qbeast</a> which is an open-source project on top of delta lake I needed to dig into.</p>
<p>This introductory post explains it quite well: <a href="https://qbeast.io/qbeast-format-enhanced-data-lakehouse/">https://qbeast.io/qbeast-format-enhanced-data-lakehouse/</a></p>
<p>The project is quite good and it seems helpful if you need to write your custom data source as everything is documented. And well as I&rsquo;m in love with note-taking I want to dig into the following three topics:</p>
<ol>
<li>Explaining how the format works (including optimizations)</li>
<li>Describing how the sampling push is implementing</li>
<li>Understanding the table tolerance</li>
</ol>
<h1 id="1-qbeast-format">1. Qbeast format</h1>
<p>This would be better explained with diagrams. Remember delta lake? We had a _delta_log folder with files pointing to files. Now Qbeast has extended this delta_log and has added some new properties.</p></description></item><item><title>Spark Dates</title><link>https://adrianabreu.com/spark-certification/2022-06-29-spark-dates/</link><pubDate>Wed, 29 Jun 2022 15:43:22 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-06-29-spark-dates/</guid><description><p>I can perfectly describe this as the scariest part of the exam. I&rsquo;m used to working with dates but I&rsquo;m especially used to suffering from the typical UTC / not UTC / summer time hours difference.</p>
<p>I will try to make some simple exercises for this, the idea would be:</p>
<ul>
<li>We have some sales data and god knows how the business people love to refresh super fast their dashboards on Databricks SQL. So we decided to aggregate at different levels the same KPI, our sales per store. Considering some data as:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>data <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span> (<span style="color:#ae81ff">1656520076</span>, <span style="color:#ae81ff">1001</span>, <span style="color:#ae81ff">10</span>),
</span></span><span style="display:flex;"><span> (<span style="color:#ae81ff">1656520321</span>, <span style="color:#ae81ff">1001</span>, <span style="color:#ae81ff">8</span>),
</span></span><span style="display:flex;"><span> (<span style="color:#ae81ff">1656509025</span>, <span style="color:#ae81ff">1002</span>, <span style="color:#ae81ff">5</span>),
</span></span><span style="display:flex;"><span> (<span style="color:#ae81ff">1656510826</span>, <span style="color:#ae81ff">1002</span>, <span style="color:#ae81ff">3</span>),
</span></span><span style="display:flex;"><span> (<span style="color:#ae81ff">1656510056</span>, <span style="color:#ae81ff">1001</span>, <span style="color:#ae81ff">5</span>),
</span></span><span style="display:flex;"><span> (<span style="color:#ae81ff">1656514076</span>, <span style="color:#ae81ff">1001</span>, <span style="color:#ae81ff">8</span>),
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>ts <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;ts&#34;</span>
</span></span><span style="display:flex;"><span>store_id <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;store_id&#34;</span>
</span></span><span style="display:flex;"><span>amount <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;amount&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>df <span style="color:#f92672">=</span> spark<span style="color:#f92672">.</span>createDataFrame(data, [ts, store_id, amount])
</span></span></code></pre></div><p>We need to parse that data into a readable date as the first number is an epoch or <em>unix_time</em>. Using the function from_unixttime this is quite simple:</p></description></item><item><title>Spark Cert Exam Practice</title><link>https://adrianabreu.com/spark-certification/2022-06-28-databricks-practice-exam/</link><pubDate>Tue, 28 Jun 2022 13:43:22 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-06-28-databricks-practice-exam/</guid><description><script
src="https://cdn.jsdelivr.net/npm/quizdown@latest/public/build/quizdown.js">
</script>
<script
src="https://cdn.jsdelivr.net/npm/quizdown@latest/public/build/extensions/quizdownKatex.js">
</script>
<script
src="https://cdn.jsdelivr.net/npm/quizdown@latest/public/build/extensions/quizdownHighlight.js">
</script>
<script>quizdown.register(quizdownHighlight).register(quizdownKatex).init()</script>
<div class='quizdown'>
---
primary_color: orange
secondary_color: lightgray
text_color: black
shuffle_questions: false
---
## Which of the following statements about the Spark driver is incorrect?
- [ ] The Spark driver is the node in which the Spark application's main method runs to ordinate the Spark application.
- [X] The Spark driver is horizontally scaled to increase overall processing throughput.
- [ ] The Spark driver contains the SparkContext object.
- [ ] The Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode.
- [ ] The Spark driver should be as close as possible to worker nodes for optimal performance.
## Which of the following describes nodes in cluster-mode Spark?
- [ ] Nodes are the most granular level of execution in the Spark execution hierarchy.
- [ ] There is only one node and it hosts both the driver and executors.
- [ ] Nodes are another term for executors, so they are processing engine instances for performing computations.
- [ ] There are driver nodes and worker nodes, both of which can scale horizontally.
- [X] Worker nodes are machines that host the executors responsible for the execution of tasks
## Which of the following statements about slots is true?
- [ ] There must be more slots than executors.
- [ ] There must be more tasks than slots.
- [ ] Slots are the most granular level of execution in the Spark execution hierarchy.
- [ ] Slots are not used in cluster mode.
- [X] Slots are resources for parallelization within a Spark application.
## Which of the following is a combination of a block of data and a set of transformers that will run on a single executor?
- [ ] Executor
- [ ] Node
- [ ] Job
- [X] Task
- [ ] Slot
## Which of the following is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines?
- [ ] Job
- [ ] Slot
- [ ] Executor
- [ ] Task
- [X] Stage
## Which of the following describes a shuffle?
- [X] A shuffle is the process by which data is compared across partitions.
- [ ] A shuffle is the process by which data is compared across executors.
- [ ] A shuffle is the process by which partitions are allocated to tasks.
- [ ] A shuffle is the process by which partitions are ordered for write.
- [ ] A shuffle is the process by which tasks are ordered for execution.
## DataFrame df is very large with a large number of partitions, more than there are executors in the cluster. Based on this situation, which of the following is incorrect? Assume there is one core per executor.
- [X] Performance will be suboptimal because not all executors will be utilized at the same time.
- [ ] Performance will be suboptimal because not all data can be processed at the same time.
- [ ] There will be a large number of shuffle connections performed on DataFrame df when operations inducing a shuffle are called.
- [ ] There will be a lot of overhead associated with managing resources for data processing within each task.
- [ ] There might be risk of out-of-memory errors depending on the size of the executors in the cluster.
## Which of the following operations will trigger evaluation?
- [ ] DataFrame.filter()
- [ ] DataFrame.distinct()
- [ ] DataFrame.intersect()
- [ ] DataFrame.join()
- [X] DataFrame.count()
## Which of the following describes the difference between transformations and actions?
- [ ] Transformations work on DataFrames/Datasets while actions are reserved for native language objects.
- [ ] There is no difference between actions and transformations.
- [ ] Actions are business logic operations that do not induce execution while transformations are execution triggers focused on returning results.
- [ ] Actions work on DataFrames/Datasets while transformations are reserved for native language objects.
- [X] Transformations are business logic operations that do not induce execution while actions are execution triggers focused on returning results.
## Which of the following DataFrame operations is always classified as a narrow transformation?
- [ ] DataFrame.sort()
- [ ] DataFrame.distinct()
- [ ] DataFrame.repartition()
- [X] DataFrame.select()
- [ ] DataFrame.join()
## Spark has a few different execution/deployment modes: cluster, client, and local. Which of the following describes Spark's execution/deployment mode?
- [X] Spark's execution/deployment mode determines where the driver and executors are physically located when a Spark application is run
- [ ] Spark's execution/deployment mode determines which tasks are allocated to which executors in a cluster
- [ ] Spark's execution/deployment mode determines which node in a cluster of nodes is responsible for running the driver program
- [ ] Spark's execution/deployment mode determines exactly how many nodes the driver will connect to when a Spark application is run
- [ ] Spark's execution/deployment mode determines whether results are run interactively in a notebook environment or in batch
## Which of the following cluster configurations will ensure the completion of a Spark application in light of a worker node failure? Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores.

- [ ] Scenario #1
- [X] They should all ensure completion because worker nodes are fault-tolerant.
- [ ] Scenario #4
- [ ] Scenario #5
- [ ] Scenario #6
## Which of the following describes out-of-memory errors in Spark?
- [X] An out-of-memory error occurs when either the driver or an executor does not have enough memory to collect or process the data allocated to it.
- [ ] An out-of-memory error occurs when Spark's storage level is too lenient and allows data objects to be cached to both memory and disk.
- [ ] An out-of-memory error occurs when there are more tasks than are executors regardless of the number of worker nodes.
- [ ] An out-of-memory error occurs when the Spark application calls too many transformations in a row without calling an action regardless of the size of the data object on which the transformations are operating.
- [ ] An out-of-memory error occurs when too much data is allocated to the driver for computational purposes.
## Which of the following is the default storage level for persist() for a non-streaming DataFrame/Dataset?
- [X] MEMORY_AND_DISK
- [ ] MEMORY_AND_DISK_SER
- [ ] DISK_ONLY
- [ ] MEMORY_ONLY_SER
- [ ] MEMORY_ONLY
## Which of the following describes a broadcast variable?
- [ ] A broadcast variable is a Spark object that needs to be partitioned onto multiple worker nodes because it's too large to fit on a single worker node.
- [ ] A broadcast variable can only be created by an explicit call to the broadcast() operation.
- [ ] A broadcast variable is entirely cached on the driver node so it doesn't need to be present on any worker nodes.
- [X] A broadcast variable is entirely cached on each worker node so it doesn't need to be shipped or shuffled between nodes with each stage.
- [ ] A broadcast variable is saved to the disk of each worker node to be easily read into memory when needed.
## Which of the following operations is most likely to induce a skew in the size of your data's partitions?
- [ ] DataFrame.collect()
- [ ] DataFrame.cache()
- [ ] DataFrame.repartition(n)
- [X] DataFrame.coalesce(n)
- [ ] DataFrame.persist()
## Which of the following data structures are Spark DataFrames built on top of?
- [ ] Arrays
- [ ] Strings
- [X] RDDs
- [ ] Vectors
- [ ] SQL Tables
### Which of the following code blocks returns a DataFrame containing only column storeId and column **division** from DataFrame **storesDF?**
- [ ] `storesDF.select("storeId").select("division")`
- [ ] `storesDF.select(storeId, division)`
- [X] `storesDF.select("storeId", "division")`
- [ ] `storesDF.select(col("storeId", "division"))`
- [ ] `storesDF.select(storeId).select(division)`
## Which of the following code blocks returns a DataFrame containing all columns from DataFrame storesDF except for column sqft and column customerSatisfaction? A sample of DataFrame storesDF is below:

- [X] `storesDF.drop("sqft", "customerSatisfaction")`
- [ ] `storesDF.select("storeId", "open", "openDate", "division")`
- [ ] `storesDF.select(-col(sqft), -col(customerSatisfaction))`
- [ ] `storesDF.drop(sqft, customerSatisfaction)`
- [ ] `storesDF.drop(col(sqft), col(customerSatisfaction))`
## The below code shown block contains an error. The code block is intended to return a DataFrame containing only the rows from DataFrame storesDF where the value in DataFrame storesDF's "sqft" column is less than or equal to 25,000. Assume DataFrame storesDF is the only defined language variable. Identify the error. Code block: `storesDF.filter(sqft <= 25000)`
- [ ] The column name **sqft** needs to be quoted like s**toresDF.filter("sqft" <=000).**
- [X] The column name sqft needs to be quoted and wrapped in the **col()** function like **storesDF.filter(col("sqft") <= 25000).**
- [ ] The sign in the logical condition inside **filter()** needs to be changed from <= to >.
- [ ] The sign in the logical condition inside **filter()** needs to be changed from <= to >=.
- [ ] The column name sqft needs to be wrapped in the **col()** function like **storesDF.filter(col(sqft) <= 25000).**
## The code block shown below should return a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__**1__**(__**2**__ __3__ __4__)
- [X] ```
1. filter
2. (col("sqft") <= 25000)
3. |
4. (col("customerSatisfaction") >= 30)
```
- [ ] ```
1. drop
2. (col(sqft) <= 25000)
3. |
4. (col(customerSatisfaction) >= 30)
```
- [ ] ```
1. filter
2. col("sqft") <= 25000
3. |
4. col("customerSatisfaction") >= 30
```
- [ ] ```
1. filter
2. col("sqft") <= 25000
3. or
4. col("customerSatisfaction") >= 30
```
- [ ] ```
1. filter
2. (col("sqft") <= 25000)
3. or
4. (col("customerSatisfaction") >= 30)
```
## Which of the following operations can be used to convert a DataFrame column from one type to another type?
- [X] col().cast()
- [ ] convert()
- [ ] castAs()
- [ ] col().coerce()
- [ ] col()
## Which of the following code blocks returns a new DataFrame with a new column sqft100 that is 1/100th of column sqft in DataFrame storesDF? Note that column sqft100 is not in the original DataFrame storesDF.
- [ ] `storesDF.withColumn("sqft100", col("sqft") * 100)`
- [ ] `storesDF.withColumn("sqft100", sqft / 100)`
- [ ] `storesDF.withColumn(col("sqft100"), col("sqft") / 100)`
- [X] `storesDF.withColumn("sqft100", col("sqft") / 100)`
- [ ] `storesDF.newColumn("sqft100", sqft / 100)`
## Which of the following code blocks returns a new DataFrame from DataFrame storesDF where column numberOfManagers is the constant integer 1?
- [ ] `storesDF.withColumn("numberOfManagers", col(1))`
- [ ] `storesDF.withColumn("numberOfManagers", 1)`
- [X] `storesDF.withColumn("numberOfManagers", lit(1))`
- [ ] `storesDF.withColumn("numberOfManagers", lit("1"))`
- [ ] `storesDF.withColumn("numberOfManagers", IntegerType(1))`
## The code block shown below contains an error. The code block intends to return a new DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory. Identify the error. A sample of DataFrame storesDF is displayed below: Code block:
```
(storesDF.withColumn(
"storeValueCategory", col("storeCategory").split("*")[0]
).withColumn(
"storeSizeCategory", col("storeCategory").split("*")[1]
)
)
```

- [ ] The split() operation comes from the imported functions object. It accepts a string column name and split character as arguments. It is not a method of a Column object.
- [X] The split() operation comes from the imported functions object. It accepts a Column object and split character as arguments. It is not a method of a Column object.
- [ ] The index values of 0 and 1 should be provided as second arguments to the split() operation rather than indexing the result.
- [ ] The index values of 0 and 1 are not correct — they should be 1 and 2, respectively.
- [ ] The withColumn() operation cannot be called twice in a row.
## Which of the following operations can be used to split an array column into an individual DataFrame row for each element in the array?
- [ ] extract()
- [ ] split()
- [X] explode()
- [ ] arrays_zip()
- [ ] unpack()
## Which of the following code blocks returns a new DataFrame where column storeCategory is an all-lowercase version of column storeCategory in DataFrame storesDF? Assume DataFrame storesDF is the only defined language variable.
- [X] `storesDF.withColumn("storeCategory", lower(col("storeCategory")))`
- [ ] `storesDF.withColumn("storeCategory", coll("storeCategory").lower())`
- [ ] `storesDF.withColumn("storeCategory", tolower(col("storeCategory")))`
- [ ] `storesDF.withColumn("storeCategory", lower("storeCategory"))`
- [ ] `storesDF.withColumn("storeCategory", lower(storeCategory))`
## The code block shown below contains an error. The code block is intended to return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Identify the error. Code block:
```
(storesDF.withColumnRenamed("state", "division")
.withColumnRenamed("managerFullName", "managerName"))
```
- [ ] Both arguments to operation withColumnRenamed() should be wrapped in the col() operation.
- [ ] The operations withColumnRenamed() should not be called twice, and the first argument should be ["state", "division"] and the second argument should be["managerFullName", "managerName"].
- [ ] The old columns need to be explicitly dropped.
- [X] The first argument to operation withColumnRenamed() should be the old column name and the second argument should be the new column name.
- [ ] The operation withColumnRenamed() should be replaced with withColumn().
## Which of the following code blocks returns a DataFrame where rows in DataFrame storesDF containing missing values in every column have been dropped?
- [ ] storesDF.nadrop("all")
- [ ] storesDF.na.drop("all", subset = "sqft")
- [ ] storesDF.dropna()
- [ ] storesDF.na.drop()
- [X] storesDF.na.drop("all")
## Which of the following operations fails to return a DataFrame where every row is unique?
- [ ] DataFrame.distinct()
- [ ] DataFrame.drop_duplicates(subset = None)
- [ ] DataFrame.drop_duplicates()
- [ ] DataFrame.dropDuplicates()
- [X] DataFrame.drop_duplicates(subset = "all")
## Which of the following code blocks will not always return the exact number of distinct values in column division?
- [X] `storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))`
- [ ] `storesDF.agg(approx_count_distinct(col("division"), 0).alias("divisionDistinct"))`
- [ ] `storesDF.agg(countDistinct(col("division")).alias("divisionDistinct"))`
- [ ] `storesDF.select("division").dropDuplicates().count()`
- [ ] `storesDF.select("division").distinct().count()`
## The code block shown below should return a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block:
`storesDF.__**1__**(__2__(__**3**__).alias("sqftMean"))`
- [X] ```
1. agg
2. mean
3. col("sqft")
```
- [ ] ```
1. mean
2. col
3. "sqft"
```
- [ ] ```
1. withColumn
2. mean
3. col("sqft")
```
- [ ] ```
1. agg
2. mean
3. "sqft"
```
- [ ] ```
1. agg
2. average
3. col("sqft")
```
## Which of the following code blocks returns the number of rows in DataFrame storesDF?
- [ ] storesDF.withColumn("numberOfRows", count())
- [ ] storesDF.withColumn(count().alias("numberOfRows"))
- [ ] storesDF.countDistinct()
- [X] storesDF.count()
- [ ] storesDF.agg(count())
## Which of the following code blocks returns the sum of the values in column sqft in DataFrame storesDF grouped by distinct value in column division?
- [ ] storesDF.groupBy.agg(sum(col("sqft")))
- [ ] storesDF.groupBy("division").agg(sum())
- [ ] storesDF.agg(groupBy("division").sum(col("sqft")))
- [ ] storesDF.groupby.agg(sum(col("sqft")))
- [X] storesDF.groupBy("division").agg(sum(col("sqft")))
## Which of the following code blocks returns a DataFrame containing summary statistics only for column sqft in DataFrame storesDF?
- [ ] storesDF.summary("mean")
- [X] storesDF.describe("sqft")
- [ ] storesDF.summary(col("sqft"))
- [ ] storesDF.describeColumn("sqft")
- [ ] storesDF.summary()
## Which of the following operations can be used to sort the rows of a DataFrame?
- [X] sort() and orderBy()
- [ ] orderby()
- [ ] sort() and orderby()
- [ ] orderBy()
- [ ] sort()
## The code block shown below contains an error. The code block is intended to return a 15 percent sample of rows from DataFrame storesDF without replacement. Identify the error. Code block: storesDF.sample(True, fraction = 0.15)
- [ ] There is no argument specified to the seed parameter.
- [ ] There is no argument specified to the withReplacement parameter.
- [ ] The sample() operation does not sample without replacement — sampleby() should be used instead.
- [ ] The sample() operation is not reproducible.
- [X] The first argument True sets the sampling to be with replacement.
## Which of the following operations can be used to return the top n rows from a DataFrame?
- [ ] DataFrame.n()
- [X] DataFrame.take(n)
- [ ] DataFrame.head
- [ ] DataFrame.show(n)
- [ ] DataFrame.collect(n)
## The code block shown below should extract the value for column sqft from the first row of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
`__**1__.__2 __.__3 __**`
- [ ] ```
1. storesDF
2. first
3. col("sqft")
```
- [ ] ```
1. storesDF
2. first
3. sqft
```
- [ ] ```
1. storesDF
2. first
3. ["sqft"]
```
- [X] ```
1. storesDF
2. first()
3. sqft
```
- [ ] ```
1. storesDF
2. first()
3. col("sqft")
```
## Which of the following lines of code prints the schema of a DataFrame?
- [ ] print(storesDF)
- [ ] storesDF.schema
- [ ] print(storesDF.schema())
- [X] DataFrame.printSchema()
- [ ] DataFrame.schema()
## In what order should the below lines of code be run in order to create and register a SQL UDF named "ASSESS_PERFORMANCE" using the Python function assessPerformance and apply it to column customerSatistfaction in table stores?
```
Lines of code:
1.`spark.udf.register("ASSESS_PERFORMANCE", assessPerformance)`
2.`spark.sql("SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores")`
3.`spark.udf.register(assessPerformance, "ASSESS_PERFORMANCE")`
4. `spark.sql("SELECT customerSatisfaction, ASSESS_PERFORMANCE(customerSatisfaction) AS result FROM
stores")`
```
- [ ] 3, 4
- [X] 1, 4
- [ ] 3, 2
- [ ] 2
- [ ] 1, 2
## In what order should the below lines of code be run in order to create a Python UDF assessPerformanceUDF() using the integer-returning Python function assessPerformance and apply it to column customerSatisfaction in DataFrame storesDF?
```
Lines of code:
1. assessPerformanceUDF = udf(assessPerformance, IntegerType)
2. assessPerformanceUDF = spark.register.udf("ASSESS_PERFORMANCE",
assessPerformance)
3. assessPerformanceUDF = udf(assessPerformance, IntegerType())
4. storesDF.withColumn("result",
assessPerformanceUDF(col("customerSatisfaction")))
5. storesDF.withColumn("result",
assessPerformance(col("customerSatisfaction")))
6. storesDF.withColumn("result",
ASSESS_PERFORMANCE(col("customerSatisfaction")))
```
- [X] 3, 4
- [ ] 2, 6
- [ ] 3, 5
- [ ] 1, 4
- [ ] 2, 5
## Which of the following operations can execute a SQL query on a table?
- [ ] spark.query()
- [ ] DataFrame.sql()
- [X] spark.sql()
- [ ] DataFrame.createOrReplaceTempView()
- [ ] DataFrame.createTempView()
## Which of the following code blocks creates a single-column DataFrame from Python list years which is made up of integers?
- [ ] `spark.createDataFrame([years], IntegerType())`
- [X] `spark.createDataFrame(years, IntegerType())`
- [ ] `spark.DataFrame(years, IntegerType())`
- [ ] `spark.createDataFrame(years)`
- [ ] `spark.createDataFrame(years, IntegerType)`
## Which of the following operations can be used to cache a DataFrame only in Spark’s memory assuming the default arguments can be updated?
- [ ] DataFrame.clearCache()
- [ ] DataFrame.storageLevel
- [ ] StorageLevel
- [X] DataFrame.persist()
- [ ] DataFrame.cache()
## The code block shown below contains an error. The code block is intended to return a new 4-partition DataFrame from the 8-partition DataFrame storesDF without inducing a shuffle. Identify the error. Code block: storesDF.repartition(4)
- [ ] The repartition operation will only work if the DataFrame has been cached to memory.
- [ ] The repartition operation requires a column on which to partition rather than a number of partitions.
- [ ] The number of resulting partitions, 4, is not achievable for an 8-partition DataFrame.
- [X] The repartition operation induced a full shuffle. The coalesce operation should be used instead.
- [ ] The repartition operation cannot guarantee the number of result partitions.
## Which of the following code blocks will always return a new 12-partition DataFrame from the 8-partition DataFrame storesDF?
- [ ] storesDF.coalesce(12)
- [ ] storesDF.repartition()
- [X] storesDF.repartition(12)
- [ ] storesDF.coalesce()
- [ ] storesDF.coalesce(12, "storeId")
## Which of the following Spark config properties represents the number of partitions used in wide transformations like join()?
- [x] `spark.sql.shuffle.partitions`
- [ ] `spark.shuffle.partitions`
- [ ] `spark.shuffle.io.maxRetries`
- [ ] `spark.shuffle.file.buffer`
- [ ] `spark.default.parallelism`
### In what order should the below lines of code be run in order to return a DataFrame containing a column openDateString, a string representation of Java’s SimpleDateFormat? Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1st, 1970. An example of Java's SimpleDateFormat is "Sunday, Dec 4, 2008 1:05 PM". A sample of storesDF is displayed below:
```
Lines of code:
1. `storesDF.withColumn("openDateString",
from_unixtime(col("openDate"), simpleDateFormat))`
2. `simpleDateFormat = "EEEE, MMM d, yyyy h:mm a"`
3. `storesDF.withColumn("openDateString",
from_unixtime(col("openDate"), SimpleDateFormat()))`
4.`storesDF.withColumn("openDateString",
date_format(col("openDate"), simpleDateFormat))`
5.`storesDF.withColumn("openDateString",
date_format(col("openDate"), SimpleDateFormat()))`
6.`simpleDateFormat = "wd, MMM d, yyyy h:mm a"`
```
- [ ] 2, 3
- [X] 2, 1
- [ ] 6, 5
- [ ] 2, 4
- [ ] 6, 1
## Which of the following code blocks returns a DataFrame containing a column month, an integer representation of the month from column openDate from DataFrame storesDF? Note that column openDate is of type integer and represents a date in the UNIX epoch format — the number of seconds since midnight on January 1st, 1970. A sample of storesDF is displayed below:

- [ ] `storesDF.withColumn("month", getMonth(col("openDate")))`
- [X] `storesDF.withColumn("openTimestamp", col("openDate").cast("Timestamp")).withColumn("month", month(col("openTimestamp")))`
- [ ] `storesDF.withColumn("openDateFormat", col("openDate").cast("Date")).withColumn("month", month(col("openDateFormat")))`
- [ ] `storesDF.withColumn("month", substr(col("openDate"), 4, 2))`
- [ ] `storesDF.withColumn("month", month(col("openDate")))`
## Which of the following operations performs an inner join on two DataFrames?
- [ ] DataFrame.innerJoin()
- [X] DataFrame.join()
- [ ] Standalone join() function
- [ ] DataFrame.merge()
- [ ] DataFrame.crossJoin()
## Which of the following code blocks returns a new DataFrame that is the result of an outer join between DataFrame storesDF and DataFrame employeesDF on column storeId?
- [X] storesDF.join(employeesDF, "storeId", "outer")
- [ ] storesDF.join(employeesDF, "storeId")
- [ ] storesDF.join(employeesDF, "outer", col("storeId"))
- [ ] storesDF.join(employeesDF, "outer", storesDF.storeId == employeesDF.storeId)
- [ ] storesDF.merge(employeesDF, "outer", col("storeId"))
## The below code block contains an error. The code block is intended to return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId and column employeeId which are in both DataFrames. Identify the error.
Code block:
`storesDF.join(employeesDF, [col("storeId"), col("employeeId")])`
- [ ] The join() operation is a standalone function rather than a method of DataFrame — the join() operation should be called where its first two arguments are storesDF and employeesDF.
- [ ] There must be a third argument to join() because the default to the how parameter is not "inner".
- [ ] The col("storeId") and col("employeeId") arguments should not be separate elements of a list — they should be tested to see if they're equal to one another like col("storeId") == col("employeeId").
- [ ] There is no DataFrame.join() operation — DataFrame.merge() should be used instead.
- [X] The references to "storeId" and "employeeId" should not be inside the col() function — removing the col() function should result in a successful join.
## Which of the following Spark properties is used to configure the broadcasting of a DataFrame without the use of the broadcast() operation?
- [X] spark.sql.autoBroadcastJoinThreshold
- [ ] spark.sql.broadcastTimeout
- [ ] spark.broadcast.blockSize
- [ ] spark.broadcast.compress
- [ ] spark.executor.memoryOverhead
## The code block shown below should return a new DataFrame that is the result of a cross join between DataFrame storesDF and DataFrame employeesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block:
__***1*__**.__**2__**(__3__)
- [ ] ```
1. storesDF
2. crossJoin
3. employeesDF, "storeId"
```
- [ ] ```
1. storesDF
2. join
3. employeesDF, "cross"
```
- [ ] ```
1. storesDF
2. crossJoin
3. employeesDF, "storeId"
```
- [ ] ```
1. storesDF
2. join
3. employeesDF, "storeId", "cross"
```
- [X] ```
1. storesDF
2. crossJoin
3. employeesDF
```
## Which of the following operations performs a position-wise union on two DataFrames?
- [ ] The standalone concat() function
- [ ] The standalone unionAll() function
- [ ] The standalone union() function
- [ ] DataFrame.unionByName()
- [X] DataFrame.union()
## Which of the following code blocks writes DataFrame storesDF to file path filePath as parquet?
- [ ] storesDF.write.option("parquet").path(filePath)
- [ ] storesDF.write.path(filePath)
- [ ] storesDF.write().parquet(filePath)
- [ ] storesDF.write(filePath)
- [X] storesDF.write.parquet(filePath)
## The code block shown below contains an error. The code block is intended to write DataFrame storesDF to file path filePath as parquet and partition by values in column division. Identify the error. Code block:
`storesDF.write.repartition("division").parquet(filePath)`
- [ ] The argument division to operation repartition() should be wrapped in the col() function to return a Column object.
- [ ] There is no parquet() operation for DataFrameWriter — the save() operation should be used instead.
- [X] There is no repartition() operation for DataFrameWriter — the partitionBy() operation should be used instead.
- [ ] DataFrame.write is an operation — it should be followed by parentheses to return a
DataFrameWriter.
- [ ] The mode() operation must be called to specify that this write should not overwrite
existing files.
## Which of the following code blocks reads a parquet at the file path filePath into a DataFrame?
- [ ] spark.read().parquet(filePath)
- [ ] spark.read().path(filePath, source = "parquet")
- [ ] spark.read.path(filePath, source = "parquet")
- [x] spark.read.parquet(filePath)
- [ ] spark.read().path(filePath)
## Which of the following code blocks reads JSON at the file path filePath into a DataFrame with the specified schema schema?
- [ ] spark.read().schema(schema).format(json).load(filePath)
- [ ] spark.read().schema(schema).format("json").load(filePath)
- [ ] spark.read.schema("schema").format("json").load(filePath)
- [ ] spark.read.schema("schema").format("json").load(filePath)
- [x] spark.read.schema(schema).format("json").load(filePath)
</div></description></item><item><title>Spark User Defined Functions</title><link>https://adrianabreu.com/spark-certification/2022-06-19-spark-udf-udaf/</link><pubDate>Sun, 19 Jun 2022 14:43:22 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-06-19-spark-udf-udaf/</guid><description><p>Sometimes we need to execute arbitrary Scala code on Spark. We may need to use an external library or so on. For that, we have the UDF, which accepts and return one or more columns.</p>
<p>When we have a function we need to register it on Spark so we can use it on our worker machines. If you are using Scala or Java, the udf can run inside the Java Virtual Machine so there&rsquo;s a little extra penalty. But from Python, there is an extra penalty as Spark needs to start a Python process on the worker, serialize the data from JVM to Python, run the function and then serialize the result to the JVM.</p></description></item><item><title>Spark DataSources</title><link>https://adrianabreu.com/spark-certification/2022-06-11-spark-data-sources/</link><pubDate>Sat, 11 Jun 2022 16:43:22 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-06-11-spark-data-sources/</guid><description><p>As estated in the <a href="https://adrianabreu.com/spark-certification/2022-06-10-spark-structured-api">structured api section</a>, Spark supports a lot of sources with a lot of options. There is no other goal for this post than to clarify how the most common ones work and how they will be converted to <strong>DataFrames</strong>.</p>
<p>First, all the supported sources are listed here: <a href="https://spark.apache.org/docs/latest/sql-data-sources.html">https://spark.apache.org/docs/latest/sql-data-sources.html</a></p>
<p>And we can focus on the typical ones: JSON, CSV and Parquet (as those are the typical format on open-source data).</p></description></item><item><title>Spark Dataframes</title><link>https://adrianabreu.com/spark-certification/2022-06-10-spark-structured-api/</link><pubDate>Fri, 10 Jun 2022 17:02:32 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-06-10-spark-structured-api/</guid><description><p>Spark was initially released for dealing with a particular type of data called <strong>RDD</strong>. Nowadays we work with abstract structures on top of it, and the following tables summarize them.</p>
<table>
<thead>
<tr>
<th>Type</th>
<th>Description</th>
<th>Advantages</th>
</tr>
</thead>
<tbody>
<tr>
<td>Datasets</td>
<td>Structured composed of a list of <T> where you can specify your custom class (only Scala)</td>
<td>Type-safe operations, support for operations that cannot be expressed otherwise.</td>
</tr>
<tr>
<td>Dataframes</td>
<td>Datasets of type Row (a generic spark type)</td>
<td>Allow optimizations and are more flexible</td>
</tr>
<tr>
<td>SQL tables and views</td>
<td>Same as Dataframes but in the scope of databases instead of programming languages</td>
<td></td>
</tr>
</tbody>
</table>
<p>Let&rsquo;s dig into the Dataframes.
They are a data abstraction for interacting with name columns, those names are defined in a <strong>schema</strong>.</p></description></item><item><title>Spark Execution</title><link>https://adrianabreu.com/spark-certification/2022-06-08-spark-execution/</link><pubDate>Wed, 08 Jun 2022 17:02:32 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-06-08-spark-execution/</guid><description><p>Spark provides an api and an engine, that engine is responsible for analyzing the code and performing several optimizations. But how does this work?
We can do two kinds of operations with Spark, transformations and actions.</p>
<p>Transformations are operations on top of the data that modify the data but do not yield a result directly, that is because they all are lazily evaluated so, you can add new columns, filter rows, or perform some computations that won&rsquo;t be executed immediately.</p></description></item><item><title>Spark Architecture</title><link>https://adrianabreu.com/spark-certification/2022-06-07-spark-architecture/</link><pubDate>Tue, 07 Jun 2022 17:02:32 +0000</pubDate><guid>https://adrianabreu.com/spark-certification/2022-06-07-spark-architecture/</guid><description><p>Spark works on top of a cluster supervised by a cluster manager. The later is responsible of:</p>
<ol>
<li>Tracking resource allocation across all applications running on the cluster.</li>
<li>Monitoring the health of all the nodes.</li>
</ol>
<p>Inside each node there is a node manager which is responsible to track each node health and resources and inform the cluster manager.</p>
<div class="goat svg-container ">
<svg
xmlns="http://www.w3.org/2000/svg"
font-family="Menlo,Lucida Console,monospace"
viewBox="0 0 328 105"
>
<g transform='translate(8,16)'>
<path d='M 0,16 L 136,16' fill='none' stroke='currentColor'></path>
<path d='M 176,16 L 208,16' fill='none' stroke='currentColor'></path>
<path d='M 136,32 L 176,32' fill='none' stroke='currentColor'></path>
<path d='M 0,48 L 136,48' fill='none' stroke='currentColor'></path>
<path d='M 176,48 L 208,48' fill='none' stroke='currentColor'></path>
<path d='M 176,80 L 208,80' fill='none' stroke='currentColor'></path>
<path d='M 0,16 L 0,48' fill='none' stroke='currentColor'></path>
<path d='M 136,16 L 136,32' fill='none' stroke='currentColor'></path>
<path d='M 136,32 L 136,48' fill='none' stroke='currentColor'></path>
<path d='M 176,16 L 176,32' fill='none' stroke='currentColor'></path>
<path d='M 176,32 L 176,48' fill='none' stroke='currentColor'></path>
<path d='M 176,48 L 176,64' fill='none' stroke='currentColor'></path>
<path d='M 176,64 L 176,80' fill='none' stroke='currentColor'></path>
<text text-anchor='middle' x='8' y='36' fill='currentColor' style='font-size:1em'>C</text>
<text text-anchor='middle' x='16' y='36' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='40' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='72' y='36' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='88' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='96' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='104' y='36' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='224' y='20' fill='currentColor' style='font-size:1em'>N</text>
<text text-anchor='middle' x='224' y='52' fill='currentColor' style='font-size:1em'>N</text>
<text text-anchor='middle' x='224' y='84' fill='currentColor' style='font-size:1em'>N</text>
<text text-anchor='middle' x='232' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='232' y='52' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='232' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='240' y='20' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='240' y='52' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='240' y='84' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='248' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='248' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='248' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='264' y='20' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='264' y='52' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='264' y='84' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='272' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='272' y='52' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='272' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='280' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='280' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='280' y='84' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='288' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='288' y='52' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='288' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='296' y='20' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='296' y='52' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='296' y='84' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='304' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='304' y='52' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='304' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='312' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='312' y='52' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='312' y='84' fill='currentColor' style='font-size:1em'>r</text>
</g>
</svg>
</div>
<p>When we run a Spark application we generate processes inside the cluster where one node will act as a Driver and the rest will be Workers. Here there are two main points:</p></description></item><item><title>Faker with PySpark</title><link>https://adrianabreu.com/blog/2022-05-31-faker-pyspark/</link><pubDate>Tue, 31 May 2022 09:28:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-05-31-faker-pyspark/</guid><description><p>I’m preparing a small blog post about some tweakings I’ve done for a delta table, but I want to dig into the Spark UI differences before this. As this was done as part of my work I’m reproducing the problem with some generated data.</p>
<p>I didn’t know about <a href="https://faker.readthedocs.io/en/master/">Faker</a> and <em>boy</em> it is really simple and easy.</p>
<p>In this case, I want to generate a small dataset for a dimension product table including its id, category and price.</p></description></item><item><title>Git 101</title><link>https://adrianabreu.com/blog/2022-03-21-git-intro/</link><pubDate>Mon, 21 Mar 2022 22:28:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-03-21-git-intro/</guid><description><p>From time to time I get to the same place, telling some people about git, what it solves and some basic usage.</p>
<p>Since I&rsquo;ve done it a lot recenly I wanted to write down a post and enjoy it.</p>
<h1 id="what-is-git">What is git?</h1>
<p>Git is a gift from the gods for the following use cases:</p>
<ul>
<li>
<p>My laptop is broke! I need the data there is a whole month of work there!</p></description></item><item><title>Sbt tests</title><link>https://adrianabreu.com/blog/2022-02-07-sbt-tests/</link><pubDate>Mon, 07 Feb 2022 19:53:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2022-02-07-sbt-tests/</guid><description><p>Últimamente en el trabajo estoy usando mucho delta para algunas tablas de dimensiones y estas tablas realizan actualizaciones parciales de las filas para replicar la lógica de negocio.</p>
<p>Esto, nos lleva a varios tests que replican un estado de la tabla y realizan las actualizaciones pertinentes para comprobar todos los flujos y por ende un sobrecoste de ejecución de ese tipo de tests que acaba siendo agotador.</p>
<p>Una de las soluciones planteadas fue incluir en las builds un parámetro para saltarse el step de ejecución de los tests. Lo cual es legítimo pero al menos para mí, resulta algo arbitrario. Buscando otro concens llegamos a: en las pull request se ejecutarán todos los tests y en el resto de builds (manuales o automáticas de rama) se excluirán estos tests, para que al hacer pruebas o durante las integraciones de las ramas no estemos acumulando tiempo en tests ya validados.</p></description></item><item><title>Multiplying rows in Spark</title><link>https://adrianabreu.com/blog/2021-11-11-multiplying-rows-in-spark/</link><pubDate>Thu, 11 Nov 2021 18:32:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2021-11-11-multiplying-rows-in-spark/</guid><description><p>Earlier this week I checked on a Pull Request that bothered me since I saw it from the first time. Let&rsquo;s say we work for a bank and we are going to give cash to our clients if they get some people to join our bank.</p>
<p>And we have an advertising campaign definition like this:</p>
<table>
<thead>
<tr>
<th>campaign_id</th>
<th>inviter_cash</th>
<th>receiver_cash</th>
</tr>
</thead>
<tbody>
<tr>
<td>FakeBank001</td>
<td>50</td>
<td>30</td>
</tr>
<tr>
<td>FakeBank002</td>
<td>40</td>
<td>20</td>
</tr>
<tr>
<td>FakeBank003</td>
<td>30</td>
<td>20</td>
</tr>
</tbody>
</table>
<p>And then our BI teams defines the schema they want for their dashboards.</p></description></item><item><title>The horrible azure devops ui</title><link>https://adrianabreu.com/blog/2021-11-07-the-horrible-azure-devops-pipeline-ui/</link><pubDate>Sun, 07 Nov 2021 16:32:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2021-11-07-the-horrible-azure-devops-pipeline-ui/</guid><description><p><strong>Disclaimer: I read the docs, I know this is just complaining and not giving feedback, but man, this UI stills is horrible</strong>.</p>
<p>So&hellip; Let&rsquo;s put into situation, there was a connection update between devops and bitbucket and suddenly most of our pipelines stopped working. They told me to change the connection in the yaml file and that didn&rsquo;t work.</p>
<p>I know that there are three parts involving in a pipeline for sure:</p></description></item><item><title>Regex 101</title><link>https://adrianabreu.com/blog/2021-11-01-regex-101/</link><pubDate>Fri, 05 Nov 2021 16:39:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2021-11-01-regex-101/</guid><description><p><em>-You will spent your whole life relearning regex, there is a beginning, but never and end.</em></p>
<p>Last year I participated in some small code problems and practised some regex. I got used to it and feel quite good at it.</p>
<p>And today I had to use it again. I had the following dataframe:</p>
<table>
<thead>
<tr>
<th>product</th>
<th>attributes</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>(SIZE-36)</td>
</tr>
<tr>
<td>2</td>
<td>(COLOR-RED)</td>
</tr>
<tr>
<td>3</td>
<td>(SIZE-38, COLOR-BLUE)</td>
</tr>
<tr>
<td>4</td>
<td>(COLOR-GREEN, SIZE-39)</td>
</tr>
</tbody>
</table>
<p>A wonderful set of string merged with properties that could vary. And we wanted one column for each:</p></description></item><item><title>Sbt Intro I</title><link>https://adrianabreu.com/blog/2021-10-10-sbt-intro-i/</link><pubDate>Sun, 10 Oct 2021 14:47:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2021-10-10-sbt-intro-i/</guid><description><p>El mes pasado cambié a otro trabajo :) y por casualidades he vuelto a acabar con proyectos de scala. Este proyecto está bastante avanzado y hace un uso intensivo de los plugins de sbt. De hecho, una tarea que tengo próximamente es hacer una plantilla para proyectos. Así que quería repasar los conceptos básicos de sbt en una serie de posts.</p>
<p>¿Qué es sbt <strong>scala build tool</strong>? Es una herramienta para gestionar proyectos en scala. Es la más utilizada (casi el 95% de los proyectos se hacen en sbt) y uno de sus puntos fuertes es que permite trabajar con múltiples versiones de scala haciendo cross-compilation.</p></description></item><item><title>A funny bug</title><link>https://adrianabreu.com/blog/2021-07-21-funny-bug/</link><pubDate>Wed, 21 Jul 2021 18:52:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2021-07-21-funny-bug/</guid><description><p>Ayer mismo estaba intentando analizar unos data_types para completar una exportación al backend. La idea era sencilla, buscar para cada usuario la ultima información disponible en una tabla que representaba actualizaciones sobre su perfil.</p>
<p>Como a su perfil podías &ldquo;añadir&rdquo;, &ldquo;actualizar&rdquo; y &ldquo;eliminar&rdquo; cosas, pues existian los tres tipos en la tabla. Para que la imaginemos mejor sería tal que así:</p>
<table>
<thead>
<tr>
<th>user_id</th>
<th>favorite_stuff</th>
<th>operation</th>
<th>metadata</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Chocolate</td>
<td>Add</td>
<td>&hellip;</td>
</tr>
<tr>
<td>A</td>
<td>Chocolate</td>
<td>Update</td>
<td>&hellip;</td>
</tr>
<tr>
<td>B</td>
<td>Milk</td>
<td>Remove</td>
<td>&hellip;</td>
</tr>
<tr>
<td>B</td>
<td>Cornflakes</td>
<td>Add</td>
<td>&hellip;</td>
</tr>
</tbody>
</table>
<p>De tal manera que habría que combinar todos los eventos para saber cual es el perfil actual del usuario y aplicar cierta lógica. Sin embargo los eventos que habían llegado realmente eran así:</p></description></item><item><title>Exportando los datos de firebase</title><link>https://adrianabreu.com/blog/2021-05-06-exporting-firebase-data/</link><pubDate>Thu, 06 May 2021 11:49:36 +0000</pubDate><guid>https://adrianabreu.com/blog/2021-05-06-exporting-firebase-data/</guid><description><p>Si trabajamos analizando los datos de una aplicación móvil es muy probable que esté integrado algún sistema para trackear los eventos de la app. Y entre ellos, uno de los más conocidos es Firebase.</p>
<p>Estos eventos contienen mucha información útil y nos permiten por ejemplo saber, un usuario que se ha ido cuanto tiempo ha usado la aplicación o cuantos dias han pasado.</p>
<p>O si realmente ha seguido el flujo de acciones que esperabamos (con un diagrama de sankey podríamos ver donde se han ido los usuarios).</p></description></item><item><title>Notas sobre storytelling with data</title><link>https://adrianabreu.com/blog/2021-01-04-notes-storytelling-with-data-/</link><pubDate>Mon, 04 Jan 2021 19:00:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2021-01-04-notes-storytelling-with-data-/</guid><description><p>Como uno de los objetivos antes de cambiar de año quería empezar a dar visibilidad sobre el producto en el que estoy trabajando con un dashboard. Tras probar varias opciones, hemos optado por utilizar <a href="https://aws.amazon.com/es/quicksight/">Quicksight</a> para simplificar los procesos en aws y reducir nuestra infraestructura.</p>
<p>Aún así, empezando un dashboard de cero, es muy difícil transmitir la información de forma clara. Es importante evitar que los usuarios vengan simplemente a expotar sus datos a csv para luego cargarlos en excel.</p></description></item><item><title>Configurando poetry y gitlab</title><link>https://adrianabreu.com/blog/2020-12-30-configurando-poetry-gitlab/</link><pubDate>Wed, 30 Dec 2020 15:42:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-12-30-configurando-poetry-gitlab/</guid><description><p>Hace poco más de un mes cambié de trabajo y me encontré además con un cambio de stack considerable. Ahora estoy trabajando con aws + github + python. Y bueno, al margen de los cambios de conceptos y demás, me ha llevado bastante encontrar un flujo de trabajo que no me pareciera &ldquo;frágil&rdquo;.</p>
<p>Lo primero y que me ha decepcionado bastante es que github no incluye soporte para hostear paquetes de python. Lo tendrá sí, pero sin fecha clara y <a href="https://github.com/github/roadmap/projects/1?card_filter_query=label%3Apackages">por lo pronto parece que será despues de Junio de 2021</a>.</p></description></item><item><title>Jugando con Data Factory</title><link>https://adrianabreu.com/blog/2020-10-01-jugando-con-df/</link><pubDate>Thu, 01 Oct 2020 10:12:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-10-01-jugando-con-df/</guid><description><p>Sorprendentemente, hasta ahora, no había tenido la posibilidad de trabajar con data factory, sólo lo habia usado para algunas migraciones de datos.</p>
<p>Sin embargo, tras estabilizar un proyecto y consolidar su nueva etapa, necesitabamos simplificar la solución implementada para migrar datos.</p>
<p>Una representación sencilla de la arquitectura actual sería:</p>
<p><img src="https://adrianabreu.com/images/data-factory/original-architecture.png" alt="Arquitectura actual"></p>
<p>En un flujo muy sencillo sería esto:</p>
<ol>
<li>La etl escribe un fichero csv con spark en un directorio de un blob storage.</li>
<li>La primera function filtra los ficheros de spark que no son part- y se encarga de notificar a una function que actua de gateway para el batch con que fichero queremos enviar, el nombre original, el path y el nombre que queremos darle.</li>
<li>Esta function de gateway se encarga de realizar las llamadas necesarias a la api de Azure para generar una tarea en el batch.</li>
<li>El batch se encarga de comprimir el fichero y enviarlo al sftp del cliente, recuperando las credenciales según el tipo de fichero que se trate.</li>
</ol>
<p>Este proceso nos permitía trabajar con dos versiones del proyecto en lo que hacíamos la migración a la nueva versión. Ahora que la nueva versión ya está consolidada y hemos conseguido además que el cliente utilice un formato de compresión que podemos escribir directamente desde spark sin recurrir al batch, es el momento de cambiar la arquitectura de transferencia de datos.</p></description></item><item><title>Tipos de join en spark</title><link>https://adrianabreu.com/blog/2020-12-29-spark-joins/</link><pubDate>Tue, 29 Sep 2020 18:52:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-12-29-spark-joins/</guid><description><p>Hace unos días tuve la fortuna (o desgracia) de implementar la lógica más compleja de todo el dominio.
El resultado, como esperaba, una etl que falaba por recursos constantementes. El problema:</p>
<pre tabindex="0"><code>Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
</code></pre><p>Lo primero fue revisar el plan de ejecución para ver que estaba sucediendo.</p></description></item><item><title>Calcular el domingo de la semana</title><link>https://adrianabreu.com/blog/2020-09-02-calcular-el-domingo-correspondiente/</link><pubDate>Wed, 02 Sep 2020 10:12:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-09-02-calcular-el-domingo-correspondiente/</guid><description><p>A la hora de publicar reportes es común agrupar los datos por semanas. Otro motivo es alinearse con el negocio donde los cierres pueden producirse en días concretos, por ejemplo, un domingo.</p>
<p>En esos casos si tenemos los datos particionados por días nos interesa saber a que domingo correspondería cada uno de los datos.</p>
<p>Los que venimos de otros entornos tendemos a pensar en esas complicadas librerías de fechas (moment.js, jodatime, etc). Incluso alguien podría pensar en extraer los datos del dataframe y procesarlo en local.</p></description></item><item><title>Detectando ficheros pequenos Spark</title><link>https://adrianabreu.com/blog/2020-08-25-detectando-ficheros-pequenos/</link><pubDate>Tue, 25 Aug 2020 17:22:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-08-25-detectando-ficheros-pequenos/</guid><description><p>Uno de los mayores problemas de rendimiento que podemos encontrar en los datalake es tener que mover una enorme cantidad de ficheros pequeños, por el overhead que eso representa en las transacciones.
Este post de databricks recomendada <a href="https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html">https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html</a> que se crearan ficheros de 1GB parquet.</p>
<p>Sin embargo mucha gente no sabe como detectar esto. Hace poco estuve jugando con un notebook y usando simplemente las herramientas del dbutils pude clasificar los ficheros que tenia en las entidades del datalake en múltiples categorías, así podría estimar cuantos ficheros había en un rango de tiempo.</p></description></item><item><title>Spark windows functions (I)</title><link>https://adrianabreu.com/blog/2020-08-11-spark-windows-functions/</link><pubDate>Tue, 11 Aug 2020 18:52:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-08-11-spark-windows-functions/</guid><description><p>En analítica, es muy común hacer uso de las funciones de ventana para distintos cálculos. Hace poco me encontré con un pequeño problema cuya solución mejoró muchísimo al usar las funciones de ventana, demos un poco de contexto.</p>
<p>Tenemos una dimensión de usuarios donde los usuarios se van registrando con una fecha y tenemos una tabla de ventas donde tenemos las ventas globales para cada día</p>
<p>Y lo que queremos dar es una visión de cómo cada día evoluciona el programa, para ello se quiere que cada día estén tanto las ventas acumuladas como los registros acumulados.</p></description></item><item><title>Acceso al keyvault mediante certificados</title><link>https://adrianabreu.com/blog/2020-05-31-acceso-keyvault-certificados/</link><pubDate>Sat, 30 May 2020 19:52:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-05-31-acceso-keyvault-certificados/</guid><description><p>En el proceso de migración de una aplicación de webjob a azure batch, nos encontramos con la problemática de gestionar los secretos. El servicio de batch se encarga de recoger una aplicación de un storage y hacer ejecuciones de ellas (tasks) en unas máquinas concretas (pool).</p>
<p>Para poder gestionar los secretos de la aplicación, estos estaban guardados en keyvault. Y teníamos que acceder de forma segura a ello. Por eso optamos por utilizar la autenticación via certificado. La idea de este tutorial es reproducir los mismos pasos que he usado yo para poder usar este certificado.</p></description></item><item><title>Scala best practices notes</title><link>https://adrianabreu.com/blog/2020-04-27-scala-best-practices-nicolas-rinaudo/</link><pubDate>Mon, 27 Apr 2020 20:55:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-04-27-scala-best-practices-nicolas-rinaudo/</guid><description><p>He aprovechado estos días de cuarentena para revisar algunos de los &ldquo;huecos&rdquo; de conocimiento que tenía en Scala. Una de las charlas que he podido ver es esta: <a href="https://www.youtube.com/watch?v=DGa58FfiMqc">Scala best practices I wish someone&rsquo;d told me about - Nicolas Rinaudo</a></p>
<p>Por supuesto siempre recomiendo ver la charla, pero he querido condensar (aún más) ese conocimiento en este post, insisto, es amena y muy interesante, muchos de los puntos que se definen en la charla no se han explicado porque la mayoría se resuelven en dotty y aunque</p></description></item><item><title>Notas sobre programación funcional en Scala I</title><link>https://adrianabreu.com/blog/2020-04-07-notes-on-functional-programming-in-scala-i/</link><pubDate>Mon, 06 Apr 2020 18:22:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-04-07-notes-on-functional-programming-in-scala-i/</guid><description><p>Hace unos días pude comprarme el <a href="https://www.amazon.es/Functional-Programming-Scala-Paul-Chiusano/dp/1617290653">libro de Paul Chiusano y Rúnar Bjarnason: Functional Programming in scala</a> y no puedo estar más contento con él.</p>
<p>Como ya es costumbre, aprovecho para dejar mis notas sobre el libro en el blog. No se trata de un resumen del mismo sino curiosidades que sé que volveré a consultar en un futuro. Intentaré que no queden post excesivamente largos haciendo un por capítulo. Igualmente, recomiendo a todo el mundo adquirir &ldquo;el libro rojo de Scala&rdquo; y echarle un vistazo.</p></description></item><item><title>Límites en azure functions para procesos de larga duración</title><link>https://adrianabreu.com/blog/2020-02-24-azure-functions-service-bus-limits/</link><pubDate>Mon, 24 Feb 2020 19:22:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2020-02-24-azure-functions-service-bus-limits/</guid><description><p>Estas últimas semanas he tenido que implementar ciertas mejoras en un proyecto. El objetivo era muy simple, conectar el proyecto a una aplicación de datawarehousing existente, y de forma externa, realizar agregados y luego aplicar cierto procesamiento para un servicio en particular.</p>
<p>Además había una serie de requisitos extras:</p>
<ol>
<li>El procesamiento iba a ser reutilizado por otro proyecto. Y requería comprimir y cifrar archivos grandes.</li>
<li>La primera parte tenía que simplemente,</li>
<li>Había una deadline muy cercana para este proyecto.</li>
</ol>
<p>Con todas estas limitaciones, la solución propuesta fue esta:</p></description></item><item><title>Conceptos básicos de Spark</title><link>https://adrianabreu.com/blog/2019-11-09-spark-concepts-basicos/</link><pubDate>Sat, 09 Nov 2019 19:22:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2019-11-09-spark-concepts-basicos/</guid><description><p><strong>Nota del autor</strong>: Todos los contenidos de este artículo son extractos del libro &ldquo;The Data Engineer&rsquo;s Guide to Apache Spark&rdquo; que puedes descargar desde la pagina de databricks: <a href="https://databricks.com/lp/ebook/data-engineer-spark-guide">https://databricks.com/lp/ebook/data-engineer-spark-guide</a></p>
<h2 id="preludio">Preludio:</h2>
<h3 id="cluster">Cluster:</h3>
<p>Un cluster no es más que un conjunto de máquinas trabajando de forma coordinada. Un cluster de Spark se compone de nodos. Uno actúa como DRIVER y es el punto de entrada para el código del usuario. Los otros actúan como EXECUTOR que seran los encargados de realizar las operaciones.</p></description></item><item><title>Empezando en Spark con Docker</title><link>https://adrianabreu.com/blog/2019-11-10-empezando-en-spark-con-docker/</link><pubDate>Sat, 09 Nov 2019 19:22:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2019-11-10-empezando-en-spark-con-docker/</guid><description><p>A pesar de haber leído guías tan buenas como:</p>
<p><a href="https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b">https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b</a></p>
<p><a href="https://medium.com/@singhpraveen2010/install-apache-spark-and-configure-with-jupyter-notebook-in-10-minutes-ae120ebca597">https://medium.com/@singhpraveen2010/install-apache-spark-and-configure-with-jupyter-notebook-in-10-minutes-ae120ebca597</a></p>
<p>Se me ha hecho cuesta arriba el poder conectar un notebook de jupyter y utilizar Scala. Entre configurar el apache toree para poder usar scala en los notebooks y algún error luego en spark al usarlo desde IntelliJ, me he acabado rindiendo.</p>
<p><em>Nota del autor:</em> Como disclaimer esto ocurre probablemente porque estoy en Manjaro y mi version de Scala es incompatible. Esta clase de problemas en su día las solucionaba fijando una versión, sin embargo, creo que teniendo una herramienta tan potente como es Docker estoy reinventando la rueda par un problema ya resuelto. Además de que voy a probarlo también en un windows para ver que es una solución agnóstical SO.</p></description></item><item><title>Correlated subqueries</title><link>https://adrianabreu.com/blog/2019-09-26-correlated-subqueries/</link><pubDate>Thu, 26 Sep 2019 20:43:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2019-09-26-correlated-subqueries/</guid><description><p>Llevo un par de meses viendo como la mayoría de esfuerzos en el proyecto en el que estoy se centran en evitar los joins en las distintas capas de análisis. Aprovechando las capacidades de spark se busca tener las estructuras muy desnormalizadas y se había &ldquo;endemoniado&rdquo; al join considerarlo perjudicial.</p>
<p>Tanto es así que llevo un par de días peleando con una pieza de código que me ha sorprendido. Partiendo de una tabla de hechos que agrupa datos para un periodo a hasta b, se quiere que se &ldquo;colapsen&rdquo; los datos de hace 14 días. Será mejor con un ejemplo:</p></description></item><item><title>Datos I - Introducción al Datawarehousing</title><link>https://adrianabreu.com/blog/2019-02-05-data-i-introduccion-al-datawarehouse/</link><pubDate>Tue, 05 Feb 2019 14:43:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2019-02-05-data-i-introduccion-al-datawarehouse/</guid><description><p>En los últimos meses mi trabajo ha pivotado del mundo de la web al mundo de los datos. He entrado a participar en un proyecto de Data Warehouse y he acabado muy contento en él. Hace unos días mi cambio se oficializó completamente y ahora me he dado cuenta de que no solo tengo un mundo técnico ante mí, sino que además necesito consolidar algunas bases teóricas.</p>
<p>Investigando la bibliografía, me han recomendado en Reddit: <em>The Data Warehouse Toolkit, The Complete Guide to Dimensional Modeling 2nd Edition.</em>
Y el libro parece encajar perfectamente en el conocimiento que busco. Aun así, como todo, por necesidad, intentaré resumir en unas cuantas entradas el conocimiento que se puede obtener de este libro. El cual recomiendo encarecidamente.</p></description></item><item><title>Angular Series III - Dynamic components</title><link>https://adrianabreu.com/blog/2017-12-17-angular-series-iii-dynamic-components/</link><pubDate>Sun, 17 Dec 2017 14:43:32 +0000</pubDate><guid>https://adrianabreu.com/blog/2017-12-17-angular-series-iii-dynamic-components/</guid><description><p>Antes de terminar repasando el tema de <em>templating</em>, quiero hacer un inciso. Existen ciertos casos donde el templating es insuficiente y lo que necesitamos es simplemente escoger dinámicamente que componente vamos a renderizar.</p>
<p>Esto está documentado en la documentación de angular bajo el nombre de <a href="https://angular.io/guide/dynamic-component-loader">Dynamic Components</a>.</p>
<h3 id="cómo-funcionan-estos-dynamics-components">¿Cómo funcionan estos dynamics components?</h3>
<p>Explicado mal y pronto, la idea es: Escoger un elemento de la vista que actue de contenedor e inyectar el componente debe ir ahí.</p></description></item><item><title>Angular Series II - Templating</title><link>https://adrianabreu.com/blog/2017-11-18-angular-series-ii-templating/</link><pubDate>Sat, 18 Nov 2017 18:53:17 +0000</pubDate><guid>https://adrianabreu.com/blog/2017-11-18-angular-series-ii-templating/</guid><description><p>Continuando con el <a href="https://adrianabreu.com/blog/2017-08-14-angular-series-i-transclusion/">artículo del otro día sobre proyección de contenido</a> aquí pretendo mostrar otra forma de pasar contenido: las templates.