The following exercies are adapted from Hackerrank's SQL challenges. They outline how to do basic SQL-like operations using dataframe.
This tutorial introduces you to data manipulation using Haskell's DataFrame library. We'll work through filtering, selecting, sorting, and combining data using a functional programming approach that's both powerful and expressive.
Make sure you install dataframe and run the custom REPL which provides all the necessary imports and extensions.
Before we begin, let's load our data. We'll be working primarily with city and station data stored in CSV files. To load a CSV file and expose its columns for easy access:
dataframe> df <- D.readCsv "./data/city.csv"
dataframe> :declareColumns dfThe :declareColumns command makes column names available as variables in your scope, allowing you to reference them directly (e.g., id, name, population).
One of the most fundamental operations in data analysis is filtering - selecting rows that meet certain criteria. In Haskell's DataFrame library, we use the filterWhere function combined with comparison operators.
The filterWhere function takes a boolean expression and returns only the rows where that expression evaluates to true. For example, to find rows where a column equals a specific value, we use the .==. operator (same-type, non-nullable) or .== (nullable-aware):
df |> D.filterWhere (columnName .==. value)The pipe operator |> allows us to chain operations in a readable left-to-right style, similar to Unix pipes.
Exercise 1: Basic filtering
For this question we will use the data in ./data/city.csv.
Query all columns for a city with the ID 1661.
dataframe> df |> D.filterWhere (id .==. 1661)
-----------------------------------------------------
id | name | country_code | district | population
------|--------|--------------|----------|-----------
Int | Text | Text | Text | Int
------|--------|--------------|----------|-----------
1661 | Sayama | JPN | Saitama | 162472Exercise 2: Basic filtering (cont)
For this question we will use the data in ./data/city.csv.
Query all columns of every Japanese city. The country_code for Japan is "JPN".
dataframe> df |> D.filterWhere (country_code .==. "JPN")
--------------------------------------------------------
id | name | country_code | district | population
------|----------|--------------|-----------|-----------
Int | Text | Text | Text | Int
------|----------|--------------|-----------|-----------
1613 | Neyagawa | JPN | Osaka | 257315
1630 | Ageo | JPN | Saitama | 209442
1661 | Sayama | JPN | Saitama | 162472
1681 | Omuta | JPN | Fukuoka | 142889
1739 | Tokuyama | JPN | Yamaguchi | 107078Often you'll need to filter on multiple conditions simultaneously. You can combine boolean expressions using logical operators:
.&&.for AND (both conditions must be true, non-nullableBool);.&&for nullable-aware AND.||.for OR (either condition can be true, non-nullableBool);.||for nullable-aware OR.>,.>=,.<,.<=for comparisons (nullable-aware);.>.,.>=.,.<.,.<=.for same-type strict
For example, to find cities with large populations in a specific country:
df |> D.filterWhere ((population .>. 100000) .&&. (country_code .==. "USA"))Exercise 3: Basic filtering (cont)
For this question we will use the data in ./data/city.csv.
Query all columns for all American cities in city dataframe with:
- populations larger than 100000, and
- the CountryCode for America is "USA".
dataframe> D.readCsv "./data/country.csv"
dataframe> :declareColumns df
dataframe> df |> D.filterWhere ((population .>. 100000) .&&. (country_code .==. "USA"))
--------------------------------------------------------------
id | name | country_code | district | population
------|---------------|--------------|------------|-----------
Int | Text | Text | Text | Int
------|---------------|--------------|------------|-----------
3878 | Scottsdale | USA | Arizona | 202705
3965 | Corona | USA | California | 124966
3973 | Concord | USA | California | 121780
3977 | Cedar Rapids | USA | Iowa | 120758
3982 | Coral Springs | USA | Florida | 117549When working with large datasets, you often want to preview just a few rows rather than displaying thousands of results. The take function limits the output to a specified number of rows from the beginning of the dataframe.
df |> D.take n -- Shows first n rowsThis is particularly useful for quickly inspecting data or when you only need a sample of results.
Exercise 4: Constraining output
For this question we will use the data in ./data/city.csv.
Show the first 5 rows of the dataframe.
dataframe> df |> D.take 5
----------------------------------------------------------------
id | name | country_code | district | population
-----|------------------|--------------|-------------------|-----------
Int | Text | Text | Text | Int
-----|------------------|--------------|-------------------|-----------
6 | Rotterdam | NLD | Zuid-Holland | 593321
19 | Zaanstad | NLD | Noord-Holland | 135621
214 | Porto Alegre | BRA | Rio Grande do Sul | 1314032
397 | Lauro de Freitas | BRA | Bahia | 109236
547 | Dobric | BGR | Varna | 100399While filtering chooses which rows to include, selecting chooses which columns to display. The select function takes a list of column specifications. You can reference columns using F.name columnName:
df |> D.select [F.name column1, F.name column2]This is useful when you want to focus on specific attributes and reduce visual clutter in your output.
Exercise 5: Basic selection
For this question we will use the data in ./data/city.csv.
Get the first 5 names of the city names.
dataframe> df |> D.select [F.name name] |> D.take 5
-----------------
name
-----------------
Text
-----------------
Rotterdam
Zaanstad
Porto Alegre
Lauro de Freitas
DobricThe real power of these operations comes from chaining them together. You can filter rows and then select specific columns (or vice versa) to get exactly the data you need:
df |> D.filterWhere (condition) |> D.select [columns] |> D.take nThe order of operations matters - filtering first reduces the data before selection, which can be more efficient.
Exercise 6: Selection and filtering
For this question we will use the data in ./data/city.csv.
Query the names of all the Japanese cities and show only the first 5 results.
dataframe> df |> D.filterWhere (country_code .==. "JPN") |> D.select [F.name name] |> D.take 5
---------
name
---------
Text
---------
Neyagawa
Ageo
Sayama
Omuta
TokuyamaExercise 7: Basic select (cont)
For this question we will use the data in ./data/station.csv.
Show the first five city and state rows.
dataframe> df |> D.select [F.name city, F.name state] |> D.take 5
---------------------
city | state
--------------|------
Text | Text
--------------|------
Kissee Mills | MO
Loma Mar | CA
Sandy Hook | CT
Tipton | IN
Arlington | COWhen analyzing categorical data, you often want to see unique values rather than repeated entries. The distinct function removes duplicate rows from your result set:
df |> D.select [F.name column] |> D.distinctThis is particularly useful when exploring what values exist in a column or when preparing data for aggregation.
Exercise 8: Distinct
For this question we will use the data in ./data/station.csv.
Query a list of city names for cities that have an even ID number. Show the results in any order, but exclude duplicates from the answer.
dataframe> df |> D.filterWhere (F.lift even id) |> D.select [F.name city] |> D.distinct
----------------------
city
----------------------
Text
----------------------
Rockton
Forest Lakes
Yellow Pine
Mosca
Rocheport
Millville
...
Lee
Elm Grove
Orange City
Baker
ClutierSometimes you need to sort data and then combine results from multiple queries. The sortBy function orders rows by specified columns. Much like SQL, you can specify multiple columns to
order by. The results are ordered by the first column, with ties broken by the next column
respectively.
You can also can use the <> operator to concatenate dataframes vertically (similar to SQL's UNION).
-- Sort by ascending age
df |> D.sortBy [D.Asc age]
-- 1. Sort by descending age
-- 2. Within those who have the same age, sort by alphabetical order of name.
df |> D.sortBy [D.Asc age, D.Desc name]You can also derive new columns using derive to compute values based on existing columns:
df |> D.derive "newColumn" (F.lift function existingColumn)Exercise 9: Merging
For this question we will use the data in ./data/station.csv.
Query the two cities in STATION with the shortest and longest city names, as well as their respective lengths (i.e.: number of characters in the name).
We'll include the SQL for comparison:
(SELECT CITY, LENGTH(CITY) FROM STATION ORDER BY LENGTH(CITY) DESC LIMIT 1)
UNION
(SELECT CITY, LENGTH(CITY) FROM STATION ORDER BY LENGTH(CITY) ASC LIMIT 1);dataframe> letterSort s = df |> D.derive "length" (F.lift T.length city) |> D.select [F.name city, "length"] |> D.sortBy [s "length"] |> D.take 1
dataframe> (letterSort D.Desc) <> (letterSort D.Asc)
-------------------------------
city | length
-----------------------|-------
Text | Int
-----------------------|-------
Marine On Saint Croix | 21
Roy | 3One of the strengths of working with dataframes in Haskell is the ability to use any Haskell function in your queries. The F.lift function allows you to apply regular Haskell functions to DataFrame columns. This means you can use string functions, mathematical operations, or even your own custom logic:
df |> D.filterWhere (F.lift customFunction columnName)This enables sophisticated filtering that goes beyond simple comparisons. For example, you can check string prefixes, perform calculations, or apply complex business logic.
Exercise 10: Duplicates and user defined functions
For this question we will use the data in ./data/station.csv.
Query the list of city names starting with vowels (i.e., a, e, i, o, or u). Your result cannot contain duplicates.
dataframe> df |> D.select [F.name city] |> D.filterWhere (F.lift (\c -> any (`T.isPrefixOf` (T.toLower c)) ["a", "e", "i", "o", "u"]) city) |> D.take 5
----------
city
----------
Text
----------
Arlington
Albany
Upperco
Aguanga
OdinParquet reads can be configured so you only load the columns and rows you need. This is useful when files are wide or when you want to filter data at read-time.
For this section we will use ./data/mtcars.parquet.
dataframe> df0 <- D.readParquet "./data/mtcars.parquet"
dataframe> :declareColumns df0ParquetReadOptions currently supports:
selectedColumnspredicaterowRangesafeColumns
Options are applied in this order: predicate filtering, column projection, row range, then safe column promotion.
Exercise 11: Parquet projection
Read only the mpg, cyl, and wt columns.
dataframe> D.readParquetWithOpts
dataframe| (D.defaultParquetReadOptions{D.selectedColumns = Just ["mpg", "cyl", "wt"]})
dataframe| "./data/mtcars.parquet"Exercise 12: Row range
Read rows 5 to 10 (start inclusive, end exclusive).
dataframe> D.readParquetWithOpts
dataframe| (D.defaultParquetReadOptions{D.rowRange = Just (5, 10)})
dataframe| "./data/mtcars.parquet"Exercise 13: Predicate and projection
Read rows where cyl >= 6, but return only the mpg column.
dataframe> D.readParquetWithOpts
dataframe| ( D.defaultParquetReadOptions
dataframe| { D.selectedColumns = Just ["mpg"]
dataframe| , D.predicate = Just (cyl .>=. 6)
dataframe| }
dataframe| )
dataframe| "./data/mtcars.parquet"When selectedColumns is set, columns referenced by predicate are automatically read as needed, then projected back to the requested output columns.
Exercise 14: Safe column promotion
Read the file while promoting every output column to an optional column.
dataframe> D.readParquetWithOpts
dataframe| (D.defaultParquetReadOptions{D.safeColumns = True})
dataframe| "./data/mtcars.parquet"Use safeColumns when downstream code wants a uniformly nullable schema, even when the Parquet file marks some columns as non-nullable.
Exercise 15: using the typed API _This problem is called "Interviews" in Hackerrank. Samantha interviews many candidates from different colleges using coding challenges and contests. Write a query to print the contest_id, hacker_id, name, and the sums of total_submissions, total_accepted_submissions, total_views, and total_unique_views for each contest sorted by contest_id. Exclude the contest from the result if all four sums are 0.
WITH interviews
AS (
SELECT con.contest_id AS contest_id
,con.hacker_id AS hacker_id
,con.name AS name
,ISNULL(ss.total_submissions, 0) total_submissions
,ISNULL(ss.total_accepted_submissions, 0) total_accepted_submissions
,ISNULL(vs.total_views, 0) total_views
,ISNULL(vs.total_unique_views, 0) total_unique_views
FROM contests con
JOIN colleges col ON con.contest_id = col.contest_id
JOIN challenges cha ON col.college_id = cha.college_id
LEFT JOIN (
SELECT challenge_id
,sum(total_views) AS total_views
,sum(total_unique_views) AS total_unique_views
FROM view_stats
GROUP BY challenge_id
) vs ON cha.challenge_id = vs.challenge_id
LEFT JOIN (
SELECT challenge_id
,sum(total_submissions) AS total_submissions
,sum(total_accepted_submissions) AS total_accepted_submissions
FROM submission_stats
GROUP BY challenge_id
) ss ON cha.challenge_id = ss.challenge_id
)
SELECT contest_id
,hacker_id
,name
,sum(total_submissions)
,sum(total_accepted_submissions)
,sum(total_views)
,sum(total_unique_views)
FROM interviews
GROUP BY contest_id
,hacker_id
,name
HAVING sum(total_submissions) + sum(total_accepted_submissions) +
sum(total_views) + sum(total_unique_views) > 0
ORDER BY contest_id;#!/usr/bin/env cabal
{- cabal:
build-depends: base >= 4, dataframe
-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE NumericUnderscores #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE TypeApplications #-}
module Main where
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import qualified DataFrame.Typed as DT
import DataFrame.Operators ((|>))
import DataFrame.Typed ((.>.))
$(DT.deriveSchemaFromCsvFile "Challenges" "./data/challenges_table.csv")
$(DT.deriveSchemaFromCsvFile "Colleges" "./data/colleges_table.csv")
$(DT.deriveSchemaFromCsvFile "Contests" "./data/contests.csv")
$(DT.deriveSchemaFromCsvFile "Submissions" "./data/submission_stats_table.csv")
$(DT.deriveSchemaFromCsvFile "Views" "./data/view_stats_table.csv")
main :: IO ()
main = do
challenges' <-
either (error . show) id . DT.freezeWithError @Challenges
<$> D.readCsv "./data/challenges_table.csv"
colleges' <-
either (error . show) id . DT.freezeWithError @Colleges
<$> D.readCsv "./data/colleges_table.csv"
contests' <-
either (error . show) id . DT.freezeWithError @Contests
<$> D.readCsv "./data/contests.csv"
submissions' <-
either (error . show) id . DT.freezeWithError @Submissions
<$> D.readCsv "./data/submission_stats_table.csv"
views' <-
either (error . show) id . DT.freezeWithError @Views
<$> D.readCsv "./data/view_stats_table.csv"
let contestsWithColleges =
contests'
|> flip (DT.innerJoin @'["contest_id"]) colleges'
|> flip (DT.innerJoin @'["college_id"]) challenges'
let submissionTotals =
submissions'
|> DT.groupBy @'["challenge_id"]
|> DT.aggregate
( DT.aggNil
|> DT.agg @"total_submissions" (DT.sum (DT.col @"total_submissions"))
|> DT.agg @"total_accepted_submissions"
(DT.sum (DT.col @"total_accepted_submissions"))
)
let viewTotals =
views'
|> DT.groupBy @'["challenge_id"]
|> DT.aggregate
( DT.aggNil
|> DT.agg @"total_views" (DT.sum (DT.col @"total_views"))
|> DT.agg @"total_unique_views" (DT.sum (DT.col @"total_unique_views"))
)
print $
contestsWithColleges
|> flip (DT.leftJoin @'["challenge_id"]) viewTotals
|> flip (DT.leftJoin @'["challenge_id"]) submissionTotals
|> DT.select
@'[ "contest_id"
, "hacker_id"
, "name"
, "total_submissions"
, "total_accepted_submissions"
, "total_views"
, "total_unique_views"
]
|> DT.impute @"total_unique_views" (0 :: Int)
|> DT.impute @"total_views" (0 :: Int)
|> DT.impute @"total_submissions" (0 :: Int)
|> DT.impute @"total_accepted_submissions" (0 :: Int)
|> DT.filterWhere
( DT.col @"total_unique_views"
+ DT.col @"total_views"
+ DT.col @"total_submissions"
+ DT.col @"total_accepted_submissions" .>. DT.lit 0
)You've now learned the fundamental operations for working with dataframes in Haskell:
- Filtering with
filterWhereto select rows based on conditions - Selecting with
selectto choose specific columns - Limiting with
taketo control output size - Removing duplicates with
distinct - Sorting with
sortByand combining results with<> - Applying custom functions with
F.liftfor sophisticated data manipulation - Reading Parquet with options using
readParquetWithOptsfor projection, predicate filtering, and row ranges - Typed API using
DataFrame.Typedto create typesafe pipelines.
These building blocks can be composed together to answer complex data analysis questions in a clear, functional style.