Skip to content

Commit 9036509

Browse files
authored
Merge pull request #44 from statisticsnorway/dev
Update in guidelines
2 parents c2d1974 + 481fe8c commit 9036509

File tree

2 files changed

+24
-32
lines changed

2 files changed

+24
-32
lines changed

docs/guide.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,11 @@ In panel establishment surveys, companies will sometimes accidentally report an
4040
```python
4141
det.accumulation_error(y_var="turnover", time_var="time_period")
4242
```
43+
This returns data with an extra variable "flag_accumulation" containing a flag (0/1) for whether observations are higher than expected. To return only the units where all time periods are higher than expected, as is the case in accumulation errors, use the `output_format` parameter:
44+
45+
```python
46+
det.accumulation_error(y_var="turnover", time_var="time_period", output_format="outliers")
47+
```
4348

4449
## Check for outliers using the HB-method
4550
Hidiroglou-Berthelot (HB) method is a popular tool for detecting outliers in data in establishment surveys. It is a data driven approach to determine the parameters for edits. [Winkler et. al. ](http://www.asasrms.org/Proceedings/y2023/files/HB_JSM_2023.pdf) provide a nice summary evaluating the method.

src/vaskify/createdata.py

Lines changed: 19 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# %%
22
# Functions to create data
33

4+
45
import numpy as np
56
import pandas as pd
67

@@ -21,6 +22,10 @@ def create_test_data(
2122
2223
Returns:
2324
pd.DataFrame: Test data in long format.
25+
26+
Raises:
27+
ValueError: If `freq` is not one of "monthly", "quarterly", or "yearly".
28+
2429
"""
2530
rng = np.random.default_rng(seed) if seed else np.random.default_rng()
2631

@@ -30,39 +35,21 @@ def create_test_data(
3035
industry_codes = ["B", "C", "F", "G", "H", "J", "M", "N", "S"]
3136
industries = rng.choice(industry_codes, size=n, replace=True)
3237

33-
# Generate time periods
38+
# Generate time periods as List[str]
3439
if freq == "monthly":
35-
time_periods = (
36-
pd.date_range(
37-
start="2020-01-01",
38-
periods=n_periods,
39-
freq="ME",
40-
)
41-
.to_period("M")
42-
.astype(str)
43-
)
44-
if freq == "quarterly":
45-
time_periods = (
46-
pd.date_range(
47-
start="2020-01-01",
48-
periods=n_periods,
49-
freq="QE",
50-
)
51-
.to_period("Q")
52-
.astype(str)
53-
)
54-
if freq == "yearly":
55-
time_periods = (
56-
pd.date_range(
57-
start="2020-01-01",
58-
periods=n_periods,
59-
freq="YE",
60-
)
61-
.to_period("Y")
62-
.astype(str)
63-
)
64-
65-
# Create Cartesian product of industries and periods
40+
periods = pd.period_range(start="2020-01-01", periods=n_periods, freq="M")
41+
time_periods: list[str] = [f"{p.year}-{p.month:02d}" for p in periods]
42+
elif freq == "quarterly":
43+
periods = pd.period_range(start="2020-01-01", periods=n_periods, freq="Q-DEC")
44+
time_periods = [f"{p.year}-Q{p.quarter}" for p in periods]
45+
elif freq == "yearly":
46+
periods = pd.period_range(start="2020-01-01", periods=n_periods, freq="Y")
47+
time_periods = [f"{p.year}" for p in periods]
48+
else:
49+
mes = "freq must be one of: 'monthly', 'quarterly', 'yearly'"
50+
raise ValueError(mes)
51+
52+
# Create product of industries and periods
6653
data = pd.DataFrame(
6754
[(id_company, period) for id_company in company_ids for period in time_periods],
6855
columns=["id_company", "time_period"],

0 commit comments

Comments
 (0)