Skip to content

Commit caa5b86

Browse files
👾⚙️ Add crawler, spreadsheet and README
1 parent 1398c24 commit caa5b86

File tree

4 files changed

+81
-0
lines changed

4 files changed

+81
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Expressiveness analysis
2+
A spreadsheet program (`Transparenzinformationen.xlsx` and `header.py`) was used to examine the individual privacy policies. Each decision was also documented by means of text extracts. In addition, a Python script has been developed which could be used to carry out more detailed analyses, e.g. on the frequency of occurrences of non-compliance. The software also automatically downloads the full text of the current privacy policy (`crawler.py`). Thus, the study could also be repeated at a comparative point in time. The study was conducted mainly in German language, the summary is translated into English.

Transparenzinformationen.xlsx

13.5 MB
Binary file not shown.

crawler.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
4+
import pandas as pd
5+
import requests
6+
from tqdm import tqdm
7+
8+
print('Importing file...')
9+
file = 'Transparenzinformationen.xlsx'
10+
11+
print('---')
12+
13+
print('Get names of all available sheets...')
14+
x1 = pd.ExcelFile(file)
15+
sheets = x1.sheet_names
16+
print(sheets)
17+
18+
print('---')
19+
20+
print('Print all available data controllers (companies)...')
21+
df = pd.read_excel(file, sheet_name=sheets[0])
22+
controllers = [df['Unnamed: %s' % (i)][3] for i in range(7, 37)]
23+
print(controllers)
24+
25+
print('---')
26+
27+
print('Read every single sheet for each and every data controller...')
28+
sheets = [pd.read_excel(file, sheet_name=sheets[i]) for i in tqdm(range(5, 35))]
29+
30+
print('---')
31+
32+
print('Extract the policy URL for each company...')
33+
policies = []
34+
count = 0
35+
for sheet in sheets:
36+
try:
37+
p = sheet[controllers[count]][0]
38+
policies.append(p)
39+
print(p)
40+
except:
41+
print('URL could not be extracted.')
42+
count += 1
43+
44+
print('---')
45+
46+
print('Download all privacy policies...')
47+
48+
for url in tqdm(policies):
49+
try:
50+
response = requests.get(url, timeout=5)
51+
f = open('download/%s.html' % (url[12:25].replace('/', '-')), 'w')
52+
f.write(response.content)
53+
f.close()
54+
except:
55+
print(url + ' could not be fetched.')

header.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
4+
# This tool generates the excel sheet titles according to the schema defined in Transparenzinformationen.txt
5+
6+
import string
7+
8+
result = ''
9+
for j in range(1, 11):
10+
for i in range(1, 4):
11+
result = result + '\t' + str(j) + string.ascii_uppercase[i-1]
12+
13+
# print(result)
14+
15+
16+
17+
result = ''
18+
for r in range(1, 21):
19+
for j in range(1, 11):
20+
for i in range(1, 4):
21+
result = result + '\t' + "='" + str(j) + string.ascii_uppercase[i-1] + "'!A{}".format(str(r))
22+
23+
print(result)
24+

0 commit comments

Comments
 (0)