👾⚙️ Add crawler, spreadsheet and README

eliasgruenewald · web-flow · commit caa5b8669c1d · 2020-06-17T15:46:14.000+02:00
diff --git a/README.md b/README.md
@@ -0,0 +1,2 @@
+# Expressiveness analysis
+A spreadsheet program (`Transparenzinformationen.xlsx` and `header.py`) was used to examine the individual privacy policies. Each decision was also documented by means of text extracts. In addition, a Python script has been developed which could be used to carry out more detailed analyses, e.g. on the frequency of occurrences of non-compliance. The software also automatically downloads the full text of the current privacy policy (`crawler.py`). Thus, the study could also be repeated at a comparative point in time. The study was conducted mainly in German language, the summary is translated into English.
diff --git a/Transparenzinformationen.xlsx b/Transparenzinformationen.xlsx
diff --git a/crawler.py b/crawler.py
@@ -0,0 +1,55 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import pandas as pd
+import requests
+from tqdm import tqdm
+
+print('Importing file...')
+file = 'Transparenzinformationen.xlsx'
+
+print('---')
+
+print('Get names of all available sheets...')
+x1 = pd.ExcelFile(file)
+sheets = x1.sheet_names
+print(sheets)
+
+print('---')
+
+print('Print all available data controllers (companies)...')
+df = pd.read_excel(file, sheet_name=sheets[0])
+controllers = [df['Unnamed: %s' % (i)][3] for i in range(7, 37)]
+print(controllers)
+
+print('---')
+
+print('Read every single sheet for each and every data controller...')
+sheets = [pd.read_excel(file, sheet_name=sheets[i]) for i in tqdm(range(5, 35))]
+
+print('---')
+
+print('Extract the policy URL for each company...')
+policies = []
+count = 0
+for sheet in sheets:
+    try:
+        p = sheet[controllers[count]][0]
+        policies.append(p)
+        print(p)
+    except:
+        print('URL could not be extracted.')
+    count += 1
+
+print('---')
+
+print('Download all privacy policies...')
+
+for url in tqdm(policies):
+    try:
+        response = requests.get(url, timeout=5)
+        f = open('download/%s.html' % (url[12:25].replace('/', '-')), 'w')
+        f.write(response.content)
+        f.close()
+    except:
+        print(url + ' could not be fetched.')
diff --git a/header.py b/header.py
@@ -0,0 +1,24 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# This tool generates the excel sheet titles according to the schema defined in Transparenzinformationen.txt
+
+import string
+
+result = ''
+for j in range(1, 11):
+	for i in range(1, 4):
+		result = result + '\t' + str(j) + string.ascii_uppercase[i-1]
+
+# print(result)
+
+
+
+result = ''
+for r in range(1, 21):
+	for j in range(1, 11):
+		for i in range(1, 4):
+			result = result + '\t' + "='" + str(j) + string.ascii_uppercase[i-1] + "'!A{}".format(str(r))
+
+print(result)
+

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+# Expressiveness analysis`
	`2`	+A spreadsheet program (`Transparenzinformationen.xlsx` and `header.py`) was used to examine the individual privacy policies. Each decision was also documented by means of text extracts. In addition, a Python script has been developed which could be used to carry out more detailed analyses, e.g. on the frequency of occurrences of non-compliance. The software also automatically downloads the full text of the current privacy policy (`crawler.py`). Thus, the study could also be repeated at a comparative point in time. The study was conducted mainly in German language, the summary is translated into English.