mscthesis/Ch.40_Discussion.tex at main · lambdakilo/mscthesis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
\chapter{Discussion\label{discussion}}
% rqs and difficulity of process
The results regarding the research questions can be summarized as there existing a relatively large amount of public sofware licenses and the naming conventions of these licenses are not firmly established yet. In this chapter we will discuss these results more in detail and what could the implications be. Before proceeding to the following sub-chapters it is good to note that the process of this thesis proved difficult due to the majority of the literature being grey literature and the subject of all public software licenses not being academically researched on before.

% indications
This research indicates that the field of public licenses in software engineering, meaning public software licenses, is at an early stage of development. The review of existing literature reveals a notable absence of common set of definitions and terminology in this field. Consequently, this often leads to work covering the same ground. Furthermore, the variability of terminology across different literature makes it challenging to compare and synthesize results effectively. Addressing this challenge will require the development of common measurement tools and frameworks for evaluating and comparing public licenses in software engineering. Such efforts could lead to the establishment of widely adopted standards for measuring and improving public software licenses.

% follow-up observation
That said, there is clear interest in public licenses in software engineering. Although the amount of purely academic literature on public software licenses is limited, the amount of grey literature on public software licenses is ever increasing. This is most likely due to the recent developments in the field regarding proprietarization noted in the \hyperref[intro]{Chapter 1}.

% notable observation
A notable observation of our reserach is that the recent efforts in the industry have led to the development of Post Open Source not yet noted in the research literature reviewed \citep{register:poss}. Whether or not this paradigm will change the industry like open source and free software did will only be evident over time. The paradigm is explained shortly in \hyperref[conclusions]{Chapter 5}.

% acknowledge the topic is complex
The quest to objectively categorize every public software license is a complex one. Therefore it is essential to continue taking the correct steps towards incresing the scientific understanding and providing the industry with examples, standards and processes to follow. However, as the previous chapters have revealed, a significant amount of effort is still being spent on solving the same problem multiple times, rather than building on existing knowledge and finding the next problem to solve. As the knowledge, conventions, and terminology take shape,we can look forward to reaching a state where less effort is spent on defining concepts and more on practical problem-solving.

\section{Implications for research}
% how to generally improve scientific scene 1
To improve the maturity of research methods employed in the field of public licenses in software engineering, the rsults indicate that researchers should aim to use more rigorous and comprehensive research methods. This may involve using larger and more diverse data sets, developing more sophisticated measurement tools, and conducting experiments that are representative of real-world scenarios.

% how to generally improve scientific scene 2
Furthermore, researchers should strive to increase the transparency and reproducibility of their research by making their data and code openly available. This would enable other researchers to replicate and build upon their work, as well as facilitate the establishment of common standards and best practices.

% improve scientific scene multivocal vs academic
Finally, it is important for researchers to publish more articles regardless of the grey literature included in the papers. Because there is largely only grey literature published in the twenty-first century in the field, the next academic articles will be multivocal by default. The non-multivocal, academic articles will follow but only after there are systematic, academic and multivocal articles published for the former to build on. The results presented here are modest but by working together, researchers and industry professionals could produce more useful research regarding public licenses in software engineering.

\section{Implications for software engineering professionals}
% how to improve professional scene 1
The results indicate that software engineering professionals should start by educating themselves of the basics of public licenses in software engineering and incorporating it into their design and development processes. They should be mindful or strive to mindful about the public licenses their third-party softwares are using and how it impacts their craft. Making a map of the public software licenses and their corresponding usecases might help plotting the larger picture.

% how to improve professional scene 2
However, it is important to acknowledge that the institutions should hold the greater responsibility of teaching the basics of public software licenses without getting too tangled up in history, politics or simply waving the field of as a form of human rights activism. The key focus points here being vocational schools, software engineering courses, college and university since these are the timestamps where most software engineers start to produce code that need to be licensed or require the use of a licensed piece of software.

% overall implications
Overall, the lack of public software license knowledge regarding software engineering professionals points to the need of more education regarding public software licenses and the practical effects stemming from the application of these licenses.

\section{Limitations and threats to validity}
We made efforts to ensure the inclusion of comprehensive set of literature in the search process. However, as with all systematic literature reviews, a comprehensive manual review of all literature would have been a formidable task. Therefore, additional filtering was conducted. This filtering was carried out in two phases, starting with the application of inclusion/exclusion criteria, followed by a second phase focused on evaluating the nature of the public software licenses and conducting a manual review. As a result of this second phase, a set of literature were excluded.

% no peer reviews
The major limitation of this study is that the subjective results could not be validated by multiple researchers. In a systematic review, it is standard practice and highly recommended to have at least two, if not more, individuals independently conduct the review processes and then cross validating the findings. This would result in the possibility of comparing individual exclusion decisions and other decicions. The authors gave their best attempt to document the methodology as thoroughly as possible.

% inaccuracy of single reseracher
As a work of single researcher, there is also a chance of inaccuracy and bias in the literature selection and filtering process. As much of the literature had to be reviewed manually and then included/excluded on a qualitative basis, this is a known limitation and a threat to validity. Multiple rounds of documented filtering and a clear paper trail of all decisions made keeps this threat in the acceptable levels.

\subsection{Limitations of literature selection for review}
Efforts were made to ensure the inclusion of comprehensive set of literature in the search process. This was achieved by setting the starting point of license listing sites to the Wikipedia article of the \texttt{MIT} license.

\subsection{Inaccuary and bias in license listing site data}
% SPDX & DFSG
The first phase of filtering has some notable limitations starting with the two license listing websites: SPDX and DFSG. Since the material was gathered to a spreadsheet program the duplicates were removed using the short identifier the listing page was using. Next we will concretisize this threat. Suppose our spreadsheet program has acquired the public license with an identifier \texttt{MIT}. The results of phase one will not include any other public license marked with the identifier \texttt{MIT}. In the worst case the identifier \texttt{MIT} could have actually been \texttt{MIT-DFSG-edition} but with the identifier of \texttt{MIT}. Since there were so many public software licenses in phase one it would not have been possible to check the uniqueness of all removed duplicates. One of the reasons why this would not have been feasible is that the listing sites would fetch the public license contents from another webpage or at the second worst case, from another website. The worst case is that the URL is dead and we get HTTP 404. The amount of public software licenses, duplicates and the lack of already existing tools makes this problem multilayered. However this is the level of integrity we decided to finish our study with.

% FSF listing site
FSF's listing site presented some limitations to the scope of this thesis. The license shortcoded as \texttt{other} was not a public license but instead a hyperlink to another listing webpage that listed programs that the FSF has no yet managed to document the license which the program uses. Although the one of the programs called \texttt{babl} was licensed as with ''gplv3'' the amount of undocumented programs was over 5200 at the time of observation. For this reason we are excluding the public software licenses found indirectly from the category \texttt{other}.

% FSF list problems continued
FSF license listing site also had some other more minor issues than described earlier. Licenses like \texttt{DejaVu} and \texttt{DBG-3.0} did have an FSF license page found from the listing site but these pages only offered one single whitespace character as the full license text. Licenses like \texttt{CorkForkPL} also contained a whitespace as the full license text but included a note about a software that uses this license. Sometimes the full license text could be found by just clicking the provided hyperlink to the software mentioned which is what we did with \texttt{JahiaCSL}. Sometimes it would have required the author to download and unarchive source code to see the full license text or use an internet archiver on top of that due to broken hyperlinks or the software's website being down permanently. We solved this dilemma by deciding to only get the full license text if it was at maximum one click away from the original license listing page. In cases where the license was listed on the FSF license listing page as whitespace the full license text was fetched from the next license listing site in the Wikipedia infobox order if it existed there. For example the full license text for FSF's \texttt{MPL} was fetched from GNU under \texttt{MPL-1.1}. While we figured out reproducable rules to our literature selection phase it is fair to note that these are threats to validity regardless of the systematic nature of the remedies presented here.

% GNU listing site
GNU project's listing site allowed us to use a shortcut of sorts which we will document here for the purposes of acknowleding the limitations of it. The table of contents at the listing site marked certain consequtive public software licenses as software public software licenses. On top of this the public software licenses were not organized into easily processable tables but rather in stacked on one another in rich text format. Although we decided to use regex on the HTML file the included public software licenses were only the ones that were simply under the header ''Software licenses''. In the worst case scenario GNU project could have misinterpreted some public software licenses as non-software licenses thus making this thesis exclude them with a wrong reason. While from a quick glance and the existence of the other four license listing sites, we think it is still worth documenting when it comes to validity and the integrity of this thesis.

% possible false-positive duplicates
\subsection{Technical inaccuracies in the search process}
On top of too heavy filters we would also like to document the too light filters in the literature selection for review. We can see from \hyperref[appendix:a]{Appendix A} that for example public software licenses with the literature identifiers L777 and L780 are almost the same regarding the shortcoded identifiers: \texttt{ZPL - 2.1} and \texttt{ZPL-2.1}. The duplicate removal would have been seemingly simple to execute on phase 1. However with the presence of over 700 pieces of literature we decided not to give special treatment to any potential set of duplicates. While it is most possible that OSI's \texttt{ZPL - 2.1} is equivalent exactly to SPDX's \texttt{ZPL-2.1} we could not be sure without looking at their contents. This could have resulted duplicate public software licenses in the literature selection for review but these type of duplicates are removed in phases 2 and 3 due to the public software licenses being read in full.

% Miscellaneous validity issues on literature selection
To finish this subsection we will discuss some more minor validity issues that did not fit into \hyperref[results]{Chapter 3} but are regardless important to note for the integrity of the thesis. Stage three of the search process included a validity threat regarding the removal of duplicates. If two full license texts would seem duplicates we would check the two license listing sites' license pages for further investigation without using an internet archiver. This is a common validity threat on this thesis, that is not relying on an internet archiver on every source possible. Still, archiving more than a thousand license pages and accessing them would have been very slow process in terms of both archiving and accessing.

% why exclusion over inclusion
As can be seen in  \hyperref[methods]{Chapter 2} the regular expression string was only an exclusion filter. Using an inclusion and exclusion resulted in difficulities to match all of the public software licenses. In other words it eventually turned out to be faster to match the excludable licenses than the includable licenses. The validity threat lying here is that only using an exclusion filter implicates a majority of the public licenses in our dataset to be public software licenses. An example of difficult to include public software license is the \texttt{wtfpl} which includes no evidence of it being a public software license but rather a general public license. However because \texttt{wtfpl} is a largely used in software source code as can be seen in \hyperref[results]{Chapter 3}. Another examples to back up this choice in exclusion-only are the font licenses that are considered public software licenses. With the exceptions inflating the inclusion regular expression string we eventually decided to only use the aforementioned exclusion filter. Before the decision our inclusion string looked like this:
\begin{verbatim}
  (.*\b(source|software|program|code|module|public(s+)license|ware|
  (w+)ware)\b).*
\end{verbatim}

% wikipedia infobox bias
As mentioned earlier in the thesis the Wikipedia infobox order of license listing sites plays a heavy role in the literature selection. This manifests as a validity threat for example in removal of duplicates where the duplicates are removed from the lattermost listing site, giving a false sense of the majority of the public licenses coming from the formermost license listing sites like the SPDX. While this might be true due to the high volume of literature from the formermost license listing sites in order of the Wikipedia infobox it is still a threat to validity. Because of this choice in our scope the accuracy of the origins of the licenses in the search stages is not as high as it could be.

% systemacity != automacity
A more general note on the systematicity of this thesis is due. Systematic does not equal to automatic. The author's human eyesight was for example a major factor to distinguish duplicates in literature selection in search stage three. Licenses were sorted by the Ratcliff and Obershelp, opened all search stage two licenses to tabs on the text editor, switched with keybinds between $n - 1$, $n$ and $n + 1$ full license texts and removed licenses that the author concluded to be duplicates based on various factors descrbied in \hyperref[methods]{Chapter 2}. As can be seen the process is systematic and relies heavily on the use of various automated tools but much of the work is also on the responsibility of the author's eyesight, memory and overall judgement which makes this process far from automatic. It is also good to note that the Python script used in \hyperref[methods]{Chapter 2} does not work on Windows systems. This was tested to decrease the waiting time of the Ratcliff and Obershelp on a more powerful desktop computer. This is the last and most minor validity threat mentioned in this thesis regarding the literature selection for review.

As such we note that instructions and good conventions were followed to the best of our abilities.

\subsection{Limitations in data extraction}
% importance of data extraction
The process of data extraction holds great significance in a systematic literature review, as it has a direct impact on the transparency and rationale of the paper. The data extraction approach was shallow due to the data extraction form being relatively small. As mentioned above, not much data could be easily nor verifyiably extracted from our main grey literature, the five license listing sites. Despite the dilligent efforts to eliminate researcher bias, which is a common concern in interpretive methods, it was not feasible to replicate this work by another individual for cross-referencing purposes. Efforts were made to make steps as transparent as possible make good use of the short, but well-defined data extraction format.

% lack of measurements and tooling
We still note that because of the lack of common standardized measurements and tooling for them, a considerable amount of personal consideration had to be done to bring the research results of the primary literature into a comparative state.