Course Project: This project was developed as part of the MSc course 'Data Mining' within the Software Design program at the IT University of Copenhagen (ITU), 2025-2026.
⚠ Disclaimer: We explicitly state that we understand the seriousness and sensitivity of the topic fully. All of our findings are preliminary and based on the limited — course-related — research, and we do not attempt to give any final answers, but mostly investigate the topic in a more data-driven way than it usually happens. Topics like immigration usually require a lot of research combining many different socio-economic data and quantitative data that need to be analyzed with qualitative data.
Full Documentation: For a detailed breakdown of our exploratory data analysis, statistical methods, and clustering visualizations, please read the full project report (PDF).
This project investigates common narratives regarding immigration and crime rates in Denmark, specifically focusing on the geographic area of Zealand. We aim to analyze these complex socio-economic topics using a data-driven approach. The analysis is divided into two main hypotheses:
- Hypothesis One: Investigating if there is a relationship between the non-Danish population ratio (immigrants and descendants) and crime rates in Danish municipalities.
- Hypothesis Two: Exploring to what extent criminal activity influences the local housing market.
crimes.and.populations.ipynb: Contains the data preparation, exploratory data analysis, and correlation/regression testing for Hypothesis One.preprocessing_housing_data.ipynb: Handles the cleaning, filtering, and merging of the property sales data with the demographics data for Hypothesis Two.analysis.ipynb: Contains the K-Means clustering and linear regression experiments regarding the housing market.
Data Availability: The complete datasets used for this analysis (approximately 45MB) are included directly in this repository to ensure full reproducibility of our results. The project utilizes three primary datasets, filtered for the 2014-2024 timeframe:
- Population Data (
population_origin_status_region_quarter_2024-2014.csv): Contains the number of reported people living in each municipality per quarter, categorized by migration status and country of origin, sourced from StatBank Denmark. - Crimes Data (
crimes_data.csv): Includes data regarding the reported crimes in each municipality per quarter, also sourced from StatBank Denmark. - Housing Prices (
housing_prices.parquet): Contains records of residential household sales, including purchase prices and construction years, originally sourced from Kaggle. It was provided and initially cleaned by Martin Frederiksen. The raw data are available in his Github repository.
(Note: Processed/merged datasets like non_danish_crime_per_capita are generated within the notebooks during runtime.)
The analysis was conducted using Python and involved several key data mining techniques:
- Exploratory Data Analysis (EDA): Visualizing distributions and time trends.
- Correlation Analysis: Utilizing Spearman correlation to handle right-skewed data and measure monotonic relationships.
- Linear Regression: Building univariate and multivariate models (using Ordinary Least Squares) to evaluate the predictive power of demographics on crime, and crime on housing prices.
- K-Means Clustering: Grouping municipalities to discover patterns without predefined categories, utilizing the Elbow method and Silhouette Score for optimal cluster selection.
- Crime & Demographics Trends: While there is a moderate positive correlation (Spearman 0.512) between the non-Danish ratio and crime rates, the two metrics moved in opposite directions over the 11-year period. Crime rates decreased by approximately 24.5%, while the non-Danish ratio increased by approximately 32.2%.
- Weak Predictive Power: Linear regression revealed that the non-Danish ratio explains only 9.5% of the variation in crime rates, meaning 90.5% of the variation is driven by factors outside this demographic metric.
- Cluster Separation: K-Means clustering demonstrated that municipalities with the highest crime rates and those with the highest non-Danish population ratios form distinctly different clusters, providing evidence against a direct causal relationship.
- Housing Market Impact: Location (municipality) is the dominant driver of house prices. The crime rate alone is a poor predictor, though our best multivariate model (combining Crime, Demographics, and Location) explained 32.5% of price variations.
- Outlier Influence: Removing statistical outliers like Copenhagen and Tårnby (which have unique high-crime and high-price dynamics, partly due to transit hubs like the airport) significantly improved the reliability of our models.
- Special thanks to Martin Frederiksen for providing and initially cleaning the housing prices dataset used in Hypothesis Two.
- Across The Globe. Why Denmark is Suddenly Declaring War on Immigration, 2025.
- Katya Adler. The country where the left (not the far right) made hardline immigration laws, 2025.