β
Sequence of data processing steps, where the output of one step becomes the input for the next.
β
Commonly used in data engineering, machine learning, and software development to automate workflows and ensure efficient processing.
β
Apache Beam is a framework for batch and streaming data processing.
β
It provides a unified API that can run on multiple execution engines.
β
Works well with Google Cloud Dataflow, Apache Flink, and Apache Spark.
β
Supports Python, Java, and Go for pipeline development.
- To install Apache Beam in Python, follow these steps:
1οΈβ£ Install Using pip:
pip install apache-beam
2οΈβ£ Install Apache Beam with Google Cloud Support (Optional): - This includes additional dependencies for Google Cloud Storage, Pub/Sub, and BigQuery
pip install apache-beam[gcp]
3οΈβ£ Verify Installation:
import apache_beam as beam
print(beam.__version__)
β
A DataFrame is a tabular structure for handling structured data in Python.
β
It is part of the pandas library and supports fast data manipulation.
β
It is widely used in data science, analytics, and machine learning.
1οΈβ£ Install Using pip:
pip install pandas
2οΈβ£ Verify Installation:
import pandas as pd
print(pd.__version__)
1οΈβ£ Install Using pip:
pip install scikit-learn
2οΈβ£ Verify Installation:
import sklearn
print(sklearn.__version__)
3οΈβ£ Call a dataset to work with it:
from sklearn.datasets import load_linnerud
4οΈβ£ Transform it into a DataFrame:
from sklearn.datasets import load_linnerud
dt = load_linnerud()
df = pd.DataFrame(dt.data, columns=dt.feature_names)
print(df)
- This code will check if the 'Chins' field meets the condition:
- If 'Chins' is greater than 10 (Chins > 10)
- It will only display 'True' or 'False' if 'Chins' meets the previous condition."
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
15 250 80
- Output .csv (Chins > 10)
Chins
False
False
True
True
- This code will check if the 'Chins' field meets the condition:
- If 'Chins' is greater than 10.
- Only the 'Chins' field will be displayed, along with whether it meets the previous condition.
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
15 250 80
- Output .csv (Chins > 10)
Chins
12
15
- This code will check if the 'Chins' field meets the condition:
- If 'Chins' is greater than 10.
- It will display all fields if the 'Chins' field meets the previous condition.
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
15 250 80
- Output .csv (Chins > 10)
Chins,Situps,Jumps
12,101,101
15,250,80
- This code will check if the 'Chins' field meets the condition:
- If 'Chins' is greater than 10.
- It will display all fields and add a new field (Chins>10) indicating whether the 'Chins' field meets the previous condition:
- 'True' if it does.
- 'False' if it does not.
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
15 250 80
- Output .csv (Chins>10)
Chins,Situps,Jumps,Chins(>10)
5,162,60,False
2,110,60,False
12,101,101,True
15,250,80,True
-
This code will help us group the 'Chins' field into 4 ranges:
Chins 0-5 (0) 5-10 (5) 10-15 (10) 15-20 (15) 20-25 (20) 25-30 (25)
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
15 250 80
8 150 70
6 200 84
- Output .csv
Chins_Range,Count
0,2
5,2
10,1
15,1
-
This code will categorize the 'Chins' data into four ranges:
Chins 0-5 (0) 5-10 (5) 10-15 (10) 15-20 (15) 20-25 (20) 25-30 (25) -
It will also display the results as a bar chart.
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
15 250 80
8 150 70
6 200 84
- Output .csv
0,2
5,2
10,1
15,1
- Chart
-
This code will perform two tasks, adding two new fields to the data:
-
Categorize the 'Chins' field into four ranges.
Chins 0-5 (0) 5-10 (5) 10-15 (10) 15-20 (15) 20-25 (20) 25-30 (25) -
Recategorize the 'Chins' field to determine if it is above or below the average.
0 Above Average 1 Below Average -
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
12 105 37
- Output .csv
Chins,Situps,Jumps,Chins_Range,Chins_vs_Avg
5.0,162.0,60.0,5,1
2.0,110.0,60.0,0,1
12.0,101.0,101.0,10,0
12.0,105.0,37.0,10,0
-
This code will perform two tasks, adding two new fields to the data:
- Categorize the 'Chins', 'Situps' and 'Jumps' fields into four ranges.
Chins Situps Jumps 0-5 (0) 0-50 (0) 0-25 (0) 5-10 (5) 50-100 (50) 25-50 (25) 10-15 (10) 100-150 (100) 50-75 (50) 15-20 (15) 150-200 (150) 75-100 (75) 20-25 (20) 200-250 (200) 100-125 (100) 25-30 (25) 250-300 (250) 125-300 (125) - Recategorize the 'Chins', 'Situps' and 'Jumps' field to determine if it is above or below the average.
0 Above Average 1 Below Average
- Original Dataset
Chins Situps Jumps
5 162 60
2 110 60
12 101 101
12 105 37
- Output .csv
Chins,Situps,Jumps,Chins Range,Situps Range,Jumps Range,User,Chins_vs_Avg,Situps_vs_Avg,Jumps_vs_Avg
5.0,162.0,60.0,5,150,50,User 1,1,0,1
2.0,110.0,60.0,0,100,50,User 2,1,1,1
12.0,101.0,101.0,10,100,100,User 3,0,1,0
12.0,105.0,37.0,10,100,25,User 4,0,1,1
13.0,155.0,58.0,10,150,50,User 5,0,0,1
-
This code will allow us to analyze the available information to determine what type of analysis we can perform afterward.
-
Three analyses (so far) will be conducted:
- Dataset report.
- Histograms.
- Density analysis.
-
Finally, the reports will be downloaded as a PDF file.
- I will not display the outputs, as they are saved as images and can be accessed from the 'downloads' folder.
- In this code, we will perform two analyses using three variables: 'Chins', 'Jumps', and 'Situps':
- Hypothesis testing
- Clustering
- More may be added in the future. For now, this analysis is conducted to test how it works.
Mean Situps: 135.0
Mean Jumps: 53.75
T-statistic: 42.5296
P-value: 0.0000
We reject the null hypothesis: the means are significantly different.
Mean Jumps: 53.75
Mean Chins: 7.5
T-statistic: 16.8558
P-value: 0.0000
We reject the null hypothesis: the means are significantly different.
Mean Chins: 7.5
Mean Situps: 135.0
T-statistic: -57.3606
P-value: 0.0000
We reject the null hypothesis: the means are significantly different.
Iteration 0 - Silhouette Score: 0.55
Iteration 1 - Silhouette Score: 0.49
Iteration 2 - Silhouette Score: 0.53
- This code will determinate the values of MSE and MAE.
MSE for Situps Range and Jumps Range: 189.8777
MAE for Situps Range and Jumps Range: 10.6886
MSE for Jumps Range and Chins Range: 17.6324
MAE for Jumps Range and Chins Range: 3.0729
MSE for Chins Range and Situps Range: 2265.5290
MAE for Chins Range and Situps Range: 46.8652
- This code will determine the values of MSE, and with that, we can create the linear regression model.
Regression between 'Situps Range' (X) and 'Jumps Range' (y):
Coef: [0.35509138]
Intercept: [4.50391645]
R2: 0.5581028260779296
MSE: 189.87769191963963
Regression between 'Jumps Range' (X) and 'Chins Range' (y):
Coef: [0.06666667]
Intercept: [4.47916667]
R2: -0.02588383838383823
MSE: 17.63237847222222
Regression between 'Chins Range' (X) and 'Situps Range' (y):
Coef: [7.99373041]
Intercept: [65.67398119]
R2: -0.3181259832530947
MSE: 2265.5290337162564
- The code builds decision trees by using each column in selected_columns as a feature and the next column as a label, training the model on 70% of the data and testing on 30%.
- It evaluates accuracy and visualizes the tree for each pair.
- The process repeats in a circular manner for all selected columns.
Decision tree between 'Situps Range' (X) and 'Jumps Range' (y)
X_train shape: (14, 1)
y_train shape: (14,)
Unique classes in y_train: [ 25 50 100 125]
Accuracy: 0.50
Decision tree between 'Jumps Range' (X) and 'Chins Range' (y)
X_train shape: (14, 1)
y_train shape: (14,)
Unique classes in y_train: [ 0 5 10 15]
Accuracy: 0.33
Decision tree between 'Chins Range' (X) and 'Situps Range' (y)
X_train shape: (14, 1)
y_train shape: (14,)
Unique classes in y_train: [ 50 100 150 200 250]
Accuracy: 0.00
-
This code will determine the values of MSE, and with that, we can create the linear regression model.
-
In this code we will use 'pipeline'
Linear Regression between 'Situps Range' (X) and 'Jumps Range' (y)
MSE: 189.8777
Linear Regression between 'Jumps Range' (X) and 'Chins Range' (y)
MSE: 17.6324
Linear Regression between 'Chins Range' (X) and 'Situps Range' (y)
MSE: 2265.5290
-
This code will determine the values of MSE, and with that, we can create the Decision Tree Regressor model.
-
In this code we will use 'pipeline'
Decision Tree between 'Situps Range' (X) and 'Jumps Range' (y)
MSE: 304.9267
Decision Tree between 'Jumps Range' (X) and 'Chins Range' (y)
MSE: 20.8333
Decision Tree between 'Chins Range' (X) and 'Situps Range' (y)
MSE: 2517.3611