5 个真正有用的 Bash 脚本用于数据科学

原文：www.kdnuggets.com/2023/02/bash-scripts-data-science.html

作者提供的图片

Python、R 和 SQL 通常被认为是处理、建模和探索数据的最常用语言。虽然这可能是真的，但并没有理由其他语言不能或没有被用来做这些工作。

我们的前三个课程推荐

1. Google 网络安全证书 - 快速进入网络安全职业生涯。

Bash shell 是一种 Unix 及类 Unix 操作系统的 shell，以及与之配套的命令和编程语言。Bash 脚本是使用 Bash shell 脚本语言编写的程序。这些脚本由 Bash 解释器按顺序执行，可以包括其他编程语言中常见的所有构造，包括条件语句、循环和变量。

常见的 Bash 脚本用途包括：

自动化系统管理任务
执行备份和维护
解析日志文件和其他数据
创建命令行工具和实用程序

Bash 脚本还用于协调复杂分布式系统的部署和管理，使其成为数据工程、云计算环境和 DevOps 领域极为有用的技能。

在这篇文章中，我们将深入探讨五种与数据科学相关的脚本任务，看看 Bash 可以有多么灵活和有用。

Bash 脚本

清理和格式化原始数据

这是一个用于清理和格式化原始数据文件的 bash 脚本示例：

#!/bin/bash

# Set the input and output file paths
input_file="raw_data.csv"
output_file="clean_data.csv"

# Remove any leading or trailing whitespace from each line
sed 's/^[ \t]*//;s/[ \t]*$//' $input_file > $output_file

# Replace any commas within quoted fields with a placeholder
sed -i 's/","/,/g' $output_file

# Replace any newlines within quoted fields with a placeholder
sed -i 's/","/ /g' $output_file

# Remove the quotes around each field
sed -i 's/"//g' $output_file

# Replace the placeholder with the original comma separator
sed -i 's/,/","/g' $output_file

echo "Data cleaning and formatting complete. Output file: $output_file"

这个脚本：

假设你的原始数据文件是一个名为raw_data.csv的 CSV 文件
将清理后的数据保存为clean_data.csv
使用sed命令来：
- 从每一行中移除前后空白，并将引用字段中的逗号替换为占位符
- 将引用字段中的换行符替换为占位符
- 移除每个字段周围的引号
- 将占位符替换为原始的逗号分隔符
打印一条消息，指示数据清理和格式化已完成，并提供输出文件的位置

自动化数据可视化

这是一个用于自动化数据可视化任务的 bash 脚本示例：

#!/bin/bash

# Set the input file path
input_file="data.csv"

# Create a line chart of column 1 vs column 2
gnuplot -e "set datafile separator ','; set term png; set output 'line_chart.png'; plot '$input_file' using 1:2 with lines"

# Create a bar chart of column 3
gnuplot -e "set datafile separator ','; set term png; set output 'bar_chart.png'; plot '$input_file' using 3:xtic(1) with boxes"

# Create a scatter plot of column 4 vs column 5
gnuplot -e "set datafile separator ','; set term png; set output 'scatter_plot.png'; plot '$input_file' using 4:5 with points"

echo "Data visualization complete. Output files: line_chart.png, bar_chart.png, scatter_plot.png"

上述脚本：

假设你的数据在一个名为data.csv的 CSV 文件中
使用gnuplot命令创建三种不同类型的图表：
- 绘制第 1 列与第 2 列的折线图。
- 绘制第 3 列的条形图。
- 绘制第 4 列与第 5 列的散点图。
以 PNG 格式输出图表，并分别保存为line_chart.png、bar_chart.png和scatter_plot.png。
打印一条消息，指示数据可视化已完成以及输出文件的位置。

请注意，为了使此脚本正常工作，需要根据您的数据和需求调整列号和图表类型。

统计分析

这里是一个示例 Bash 脚本，用于对数据集进行统计分析：

#!/bin/bash

# Set the input file path
input_file="data.csv"

# Set the output file path
output_file="statistics.txt"

# Use awk to calculate the mean of column 1
mean=$(awk -F',' '{sum+=$1} END {print sum/NR}' $input_file)

# Use awk to calculate the standard deviation of column 1
stddev=$(awk -F',' '{sum+=$1; sumsq+=$1*$1} END {print sqrt(sumsq/NR - (sum/NR)**2)}' $input_file)

# Append the results to the output file
echo "Mean of column 1: $mean" >> $output_file
echo "Standard deviation of column 1: $stddev" >> $output_file

# Use awk to calculate the mean of column 2
mean=$(awk -F',' '{sum+=$2} END {print sum/NR}' $input_file)

# Use awk to calculate the standard deviation of column 2
stddev=$(awk -F',' '{sum+=$2; sumsq+=$2*$2} END {print sqrt(sumsq/NR - (sum/NR)**2)}' $input_file)

# Append the results to the output file
echo "Mean of column 2: $mean" >> $output_file
echo "Standard deviation of column 2: $stddev" >> $output_file

echo "Statistical analysis complete. Output file: $output_file"

此脚本：

假设您的数据在名为data.csv的 CSV 文件中。
使用awk命令计算 2 列的均值和标准差。
使用逗号分隔数据。
将结果保存到文本文件statistics.txt中。
打印一条消息，指示统计分析已完成以及输出文件的位置。

请注意，您可以添加更多的awk命令来计算其他统计值或处理更多列。

管理 Python 包依赖关系

这里是一个示例 Bash 脚本，用于管理和更新数据科学项目所需的依赖项和包：

#!/bin/bash

# Set the path of the virtual environment
venv_path="venv"

# Activate the virtual environment
source $venv_path/bin/activate

# Update pip
pip install --upgrade pip

# Install required packages from requirements.txt
pip install -r requirements.txt

# Deactivate the virtual environment
deactivate

echo "Dependency and package management complete."

此脚本：

假设您已设置虚拟环境，并且有一个名为requirements.txt的文件，包含您要安装的包名称和版本。
使用source命令激活由路径venv_path指定的虚拟环境。
使用pip升级pip到最新版本。
安装requirements.txt文件中指定的包。
使用deactivate命令在安装包后停用虚拟环境。
打印一条消息，指示依赖项和包管理已完成。

每次更新您的依赖项或为数据科学项目安装新包时，都应运行此脚本。

管理 Jupyter Notebook 执行

这里是一个示例 Bash 脚本，用于自动化执行 Jupyter Notebook 或其他交互式数据科学环境：

#!/bin/bash

# Set the path of the notebook file
notebook_file="analysis.ipynb"

# Set the path of the virtual environment
venv_path="venv"

# Activate the virtual environment
source $venv_path/bin/activate

# Start Jupyter Notebook
jupyter-notebook $notebook_file

# Deactivate the virtual environment
deactivate

echo "Jupyter Notebook execution complete."

上述脚本：

假设您已设置虚拟环境，并在其中安装了 Jupyter Notebook。
使用source命令激活虚拟环境，指定路径为venv_path。
使用jupyter-notebook命令启动 Jupyter Notebook 并打开指定的notebook_file。
在执行 Jupyter Notebook 后，使用deactivate命令停用虚拟环境。
打印一条消息，指示 Jupyter Notebook 执行已完成。

每次执行 Jupyter Notebook 或其他交互式数据科学环境时，都应运行此脚本。

我希望这些简单的脚本能展示 Bash 脚本的简便性和强大功能。它可能不是您每种情况的首选解决方案，但它确实有其存在的价值。祝您的脚本编写好运。

马修·梅奥 (@mattmayo13) 是一位数据科学家，同时也是 KDnuggets 的主编，这是一个开创性的在线数据科学和机器学习资源。他的兴趣领域包括自然语言处理、算法设计与优化、无监督学习、神经网络，以及自动化机器学习方法。马修拥有计算机科学硕士学位和数据挖掘研究生文凭。他可以通过 editor1 at kdnuggets[dot]com 联系。

主题相关信息

3 个有用的 Python 自动化脚本
KDnuggets 新闻，12 月 7 日：揭示前 10 大数据科学神话 • 4…
4 个有用的中级 SQL 查询用于数据科学
Kaggle 竞赛对现实世界问题是否有用？
如何使用 Bash 浏览文件系统
如何在 Bash 中管理文件和目录

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!