Git is primarily tool for version control. It is a way for you track the changes to your code in an efficient way that allows you to more easily recover previous versions of your code without having to save multiple variations of analyses. It is a very powerful, but also very complicated, tool to use that will greatly improve your reliability and confidence in working with your analysis code.
This document will briefly introduce the core concepts of git and how get started tracking your own analysis code.
Git enables the tracking of changes to files. It provides a system that lets users recall different versions of files. This process becomes more essential as collaborative projects grow.
The fundamental functions that make up the basic Git workflow include:
- Initialization (
init): Starting a Git repository. - Staging (
add): Indicating that a modified file should be included in the next snapshot. The staging area, or Index, holds what will be in your next snapshot. - Committing (
commit): Taking a snapshot of the changes. Commits capture the purpose and context of changes. - Inspecting: Commands like
git status,git diff, andgit logare used to check the status of files, see what has changed, and view the history of the repository, respectively. - Undoing: Git provides ways to undo changes, such as
unmodifying a file (
git restore <file>) or undoing the last commit (git reset HEAD~). It also allows you to undo a commit from further back in history usinggit revert.
Git operates by taking a stream of snapshots rather than relying
on delta-based version control. It converts the contents of files and
directories into a unique, deterministic hash (e.g.,
24b9da6552252987aa493b52f8696cd6d3b00373), ensuring reliability.
Git, often paired with hosting services like GitHub (which hosts Git repositories online), is crucial for collaboration:
- Distributed Version Control: Git uses a distributed model, meaning it does not rely solely on a central server, mitigating risk if the server crashes.
- Branching and Merging: Git supports nonlinear development through branches, which mark a line of development (e.g., a new feature or a collaborator's contribution). This allows multiple team members to work in parallel. Git commands are used for starting a new branch, changing branches, and merging branches.
- Remotes: Git is used to interact with remote repositories
(like those on GitHub, GitLab, or Bitbucket). Commands like
git pushsend commits to a remote repository, andgit fetchretrieves commits from a remote. - Handling Conflicts: Git manages merge conflicts that arise when multiple people edit the same line of code, although someone needs to manually resolve these conflicts.
- Code Review and Contribution: Git facilitates the contribution workflow, often involving forks (cloning a repository on GitHub) and pull requests. A pull request allows project maintainers to review code, make suggestions, and decide whether to merge changes into the original repository.
While Git is the version control system, tools built around it (like GitHub) use Git repositories to manage work. Best practices associated with Git repositories help ensure effective collaboration and streamlined workflows:
- Documentation: Git repositories house documentation like the
README.mdfile, which gives an overview of the project, setup instructions, and contribution guidelines. - Exclusion of Files: The
.gitignorefile, used within a Git repository, specifies files and directories (like build artifacts, sensitive information, or large media files) that should be excluded from version control. - Maintaining History: Git facilitates maintaining a clean commit history by encouraging descriptive and atomic commit messages.
- Project Organization: GitHub allows users to classify repositories using topics (labels) to improve discoverability and organization based on purpose or technology stack.
Overall, Git is essential for managing code changes, enabling teams to work collaboratively, and ensuring a recoverable history for large and small projects.
Initializing a Git repository for tracking code in a project can be accomplished through two main methods using command line Git. The command to initialize a repository is a basic objective of the Git workflow.
There are two options for starting a Git repository:
-
Cloning an existing repository: If the code repository already exists remotely (e.g., on GitHub, GitLab, or a central server), you can create a local copy by cloning it.
- Command:
git clone <repo URL>.
- Command:
-
Making an existing folder into a Git repository: If you have a local directory containing your project files (such as those for a data science project) and want to start tracking changes, you can initialize Git in that folder.
- First, navigate to the desired directory:
cd <directory>. - Then, execute the initialization command:
git init.
- First, navigate to the desired directory:
Once a Git repository is initialized, it includes a local repository
(the .git/ folder) which stores metadata and snapshots.
After initialization, several best practices ensure the repository is functional and well-managed:
The basic workflow involves three steps after initialization:
- Modify: Change a file in your working directory (the version of the project you are actively working on).
- Stage: Indicate which modified files should be included in the
next snapshot. You use
git add <filename>to move changes to the staging area (Index). The staging area holds what will be included in your next snapshot. - Commit: Take the snapshot using
git commit -m “<short, informative commit message>”.
When writing commit messages, they should be clear and descriptive,
explaining the purpose and context of the changes. For example, a good
commit message might be: git commit -m "Add user authentication mechanism to the inventory management system". It is bad practice to
use vague messages like git commit -m "Fixed stuff".
It is crucial to use a .gitignore file immediately. This file
allows you to specify patterns for files and directories that should
be excluded from version control. This is particularly useful for data
science projects, as it helps avoid tracking:
- Build artifacts (e.g., log files, temporary files).
- Sensitive information (e.g., API keys, passwords, configuration files).
- Large files (e.g., media files, binary files) that are unnecessary
for version control. Tools like
git-lfs,git-annex, ordataladcan help in these scenarios. - Logs and caches.
- User-specific files (e.g., editor settings).
A well-documented repository is highly valuable. The README.md
file is the first item a visitor sees and should provide a quick
overview of the repository and how to get started. For projects hosted
on GitHub, the README supports Markdown, allowing for advanced
formatting, links, images, lists, and headers.
The README.md should include information such as:
- Project description.
- Setup and installation instructions.
- Usage examples.
- Contribution guidelines.
- License information.
For a GitHub project, the README and project description can explain
the project's purpose and describe the project views and how to use
them, including relevant links and people to contact.
If the project is hosted on GitHub, applying strong best practices early is beneficial:
- Clear Repository Naming Convention: Use a clear and
standardized naming convention for organization and
clarity. Examples include prefixing by project/team (e.g.,
teambravo_data_pipeline), using descriptive names (e.g.,machine_learning_model_trainer), or including the technology stack (e.g.,image_processor_python). This included using standard project structured and standards, likecookiecutter,TONIC,BIDS, orNipoppy. - Topics: Classify the repository using topics (labels) to help categorize and discover the project based on its purpose or technology stack.
- While a README is part of the documentation, there are many tools
to facilitate better practices that can be incorporated directly
into your repos. Tools like
ReadTheDocsandMkDocscan, with some additional configuration, automate the release of updated documentation when you commit. This advanced usage will be covered later in a different document.
Working collaboratively with Git can introduce several issues, ranging from technical problems when integrating code to difficulty recovering from common human errors, particularly since Git is often described as being hard to use when things go wrong.
Key issues people might encounter when collaborating include:
The most direct issue related to concurrent coding is the merge conflict.
- Conflict Occurrence: A merge conflict happens if several people edit the same line of code.
- Resolution Requirement: When a merge conflict occurs (whether when merging branches or integrating changes between local and remote repositories), someone needs to manually resolve it.
- Recovering from Bad Merges: If a merger goes wrong and breaks
the repository, a user can try to recover using the
git reflogcommand, which acts like a "magic time machine" to undo the damage and restore the repo to a previous working state.
Effective collaboration relies on standardized practices, and deviations can lead to confusion and inefficiency. Reaching out to maintainers and asking for clarifications on how to make contributions is a great idea, especially for tools you heavily rely on. Every group has different standards for how to make contributions, and until you're experienced with the tools, you can't really know what they are. Some things to keep in mind:
- Large Issues: Breaking a large issue into smaller issues is a best practice, as failure to do so makes the work less manageable, prevents team members from working in parallel, and results in larger pull requests that are more difficult to review.
- Unclear Commit History: Collaboration requires a clean commit
history. This is undermined by:
- Vague Commit Messages: Using non-descriptive messages like
git commit -m "Fixed stuff"is considered bad practice. Instead, clear and descriptive commit messages explaining the purpose and context of the changes are required. - Non-Atomic Commits: Commits should be small, focused, and contain a single logical change; failing to use atomic commits increases the risk of conflicts and errors, and makes the history harder to understand.
- Vague Commit Messages: Using non-descriptive messages like
- Stale Information: To prevent information from getting out of sync across the project, collaborators should maintain a single source of truth. If critical data (like a target ship date) is tracked across multiple fields, it must be updated in all locations when it shifts, increasing the chance of error.
Git is often perceived as difficult because "screwing up is easy, and figuring out how to fix your mistakes is [...] impossible" if the user does not already know the specific command needed.
- Amending Public Commits: If a user realizes they need to make
a small change right after committing, they can use
git commit --amend. However, a critical issue is that one should never amend commits that have been pushed up to a public/shared branch. Amending public commits will lead to "a bad time". - Committing to the Wrong Branch: Accidentally committing
changes to a shared branch (like
master) when they should have been on a new branch is a common mistake. Correcting this requires resetting the branch history locally, which can be complex. - Baffling Command Behavior: Git commands sometimes behave
non-intuitively for new users. For instance, if a user makes
changes and stages them (
add), the simplegit diffcommand will appear empty (showing nothing changed in the working directory) because the changes are in the staging area. To see those changes, the user must remember the special flaggit diff --staged. - Destructive Recovery: If a user’s branch becomes too "borked,"
they might resort to destructive and unrecoverable actions, such
as resetting the local state to match the remote repository
exactly, using commands like
git reset --hard origin/master. Be careful doing this, as you risk losing your unfinished work!
Git is a complicated tool, but provides a very powerful way for you to more easily adhere to best research practices when working with your code. It is very much worth the upfront investment to learn this ecosystem. Fortunately, because it is so powerful, there are many great resources for getting started with it. Please feel free to find and contribute or recommend additional information to improve this document!