Partially Based on https://link.medium.com/uQMhEpPL95
0. Install Git and DVC
https://git-scm.com/downloads
https://dvc.org/
1. Initialize Repo
In the root of the folder enter
git init
dvc init
git commit -m "Initial commit"
2. Add remote repositories
DVC
DVC remote repository
dvc remote add name_of_repo /path/to/repo
Setting this remote as the default for this project to avoid mentioning it in every dvc push
dvc remote default name_of_repo
Set the cache of data to be on the remote directory to avoid huge storage overhead in local folder
dvc cache dir /path/to/data_store
Git
Create bare repository in remote folder. Bare repository is the remote repository which does not accept commits and changes. typically that would be on gitub/gitlab. Here we are referring to the case where we want to have a remote repository somewhere on a server.
cd /path/to/repo/
git init --bare repo_origin.git
Go back to folder directory and add remote
cd /project/dir/
git remote add origin /path/to/repo/repo_origin.git
git commit -m "Configured remote"
3. Add data to version control
dvc add 02_data 03_imgs
git add .
git commit -m "Data versioning files added to git"
dvc push
4. Start working on a branch to develop a feature
git checkout -b v1
5. Commit changes
dvc add 02_data 03_imgs
git add .
git commit -m "Did some changes to code and data"
6. Push to repo
dvc push
git push origin name_of_repo:master # Pushes current branch to remote master
7. Merge branch to master after completing a feature
git checkout master
git merge v1
Bonus:
To clear some of the data in cache
dvc gc - -workspace
Full article
When working in a productive machine learning project you probably deal with a tone of data and several models. To keep track of which models were trained with which data, you should use a system to version the data, similar to versioning and tracking your code. One way to solve this problem is dvc (Data Version Control, https://dvc.org/), which approaches data versioning in a similar way to Git.
To illustrate the use of dvc in a machine learning context, we assume that our data is divided into train, test and validation folders by default, with the amount of data increasing over time either through an active learning cycle or by manually adding new data. An example could be the following structure, whereby the labels were omitted here for simplification purposes:
├── train
│ ├── image1.jpg
│ ├── image2.jpg
│ └── image3.jpg
├── val
│ └── image4.jpg
└──test
└── image5.jpg
Normally, a minimal versioning system should have the following two capabilities:
- Tag a new set of data with a new version e.g. vx.y.z
- Return to old data versions or switch between different data versions very easy
Among other features, dvc is capable of doing these tasks. For this purpose it works closely together with Git. First you need to install dvc which can be done using pip
pip install dvc
To start the versioning process you have to create a git repository in the base folder of your data and initialize dvc afterwards through
git init
dvc init
Through the init command dvc has now created a .dvc folder containing its cache in order to save differences between different data versions and the config file which stores meta information.
In the case you are wondering how git fits into this concept: The task of git in this case is not to version the data itself but to version the dvc files which save the meta informations of the version like the location of files corresponding to a special version or the information which file of your data belongs to the current data version.
In order for git to ignore the data itself dvc also automatically writes to the .gitignore file. To commit the config file of dvc and the .gitignore file we need to do a initial commit
git commit -m “Initial commit”
Each data version is associated with their own .dvc files which again are associated with one commit or one head of Git. The dvc files define and track the data for a given version whereby the dvc files themself are tracked by Git. For me a good way to associate a new data version with a head of Git is to make a new branch for a new data version. To do this, before we define our first version we create a new branch with the name of the version and checkout to this branch:
git checkout -b v0.0.1
Now we can define our first version by telling dvc which data should be tracked, which are in our case the train, val and test folders. This can be done by the dvc add command:
dvc add train test val
After that we now see new .dvc files for each folder like train.dvc inside our base folder. The folders themselves have been added to the .gitignore so that git doesn`t track the data itself which in our case is the task of dvc. In order to track the new .dvc files with Git we make the standard Git procedure for a commit with
git add .
git commit -m "Data versioning files added to Git"
Now we have created our first version of our data by having stored which data belongs to the version in our .dvc files and referenced the .dvc themself by the current commit. Please note that you can also connect the git to a remote git to save and version the .dvc files remotely. The data in this case stays in the current folder and is not stored remotely (this can be also changed using dvc push and pull).
We now have associated one state of our data with a version, but of course you don’t need data versioning for one fix data set. Therefore we now assume that two new images (image6.jpg and image7.jpg) are added to the train and test folders, so that the structure now looks like this:
├── train
│ ├── image1.jpg
│ ├── image2.jpg
│ ├── image3.jpg
│ └── image6.jpg
├── val
│ └── image5.jpg
└── test
├── image4.jpg
└── image7.jpg
In order to create a new data version we repeat the previous steps. We therefore create a new branch corresponding to the new data version
git checkout -b v0.0.2
As we already know, a new data version is always associated with their own .dvc files which store the meta information of the version. In order to update the .dvc files we need to tell dvc that it should track again the train and test folder as there is new data in these folders:
dvc add train test
The train.dvc and test.dvc files changed and dvc now tracks which files belongs to the current version. In order to track the new .dvc files inside the git branch we have to do a commit:
git add .
git commit -m "Data versioning files added to Git"
Now the cool part is coming. When checking your git branches you see two different branches (master excluded) where each branch corresponds to one data version:
master
v0.0.1
* v0.0.2
You are now able to get back to an older data version and update your data directory directly in order to recreate the old data version. In order to get back to the previous version we need to do two things. First we need to checkout to the corresponding head of the data version which is in this case the branch v0.0.1:
git checkout v0.0.1
In this head the .dvc files are different compared to v0.0.2 but out current data directory still looks the same and the data inside the directory still corresponds to v0.0.2. This is because dvc has not yet aligned the data directory with its .dvc files. To align your data directory to the correct data version, which again is persistent in the .dvc files, one need to perform the dvc checkout command:
dvc checkout
This command restores the old data version (in this case v0.0.1) using its cache. When you now look into your data repository you see again the following structure:
├── train
│ ├── image1.jpg
│ ├── image2.jpg
│ └── image3.jpg
├── val
│ └── image4.jpg
└──test
└── image5.jpg
The files image6.jpg and image7.jpg were removed from the data directories and stored into the cache of dvc. You can now work with the old data version just as usual with the three folders.
This procedure also works for data versions containing a lot more data than currently persistent in the data folder as dvc stores differences of arbitrary size between different versions in its cache and can therefore recreate older or newer states of the data directories by its checkout command. The checkout is of course also possible in the other direction. You could checkout the git to branch v0.0.2 and perform a dvc checkout in order to set the data directory to the state of version v0.0.2.
Besides the init, add and checkout command dvc has a lot more features in order to make the machine learning/big data workflow more easy. For example can data versions be shared between multiple machines using a remote bucket like Amazon’s S3 Bucket and interacting with the bucket using dvc push and pull (for Details see. https://dvc.org/).
I hope this article can help to better organize the data in a Machine Learning project and to keep a better overview.
For more blog posts about Machine Learning, Data Science and Statistics checkout www.matthias-bitzer.de
No comments:
Post a Comment