About a year ago I started using the black
formatter tool for all my new Python projects. I wrote about it here, explaining why it seemed like a good idea at the time, and left open the question of whether I would stick with it.
So, have I? Yes. I do not agree with all of the decisions that black
makes when formatting my code, but I appreciate the consistency it gives and the reduced cognitive load of worrying about formatting the code before committing to Git. Most of my projects aren’t large collaborations, but I think it would be even more valuable for those. So, thumbs up to the black
devs. And I invite you to try it out if you haven’t yet… See what the Zen of not worrying about formatting feels like.
P.S.: I am a very finicky code formatter. I love things like vertically-aligned operators and vertically-aligned in-code comments. Black doesn’t always do what I want in these cases, but at least I can shrug and say “well, that’s just how it is”. Again, reduced cognitive load — no need to sweat the small stuff if black
is going to change it anyway. “Just learn to let go.”
The Dat Project is a promising way to share “folders” of data — which is an attractive option for sharing research data and/or source code for more reproducible research.
I’ve been excited about the Dat project for a while, and lately I’ve been thinking a lot about the pros and cons of several approaches to data sharing/collaboration tools. I admire what the folks at the Dat project are doing, but I worry that their focus has recently moved away from research and toward “a peer-to-peer web”. While that isn’t inherently bad, it does mean that their priorities no longer align so perfectly with what researchers need.
So here is my own wishlist for what I wish Dat could/would do differently to become a perfect tool for sharing research datasets (and maybe code). I’m writing it down mainly to help myself make sense of the trade-offs between the plethora of alternatives.
Offline Staging
The way a Dat is currently constructed, if you are writing changes into the “live” dat archive, those changes are immediately propagated to anyone else who is watching that archive. You could make a copy of the archive, try out your “experimental” changes in that copy, then “commit” to the changes by copying them back into the live archive when you are sure you are ready. The problem is that now you have to use twice as much space locally, and always be careful that you are working on the “sandbox” version, not the “live” one.
You could also work on the local version, then use dat sync
to commit and propagate the changes to at least one peer (maybe a pinning service like Hashbase), then go back offline and do more work. Such a workflow would look like this:
$ mkdir mydat && cd mydat
$ dat init
$ # Now, add files to the Dat.
$ dat sync # this also shares!
dat v13.10.0
dat://ec930b[...]
Sharing dat: 4 files (180 B)
0 connections | Download 0 B/s Upload 0 B/s
Watching for file updates
Ctrl+C to Exit
Now you can’t make any changes in the Dat without having them be “live” instantly. So, you could <CTRL+C> to take down the dat, and maybe someone else is seeding it – or not.
Then you change some things, and do dat sync
again when you are ready to make it live…
What I really would prefer is to basically “steal” the workflow of Git. Initialize, work, stage, commit, push, repeat.
$ mkdir mydat && cd mydat
$ dat init
$ # Now, add your data...
$ dat add . # Add everything to the managed archive
Then, use a “commit” to mark the changes that are ready to be shared with peers:
$ dat commit -m "initial data version"
$ # When you run `sync`, only committed changes are actually synced:
$ dat sync # A `dat push` to a pinning peer would be nice as well
When you make more changes, you “stage” them with add
and then commit
them to make them “live”. In my imagining, the dat sync
operation starts seeding your committed changes immediately, and runs in the background to make your local machine an (almost) always-on peer. Then you can continue working — none of your local changes become “visible” on remote copies of your Archive until you’ve staged and committed them. There could be a dat status
that would quickly tell you what version is “live” and what you’ve changed but not committed.
A Real Revert-able History
Dat archives maintain a history of every action that has taken place, but it isn’t a very useful history. Here is an example:
$ dat log
1 [put] /dat.json 39 B (1 block)
2 [put] /dat.json 157 B (1 block)
3 [put] /file1 0 B (0 blocks)
4 [put] /file2 0 B (0 blocks)
5 [put] /file3 0 B (0 blocks)
6 [put] /file2 23 B (1 block)
Log synced with network
Archive has 6 changes (puts: +6, dels: -0)
Current Size: 180 B
Total Size:
- Metadata 336 B
- Content 219 B
Blocks:
- Metadata 7
- Content 3
I can see that there are six revisions of the information in this Dat, but I have no idea what the significance of any of those revisions might be. This is where commit messages like the ones in Git would come in handy. Or, in lieu of that, something akin to the concept of a Git tag to mark milestones. How about this:
$ dat log
1 > dat initialization...
[put] /dat.json 39 B (1 block)
[put] /dat.json 157 B (1 block)
2 > adding placeholders for files 1 - 3
[put] /file1 0 B (0 blocks)
[put] /file2 0 B (0 blocks)
[put] /file3 0 B (0 blocks)
3 > adding results to file2 from experiment 1
[put] /file2 23 B (1 block)
Log synced with network
Archive has 3 commits, 6 changes (puts: +6, dels: -0)
Current Size: 180 B
Total Size:
- Metadata 336 B
- Content 219 B
Blocks:
- Metadata 7
- Content 3
Untracked files:
- file4
- MessyNotes.md
Changes not synced:
- [put] /file1 32 B (1 blocks)
Now, let’s say I wanted to go back to just before the results were entered from experiment 1 (commit #2 above). A command might look like this:
$ dat checkout 2
Archive at commit 2: "adding placeholders for files 1 - 3"
Now, to be fair, the Dat documentation does claim that you can revert the history (so it is there in the protocol), but the example shown on the linked page will not really work (you have to use the URL of the file with the “?version=X
” query string, not the Dat archive itself). Also, why does this only work on HTTP, not on a dat://
link? It just feels like either an unfinished feature, or a feature that was abandoned after the project’s focus shifted. Since it already partially works, maybe it just needs some love to get it fully implemented.
Peer-to-Peer Git
The more I think about it, the more I’m convinced that what I actually want is a fork of Git that would do just a few things differently:
- It would be nice to have the peer-to-peer capability of Dat.
- Large binary files would need to be a first-class citizen.
- The interface should to be (much) less daunting for non programmers.
- CLI: Great, but make the commands more obvious.
- GUI: An (optional) graphical interface out-of-the-box is probably a good idea for collaborators who never venture out of the GUI world of Word, Excel, and occasionally R-Studio.
Now, the question is: What is the shortest path to this? Would it be easier to “adapt” Git to be more like Dat, or adapt Dat to be more like Git? I’m not sure, but every time I sit down and write about it, I get a little closer to clarity I think…
I seem to keep running into the problem of maintaining large datasets related to research projects. I’m not alone — this is a requirement for research in many domains, not just *informatics and *omics. One pain point that we feel more acutely in the “Bio-[insert-suffix-here]” fields is controlled access datasets. Because of (mostly good) laws like HIPPA, we often have to be very careful how we store and share our data. I suppose that even in fields without these restrictions, it is probably attractive to maintain secrecy at least prior to an initial publication.
From my standpoint, a “perfect” file container for research would meet all of the requirements below:
- Versioning: We would like to be able to track changes made to a dataset. Ideally, it would be nice to be able to “clone” a dataset at any point in its history, or to “revert” changes to a previous version.
- Access Control: We need to be able to limit who has access to a dataset, including rights to view versus edit. Ideally, being able to add new viewers/editors over time, and revoke access would be key features.
- Sharing / Transport: Research teams are rarely all in the same room. There should be an efficient-as-possible means of sharing the research dataset with others. This sharing mechanism should respect the access control requirements for the dataset. A “wishlist” feature here would be “partial” or “on-demand” access to certain parts of a large dataset without requiring space to store the whole thing.
As far as I can tell, there is no current file or container format that meets all of these requirements (it may not even be possible to fulfill all the “wishes”). Still, there are some options that come close, and some that do parts very well:
Git seems to be the most popular source code versioning system, at least amongst the open-source community. The Git-LFS (Git Large File Storage) extension allows “large”, binary files to be stored using the same tools that developers use to version source code (and other text-based content). Git-LFS would handle the versioning problem nicely, and the sharing/transport problem pretty well; it cannot help with access control. (You could encrypt the files within the repository to provide access control.)
- Versioning: Great
This is what Git was built for. All of the other options will be measured against this… The ability to modify history is the only negative here.
- Access Control: OK
Git security and access is controlled at the server-side (remote) level, just like with regular Git repositories, so access at the repository level is well understood. Once a user has downloaded the repository, they have possession of all the data and you can’t revoke it at that point (but this is the same as most other options). You could revoke access to future revisions, of course. Support for “readers” versus “editors” is easy and well understood.
- Sharing / Transport: Pretty Good
Git is designed for sharing source code with a development team that might be globally distributed. As long as each member can access the repository’s server (and LFS datastore), all is well. There is nothing in the protocol to help with transfer (large files will take as long as the slowest transport link requires), but that is the case with most other options as well.
I’ve been following the Dat project intently for a couple of years now. A “Dat” archive is basically a directory with some “magic” (hidden configuration) added, much like the way Git works. The project’s early focus was on boosting availability of open scientific datasets, although they seem to be shifting focus toward peer-to-peer web publishing now. In theory, a Dat archive allows files within the folder to be version-controlled through the use of an append-only log so you have a cryptographically verifiable audit trail of changes. Supposedly, you could look back at historical versions, although at the time of this writing, that feature is either not implemented yet or it is totally undocumented. Dats are shared over a peer-to-peer protocol, (potentially) improving transport and availability.
- Versioning: Half-Way-There
Although the append-only nature of the format should lead to even better versioning than Git (whose history is mutable), currently the lack of an ability to go back to a specific version means that this isn’t really useful for what we generally think of as versioning, and is instead only good for transparency. You could perform a data release by creating and publishing a Dat, then just not modifying that one ever, instead creating new Dats for future releases — but that breaks a primary advantage of using a version control system in the first place. Here is hoping that this feature is implemented (and exposed) very soon! Also, committing changes to a Dat is tricky — the server seems to want to either track changes “live” (scary), or you have to take the Dat offline, modify it, and then re-“sync”. There is no “stage” - “commit” - “push” workflow like with Git. This isn’t a deal-breaker though, just an annoyance.
- Access Control: Poor
The Dat crew would be disappointed with my “Poor” rating I’m sure, but their security model is one of “security through obscurity”, and that isn’t good enough for access controlled datasets. The ability to access a Dat archive is controlled by a “public key” — if you have the key, you can read from (but not write to) the Dat. This key is large enough (32 byte Ed25519 keys) that “guessing” is statistically impossible, but the real problem is accidental key leakage; you are one “reply-all” from disaster with Dat. There is no way to revoke access aside from taking an archive offline, and since the protocol is peer-to-peer, every peer would have to agree to take down the archive at the same time to render it “gone”. So, if you want access-controlled data in a Dat, you are left to encrypting the contents yourself.
- Sharing / Transport: Great
This is one of the primary things Dat was designed to do. Dat uses a decentralized, peer-to-peer architecture (although servers can be used to guarantee discoverability and availability), so the more team members “seed” the dataset, the faster it should be to transfer. Sharing is as easy as sharing a URL (the “dat” link). This is one of the main things that keeps me excited about Dat. One missing feature is the ability for team members to write to the Dat — the Dat team calls this feature “multi-writer”, and work on it is in progress so it should be coming (hopefully) soon.
Well, this one is ubiquitous. I’ll also lump Box in here, as they are pretty similar. Dropbox creates a special directory on your machine that is automatically backed up to their servers. On top of this, they allow easy sharing across machines (with one account) or to others with Dropbox logins, or even to anyone with a special Dropbox link. The security model is better for the “has a Dropbox account” scenario, because you can assign read/edit access on a per-user basis, and revoke it. The links are a little more scary, since having the link grants access with no other safeguards, but at lest it is easy to revoke a link (of course, if someone unauthorized has already downloaded the files, it is too late). This one is super easy to use, and very dependable, but requires you to trust a third-party company. One limitation might be available space if your datasets are very large, not to mention transfer time.
- Versioning: Good
Dropbox maintains a history of file versions, and allows you to roll back to a previous version through their web interface. The history may be limited (depending on settings), so it might not go back forever.
- Access Control: Fair
Access can be granted in several ways; the most secure is to grant specific access to other Dropbox users. This way, you can set the permissions (read or edit) and change them later. There is nothing to prevent a user who has already downloaded a shared resource from viewing it after access is revoked though. Access controlled datasets probably shouldn’t be shared with the “link” method.
- Sharing / Transport: Great / Mediocre
Sharing is easy and works very well; it is a major selling-point of the service. Transport of very large files is less good, though. In my experience (and anecdotally from others who I’ve discussed the service with), transfers of very large files or large numbers of small files can be quite slow versus the expected bandwidth of the user’s connection. Your mileage may vary, of course.
BitTorrent is a well known and established protocol/service for sharing files over peer-to-peer connections. It is primarily a file-oriented protocol, so you would need to place your datasets in a container (a tarball would work) before sharing. There is essentially no security beyond discoverability (if someone can find your torrent ID, they can download your file), so you have to provide access control mechanisms at the file container level. But for distribution of a resource that many end users need at about the same time, it is hard to beat.
- Versioning: None
Versioning is really out-of-scope for BitTorrent. You would need to provide this in some other way.
- Access Control: Poor
Anyone with the torrent link can download the file; this is security-through-obscurity at best, and cannot be used for access-controlled data. You would need to encrypt prior to sharing.
- Sharing / Transport: Great
This is the bread-and-butter of BitTorrent. As long as you have at least one person “seeding” the file, it is available. Transport speed will (generally) increase as more peers seed the file. To assure availability, you would need to run a server to perform seeding full-time.
Syncing Services
Here I include things like BitTorrent Sync (Resilio Sync) and SyncThing. These share directories using peer-to-peer technology.
- Versioning: None
As far as I know, these services do not perform versioning in the sense that we would think of it from i.e. a version-control system. You always get the most current version of the shared directory (and this is the point — syncing).
- Access Control: Varies
Some of these can do per-peer access control; that is not bad. Others do access control based on whether you have the “link” to the shared data (not great). You need to look into the details of the service you choose.
- Sharing / Transport: Great
The peer-to-peer nature makes these services great for quickly moving large amounts of data if multiple peers are online. Transport speed will (generally) increase as more peers seed the file. To assure availability, you would need to run a server to perform seeding full-time.
Direct Download
I will lump together all of the methods of downloading a file directly from a server here: HTTP, FTP / SFTP, Rsync, etc. This is currently how a large amount of research data is shared online. It is challenging to set up and maintain, but the methods are well known and time-tested. Security can be non-existent (HTTP/FTP) or as good as you desire (and have the skill) to make it.
- Versioning: None
Versioning is out of scope of this option, and would have to be provided via some other means.
- Access Control: Depends
If you are skilled at managing the server, you can set up very specific access control rules. This option is potentially among the best in this category, but it also requires the most effort and skill to get right.
- Sharing / Transport: Good
Assuming your server has a high-bandwidth connection (or you use a CDN) and you are skilled enough to set up accounts to provide the level of access you require, the “just host it on a server” solution is hard to beat.
[Edited 2018-06-15 to fix “Git-LFS” versioning bullet point.]