Jason L Causey

Imperfect Options for Sharing Big Research Data

I seem to keep running into the problem of maintaining large datasets related to research projects. I’m not alone — this is a requirement for research in many domains, not just *informatics and *omics. One pain point that we feel more acutely in the “Bio-[insert-suffix-here]” fields is controlled access datasets. Because of (mostly good) laws like HIPPA,1 we often have to be very careful how we store and share our data. I suppose that even in fields without these restrictions, it is probably attractive to maintain secrecy at least prior to an initial publication.

From my standpoint, a “perfect” file container for research would meet all of the requirements below:

  1. Versioning: We would like to be able to track changes made to a dataset. Ideally, it would be nice to be able to “clone” a dataset at any point in its history, or to “revert” changes to a previous version.
  2. Access Control: We need to be able to limit who has access to a dataset, including rights to view versus edit. Ideally, being able to add new viewers/editors over time, and revoke access would be key features.
  3. Sharing / Transport: Research teams are rarely all in the same room. There should be an efficient-as-possible means of sharing the research dataset with others. This sharing mechanism should respect the access control requirements for the dataset. A “wishlist” feature here would be “partial” or “on-demand” access to certain parts of a large dataset without requiring space to store the whole thing.

As far as I can tell, there is no current file or container format that meets all of these requirements (it may not even be possible to fulfill all the “wishes”). Still, there are some options that come close, and some that do parts very well:

Git-LFS

Git seems to be the most popular source code versioning system, at least amongst the open-source community. The Git-LFS (Git Large File Storage) extension allows “large”, binary files to be stored using the same tools that developers use to version source code (and other text-based content). Git-LFS would handle the versioning problem nicely, and the sharing/transport problem pretty well; it cannot help with access control. (You could encrypt the files within the repository to provide access control.)

Dat

I’ve been following the Dat project intently for a couple of years now. A “Dat” archive is basically a directory with some “magic” (hidden configuration) added, much like the way Git works. The project’s early focus was on boosting availability of open scientific datasets, although they seem to be shifting focus toward peer-to-peer web publishing now. In theory, a Dat archive allows files within the folder to be version-controlled through the use of an append-only log so you have a cryptographically verifiable audit trail of changes. Supposedly, you could look back at historical versions, although at the time of this writing, that feature is either not implemented yet or it is totally undocumented. Dats are shared over a peer-to-peer protocol, (potentially) improving transport and availability.

Dropbox

Well, this one is ubiquitous. I’ll also lump Box in here, as they are pretty similar. Dropbox creates a special directory on your machine that is automatically backed up to their servers. On top of this, they allow easy sharing across machines (with one account) or to others with Dropbox logins, or even to anyone with a special Dropbox link. The security model is better for the “has a Dropbox account” scenario, because you can assign read/edit access on a per-user basis, and revoke it. The links are a little more scary, since having the link grants access with no other safeguards, but at lest it is easy to revoke a link (of course, if someone unauthorized has already downloaded the files, it is too late). This one is super easy to use, and very dependable, but requires you to trust a third-party company. One limitation might be available space if your datasets are very large, not to mention transfer time.

BitTorrent

BitTorrent is a well known and established protocol/service for sharing files over peer-to-peer connections. It is primarily a file-oriented protocol, so you would need to place your datasets in a container (a tarball would work) before sharing. There is essentially no security beyond discoverability (if someone can find your torrent ID, they can download your file), so you have to provide access control mechanisms at the file container level. But for distribution of a resource that many end users need at about the same time, it is hard to beat.

Syncing Services

Here I include things like BitTorrent Sync (Resilio Sync) and SyncThing. These share directories using peer-to-peer technology.

Direct Download

I will lump together all of the methods of downloading a file directly from a server here: HTTP, FTP / SFTP, Rsync, etc. This is currently how a large amount of research data is shared online. It is challenging to set up and maintain, but the methods are well known and time-tested. Security can be non-existent (HTTP/FTP) or as good as you desire (and have the skill) to make it.

[Edited 2018-06-15 to fix “Git-LFS” versioning bullet point.]


Black Formatter for Python

Today I learned about the black formatter tool for Python source code. The name is a play on the Henry Ford quote “Any customer can have a car painted any color that he wants so long as it is black.”

The idea is that black is not configurable. Now I love to dig into configuration options (it’s a great way to procrastiwork), but this really struck me: If you are not allowed to tweak any options, it really reduces the mental load over which ones are the “right fit” for the project. This could be especially freeing if you are working on a large repository over time, or if you are collaborating.

I have found that my style tends to “drift” over time, especially in Python. I’d like to say that I’m a devout follower of the PEP-8 gospel, but the truth is I waver. I usually like my code style “the way I like it”, which is usually whatever is aesthetically most pleasing to me at the time, and what I consider most readable. But aesthetic choices drift… So better to not have any control of it at all! (Yes, I know that Go was created on this philosophy, but I don’t do all that much Go programming.)

The beauty of black is the lack of options — a particular project manager can’t make any decisions about the “right” style for the project, so there can’t be any discussions of the merits of those choices. black will format according to PEP-8 (actually, according to pycodestyle), and that’s that.

So, I’m going to try this out in a new library I’m building that will collect some scripts I’ve been using to work on CSV and other text-based formats. My plan is to always require black formatting before any commit. To help with this, I created a pre-commit hook in git that basically does the following:

I’m hoping this will help free me from “worrying” about formatting in this library, and allow me to focus on what matters — making useful tools. We’ll see how it goes…


Salt and Hash your Passwords

This post was originally created on the A-State CS Department wiki on May 9, 2016 after I observed several students having trouble understanding the fundamental order of operations required to hash and salt a password. The advice here is at the most basic level — do more research before trying this for anything “real”.

The order in which you “salt” and “hash” matters!

The order in which you perform the “salt” and “hash” steps when storing a password is vital to the security of the whole scheme. You absolutely must do things in this order:

  1. Concatenate the ‘salt’ with the raw password.
  2. Hash the salted password.
  3. Compare the hash with the stored hash in the database. (Or, if creating an account, store the hash and the salt into the database.)

Previously, I have seen exam answers where 1 & 2 were reversed. Adding a string of random characters on the end of an already hashed password offers absolutely no advantage. Consider:

Doing it Wrong:

password:  sesame
hash("sesame") => b3fba6554a22fdc16c8e28b173085ccc
salt:             kqrjtiuhvaw

If you hash then salt, you get:

b3fba6554a22fdc16c8e28b173085ccckqrjtiuhvaw

But, if I know that you are using a particular hash algorithm (md5 here), I know the length of the hash string (32 for md5), so I just split it:

                                  |
b3fba6554a22fdc16c8e28b173085ccc  |  kqrjtiuhvaw
                                  |

And throw away the salt… Leaving the hash b3fba6554a22fdc16c8e28b173085ccc that I will just look up in my rainbow table:

286755fad04869ca523320acce0dc6a4 : password
f447b20a7fcbf53a5d5be013ea0b15af : 123456
2f548f61bd37f628077e552ae1537be2 : monkey
b3fba6554a22fdc16c8e28b173085ccc : sesame
6341e21206c4672f8b86dc4af593c5dd : abc123456

I’ll know your password is “sesame” in no time. The salt didn’t help at all.

Doing it Right:

password:        sesame
salt:            kqrjtiuhvaw
salted password: sesamekqrjtiuhvaw
hash("sesamekqrjtiuhvaw"): f429f37d8fe81d46ae1afccf80ccaa88

Now you store the salted-then-hashed password in the database along with the salt:

f429f37d8fe81d46ae1afccf80ccaa88:kqrjtiuhvaw

And if I steal that password, I can try my rainbow table, but the md5 hash for “sesame” is b3fba6554a22fdc16c8e28b173085ccc not f429f37d8fe81d46ae1afccf80ccaa88, so I won’t “see” your password in the table. I would have to brute force every combination of password + "kqrjtiuhvaw" to find one that matched….

"a"  + "kqrjtiuhvaw"  => ee70626aab64bb600e05c4c28c822f0a
"b"  + "kqrjtiuhvaw"  => c13bd0e308c5ef7035fc1ba7409fce14
"c"  + "kqrjtiuhvaw"  => bbdec97b8be3a1098c08ff7ced3c7965
   ...
"z"  + "kqrjtiuhvaw"  => 9c67c13058e8b3713d8f307fe9a914e4
"aa" + "kqrjtiuhvaw"  => b1eab1930d25f46ac92ea8a73fbdc6f6
   ...

I’m going to be at it a while – and then I only get one password for my trouble. Since the salt is unique per-user, I have to start all over again for the next one. Not worth it.

This is why you salt before hashing. The order makes a big difference.

(Also, don’t actually use md5! Look for a proper password hashing function/library, and consult the OWASP password storage cheat-sheet.)