Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Data Munging for Non-Programming Biologists

by Amir Karger, Eitan Rubin
October 20, 2005

Have you ever renamed 768 files? Merged the content from 96 files into a spreadsheet? Filtered 100 lines out of a 20,000-line file?

Have you ever done these things by hand?

Disciples of laziness--one of the three Perl programmer's virtues--know that you should never repeat anything five times, let alone 768. It dismayed me to learn that biologists do this kind of thing all the time.

On the Origin of Scripts: The Problem

Experimental biologists increasingly face large sets of large files in often-incompatible formats, which they need to filter, reformat, merge, and otherwise munge (definition 3). Biologists who can't write Perl (most of them) often end up editing large files by hand. When they have the same problem a week later, they do the same thing again--or they just give up.

My job description includes helping biologists to use computers. I could just write tailored, one-off scripts for them, right? As an answer, let me tell you about Neeraj. Neeraj is a typical NPB (non-programming biologist) who works down the hall. He came into my office, saying, "I have 12,000 sequences that I need to make primers for." I said, "Make what?" (I haven't been doing biology for very long.) Luckily, we figured out pretty quickly that all he wants to do is get characters 201-400 from each DNA sequence in a file. Those of you who have been Perling for a while can do this with your eyes closed (if you can touch type):

perl -ne 'print substr($_, 200, 200), "\n"' sequences.in >
    primers.out

Related Reading

Mastering Perl for Bioinformatics

Mastering Perl for Bioinformatics
By James Tisdall

Table of Contents
Index
Sample Chapter

Read Online--Safari Search this book on Safari:
 

Code Fragments only

Voilá! I gave Neeraj his output file and he went away, happy, to finish building his clone army to take over the world. (Or was he going to genetically modify rice to solve world hunger? I keep forgetting.)

Unfortunately, that wasn't the end. The next day, Neeraj came back, because he also wanted primers from the back end of the sequences (substr($_, -400, 200)). Because he's doing cutting-edge research, he may have totally different requirements next month, when he finishes his experiments. With just a few people in our group supporting hundreds or even thousands of biologists, writing tailored scripts, even quick one-liners, doesn't scale. Other common solutions, such as teaching biologists Perl or creating graphical workflow managers, didn't seem to fully address the data manipulation problem especially for occasional users, who won't be munging every day.

We need some tool that allows Neeraj, or any NPB, to munge his own data, rather than relying on (and explaining biology to) a programmer. Keeping the biologist in the loop this way gives him the best chance of applying the relevant data and algorithms to answer the right questions. The tool must be easy for a non-programmer to learn and to remember after a month harvesting fish eyes in Africa. It should also be TMTOWTDI-compliant, allowing him to play with data until he can sculpt it in the most meaningful way. While we're at it, the tool will need to evolve rapidly as biologists ask new questions and create new kinds of data at an ever-increasing rate.

When I told Neeraj's story to others in our group, they said that they have struggled with this problem for years. During one of our brainstorming sessions, my not-so-pointy-haired boss, Eitan Rubin, said, "Wouldn't it be nice if we could just give them a book of magical data-munging scripts that Just Work?" "Hm--a sort of Script Tome?" And thus the Scriptome was born. (The joke here is that every self-respecting area of study in biology these days needs to have "ome" in its name: the genome, proteome, metabolome. There's even a journal called OMICS now.)

Harnessing the Power of the Atom

The Scriptome is a cookbook for munging biological data. The cookbook model nicely fits the UNIX paradigm of small tools that do simple operations. Instead of UNIX pipes, though, we encourage the use of intermediate files to avoid errors.

We use a couple of tricks in order to make this cookbook accessible to NPBs. We use the familiar web browser as our GUI and harness the power of hyperlinking to develop a highly granular, hierarchical table of contents for the tools. This means we can include dozens to hundreds of tools, without requiring users to remember command names. Another trick is syntax highlighting. We gray out most of the Perl, to signify that reading it is optional. Parameters--such as filenames, or maximum values to filter a certain column by--we highlight in attention-getting red. Finally, we make a conscious effort to avoid computer science or invented terminology. Instead, we use language biologists find familiar. For example, tools are "atoms," rather than "snippets."

Each Scriptome tool consists of a Perl one-liner in a colored box, along with a couple of sentences of documentation (any more than that and no one will read it), and sample inputs and outputs. In order to use a tool, you:

  • Pick a tool type, perhaps "Choose" to choose certain lines or columns from a file.
  • Browse a hierarchical table of contents.
  • Cut and paste the code from the colored box onto a Unix, Mac OS X, or Windows command line. (Friendlier interfaces are in alpha testing--a later section explains more.)
  • Change red text as desired, using arrow keys or a text editor.
  • Hit Enter.
  • That's it!

Figure 1
Figure 1. A Scriptome tool for finding unique lines in a file--click image for full-size screen shot.

The tool in Figure 1 reads the input line by line, using Perl's -n option, and prints each line only when it sees the value in a given, user-editable column for the first time. The substitution removes a newline, even if it's a Windows file being read on a UNIX machine. Then the line is split on tabs. A hash keeps track of unique values in the given column, deciding which lines to print. Finally, the script prints to the screen a very quick diagnostic, specifically how many lines it chose out of how many total lines it read. (Choosing all lines or zero lines may mean you're not filtering correctly.)

By cutting and pasting tools like this, a biologist can perform basic data munging operations, without any programming knowledge or program installation (except for ActivePerl on Windows). Unfortunately, that's still not really enough to solve real-world problems.

Pages: 1, 2, 3

Next Pagearrow