Unfolding the mystery behind DNA sequences is key to designing synthetic microorganisms for alternate fuel sources. Penn State University Assistant Professor, Howard Salis Ph.D., leveraged Amazon Web Services (AWS) to offer an online HPC portal to bring supercomputing resources to scientists the world over for this project. We sat down with Dr. Salis to learn more about this fascinating topic.
insideHPC: What does your research within synthetic biology entail?
Howard Salis Ph.D.: My research group deciphers the rules that control how genes are expressed and regulated inside living cells. These rules are quantitative, predictive, and combined together into what we call a “DNA Compiler.” We feed DNA sequences and mutations into our DNA Compiler, and it predicts how the cell’s behavior will change as a result. We combine these predictions with optimization algorithms to design non-natural DNA sequences that reprogram an organism’s behavior towards solving a specific problem.
Our rules are formulated using physics and chemistry, and experimentally validated across thousands of natural and synthetic DNA sequences in diverse organisms. Deriving our rules using physics means that we can predict many types of behaviors in diverse scenarios; the physics governing life don’t change much from one organism to another.
Our DNA Compiler has rapidly evolved since its birth 4 years ago. We can now design multiple types of DNA sequences (“parts”) found in genetic systems, and assemble them together with predictable functions. We’re also using the DNA Compiler to “reverse-engineer” natural systems, identifying how each genetic part contributes to an organism’s overall function.
We’re starting to see convergence in that effort, though we are still learning how organisms interpret DNA, particularly in human and plant cells. The physics that govern cellular processes create highly emergent systems; deciphering the individual interactions using Synthetic Biology will be an important challenge. Even before we have a complete understanding, though, we know enough to engineer simpler micro-organisms towards solving humanity’s problems: manufacturing novel fuels, drugs, and materials from renewable biomass; and developing new types of computers that interface directly with our bodies.
insideHPC: How does the high performance computing platform you built, the DNA Compiler, work? What technology is involved and why did you architect it the way you did?
Howard Salis Ph.D.: Designing DNA sequences can be extremely computationally intensive; a short DNA strand with 150 nucleotide units has more possible sequences than all of the atoms in the universe. We use sophisticated optimization algorithms to quickly identify a DNA sequence that achieves a target organism behavior such as producing a larger quantity of biofuel, or regulating genes that work together to process signals and make decisions.
Perhaps ironically, we’ve found that genetic algorithms — optimization algorithms that were inspired by evolution — have especially rapid convergence with a high degree of parallelism. Some optimization calculations can require a few days to design a long DNA sequence if, for example, the target objective was particularly difficult to satisfy.
Some more specifics for the like-minded: the DNA Compiler is written in Python with the intensive computation performed in wrapped C code. The web server is implemented using Python’s Cherrypy, Genshi, and SQLAlchemy modules to provide interactive and responsive HTML. We try to keep the “fluff” to a minimum, as our audience of scientists and engineers like a clean interface. The HTML interface is important as some of our researchers will view their results on tablets and smartphones, sitting next to their lab benches. On the back-end, we use Amazon’s AWS EC2 computing cluster, and its S3 distributed storage and SQS queuing system, to dynamically turn on compute nodes, run jobs, and deliver results. This architecture has scaled well as the DNA Compiler has become increasingly popular, particularly using EC2 AutoScale groups to handle dynamic scale-up of nodes.
insideHPC: Do synthetic biologists need to have at least a basic understanding of high performance computing to be successful today?
Howard Salis Ph.D.: Absolutely, but I would go farther. The life science fields, including Synthetic Biology, have become swamped with measurements, particularly from next-generation sequencing. 100 bacterial genomes in a day. Human genomes for about $1,000. RNA-Seq. ChIP-Seq. Flow-Seq. Ribosome Profiling. All of these experiments generate terabytes of data and provide a quantitative measure of how cells work.
Though, to analyze this data and build a comprehensive understanding of the cell’s molecular machinery, we need intensive computational modeling to connect the data dots, and provide an integrated physical-chemical explanation for millions of disparate observations. Distributing computing platforms make this analysis much easier and faster. Terabyte datasets can be loaded onto high-memory nodes, which then distributes parallelized calculations to dynamically scaled-up clusters of fast-compute nodes. Instead of purchasing and maintaining clusters, renting them as the need arises can be cheaper overall, while providing more flexibility.
insideHPC: How is the DNA Compiler being used today? How does the research being done with the HPC portal apply to real world situations?
Howard Salis Ph.D.: We have 6,000 registered users from 56 countries who have designed over 50,000 synthetic DNA sequences for all sort of biotech applications. We help the biotech industry produce more therapeutic protein drugs, as well as engineer micro-organisms to manufacture more biofuels, drugs, and materials from renewable feedstocks. The DNA Compiler has fundamentally changed the way that genetic engineering takes place by providing a way to quantitatively control and optimize the expression of many proteins working together, instead of performing trial-and-error DNA mutagenesis.
insideHPC: How do you think the field of synthetic biology will evolve from a scientific as well as technological perspective?
Howard Salis Ph.D.: The costs for DNA synthesis continue to drop, allowing Synthetic Biologists to construct larger and larger genetic systems with more interacting parts. To design such large systems, we also need to incorporate approaches that are more traditionally used by electric and chemical engineers; e.g. algorithms to determine the best genetic system architectures and regulatory control loops.
New classes of regulatory genetic parts have also been discovered that work similarly in bacteria, yeast, and even humans. Using these more universal parts may allow us to prototype genetic systems in simple organisms, like bacteria, and then use a “source-to-source” DNA compiler to redesign the genetic system for eventual use in more complex organisms such as humans.
As the technology behind developing web interfaces and distributing computing becomes easier to use, these scientific advancements will be more routinely offered via web-accessible and user-friendly platforms, like the DNA Compiler, to provide these capabilities to all biotech researchers.