In this special guest feature from Scientific Computing World, Shailesh M Shenoy from the Albert Einstein College of Medicine in New York discusses the challenges faced by large medical research organizations in the face of ever-growing volumes of data.
The Albert Einstein College of Medicine in New York, or just Einstein for short, is a research-intensive medical school dedicated to medical education, research and biomedical investigation. To give you an idea of the full size of Einstein, we have over 3,000 faculty and staff, 750 medical students and 245 PhD students and over 300 research laboratories. Our core mission is divided between educating medical students and our high-end research program.
One of my roles is to deliver and maintain the computational resources necessary for the collection, analysis, visualization and storage of biophotonics data sets – harnessing photons to detect, image and manipulate biological material.
What makes us different from other research institutes is that we develop technology that is designed to meet specific research goals. Specifically, my research involves the engineering of optical microscopes and instrumentation that help scientists observe single biological molecules within living cells – that entails developing software to control microscope automation and image acquisition and designing algorithms for image analysis. In this program, our faculty has research projects that use microscopy exclusively to study hypotheses.
We distribute the technology we develop, and the methods and the reagents to the broad research community – we have a big role in educating the wider industry, not just our students.
With such a broad range of research and outputs, it is interesting to note that our central IT department does not have a mandate to support our central research data systems. In fact, our IT support is decentralized and is all department based – that in itself required a significant administrative overhead. Additionally, data has been and continues to grow at an unprecedented rate – just a couple of years ago, our data growth was around one terabyte per week.
These two challenges, coupled with an independent HPC resource, multiple storage systems and a plethora of LDAPs for authenticating users, made it very difficult for us to collaborate internally, let alone when we want to make our findings available to other institutes and researchers. There was also a lack of high availability systems – if we ever needed to perform maintenance, or there was a network outage to deal with, our systems had to go down and important research would have to be scheduled in advance, or potentially stop altogether – that is unacceptable.
Our workflows may not be very different to other biomedical research facilities; we collect raw data on microscope systems – we collect the images and, using metadata, can catalogue them. Our researchers study the data sets using homegrown and commercial software systems on many different hardware platforms, which all need to ‘speak’ with each other seamlessly.
In short, our challenge was that we needed the ability to collaborate within the institution and with colleagues at other institutes – we needed to maintain that fluid conversation that involves data, not just the hypotheses and methods.
Solving the problem
Having identified the challenges, we knew that our new system needed to be easy to manage. The team and I needed a storage solution that we could support internally – we need it be a sustainable solution – something that would be easy to scale when needed.
Other stipulations were around having a copy of all data in a secondary location – for disaster recovery and archive. We also wanted to store data in a secondary tier that was a little bit lower performance, certainly a lower cost. So, we wanted a system that was easy for users to access and have everything in a single namespace.
The primary building block of our infrastructure now is DDN’s GRIDScaler running IBM’s parallel file system, GPFS – a parallel file system appliance that delivers high performance and scalability for high data rate capture and future data growth.
We also use a technology that ‘bridges’ our main storage solution to the secondary tier of object storage – DDN WOS. In a single namespace, we can copy or move files across the tiers – our user community did not need to know where the data resides and the team here do not have to manage that. Since I have a small team, I do not have to worry about where the data resides as that is all taken care of automatically using the DDN WOS Bridge technology.
Object storage in scientific computing
This project was a consolidation of disparate storage systems on to one central platform that is accessible to everybody equally, reliably, and in a robust fashion. As I mentioned earlier, our workflows are not very different to other institutions, and like us, over time storage systems do grow, especially across large institutions. As they grow, they become harder to manage.
This issue of managing data is somewhat mitigated with object storage – there are numerous case studies, articles and live environments where object storage systems throw out the existing file system approach. The hierarchies that become complex as they grow make way for the flat structure of object storage. The top-line benefits of object storage are the potential for near unlimited scalability, the custom metadata, and the access to other storage solutions through bridging technologies.
What stands out for me though is the use of object storage as the secondary storage tier, it gave us lots of features I was not expecting when we were first thinking about the system we wanted. One thing is having data stored in an immutable fashion, so since we store metadata anyhow, we can add Object Identifiers (OIDs) as additional fields.
Our experience of object storage as the secondary tier makes a lot of sense. Our parallel file system connects into our HPC resource, and once that data is no longer needed on primary storage it can be moved to this secondary tier. Thanks to custom rulesets (such as age of data, size, use), it is automatically moved to the object storage layer – which we can then grow as and when we need it.