
Volume rendering visualization: ChatVis produced a screenshot identical to the ground truth except for a different color palette because the user prompt did not specify one.
The following is an excerpt from an article written by Gail Pieper, coordingating writer/editor at Argonne National Laboratory. The complete article can be found here.
Large language models (LLMs) applications range from text processing to predicting virus variants. As the datasets on which LLM models are trained become increasingly massive — including trillions of parameters — there is a growing need for strategies to make these models less costly and more effective for scientific uses, such as code translation, visualization, compression, privacy protection and prediction.
Researchers in the Mathematics and Computer Science (MCS) division at the U.S. Department of Energy’s Argonne National Laboratory have been addressing this need in several ways.
Converting Code
A key problem in science is converting legacy Fortran codes to C++. Although Fortan is highly performant, its support for heterogeneous platforms is inferior to that of C++. Manual translation has been the typical approach, but the process is nontrivial, requiring extensive knowledge of both languages, and can be extremely labor intensive.
“This is where CodeScribe shines,” said Anshu Dubey, a senior computational scientist and lead PI for the research.
CodeScribe is a new tool that combines user supervision with chat completion — a technique that uses structured conversations to craft the most effective “prompts” that produce the desired output. To enhance the process, CodeScribe leverages emerging generative AI technologies. First, it maps the project structure by indexing subroutines, modules and functions across various files. Next, it generates a draft of the C++ code for a given Fortran source file. The generated results are reviewed, and errors are addressed manually by the developer or are sent for regeneration by updating the original prompt.
“CodeScribe automates many aspects, but human expertise remains essential for the final review,” said Akash Dhruv, an assistant computational scientist and primary developer of the new tool.
CodeScribe was motivated by scientists’ desire to convert MCFM — a Monte Carlo code that simulates particle interactions observed at the Large Hadron Collider — so that the code would be interoperable with other high energy physics codes and libraries. The researchers used several generative AI models for the MCFM code conversion, each with distinct parameter counts and capabilities. While GPT-4o emerged as the most effective model in this context. (see Fig. 1), the performance also revealed opportunities for optimization, particularly concerning the manual review and testing processes associated with such translations.
In ongoing work the researchers are applying CodeScribe to other applications. For example, they are using CodeSource to build GPU compatibility between the Flash-X open-source multiphase simulation software and the AMRex framework for block-structured adaptive mesh refinement applications. The researchers envision CodeScribe as a valuable tool that empowers developers in scientific computing to leverage generative AI effectively.
For further information see A. Dhruv and A. Dubey, “Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing,” accepted by the Platform for Advanced Scientific Computing conference https://doi.org/10.48550/arXiv.2410.24119
Making LLMs More Manageable
Another major challenge facing LLMs is how to make them accessible with significantly reduced computational resources. Pruning has emerged as an important compression strategy to enhance both memory and computational efficiency, but traditional global pruning has been impractical for LLMs because of scalability issues.
To address this challenge, researchers from Emory University and Argonne have developed SparseLLM. This innovative method redefines the global pruning process into multiple local optimization subproblems, coordinated by auxiliary variables. The researchers introduced an alternating optimization strategy in which some subproblems are optimized while others are kept fixed; the process then is repeated with a different subset. For the optimization, they leveraged sparsity-aware algorithms to optimize both the pruning mask selection and weight reconstruction simultaneously, ensuring minimal performance degradation (see Fig. 2).
“By not optimizing all the variables at the same time, we can achieve more scalable training while reducing the computational cost,” said Kibaek Kim, a computational mathematician and one of the developers of SparseLLM.
For further information, see the paper by Guangji Bai, Yijiang Li, Kibaek Kim and Liang Zhao, “Towards Global Pruning for Pre-trained Language Models,” arXivL2402.17946; poster at NeurlPS 2024.
Reasoning with LLMs
Whether LLMs can really reason has become a highly debated issue. Some studies cite achievements in multistep planning and prediction as demonstrating LLMs’ reasoning capabilities; others argue that “true reasoning” goes beyond LLMs’ ability to recognize patterns and apply logical rules. In a recent study, researchers from Argonne and the University of Pennsylvania joined the debate by focusing on a new aspect — LLM token biases when solving logical problems.
They introduced a hypothesis-testing framework to evaluate multiple commercial and open-source LLMs. They applied tests on matched problem pairs to detect performance shifts when logically irrelevant tokens were altered, such as names or quantifiers. The results showed that many state-of-the-art LLMs fail to generalize logical reasoning across minor perturbations, suggesting they often rely on superficial token patterns rather than formal logical reasoning (see Fig. 3).
“We demonstrated statistically that apparent reasoning success may stem from token bias rather than actual understanding,” said Tanwi Mallick, an assistant computer scientist. “The study provides new insights into the reliability of LLMs and opens avenues for future work on ways to improve LLMs’ logical reasoning ability.”
For further information, see the paper by Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo J Taylor and Dan Roth, “A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 472–4756, https://aclanthology.org/2024.emnlp-main.272.pdf
Visualizing with an LLM
Creating scientific visualizations is challenging, consuming considerable time and requiring expertise in both data analysis and visualization. Four researchers from Argonne’s MCS Division have proposed a new approach — synthetic software generation using an LLM. To this end, they have developed an AI assistant, called ChatVis, that allows the user to specify a chain of analysis/visualization operations in natural language.
ChatVis generates a Python script for the desired operations and iterates until the script executes correctly, prompting the LLM to revise the script as needed. Moreover, the LLMs are not trained on esoteric visualization operations; instead, ChatVis allows commonly available LLMs such as ChatGPT to generate correct visualizations, without retraining or fine-tuning.
“ChatVis employs a friendly human-centric natural-language interface,” said Orcun Yildiz, an assistant computer scientist. “Domain scientists, managers, administrators and decision-makers who are not visualization experts can now generate their own high-quality visualizations.”
The Argonne team compared visualizations against five state-of-the-art LLM models with and without ChatVis. With ChatVis, they were able to generate all five visualizations successfully; without ChatVis, the best LLM could generate only one of the five cases correctly. Figure 4 shows how closely ChatVis matched the ground truth.