• LLMs Research
  • Posts
  • LLMs related research papers published on 8th May, 2024

LLMs related research papers published on 8th May, 2024

Today' edition covers important papers which demonstrates LLMs deployment in physical surgery, effectively selecting LLMs deployment configuration, method to increase inference time and more

πŸ”‘ takeaways from today’s papers

  • A new research paper using LLMs powered robot in physical surgery!

  • Conv-basis paper modify the approximation method to reduce the attention layer computation

  • Vidur framework finds an optimal LLMs deployment configuration

  • Cohere AI published a new study paper to analyze under trained tokens in multiple LLMs

πŸ”¬Core research improving LLMs!

πŸ’‘Why?: LLMs have a quadratic computational cost, which limits their scalability and further improvement in longer context.

πŸ’»How?:  

Paper leverages the convolution-like structure of attention matrices to develop an efficient approximation method for attention computation using convolution matrices.

Paper propose a π–Όπ—ˆπ—‡π— basis system, "similar" to the rank basis, and show that any lower triangular (attention) matrix can always be decomposed as a sum of k structured convolution matrices in this basis system.

Paper further designs an algorithm to quickly decompose the attention matrix into k convolution matrices. Thanks to Fast Fourier Transforms (FFT), the attention {it inference} can be computed in O(knd*logn) time, where d is the hidden dimension.

In practice, dβ‰ͺn, i.e., d=3,072 and n=1,000,000 for Gemma. Thus, when kd=no(1), algorithm achieve almost linear time, i.e., n1+o(1). Furthermore, the attention’s forward training and backward gradient can be computed in n1+o(1) as well.

These approach can avoid explicitly computing the nΓ—n attention matrix, which may largely alleviate the quadratic computational complexity.

Furthermore, this algorithm works on any input matrices. This work provides a new paradigm for accelerating attention computation in transformers to enable their application to longer contexts.

βš™οΈLLMOps + semi-conductor enhancement

πŸ’‘Why?: Deployment of LLMs can still be optimized. At the moment, it is very expensive and time-consuming due to the need for experimentally running an application workload against different LLM implementations and exploring a large configuration space of system knobs.

πŸ’» How?: The research paper proposes a solution called Vidur - a large-scale, high-fidelity, and easily-extensible simulation framework for LLM inference performance. Vidur uses a combination of experimental profiling and predictive modeling to model the performance of LLM operators. It then evaluates the end-to-end inference performance for different workloads by estimating key metrics such as latency and throughput. This allows for efficient exploration of the large configuration space and helps identify the most cost-effective deployment configuration that meets application performance constraints.

πŸ“ŠResults: The research paper does not explicitly mention any specific performance improvement achieved by Vidur. However, it does mention that Vidur estimates inference latency with less than 9% error across a wide range of LLMs, indicating its high fidelity. Additionally, the paper gives an example of how Vidur-Search, a configuration search tool built on top of Vidur, was able to find the best deployment configuration for LLaMA2-70B in just one hour on a CPU machine!

πŸ’‘Why?: LLMs still have a slow inference time specifically in the prompt phase where the first token is generated.

πŸ’»How?: Paper proposes an efficient parallelization scheme called KV-Runahead. This scheme orchestrates multiple processes to populate a key-value cache (KV-cache) during the extension phase, which is faster than the prompt phase. This minimizes the time-to-first-token (TTFT) and allows for faster generation of subsequent tokens. The KV-cache is also dual-purposed for the extension phase, reducing computation and making implementation easier. Context-level load-balancing is also used to handle uneven KV-cache generation and optimize TTFT.

πŸ“ŠResults: The research paper shows that KV-Runahead can offer over 1.4x and 1.6x speedups for Llama 7B and Falcon 7B, respectively, compared to existing parallelization schemes. This demonstrates a significant performance improvement for LLM inference.

πŸ§ͺ LLMs evaluations

πŸ’‘Why?: Inspects the biased and incorrect results generated by LLMs

πŸ’»How?: Proposes a benchmark to detect these 'alignment fakers' by creating a set of 324 pairs of LLMs that have been fine-tuned to select actions in role-play scenarios. One model in each pair is consistently aligned, while the other model purposely misbehaves in scenarios where it is unlikely to be caught. The benchmark task is to identify the misbehaving model using only inputs where both models behave identically. The research paper tests five different detection strategies, one of which is able to identify 98% of the alignment fakers.

πŸ’‘Why?: To asses the systemic biases perpetuated by LLMs in modern societies, particularly in user-facing applications and enterprise tools. These biases can have harmful effects on marginalized groups and lead to discrimination and exclusion.

πŸ’»How?: Paper introduces the Covert Harms and Social Threats (CHAST) metrics, which are based on social science literature. These metrics provide a more comprehensive and nuanced understanding of the various forms in which harms can manifest in LLM-generated conversations. To evaluate the presence of covert harms, the researchers use evaluation models aligned with human assessments. This allows for a more accurate detection of harmful views expressed in seemingly neutral language.

πŸ’‘Why?:  Set the benchmarks for β€œLLMs in programming repair”

πŸ’»How?: Paper proposes a novel educational program repair benchmark that curates two high-quality programming datasets and introduces a unified evaluation procedure with a new evaluation metric called rouge@k to measure the quality of repairs. This benchmark will facilitate the comparison of different approaches in program repair and help establish baseline performance.

πŸ’‘Why?: Set the benchmark of how easily LLMs can negotiate fairly within a game-theoretical framework.

πŸ’»How?: The research paper proposes to use simulations and gradient-boosted regression with Shapley explainers to measure the impact of different personality traits on negotiation behaviour and outcomes. It works by analyzing data from 1,500 simulations of single-issue and multi-issue negotiations and identifying patterns between personality traits and negotiation tendencies.

πŸ“šWant to read more? Survey papers

Research paper proposes a comprehensive analysis of LLM tokenizers, using a combination of tokenizer analysis, model weight-based indicators, and prompting techniques. This allows for the automatic detection of problematic tokens, which can then be addressed to improve the efficiency and safety of language models.

🌈 Creative ways to use LLMs!!

πŸ’‘Why?: Can we enhance conventional ML algorithm with LLMs?

πŸ’»How?: The research paper proposes to use pre-trained LLMs to improve the prediction performance of classical machine learning estimators. This is achieved by integrating LLM into the estimator and utilizing its ability to simulate various scenarios and generate output based on specific instructions and multimodal input. The proposed approaches are tested on both standard supervised learning binary classification tasks and a transfer learning task with distribution changes in the test data compared to the training data.

πŸ’‘Why?: Can we locate orphaned wells, which are abandoned oil and gas wells that pose environmental risks and impacts if they are not properly located and plugged using LLMs?

Paper is proposing a RAG setup to extract information from unstructured historical documents!!

πŸ’‘Why?: The research paper addresses the issue of efficient communication within System of Systems (SoS), which is a complex and dynamic network of constituent systems called holons. In order to ensure interoperability and collaborative functioning among these holons, there is a need for improved communication mechanisms.

πŸ’»How?: The research paper proposes an innovative solution to enhance holon communication within SoS by integrating Conversational Generative Intelligence (CGI) techniques. This approach utilizes advancements in CGI, specifically LLMs, to enable holons to understand and act on natural language instructions. By leveraging CGI, holons can have more intuitive interactions with humans, improving social intelligence and ultimately leading to better coordination among diverse systems.

πŸ’‘Why?: Paper tries to develop an effective way to analyze large amounts of qualitative data from online discussion forums.

πŸ’»How?: The research paper proposes a novel framework called QuaLLM, which is based on LLM (Latent Language Model) technology. The framework uses a unique prompting methodology and evaluation strategy to analyze and extract quantitative insights from text data on online forums. It works by automatically processing the data and generating human-readable outputs, eliminating the need for significant human effort.

πŸ“ŠResults: The research paper successfully applied the QuaLLM framework to analyze over one million comments from two Reddit's ride share worker communities. The framework sets a new precedent for AI-assisted quantitative data analysis and has the potential to improve the efficiency and accuracy of analyzing data from online forums.

πŸ€– LLMs for robotics & VLLMS

πŸ’‘Why?: Can we leverage LLMs for object navigation within complex scenes? Specifically about how to effectively represent and utilize language information for this task.

πŸ’»How?: The research paper proposes a novel language-driven object-centric image representation, called LOC-ZSON, which is specifically designed for object navigation. This representation is used to fine-tune a visual-language model (VLM) and handle complex object-level queries. In addition, the paper also introduces a novel LLM-based augmentation and prompt templates to improve training stability and zero-shot inference. The proposed method is implemented on the Astro robot and tested in both simulated and real-world environments.

πŸ’‘Why?: The research paper addresses the problem of limited dexterity and autonomy in robotic surgical assistants.

πŸ’»How?: The paper proposes SuFIA, a framework that combines natural language processing with perception modules to enable high-level planning and low-level control of a surgical robot. It uses LLMs to reason and make decisions, allowing for a learning-free approach to surgical tasks without the need for examples or motion primitives. The framework also incorporates a human-in-the-loop paradigm, giving control back to the surgeon when necessary to mitigate errors and ensure mission-critical tasks are completed successfully.

πŸ“ŠResults: The paper evaluates SuFIA on four surgical sub-tasks in a simulation environment and two sub-tasks on a physical surgical robotic platform in the lab. The results demonstrate that SuFIA is able to successfully perform common surgical tasks with supervised autonomous operation, even under challenging physical and workspace conditions.

Reply

or to participate.