LLMs Research
Posts
LLMs related research papers published on May 7th, 2024

LLMs related research papers published on May 7th, 2024

Newsletter covering today's research paper proposing groun breaking research to improve LLMs computation, innovative applications and approaches to make LLMs safe to use against jailbreaking attacks

May 10, 2024

🔬Core research improving LLMs!

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Nvidia MIT IBM 🔥

🔗GitHub: https://github.com/mit-han-lab/qserve

🤔 Why?: Existing INT4 quantization techniques failing to deliver performance gains in large-batch, cloud-based language model serving due to significant runtime overhead on GPUs.

💻 How?: The research paper proposes a new quantization algorithm, QoQ, which stands for quattuor-octo-quattuor, that uses 4-bit weight, 8-bit activation, and 4-bit KV cache. This algorithm is implemented in the QServe inference library and aims to reduce dequantization overhead on GPUs by introducing progressive quantization. Additionally, the research paper introduces SmoothAttention to mitigate accuracy degradation caused by 4-bit KV quantization. QServe also performs compute-aware weight reordering and utilizes register-level parallelism to reduce dequantization latency. Finally, QServe makes use of fused attention memory-bound to further improve performance.

🦾 Performance gain: The research paper achieves significant performance improvements compared to existing techniques. QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100.

xLSTM: Extended Long Short-Term Memory must-read

🤔 Why?: Paper addresses the limitations of Long Short-Term Memory (LSTM) models in language modeling, particularly when compared to newer technologies like Transformers.

💻 How?: The research paper proposes two solutions to improve the capabilities of LSTM models. Firstly, they introduce exponential gating with normalization and stabilization techniques to improve memory efficiency. Secondly, they modify the LSTM memory structure, creating two new models: sLSTM with a scalar memory and update, and mLSTM with a matrix memory and update. These modifications allow for better parallelization and improved memory capabilities.

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

🤔 Why?: The research paper addresses the problem of efficient use of GPU memory for high throughput LLM inference. Previous systems have reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation.

💻 How?: The research paper proposes vAttention, a dynamic KV-cache memory management system. Unlike previous systems, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging. This enables on-demand physical memory allocation, unburdening the attention kernel developer from explicitly supporting paging and avoiding re-implementation of memory management in the serving framework.

📊 Results: Tokens generation upto 1.97x faster then vLLMs.

Optimizing Language Model's Reasoning Abilities with Weak Supervision

🤔 Why?: LLMs scaling issue due to their dependence on extensively annotated datasets, which are time-consuming and costly to create.

💻 How?: The paper proposes a self-reinforcement approach to enhance LLMs' reasoning abilities with minimal human supervision. This approach involves first fine-tuning the model using a small set of annotated questions, and then iteratively improving it by learning from the differences in responses between the fine-tuned and unfinetuned models on unlabeled questions. This allows for more efficient use of data and reduces reliance on extensive human-annotated explanations.

FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

🤔 Why?: Low inference efficiency in LLMs when using retrieved documents from an external corpus. This leads to high runtime and increased inferential cost.

💻 How?: The research paper proposes a solution called Retrieval-Augmented Language Modeling (RALM) which integrates LLMs with relevant documents from an external corpus. This is achieved through a modular approach called FlashBack, which appends retrieved documents at the end of the context instead of simply prepending them. This allows for efficient utilization of the Key-Value (KV) cache and improves inference efficiency. The LLMs are also fine-tuned without losing their knowledge integrity.

📊 Results: The research paper provides an experiment which shows that FlashBack is up to 4x faster than the prepending method when using a 7B LLM (Llama 2). This results in significantly faster inference speed and reduced inferential cost.

Locally Differentially Private In-Context Learning

To overcome LLMs memorization issue, paper proposes a locally differentially private framework of in-context learning (LDP-ICL). This framework treats LLMs as untrusted in terms of privacy and uses gradient descent mechanisms to incorporate privacy into the in-context learning process. This is achieved by adding noise to the gradients during training, which ensures that sensitive information is not leaked from the model. This approach works by balancing the trade-off between privacy and utility, allowing for the use of LLMs for specific tasks while still protecting the privacy of sensitive data.

Long Context Alignment with Short Instructions and Synthesized Positions

🤔 Why?: Enabling long contexts in LLMs.

💻 How?: Paper proposes a new technique called Step-Skipping Alignment (SkipAlign) which strategically inserts skipped positions within instruction-following samples, which utilizes the semantic structure of the data to effectively expand the context. This allows the model to synthesize long-range dependencies from the aspect of position indices, rather than just expanding the length of input samples. This technique does not require additional efforts beyond training with original data length, making it a more efficient solution for handling long context.

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

The research paper proposes Coupled Quantization (CQ) which exploits the inter-dependency between multiple key/value channels to compress the KV cache in a more information-efficient manner. This is achieved by coupling the channels together and encoding the activations jointly, rather than separately. This results in a more compact representation of the data, reducing the memory usage and improving inference latency.

🧪 LLMs evaluations

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

The Silicone Ceiling: Auditing GPT's Race and Gender Biases in Hiring

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

🧯Let’s make LLMs safe!!

Unveiling Disparities in Web Task Handling Between Human and Web Agent

A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Paper explores the impact of "jailbreaking" on three state-of-the-art VLMs, each using a different modeling approach. By comparing each VLM to their respective LLM backbone, the paper finds that VLMs are more susceptible to jailbreaking. This is due to the visual instruction-tuning process, which can inadvertently weaken the LLM's safety guardrails.

Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

Will LLM agents deceive humans? Research paper introduces a novel testbed framework that simulates a goal-driven environment and uses reinforcement learning to build deceptive capabilities in LLM agents. This framework is based on theories from language philosophy and cognitive psychology

Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore LLM detection

The research paper proposes a black-box zero-shot detection approach that relies on the observation that human-written texts tend to have more grammatical errors than LLM-generated texts. This approach involves calculating the Grammar Error Correction Score (GECScore) for a given text and using it to distinguish between human-written and LLM-generated text. It works by leveraging the difference in grammatical errors between the two types of text.

A Causal Explainable Guardrails for Large Language Models

The research paper proposes a novel framework called LLMGuardaril, which incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. This framework systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. This approach aims to mitigate biases and steer LLMs towards desired attributes.

🌈 Creative ways to use LLMs!!

Toward In-Context Teaching: Adapting Examples to Students' Misconceptions

🤔 Why?: How to use LLMs to effectively teach students through providing informative examples that can adapt to their changing state of knowledge.

💻 How?: The paper introduces a suite of models and evaluation methods called AdapT, which includes simulated Bayesian student models and a platform for evaluation with human students. Additionally, the paper introduces AToM, a new probabilistic model for adaptive teaching that infers students' past beliefs and optimizes for future correctness.

Granite Code Models: A Family of Open Foundation Models for Code Intelligence New model👇 code generation

Paper introduces the Granite series of decoder-only code models, which are trained with code written in 116 programming languages. These models have a wide range of capabilities, including code generation, bug fixing, explanation and documentation, and maintaining repositories. They are optimized for enterprise software development workflows and can perform well across different coding tasks.

Enhancing the Efficiency and Accuracy of Underlying Asset Reviews in Structured Finance: The Application of Multi-agent Framework

The research paper proposes to integrate AI with traditional asset review processes to improve efficiency and accuracy. This is achieved by using both open-sourced and close-sourced LLMs to automate the verification process. The open-sourced model, LLAMA3, is a cost-effective alternative while the close-sourced model, GPT-4, shows superior performance. The use of dual-agent systems further increases accuracy, but at a higher operational cost.

Semantic API Alignment: Linking High-level User Goals to APIs
agents

The research paper proposes a system architecture called Semantic API Alignment (SEAL) where LLM-powered "agents" match high-level objectives with appropriate API calls. This system could automate programming tasks by finding matching links or explaining mismatches to guide manual intervention or further development. It works by utilizing existing libraries and applying LLMs to Goal-Oriented Requirements Engineering (GORE) through sub-goal analysis, specifically in aligning with REST API specifications.

Iterative Experience Refinement of Software-Developing Agents agents

The research paper proposes the Iterative Experience Refinement framework, which enables LLM agents to refine experiences during task execution. This is achieved through two fundamental patterns: the successive pattern, where experiences are refined based on nearest experiences within a task batch, and the cumulative pattern, where experiences are acquired across all previous task batches. Additionally, the method utilizes heuristic experience elimination to prioritize high-quality and frequently-used experiences, effectively managing the experience space and enhancing efficiency.

NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions

The research paper proposes to merge the strengths of classical planners and LLMs by creating NL2Plan, a domain-agnostic offline planning system. NL2Plan uses an LLM to extract necessary information from a short text prompt, which is then used to create a complete PDDL representation of the domain and problem. This PDDL description is then solved by a classical planner. NL2Plan also allows for user inspection and correction of intermediate results, increasing explainability and making it an assistive tool for PDDL creation.

Sketch Then Generate: Providing Incremental User Feedback and Guiding LLM Code Generation through Language-Oriented Code Sketches code generation

The research paper proposes an interactive approach called Language-Oriented Code Sketching to solve this problem. This approach provides instant, incremental feedback in the form of code sketches, which are incomplete code outlines, during prompt crafting. It converts a prompt into a code sketch by leveraging the inherent linguistic structures within the prompt and applying classic natural language processing techniques. The sketch serves as an intermediate placeholder that not only previews the intended code structure but also guides the LLM towards the desired code, thereby enhancing human-LLM interaction.

Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application

Research paper proposes an approach called Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework. This framework integrates open-world domain knowledge by leveraging the capabilities of LLMs that are pretrained on massive text corpus. This allows for a more comprehensive understanding of the items and users, leading to enhanced recommendations. To address computational complexity concerns, pretrained LLMs are used as item encoders and their parameters are frozen to preserve open-world knowledge and avoid catastrophic forgetting. Additionally, a twin-tower structure is designed to bridge the gap between the open-world and collaborative domains, tailored for practical industrial application.

ERATTA: Extreme RAG for Table To Answers with Large Language Models

The research paper proposes a unique LLM-based system that utilizes multiple LLMs to enable various tasks such as data authentication, user query routing, data retrieval, and custom prompting for question answering from large and diverse data tables. The system is specifically designed for enterprise-level data products and is capable of providing real-time responses in under 10 seconds. It consists of a four-prompt process where the first prompt authenticates the user-to-data, followed by three prompts for routing, fetching, and generating natural language responses. The system also includes a five metric scoring module to detect and report any hallucinations in the LLM responses. This system can achieve >90% confidence scores in sustainability, financial health, and social media domains. The research paper also suggests that further development of extreme RAG architectures can enable query processing from heterogeneous data sources using LLMs.

Codexity: Secure AI-assisted Code Generation

🤔 Why?: To address the vulnerabilities caused by current AI programming agents

💻 How?: The research paper proposes a solution called Codexity, which is a security-focused code generation framework integrated with five LLMs. It leverages feedback from static analysis tools like Infer and CppCheck to mitigate security vulnerabilities in LLM-generated programs. Essentially, Codexity uses the power of LLMs to generate code, but also incorporates the checks and balances of static analysis tools to ensure that the code is secure.

📊 Results: The research paper evaluated Codexity in a real-world benchmark with 751 automatically generated vulnerable subjects and found that it was able to prevent 60% of the vulnerabilities from being exposed to the software developer. This is a significant improvement in terms of security, as it reduces the potential risks and threats that may arise from using AI programming assistants.

🤖 Robotics | MLLMs

ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning [Project page: https://chathuman.github.io/ ] 🔥🔥🔥

ChatHuman takes input text queries, images, or other 3D human-related modalities such as vectors like SMPL pose. Then, based on the user query, ChatHuman adopts a paper-based RAG mechanism to generate a textual response about the tool use and call the tools. Finally, the tool results are transformed into a textual or visual format and fed into the multimodal LLM-based agent, which will incorporate the tool results with its generic world knowledge to generate a response in the form of text, images, or other modalities related to 3D humans.

In-context Learning for Automated Driving Scenarios

The research paper proposes an innovative approach that utilizes LLMs to optimize RL reward functions in a human-centric way. This is done by inputting instructions and dynamic environment descriptions into the LLM, which then assists in generating rewards that steer RL agents towards more human-like driving patterns. This approach makes use of prompt design strategies for reward-proxy and reward-shaping, which have a significant impact on the behavior of AD vehicles.

Reply

or to participate.