• LLMs Research
  • Posts
  • LLMs related research papers published on May 13th, 2024

LLMs related research papers published on May 13th, 2024

Covers research papers evaluting LLMs reasoning capabilities, new distillation to improve performance, VLLMs for gesture detection, and automating warehouse work, and many more!

In partnership with

πŸ”‘ takeaway from today’s newsletter

  • Can LLMs truly reason the tasks or just memorizes the instructions?

  • A new distillation approach to improve the performance of LLMs

  • A new paper swapping LLM tokenizer to enable multi-linguistic features ability!

  • LLMs can now understands the network flow data to detect carpet bombing DDoS

  • VLLMs for gesture detection, and automating warehouse work

πŸ”¬Core research improving LLMs!

πŸ’‘Why?: We all ask if LLMs have theory of mind (ToM), which is the ability to understand and reason about the mental and emotional states of others.

πŸ’»How?: The research paper identifies key areas where LLM ToM will show up in human-LLM interactions at both individual and group levels. It works by analyzing the role and impacts of human ToM and then applying it to LLMs. This helps understand how LLMs can be aligned with human values, especially as they become more integrated into our personal, professional, and social lives.

πŸ“ŠResults: By understanding how LLMs can develop ToM and how it can affect human-LLM interactions, we can better align LLMs with human values and potentially improve their performance in decision-making processes.

πŸ’‘Why?: The research paper addresses the problem of effectively learning prompts for prompt-tuning in low-shot scenarios and large class spaces.

πŸ’»How?: The research paper proposes a method that leverages class descriptions from LLMs to construct part-level description-guided views of both image and text features. These features are then aligned to learn more generalizable prompts, addressing the issue of overfitting in low-shot scenarios and decreased performance in large class spaces.

πŸ“ŠResults: The research paper achieved substantial improvements in performance compared to established methods, as demonstrated by comprehensive experiments conducted across 11 benchmark datasets.

Friedrich-Schiller-Universita ̈t Jena, Leipzig University, CSIRO, University of Queensland, Bauhaus-Universita ̈t Weimar, University of Kassel, hessian.AI, ScadDS.AI
A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking - code

πŸ’‘Why?: The research paper tries to improve the effectiveness and efficiency of LLMs as re-rankers in information retrieval tasks.

πŸ’»How?: The research paper proposes a distillation process in which a smaller model is trained to mimic the behaviour of a larger model. In this case, the researchers aim to distill the knowledge from large language models into cross-encoders, which are more efficient models that can re-rank search results. This is achieved by creating a new distillation dataset, named Rank-DistiLLM, which is used to train the cross-encoders. The dataset incorporates insights from fine-tuning cross-encoders on manually labeled data, such as hard-negative sampling, deep sampling, and listwise loss functions. By training the cross-encoders on this dataset, they are able to reach the effectiveness of large language models while being orders of magnitude more efficient.

πŸ“ŠResults: The research paper does not provide specific performance improvement results, but it is stated that the cross-encoders trained on Rank-DistiLLM are able to reach the effectiveness of large language models while being more efficient. This means that the cross-encoders are able to achieve similar performance to large language models

University of Cambridge, University of Edinburgh
Zero-Shot Tokenizer Transfer

πŸ’‘Why?: The research paper addresses the problem of limited flexibility in LLMs due to their tokenizer, which maps raw text to a sequence of vocabulary items. This restricts their performance in languages other than the one they were primarily trained on.

πŸ’»How?: The research paper proposes a solution called Zero-Shot Tokenizer Transfer (ZeTT), which allows for the swapping of the original LM tokenizer with an arbitrary one without degrading performance. This is achieved by training a hypernetwork that takes a tokenizer as input and predicts corresponding embeddings for the tokens in the vocabulary. This allows for the LM to adapt to different tokenizers and improve efficiency in languages other than the one it was trained on.

πŸ“ŠResults: The research paper shows empirical evidence that the hypernetwork generalizes to new tokenizers and achieves close to the original model’s performance in cross-lingual and coding tasks. It also demonstrates that the remaining performance gap can be quickly closed by continued training on less than 1B tokens.

Salesforce AI Research, University of Illinois Urbana-Champaign
RLHF Workflow: From Reward Modeling to Online RLHF - code

πŸ’‘Why?: The research paper addresses the problem of how to implement Online Iterative Reinforcement Learning from Human Feedback (RLHF) in a practical and reproducible manner. This is a widely studied topic in the large language model (LLM) literature, but existing open-source projects are limited to offline learning settings, making it difficult to apply in real-world scenarios.

πŸ’»How?: The research paper proposes a detailed recipe for implementing online iterative RLHF, which involves constructing preference models using a diverse set of open-source datasets and using these models to approximate human feedback. The theoretical insights and algorithmic principles behind this approach are also discussed in the paper. This method allows for the use of open-source datasets in place of actual human feedback, making it more feasible for communities with limited resources.

πŸ“ŠResults: The research paper reports impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. The trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieved state-of-the-art performance, demonstrating the effectiveness of the proposed approach. Additionally, the research paper provides access to the models

πŸ’‘Why?: The research paper addresses the problem of imbalance in per-class prediction accuracy in language models. This issue is especially prevalent in LLMs and can obscure the true accuracy of the model.

πŸ’»How?: The research paper proposes a solution called Debiasing as Nonlinear Integer Programming (DNIP). This approach involves reconceptualizing the accuracy imbalance as the Contextual Oddity Bias (COBias) and then using nonlinear integer programming (NIP) to debias the model. In simpler terms, the researchers use a new metric, COBias, to identify and correct for differences in accuracy between classes in the language model. This is done by optimizing the model's performance using simulated annealing, a type of optimization algorithm.

πŸ“ŠResults: The research paper reports significant improvements in both COBias reduction (by 27%) and overall accuracy (by 12%) on three different LLMs and seven natural language processing (NLP) classification tasks. This suggests that the proposed DNIP approach is effective in improving the accuracy and reliability of LLM predictions.

πŸ§ͺ LLMs evaluations

Microsoft, Snowflake, Amazon, University College London
Synthetic Test Collections for Retrieval Evaluation

πŸ’‘Why?: The research paper addresses the challenge of obtaining diverse user queries and relevance judgments for test collection construction in information retrieval systems.

πŸ’»How?: The paper proposes to use LLMs to generate synthetic datasets, including both queries and relevance judgments, for constructing test collections. This approach takes advantage of the capabilities of LLMs to mimic human language and generate realistic synthetic data.

πŸ“ŠResults: The paper does not mention any specific performance improvement achieved, but it demonstrates that the use of LLMs can reliably construct synthetic test collections for retrieval evaluation. This suggests that the proposed approach has the potential to improve the performance of ranking models in information retrieval.

Stanford University, Johns Hopkins University,Hospital Israelita Albert Einstein
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

πŸ’‘Why?: Paper evaluates LLMs in clinical care, specifically in the context of interactive decision-making that is required in real-life clinical work.

πŸ’»How?: The research paper proposes a benchmark called AgentClinic, which consists of two open benchmarks: a multimodal image and dialogue environment (AgentClinic-NEJM) and a dialogue-only environment (AgentClinic-MedQA). In this benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. The benchmark also incorporates cognitive and implicit biases in both patient and doctor agents to emulate realistic interactions. The research paper also evaluates a suite of state-of-the-art LLMs to see how they perform in this benchmark.

Georgia Institute of Technology
EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning
[Dataset: HuggingFace] - This paper proposes EconLogicQA, a benchmark that presents challenging multi-event scenarios derived from economic articles. It requires models to not only predict subsequent events, but also sequence multiple interconnected events, capturing the complexity of economic logics. It works by evaluating the performance of various leading-edge LLMs on the benchmark dataset.

πŸ’‘Why?: The research paper addresses the lack of a clear and comprehensive criterion for evaluating the faithfulness of persona-driven role-playing (PRP) AI characters in responding to user queries.

πŸ’»How?: The research paper proposes a fine-grained and explainable criterion called Active-Passive-Constraint (APC) score, which takes into account both active and passive constraints in determining the AI character's response. This is achieved by first identifying the relevance of persona statements to the user query, then incorporating all constraints in a mathematically formulated score based on natural language inference (NLI) scores. This score is then used as a reward system in direct preference optimization (DPO) to improve the AI character's performance. The APC scoring system is also built using small discriminators from GPT-4 for efficiency.

πŸ“ŠResults: The research paper validates the quality of the APC score through human evaluation and shows a high correlation with example personas. It also compares the performance of existing PRP techniques and shows the advantages and limitations of each. The APC-based DPO is found to be one of the most competitive techniques for sticking with all constraints and can be well incorporated with other techniques. The experiments are further extended to real persons with consistent results.

Hire a world class AI team

Engineers who understand AI are expensive and difficult to find, and it can be hard to figure out who to trust. On top of that, 85% of all AI projects fail.

But AE Studio succeeds.

We listen to your business challenge and help you craft and implement the optimal AI solution with our team of world class AI experts from Harvard, Stanford and Princeton.

Our development, design, and data science teams work closely with founders and executives to create custom software and AI solutions that get the job done. The secret to our success is treating your project as if it were our own startup.

🧯Let’s make LLMs safe!!

This research paper: Many-Shot Regurgitation (MSR) Prompting addresses the issue of verbatim content reproduction in large language models (LLMs) and the potential implications this has on privacy and security. proposes a new black-box membership inference attack framework called Many-Shot Regurgitation (MSR) prompting. This framework involves dividing input text into segments and creating a single prompt that simulates a conversation between a user and a language model, eliciting verbatim reproduction of the input text. This is done to examine the extent to which LLMs can reproduce verbatim content, particularly from sources they were trained on. The framework is applied to various text sources, including Wikipedia articles and open educational resources (OER) textbooks, to gather data and analyze the frequency and distribution of verbatim matches.

This research paper PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition proposes a solution to this problem by augmenting the LLM with a dedicated "safeguard" that checks the model's inputs and outputs for any undesired behaviour. This approach uses the LLM itself as the safeguard, avoiding the need for finetuning or white box access to the model. This is achieved through a method called PARDEN, which simply asks the model to repeat its own outputs. This avoids the domain shift caused by other methods, such as prompting the model to self-classify toxic content.

πŸ“ŠResults: The research paper shows empirical evidence that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. In particular, it achieves a roughly 11x reduction in False Positive Rate (FPR) at a high True Positive Rate (TPR) of 90% for Llama2-7B on the harmful behaviors dataset.

This research paper Backdoor Removal for Generative Large Language Models solves the backdoor attacks on generative large language models, which can result in malicious responses being generated when certain backdoor triggers are activated.

πŸ’»How?: The research paper proposes a two-stage framework called Simulate and Eliminate (SANDE) to address this problem. The first stage involves Overwrite Supervised Fine-tuning (OSFT), where known backdoor triggers are overwritten with safe data during the fine-tuning process. In the second stage, SANDE uses OSFT to remove unknown backdoor triggers by simulating potential triggers and eliminating them from the model. This ensures that the model behaves normally even when the triggers are activated.

🌈 Creative ways to use LLMs!!

LLMs in academic research reference management tool! presenting PyZoBot: A Platform for Conversational Information Extraction and Synthesis from Curated Zotero Reference Libraries through Advanced Retrieval-Augmented Generation which integrates traditional reference management software with advanced computational techniques. Specifically, it introduces PyZoBot, an AI-driven platform developed in Python that combines Zotero's reference management with OpenAI's sophisticated LLMs. This platform streamlines knowledge extraction and synthesis from extensive human-curated scientific literature databases. It does so by handling complex natural language queries, integrating data from multiple sources, and presenting references in a meticulous manner to uphold research integrity and facilitate further exploration.

This paper tries to utilize LLMs to decipher semantic information from fMRI signals. Open-vocabulary Auditory Neural Decoding Using fMRI-prompted LLM - The research paper proposes a novel method called the Brain Prompt GPT (BP-GPT) which uses the brain representation extracted from fMRI as a prompt and leverages GPT-2 to decode the fMRI signals into stimulus text. It also introduces a text-to-text baseline and aligns the fMRI prompt with the text prompt, allowing for a more robust brain prompt and promoting the decoding of pre-trained LLM.

πŸ’‘Why?: The research paper addresses the problem of detecting unknown malicious flows in network data, specifically focusing on the increasing threat of Carpet Bombing as a DDoS attack.

πŸ’»How?: The research paper proposes a model called DoLLM, which utilizes open-source LLMs as its backbone. This model reorganizes non-contextual network flows into Flow-Sequences and uses LLMs contextual understanding to extract flow representations in overall network context. These representations are then used to improve the performance of DDoS detection.

πŸ“ŠResults: The research paper has achieved significant performance improvements in both zero-shot scenarios and real ISP traces. In zero-shot scenarios, the F1 score increased by up to 3

πŸ’‘Why?: The research paper addresses the problem of detecting anomaly edges in dynamic graphs, particularly in scenarios where there is limited labeled data for each type of anomaly.

πŸ’»How?: The research paper proposes a method called AnomalyLLM, which utilizes the knowledge encoded in LLMs to detect anomaly edges. It achieves this by pre-training a dynamic-aware encoder to generate representations of edges and using prototypes of word embeddings to reprogram the edges. Additionally, the paper introduces an in-context learning framework that incorporates information from a few labeled samples to achieve few-shot anomaly detection.

πŸ“ŠResults: The research paper reports significant improvements in the performance of few-shot anomaly detection using AnomalyLLM. Additionally, it also achieves superior results on new anomalies without the need for updating model parameters.

πŸ€–LLMs for robotics & VLLMs

πŸ’‘Why?:  Address the labor-intensive and inflexible programming of robots for adaptive assembly in the manufacturing industry, despite technological advancements.

πŸ’»How?: Paper uses LLMs by using natural language instructions to generate code for programming robots, allowing for task-specific code to be written quickly and efficiently. The system also incorporates strategies for task decomposition and code generation, as well as a simulated workcell for testing and debugging.

πŸ’‘Why?: The research paper aims to find a more efficient and effective method for generating realistic and appropriate gestures that can enhance the engagement of interactive agents.

πŸ’»How?: The research paper proposes using LLM (Lexical Lab Motion) features extracted from text using LLAMA2 for gesture generation instead of the traditional audio-driven approach. LLMs provide rich encodings of speech-related content, which are then used by the model to generate both beat and semantic gestures. This is achieved by comparing the performance of LLM features against audio features and exploring the combination of both modalities in objective tests and a user study.

πŸ“ŠResults: The research paper shows that using LLM features for gesture generation performs significantly better than using audio features alone. It also demonstrates that the combination of both modalities does not yield any significant improvement over using LLM features in isolation. This suggests that LLMs can provide a more suitable and efficient encoding for gesture generation in character animation.

Reply

or to participate.