RecNet
https://arxiv.org/abs/2311.04076Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig
Nov, 2023
Read
Yilun Hua recommended on on 4/2/2024
Tests if LLMs show human-like behaviors when answering survey questionnaires, based on their wordings. Shows that LLMs, especially those with RLHF, generally do not behave like humans. The specific behaviors studied may have broader impacts on LLMs' applications.
What Is Missing in Multilingual Visual Reasoning and How to Fix ItYueqi Song, Simran Khanuja, Graham Neubig
Mar, 2024
Read
Yilun Hua recommended on on 3/12/2024
Nice results on multi-image reasoning using image captions to bypass the issue that many VLMs were trained with single image-text pairs, though I would prefer models designed/trained to natively support multi-image reasoning. Also, they focused only on one reasoning task.
Parallel Structures in Pre-training Data Yield In-Context LearningYanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He
Feb, 2024
Read
Yilun Hua recommended on on 2/27/2024
Smart methods suggesting the parallel structures (PS), phrases following similar templates, in pertaining data have big impacts on LMs' in-context learning ability. Ablation of PS in training data may introduce confounding factors, which I'm unsure how well this paper addresses.
Everybody Prune Now: Structured Pruning of LLMs with only Forward PassesLucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar
Feb, 2024
Read
Yilun Hua recommended on on 2/13/2024
New way of pruning that only needs inferences and makes the pruned model faster. It estimates module importance perturbatively via inferencing on a small set of submodels, and solving an under-determined regression. Also creates priors for modules, so fewer submodels are needed.
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent DebateSteffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu
Jan, 2024
Read
Yilun Hua recommended on on 2/6/2024
Introduces the idea of using agent debate to (almost) automatically evaluate the effectiveness of specific LLMs as evaluators (meta-evaluation). Its framework is based on pairwise response comparison and requires human intervention only when the agents fail to reach a consensus.
Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language ModelsKaitlyn Zhou, Dan Jurafsky, Tatsunori Hashimoto
Feb, 2023
Read
Yilun Hua recommended on on 1/30/2024
It reveals that LMs' accuracies are sensitive to epistemic markers, e.g. prompted with "It could be ..." vs. "I'm certain that...". Surprisingly, markers of high certainty lead to lower accuracies. Contexts of these markers in pretraining dataset offer plausible explanations.
Detecting Pretraining Data from Large Language ModelsWeijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer
Oct, 2023
Read
Yilun Hua recommended on on 1/16/2024
A benchmark for pretraining data detection, applicable to various models and will be continually updated by new wikipedia event data; uses a detection method based on a simple hypothesis: an unseen example tends to contain a few outlier words with low probabilities.
An In-depth Look at Gemini’s Language AbilitiesSyeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bauerle, Angel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham Neubig
Dec, 2023
Read
Yilun Hua recommended on on 1/2/2024
A comparison of Gemini and GPT models on a wide range of tasks. It shows Gemini Pro's under-performance compared to GPT 3.5 turbo on all the tasks benchmarked and provides analysis of possible causes of Gemini's failures, such as sensitivity to multiple-choice answer ordering.
Getting MoRE out of Mixture of Language Model Reasoning ExpertsChenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, Jordan Boyd-Graber
Oct, 2023
Read
Yilun Hua recommended on on 12/19/2023
A Mixture-of-Expert model for QA reasoning, with promising results on 12 datasets (4 reasoning types). Paper shows that presenting the predictions by individual models and the answer selection process helps the users more accurately calibrate when to trust the system's output.
Proving Test Set Contamination in Black Box Language ModelsYonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto
Oct, 2023
Read
Yilun Hua recommended on on 12/12/2023
A novel method to provably identify test set contamination in LLMs without accessing model weights or pretraining data. This method allows for hypothesis testing and the results on small models and contamination by small datasets seem promising.
DIVERGENCES BETWEEN LANGUAGE MODELS AND HUMAN BRAINSYuchen Zhou, Emmy Liu, Graham Neubig, Leila Wehbe
Nov, 2023
Read
Yilun Hua recommended on on 12/5/2023
This paper looks at how LMs differ from human brain in processing language, via Magnetoencephalography (MEG). It identifies phenomena in human MEG that cannot be explained well by LMs and explored the effects of finetuning in improving the alignment between LMs and human brain.
Visual Instruction TuningHaotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Apr, 2023
Read
Yilun Hua recommended on on 11/21/2023
A strong vision-language-model trained on GPT-4 generated visual instruction following data. The architecture is not surprising but its proposed way of synthesizing instruction-following dataset will probably have a broader impact.
© 2024 RecNet. All rights reserved.