RecNet

Ctrl+K

Skeleton

https://arxiv.org/abs/2311.04076Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig

Nov, 2023

Yilun Hua recommended on on 4/2/2024

Tests if LLMs show human-like behaviors when answering survey questionnaires, based on their wordings. Shows that LLMs, especially those with RLHF, generally do not behave like humans. The specific behaviors studied may have broader impacts on LLMs' applications.

What Is Missing in Multilingual Visual Reasoning and How to Fix ItYueqi Song, Simran Khanuja, Graham Neubig

Mar, 2024

Read

Yilun Hua recommended on on 3/12/2024

Nice results on multi-image reasoning using image captions to bypass the issue that many VLMs were trained with single image-text pairs, though I would prefer models designed/trained to natively support multi-image reasoning. Also, they focused only on one reasoning task.

Parallel Structures in Pre-training Data Yield In-Context LearningYanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He

Feb, 2024

Read

Yilun Hua recommended on on 2/27/2024

Smart methods suggesting the parallel structures (PS), phrases following similar templates, in pertaining data have big impacts on LMs' in-context learning ability. Ablation of PS in training data may introduce confounding factors, which I'm unsure how well this paper addresses.

Everybody Prune Now: Structured Pruning of LLMs with only Forward PassesLucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar

Feb, 2024

Read

Yilun Hua recommended on on 2/13/2024

New way of pruning that only needs inferences and makes the pruned model faster. It estimates module importance perturbatively via inferencing on a small set of submodels, and solving an under-determined regression. Also creates priors for modules, so fewer submodels are needed.

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent DebateSteffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu

Jan, 2024

Read

Yilun Hua recommended on on 2/6/2024

Introduces the idea of using agent debate to (almost) automatically evaluate the effectiveness of specific LLMs as evaluators (meta-evaluation). Its framework is based on pairwise response comparison and requires human intervention only when the agents fail to reach a consensus.

Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language ModelsKaitlyn Zhou, Dan Jurafsky, Tatsunori Hashimoto

Feb, 2023

Read

Yilun Hua recommended on on 1/30/2024

It reveals that LMs' accuracies are sensitive to epistemic markers, e.g. prompted with "It could be ..." vs. "I'm certain that...". Surprisingly, markers of high certainty lead to lower accuracies. Contexts of these markers in pretraining dataset offer plausible explanations.

Detecting Pretraining Data from Large Language ModelsWeijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

Oct, 2023

Read

Yilun Hua recommended on on 1/16/2024

A benchmark for pretraining data detection, applicable to various models and will be continually updated by new wikipedia event data; uses a detection method based on a simple hypothesis: an unseen example tends to contain a few outlier words with low probabilities.

An In-depth Look at Gemini’s Language AbilitiesSyeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bauerle, Angel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham Neubig

Dec, 2023

Read

Yilun Hua recommended on on 1/2/2024

A comparison of Gemini and GPT models on a wide range of tasks. It shows Gemini Pro's under-performance compared to GPT 3.5 turbo on all the tasks benchmarked and provides analysis of possible causes of Gemini's failures, such as sensitivity to multiple-choice answer ordering.

Getting MoRE out of Mixture of Language Model Reasoning ExpertsChenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, Jordan Boyd-Graber

Oct, 2023

Read

Yilun Hua recommended on on 12/19/2023

A Mixture-of-Expert model for QA reasoning, with promising results on 12 datasets (4 reasoning types). Paper shows that presenting the predictions by individual models and the answer selection process helps the users more accurately calibrate when to trust the system's output.

Proving Test Set Contamination in Black Box Language ModelsYonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto

Oct, 2023

Read

Yilun Hua recommended on on 12/12/2023

A novel method to provably identify test set contamination in LLMs without accessing model weights or pretraining data. This method allows for hypothesis testing and the results on small models and contamination by small datasets seem promising.

DIVERGENCES BETWEEN LANGUAGE MODELS AND HUMAN BRAINSYuchen Zhou, Emmy Liu, Graham Neubig, Leila Wehbe

Nov, 2023

Read

Yilun Hua recommended on on 12/5/2023

This paper looks at how LMs differ from human brain in processing language, via Magnetoencephalography (MEG). It identifies phenomena in human MEG that cannot be explained well by LMs and explored the effects of finetuning in improving the alignment between LMs and human brain.

Visual Instruction TuningHaotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Apr, 2023

Read

Yilun Hua recommended on on 11/21/2023

A strong vision-language-model trained on GPT-4 generated visual instruction following data. The architecture is not surprising but its proposed way of synthesizing instruction-following dataset will probably have a broader impact.