Trend in Research
Trend in Research
  • 50
  • 3 388
[CVPR 24 Best Paper] Rich Human Feedback for Text-to-Image Generation
arxiv.org/abs/2312.10240
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: this https URL.
มุมมอง: 4

วีดีโอ

[Meta AI] An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
มุมมอง 1521 ชั่วโมงที่ผ่านมา
arxiv.org/abs/2406.09415 This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different fr...
[Stanford] Generative Agents: Interactive Simulacra of Human Behavior
มุมมอง 112 ชั่วโมงที่ผ่านมา
arxiv.org/abs/2304.03442
[NVIDIA] NVIDIA Nemotron-4 340B Technical Report (Better than GPT-4?)
มุมมอง 382 ชั่วโมงที่ผ่านมา
arxiv.org/pdf/2406.11704v1 We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4- 340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models o...
Empowering LLM to use Smartphone for Intelligent Task Automation
มุมมอง 2512 ชั่วโมงที่ผ่านมา
arxiv.org/pdf/2308.15272 Mobile task automation is an attractive technique that aims to enable voice-based hands-free user interaction with smartphones. However, existing approaches suffer from poor scalability due to the limited language understanding ability and the non-trivial manual efforts required from developers or end-users. The recent advance of large language models (LLMs) in language...
[Northeastern, MIT, Princeton] Reflexion: Language Agents with Verbal Reinforcement Learning
มุมมอง 3416 ชั่วโมงที่ผ่านมา
arxiv.org/abs/2303.11366 Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuni...
Can LLM interpret fMRI via embeddings? Crafting Interpretable Embeddings by Asking LLMs Questions
มุมมอง 1114 วันที่ผ่านมา
arxiv.org/abs/2405.16714 Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. However, their opaqueness and proliferation into scientific domains such as neuroscience have created a growing need for interpretability. Here, we ask whether we can obtain interpretable embeddings through LLM prompting. We introduce question-ans...
[Stanford] Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
มุมมอง 2414 วันที่ผ่านมา
arxiv.org/abs/2406.00888 Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a...
[ICRA 2024 Best Paper] NoMaD: Goal Masking Diffusion Policies for Navigation and Exploration
มุมมอง 7514 วันที่ผ่านมา
arxiv.org/abs/2310.07896 Robotic learning for navigation in unfamiliar environments needs to provide policies for both task-oriented navigation (i.e., reaching a goal that the robot has located), and task-agnostic exploration (i.e., searching for a goal in a novel setting). Typically, these roles are handled by separate models, for example by using subgoal proposals, planning, or separate navig...
[Stanford] FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
มุมมอง 1914 วันที่ผ่านมา
arxiv.org/abs/2305.05176 There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and...
[Deepmind] Teach LLMs to Phish: Stealing Private Information from Language Models
มุมมอง 1714 วันที่ผ่านมา
arxiv.org/pdf/2403.00871 When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call "neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit...
[Deepmind] Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning
มุมมอง 3214 วันที่ผ่านมา
arxiv.org/abs/2304.13653 We investigate whether Deep Reinforcement Learning (Deep RL) is able to synthesize sophisticated and safe movement skills for a low-cost, miniature humanoid robot that can be composed into complex behavioral strategies in dynamic environments. We used Deep RL to train a humanoid robot with 20 actuated joints to play a simplified one-versus-one (1v1) soccer game. The res...
[MetaAI] Better & Faster Large Language Models via Multi-token Prediction
มุมมอง 3321 วันที่ผ่านมา
arxiv.org/pdf/2404.19737 Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, op...
[Radboud, U of Amsterdam] Fine Tuning vs. RAG, Which is Better?
มุมมอง 2528 วันที่ผ่านมา
arxiv.org/pdf/2403.01432 Large language models (LLMs) memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance ...
[Stanford, UofT] Observational Scaling Laws and the Predictability of Language Model Performance
มุมมอง 4928 วันที่ผ่านมา
arxiv.org/pdf/2405.10938 Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scalin...
[UC Berkeley, UIUC, NYU] Fine-Tuning VLM Agent via RL
มุมมอง 4828 วันที่ผ่านมา
[UC Berkeley, UIUC, NYU] Fine-Tuning VLM Agent via RL
[Google] Demystifying MoE in LLM: What is it?
มุมมอง 22หลายเดือนก่อน
[Google] Demystifying MoE in LLM: What is it?
[Nature, Stanford] GenAI for designing structurally novel antibiotics
มุมมอง 11หลายเดือนก่อน
[Nature, Stanford] GenAI for designing structurally novel antibiotics
[Nature] Research Shows AI Deciphers Ancient Texts
มุมมอง 19หลายเดือนก่อน
[Nature] Research Shows AI Deciphers Ancient Texts
[Ilya Sutskever Recommended Paper List] Pointer Networks
มุมมอง 501หลายเดือนก่อน
[Ilya Sutskever Recommended Paper List] Pointer Networks
[Ilya Sutskever Recommended Paper List] Order Matters: Sequence to sequence for sets
มุมมอง 147หลายเดือนก่อน
[Ilya Sutskever Recommended Paper List] Order Matters: Sequence to sequence for sets
[ICML 2024] Tell, Don’t Show!: Language Guidance EasesTransfer Across Domains in Images and Videos
มุมมอง 11หลายเดือนก่อน
[ICML 2024] Tell, Don’t Show!: Language Guidance EasesTransfer Across Domains in Images and Videos
[GaTech, ICRA 2024] VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
มุมมอง 67หลายเดือนก่อน
[GaTech, ICRA 2024] VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
[Anthropic] On Attack LLM: Many-shot Jailbreaking
มุมมอง 436หลายเดือนก่อน
[Anthropic] On Attack LLM: Many-shot Jailbreaking
[INRIA, MPI] 3D Gaussian Splatting for Real-Time Radiance Field Rendering
มุมมอง 33หลายเดือนก่อน
[INRIA, MPI] 3D Gaussian Splatting for Real-Time Radiance Field Rendering
[Nature, Stanford] Future AI Powered Vision Pro? 3D holographic AR displays
มุมมอง 93หลายเดือนก่อน
[Nature, Stanford] Future AI Powered Vision Pro? 3D holographic AR displays
[Princeton] Long-Context Language Modeling with Parallel Context Encoding
มุมมอง 42หลายเดือนก่อน
[Princeton] Long-Context Language Modeling with Parallel Context Encoding
[Microsoft] LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
มุมมอง 55หลายเดือนก่อน
[Microsoft] LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
[Princeton, DeepMind] Tree of Thoughts: Deliberate Problem Solvingwith Large Language Models
มุมมอง 21หลายเดือนก่อน
[Princeton, DeepMind] Tree of Thoughts: Deliberate Problem Solvingwith Large Language Models
[Waymo, CVPR 2024] MoST: Multi-modality Scene Tokenization for Motion Prediction
มุมมอง 59หลายเดือนก่อน
[Waymo, CVPR 2024] MoST: Multi-modality Scene Tokenization for Motion Prediction

ความคิดเห็น

  • @anhtunguyen77
    @anhtunguyen77 26 วันที่ผ่านมา

    amazing!

  • @Jshicwhartz
    @Jshicwhartz หลายเดือนก่อน

    There is actually a fix for this issue, and Microsoft use it. There is another agent which gets given the most reason message, and you force it to output a function_call saying if it's a jailbreak or not, then you pass this to your main agent and boom. It doesn't work. I won't give all the solutions away, but it's one of many which we use internally. Research stopped getting shared publicly late 2022.

    • @trendinresearch
      @trendinresearch หลายเดือนก่อน

      Thanks, do you have a link of that paper?

  • @basiclick
    @basiclick หลายเดือนก่อน

    wer waiting.. )

  • @jasonliu1587
    @jasonliu1587 หลายเดือนก่อน

    good to know

  • @jasonliu1587
    @jasonliu1587 หลายเดือนก่อน

    Can I segment svg file?

  • @trendinsightgpt
    @trendinsightgpt หลายเดือนก่อน

    what are some of the energy based models?

  • @gabrielrainer328
    @gabrielrainer328 หลายเดือนก่อน

    Very important

  • @BooleanDisorder
    @BooleanDisorder 3 หลายเดือนก่อน

    I definitely think self-reward is the path forward. Like planting a seed that automatically grow with enough resources.

  • @jamar9205
    @jamar9205 5 หลายเดือนก่อน

    👊 *promosm*