[Overall] The overall flow and structure were easy to understand. However, the presentation lacked details in the implementation and architecture explanation. It would be more meaningful for the audience if there were more explanations about the types of input and output data and the data flow illustrated in the architecture diagrams. At present, it feels more like reading a blog that briefly introduces a new paper rather than attending a seminar. [Feedback] F1: Slide 9 contains too much information, and without proper guidance, it is difficult to understand as the explanations are only verbal. Why not use appropriate animations and annotations to aid comprehension? F2: For slide 16, it would be better to cut and show only a few parts or use strategies that display the images more prominently. Currently, the images are hard to see, and there is a lot of information in the illustrations, but there isn’t enough time taken to explain them thoroughly. F3: Regarding the "Three main architectures," does this refer to the VAE, Diffusion, and VQ-VAE architectures used in previous research? If so, why does the top title change from "Previous Works" to "What is text-to-motion generation?" The content of the presentation seems confusing. F4: As I pointed out to other presenters earlier, there is poor control over the amount of information provided on each slide. When there is a lot of information, sufficient explanations and additional guidance and annotations should be provided. There is no consideration of how the audience's gaze should move or the order in which understanding should occur. [Questions] Q1: What information did Parco authors use to render the stick man? Joint information? How did they determine the joint configuration? Q2: When are "part-aware motion discretization" and "text-driven part coordinating synthesis" used? What inputs and outputs are involved, and what do the numbers in the diagrams represent? Q3: Where is the "Part-Coordinated Transformer" used? Is it an expanded version of the transformer at the top of the "text-driven part coordinating synthesis"? Q4: What methods are used to maintain coherence between separately generated motions in Parco? I am curious about the specific training methods, inputs, outputs, and effects.
Thank you for your feedback, sir. A1. For rendering, the x, y, z coordinates of the joints are used for rendering. The configuration for this varies depending on the dataset, and for example, HumanML3D and KIT-ML each have their own custom configurations. A2. Basically, ParCo is a two-stage learning methodology. First, the motion is tokenized using VQ-VAE. In the case of HumanML3D, it is compressed into one token for every 4 frames. The loss term used in this step is reconstruction loss, meaning that the input and output are identical motions. The numbers in the diagram refer to tokens. The text-driven stage is the second stage. Using the motion tokens learned in the previous step, the task generates motion tokens by receiving text as input. The loss used in this stage is NLL. A3. It is an expanded version. This is used when combining information about each part's motion to maintain coherence across parts when generating motions independently for each part. A4. The answer is the same as A3. Specifically, although each part is generated independently through separate modules, coherence cannot be guaranteed if they are completely independent. Therefore, when generating, information from other parts is also provided to ensure consistency. The input consists of information from each part and the previously generated motion tokens of other parts, and the output is the next token for each part.
Thank you for your presentation. Q1: In the ParCo model, it is explained that the body is divided into a total of six parts to generate motion. I'm curious about why these specific six parts were chosen to divide the body-what were the criteria or reasons behind this decision? Additionally, I'd like to know what trade-offs exist between dividing the body into smaller, more detailed units versus larger units. Q2: When using a method like ParCo that generates motion by dividing the body into parts, I'm wondering whether it's possible to input specific textual descriptions for each part and have that more accurately reflected in the motion. In other words, does this approach help to implement detailed instructions for each body part more effectively?
A1. If divided too finely, there is a risk that coherence may break down when combining them later. A2. This is not present in ParCo, but there is a paper called "LGTM" introduced at SIGGRAPH 2024 that uses ChatGPT to split the text by parts and generate them accordingly.
Thank you for your presentation Q1. How are "parts" of a body given as input to a encoder? Specifically, what is the format of the input? Q2. How do you define the parts for the body? Is it defined manually by the users?
A1. Literally, the parts are separated and fed into the encoder. These are divided into 263 feature vectors, with each feature vector corresponding to different parts like the head, arms, etc. The format includes things like joint rotations or velocities. A2. There is no specific rule; we simply decided on six parts. There is no particular reference for this. My guess is that the performance was probably best when divided into six parts through experimental results.
Thank you for your seminar. I have two questions below: Q1. Is there a clear standard for dividing body parts in the part-based method, or is it expressed differently in each paper? If it differs from paper to paper, how do you think performance might vary depending on how the parts are divided? Q2. Aside from performance, are there any other advantages of dividing body parts compared to methods that do not divide them?
A1. Typically, it is divided into six parts, but there are slight differences in the joint groups. It seems to be divided based on intuition, and since the differences aren’t that significant, there likely wouldn’t be a large impact on performance. A2. When you run inference, the results are much better when specific text is provided for each part. Additionally, it performs much better for more complex motions.
Thank you for the presentation. I have two questions: Q1: I don’t understand how the Coordinate Networks work for synchronizing the separated generators. Where is the loss function? I can't understand how this model will be optimized according to the author's design intention. Q2: How is text translated into animation in this architecture? Why do we need to use the Encoder/Decoder architecture from this perspective?
A1. Loss for synchronizing is not existed, just reconstruction for VQ-VAE and cross-entropy loss of text tokens and motion tokens for text-to-motion generation is used in this model. A2. This model is 2stage model. VQ-VAE is used for tokenizing motions, and CLIP is used for embedding text. Autoregressive style is used to for making motion tokens.
Thank you for the excellent presentation. Q1. I am curious about how detailed the division of body parts should be. In the paper, it appears that the body is divided into six parts. However, I believe that the leg could be further divided into the upper and lower parts, such as the thigh and calf. While it seems possible to categorize body parts in more detail, I didn’t see any mention of this in the paper. Do you have any thoughts on this, or was it addressed in the research? Q2. I have a question about the MM-Dist metric. Could you explain in more detail how it is measured? I am particularly interested in how this metric evaluates whether the movement of each body part is appropriate when the name of that body part is mentioned in the text. How does this evaluation work?
A1. There isn’t a specific mention of this. It could potentially work better if further divided, or it might not. It’s a good idea, and while we would need to experiment to know for sure, my guess is that if it's split too finely, issues could arise when trying to combine them later. Although it's not mentioned in the paper, when we train and observe the results, the six parts aren’t perfectly coherent, and there are slight discrepancies. If divided even further, maintaining this coherence might become more difficult. A2. MM Dist is simply a metric that measures the Euclidean distance in the latent space between the text and the generated motion. It evaluates whether the model has generated a motion that is close to the text in the latent space perspective.
Thank you for your seminar. 1. I think using 6 models will increase the number of model parameters, what do you think? 2. There seems to be no explanation for loss term, but do you want to use cross entropy loss for tokens or MSE loss for finished motion?
Thank you for your good presentation. I'm doing research on a similar topic, so your presentation was very helpful for me. I have two questions: 1. In the figures (p.3 and p.5) describing the previous works, it seems like "text" should go into the input of the encoder since it's a text-to-motion task, but in the figures it seems like only "motion sequences" go into the input. Is it a mistake to draw the figure? Or is it correct that the motion sequence goes in as input? I would appreciate a little more clarification. 2. In the process of breaking down the text into individual body parts through text-driven part coordinating synthesis, could you please explain in detail how the whole text is broken down into text for each body part? Is it focusing on specific words in the input text that describe the body part (e.g. “left hand”, “right foot”, etc.)? Thank you.
A1. That is not a mistake. It is correct that motion goes in and motion comes out. This is to construct the latent space through the reconstruction task. A2. The text is not split. The same text is provided to all parts.
Thank you for your seminar. I have one question: What specific techniques or strategies does Parco utilize to avoid overfitting to certain motion patterns or repetitive text descriptions during the training process, especially when dealing with limited or biased datasets?
There is actually no mention of this. If anything, there might be something about code reset and EMA to prevent code collapse in VQ-VAE, but this is not a methodology unique to ParCo. Models that use VQ-VAE generally adopt this approach.
Thank you for the presentation. I have two questions. 1) How does the part coordination block actually work? What I understand is a combination of transformer architecture and RNN style autoregressive on body part-wise processing. 2) I'm aligning with the assumption that part-wise generation of character animation results in better details. But is it goes ok with ambiguous input text (e.g. without a detailed description of body part)?
A1. Simply, it plays a role in maintaining coherence by using the information from previously generated tokens of other parts when independently generating the current part. A2. ParCo doesn't address that issue in detail. However, to solve this problem, a paper called "LGTM" was introduced at SIGGRAPH 2024. It uses ChatGPT to split detailed text for each part and then generates each part accordingly.
Thank you for presentation. I have three questions. Q1. Is the model they are trying to train a skeleton model for motion? If so, is the target of learning the translation and rotation values of the joints? I’m a bit confused as the input of the encoder and output of the decoder are not clearly explained. Q2. In the part-based method, it seems like the parts being trained separately are at a coarse level, such as arms, legs, torso, and head. I don’t quite understand how dividing them at this level can achieve finer-grained and more precise motion generation compared to existing methods. Wouldn't defining the parts down to the bone level, as defined by the skeleton, allow for better learning and representation of more detailed and complex motion? Q3. Could you explain any limitations mentioned in this paper, or any limitations you personally perceive in this work?
Thank you for your questions. A1. The model trains skeleton model for motion but it is not directly. Translation and rotation values of the joints are transformed to specific forms including like speeds of motions of parts. A2. Your question is great. Parco is also a better case, having moved from the existing 2-part creation to 6-part creation. But there are no research about bone level. A3. Because of autoregressive model, it is quite slow for inference.
Thank you for your presentation. I have two questions. First, have there been any experiments where the body is divided differently from the six parts you mentioned? I'm curious if the division into six parts is the best approach. Second, it seems that the encoder and decoder are divided based on these six parts. However, I didn't quite understand how information is shared between them. Could you explain this part again and clarify it further?
A1. There isn't anything like that. I think it was probably determined through internal testing. A2. It is shared in the Part Coordinated Transformer. The previous tokens of other parts are received, and the current part’s token is generated independently.
I am curious about how the features processed for each part form one natural motion. Text and motion do not have a one-to-one correspondence, so multiple motions can be expressed as one text. Then, wouldn’t it be possible to express different motions for each part?
There are 263 features, and each part has corresponding features, so they were simply split. In ParCo, only one text was used for all parts, but in LGTM, a separate text was created for each part.
I have two questions. Q1. I know data is very important in text-to-motion generation, but is it good at generating motion that isn't in the data? I wonder if the paper takes this into account. Q2. I'm wondering how the motions created by the parts are connected and made into a single motion.
A1. It seems that parts without specific input are still fairly well generated. A2. It’s simply concatenated. From the beginning, coherence is maintained through the part-coordinated transformer during generation.
[Overall]
The overall flow and structure were easy to understand. However, the presentation lacked details in the implementation and architecture explanation. It would be more meaningful for the audience if there were more explanations about the types of input and output data and the data flow illustrated in the architecture diagrams. At present, it feels more like reading a blog that briefly introduces a new paper rather than attending a seminar.
[Feedback]
F1: Slide 9 contains too much information, and without proper guidance, it is difficult to understand as the explanations are only verbal. Why not use appropriate animations and annotations to aid comprehension?
F2: For slide 16, it would be better to cut and show only a few parts or use strategies that display the images more prominently. Currently, the images are hard to see, and there is a lot of information in the illustrations, but there isn’t enough time taken to explain them thoroughly.
F3: Regarding the "Three main architectures," does this refer to the VAE, Diffusion, and VQ-VAE architectures used in previous research? If so, why does the top title change from "Previous Works" to "What is text-to-motion generation?" The content of the presentation seems confusing.
F4: As I pointed out to other presenters earlier, there is poor control over the amount of information provided on each slide. When there is a lot of information, sufficient explanations and additional guidance and annotations should be provided. There is no consideration of how the audience's gaze should move or the order in which understanding should occur.
[Questions]
Q1: What information did Parco authors use to render the stick man? Joint information? How did they determine the joint configuration?
Q2: When are "part-aware motion discretization" and "text-driven part coordinating synthesis" used? What inputs and outputs are involved, and what do the numbers in the diagrams represent?
Q3: Where is the "Part-Coordinated Transformer" used? Is it an expanded version of the transformer at the top of the "text-driven part coordinating synthesis"?
Q4: What methods are used to maintain coherence between separately generated motions in Parco? I am curious about the specific training methods, inputs, outputs, and effects.
Thank you for your feedback, sir.
A1. For rendering, the x, y, z coordinates of the joints are used for rendering. The configuration for this varies depending on the dataset, and for example, HumanML3D and KIT-ML each have their own custom configurations.
A2. Basically, ParCo is a two-stage learning methodology. First, the motion is tokenized using VQ-VAE. In the case of HumanML3D, it is compressed into one token for every 4 frames. The loss term used in this step is reconstruction loss, meaning that the input and output are identical motions. The numbers in the diagram refer to tokens. The text-driven stage is the second stage. Using the motion tokens learned in the previous step, the task generates motion tokens by receiving text as input. The loss used in this stage is NLL.
A3. It is an expanded version. This is used when combining information about each part's motion to maintain coherence across parts when generating motions independently for each part.
A4. The answer is the same as A3. Specifically, although each part is generated independently through separate modules, coherence cannot be guaranteed if they are completely independent. Therefore, when generating, information from other parts is also provided to ensure consistency. The input consists of information from each part and the previously generated motion tokens of other parts, and the output is the next token for each part.
Thank you for your presentation.
Q1:
In the ParCo model, it is explained that the body is divided into a total of six parts to generate motion. I'm curious about why these specific six parts were chosen to divide the body-what were the criteria or reasons behind this decision? Additionally, I'd like to know what trade-offs exist between dividing the body into smaller, more detailed units versus larger units.
Q2:
When using a method like ParCo that generates motion by dividing the body into parts, I'm wondering whether it's possible to input specific textual descriptions for each part and have that more accurately reflected in the motion. In other words, does this approach help to implement detailed instructions for each body part more effectively?
A1. If divided too finely, there is a risk that coherence may break down when combining them later.
A2. This is not present in ParCo, but there is a paper called "LGTM" introduced at SIGGRAPH 2024 that uses ChatGPT to split the text by parts and generate them accordingly.
Thank you for your presentation
Q1. How are "parts" of a body given as input to a encoder? Specifically, what is the format of the input?
Q2. How do you define the parts for the body? Is it defined manually by the users?
A1. Literally, the parts are separated and fed into the encoder. These are divided into 263 feature vectors, with each feature vector corresponding to different parts like the head, arms, etc. The format includes things like joint rotations or velocities.
A2. There is no specific rule; we simply decided on six parts. There is no particular reference for this. My guess is that the performance was probably best when divided into six parts through experimental results.
Thank you for your seminar.
I have two questions below:
Q1. Is there a clear standard for dividing body parts in the part-based method, or is it expressed differently in each paper? If it differs from paper to paper, how do you think performance might vary depending on how the parts are divided?
Q2. Aside from performance, are there any other advantages of dividing body parts compared to methods that do not divide them?
A1. Typically, it is divided into six parts, but there are slight differences in the joint groups. It seems to be divided based on intuition, and since the differences aren’t that significant, there likely wouldn’t be a large impact on performance.
A2. When you run inference, the results are much better when specific text is provided for each part. Additionally, it performs much better for more complex motions.
Thank you for the presentation.
I have two questions:
Q1: I don’t understand how the Coordinate Networks work for synchronizing the separated generators. Where is the loss function? I can't understand how this model will be optimized according to the author's design intention.
Q2: How is text translated into animation in this architecture? Why do we need to use the Encoder/Decoder architecture from this perspective?
A1. Loss for synchronizing is not existed, just reconstruction for VQ-VAE and cross-entropy loss of text tokens and motion tokens for text-to-motion generation is used in this model.
A2. This model is 2stage model. VQ-VAE is used for tokenizing motions, and CLIP is used for embedding text. Autoregressive style is used to for making motion tokens.
Thank you for the excellent presentation.
Q1. I am curious about how detailed the division of body parts should be. In the paper, it appears that the body is divided into six parts. However, I believe that the leg could be further divided into the upper and lower parts, such as the thigh and calf. While it seems possible to categorize body parts in more detail, I didn’t see any mention of this in the paper. Do you have any thoughts on this, or was it addressed in the research?
Q2. I have a question about the MM-Dist metric. Could you explain in more detail how it is measured? I am particularly interested in how this metric evaluates whether the movement of each body part is appropriate when the name of that body part is mentioned in the text. How does this evaluation work?
A1. There isn’t a specific mention of this. It could potentially work better if further divided, or it might not. It’s a good idea, and while we would need to experiment to know for sure, my guess is that if it's split too finely, issues could arise when trying to combine them later. Although it's not mentioned in the paper, when we train and observe the results, the six parts aren’t perfectly coherent, and there are slight discrepancies. If divided even further, maintaining this coherence might become more difficult.
A2. MM Dist is simply a metric that measures the Euclidean distance in the latent space between the text and the generated motion. It evaluates whether the model has generated a motion that is close to the text in the latent space perspective.
Thank you for your seminar.
1. I think using 6 models will increase the number of model parameters, what do you think?
2. There seems to be no explanation for loss term, but do you want to use cross entropy loss for tokens or MSE loss for finished motion?
A1. No, because a smaller model was used, it actually has fewer parameters compared to the previous model.
A2. Cross entropy is used.
Thank you for your good presentation. I'm doing research on a similar topic, so your presentation was very helpful for me.
I have two questions:
1. In the figures (p.3 and p.5) describing the previous works, it seems like "text" should go into the input of the encoder since it's a text-to-motion task, but in the figures it seems like only "motion sequences" go into the input. Is it a mistake to draw the figure? Or is it correct that the motion sequence goes in as input? I would appreciate a little more clarification.
2. In the process of breaking down the text into individual body parts through text-driven part coordinating synthesis, could you please explain in detail how the whole text is broken down into text for each body part? Is it focusing on specific words in the input text that describe the body part (e.g. “left hand”, “right foot”, etc.)?
Thank you.
A1. That is not a mistake. It is correct that motion goes in and motion comes out. This is to construct the latent space through the reconstruction task.
A2. The text is not split. The same text is provided to all parts.
Thank you for your seminar. I have one question: What specific techniques or strategies does Parco utilize to avoid overfitting to certain motion patterns or repetitive text descriptions during the training process, especially when dealing with limited or biased datasets?
There is actually no mention of this. If anything, there might be something about code reset and EMA to prevent code collapse in VQ-VAE, but this is not a methodology unique to ParCo. Models that use VQ-VAE generally adopt this approach.
Thank you for the presentation. I have two questions.
1) How does the part coordination block actually work? What I understand is a combination of transformer architecture and RNN style autoregressive on body part-wise processing.
2) I'm aligning with the assumption that part-wise generation of character animation results in better details. But is it goes ok with ambiguous input text (e.g. without a detailed description of body part)?
A1. Simply, it plays a role in maintaining coherence by using the information from previously generated tokens of other parts when independently generating the current part.
A2. ParCo doesn't address that issue in detail. However, to solve this problem, a paper called "LGTM" was introduced at SIGGRAPH 2024. It uses ChatGPT to split detailed text for each part and then generates each part accordingly.
Thank you for presentation.
I have three questions.
Q1. Is the model they are trying to train a skeleton model for motion? If so, is the target of learning the translation and rotation values of the joints? I’m a bit confused as the input of the encoder and output of the decoder are not clearly explained.
Q2. In the part-based method, it seems like the parts being trained separately are at a coarse level, such as arms, legs, torso, and head. I don’t quite understand how dividing them at this level can achieve finer-grained and more precise motion generation compared to existing methods. Wouldn't defining the parts down to the bone level, as defined by the skeleton, allow for better learning and representation of more detailed and complex motion?
Q3. Could you explain any limitations mentioned in this paper, or any limitations you personally perceive in this work?
Thank you for your questions.
A1. The model trains skeleton model for motion but it is not directly. Translation and rotation values of the joints are transformed to specific forms including like speeds of motions of parts.
A2. Your question is great. Parco is also a better case, having moved from the existing 2-part creation to 6-part creation. But there are no research about bone level.
A3. Because of autoregressive model, it is quite slow for inference.
Thank you for your presentation. I have two questions.
First, have there been any experiments where the body is divided differently from the six parts you mentioned? I'm curious if the division into six parts is the best approach.
Second, it seems that the encoder and decoder are divided based on these six parts. However, I didn't quite understand how information is shared between them. Could you explain this part again and clarify it further?
A1. There isn't anything like that. I think it was probably determined through internal testing.
A2. It is shared in the Part Coordinated Transformer. The previous tokens of other parts are received, and the current part’s token is generated independently.
I am curious about how the features processed for each part form one natural motion. Text and motion do not have a one-to-one correspondence, so multiple motions can be expressed as one text. Then, wouldn’t it be possible to express different motions for each part?
There are 263 features, and each part has corresponding features, so they were simply split. In ParCo, only one text was used for all parts, but in LGTM, a separate text was created for each part.
I have two questions.
Q1. I know data is very important in text-to-motion generation, but is it good at generating motion that isn't in the data? I wonder if the paper takes this into account.
Q2. I'm wondering how the motions created by the parts are connected and made into a single motion.
A1. It seems that parts without specific input are still fairly well generated.
A2. It’s simply concatenated. From the beginning, coherence is maintained through the part-coordinated transformer during generation.