add link

cal-cs184-student · Apr 17, 2024 · 155c7a6 · 155c7a6
1 parent 9e3fa4a
commit 155c7a6
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/milestone/milestone.html b/milestone/milestone.html
@@ -523,6 +523,6 @@
 </style><title>milestone</title>
 </head>
 <body class='typora-export'><div class='typora-export-content'>
-<div id='write'  class=''><h1 id='team42-milestone'><span>Team42 Milestone</span></h1><h3 id='video-----slides'><a href=''><span>video</span></a><span>     </span><a href=''><span>slides</span></a></h3><h2 id='about-proposal-feedback'><span>About Proposal Feedback</span></h2><p><span>Thanks to TA Mingyang Wang, for some questions and suggestions on our project&#39;s proposal. Here are some answers:</span></p><ul><li><p><span>About our input text: If we were to use a text prompt format, we would use a search-like approach rather than LLM, which is overly complex and there doesn&#39;t seem to be much work currently combining LLM and 3D generation. But right now we&#39;re not sure if we want to combine it with a text prompt.</span></p></li><li><p><span>About feature transformation: We haven&#39;t finalized what we&#39;re going to do because there are a lot of existing methods that deal with it differently. We have reproduced various efforts and have not yet decided which one we will adopt.</span></p></li><li><p><span>About clothes: The clothes shape will be already baked into the model(it depends on the picture entered at the beginning and there will be no physics simulation). It&#39;s a pity we couldn&#39;t include a physics simulation, because the reconstruction alone would have been complicated. </span></p></li></ul><h2 id='our-progress'><span>Our Progress</span></h2><p><span>We originally intended to implement Text-Driven 3D Human Generation. we found a very relevant paper: </span><a href='https://arxiv.org/abs/2311.17061'><span>HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting</span></a><span>, and plan to reproduce this. However, after 2-3 days of trying hard, we realized that there were a lot of problems with this repository: the environment configuration must cause some conflicts, some plugins can&#39;t be installed on our machine...</span></p><p><span>We turn our attention to a piece of work </span><a href='https://econ.is.tue.mpg.de/'><span>ECON</span></a><span> that does not use any Nerf or Gaussian methods, but instead recovers the entire reconstruction from a monocular image(no text prompt) by a series of methods such as estimating normals, depths, etc., as well as by combining the SMPL prior for the human body. We ran this work successfully and here are some results:</span></p><p><img src="img/m1.png" referrerpolicy="no-referrer"></p><p><span>On the left are the images entered, and on the right are the people observed from all angles after the reconstruction.</span></p><p><span>However, this method is more &quot;traditional&quot; (and actually quite new, but scene reconstruction methods such as Gaussian splatting seem to be a bit more up-to-date). We looked at the </span><a href='https://arxiv.org/pdf/2311.16482.pdf'><span>Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars</span></a><span> method, which utilizes Gaussian splatting to reconstruct character action scenes. We successfully ran this work:</span></p><p><img src="img/13.png" style="zoom:33%;" /><span> </span><img src="img/17.png" style="zoom:33%;" /><span> </span><img src="img/27.png" style="zoom:33%;" /></p><p><span>Above is the result of reconstructing the animated human body.</span></p><p><span>We then turn our attention to another piece of work: </span><a href='https://arxiv.org/abs/2312.03029'><span>Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians</span></a><span>, which focuses on reconstructing dynamic head avatar, could be driven by input multi-view videos. Our test shows that the results reach high-fidelity. The pipeline is shown here:</span></p><p><img src="img/overview.png" referrerpolicy="no-referrer"></p><p><span>They first optimize the guidance model including a neutral mesh, a deformation MLP and a color MLP, yielding a expressive mesh head avatar. The mesh and the MLPs would be the initialization of the Gaussian model and the dynamic generator respectively. The optimizing process is shown below.</span></p><p><img src="img/meshhead.gif" referrerpolicy="no-referrer"></p><p><span>Then they use the Gaussian model and the dynamic generator to generate expressive Gaussian head, feeding to a super-resolution up-sampling network renders a 2K RGB image, which is compared with the ground truth to get the loss. The first 7% of the process is shown below:</span></p><p><img src="img/gshead.gif" referrerpolicy="no-referrer"></p><p><span>Here shown are two images of the rendered result of trained raw Gaussian head (left) and high-resolution result (right). The result reaches high-fidelity.</span></p><p><img src="img/high-fidelity.jpg" style="zoom:20%;" /><span> </span><img src="img/hi-fid-hi-res.jpg" style="zoom:20%;" /><span> </span></p><p><span>After training, the Gaussian avatar can be reenacted by expression coefficients as shown below.</span></p><p><img src="img/reenactment.gif" referrerpolicy="no-referrer"></p><p><span>We tried to search for more literature related to Text-Driven 3D Human Generation on various platforms such as Google Scholar and found that since text-driven related content is still relatively new, and at the same time, 3d Gaussian splatting has just been proposed, there is very little work combining the two and it is difficult to learn from it. We thought about whether we could do something that does not use text-driven generation, but rather single/multiple eye reconstruction, which has more work to draw on.</span></p><p>&nbsp;</p><h2 id='reflection-and-update-plan'><span>Reflection and update plan</span></h2><ul><li><p><span>We are now confined to the work of reproducing the various methods, and do not yet know what our predominant course will be. We will try to finalize our route over the next week.</span></p></li><li><p><span>What we&#39;re doing is more difficult, and we&#39;re not sure we can innovate on replicating what others have done. We will continue to explore.</span></p></li></ul><p>&nbsp;</p></div></div>
+<div id='write'  class=''><h1 id='team42-milestone'><span>Team42 Milestone</span></h1><h3 id='videohttpsyoutubehqxuyymf-hc-----slides'><a href='https://youtu.be/hqxUyYMF-Hc'><span>video</span></a><span>     </span><a href='https://docs.google.com/presentation/d/1ZP1gO6y45i76lddRRMKG3XSdGh15vEu8KZf1Idavhaw/edit?usp=sharing'><span>slides</span></a></h3><h2 id='about-proposal-feedback'><span>About Proposal Feedback</span></h2><p><span>Thanks to TA Mingyang Wang, for some questions and suggestions on our project&#39;s proposal. Here are some answers:</span></p><ul><li><p><span>About our input text: If we were to use a text prompt format, we would use a search-like approach rather than LLM, which is overly complex and there doesn&#39;t seem to be much work currently combining LLM and 3D generation. But right now we&#39;re not sure if we want to combine it with a text prompt.</span></p></li><li><p><span>About feature transformation: We haven&#39;t finalized what we&#39;re going to do because there are a lot of existing methods that deal with it differently. We have reproduced various efforts and have not yet decided which one we will adopt.</span></p></li><li><p><span>About clothes: The clothes shape will be already baked into the model(it depends on the picture entered at the beginning and there will be no physics simulation). It&#39;s a pity we couldn&#39;t include a physics simulation, because the reconstruction alone would have been complicated. </span></p></li></ul><h2 id='our-progress'><span>Our Progress</span></h2><p><span>We originally intended to implement Text-Driven 3D Human Generation. we found a very relevant paper: </span><a href='https://arxiv.org/abs/2311.17061'><span>HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting</span></a><span>, and plan to reproduce this. However, after 2-3 days of trying hard, we realized that there were a lot of problems with this repository: the environment configuration must cause some conflicts, some plugins can&#39;t be installed on our machine...</span></p><p><span>We turn our attention to a piece of work </span><a href='https://econ.is.tue.mpg.de/'><span>ECON</span></a><span> that does not use any Nerf or Gaussian methods, but instead recovers the entire reconstruction from a monocular image(no text prompt) by a series of methods such as estimating normals, depths, etc., as well as by combining the SMPL prior for the human body. We ran this work successfully and here are some results:</span></p><p><img src="img/m1.png" referrerpolicy="no-referrer"></p><p><span>On the left are the images entered, and on the right are the people observed from all angles after the reconstruction.</span></p><p><span>However, this method is more &quot;traditional&quot; (and actually quite new, but scene reconstruction methods such as Gaussian splatting seem to be a bit more up-to-date). We looked at the </span><a href='https://arxiv.org/pdf/2311.16482.pdf'><span>Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars</span></a><span> method, which utilizes Gaussian splatting to reconstruct character action scenes. We successfully ran this work:</span></p><p><img src="img/13.png" style="zoom:33%;" /><span> </span><img src="img/17.png" style="zoom:33%;" /><span> </span><img src="img/27.png" style="zoom:33%;" /></p><p><span>Above is the result of reconstructing the animated human body.</span></p><p><span>We then turn our attention to another piece of work: </span><a href='https://arxiv.org/abs/2312.03029'><span>Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians</span></a><span>, which focuses on reconstructing dynamic head avatar, could be driven by input multi-view videos. Our test shows that the results reach high-fidelity. The pipeline is shown here:</span></p><p><img src="img/overview.png" referrerpolicy="no-referrer"></p><p><span>They first optimize the guidance model including a neutral mesh, a deformation MLP and a color MLP, yielding a expressive mesh head avatar. The mesh and the MLPs would be the initialization of the Gaussian model and the dynamic generator respectively. The optimizing process is shown below.</span></p><p><img src="img/meshhead.gif" referrerpolicy="no-referrer"></p><p><span>Then they use the Gaussian model and the dynamic generator to generate expressive Gaussian head, feeding to a super-resolution up-sampling network renders a 2K RGB image, which is compared with the ground truth to get the loss. The first 7% of the process is shown below:</span></p><p><img src="img/gshead.gif" referrerpolicy="no-referrer"></p><p><span>Here shown are two images of the rendered result of trained raw Gaussian head (left) and high-resolution result (right). The result reaches high-fidelity.</span></p><p><img src="img/high-fidelity.jpg" style="zoom:20%;" /><span> </span><img src="img/hi-fid-hi-res.jpg" style="zoom:20%;" /><span> </span></p><p><span>After training, the Gaussian avatar can be reenacted by expression coefficients as shown below.</span></p><p><img src="img/reenactment.gif" referrerpolicy="no-referrer"></p><p><span>We tried to search for more literature related to Text-Driven 3D Human Generation on various platforms such as Google Scholar and found that since text-driven related content is still relatively new, and at the same time, 3d Gaussian splatting has just been proposed, there is very little work combining the two and it is difficult to learn from it. We thought about whether we could do something that does not use text-driven generation, but rather single/multiple eye reconstruction, which has more work to draw on.</span></p><p>&nbsp;</p><h2 id='reflection-and-update-plan'><span>Reflection and update plan</span></h2><ul><li><p><span>We are now confined to the work of reproducing the various methods, and do not yet know what our predominant course will be. We will try to finalize our route over the next week.</span></p></li><li><p><span>What we&#39;re doing is more difficult, and we&#39;re not sure we can innovate on replicating what others have done. We will continue to explore.</span></p></li></ul><p>&nbsp;</p></div></div>
 </body>
 </html>