PhD Position F/M PhD Position Computer Vision / Deep Learning: Video Generation

Inria

Sophia Antipolis, 06560

Postuler

PhD Position F/M PhD Position Computer Vision / Deep Learning: Video Generation

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

A propos du centre ou de la direction fonctionnelle

The Inria Université Côte d’Azur center counts 36 research teams as well as 7 support departments. The center's staff (about 500 people including 320 Inria employees) is made up of scientists of different nationalities (250 foreigners of 50 nationalities), engineers, technicians and administrative staff. 1/3 of the staff are civil servants, the others are contractual agents. The majority of the center’s research teams are located in Sophia Antipolis and Nice in the Alpes-Maritimes. Four teams are based in Montpellier and two teams are hosted in Bologna in Italy and Athens. The Center is a founding member of Université Côte d'Azur and partner of the I-site MUSE supported by the University of Montpellier.

Contexte et atouts du poste

Inria, the French National Institute for computer science and applied mathematics, promotes “scientific excellence for technology transfer and society”. Graduates from the world’s top universities, Inria's 2,700 employees rise to the challenges of digital sciences. With its open, agile model, Inria is able to explore original approaches with its partners in industry and academia and provide an efficient response to the multidisciplinary and application challenges of the digital transformation. Inria is the source of many innovations that add value and create jobs.

Team

The STARS research team combines advanced theory with cutting edge practice focusing on cognitive vision systems.

Team web site : https://team.inria.fr/stars/

Mission confiée

The Inria STARS team is seeking for a Ph.D. researcher with strong background in computer vision, deep learning and machine learning.

The candidate is expected to conduct research related to generative models, including the development of computer vision algorithms for image and video generation.

Principales activités

Despite remarkable progress in generative models, a pretrained network is currently limited in being able to generate only a single training subject / object within a single scenario the training data was pertained to.

This Ph.D. thesis aims at bringing video generation to the next level by proposing strategies to generalize the generation ability of generative models by disentangling appearance and motion in the latent space and further disentangling motion in primary directions, applicable to any subject in any setting. This carries the premise of allowing for more complex settings incorporating interaction of subjects / objects.

Context:

Generative adversarial networks (GANs) [1] have witnessed increased interest from academia and industry, due to exceptional capacity in generating highly realistic images [2, 3, 4, 5, 6, 7]. Videos signify more complex data, due to the additional temporal dimension. While some research works showed early results in video generation [8-11], there are many open questions in the field.

Model architecture

The thesis firstly will investigate, how to design model architecture for generator and discriminator in generative models. We will explore traditional model architectures such as CNN and RNN, as well as Transformer-based generators. Our objective will be to explore whether we can design a unified model architecture that generalizes over categories, such as human bodies and faces. We will study how to connect different architectures, in order to create such a general system for cross-category generation.

3D-aware generation

Learning 3D-aware models from 2D data has become a popular research topic in image generation. In this thesis, we will go one step further in this direction to explore novel view synthesis in video generation. We intend to combine jointly state-of-the-art novel view synthesis techniques with video generation, aiming at creating 3D-aware video generation. Our idea is to explore implicit representation (e.g., NeRF), explicit representation (e.g., 3D representation), as well as hybrid (implicit-explicit) representation in video generation models. One objective will be to design an efficient and effective representation for novel-view synthesis in video generation.

Generalizability

Finally, we will aim to design a universal model which is able to generate videos across categories. Most of current models focus on generating single category (e.g., faces, sky…). Currently, there is no models, which are able to generate complex multi-category videos (e.g. Kinetics-600). We plan to increase the complexity of video generative models and design a large-scale video GAN. The objective is to study whether big generative models are able to capture the distribution of complex video datasets and create semantic meaningful videos.

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[2] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410.
[3] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super- resolution using a generative adversarial network.” in CVPR, 2017.
[4] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, “Disentangled person image generation,” in CVPR, 2018.
[5] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in ICLR, 2018.
[6] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan : Fine-grained text to image generation with attentional generative adversarial networks,” in CVPR, 2018.
[7] B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image generation from layout,” in CVPR, 2019.

[8] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in NIPS, 2016.
[9] M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2830–2839.
[10] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “MoCoGAN : Decomposing motion and content for video generation,” in CVPR, 2018.
[11] Y. Wang, P. Bilinski, F. Bremond, and A. Dantcheva, “G3AN : Disentangling appearance and motion for video generation,” in CVPR, 2020.

Compétences

Candidates must hold a Master degree or equivalent in Computer Science or a closely related discipline by the start date.

The candidate must be grounded in the basics of computer vision, have solid mathematical and programming skills.

Preferably in Python, OpenCV, deep learning framework Pytorch or Tensorflow.

The candidate must be committed to scientific research and strong publications.

Avantages

Subsidized meals
Partial reimbursement of public transport costs
Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
Possibility of teleworking and flexible organization of working hours
Professional equipment available (videoconferencing, loan of computer equipment, etc.)
Social, cultural and sports events and activities
Access to vocational training
Contribution to mutual insurance (subject to conditions)

Rémunération

Gross Salary per month: 2082€ brut per month (year 1 & 2) and 2190€ brut/month (year 3)

Postuler

Voir tous les emplois