This newsletter discusses a brand new unlock of a multimodal Hunyuan Video global fashion known as ‘HunyuanCustom’. The brand new paper’s breadth of protection, blended with a number of problems in most of the provided instance movies on the project page*, constrains us to extra common protection than same old, and to restricted replica of the massive quantity of video subject matter accompanying this unlock (since most of the movies require important re-editing and processing as a way to strengthen the clarity of the structure).
Please observe moreover that the paper refers back to the API-based generative machine Kling as ‘Keling’. For readability, I seek advice from ‘Kling’ as an alternative during.
Tencent is within the strategy of liberating a brand new model of its Hunyuan Video model, titled HunyuanCustom. The brand new unlock is it appears in a position to making Hunyuan LoRA models redundant, through permitting the person to create ‘deepfake’-style video customization thru a unmarried picture:
Click on to play. Steered: ‘A person is taking note of tune and cooking snail noodles within the kitchen’. The brand new means in comparison to each close-source and open-source strategies, together with Kling, which is an important opponent on this area. Supply: https://hunyuancustom.github.io/ (caution: CPU/memory-intensive website online!)
Within the left-most column of the video above, we see the only supply picture provided to HunyuanCustom, adopted through the brand new machine’s interpretation of the steered in the second one column, subsequent to it. The rest columns display the effects from more than a few proprietary and FOSS techniques: Kling; Vidu; Pika; Hailuo; and the Wan-based SkyReels-A2.
Within the video under, we see renders of 3 situations very important to this unlock: respectively, individual + object; single-character emulation; and digital try-on (individual + garments):
Click on to play. 3 examples edited from the fabric on the supporting website online for Hunyuan Video.
We will be able to understand a couple of issues from those examples, most commonly associated with the machine depending on a unmarried supply picture, as an alternative of more than one pictures of the similar area.
Within the first clip, the person is largely nonetheless dealing with the digital camera. He dips his head down and sideways at no longer a lot more than 20-25 levels of rotation, however, at a bent in far more than that, the machine would truly have to begin guessing what he seems like in profile. That is laborious, most certainly unattainable to gauge appropriately from a sole frontal picture.
In the second one instance, we see that the little woman is smiling within the rendered video as she is within the unmarried static supply picture. Once more, with this sole picture as reference, the HunyuanCustom must make a reasonably uninformed bet about what her ‘resting face’ seems like. Moreover, her face does no longer deviate from camera-facing stance through greater than the prior instance (‘guy consuming crisps’).
Within the final instance, we see that for the reason that supply subject matter – the lady and the garments she is precipitated into dressed in – aren’t whole pictures, the render has cropped the state of affairs to suit – which is in reality reasonably a just right resolution to a knowledge factor!
The purpose is that regardless that the brand new machine can take care of more than one pictures (equivalent to individual + crisps, or individual + garments), it does no longer it appears permit for more than one angles or choice perspectives of a unmarried personality, in order that numerous expressions or bizarre angles may well be accommodated. To this extent, the machine might due to this fact battle to exchange the rising ecosystem of LoRA fashions that experience sprung up round HunyuanVideo since its unlock final December, since those can assist HunyuanVideo to provide constant characters from any attitude and with any facial features represented within the coaching dataset (20-60 pictures is conventional).
Stressed out for Sound
For audio, HunyuanCustom leverages the LatentSync machine (notoriously laborious for hobbyists to arrange and get just right effects from) for acquiring lip actions which can be matched to audio and textual content that the person provides:
Options audio. Click on to play. More than a few examples of lip-sync from the HunyuanCustom supplementary website online, edited in combination.
On the time of writing, there aren’t any English-language examples, however those seem to be reasonably just right – the extra so if the process of constructing them is easily-installable and available.
Modifying Current Video
The brand new machine provides what seem to be very spectacular effects for video-to-video (V2V, or Vid2Vid) enhancing, by which a section of an current (actual) video is masked off and intelligently changed through a topic given in one reference picture. Beneath is an instance from the supplementary fabrics website online:
Click on to play. Most effective the central object is focused, however what stays round it additionally will get altered in a HunyuanCustom vid2vid cross.
As we will be able to see, and as is same old in a vid2vid state of affairs, the whole video is to some degree altered through the method, regardless that maximum altered within the focused area, i.e., the lush toy. Possibly pipelines may well be evolved to create such transformations below a garbage matte manner that leaves nearly all of the video content material just like the unique. That is what Adobe Firefly does below the hood, and does somewhat neatly – however it’s an under-studied procedure within the FOSS generative scene.
That stated, lots of the choice examples equipped do a greater activity of focused on those integrations, as we will be able to see within the assembled compilation under:
Click on to play. Numerous examples of interjected content material the usage of vid2vid in HunyuanCustom, displaying notable admire for the untargeted subject matter.
A New Get started?
This initiative is a building of the Hunyuan Video project, no longer a troublesome pivot clear of that building circulation. The mission’s improvements are offered as discrete architectural insertions reasonably than sweeping structural adjustments, aiming to permit the fashion to deal with id constancy throughout frames with out depending on subject-specific fine-tuning, as with LoRA or textual inversion approaches.
To be transparent, due to this fact, HunyuanCustom isn’t educated from scratch, however reasonably is a fine-tuning of the December 2024 HunyuanVideo basis fashion.
Those that have evolved HunyuanVideo LoRAs might wonder whether they’ll nonetheless paintings with this re-creation, or whether or not they’ll need to reinvent the LoRA wheel yet again if they would like extra customization functions than are constructed into this new unlock.
Normally, a closely fine-tuned unlock of a hyperscale fashion alters the model weights sufficient that LoRAs made for the sooner fashion is not going to paintings correctly, or in any respect, with the newly-refined fashion.
Every now and then, on the other hand, a fine-tune’s recognition can problem its origins: one instance of a fine-tune turning into an efficient fork, with a devoted ecosystem and fans of its personal, is the Pony Diffusion tuning of Stable Diffusion XL (SDXL). Pony at the moment has 592,000+ downloads at the ever-changing CivitAI area, with an infinite vary of LoRAs that experience used Pony (and no longer SDXL) as the bottom fashion, and which require Pony at inference time.
Freeing
The project page for the new paper (which is titled HunyuanCustom: A Multimodal-Pushed Structure for Custom designed Video Era) options hyperlinks to a GitHub site that, as I write, simply turned into practical, and looks to comprise all code and important weights for native implementation, along side a proposed timeline (the place the one necessary factor but to return is ComfyUI integration).
On the time of writing, the mission’s Hugging Face presence remains to be a 404. There may be, on the other hand, an API-based version of the place one can it appears demo the machine, as long as you’ll supply a WeChat scan code.
I’ve infrequently observed such an elaborate and in depth utilization of such all kinds of initiatives in a single meeting, as is clear in HunyuanCustom – and possibly one of the most licenses would in spite of everything oblige a complete unlock.
Two fashions are introduced on the GitHub web page: a 720px1280px model requiring 8)GB of GPU Top Reminiscence, and a 512px896px model requiring 60GB of GPU Top Reminiscence.
The repository states ‘The minimal GPU reminiscence required is 24GB for 720px1280px129f however very gradual…We propose the usage of a GPU with 80GB of reminiscence for higher era high quality’ – and iterates that the machine has handiest been examined to this point on Linux.
The sooner Hunyuan Video fashion has, since reputable unlock, been quantized right down to sizes the place it may be run on not up to 24GB of VRAM, and it sort of feels affordable to suppose that the brand new fashion will likewise be tailored into extra consumer-friendly paperwork through the neighborhood, and that it is going to temporarily be tailored to be used on Home windows techniques too.
Because of time constraints and the overpowering quantity of data accompanying this unlock, we will be able to handiest take a broader, reasonably than in-depth take a look at this unlock. Nevertheless, let’s pop the hood on HunyuanCustom a little bit.
A Have a look at the Paper
The information pipeline for HunyuanCustom, it appears compliant with the GDPR framework, accommodates each synthesized and open-source video datasets, together with OpenHumanVid, with 8 core classes represented: people, animals, crops, landscapes, automobiles, items, structure, and anime.

From the discharge paper, an outline of the various contributing programs within the HunyuanCustom knowledge development pipeline. Supply: https://arxiv.org/pdf/2505.04512
Preliminary filtering starts with PySceneDetect, which segments movies into single-shot clips. TextBPN-Plus-Plus is then used to take away movies containing over the top on-screen textual content, subtitles, watermarks, or emblems.
To handle inconsistencies in decision and length, clips are standardized to 5 seconds in duration and resized to 512 or 720 pixels at the quick aspect. Aesthetic filtering is treated the usage of Koala-36M, with a customized threshold of 0.06 implemented for the customized dataset curated through the brand new paper’s researchers.
The topic extraction procedure combines the Qwen7B Massive Language Type (LLM), the YOLO11X object reputation framework, and the preferred InsightFace structure, to spot and validate human identities.
For non-human topics, QwenVL and Grounded SAM 2 are used to extract related bounding bins, that are discarded if too small.

Examples of semantic segmentation with Grounded SAM 2, used within the Hunyuan Regulate mission. Supply: https://github.com/IDEA-Analysis/Grounded-SAM-2
Multi-subject extraction makes use of Florence2 for bounding field annotation, and Grounded SAM 2 for segmentation, adopted through clustering and temporal segmentation of coaching frames.
The processed clips are additional enhanced by the use of annotation, the usage of a proprietary structured-labeling machine evolved through the Hunyuan group, and which furnishes layered metadata equivalent to descriptions and digital camera movement cues.
Mask augmentation methods, together with conversion to bounding bins, have been implemented all over coaching to cut back overfitting and make sure the fashion adapts to numerous object shapes.
Audio knowledge was once synchronized the usage of the aforementioned LatentSync, and clips discarded if synchronization rankings fall under a minimal threshold.
The blind picture high quality evaluate framework HyperIQA was once used to exclude movies scoring below 40 (on HyperIQA’s bespoke scale). Legitimate audio tracks have been then processed with Whisper to extract options for downstream duties.
The authors incorporate the LLaVA language assistant fashion all over the annotation section, they usually emphasize the central place that this framework has in HunyuanCustom. LLaVA is used to generate picture captions and lend a hand in aligning visible content material with textual content activates, supporting the development of a coherent coaching sign throughout modalities:

The HunyuanCustom framework helps identity-consistent video era conditioned on textual content, picture, audio, and video inputs.
By means of leveraging LLaVA’s vision-language alignment functions, the pipeline positive factors an extra layer of semantic consistency between visible parts and their textual descriptions – particularly precious in multi-subject or complex-scene situations.
Customized Video
To permit video era in keeping with a reference picture and a steered, the 2 modules focused round LLaVA have been created, first adapting the enter construction of HunyuanVideo in order that it will settle for a picture along side textual content.
This concerned formatting the steered in some way that embeds the picture without delay or tags it with a brief id description. A separator token was once used to forestall the picture embedding from overwhelming the steered content material.
Since LLaVA’s visible encoder has a tendency to compress or discard fine-grained spatial main points all over the alignment of picture and textual content options (in particular when translating a unmarried reference picture right into a common semantic embedding), an id enhancement module was once integrated. Since just about all video latent diffusion fashions have some issue keeping up an id with out an LoRA, even in a five-second clip, the efficiency of this module in neighborhood checking out might turn out important.
In the end, the reference picture is then resized and encoded the usage of the causal 3-d-VAE from the unique HunyuanVideo fashion, and its latent inserted into the video latent around the temporal axis, with a spatial offset implemented to forestall the picture from being without delay reproduced within the output, whilst nonetheless guiding era.
The fashion was once educated the usage of Flow Matching, with noise samples drawn from a logit-normal distribution – and the community was once educated to recuperate the proper video from those noisy latents. LLaVA and the video generator have been each fine-tuned in combination in order that the picture and steered may just information the output extra fluently and stay the topic id constant.
For multi-subject activates, each and every image-text pair was once embedded one by one and assigned a definite temporal place, permitting identities to be prominent, and supporting the era of scenes involving more than one interacting topics.
Sound and Imaginative and prescient
HunyuanCustom stipulations audio/speech era the usage of each user-input audio and a textual content steered, permitting characters to talk inside of scenes that replicate the described atmosphere.
To strengthen this, an Identification-disentangled AudioNet module introduces audio options with out disrupting the id alerts embedded from the reference picture and steered. Those options are aligned with the compressed video timeline, divided into frame-level segments, and injected the usage of a spatial cross-attention mechanism that helps to keep each and every body remoted, maintaining area consistency and heading off temporal interference.
A moment temporal injection module supplies finer keep an eye on over timing and movement, running in tandem with AudioNet, mapping audio options to express areas of the latent series, and the usage of a Multi-Layer Perceptron (MLP) to transform them into token-wise movement offsets. This permits gestures and facial motion to practice the rhythm and emphasis of the spoken enter with larger precision.
HunyuanCustom permits topics in current movies to be edited without delay, changing or putting folks or items right into a scene with no need to rebuild all of the clip from scratch. This makes it helpful for duties that contain changing look or movement in a focused manner.
Click on to play. An extra instance from the supplementary website online.
To facilitate environment friendly subject-replacement in current movies, the brand new machine avoids the resource-intensive manner of latest strategies such because the currently-popular VACE, or those who merge whole video sequences in combination, favoring as an alternative the compression of a reference video the usage of the pretrained causal 3-d-VAE – aligning it with the era pipeline’s inner video latents, after which including the 2 in combination. This helps to keep the method reasonably light-weight, whilst nonetheless permitting exterior video content material to lead the output.
A small neural community handles the alignment between the blank enter video and the noisy latents utilized in era. The machine exams two tactics of injecting this knowledge: merging the 2 units of options ahead of compressing them once more; and including the options body through body. The second one means works higher, the authors discovered, and avoids high quality loss whilst conserving the computational load unchanged.
Knowledge and Checks
In exams, the metrics used have been: the id consistency module in ArcFace, which extracts facial embeddings from each the reference picture and each and every body of the generated video, after which calculates the typical cosine similarity between them; area similarity, by the use of sending YOLO11x segments to Dino 2 for comparability; CLIP-B, text-video alignment, which measures similarity between the steered and the generated video; CLIP-B once more, to calculate similarity between each and every body and each its neighboring frames and the primary body, in addition to temporal consistency; and dynamic diploma, as outlined through VBench.
As indicated previous, the baseline closed supply competition have been Hailuo; Vidu 2.0; Kling (1.6); and Pika. The competing FOSS frameworks have been VACE and SkyReels-A2.

Type efficiency analysis evaluating HunyuanCustom with main video customization strategies throughout ID consistency (Face-Sim), area similarity (DINO-Sim), text-video alignment (CLIP-B-T), temporal consistency (Temp-Consis), and movement depth (DD). Optimum and sub-optimal effects are proven in daring and underlined, respectively.
Of those effects, the authors state:
‘Our [HunyuanCustom] achieves the most efficient ID consistency and area consistency. It additionally achieves related ends up in steered following and temporal consistency. [Hailuo] has the most efficient clip rating as a result of it might probably practice textual content directions neatly with handiest ID consistency, sacrificing the consistency of non-human topics (the worst DINO-Sim). On the subject of Dynamic-degree, [Vidu] and [VACE] carry out poorly, that could be because of the small measurement of the fashion.’
Despite the fact that the mission website online is saturated with comparability movies (the structure of which turns out to had been designed for web site aesthetics reasonably than simple comparability), it does no longer at the moment function a video similar of the static effects stuffed in combination within the PDF, in regard to the preliminary qualitative exams. Despite the fact that I come with it right here, I urge the reader to make an in depth exam of the movies on the mission website online, as they offer a greater impact of the results:

From the paper, a comparability on object-centered video customization. Despite the fact that the viewer must (as all the time) seek advice from the supply PDF for higher decision, the movies on the mission website online could be a extra illuminating useful resource on this case.
The authors remark right here:
‘It may be observed that [Vidu], [Skyreels A2] and our means reach reasonably just right ends up in steered alignment and area consistency, however our video high quality is healthier than Vidu and Skyreels, due to the nice video era efficiency of our base fashion, i.e., [Hunyuanvideo-13B].
‘Amongst industrial merchandise, despite the fact that [Kling] has a just right video high quality, the primary body of the video has a copy-paste [problem], and on occasion the topic strikes too rapid and [blurs], main a deficient viewing enjoy.’
The authors additional remark that Pika plays poorly with regards to temporal consistency, introducing subtitle artifacts (results from deficient knowledge curation, the place textual content parts in video clips had been allowed to pollute the core ideas).
Hailuo maintains facial id, they state, however fails to maintain full-body consistency. Amongst open-source strategies, VACE, the researchers assert, is not able to deal with id consistency, while they contend that HunyuanCustom produces movies with robust id preservation, whilst maintaining high quality and variety.
Subsequent, exams have been performed for multi-subject video customization, towards the similar contenders. As within the earlier instance, the flattened PDF effects aren’t print equivalents of movies to be had on the mission website online, however are distinctive a few of the effects introduced:

Comparisons the usage of multi-subject video customizations. Please see PDF for higher element and backbone.
The paper states:
‘[Pika] can generate the desired topics however shows instability in video frames, with circumstances of a person disappearing in a single state of affairs and a girl failing to open a door as precipitated. [Vidu] and [VACE] in part seize human id however lose important main points of non-human items, indicating a limitation in representing non-human topics.
‘[SkyReels A2] stories serious body instability, with noticeable adjustments in chips and a lot of artifacts in the suitable state of affairs.
‘By contrast, our HunyuanCustom successfully captures each human and non-human area identities, generates movies that adhere to the given activates, and maintains excessive visible high quality and balance.’
An extra experiment was once ‘digital human commercial’, by which the frameworks have been tasked to combine a product with an individual:

From the qualitative checking out spherical, examples of neural ‘product placement’. Please see PDF for higher element and backbone.
For this spherical, the authors state:
‘The [results] show that HunyuanCustom successfully maintains the id of the human whilst maintaining the main points of the objective product, together with the textual content on it.
‘Moreover, the interplay between the human and the product seems herbal, and the video adheres carefully to the given steered, highlighting the considerable possible of HunyuanCustom in producing commercial movies.’
One space the place video effects would had been very helpful was once the qualitative spherical for audio-driven area customization, the place the nature speaks the corresponding audio from a text-described scene and posture.

Partial effects given for the audio spherical – regardless that video effects may had been preferable on this case. Most effective the highest part of the PDF determine is reproduced right here, as it’s huge and tough to house on this article. Please seek advice from supply PDF for higher element and backbone.
The authors assert:
‘Earlier audio-driven human animation strategies enter a human picture and an audio, the place the human posture, apparel, and surroundings stay in step with the given picture and can not generate movies in different gesture and surroundings, which might [restrict] their software.
‘…[Our] HunyuanCustom allows audio-driven human customization, the place the nature speaks the corresponding audio in a text-described scene and posture, making an allowance for extra versatile and controllable audio-driven human animation.’
Additional exams (please see PDF for all main points) incorporated a spherical pitting the brand new machine towards VACE and Kling 1.6 for video area substitute:

Checking out area substitute in video-to-video mode. Please seek advice from supply PDF for higher element and backbone.
Of those, the final exams introduced within the new paper, the researchers opine:
‘VACE suffers from boundary artifacts because of strict adherence to the enter mask, leading to unnatural area shapes and disrupted movement continuity. [Kling], by contrast, shows a copy-paste impact, the place topics are without delay overlaid onto the video, resulting in deficient integration with the background.
‘Compared, HunyuanCustom successfully avoids boundary artifacts, achieves seamless integration with the video background, and maintains robust id preservation—demonstrating its awesome efficiency in video enhancing duties.’
Conclusion
It is a interesting unlock, no longer least as it addresses one thing that the ever-discontent hobbyist scene has been complaining about extra in recent times – the loss of lip-sync, in order that the higher realism succesful in techniques equivalent to Hunyuan Video and Wan 2.1 could be given a brand new size of authenticity.
Despite the fact that the structure of just about the entire comparative video examples on the mission website online makes it reasonably tough to check HunyuanCustom’s functions towards prior contenders, it will have to be famous that very, only a few initiatives within the video synthesis area have the braveness to pit themselves in exams towards Kling, the industrial video diffusion API which is all the time soaring at or close to the highest of the leader-boards; Tencent seems to have made headway in contrast incumbent in a reasonably spectacular means.
* The problem being that one of the most movies are so large, quick, and high-resolution that they’re going to no longer play in same old video avid gamers equivalent to VLC or Home windows Media Participant, appearing black monitors.
First revealed Thursday, Would possibly 8, 2025
Source link