Engineering
Dec 26, 2024
Engineering
Developing KIMCHI: An AI Image Generation Model Reflecting Korean Culture
Yonggeun Kwon
Research Intern
Dec 26, 2024
Engineering
Developing KIMCHI: An AI Image Generation Model Reflecting Korean Culture
Yonggeun Kwon
Research Intern
Introduction
Lablup's research team is working on various examples utilizing Backend.AI. We develop AI Agents that automatically generate CLI commands inside Backend.AI to match the user's instructions, and we also create simple demos for various exhibitions. For example, 'VisuTale', a service that creates images and fairy tales using images from users', is the work of our research team. VisuTale has been used at various exhibitions as an example of AI services that can be developed using Backend.AI, and it has been a hit with the audience.
During a recent internal study, we realized that AI-powered image generation models often fail to capture specific cultural contexts or languages. This is because global AI models are often trained on English, or trained on generic, general-purpose domain data. When a region or culture is entered into the prompts, its unique characteristics are often not represented. When we examined different models, we found that when using Korean text prompts to generate images, the Korean identity was not fully represented in the images.
For example, popular models such as Stable Diffusion v1.5 and DALL-E 3 demonstrate issues with not interpreting Korean text prompts correctly, or not fully reflecting Korean identity in the generated images.
-
Stable Diffusion v1.5 generated an image of a building instead of a ramyeon noodle with the prompt in Korean “Draw me a Korean Ramyeon”, indicating that the model was not handling the Korean input correctly.
-
DALL-E 3 generated an image of a noodle dish with the same prompt, it produced an image that looked more like Japanese ramen than Korean ramyeon and did not capture the Korean vibe.
Typing the prompt in English isn't the answer either - sometimes there are no one-to-one words that translate from Korean to English, or the translator doesn't understand the context and doesn't come up with the right translation. For example, if you type the food “Gochujang-jinmichaeboekum” into the translator, it will come up with “stir-fried vegetables with red pepper paste”, which isn't an accurate translation of the food.
As a solution to this problem, KIMCHI (Korean dIffusion Model adapting Culture and HIstory), an image generation model that fully reflects Korean culture and identity, was born.
Making 'KIMCHI'
Attached Image Source: Wikipedia & Lotte Group
Lablup participates in various overseas exhibitions and conferences such as CES, MWC, GTC, and SC, so there are many opportunities to showcase various demos. In this situation, we believe that 'KIMCHI' can be used as a good example of a culture-specific AI model. If Lablup's 'KIMCHI' is shown to generate customized images that reflect cultural characteristics in real time, it will attract attention overseas and effectively communicate the brand's differentiation. Furthermore, KIMCHI can also be used as marketing materials at various exhibitions and provide differentiated content.
'KIMCHI' was developed with two goals. 1. To understand Korean prompts directly and produce accurate images without the need for translation. 2. To generate images that reflect Korean culture and identity. If we can achieve both of these goals, we should be able to generate images of Korean ramyeon noodles and Lotte World Tower standing in a Korean city center in response to the prompts “Draw me ramyeon noodles in Korea” or “Draw me Lotte World Tower in Korea”
In this article, we'll describe the development of these 'KIMCHI' models and the limitations of the dataset preprocessing framework.
Preparing Dataset
Datasets collected from AIHub, such as Korean Food Dataset, Korean Historical Building and Landmark Dataset, External knowledge base multimodal Q&A data with Korean images were used to train the 'KIMCHI' model.
Processing the Data
1. VLM(Vision Language Model) Processing: First, we'll feed the prepared image with prompts as input to a VLM called LLaVA. The VLM is a multimodal AI model that can understand images and text together, meaning it analyzes the information in the image and translates it into human-understandable text. For example, it can look at a certain photo and generate an English caption that says, “A plate of food with a red sauce and a green vegetable." 2. LLM(Large Language Model) Refinement: The English captions are then passed to LLaMA, a language model. LLaMA is responsible for replacing the existing English caption with a more detailed and accurate Korean description. If the original English caption was “A plate of food with a red sauce and a green vegetable.” and contained only visual information about the image, LLaMA would turn it into a Korean sentence that reads, “Kimchi, a traditional Korean dish made with chunggat tossed in red sauce on a white plate.”
Through these two processes, Korean captions are generated that best describe the existing image. These captions are then used to train the image generation model.
Training a model
The Korean descriptions generated from this process were used as text prompts and utilized to train the Diffusion model.
In the image above, we obtained the Korean description of Gat-Kimchi (a Korean food which is a variant of kimchi) “a traditional Korean dish made of chunggat tossed in red sauce on a white plate”. Since it would be very expensive to fully refine the diffusion model, we applied the Low-Rank Adaptation (LoRA) technique to improve learning efficiency, and used the text encoder of the Korean-CLIP model to encode Korean prompts. The Korean-CLIP model is a model that is trained by adding the CLIP model to Korean-English parallel data. Through this process, we were able to further train the model and develop 'KIMCHI', which is capable of Korean prompts and generates images that naturally reflect Korean culture.
Evaluating experiments and performance
Image generation experiments
Experiment (1) - Representing cultural elements
The comparison experiment above shows the results of DALL-E 3, ChatGPT-4o, Stable Diffusion 2.1 (using English translation prompts), and the KIMCHI model generating images based on the same Korean prompts alongside real images.
While other models performed well, KIMCHI captures the texture and ingredients of food, details of traditional architecture, and more without looking artificial. This means that KIMCHI is trained to accurately understand Korean text and generate natural Korean images without the need for translation.
Experiment(2): Preserving the ability to create generic images
The image above is the result of a validation to see if KIMCHI retains the ability to generate the pretrained model after training. KIMCHI was trained using Stable Diffusion 1.5 as a base model, so we compared them together.
Evaluating performance
Result(1): Representing cultural elements
After the experiment, we evaluated the image generation results by surveying 53 evaluators, who selected one of five statements for each of the above items for each model's generated images, ranging from strongly agree to strongly disagree. We then scaled the evaluators' responses from 1 to 5, and measured the average value for each item.
The table above tabulates the experimental results for Experiment (1). KIMCHI outperformed all other models in Text-Image Alignment, Reality, and Domain-Specific Suitability. This means that KIMCHI is good at generating images that are close to the real image, reflecting the Korean prompt, and it is also good at reflecting cultural factors.
Result(2): Preserving the ability to create generic images
The table above shows the results of Experiment(2) through human evaluation. Compared to the Stable Diffusion 1.5 base model, KIMCHI performs slightly worse in Text-Image Alignment and Error Detection, indicating that KIMCHI's ability to generate general images has decreased somewhat as a result of focusing on that domain while learning to specialize in a specific domain. However, we found that it preserved its Reality performance well.
Limitations and conclusions
Limitations
KIMCHI is a fine-tuned model utilizing Stable Diffusion v1.5 as a base model. Stable Diffusion v1.5 was released in October 2022, which is more than two years old. Therefore, even with fine-tuning, image performance may not be as good as the latest version of the model. Also, while it can be trained to understand Korean prompts, it has limitations in that its text-image alignment ability is dependent on the ability of the base model.
Conclusions
KIMCHI is a model that directly understands Korean prompts and generates images that reflect cultural nuances, challenging the limitations of existing image generation models. Although there are still many limitations, this research has the following contributions.
- Build a culturally contextualized dataset preprocessing pipeline with VLM and LLM
- Generate images with Korean prompts without translation
- Generate more realistic and culturally contextualized images
If we can build on these contributions and collect images from specific cultures, we can leverage the dataset generation preprocessing pipeline utilized by KIMCHI to build a dataset that can be used to fine-tune the diffusion model, which will allow us to develop a generation model specific to that language and culture.
While this research started by generating images that accurately reflect Korean culture, it can be extended to AI technologies that understand various cultures. We look forward to further research in the future so that culture-specific models such as KIMCHI can contribute to the world's diverse cultures, and to AI technologies that understand and respect diverse cultures.