HyperCLOVA X: Leading AI Sovereignty in South Korea - 1. How it began

(01) How it began.

Since the rise of large language models in the IT industry, big tech companies from the U.S. and China have been competitively entering the market, continuously releasing new AI services and solutions. As U.S. big tech companies including OpenAI, Google, Microsoft, and Meta are launching various AI models covering most of the world’s languages, the other parts of the world are left with no alternative but to use models from global big-tech companies. However, continuous concerns for data sovereignty – including the fear of data breach, data storage, etc. – increase the need for “Sovereign AI.” “Sovereign AI” is an AI infrastructure andmodel that allows each country or company to fully own and control its data under its policy, rather than following a uniform global standard or transferring data to a specific company. An example is the case of the UK government releasing its plan in March 2023, to invest more than 900 million pounds as a part of a national AI strategy. This plan to build “BritGPT” illustrates the worldwide need for Sovereign AI.

In such an environment, NAVER Cloud, a large-scale cloud service provider in South Korea, launched HyperCLOVA X, a frontier Korean language model to protect AI sovereignty in Korea. The primary motivations behind NAVER Cloud's development of the Sovereign AI are to address economic and socio-historical challenges, as well as to reduce dependency on North American culture-biased AI, particularly those created by major global tech companies based in Silicon Valley.

The economic burden on Korean customers having to use AI models with price structures designed based on English (Latin-based language) is much heavier compared to other English-speaking countries. The main reason is that English-centric models have set English-optimized tokenizers within a fixed token vocabulary size. While one token is replaced by one word in English, for example, it is replaced by one byte in non-Latin languages. Because one word usually consists of several bytes in non-Latin languages, it makes the cost of non-Latin languages cost higher in the global big-tech AI models. According to an experiment by Tomasz, “Cost Overhead of Processing Various Language in GPT-4 Compared to English,” the cost burden of processing Korean is 347% higher than processing English. This problem is similar in other non-Latin languages such as Arabic, Asean, Japanese and Chinese languages.

Also, English-based large language models (LLMs) exhibit a lower understanding of the socio-cultural perspectives and the history of other non-English countries. For instance, Llama 2 from Meta, a widely used open-source LLM, is primarily trained with a dataset that is comprised of 89% English language (Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023). The model will perform well in English use cases but may have difficulties handling context with cultural sentiment that differs by country since LLM learns not only the language but also the social contexts embedded in the data. For example, ChatGPT’s hallucination became a famous meme in Korea. When ChatGPT was asked to explain an incident regarding King Sejong (King of the Joseon Dynasty during the 15th Century) throwing a MacBook, ChatGPT did not recognize it as an invalid question and generated a made-up story. This is because ChatGPT lacks an understanding of Korean history. Such a lack of understanding of the social, and historical background of the country can lead to irrelevant or incorrect outputs. In order to solve these problems, NAVER Cloud has spent years developing our independent AI model tailored to Korea.

Tunnisteet
Artificial Intellience