Publications – HGU Deep Learning Lab

국외전문학술지

Dougho Park, Younghun Kim, Harim Kang, Junmyeoung Lee, Jinyoung Choi, Taeyeon Kim, Sangeok Lee, Seokil Son, Minsol Kim, Injung Kim, PECI-Net: Bolus segmentation from video fluoroscopic swallowing study images using preprocessing ensemble and cascaded inference, Computers in Biology and Medicine, 2024. [More]

Bolus segmentation is crucial for the automated detection of swallowing disorders in videofluoroscopic swallowing studies (VFSS).
However, it is difficult for the model to accurately segment a bolus region in a VFSS image because VFSS images are translucent, have low contrast and unclear region boundaries, and lack color information. To overcome these challenges, we propose PECI-Net, a network architecture for VFSS image analysis that combines two novel techniques: the preprocessing ensemble network (PEN) and the cascaded inference network (CIN). PEN enhances the sharpness and contrast of the VFSS image by combining multiple preprocessing algorithms in a learnable way. CIN reduces ambiguity in bolus segmentation by using context from other regions through cascaded inference.
Moreover, CIN prevents undesirable side effects from unreliably segmented regions by referring to the context in an asymmetric way. In experiments, PECI-Net exhibited higher performance than four recently developed baseline models, outperforming TernausNet, the best among the baseline models, by 4.54\% and the widely used UNet by 10.83\%. The results of the ablation studies confirm that CIN and PEN are effective in improving bolus segmentation performance.
[link] [PDF]
Hyunil Kim, Tae-Yeong Kwak, Hyeyoon Chang, Sun Woo Kim, and Injung Kim, “RCKD: Response-Based Cross-Task Knowledge Distillation for Pathological Image Analysis”, Bioengineering, Nov. 2023. [More]

We propose a novel transfer learning framework for pathological image analysis, the Response-based Cross-task Knowledge Distillation (RCKD), which improves the performance of the model by pretraining it on a large unlabeled dataset guided by a high-performance teacher model. RCKD first pretrains a student model to predict the nuclei segmentation results of the teacher model for unlabeled pathological images, and then fine-tunes the pretrained model for the downstream tasks, such as organ cancer sub-type classification and cancer region segmentation, using relatively small target datasets. Unlike conventional knowledge distillation, RCKD does not require that the target tasks of the teacher and student models be the same. Moreover, unlike conventional transfer learning, RCKD can transfer knowledge between models with different architectures. In addition, we propose a lightweight architecture, the Convolutional neural network with Spatial Attention by Transformers (CSAT), for processing high-resolution pathological images with limited memory and computation. CSAT exhibited a top-1 accuracy of 78.6% on ImageNet with only 3M parameters and 1.08 G multiply-accumulate (MAC) operations. When pretrained by RCKD, CSAT exhibited average classification and segmentation accuracies of 94.2% and 0.673 mIoU on six pathological image datasets, which is 4% and 0.043 mIoU higher than EfficientNet-B0, and 7.4% and 0.006 mIoU higher than ConvNextV2-Atto pretrained on ImageNet, respectively.
[link] [PDF]
Junseok Oh, Donghwee Yoon, and Injung Kim, “One-shot Ultra-high-Resolution Generative Adversarial Network That Synthesizes 16K Images On A Single GPU”, Image and Vision Computing, Sep., 2023. [More]

We propose a one-shot ultra-high-resolution generative adversarial network (OUR-GAN) framework that generates non-repetitive 16K (16, 384 × 8, 640) images from a single training image and is trainable on a single
consumer GPU. OUR-GAN generates an initial image that is visually plausible and varied in shape at low resolution, and then gradually increases the resolution by adding detail through super-resolution. Since OUR-GAN
learns from a real ultra-high-resolution (UHR) image, it can synthesize large shapes with fine details and longrange coherence, which is difficult to achieve with conventional generative models that rely on the patch distribution learned from relatively small images. OUR-GAN can synthesize high-quality 16K images with 12.5 GB
of GPU memory and 4K images with only 4.29 GB as it synthesizes a UHR image part by part through seamless
subregion-wise super-resolution. Additionally, OUR-GAN improves visual coherence while maintaining diversity
by applying vertical positional convolution. In experiments on the ST4K and RAISE datasets, OUR-GAN exhibited
improved fidelity, visual coherency, and diversity compared with the baseline one-shot synthesis models. To the
best of our knowledge, OUR-GAN is the first one-shot image synthesizer that generates non-repetitive UHR
images on a single consumer GPU. The synthesized image samples are presented at https://our-gan.github.io.
[link] [PDF]
Sungjae Kim, Yewon Kim, Jewoo Jun, and Injung Kim, “MuSE-SVS: Multi-Singer Emotional Singing Voice
Synthesizer that Controls Emotional Intensity”, IEEE/ACM Transactions on Transactions on Audio, Speech and Language Processing (TASLP), July 2023. [More]

We propose a multi-singer emotional singing voice synthesizer, Muse-SVS, that expresses emotion at various intensity levels by controlling subtle changes in pitch, energy, and phoneme duration while accurately following the score. To control multiple style attributes while avoiding loss of fidelity and expressiveness due to interference between the attributes, Muse-SVS represents all attributes and their relations together by a joint embedding in a unified embedding space. Muse-SVS
can express emotional intensity levels not included in the training data through embedding interpolation and extrapolation. We also propose a statistical pitch predictor to express pitch variation according to emotional intensity, and a context-aware residual duration predictor to prevent the accumulation of phoneme-level duration prediction errors, which is crucial for synchronization with instrumental parts. In addition, we propose a novel ASPPTransformer, which combines atrous spatial pyramid pooling (ASPP) and Transformer, to improve fidelity and expressiveness by referring to broad contexts. In experiments, Muse-SVS exhibited improved fidelity, expressiveness, and synchronization performance compared with baseline models. The visualization results show that Muse-SVS effectively express the variation in pitch, energy, and phoneme duration according to emotional intensity. To the best of our knowledge, Muse-SVS is the first neural SVS capable of controlling emotional intensity
[link] [PDF]
Douogho Park and Injung Kim, ” Application of Machine Learning in the Field of Intraoperative Neurophysiological Monitoring: A Narrative Review”, Applied Science, 12(15), 7943, 2022. [More]

Intraoperative neurophysiological monitoring (IONM) is being applied in a wide range of sur-gical fields as a diagnostic tool to protect patients from neural insults that may occur during surgery. However, several contributing factors complicate the interpretation of IONM, and it is labor- and training-intensive. Meanwhile, machine learning (ML)-based medical research has been growing rapidly, and many studies on the clinical application of ML algorithms have been published in recent years. However, the application of ML to IONM remains limited. The major challenges in the application of ML to IONM include the presence of non-surgical contributing factors, ambiguity in the definition of false-positive cases, and inter-rater variability. Neverthe-less, we believe that the application of ML enables objective and reliable IONM, while over-coming the aforementioned problems that experts may encounter. Large-scale, standardized studies and technical considerations are required to overcome certain obstacles to the use of ML in IONM in the future.
[link] [PDF]
J. Yang, H. Kim, H. Kwak, and Injung Kim, “HanFont: large-scale adaptive Hangul font recognizer using CNN and font clustering”, International Journal on Document Analysis and Recognition (IJDAR) vol. 22, pp. 407-416, 2019. [More]

We propose a large-scale Hangul font recognizer that is capable of recognizing 3300 Hangul fonts. Large-scale Hangul font recognition is a challenging task. Typically, Hangul fonts are distinguished by small differences in detailed shapes, which are often ignored by the recognizer. There are additional issues in practical applications, such as the existence of almost indistinguishable fonts and the release of new fonts after the training of the recognizer. Only a few recently developed font recognizers are scalable enough to recognize thousands of fonts, most of which focus on the fonts for western languages. The proposed recognizer, HanFont, is composed of a convolutional neural network (CNN) model designed to effectively distinguish the detailed shapes. HanFont also contains a font clustering algorithm to address the issues caused by indistinguishable fonts and untrained new fonts. In the experiments, HanFont exhibits a recognition rate of 94.11% for 3300 Hangul fonts including numerous similar fonts, which is 2.49% higher than that of ResNet. The cluster-level recognition accuracy of HanFont was 99.47% when the 3300 fonts were grouped into 1000 clusters. In a test on 100 new fonts without retraining the CNN model, HanFont exhibited 57.87% accuracy. The average accuracy for the top 56 untrained fonts was 75.76%. [link] [PDF]
H. Park, Y. Yoo, Y. Park, C. Lee, H.Lee, Injung Kim, and K. Yi, “Toward Optimal FPGA Implementation of Deep Convolutional Neural Networks for Handwritten Hangul Character Recognition”, Journal of Computing Science and Engineering, vol. 12, no.1, pp. 24-35, 2018. [More]

Deep convolutional neural network (DCNN) is an advanced technology in image recognition. Because of extreme computing resource requirements, DCNN implementation with software alone cannot achieve real-time requirement. Therefore, the need to implement DCNN accelerator hardware is increasing. In this paper, we present a field programmable gate array (FPGA)-based hardware accelerator design of DCNN targeting handwritten Hangul character recognition application. Also, we present design optimization techniques in SDAccel environments for searching the optimal FPGA design space. The techniques we used include memory access optimization and computing unit parallelism, and data conversion. We achieved about 11.19 ms recognition time per character with Xilinx FPGA accelerator. Our design optimization was performed with Xilinx HLS and SDAccel environment targeting Kintex XCKU115 FPGA from Xilinx. Our design outperforms CPU in terms of energy efficiency (the number of samples per unit energy) by 5.88 times, and GPGPU in terms of energy efficiency by 5 times. We expect the research results will be an alternative to GPGPU solution for real-time applications, especially in data centers or server farms where energy consumption is a critical problem. [link] [PDF]
I.J.Kim, C.B Choi and S,H.Lee, “Improving Discrimination Ability of Convolutional Neural Networks by Hybrid Learning”, International Journal on Document Analysis and Recognition, vol. 19, pp. 1-9, 2016. [More]

The discrimination of similar patterns is important because they are the major sources of the classification error. This paper proposes a novel method to improve the discrimination ability of convolutional neural networks (CNNs) by hybrid learning. The proposed method embeds a collection of discriminators as well as a recognizer in a shared CNN. By visualizing contrastive class saliency, we show that learning with embedded discriminators leads the shared CNN to detect and catch the differences among similar classes. Also proposed is a hybrid learning algorithm that learns recognition and discrimination together. The proposed method learns recognition focusing on the differences among similar classes, and thereby improves the discrimination ability of the CNN. Unlike conventional discrimination methods, the proposed method does not require predefined sets of similar classes or additional step to integrate its result with that of the recognizer. In experiments on two handwritten Hangul databases SERI95a and PE92, the proposed method reduced classification error from 2.56 to 2.33, and from 4.04 to 3.66 % respectively. These improvement lead to relative error reduction rates of 8.97 % on SERI95a, and 9.42 % on PE92. Our best results update the state-of-the-art performance which were 4.04 % on SERI95a and 7.08 % on PE92. [link] [PDF]
I.J.Kim, X. Xie, “Handwritten Hangul Recognition using Deep Convolutional Neural Networks”, Internal journal on Document Analysis and Recognition, vol. 18, pp. 1-13, 2015. [More]

In spite of the advances in recognition technology, handwritten Hangul recognition (HHR) remains largely unsolved due to the presence of many confusing characters and excessive cursiveness in Hangul handwritings. Even the best existing recognizers do not lead to satisfactory performance for practical applications and have much lower performance than those developed for Chinese or alphanumeric characters. To improve the performance of HHR, here we developed a new type of recognizers based on deep neural networks (DNNs). DNN has recently shown excellent performance in many pattern recognition and machine learning problems, but have not been attempted for HHR. We built our Hangul recognizers based on deep convolutional neural networks and proposed several novel techniques to improve the performance and training speed of the networks. We systematically evaluated the performance of our recognizers on two public Hangul image databases, SERI95a and PE92. Using our framework, we achieved a recognition rate of 95.96 % on SERI95a and 92.92 % on PE92. Compared with the previous best records of 93.71 % on SERI95a and 87.70 % on PE92, our results yielded improvements of 2.25 and 5.22 %, respectively. These improvements lead to error reduction rates of 35.71 % on SERI95a and 42.44 % on PE92, relative to the previous lowest error rates. Such improvement fills a significant portion of the large gap between practical requirement and the actual performance of Hangul recognizers. [link] [PDF]
S.-J. Ryu and I.-J. Kim, “Discrimination of similar characters using nonlinear normalization based on regional importance measure”, International Journal on Document Analysis and Recognition, vol. 17, issue 1, pp.79-89, 2014. [More]

Discrimination of confusing characters is very important in recognition of character sets containing a multitude of similar characters. Confusing characters have very similar shapes and are separated by only a small difference. For a successful discrimination, we need to focus on that difference. However, the small difference can be reduced or even lost during the feature extraction process. In such a case, further analysis after the feature extraction rarely succeeds. This paper proposes a discriminative nonlinear normalization algorithm to improve discrimination ability. The proposed method emphasizes the difference between confusing characters. It measures the importance of each region in the discrimination of confusing characters. Then, it resamples the image according to the regional importance measure. As a result, it expands important regions but shrinks less important regions. Since it emphasizes important regions in the preprocessing step, it does not suffer from the information loss during the feature extraction. In experiments, the proposed method successfully detected and expanded important regions. In handwritten Hangul recognition, the proposed method outperformed other two recently developed pair-wise discrimination methods. On SERI95a data set, it improved the recognition rate from 87.69 to 90.11 %, achieving a 19.66 % error reduction rate. [link] [PDF]
G.-R. Park and I. -J. Kim, “An Evaluation of Statistical Methodologies in Handwritten Hangul Recognition”, International Journal on Document Analysis and Recognition, vol. 16, issue 3, pp. 273-283, 2013. [More]

Although structural approaches have shown better performance than statistical ones in handwritten Hangul recognition (HHR), they have not been widely used in practical applications because of their vulnerability to image degradation and high computational complexity. Statistical approaches have not received high attention in HHR because their early trials were not promising enough. The past decade has seen significant improvements in statistical recognition in handwritten character recognition, including handwritten Chinese character recognition. Nevertheless, without a systematic evaluation on the effects of statistical methods in HHR, they cannot draw enough attention because of their discouraging experience. In this study, we comprehensively evaluate state-of-the-art statistical methods in HHR. Specifically, we implemented fifteen character normalization methods, five feature extraction methods, and four classification methods and evaluated their performances on two public handwritten Hangul databases. On the SERI database, statistical methods achieved the best performance of 93.71 % accuracy, which is higher than the best result achieved by structural recognizers. On the PE92 database, which has small number of samples per class, statistical methods gave slightly lower performance than the best structural recognizer. [link] [PDF]
I.J.Kim, J.H.Kim, “Statistical Character Structure Modeling and Its Application to Handwritten Chinese Character Recognition”, IEEE TPAMI, vol. 25, no. 11, pp. 1422-1436, 2003. [More]

This paper proposes a statistical character structure modeling method. It represents each stroke by the distribution of the feature points. The character structure is represented by the joint distribution of the component strokes. In the proposed model, the stroke relationship is effectively reflected by the statistical dependency. It can represent all kinds of stroke relationship effectively in a systematic way. Based on the character representation, a stroke neighbor selection method is also proposed. It measures the importance of a stroke relationship by the mutual information among the strokes. With such a measure, the important neighbor relationships are selected by the nth order probability approximation method. The neighbor selection algorithm reduces the complexity significantly because we can reflect only some important relationships instead of all existing relationships. The proposed character modeling method was applied to a handwritten Chinese character recognition system. Applying a model-driven stroke extraction algorithm that cooperates with a selective matching algorithm, the proposed system is better than conventional structural recognition systems in analyzing degraded images. The effectiveness of the proposed methods was visualized by the experiments. The proposed method successfully detected and reflected the stroke relationships that seemed intuitively important. The overall recognition rate was 98.45 percent, which confirms the effectiveness of the proposed methods. [link] [PDF]
I.J.Kim, J.H.Kim, “Pair-wise Discrimination Based On Stroke Importance Measure”, Pattern Recognition Journal, vol. 35, no. 10, pp. 2259-2266, 2002. [More]

The pair-wise discriminator is a binary classifier that verifies the outcome of the recognizer if it belongs to a class in a pre-defined confusion pair database. It is difficult to discriminate a pair of characters that are very similar in shape except for a small difference, because the small difference can be overridden by the writing variation. This paper proposes a pair-wise discrimination method that discriminates similar characters by focusing on the structural difference between the two characters. It discriminates a pair of characters by comparing their matching scores between the input character and the models of the two characters. When the stroke matching scores are combined to compute the overall matching score, each stroke is assigned a weight to reflect its importance in discriminating the character pair. By assigning large weights to the discriminative strokes, the difference between the characters is emphasized. The stroke weights are systematically obtained by a neural network training algorithm. In the experiments, the recognition performance was significantly improved by applying the proposed method. [link] [PDF]
C.L.Liu, I.J.Kim, J.H.Kim, “Model based stroke extraction and matching for handwritten Chinese character recognition”, Pattern Recognition, vol. 34, no. 12, pp. 2339-2352, 2001. [More]

This paper proposes a model-based structural matching method for handwritten Chinese character recognition (HCCR). This method is able to obtain reliable stroke correspondence and enable structural interpretation. In the model base, the reference character of each category is described in an attributed relational graph (ARG). The input character is described with feature points and line segments. The strokes and inter-stroke relations of input character are not determined until being matched with a reference character. The structural matching is accomplished in two stages: candidate stroke extraction and consistent matching. All candidate input strokes to match the reference strokes are extracted by line following and then the consistent matching is achieved by heuristic search. Some structural post-processing operations are applied to improve the stroke correspondence. Recognition experiments were implemented on an image database collected in KAIST, and promising results have been achieved. [link] [PDF]

국제학술대회

Kyoungmin Kim, Sangoh Lee, Injung Kim, and Wook-Shin Han, “ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality Estimation”, SIGMOD2024 (accepted). [More]

Recent efforts in learned cardinality estimation (CE) have substantially improved estimation accuracy and
query plans inside query optimizers. However, achieving decent efficiency, scalability, and the support of a wide
range of queries at the same time, has remained questionable. Rather than falling back to traditional approaches
to trade off one criterion with another, we present a new learned approach that achieves all these. Our method,
called ASM, harmonizes autoregressive models for per-table statistics estimation, sampling for merging these
statistics for join queries, and multi-dimensional statistics merging that extends the sampling for estimating
thousands of sub-queries, without assuming independence between join keys. Extensive experiments show
that ASM significantly improves query plans under a similar or smaller overhead than the previous learned
methods and supports a wider range of queries

[link] [PDF]
M. Kang, J. Lee, S. Kim, and Injung Kim, “FastDCTTS: Efficient Deep Convolutional Text-to-Speech”, ICASSP2021. [More]

We propose an end-to-end speech synthesizer, Fast DCTTS, that synthesizes speech in real time on a single CPU thread. The proposed model is composed of a carefully-tuned light-weight network designed by applying multiple network reduction and fidelity improvement techniques. In addition, we propose a novel Group highway activation that can compromise between computational efficiency and the regularization effect of the gating mechanism. As well, we introduce a new metric called Elastic Mel Cepstral Distortion(EMCD) to measure the fidelity of the output mel-spectrogram. In experiments, we analyze the effect of the acceleration techniques on speed and speech quality. Our best model maintains a speech quality similar to that of the baseline model, DCTTS, with the computation reduced to 1.76% and the number of parameters decreased to 2.75%. The speed on a single CPU thread was improved by 7.45 times, which is fast enough to produce mel-spectrogram in real time without GPU. [project link] [link] [PDF]
J. Yang, J. Lee, Y. Kim, H. Cho, and Injung Kim, “VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network”, Interspeech2020. [More]

We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a CPU and exhibits higher MOS. [link] [PDF]
I.Baik, S. Oh, T. Kwak, and Injung Kim, “Overcoming Catastrophic Forgetting by Neuron-Level Plasticity Control”, AAAI2020. [More]

To address the issue of catastrophic forgetting in neural networks, we propose a novel, simple, and effective solution called neuron-level plasticity control (NPC). While learning a new task, the proposed method preserves the knowledge for the previous tasks by controlling the plasticity of the network at the neuron level. NPC estimates the importance value of each neuron and consolidates important \textit{neurons} by applying lower learning rates, rather than restricting individual connection weights to stay close to certain values. The experimental results on the incremental MNIST (iMNIST) and incremental CIFAR100 (iCIFAR100) datasets show that neuron-level consolidation is substantially more effective compared to the connection-level consolidation approaches. [link] [PDF]
I. Paik, T. Kwak, and Injung Kim, “Capsule Networks Need an Improved Routing Algorithm”, ACML2019. [More]

In capsule networks, the routing algorithm connects capsules in consecutive layers, enabling the upper-level capsules to learn higher-level concepts by combining the concepts of the lower-level capsules. Capsule networks are known to have a few advantages over conventional neural networks, including robustness to 3D viewpoint changes and generalization capability. However, some studies have reported negative experimental results. Nevertheless, the reason for this phenomenon has not been analyzed yet. We empirically analyzed the effect of five different routing algorithms. The experimental results show that the routing algorithms do not behave as expected and often produce results that are worse than simple baseline algorithms that assign the connection strengths uniformly or randomly. We also show that, in most cases, the routing algorithms do not change the classification result but polarize the link strengths, and the polarization can be extreme when they continue to repeat without stopping. In order to realize the true potential of the capsule network, it is essential to develop an improved routing algorithm. [link] [PDF]
2005년 7월 – The Keynote Speaker of 1st Camera-based Document Analysis and Recognition Workshop (Title: New Chances and New Challenges in Camera-based Document Analysis and Recognition), ICDAR2005. [More]

Keynote presentation about “New Chances and New Challenges in Camera-Based Document Analysis and Recognition” [link]

국내전문학술지

오준석, 이찬효, 우옥균, 김인중, “자연어 및 시계열 데이터 처리를 지원하는 C++ 기반 오픈소스 딥러닝 프레임워크 WICWIU.v3”, 정보과학회 논문지 vol. 50, no. 4, pp. 313-320, 2023 (KSC2021 우수논문 초청) [More]

본 논문은 국내 대학에서 최초로 공개한 오픈소스 딥러닝 프레임워크 WICWIU (위큐)의 세 번째 버전에 추가된 자연어 및 시계열 데이터 처리 기능들을 소개한다. C++언어로 작성된 WICWIU는 GPU 기반 병렬 처리를 지원하며 가독성과 확장성이 용이해 사용자가 직접 새로운 기능을 추가하기에 용이하다. CNN(Convolutional Neural Networks), GAN(Generative Adversarial Networks) 등 영상처리에 중점을 둔 WICWIU.v1과 v2에 추가적으로 자연어 및 시계열 데이터 처리를 위한 기능들을 추가하였다. WICWIU.v3는 LSTM(Long Short-Term Memory Networks)과 GRU(Gated Recurrent Units)를 포함한 순환 신경망(RNN), 어텐션 모듈, 트랜스포머(Transformer) 등 다양한 모델들을 구현할 수 있는 클래스와 함수를 제공한다. 새롭게 WICWIU.v3에 추가된 자연어 및 시계열 처리 기능들을 바탕으로 기계번역 및 텍스트 합성 모델을 구현함으로써 추가된 기능들이 정상적으로 동작함을 확인하였다

This paper introduces the processing functions of natural language and time series data added to the third version of WICWIU, the first open-source deep learning framework developed by Korean university. WICWIU, written in C++ language, supports GPU-based parallel processing and has excellent readability and extendibility to
facilitate users to add new features. We added functions for processing natural language and time-series data in addition to the WICWIU.v1 and v2 that focus on image processing, such as convolutional neural networks (CNN) and general adversarial networks (GAN). WICWIU.v3 provides classes and functions that can implement various models such as recurrent neural networks (RNN), including LSTM and GRU, attention modules, and Transformers. We validated the newly added functions by implementing a machine translator and a text generator based on the natural language and time-series processing functions of WICWIU.v3.
[link] [PDF]
이충헌, 김성재, 김인중, “CTC기반 음성인식 모델과 저차원 특징을 이용한 음성신호에서의 음소분할”, 정보과학회 논문지 vol. 50, no. 4, pp. 337-343, 2023 (KSC2021 우수논문 초청) [More]

본 논문은 다중 수준 특징을 이용해 음성신호를 음소 단위로 분할하는 방법을 제안한다. 기존의 딥러닝 기반 음성인식 알고리즘들은 심층신경망이 추출한 고수준 특징을 기반으로 음소들의 위치를 추정한다. 그러나, 음소인식에는 고수준 특징이 효과적인 반면, 음소분할에는 지역적 정보를 잘 반영하는 저수준 특징이 더욱 효과적이다. 제안하는 방법은 먼저 고수준 특징을 이용해 음성신호로부터 음소들을 검출한 후 저수준 특징을 이용해 음소 간 경계를 추정한다. 고수준 특징만을 이용한 모델과의 비교 실험에서 음소 경계 추정 평균절대오차(mean absolute error)가 HESD 데이터셋에 대하여 0.34초에서 0.01초로 95.8% 감소하였으며, NUS-48E 데이터셋에 대해서는 0.17초에서 0.04초로 76.5% 감소하였다. 시각화 분석에서도 다중 수준 특징을 이용하는 제안하는 방법은 비교 모델에 비해 음소 간 경계를 더 정확하게 추정하였다.

In this paper, we propose a method of segmenting a speech signal into phoneme units using multi-level features. Most deeplearning-based speech recognition models estimate the location of phonemes based on high-level features extracted by deep neural networks. However, while high-level features are effective for phoneme recognition, low-level features are more effective for phoneme segmentation since they reflect local positional information better. The proposed method first detects
phonemes from speech signals using high-level features and then estimates phoneme boundaries using low-level features. In comparison with a baseline model that relies on high-level features, the mean absolute error of phoneme boundary estimation decreased by 95.8% from 0.34 sec to 0.01 sec for the HESD dataset, and decreased by 76.5% from 0.17 sec to 0.04 sec for the NUS-48E dataset. In visualization analysis, the proposed method more accurately estimated phoneme boundaries compared
to the baseline model.
[link] [PDF]
신윤선, 서주현, 이민영, 김인중, “SoC 환경에서 TIDL NPU를 활용한 딥러닝 기반 도로 영상 인식 기술”, 한국스마트미디어저널 vol. 11, no. 11, pp. 25-31, 2022 (2022년 한국스마트미디어학회 종합학술대회 우수논문 초청) [More]

자율주행 자동차에서 딥러닝 기반 영상처리는 매우 중요하다. 자동차를 비롯한 SoC(System on Chip) 환경에서 실시간으로 도로 영상을 처리하기 위해서는 영상처리 모델을 딥러닝 연산에 특화된 NPU(Neural Processing Unit) 상에서 실행해야 한다. 본 연구에서는 GPU 서버 환경에서 개발된 7종의 오픈소스 딥러닝 영상처리 모델들을 TIDL (Texas Instrument Deep Learning) NPU 환경에 이식하였다. 성능 평가와 시각화를 통해 본 연구에서 이식한 모델들이 SoC 가상환경에서 정상 작동함을 확인하였다. 본 논문은 NPU 환경의 제약으로 인해 이식 과정에 발생한 문제들과 그 해결 방법을 소개함으로써 딥러닝 모델을 SoC 환경에 이식하려는 개발자 및 연구자가 참고할 만한 사례를 제시한다.

Deep learning-based image processing is essential for autonomous vehicles. To process road images in real-time in a System-on-Chip (SoC) environment, we need to execute deep learning models on a NPU (Neural Procesing Units) specialized for deep learning operations. In this study, we imported seven open-source image processing deep learning models, that were developed on GPU servers, to Texas Instrument Deep Learning (TIDL) NPU environment. We confirmed that the models imported in this study operate normally in the SoC virtual environment through performance evaluation and visualization. This paper introduces the problems that occurred during the migration process due to the limitations of NPU environment and how to solve them, and thereby, presents a reference case worth referring to for developers and researchers who want to port deep learning models to SoC environments.
[link] [PDF]
선한결, 이명희, 홍참길, 김인중, “GAN 기반 시점 변환을 통한 차량 영상 데이터 확장”, 정보과학회 논문지 vol. 48, no. 8, pp. 885-891, 2021 (KCC2020 우수발표논문 초청) [More]

GAN을 이용해 다양한 각도에서 촬영된 차량 영상을 특정 각도에서 촬영된 주행 영상으로 변환하는 방법을 소개한다. 차량 영상 인식기를 학습하기 위해서는 특정한 각도에서 촬영한 다량의 차량 영상 데이터가 요구된다. 그러나, 매년 새로 출시되는 다양한 차종에 대하여 그러한 학습 데이터를 수집하는 것은 현실적으로 어렵다. 따라서, 다양한 시점에서 촬영된 차량 영상을 특정 시점에서 촬영된 영상으로 변환함으로써 차량 영상 데이터를 확장하는 방법을 제안한다. 제안하는 방법은 먼저 DRGAN을 이용해 임의의 차량 영상을 전면 상단에서 촬영한 영상으로 변환한 후 DeblurGAN으로 화질을 개선하고 SRGAN을 이용해 해상도를 개선한다. 실험을 통해 제안하는 방법이 좌우 45도 이내의 방향에서 촬영한 영상을 전면 상단 시점의 영상으로 성공적으로 변환하며, 영상의 화질 및 해상도를 개선하는데 효과적임을 보였다

We introduce a novel GAN-based image synthesis method that transforms vehicle images captured from arbitrary viewpoints into images taken from a specific viewpoint. Training a vehicle image recognizer requires a large number of vehicle images taken from a specific viewpoint. However, in practice, it is difficult to collect such training data, especially for newly released vehicles. Therefore, we propose a method of augmenting vehicle image data by converting a vehicle image from an arbitrary viewpoint into an image from a specific viewpoint. The proposed method first transforms a vehicle image from an arbitrary viewpoint to an image taken from the top-front view using DRGAN, then enhances the image quality with DeblurGAN, and finally, improves the resolution using SRGAN. The experimental results demonstrated that the proposed method successfully converted an image taken within 45 degrees left and right into an image from the top-frontal view and was effective in improving the image quality and resolution.
[link] [PDF]
최진, 양진혁, 김인중, “생성적 적대 신경망과 데이터 확장을 이용한 딥러닝 기반 TTS 음질 개선”, 정보과학회 논문지 vol. 26, no. 5, pp. 256-260, 2020 (KCC2019 우수발표논문 초청) [More]

본 논문에서는 생성적 적대 신경망을 이용해 딥러닝 기반 TTS 모델이 합성한 멜 스펙트로그램을 실제 음성의 멜 스펙트로그램과 유사해지도록 개선하는 딥러닝 모델 TE-GAN(TTS Enhancement GAN)을 소개한다. TE-GAN은 음성 신호의 특성을 고려해 설계되었으며, 그리핀-림 알고리즘과 같은 간단한 보코더와 결합되어도 음질 개선 효과가 우수하다. 추가적으로 TE-GAN의 효과적인 학습을 위해 시간적 다중 에이전트(temporal multi-agent, TMA)에 의한 데이터 확장 방법을 제안한다. 실험을 통해 제안하는 방법들이 TTS 시스템이 합성한 음성의 음질을 크게 개선할 수 있음을 보였다. 실험에서 TE-GAN은 Tacotron 이 합성한 멜 스펙트럼을 실제 음성의 멜 스펙트럼과 유사하도록 개선하였으며, 합성된 음성의 MOS도 2.07에서 MOS가 3.24로 크게 개선되었다.

In this paper, we introduce TE-GAN (TTS enhancement GAN) a deep learning model that enhances the Mel-spectrogram synthesized by a deep learning-based TTS model to be similar to that of human speech using a generative adversarial network. TE-GAN was designed by considering the characteristics of speech signals, and can significantly improve the fidelity of speech signals even when it is combined with a simple vocoder such as the Griffin-Lim algorithm. Additionally, we present a data augmentation technique using a Temporal Multi-Agent (TMA) approach for effective learning. Experimental results demonstrate that the proposed methods significantly improve the fidelity of the speech signals synthesized by the TTS system. In experiments, TE-GAN improved the Mel-spectrogram of Tacotron to make it more similar to the Mel-spectrogram of human speech, on top of this the MOS of synthesized speech was improved significantly from 2.07 to 3.24. [link] [PDF]
윤동휘, 권예성, 김경협, 박참진, 윤성결, 최은서, 김인중, “생성적 적대 신경망과 척도 학습을 지원하는 C++ 기반 오픈소스 딥러닝 프레임워크 WICWIU.v2”, 정보과학회 논문지, vol. 26, no. 5, pp. 231-237, 2020 (KCC2019 우수발표논문 초청) [More]

본 논문은 국내 대학 최초의 오픈소스 딥러닝 프레임워크 WICWIU(위큐)가 2018년 공개된 이후 1년 동안 개선된 내용을 소개한다. 합성곱 신경망(convolution neural networks, CNN)에 중점을 둔 WICWIU.v1의 기능 외에 생성적 적대 신경망(generative adversarial networks, GAN)과 척도 학습(metric learning)을 지원하기 위한 기능들이 추가되었다. WICWIU.v2는 Vanilla GAN이나 DCGAN과 같은 기본적인 GAN 모델뿐 아니라 WGAN, BEGAN 등 다양한 고급 GAN 모델들을 구현할 수 있는 클래스와 함수들을 제공한다. 또한, 삼중항 손실 함수와 양성 및 음성 샘플링이 가능한 데이터 로더 등 척도학습을 위한 기능들은 원샷 학습, 또는 메타 학습 알고리즘의 구현에 유용하다. WICWIU.v2는 GitHub에서 다운로드 가능하며 연구 및 상용 소프트웨어 개발에 자유롭게 활용할 수 있다.

This paper describes how WICWIU, the first open source deep learning framework released by a Korean university, has improved in one year since the first release in 2018. In addition to the features of WICWIU.v1 that focus on CNNs, WICWIU.v2 comprises a variety of new features to support generative adversarial nets and metric learning. WICWIU.v2 provides various classes and functions to implement simple GAN models, such as vanilla GAN or DCGAN, as well as advanced GAN models including WGAN and BEGAN. The features for metric learning, including the triplet loss function and extended data loader that can sample positive and negative samples, are valuable in implementing one-shot learning or meta-learning algorithms. WICWIU.v2 is downloadable from GitHub and freely available for academic research or commercial software development. [link] [PDF]
박천명, 김지웅, 기윤호, 김지현, 윤성결, 최은서, 김인중, “C++ 기반 범용 오픈소스 딥러닝 프레임워크 WICWIU”, 정보과학회 논문지, vol. 46, no. 3, pp. 253-259, 2019. (KCC2018 우수논문 초청) [More]

국내 대학으로는 최초로 공개한 오픈소스 딥러닝 프레임워크 WICWIU를 소개한다. WICWIU는 다양한 연산자와 모듈, 그리고 일반적인 계산 그래프들을 표현할 수 있는 신경망 구조를 제공하여 Inception, ResNet, DenseNet 등 널리 사용되는 최신 딥러닝 모델들을 구성하기에 충분한 기능을 제공한다. 또한, GPU 기반 대규모 병렬 컴퓨팅을 지원해 빠른 학습이 가능하다. 모든 API가 C++로 제공되어 C++ 개발자들이 쉽게 적응할 수 있으며, C++환경에 기반하기 때문에 파이썬 기반의 프레임워크에 비해 메모리 및 성능 최적화에도 유리하다. 따라서, 프레임워크 자체를 자원이 제한된 환경에 맞도록 수정하기에도 용이하다. 일관성 높은 코드와 API로 구성되어 가독성과 확장성이 우수하며, 한국어 문서를 제공해 국내 개발자들이 쉽게 접근할 수 있다. WICWIU는 Apache 2.0 라이선스를 적용해 어떠한 연구 목적 및 상용 목적으로도 자유롭게 활용할 수 있다.

In this paper, we introduce WICWIU, the first open source deep learning framework among Korean universities. WICWIU provides a variety of operators and modules together with a network structure that can represent an arbitrary general computational graph. The WICWIU features are sufficient to compose widely used deep learning models such as Inception, ResNet, and DenseNet. WICWIU also supports GPU-based massive parallel computing which significantly accelerates the training of neural networks. It is also easily accessible for C++ developers because the whole API is provided in C++. WICWIU has an advantage over Python-based frameworks in memory and performance optimization based on the C++ environment. This eases the customizability of WICWIU for environments with limited resources. WICWIU is readable and extensible because it is composed of C++ codes coupled with consistent APIs. With Korean documentation, it is particularly suitable for Korean developers. WICWIU applies the Apache 2.0 license which is available for any research or commercial purposes for free. [link] [PDF]
김인중, 나기현, 양소희, 장재민, 김윤종, 신원영, 김덕중, “딥러닝과 통계 모델을 이용한 T-커머스 매출 예측”, 정보과학회 논문지 소프트웨어 및 응용, 제44권, 제8호, 2017. [More]

T-커머스는 양방향 디지털 TV를 기반으로 양방향 데이터방송 기술을 활용하여 상거래를 하는 기술융합형 서비스이다. 채널 번호와 판매상품이 제한된 환경에서 T-커머스의 매출을 극대화 하기 위해서는 각 제품의 시간대별 경쟁력을 고려하여 매출이 최대화 되도록 프로그램을 편성해야 한다. 이를 위해, 본 논문에서는 딥러닝을 이용해 T-커머스에서 각 상품을 각 시간대에 편성하였을 때의 매출을 예측하는 방법을 제안한다. 제안하는 방법은 심층신경망을 이용해 판매 상품과 시간대, 주차, 휴일 여부, 그리고 날씨를 입력 받아 실제 방송으로 편성했을 때 기대되는 매출을 예측한다. 그리고, 통계적 모델과 SVD (Singular Value Decomposition)를 적용하여 판매 데이터의 편중 및 희박성 문제를 완화한다. 실제 T-커머스 운영자인 (주)더블유쇼핑의 판매 기록 데이터에 대하여 실험하였을 때 실제 매출과 예측치의 차이가 0.12의 NMAE(Normalized Mean Absolute Error)를 보여 제안하는 알고리즘이 효과적으로 동작함을 확인하였다. 제안된 시스템은 (주)더블유쇼핑의 T-커머스 시스템 적용되어 방송 편성에 활용되었다.

T-commerce is technology-fusion service on which the user can purchase using data broadcasting technology based on bi-directional digital TVs. To achieve the best revenue under a limited environment in regard to the channel number and the variety of sales goods, organizing broadcast programs to maximize the expected sales considering the selling power of each product at each time slot. For this, this paper proposes a method to predict the sales of goods when it is assigned to each time slot. The proposed method predicts the sales of product at a time slot given the week-in-year and weather of the target day. Additionally, it combines a statistical predict model applying SVD (Singular Value Decomposition) to mitigate the sparsity problem caused by the bias in sales record. In experiments on the sales data of W-shopping, a T-commerce company, the proposed method showed NMAE (Normalized Mean Absolute Error) of 0.12 between the prediction and the actual sales, which confirms the effectiveness of the proposed method. The proposed method is practically applied to the T-commerce system of W-shopping and used for broadcasting organization. [link] [PDF]
조혜근, 김인중, “혼동그래프를 이용한 유사문자쌍 구분기와인식기의 통합”, 정보과학회 논문지 소프트웨어 및 응용, 제39권 제6호, pp. 507-514, 2012. [More]

문자 인식 기술은 여러 분야에 널리 응용되고 있다. 그러나, 필기 한글 인식에서는 지금까지 많은 연구에도 불구하고 높은 성능을 얻지 못하고 있다. 필기 한글 인식의 어려움 중 하나는 유사 문자간 혼동이다. 이를 극복하기 위하여 다양한 유사문자쌍 구분 방법들이 제안되었다. 유사문자쌍 구분기는 특정 혼동문자쌍 간의 차이에 집중함으로써 해당 문자쌍을 전문적으로 구분하는 두 클래스 전용 인식기이다. 유사문자쌍 구분기는 유사문자쌍의 구분에 효과적이다. 그러나, 두 개의 클래스만을 구분할 수 있기 때문에, 기본 인식기와 통합되어야만 실용적인 시스템에 사용될 수 있다. 지금까지 유사문자쌍 구분기와 기본 인식기의 통합에 대해서는 많은 연구가 이루어지지 않았다. 본 논문은 기본 인식기와 유사문자쌍 구분기로부터 획득된 다양한 정보를 체계적으로 활용함으로써 유사문자들이 많이 포함된 언어에 대한 인식 성능을 극대화하는 방법을 제안한다. 제안하는 방법은 문자간 혼동 확률, 기본 인식기 및 유사문자쌍 구분기의 신뢰도를 혼동그래프에 저장한 후 이를 활용해 기본 인식기와 유사문자쌍 구분기의 결과를 체계적으로 통합함으로써 최종 인식 결과를 선택한다. SERI95a 문자 영상으로 실험한 결과, 인식 방법과 유사문자 구분 방법의 변화 없이도 통합 방법의 개선만을 통해 8.26퍼센트의 오류감소율을 얻었다.

Character recognition technology is widely used in many fields. However, researchers couldn’t achieve high performance in handwritten Hangul recognition, in spite of many researches so far. One of major difficulties in handwritten Hangul recognition is confusion between similar characters. Various pair-wise discrimination methods were proposed to overcome this problem. A pair-wise discriminator is a two-class recognizer specialized for a particular confusing character pair by focusing on their difference. Pair-wise discriminators are effective to discriminative specific character pairs. However, because a discriminator can discriminate only a pair of classes, it needs to be integrated with a baseline recognizer to be used in a practical system. Until now, there were few researches on integration of pair-wise discriminators and baseline recognizer. This paper proposes a method to maximize overall recognition performance on a language containing lots of similar characters by systematically utilizing various information obtained by the baseline recognizer and pair-wise discriminators. The proposed method stores confusion probability between characters as well as the reliability of the baseline recognizer and pair-wise discriminators in a confusion graph. Then, it selects the final recognition result by systematically integrating outputs of the baseline recognizer and pair-wise discriminators using the confusion graph. In experiments on SERI95a data set, we achieved 8.26% of error reduction rate by only improving integration method, without any modification on recognition nor discrimination methods. [link] [PDF]
류상진, 김인중, “저화질 영상 인식을 위한 화질 저하 모델 기반 다중 인식기 결합”, 정보처리학회 논문지 Part B, vol. 2010, no.3, pp. 233-238, 2010. [More]

본 논문에서는 화질 저하 모델에 기반한 다중 인식기 결합을 이용하여 저화질 영상에 대한 인식 성능을 개선하기 위한 방법을 제안한다. 제안하는 방법은 화질 저하 모델을 이용해 특정 화질에 각각 특화된 복수의 인식기들을 생성한다. 인식 과정에서는 인식기들의 결과를 가중 평균에 의해 결합함으로써 최종 결과를 결정한다. 이 때, 각 인식기의 가중치는 입력 영상의 화질 추정 결과에 따라 동적으로 결정된다. 입력 영상의 화질에 특화된 인식기에는 큰 가중치를, 그렇지 않은 인식기에는 작은 가중치를 지정한다. 그 결과, 입력 영상의 화질 변이에 효과적으로 적응할 수 있다. 뿐만 아니라, 복수의 인식기를 사용하기 때문에 저화질 영상에 대하여 단일 인식 시스템보다 더욱 안정적인 성능을 나타낸다. 제안하는 다중 인식기 결합 방법은 화질을 고려하지 않은 다중 인식기 결합 방법이나, 화질을 고려한 단일 인식 방법과 비교하여 더 높은 인식률을 보였다.

In this paper, we propose a multiple classifier combination method based on image degradation modeling to improve recognition performance on low-quality images. Using an image degradation model, it generates a set of classifiers each of which is specialized for a specific image quality. In recognition, it combines the results of the recognizers by weighted averaging to decide the final result. At this time, the weight of each recognizer is dynamically decided from the estimated quality of the input image. It assigns large weight to the recognizer specialized to the estimated quality of the input image, but small weight to other recognizers. As the result, it can effectively adapt to image quality variation. Moreover, being a multiple-classifier system, it shows more reliable performance then the single-classifier system on low-quality images. In the experiment, the proposed multiple-classifier combination method achieved higher recognition rate than multiple-classifier combination systems not considering the image quality or single classifier systems considering the image quality. [link] [PDF]
박규로, 김인중, “영상관찰모델을 이용한 예제기반 초해상도 텍스트 영상 복원”, 정보처리학회 논문지 B, 제17권 제4호, pp. 295-302, 2010. [More]

예제기반 초해상도 영상 복원(EBSR)은 고해상도 영상과 저해상도 영상간의 패치간 대응관계를 학습함으로써 고해상도 영상을 복원하는 방법으로, 한 장의 저해상도 영상으로부터도 고해상도 영상을 복원할 수 있는 장점이 있다. 그러나, 폰트의 종류나 크기가 학습 영상과 다른 텍스트 영상을 적용할 경우 잡영을 많이 발생시킨다. 그 이유는 복원 과정 중 매칭 단계에서 입력 패치들이 사전 내의 고해상도 패치와 부적절하게 매칭될 수 있기 때문이다. 본 논문에서는 이러한 문제점을 극복하기 위한 새로운 패치 매칭 방법을 제안한다. 제안하는 방법은 영상 관찰 모델을 이용하여 입력 영상과 출력 영상간의 상관 관계를 보존함으로써 잘못 매칭된 패치로 인한 잡영을 효과적으로 억제한다. 이는 출력 영상의 화질을 개선할 뿐 아니라, 다양한 종류 및 크기의 폰트를 포함한 대용량 패치 사전을 적용할 수 있게 함으로써 폰트의 종류 및 크기의 변이에 대한 적응력을 크게 향상시킨다. 실험에서 제안하는 방법은 폰트와 크기가 다양한 영상에 대하여 기존의 방법보다 우수한 영상 복원 성능을 나타내었다. 뿐만 아니라, 인식 성능도 88.58%에서 93.54%로 개선되어 제안하는 방법이 인식 성능의 개선에도 효과적임을 확인하였다.

Example-based super resolution(EBSR) is a method to reconstruct high-resolution images by learning patch-wise correspondence between high-resolution and low-resolution images. It can reconstruct a high-resolution from just a single low-resolution image. However, when it is applied to a text image whose font type and size are different from those of training images, it often produces lots of noise. The primary reason is that, in the patch matching step of the reconstruction process, input patches can be inappropriately matched to the high-resolution patches in the patch dictionary. In this paper, we propose a new patch matching method to overcome this problem. Using an image observation model, it preserves the correlation between the input and the output images. Therefore, it effectively suppresses spurious noise caused by inappropriately matched patches. This does not only improve the quality of the output image but also allows the system to use a huge dictionary containing a variety of font types and sizes, which significantly improves the adaptability to variation in font type and size. In experiments, the proposed method outperformed conventional methods in reconstruction of multi-font and multi-size images. Moreover, it improved recognition performance from 88.58% to 93.54%, which confirms the practical effect of the proposed method on recognition performance. [link] [PDF]
박규로, 김인중, “단계적 후보 축소에 의한 예제기반 초해상도 영상복원을 위한 고속 패치 검색”, 정보과학회 논문지 소프트웨어 및 응용 제37권 제4호, pp. 264-272, 2010. [More]

예제기반 초해상도 영상복원은 영상 패치의 대한 학습 및 검색을 통해 저해상도 영상으로부터 고해상도 영상을 복원하는 방법으로써 성능이 좋고 한 장의 저해상도 영상에 대하여도 적용 가능하다. 그러나 복원 과정에서 패치 검색에 많은 비교 연산이 요구되기 때문에 속도가 매우 느리다. 복원 속도를 향상시키기 위해서는 효과적인 패치 검색 알고리즘이 요구된다. 본 논문에서는 패치 검색에 사용 가능한 다양한 고차원 특징 검색 방법들을 실제 초해상도 영상복원 시스템에 적용하여 그 성능을 비교하였다. 또한 문자 인식 분야에서 성공적으로 적용되어왔으나 초해상도 영상복원에서는 사용되지 않았던 단계적 후보축소 방법을 패치 검색 단계에 적용할 것을 제안한다. 실험 결과 기존의 방법 중에서는 LSH가 가장 좋은 성능을 나타내었다. 본 논문에서 제안한 단계적 후보 축소에 의한 패치 검색 방법은 LSH보다 더욱 우수하여 1024×1024 영상의 복원 시 LSH보다 최대 3.12배 빠른 복원 속도를 나타내었다.

Example-based super resolution is a method to restore a high resolution image from low resolution images through training and retrieval of image patches. It is not only good in its performance but also available for a single frame low-resolution image. However, its time complexity is very high because it requires lots of comparisons to retrieve image patches in restoration process. In order to improve the restoration speed, an efficient patch retrieval algorithm is essential. In this paper, we applied various high-dimensional feature retrieval methods, available for the patch retrieval, to a practical example-based super resolution system and compared their speed. As well, we propose to apply the multi-phase candidate reduction approach to the patch retrieval process, which was successfully applied in character recognition fields but not used for the super resolution. In the experiments, LSH was the fastest among conventional methods. The multi-phase candidate reduction method, proposed in this paper, was even faster than LSH: For 1024×1024 images, it was 3.12 times faster than LSH. [link] [PDF]
김인중, “화질 분석을 통한 카메라 문서 영상의 적응적 이진화”, 정보과학회논문지 : 소프트웨어 및 응용 제34권 제9호, 2007.9, 797-803, 2009. [More]

카메라 기반 문서 인식을 위해서는 화질 변화에 적응할 수 있는 이진화 기술이 매우 중요하다. 본 논문에서는 화질 분석을 통해 다양한 화질의 카메라 영상에 효과적으로 적응할 수 있는 이진화 방법을 제안한다. 먼저 이진화 파라미터가 이진화 결과에 미치는 영향을 분석하고, 카메라 영상의 화질을 측정하는 방법을 제안한다. 그리고, 측정된 화질과 이진화 파라미터간의 상관 관계를 통계적으로 분석하여 반영함으로써 화질 변화에 자동으로 적응하는 이진화 방법을 제안한다. 실험을 통해 화질과 이진화 파라미터 간에는 유의한 상관 관계가 있으며, 제안하는 방법이 화질에 따라 적절한 파라미터를 추정함으로써 화질 변화에 적응함을 확인하였다.

Adaptive binarization is very important for the camera-based document recognition. This paper proposes a binarization method which can effectively adapt to the variation of image quality. Firstly, it analyzes the effect of binarization parameters to the result and proposes a method to measure the image quality. Then, it statistically analyzes the relationship between the image quality and the binarization parameter. Finally, it proposes a binarization method that automatically adapts to the quality of the input image, using the analysis result. The experiment results show that there is a meaningful relationship between the image quality and the binarization parameter, and therefore, the proposed method can effectively adapt to the variation of image quality. [link] [PDF]
진유호, 김호연, 김인중, 김진형, “자모 결합 유형을 이용한 적은 어휘에서의 필기 한글 단어 인식”, 정보과학회논문지, 소프트웨어 및 응용 제 28권 제 1호, pp. 52-63, 2001. [More]

필기 단어 인식 방법에는 낱자별 분할 및 낱자 단위 인식을 통해 인식하는 방법과 단어 사전을 이용하여 단어와 영상을 직접 비교하는 방법이 있다. 이 중 후자는 인식 대상이 되는 단어들이 작은수의 어휘로 제한되었을 때 매우 효과적이다. 본 논문에서는 입력 영상이 주어졌을 때 자모를 순차적으로 탐색하고 그 결과의 최적 조합을 찾아 인식하는 사전을 이용한 필기 한글 단어 인식 방법을 제안한다.
입력 영상은 사전의 각 단어와의 매칭을 통해 인식된다. 단어는 필기 순서로 정렬된 자모열로 표현하고 입력 영상은 획들의 집합으로 표현한다. 단어의 자모들은 입력 영상으로부터 추출된 획들의 집합으로부터 단계적으로 탐색된다. 각 단계에서는 전 단계까지의 매칭 상태와 탐색하려는 자모의 형태로부터 자모가 존재할 것이라고 기대되는 정합 기대 영역을 설정한 후 그 안에서 자모 탐색기를 이용해 자모를 찾는다. 자모 탐색기는 획들의 집합으로 이루어진 복수의 자모 후보와 그 점수를 출력한다. 각 단계마다 생성된 자모 후보들은 최적의 단어 매칭을 찾기 위한 탐색 공간을 이룬다. 본 연구에서는 단어 사전을 trie로 구성하고, 탐색 과정에서 dynamic programming을 이용하여 효과적으로 탐색을 수행하였다. 또한 인식 속도를 향상시키기 위해 사전 축소, 탐색 공간 축소 등 다양한 지식을 이용하였다. 제안하는 방법은 무제약으로 쓰여진 필기 단어도 인식할 수 있을 뿐 아니라, 동적 사전을 이용하기 때문에 사전의 내용이 변하는 환경에서도 적용할 수 있다.
인식 실험에서는 39개의 단어로 이루어진 사전에 대하여 613개의 단어 영상에 대해 실험한 결과 98.54%의 높은 인식률을 보임으로써 제안하는 방법이 매우 효과적임을 확인하였다.

There are two kinds of approaches in handwritten word recognition. One is recognition by character segmentation and character recognition, while the other is lexicon-driven approach that compares the input image with word models directly. The latter is effective especially when we want to recognize words from a small vocabulary. This paper proposes a lexicon-driven handwritten Hangul word recognition method, which recognizes the input image by sequential grapheme spotting and finding the optimal combination of the graphemes.
The input image is recognized by matching with every word in the lexicon. A word is represented by a sequence of graphemes in writing order, while an image is represented by a set of strokes. The graphemes of a word are searched from the set of strokes, one by one. At every step, an expectation region for the target grapheme is estimated from the previous matching state and the type of the target grapheme. The grapheme is searched only within the expectation region by the grapheme spotter, which gives a number of grapheme hypotheses composed of groups of strokes and their efficiency, we organized the lexicon into trie and adopted dynamic programming technique to find the optimal word matching. We also proposed some heuristics to reduce the lexicon and the search space in order to increase the recognition speed. The proposed method can recognize unconstrained handwritten words. Moreover, it is applicable to the environment in which the lexicon may change, because it uses dynamic lexicon.
The efficiency of the proposed method was confirmed by the experiment, in which the recognition rate was 98.54% for 613 word images from 39 classes. [link] [PDF]
김인중, 김진형, “형식 문서 자동 입력 시스템에서 한글, 한자, 영문 이름 상호 참조를 이용한 이름 필드 인식 후처리” 한국 정보과학회 논문지(B), Vol. 26, No. 7, p900-908, 1999년 7월. [More]

본 논문에서는 형식 문서 자동 인식 시스템을 위한 이름 인식 결과 후처리 방법을 제안한다. 한국에서 사용되는 많은 형식문서에는 한글 이름, 한자 이름, 영문 이름을 필기하는 필드가 함께 존재한다. 이러한 이름 필드간의 상호 관계를 이용한다면 인식기로부터 얻은 이름 인식 결과의 인식 오류를 보정하여 더 좋은 인식 성능을 얻을 수 있다. 본 논문에서는 서로 다른 언어로 쓰여진 이름간의 상호 관계를 이용하여 이름 인식 결과의 오류를 보정하는 방법을 제안한다. Dynamic Programming에 의해 다른 언어의 글자/자소간의 대응관계를 찾음으로써 인식오류를 발견하고 수정한다. 형식 문서 인식 시스템에 적용, 실험을 통해 제안된 후처리 방법이 매우 효과적임을 보였다.

In this paper, a postprocessing method for name fields, for automatic form reading system, is proposed. Many forms being used in Korea have fields for names in Hangul, Chinese, and Roman characters. Using the relation between these name fields, we can correct recognition errors to achieve better performance. This paper proposes an error correction method by cross checking of names in different languages. By finding correspondence between characters/graphemes of name recognition results via dynamic programming, it can find recognition errors and correct them. The experimental results show this method is very effective. [link] [PDF]

국내학술대회

한성화, 김현욱, 나기현, 김인중, “APACHE5 NPU SoC 환경에서의 딥러닝 기반 도로영상 인식”, 정보과학회 KSC2023.
서주현, 신윤선, 이민영, 김인중, “자율주행을 위한 SoC 환경에서의 딥러닝 기반 도로 영상 인식”, 한국스마트미디어학회 2022년 종합학술대회 (최우수논문상).
이충현, 김성재, 이고은, 김인중, “다중 수준 특징을 이용한 음성신호에서의 음소 분할”, 정보과학회 KCC2022 (우수논문상).
김예원, 이충현, 김성재, 김인중, “Streamlit 프레임워크를 이용한 TTS 및 SVS 기반 웹 서비스 개발”, UCWI2022.
오준석, 이찬효, 우옥균, 이세현, 김인중, “WICWIU.v3: 순환 신경망과 트랜스포머를 지원하는 C++ 기반 오픈소스 딥러닝 프레임워크”, 정보과학회 KSC2021.
이민영, 전혜원, 김인중, “딥러닝 기반 실시간 인체 영상 분할 및 포즈 추정”, 정보과학회 KCC2020.
선한결, 이명희, 김인중, “차량 영상 데이터 확장을 위한 GAN 기반 시점 변환”, 정보과학회 KCC2020 (우수논문상)
윤동휘, 이명희, 백주열, 김인중, “자율 주행차를 위한 딥러닝 기반 주행 가능 영역 검출“, 정보과학회 KSC2020.
윤동휘, 권예성, 김경협, 윤성결, 최은서, 김인중, “WICWIU.v2 생성적 적대 신경망을 지원하는 C++ 기반 오픈소스 딥러닝 프레임워크”, 정보과학회 KCC2019 (우수발표논문상)
최진, 양진혁, 김인중, “적대적 생성 신경망을 이용한 딥러닝 기반 TTS 음질 개선”, 정보과학회 KCC2019 (우수발표논문상)
양진혁, 김인중, “노이즈 어텐션을 이용한 딥러닝 기반 음성합성”, 정보과학회 KCC2019
박참진, 강민수, 김인중, “컨볼루션 순환신경망을 이용한 한글 텍스트 영상 인식”, 정보과학회 KCC2019.
박천명, 김지웅, 기윤호, 김지현, 김인중, “WICWIU: 가독성과 확장성이 우수한 C++ 기반 딥러닝 오픈소스 프레임워크”, 한국정보과학회 KCC2018, 2018. (우수논문상)
양진혁, 곽효빈, 김인중, “딥러닝을 이용한 영상기반 한글 폰트 인식“ 2017년 한글 및 한국어 정보처리 학술대회, 2018.
류상진, 박규로, 김인중, “초해상도 영상복원과 인식기 연동학습에 의한 저해상도 문자 영상 인식”, 정보과학회 2009년가을 학술발표 논문집 vol.36, no.2(A), pp.204-207, 2009. [More]

저해상도 문자 영상을 인식하는 것은 아직까지도 어려운 문제이다. 본 논문에서는 초해상도 영상복원과 인식기의 연동학습에 의한 저해상도 문자 영상 인식 방법을 제안한다. 현재까지의 시스템에서는 초해상도 복원영상의 특성이 잘 반영되지 않았던 것에 반해 제안하는 시스템은 입력된 저해상도 영상과 초해상도 복원 영상의 특징 벡터를 결합한 후 인식기 학습 및 인식에 함께 사용한다. 그 결과 두 가지 영상의 상호 보완적인 정보를 모두 활용하여 인식을 수행한다. 실험을 통해 초해상도 영상복원과 무관하게 학습된 인식기와 비교한 결과 77.66%에서 98.79%로 개선되어 21.13%의 성능 개선 효과를 얻었다. [link] [PDF]
박규로, 김인중, “인식기반 초해상도 영상복원”, 정보과학회 2009년 한국컴퓨터종합학술대회 논문집 vol.36, no.1(A), pp.279-283, 2009. [More]

본 논문에서는 인식기반 초해상도 영상복원 방법을 제안한다. 기존의 초해상도 영상복원 방법이 인식모듈과는 독립적으로 수행되었던 반면, 제안하는 방법은 초해상도 복원 과정과 인식 과정을 통합함으로써 시스템 전체적인 관점에서의 최적화를 수행한다. 초해상도 복원 과정에 클래스 별 선행모델을 사용하고, 인식기의 반응을 참조하여 최적 클래스를 선택함으로써 영상복원과 인식을 동시에 수행한다. 실험에서 제안하는 방법으로 복원과 인식을 동시에 실시한 결과 영상복원과 인식을 독립적으로 수행하는 기존의 방법에 비하여 화질과 인식성능 면에서 모두 개선된 결과를 얻었다. [link] [PDF]
류상진, 김인중, 박규로, “중첩클러스터링을 이용한 다중 신경망 기반 필기 한글 인식”, 정보과학회 2009년 한국컴퓨터종합학술대회 논문집 vol.36, no.1(A), pp.536-540, 2009. [More]

필기 한글 인식은 지금까지 많은 연구가 이루어졌음에도 아직까지 해결되지 못한 어려운 문제이다. 본 연구에서는 중첩 클러스터링과 다중 신경망을 이용하여 필기 한글을 인식하는 방법을 제안하였다. 한글은 문자 클래스의 수가 많아 단일 신경망으로는 인식하기 어렵다. 따라서 신경망을 이용하여 필기 한글을 인식할 경우 먼저 한글 문자들을 몇 개의 클러스터로 나눈 후 이를 기반으로 입력 문자를 대분류 한후 각 클러스터에 내에서 신경망을 이용하여 상세 분류를 수행한다. 그러나 이와 같은 시스템에서는 대분류 과정에서 많은 오류가 발생하여 전체 인식률을 저하시킨다. 따라서 본 연구에서는 대분류를 통해 각 신경망이 분류할 대상 클래스를 축소하되, 대분류 단계에서 자주 혼동되는 문자 클래스를 복수의 클러스터에 중복되도록 소속시킴으로써 대분류 오류를 줄이는 방법을 제안한다. 실험을 통해 제안하는 방법을 기존에 많이 사용되는 6형식 분류 기반의 신경망 인식기와 비교한 결과 제안하는 방법이 더 높은 인식률을 나타내었다. [link] [PDF]
이충식, 김인중, 신종탁, 김진형, “문자 가분할과 Support Vector Machine을 이용한 필기 한글 단어 고속 검증기”, 한국전자공학회 학술대회 논문지, Vol. 3, pp. 37-40, 2000년 10월. [More]

A fast method of Hangul address word verification is presented in this Paper. Pre-segmentation and recognition by DP matching is adopted in this paper. An address line image is over-segmented by analyzing the topology of connected components and the projection profile. A fast individual Hangul character verifier was developed by applying SVM (Support Vector Machine). The segmentation hypothesis was represented by lattice structure, and a best path search by dynamic programming generates the most probable segmentation path and the final verification score. The word verifier was tested on 310 address image DB, and it show the possibility of improvements of this method. [link] [PDF]
김인중, 김진형, “획 중요도를 이용한 유사 문자쌍 구분”, 제 4회 문자인식 워크샵, 2000년 12월
진유호, 김호연, 김인중, 김진형, 최영우, “자모 결합 유형을 이용한 제한된 어휘의 한글 단어 인식”, 정보처리학회 추계 학술대회 논문지, Vol. 26, No. 7, pp. AI136-AI143, 숭실대학교, 1999년 10월. [More]

필기 단어 인식 방법에는 낱자별 분할 및 낱자 단위 인식을 통해 인식하는 방법과 단어 사전을 이용하여 단어와 영상을 직접 비교하는 방법이 있다. 이 중 후자는 인식 대상이 되는 단어들이 작은수의 어휘로 제한되었을 때 매우 효과적이다. 본 논문에서는 입력 영상이 주어졌을 때 자모를 순차적으로 탐색하고 그 결과의 최적 조합을 찾아 인식하는 사전을 이용한 필기 한글 단어 인식 방법을 제안한다.
입력 영상은 사전의 각 단어와의 매칭을 통해 인식된다. 단어는 필기 순서로 정렬된 자모열로 표현하고 입력 영상은 획들의 집합으로 표현한다. 단어의 자모들은 입력 영상으로부터 추출된 획들의 집합으로부터 단계적으로 탐색된다. 각 단계에서는 전 단계까지의 매칭 상태와 탐색하려는 자모의 형태로부터 자모가 존재할 것이라고 기대되는 정합 기대 영역을 설정한 후 그 안에서 자모 탐색기를 이용해 자모를 찾는다. 자모 탐색기는 획들의 집합으로 이루어진 복수의 자모 후보와 그 점수를 출력한다. 각 단계마다 생성된 자모 후보들은 최적의 단어 매칭을 찾기 위한 탐색 공간을 이룬다. 본 연구에서는 단어 사전을 trie로 구성하고, 탐색 과정에서 dynamic programming을 이용하여 효과적으로 탐색을 수행하였다. 또한 인식 속도를 향상시키기 위해 사전 축소, 탐색 공간 축소 등 다양한 지식을 이용하였다. 제안하는 방법은 무제약으로 쓰여진 필기 단어도 인식할 수 있을 뿐 아니라, 동적 사전을 이용하기 때문에 사전의 내용이 변하는 환경에서도 적용할 수 있다.
인식 실험에서는 39개의 단어로 이루어진 사전에 대하여 613개의 단어 영상에 대해 실험한 결과 98.54%의 높은 인식률을 보임으로써 제안하는 방법이 매우 효과적임을 확인하였다.

There are two kinds of approaches in handwritten word recognition. One is recognition by character segmentation and character recognition, while the other is lexicon-driven approach that compares the input image with word models directly. The latter is effective especially when we want to recognize words from a small vocabulary. This paper proposes a lexicon-driven handwritten Hangul word recognition method, which recognizes the input image by sequential grapheme spotting and finding the optimal combination of the graphemes.
The input image is recognized by matching with every word in the lexicon. A word is represented by a sequence of graphemes in writing order, while an image is represented by a set of strokes. The graphemes of a word are searched from the set of strokes, one by one. At every step, an expectation region for the target grapheme is estimated from the previous matching state and the type of the target grapheme. The grapheme is searched only within the expectation region by the grapheme spotter, which gives a number of grapheme hypotheses composed of groups of strokes and their efficiency, we organized the lexicon into trie and adopted dynamic programming technique to find the optimal word matching. We also proposed some heuristics to reduce the lexicon and the search space in order to increase the recognition speed. The proposed method can recognize unconstrained handwritten words. Moreover, it is applicable to the environment in which the lexicon may change, because it uses dynamic lexicon.
The efficiency of the proposed method was confirmed by the experiment, in which the recognition rate was 98.54% for 613 word images from 39 classes. [link] [PDF]
김인중, 김진형, “Run-length code와 Hough Transform을 이용한 Vectorizer”, 제 7회 영상 처리 및 이해에 관한 워크샵 발표논문집, pp. 53-58, KAIST, 1995년 2월.

대학학술지

김인중, 이강, “한동대학교 컴퓨터공학 졸업예정자들의 프로그램 학습성과 성취도 평가”, 한동저널, vol.8, pp. 130-146, 2010.
김인중, 이강, “한동대학교 공학교육인증과정과 비인증과정 학생 간의 학업성취도 비교”, vol. 9, pp. 147-180, 2011.

특허

간섭 없이 다중 스타일 제어가 가능한 텍스트로부터의 음성 합성 방법(등록번호: 1024954550000)
딥러닝 기반 엔드투엔드 음성 합성 시스템의 음성 합성 경량화 방법(등록번호: 1022880510000)
딥러닝 프레임워크를 활용한 뉴럴 네트워크 구조 확장 방법(등록번호: 1022199040000)
병렬 말뭉치와 타 언어 대화데이터를 이용한 지능형 대화 시스템 및 이의 동작방법(등록번호: 1021058760000)
빅데이터 딥러닝 기반의 T커머스 방송편성정보 제공방법(등록번호: 1018871960000)
빅데이터 기술을 이용한 홈쇼핑 및 T커머스 큐레이션 제공방법(등록번호: 1019133580000)