To see the other types of publications on this topic, follow the link: Encoder and decoder feature.

Journal articles on the topic 'Encoder and decoder feature'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Encoder and decoder feature.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Shim, Jae-hun, Hyunwoo Yu, Kyeongbo Kong, and Suk-Ju Kang. "FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 2 (2023): 2263–71. http://dx.doi.org/10.1609/aaai.v37i2.25321.

Full text
Abstract:
With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.
APA, Harvard, Vancouver, ISO, and other styles
2

Wen, Ying, Kai Xie, and Lianghua He. "Segmenting Medical MRI via Recurrent Decoding Cell." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 12452–59. http://dx.doi.org/10.1609/aaai.v34i07.6932.

Full text
Abstract:
The encoder-decoder networks are commonly used in medical image segmentation due to their remarkable performance in hierarchical feature fusion. However, the expanding path for feature decoding and spatial recovery does not consider the long-term dependency when fusing feature maps from different layers, and the universal encoder-decoder network does not make full use of the multi-modality information to improve the network robustness especially for segmenting medical MRI. In this paper, we propose a novel feature fusion unit called Recurrent Decoding Cell (RDC) which leverages convolutional RNNs to memorize the long-term context information from the previous layers in the decoding phase. An encoder-decoder network, named Convolutional Recurrent Decoding Network (CRDN), is also proposed based on RDC for segmenting multi-modality medical MRI. CRDN adopts CNN backbone to encode image features and decode them hierarchically through a chain of RDCs to obtain the final high-resolution score map. The evaluation experiments on BrainWeb, MRBrainS and HVSMR datasets demonstrate that the introduction of RDC effectively improves the segmentation accuracy as well as reduces the model size, and the proposed CRDN owns its robustness to image noise and intensity non-uniformity in medical MRI.
APA, Harvard, Vancouver, ISO, and other styles
3

Sun, Jun, Junbo Zhang, Xuesong Gao, et al. "Fusing Spatial Attention with Spectral-Channel Attention Mechanism for Hyperspectral Image Classification via Encoder–Decoder Networks." Remote Sensing 14, no. 9 (2022): 1968. http://dx.doi.org/10.3390/rs14091968.

Full text
Abstract:
In recent years, convolutional neural networks (CNNs) have been widely used in hyperspectral image (HSI) classification. However, feature extraction on hyperspectral data still faces numerous challenges. Existing methods cannot extract spatial and spectral-channel contextual information in a targeted manner. In this paper, we propose an encoder–decoder network that fuses spatial attention and spectral-channel attention for HSI classification from three public HSI datasets to tackle these issues. In terms of feature information fusion, a multi-source attention mechanism including spatial and spectral-channel attention is proposed to encode the spatial and spectral multi-channels contextual information. Moreover, three fusion strategies are proposed to effectively utilize spatial and spectral-channel attention. They are direct aggregation, aggregation on feature space, and Hadamard product. In terms of network development, an encoder–decoder framework is employed for hyperspectral image classification. The encoder is a hierarchical transformer pipeline that can extract long-range context information. Both shallow local features and rich global semantic information are encoded through hierarchical feature expressions. The decoder consists of suitable upsampling, skip connection, and convolution blocks, which fuse multi-scale features efficiently. Compared with other state-of-the-art methods, our approach has greater performance in hyperspectral image classification.
APA, Harvard, Vancouver, ISO, and other styles
4

Alharbi, Majed, Ahmed Stohy, Mohammed Elhenawy, Mahmoud Masoud, and Hamiden El-Wahed Khalifa. "Solving Traveling Salesman Problem with Time Windows Using Hybrid Pointer Networks with Time Features." Sustainability 13, no. 22 (2021): 12906. http://dx.doi.org/10.3390/su132212906.

Full text
Abstract:
This paper introduces a time efficient deep learning-based solution to the traveling salesman problem with time window (TSPTW). Our goal is to reduce the total tour length traveled by -*the agent without violating any time limitations. This will aid in decreasing the time required to supply any type of service, as well as lowering the emissions produced by automobiles, allowing our planet to recover from air pollution emissions. The proposed model is a variation of the pointer networks that has a better ability to encode the TSPTW problems. The model proposed in this paper is inspired from our previous work that introduces a hybrid context encoder and a multi attention decoder. The hybrid encoder primarily comprises the transformer encoder and the graph encoder; these encoders encode the feature vector before passing it to the attention decoder layer. The decoder consists of transformer context and graph context as well. The output attentions from the two decoders are aggregated and used to select the following step in the trip. To the best of our knowledge, our network is the first neural model that will be able to solve medium-size TSPTW problems. Moreover, we conducted sensitivity analysis to explore how the model performance changes as the time window width in the training and testing data changes. The experimental work shows that our proposed model outperforms the state-of-the-art model for TSPTW of sizes 20, 50 and 100 nodes/cities. We expect that our model will become state-of-the-art methodology for solving the TSPTW problems.
APA, Harvard, Vancouver, ISO, and other styles
5

Ai, Xinbo, Yunhao Xie, Yinan He, and Yi Zhou. "Improve SegNet with feature pyramid for road scene parsing." E3S Web of Conferences 260 (2021): 03012. http://dx.doi.org/10.1051/e3sconf/202126003012.

Full text
Abstract:
Road scene parsing is a common task in semantic segmentation. Its images have characteristics of containing complex scene context and differing greatly among targets of the same category from different scales. To address these problems, we propose a semantic segmentation model combined with edge detection. We extend the segmentation network with an encoder-decoder structure by adding an edge feature pyramid module, namely Edge Feature Pyramid Network (EFPNet, for short). This module uses edge detection operators to get boundary information and then combines the multiscale features to improve the ability to recognize small targets. EFPNet can make up the shortcomings of convolutional neural network features, and it helps to produce smooth segmentation. After extracting features of the encoder and decoder, EFPNet uses Euclidean distance to compare the similarity between the presentation of the encoder and the decoder, which can increase the decoder’s ability to restore from the encoder. We evaluated the proposed method on Cityscapes datasets. The experiment on Cityscapes datasets demonstrates that the accuracies are improved by 7.5% and 6.2% over the popular SegNet and ENet. And the ablation experiment validates the effectiveness of our method.
APA, Harvard, Vancouver, ISO, and other styles
6

Jiang, S. L., G. Li, W. Yao, Z. H. Hong, and T. Y. Kuc. "DUAL PYRAMIDS ENCODER-DECODER NETWORK FOR SEMANTIC SEGMENTATION IN GROUND AND AERIAL VIEW IMAGES." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIII-B2-2020 (August 12, 2020): 605–10. http://dx.doi.org/10.5194/isprs-archives-xliii-b2-2020-605-2020.

Full text
Abstract:
Abstract. Semantic segmentation is a fundamental research task in computer vision, which intends to assign a certain category to every pixel. Currently, most existing methods only utilize the deepest feature map for decoding, while high-level features get inevitably lost during the procedure of down-sampling. In the decoder section, transposed convolution or bilinear interpolation was widely used to restore the size of the encoded feature map; however, few optimizations are applied during up-sampling process which is detrimental to the performance for grouping and classification. In this work, we proposed a dual pyramids encoder-decoder deep neural network (DPEDNet) to tackle the above issues. The first pyramid integrated and encoded multi-resolution features through sequentially stacked merging, and the second pyramid decoded the features through dense atrous convolution with chained up-sampling. Without post-processing and multi-scale testing, the proposed network has achieved state-of-the-art performances on two challenging benchmark image datasets for both ground and aerial view scenes.
APA, Harvard, Vancouver, ISO, and other styles
7

Abdulaziz AlArfaj, Abeer, and Hanan Ahmed Hosni Mahmoud. "A Moving Object Tracking Technique Using Few Frames with Feature Map Extraction and Feature Fusion." ISPRS International Journal of Geo-Information 11, no. 7 (2022): 379. http://dx.doi.org/10.3390/ijgi11070379.

Full text
Abstract:
Moving object tracking techniques using machine and deep learning require large datasets for neural model training. New strategies need to be invented that utilize smaller data training sizes to realize the impact of large-sized datasets. However, current research does not balance the training data size and neural parameters, which creates the problem of inadequacy of the information provided by the low visual data content for parameter optimization. To enhance the performance of moving object tracking that appears in only a few frames, this research proposes a deep learning model using an abundant encoder–decoder (a high-resolution transformer (HRT) encoder–decoder). An HRT encoder–decoder employs feature map extraction that focuses on high resolution feature maps that are more representative of the moving object. In addition, we employ the proposed HRT encoder–decoder for feature map extraction and fusion to reimburse the few frames that have the visual information. Our extensive experiments on the Pascal DOC19 and MS-DS17 datasets have implied that the HRT encoder–decoder abundant model outperforms those of previous studies involving few frames that include moving objects.
APA, Harvard, Vancouver, ISO, and other styles
8

Wang, Hongquan, Xinshan Zhu, Chao Ren, Lan Zhang, and Shugen Ma. "A Frequency Attention-Based Dual-Stream Network for Image Inpainting Forensics." Mathematics 11, no. 12 (2023): 2593. http://dx.doi.org/10.3390/math11122593.

Full text
Abstract:
The rapid development of digital image inpainting technology is causing serious hidden danger to the security of multimedia information. In this paper, a deep network called frequency attention-based dual-stream network (FADS-Net) is proposed for locating the inpainting region. FADS-Net is established by a dual-stream encoder and an attention-based blue-associative decoder. The dual-stream encoder includes two feature extraction streams, the raw input stream (RIS) and the frequency recalibration stream (FRS). RIS directly captures feature maps from the raw input, while FRS performs feature extraction after recalibrating the input via learning in the frequency domain. In addition, a module based on dense connection is designed to ensure efficient extraction and full fusion of dual-stream features. The attention-based associative decoder consists of a main decoder and two branch decoders. The main decoder performs up-sampling and fine-tuning of fused features by using attention mechanisms and skip connections, and ultimately generates the predicted mask for the inpainted image. Then, two branch decoders are utilized to further supervise the training of two feature streams, ensuring that they both work effectively. A joint loss function is designed to supervise the training of the entire network and two feature extraction streams for ensuring optimal forensic performance. Extensive experimental results demonstrate that the proposed FADS-Net achieves superior localization accuracy and robustness on multiple datasets compared to the state-of-the-art inpainting forensics methods.
APA, Harvard, Vancouver, ISO, and other styles
9

Li, Xin, Feng Xu, Runliang Xia, et al. "Encoding Contextual Information by Interlacing Transformer and Convolution for Remote Sensing Imagery Semantic Segmentation." Remote Sensing 14, no. 16 (2022): 4065. http://dx.doi.org/10.3390/rs14164065.

Full text
Abstract:
Contextual information plays a pivotal role in the semantic segmentation of remote sensing imagery (RSI) due to the imbalanced distributions and ubiquitous intra-class variants. The emergence of the transformer intrigues the revolution of vision tasks with its impressive scalability in establishing long-range dependencies. However, the local patterns, such as inherent structures and spatial details, are broken with the tokenization of the transformer. Therefore, the ICTNet is devised to confront the deficiencies mentioned above. Principally, ICTNet inherits the encoder–decoder architecture. First of all, Swin Transformer blocks (STBs) and convolution blocks (CBs) are deployed and interlaced, accompanied by encoded feature aggregation modules (EFAs) in the encoder stage. This design allows the network to learn the local patterns and distant dependencies and their interactions simultaneously. Moreover, multiple DUpsamplings (DUPs) followed by decoded feature aggregation modules (DFAs) form the decoder of ICTNet. Specifically, the transformation and upsampling loss are shrunken while recovering features. Together with the devised encoder and decoder, the well-rounded context is captured and contributes to the inference most. Extensive experiments are conducted on the ISPRS Vaihingen, Potsdam and DeepGlobe benchmarks. Quantitative and qualitative evaluations exhibit the competitive performance of ICTNet compared to mainstream and state-of-the-art methods. Additionally, the ablation study of DFA and DUP is implemented to validate the effects.
APA, Harvard, Vancouver, ISO, and other styles
10

Geng, Yaogang, Hongyan Mei, Xiaorong Xue, and Xing Zhang. "Image-Caption Model Based on Fusion Feature." Applied Sciences 12, no. 19 (2022): 9861. http://dx.doi.org/10.3390/app12199861.

Full text
Abstract:
The encoder–decoder framework is the main frame of image captioning. The convolutional neural network (CNN) is usually used to extract grid-level features of the image, and the graph convolutional neural network (GCN) is used to extract the image’s region-level features. Grid-level features are poor in semantic information, such as the relationship and location of objects, while regional features lack fine-grained information about images. To address this problem, this paper proposes a fusion-features-based image-captioning model, which includes the fusion feature encoder and LSTM decoder. The fusion-feature encoder is divided into grid-level feature encoder and region-level feature encoder. The grid-level feature encoder is a convoluted neural network embedded in squeeze and excitation operations so that the model can focus on features that are highly correlated to the title. The region-level encoder employs node-embedding matrices to enable models to understand different node types and gain richer semantics. Then the features are weighted together by an attention mechanism to guide the decoder LSTM to generate an image caption. Our model was trained and tested in the MS COCO2014 dataset with the experimental evaluation standard Bleu-4 score and CIDEr score of 0.399 and 1.311, respectively. The experimental results indicate that the model can describe the image in detail.
APA, Harvard, Vancouver, ISO, and other styles
11

Bai, Xiaowei, Yonghong Zhang, and Jujie Wei. "LGFUNet: A Water Extraction Network in SAR Images Based on Multiscale Local Features with Global Information." Sensors 25, no. 12 (2025): 3814. https://doi.org/10.3390/s25123814.

Full text
Abstract:
To address existing issues in water extraction from SAR images based on deep learning, such as confusion between mountain shadows and water bodies and difficulty in extracting complex boundary details for continuous water bodies, the LGFUNet model is proposed. The LGFUNet model consists of three parts: the encoder–decoder, the DECASPP module, and the LGFF module. In the encoder–decoder, the Swin-Transformer module is used instead of convolution kernels for feature extraction, enhancing the learning of global information and improving the model’s ability to capture the spatial features of continuous water bodies. The DECASPP module is employed to extract and select multiscale features, focusing on complex water body boundary details. Additionally, a series of LGFF modules are inserted between the encoder and decoder to reduce the semantic gap between the encoder and decoder feature maps and the spatial information loss caused by the encoder’s downsampling process, improving the model’s ability to learn detailed information. Sentinel-1 SAR data from the Qinghai–Tibet Plateau region are selected, and the water extraction performance of the proposed LGFUNet model is compared with that of existing methods such as U-Net, Swin-UNet, and SCUNet++. The results show that the LGFUNet model achieves the best performance, respectively.
APA, Harvard, Vancouver, ISO, and other styles
12

Ma, Shangchen, and Chunlin Song. "Semi-Supervised Drivable Road Segmentation with Expanded Feature Cross-Consistency." Applied Sciences 13, no. 21 (2023): 12036. http://dx.doi.org/10.3390/app132112036.

Full text
Abstract:
Drivable road segmentation aims to sense the surrounding environment to keep vehicles within safe road boundaries, which is fundamental in Advance Driver-Assistance Systems (ADASs). Existing deep learning-based supervised methods are able to achieve good performance in this field with large amounts of human-labeled training data. However, the process of collecting sufficient fine human-labeled data is extremely time-consuming and expensive. To fill this gap, in this paper, we innovatively propose a general yet effective semi-supervised method for drivable road segmentation with lower labeled-data dependency, high accuracy, and high real-time performance. Specifically, a main encoder and a main decoder are trained in the supervised mode with labeled data generating pseudo labels for the unsupervised training. Then, we innovatively set up both auxiliary encoders and auxiliary decoders in our model that yield feature representations and predictions based on the unlabeled data subjected to different elaborated perturbations. Both auxiliary encoders and decoders can leverage information in unlabeled data by enforcing consistency between predictions of the main modules and those perturbed versions from auxiliary modules. Experimental results on two public datasets (Cityspace and CamVid) verify that our proposed algorithm can almost reach the same performance with high FPS as a fully supervised method with 100% labeled data with only utilizing 40% labeled data in the field of drivable road segmentation. In addition, our semi-supervised algorithm has a good potential to be generalized to all models with an encoder–decoder structure.
APA, Harvard, Vancouver, ISO, and other styles
13

Zhao, Rui, and Shihong Du. "An Encoder–Decoder with a Residual Network for Fusing Hyperspectral and Panchromatic Remote Sensing Images." Remote Sensing 14, no. 9 (2022): 1981. http://dx.doi.org/10.3390/rs14091981.

Full text
Abstract:
For many urban studies it is necessary to obtain remote sensing images with high hyperspectral and spatial resolution by fusing the hyperspectral and panchromatic remote sensing images. In this article, we propose a deep learning model of an encoder–decoder with a residual network (EDRN) for remote sensing image fusion. First, we combined the hyperspectral and panchromatic remote sensing images to circumvent the independence of the hyperspectral and panchromatic image features. Second, we established an encoder–decoder network for extracting representative encoded and decoded deep features. Finally, we established residual networks between the encoder network and the decoder network to enhance the extracted deep features. We evaluated the proposed method on six groups of real-world hyperspectral and panchromatic image datasets, and the experimental results confirmed the superior performance of the proposed method versus six other methods.
APA, Harvard, Vancouver, ISO, and other styles
14

Jiang, DingLin, Xinwei Luo, and Qifan Shen. "Frequency line detection in spectrograms using a deep neural network with attention." Journal of the Acoustical Society of America 156, no. 5 (2024): 3204–16. http://dx.doi.org/10.1121/10.0034360.

Full text
Abstract:
In this paper, a frequency line detection network (FLDNet) is proposed to effectively detect multiple weak frequency lines and time-varying frequency lines in underwater acoustic signals under low signal-to-noise ratios (SNRs). FLDNet adopts an encoder-decoder architecture as the basic framework, where the encoder is designed to obtain multilevel features of the frequency lines, and the decoder is responsible for reconstructing the frequency lines. FLDNet includes attention-based feature fusion modules that combine deep semantic features with shallow features learned by the encoder to reduce noise in the decoder's deep feature representation and improve reconstruction accuracy. In addition, a composite loss function was constructed by using the continuity of frequency lines, which improved the detection performance of frequency lines. After training through simulated signal sets, FLDNet can effectively detect frequency lines in spectrograms of simulated and measured signals. The experimental results indicate that FLDNet is superior to other state-of-the-art methods, even at SNRs as low as −28 dB.
APA, Harvard, Vancouver, ISO, and other styles
15

Shi, Hongwei, Shiqi Wu, Minghao Ye, and Changda Ma. "A speech separation model improved based on Conv-TasNet network." Journal of Physics: Conference Series 2858, no. 1 (2024): 012033. http://dx.doi.org/10.1088/1742-6596/2858/1/012033.

Full text
Abstract:
Abstract In the field of single-channel speech separation, the extraction and separation of features from mixed audio have always been the focus and difficulty of research. Currently, mainstream methods mainly suffer from poor generalization ability and issues such as inadequate feature extraction, which leads to the models’ inferior separation capability. This paper proposes an improved DConv-TasNet network model, focusing on the optimization of the encoder/decoder modules and separation modules and utilizing deep dilated encoders/decoders to extract features from mixed speech signals. It enhances feature extraction capability and generalization compared to conventional encoders/decoders. In terms of the separation module, improvements were made to the convolutional blocks within the module by enhancing feature extraction in the channel dimension, leading to improved performance of the separation network. Validation of the model’s performance was conducted using the WSJ0-Mix2 dataset, demonstrating superior performance compared to the Conv-TasNet network.
APA, Harvard, Vancouver, ISO, and other styles
16

Lan, Meng, Jing Zhang, Fengxiang He, and Lefei Zhang. "Siamese Network with Interactive Transformer for Video Object Segmentation." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 2 (2022): 1228–36. http://dx.doi.org/10.1609/aaai.v36i2.20009.

Full text
Abstract:
Semi-supervised video object segmentation (VOS) refers to segmenting the target object in remaining frames given its annotation in the first frame, which has been actively studied in recent years. The key challenge lies in finding effective ways to exploit the spatio-temporal context of past frames to help learn discriminative target representation of current frame. In this paper, we propose a novel Siamese network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames. Technically, we use the transformer encoder and decoder to handle the past frames and current frame separately, i.e., the encoder encodes robust spatio-temporal context of target object from the past frames, while the decoder takes the feature embedding of current frame as the query to retrieve the target from the encoder output. To further enhance the target representation, a feature interaction module (FIM) is devised to promote the information flow between the encoder and decoder. Moreover, we employ the Siamese architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods. Experimental results on three challenging benchmarks validate the superiority of SITVOS over state-of-the-art methods. Code is available at https://github.com/LANMNG/SITVOS.
APA, Harvard, Vancouver, ISO, and other styles
17

Sharma, Neha, Sheifali Gupta, Mana Saleh Al Reshan, Adel Sulaiman, Hani Alshahrani, and Asadullah Shaikh. "EfficientNetB0 cum FPN Based Semantic Segmentation of Gastrointestinal Tract Organs in MRI Scans." Diagnostics 13, no. 14 (2023): 2399. http://dx.doi.org/10.3390/diagnostics13142399.

Full text
Abstract:
The segmentation of gastrointestinal (GI) organs is crucial in radiation therapy for treating GI cancer. It allows for developing a targeted radiation therapy plan while minimizing radiation exposure to healthy tissue, improving treatment success, and decreasing side effects. Medical diagnostics in GI tract organ segmentation is essential for accurate disease detection, precise differential diagnosis, optimal treatment planning, and efficient disease monitoring. This research presents a hybrid encoder–decoder-based model for segmenting healthy organs in the GI tract in biomedical images of cancer patients, which might help radiation oncologists treat cancer more quickly. Here, EfficientNet B0 is used as a bottom-up encoder architecture for downsampling to capture contextual information by extracting meaningful and discriminative features from input images. The performance of the EfficientNet B0 encoder is compared with that of three encoders: ResNet 50, MobileNet V2, and Timm Gernet. The Feature Pyramid Network (FPN) is a top-down decoder architecture used for upsampling to recover spatial information. The performance of the FPN decoder was compared with that of three decoders: PAN, Linknet, and MAnet. This paper proposes a segmentation model named as the Feature Pyramid Network (FPN), with EfficientNet B0 as the encoder. Furthermore, the proposed hybrid model is analyzed using Adam, Adadelta, SGD, and RMSprop optimizers. Four performance criteria are used to assess the models: the Jaccard and Dice coefficients, model loss, and processing time. The proposed model can achieve Dice coefficient and Jaccard index values of 0.8975 and 0.8832, respectively. The proposed method can assist radiation oncologists in precisely targeting areas hosting cancer cells in the gastrointestinal tract, allowing for more efficient and timely cancer treatment.
APA, Harvard, Vancouver, ISO, and other styles
18

Wang, Guixian, Dandan Huang, ZhenYe Geng, Zhi Liu, and Jin Duan. "A Novel Encoder-Decoder Structure-based Transformer for Fine-Resolution Remote Sensing Images." Journal of Physics: Conference Series 2517, no. 1 (2023): 012017. http://dx.doi.org/10.1088/1742-6596/2517/1/012017.

Full text
Abstract:
Abstract Full convolution neural network (FCN) based on an encoder-decoder structure has become a standard network in the semantic segmentation domain. Encoder-decoder architecture is an effective means to get finer-grained performance. Encoders constantly extract multilevel features, and then use decoders to gradually introduce low-level features into high-level features. Context information is critical for accurate segmentation, which is the main direction of semantic segmentation at present. So many efforts have been made to make better use of this kind of information, including codec structure, void convolution (expanded convolution), and attention mechanism. However, most of these schemes are based on Resnet or other variants of convolution network FCN, which makes it unable to get rid of the defective local receptive field of convolution itself. In this work, we introduce the pyramid visual converter (PVT) to replace the traditional full convolution network architecture, and design a novel encoder-decoder architecture to more effectively utilize the context information.
APA, Harvard, Vancouver, ISO, and other styles
19

He, Haiqing, Yan Wei, Fuyang Zhou, and Hai Zhang. "A Deep Neural Network for Road Extraction with the Capability to Remove Foreign Objects with Similar Spectra." International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLVIII-1-2024 (May 10, 2024): 193–99. http://dx.doi.org/10.5194/isprs-archives-xlviii-1-2024-193-2024.

Full text
Abstract:
Abstract. Existing road extraction methods based on deep learning often struggle with distinguishing ground objects that share similar spectral information, such as roads and buildings. Consequently, this study proposes a dual encoder-decoder deep neural network to address road extraction in complex backgrounds. In the feature extraction stage, the first encoder-decoder designed for extracting road features. The second encoder-decoder utilized for extracting building features. During the feature fusion stage, road features and building features are integrated using a subtraction method. The resultant road features, constrained by building features, enhance the preservation of accurate road feature information. Within the feature fusion stage, road feature maps and building feature maps designated for fusion are input into the convolutional block attention module. This step aims to amplify the features of different channels and extract key information from diverse spatial positions. Subsequently, feature fusion is executed using the element-by-element subtraction method. The outcome is road features constrained by building features, thus preserving more precise road feature information. Experimental results demonstrate that the model successfully learns both road and building features concurrently. It effectively distinguishes between easily confused roads and buildings with similar spectral information, ultimately enhancing the accuracy of road extraction.
APA, Harvard, Vancouver, ISO, and other styles
20

Li, Hao, Sha Cao, Siyu Jiang, and Tongyang Pan. "Residual Dual Encoder Network using Distance Metric Learning for Intelligent Fault Recognition with Unknown Classes." Journal of Physics: Conference Series 2999, no. 1 (2025): 012004. https://doi.org/10.1088/1742-6596/2999/1/012004.

Full text
Abstract:
Abstract The paper proposes a residual dual encoder network using distance metric learning for intelligent fault recognition with unknown classes. The network is made up of two encoders and one decoder. In both the encoders and the decoder, residual blocks are used as the main structure for deep feature extraction. Besides, distance metric learning with triplet loss is used to train the residual dual encoder network to obtain features which could represent different health conditions. Benefiting from the metric learning principle, the proposed model could recognize the potential faults in mechanical systems even with a few additional unknown fault classes. The superiority of the residual dual encoder network is demonstrated by comparing with several intelligent detection methods on three different experimental datasets. Results indicate that the proposed residual dual encoder network could effectively recognize the unknown faults with an average classification accuracy of 98.3%, 99.9% and 94.4% and a recognition rate of 93.8%, 94.1% and 94.8% in three cases.
APA, Harvard, Vancouver, ISO, and other styles
21

Li, Weisheng, Minghao Xiang, and Xuesong Liang. "A Dense Encoder–Decoder Network with Feedback Connections for Pan-Sharpening." Remote Sensing 13, no. 22 (2021): 4505. http://dx.doi.org/10.3390/rs13224505.

Full text
Abstract:
To meet the need for multispectral images having high spatial resolution in practical applications, we propose a dense encoder–decoder network with feedback connections for pan-sharpening. Our network consists of four parts. The first part consists of two identical subnetworks, one each to extract features from PAN and MS images, respectively. The second part is an efficient feature-extraction block. We hope that the network can focus on features at different scales, so we propose innovative multiscale feature-extraction blocks that fully extract effective features from networks of various depths and widths by using three multiscale feature-extraction blocks and two long-jump connections. The third part is the feature fusion and recovery network. We are inspired by the work on U-Net network improvements to propose a brand new encoder network structure with dense connections that improves network performance through effective connections to encoders and decoders at different scales. The fourth part is a continuous feedback connection operation with overfeedback to refine shallow features, which enables the network to obtain better reconstruction capabilities earlier. To demonstrate the effectiveness of our method, we performed several experiments. Experiments on various satellite datasets show that the proposed method outperforms existing methods. Our results show significant improvements over those from other models in terms of the multiple-target index values used to measure the spectral quality and spatial details of the generated images.
APA, Harvard, Vancouver, ISO, and other styles
22

Wang, Yiyu, Jungang Xu, and Yingfei Sun. "End-to-End Transformer Based Model for Image Captioning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (2022): 2585–94. http://dx.doi.org/10.1609/aaai.v36i3.20160.

Full text
Abstract:
CNN-LSTM based architectures have played an important role in image captioning, but limited by the training efficiency and expression ability, researchers began to explore the CNN-Transformer based models and achieved great success. Meanwhile, almost all recent works adopt Faster R-CNN as the backbone encoder to extract region-level features from given images. However, Faster R-CNN needs a pre-training on an additional dataset, which divides the image captioning task into two stages and limits its potential applications. In this paper, we build a pure Transformer-based model, which integrates image captioning into one stage and realizes end-to-end training. Firstly, we adopt SwinTransformer to replace Faster R-CNN as the backbone encoder to extract grid-level features from given images; Then, referring to Transformer, we build a refining encoder and a decoder. The refining encoder refines the grid features by capturing the intra-relationship between them, and the decoder decodes the refined features into captions word by word. Furthermore, in order to increase the interaction between multi-modal (vision and language) features to enhance the modeling capability, we calculate the mean pooling of grid features as the global feature, then introduce it into refining encoder to refine with grid features together, and add a pre-fusion process of refined global feature and generated words in decoder. To validate the effectiveness of our proposed model, we conduct experiments on MSCOCO dataset. The experimental results compared to existing published works demonstrate that our model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models) CIDEr scores on 'Karpathy' offline test split and 136.0% (c5) and 138.3% (c40) CIDEr scores on the official online test server. Trained models and source code will be released.
APA, Harvard, Vancouver, ISO, and other styles
23

Li, Zhong, Hongyi Wang, Qi Han, et al. "Convolutional Neural Network with Multiscale Fusion and Attention Mechanism for Skin Diseases Assisted Diagnosis." Computational Intelligence and Neuroscience 2022 (June 14, 2022): 1–10. http://dx.doi.org/10.1155/2022/8390997.

Full text
Abstract:
Melanoma segmentation based on a convolutional neural network (CNN) has recently attracted extensive attention. However, the features captured by CNN are always local that result in discontinuous feature extraction. To solve this problem, we propose a novel multiscale feature fusion network (MSFA-Net). MSFA-Net can extract feature information at different scales through a multiscale feature fusion structure (MSF) in the network and then calibrate and restore the extracted information to achieve the purpose of melanoma segmentation. Specifically, based on the popular encoder-decoder structure, we designed three functional modules, namely MSF, asymmetric skip connection structure (ASCS), and calibration decoder (Decoder). In addition, a weighted cross-entropy loss and two-stage learning rate optimization strategy are designed to train the network more effectively. Compared qualitatively and quantitatively with the representative neural network methods with encoder-decoder structure, such as U-Net, the proposed method can achieve advanced performance.
APA, Harvard, Vancouver, ISO, and other styles
24

Javaloy, Adrián, and Ginés García-Mateos. "Text Normalization Using Encoder–Decoder Networks Based on the Causal Feature Extractor." Applied Sciences 10, no. 13 (2020): 4551. http://dx.doi.org/10.3390/app10134551.

Full text
Abstract:
The encoder–decoder architecture is a well-established, effective and widely used approach in many tasks of natural language processing (NLP), among other domains. It consists of two closely-collaborating components: An encoder that transforms the input into an intermediate form, and a decoder producing the output. This paper proposes a new method for the encoder, named Causal Feature Extractor (CFE), based on three main ideas: Causal convolutions, dilatations and bidirectionality. We apply this method to text normalization, which is a ubiquitous problem that appears as the first step of many text-to-speech (TTS) systems. Given a text with symbols, the problem consists in writing the text exactly as it should be read by the TTS system. We make use of an attention-based encoder–decoder architecture using a fine-grained character-level approach rather than the usual word-level one. The proposed CFE is compared to other common encoders, such as convolutional neural networks (CNN) and long-short term memories (LSTM). Experimental results show the feasibility of CFE, achieving better results in terms of accuracy, number of parameters, convergence time, and use of an attention mechanism based on attention matrices. The obtained accuracy ranges from 83.5% to 96.8% correctly normalized sentences, depending on the dataset. Moreover, the proposed method is generic and can be applied to different types of input such as text, audio and images.
APA, Harvard, Vancouver, ISO, and other styles
25

Nguyen, Quoc Toan. "Defective sewing stitch semantic segmentation using DeeplabV3+ and EfficientNet." Inteligencia Artificial 25, no. 70 (2022): 64–76. http://dx.doi.org/10.4114/intartif.vol25iss70pp64-76.

Full text
Abstract:
Defective stitch inspection is an essential part of garment manufacturing quality assurance. Traditional mechanical defect detection systems are effective, but they are usually customized with handcrafted features that must be operated by a human. Deep learning approaches have recently demonstrated exceptional performance in a wide range of computer vision applications. The requirement for precise detail evaluation, combined with the small size of the patterns, undoubtedly increases the difficulty of identification. Therefore, image segmentation (semantic segmentation) was employed for this task. It is identified as a vital research topic in the field of computer vision, being indispensable in a wide range of real-world applications. Semantic segmentation is a method of labeling each pixel in an image. This is in direct contrast to classification, which assigns a single label to the entire image. And multiple objects of the same class are defined as a single entity. DeepLabV3+ architecture, with encoder-decoder architecture, is the proposed technique. EfficientNet models (B0-B2) were applied as encoders for experimental processes. The encoder is utilized to encode feature maps from the input image. The encoder's significant information is used by the decoder for upsampling and reconstruction of output. Finally, the best model is DeeplabV3+ with EfficientNetB1 which can classify segmented defective sewing stitches with superior performance (MeanIoU: 94.14%).
APA, Harvard, Vancouver, ISO, and other styles
26

Chen, Yu, Ming Yin, Yu Li, and Qian Cai. "CSU-Net: A CNN-Transformer Parallel Network for Multimodal Brain Tumour Segmentation." Electronics 11, no. 14 (2022): 2226. http://dx.doi.org/10.3390/electronics11142226.

Full text
Abstract:
Medical image segmentation techniques are vital to medical image processing and analysis. Considering the significant clinical applications of brain tumour image segmentation, it represents a focal point of medical image segmentation research. Most of the work in recent times has been centred on Convolutional Neural Networks (CNN) and Transformers. However, CNN has some deficiencies in modelling long-distance information transfer and contextual processing information, while Transformer is relatively weak in acquiring local information. To overcome the above defects, we propose a novel segmentation network with an “encoder–decoder” architecture, namely CSU-Net. The encoder consists of two parallel feature extraction branches based on CNN and Transformer, respectively, in which the features of the same size are fused. The decoder has a dual Swin Transformer decoder block with two learnable parameters for feature upsampling. The features from multiple resolutions in the encoder and decoder are merged via skip connections. On the BraTS 2020, our model achieves 0.8927, 0.8857, and 0.8188 for the Whole Tumour (WT), Tumour Core (TC), and Enhancing Tumour (ET), respectively, in terms of Dice scores.
APA, Harvard, Vancouver, ISO, and other styles
27

Zhai, Cong, Liejun Wang, and Jian Yuan. "New Fusion Network with Dual-Branch Encoder and Triple-Branch Decoder for Remote Sensing Image Change Detection." Applied Sciences 13, no. 10 (2023): 6167. http://dx.doi.org/10.3390/app13106167.

Full text
Abstract:
Deep learning plays a highly essential role in the domain of remote sensing change detection (CD) due to its high efficiency. From some existing methods, we can observe that the fusion of information at each scale is quite vital for the accuracy of the CD results, especially for the common problems of pseudo-change and the difficult detection of change edges in the CD task. With this in mind, we propose a New Fusion network with Dual-branch Encoder and Triple-branch Decoder (DETDNet) that follows a codec structure as a whole, where the encoder adopts a siamese Res2Net-50 structure to extract the local features of the bitemporal images. As for the decoder in previous works, they usually employed a single branch, and this approach only preserved the fusion features of the encoder’s bitemporal images. Distinguished from these approaches, we adopt the triple-branch architecture in the decoder for the first time. The triple-branch structure preserves not only the dual-branch features from the encoder in the left and right branches, respectively, to learn the effective and powerful individual features of each temporal image but also the fusion features from the encoder in the middle branch. The middle branch utilizes triple-branch aggregation (TA) to realize the feature interaction of the three branches in the decoder, which enhances the integrated features and provides abundant and supplementary bitemporal feature information to improve the CD performance. The triple-branch architecture of the decoder ensures that the respective features of the bitemporal images as well as their fused features are preserved, making the feature extraction more integrated. In addition, the three branches employ a multiscale feature extraction module (MFE) per layer to extract multiscale contextual information and enhance the feature representation capability of the CD. We conducted comparison experiments on the BCDD, LEVIR-CD, and SYSU-CD datasets, which were created in New Zealand, the USA, and Hong Kong, respectively. The data were preprocessed to contain 7434, 10,192, and 20,000 image pairs, respectively. The experimental results show that DETDNet achieves F1 scores of 92.7%, 90.99%, and 81.13%, respectively, which shows better results compared to some recent works, which means that the model is more robust. In addition, the lower FP and FN indicate lower error and misdetection rates. Moreover, from the analysis of the experimental results, compared with some existing methods, the problem of pseudo-changes and the difficulty of detecting small change areas is better solved.
APA, Harvard, Vancouver, ISO, and other styles
28

Liu, Song, Haiwei Li, Feifei Wang, et al. "Unsupervised Transformer Boundary Autoencoder Network for Hyperspectral Image Change Detection." Remote Sensing 15, no. 7 (2023): 1868. http://dx.doi.org/10.3390/rs15071868.

Full text
Abstract:
In the field of remote sens., change detection is an important monitoring technology. However, effectively extracting the change feature is still a challenge, especially with an unsupervised method. To solve this problem, we proposed an unsupervised transformer boundary autoencoder network (UTBANet) in this paper. UTBANet consists of a transformer structure and spectral attention in the encoder part. In addition to reconstructing hyperspectral images, UTBANet also adds a decoder branch for reconstructing edge information. The designed encoder module is used to extract features. First, the transformer structure is used for extracting the global features. Then, spectral attention can find important feature maps and reduce feature redundancy. Furthermore, UTBANet reconstructs the hyperspectral image and boundary information simultaneously through two decoders, which can improve the ability of the encoder to extract edge features. Our experiments demonstrate that the proposed structure significantly improves the performance of change detection. Moreover, comparative experiments show that our method is superior to most existing unsupervised methods.
APA, Harvard, Vancouver, ISO, and other styles
29

Fang, Han, Yupeng Qiu, Kejiang Chen, Jiyi Zhang, Weiming Zhang, and Ee-Chien Chang. "Flow-Based Robust Watermarking with Invertible Noise Layer for Black-Box Distortions." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (2023): 5054–61. http://dx.doi.org/10.1609/aaai.v37i4.25633.

Full text
Abstract:
Deep learning-based digital watermarking frameworks have been widely studied recently. Most existing methods adopt an ``encoder-noise layer-decoder''-based architecture where the embedding and extraction processes are accomplished separately by the encoder and the decoder. However, one potential drawback of such a framework is that the encoder and the decoder may not be well coupled, resulting in the fact that the encoder may embed some redundant features into the host image thus influencing the invisibility and robustness of the whole algorithm. To address this limitation, this paper proposes a flow-based robust watermarking framework. The basic component of such framework is an invertible up-down-sampling neural block that can realize the embedding and extraction simultaneously. As a consequence, the encoded feature could keep high consistency with the feature that the decoder needed, which effectively avoids the embedding of redundant features. In addition, to ensure the robustness of black-box distortion, an invertible noise layer (INL) is designed to simulate the distortion and is served as a noise layer in the training stage. Benefiting from its reversibility, INL is also applied as a preprocessing before extraction to eliminate the distortion, which further improves the robustness of the algorithm. Extensive experiments demonstrate the superiority of the proposed framework in terms of visual quality and robustness. Compared with the state-of-the-art architecture, the visual quality (measured by PSNR) of the proposed framework improves by 2dB and the extraction accuracy after JPEG compression (QF=50) improves by more than 4%. Besides, the robustness against black-box distortions can be greatly achieved with more than 95% extraction accuracy.
APA, Harvard, Vancouver, ISO, and other styles
30

Zhang, Yusha, and Xiongliang Xiao. "A Dynamic Community Detection Method for Complex Networks Based on Deep Self-Coding Network." Computational Intelligence and Neuroscience 2022 (July 31, 2022): 1–9. http://dx.doi.org/10.1155/2022/7084084.

Full text
Abstract:
Aiming at the problem of community detection in complex dynamic networks, a dynamic community detection method based on graph convolution neural network is proposed. An encoding-decoding mechanism is designed to reconstruct the feature information of each node in the graph. A stack of multiple graph convolutional layers is considered as an encoder that encodes the node feature information into the potential vector space, while the decoder employs a simple two-layer perceptron to reconstruct the initial node features from the encoded vector information. The encoding-decoding mechanism achieves a re-evaluation of the initial node features. Subsequently, an additional local feature reconstruction loss is added after the decoder to aid the goal of graph classification. Further, stochastic gradient descent is applied to solve the problem in the loss function. Finally, the proposed model is experimentally validated based on the Karate Club and Football datasets. The experimental results show that the proposed model improves the NMI metric by an average of 7.65% and effectively mitigates the node oversmoothing problem. The proposed model is proved to have good detection accuracy.
APA, Harvard, Vancouver, ISO, and other styles
31

Lei, Zhi, Guixian Zhang, Lijuan Wu, Kui Zhang, and Rongjiao Liang. "A Multi-level Mesh Mutual Attention Model for Visual Question Answering." Data Science and Engineering 7, no. 4 (2022): 339–53. http://dx.doi.org/10.1007/s41019-022-00200-9.

Full text
Abstract:
AbstractVisual question answering is a complex multimodal task involving images and text, with broad application prospects in human–computer interaction and medical assistance. Therefore, how to deal with the feature interaction and multimodal feature fusion between the critical regions in the image and the keywords in the question is an important issue. To this end, we propose a neural network based on the encoder–decoder structure of the transformer architecture. Specifically, in the encoder, we use multi-head self-attention to mine word–word connections within question features and stack multiple layers of attention to obtain multi-level question features. We propose a mutual attention module to perform information exchange between modalities for better question features and image features representation on the decoder side. Besides, we connect the encoder and decoder in a meshed manner, perform mutual attention operations with multi-level question features, and aggregate information in an adaptive way. We propose a multi-scale fusion module in the fusion stage, which utilizes feature information at different scales to complete modal fusion. We test and validate the model effectiveness on VQA v1 and VQA v2 datasets. Our model achieves better results than state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
32

Liu, C., Y. Zhang, and Y. Ou. "COMPONENT SUBSTITUTION NETWORK FOR PAN-SHARPENING VIA SEMI-SUPERVISED LEARNING." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences V-3-2020 (August 3, 2020): 255–62. http://dx.doi.org/10.5194/isprs-annals-v-3-2020-255-2020.

Full text
Abstract:
Abstract. Pan-sharpening refers to the technology which fuses a low resolution multispectral image (MS) and a high resolution panchromatic (PAN) image into a high resolution multispectral image (HRMS). In this paper, we propose a Component Substitution Network (CSN) for pan-sharpening. By adding a feature exchange module (FEM) to the widely used encoder-decoder framework, we design a network following the general procedure of the traditional component substitution (CS) approaches. Encoder of the network decomposes the input image into spectral feature and structure feature. The FEM regroups the extracted features and combines the spectral feature of the MS image with the structure feature of the PAN image. The decoder is an inverse process of the encoder and reconstructs the image. The MS and the PAN image share the same encoder and decoder, which makes the network robust to spectral and spatial variations. To reduce the burden of data preparation and improve the performance on full-resolution data, the network is trained through semi-supervised learning with image patches at both reduced-resolution and full-resolution. Experiments performed on GeoEye-1 data verifies that the proposed network has achieved state-of-the-art performance, and the semi-supervised learning stategy further improves the performance on full-resolution data.
APA, Harvard, Vancouver, ISO, and other styles
33

Khanh, Trinh Le Ba, Duy-Phuong Dao, Ngoc-Huynh Ho, et al. "Enhancing U-Net with Spatial-Channel Attention Gate for Abnormal Tissue Segmentation in Medical Imaging." Applied Sciences 10, no. 17 (2020): 5729. http://dx.doi.org/10.3390/app10175729.

Full text
Abstract:
In recent years, deep learning has dominated medical image segmentation. Encoder-decoder architectures, such as U-Net, can be used in state-of-the-art models with powerful designs that are achieved by implementing skip connections that propagate local information from an encoder path to a decoder path to retrieve detailed spatial information lost by pooling operations. Despite their effectiveness for segmentation, these naïve skip connections still have some disadvantages. First, multi-scale skip connections tend to use unnecessary information and computational sources, where likable low-level encoder features are repeatedly used at multiple scales. Second, the contextual information of the low-level encoder feature is insufficient, leading to poor performance for pixel-wise recognition when concatenating with the corresponding high-level decoder feature. In this study, we propose a novel spatial-channel attention gate that addresses the limitations of plain skip connections. This can be easily integrated into an encoder-decoder network to effectively improve the performance of the image segmentation task. Comprehensive results reveal that our spatial-channel attention gate remarkably enhances the segmentation capability of the U-Net architecture with a minimal computational overhead added. The experimental results show that our proposed method outperforms the conventional deep networks in term of Dice score, which achieves 71.72%.
APA, Harvard, Vancouver, ISO, and other styles
34

Zheng, Chuanpan, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. "GMAN: A Graph Multi-Attention Network for Traffic Prediction." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 1234–41. http://dx.doi.org/10.1609/aaai.v34i01.5477.

Full text
Abstract:
Long-term traffic prediction is highly challenging due to the complexity of traffic systems and the constantly changing nature of many impacting factors. In this paper, we focus on the spatio-temporal factors, and propose a graph multi-attention network (GMAN) to predict traffic conditions for time steps ahead at different locations on a road network graph. GMAN adapts an encoder-decoder architecture, where both the encoder and the decoder consist of multiple spatio-temporal attention blocks to model the impact of the spatio-temporal factors on traffic conditions. The encoder encodes the input traffic features and the decoder predicts the output sequence. Between the encoder and the decoder, a transform attention layer is applied to convert the encoded traffic features to generate the sequence representations of future time steps as the input of the decoder. The transform attention mechanism models the direct relationships between historical and future time steps that helps to alleviate the error propagation problem among prediction time steps. Experimental results on two real-world traffic prediction tasks (i.e., traffic volume prediction and traffic speed prediction) demonstrate the superiority of GMAN. In particular, in the 1 hour ahead prediction, GMAN outperforms state-of-the-art methods by up to 4% improvement in MAE measure. The source code is available at https://github.com/zhengchuanpan/GMAN.
APA, Harvard, Vancouver, ISO, and other styles
35

Li, Rumei, Liyan Zhang, Zun Wang, and Xiaojuan Li. "FCSwinU: Fourier Convolutions and Swin Transformer UNet for Hyperspectral and Multispectral Image Fusion." Sensors 24, no. 21 (2024): 7023. http://dx.doi.org/10.3390/s24217023.

Full text
Abstract:
The fusion of low-resolution hyperspectral images (LR-HSI) with high-resolution multispectral images (HR-MSI) provides a cost-effective approach to obtaining high-resolution hyperspectral images (HR-HSI). Existing methods primarily based on convolutional neural networks (CNNs) struggle to capture global features and do not adequately address the significant scale and spectral resolution differences between LR-HSI and HR-MSI. To tackle these challenges, our novel FCSwinU network leverages the spectral fast Fourier convolution (SFFC) module for spectral feature extraction and utilizes the Swin Transformer’s self-attention mechanism for multi-scale global feature fusion. FCSwinU employs a UNet-like encoder–decoder framework to effectively merge spatiospectral features. The encoder integrates the Swin Transformer feature abstraction module (SwinTFAM) to encode pixel correlations and perform multi-scale transformations, facilitating the adaptive fusion of hyperspectral and multispectral data. The decoder then employs the Swin Transformer feature reconstruction module (SwinTFRM) to reconstruct the fused features, restoring the original image dimensions and ensuring the precise recovery of spatial and spectral details. Experimental results from three benchmark datasets and a real-world dataset robustly validate the superior performance of our method in both visual representation and quantitative assessment compared to existing fusion methods.
APA, Harvard, Vancouver, ISO, and other styles
36

Yang, Yong, Wenzhi Xu, Shuying Huang, and Weiguo Wan. "Low-Light Image Enhancement Network Based on Multi-Scale Feature Complementation." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 3 (2023): 3214–21. http://dx.doi.org/10.1609/aaai.v37i3.25427.

Full text
Abstract:
Images captured in low-light environments have problems of insufficient brightness and low contrast, which will affect subsequent image processing tasks. Although most current enhancement methods can obtain high-contrast images, they still suffer from noise amplification and color distortion. To address these issues, this paper proposes a low-light image enhancement network based on multi-scale feature complementation (LIEN-MFC), which is a U-shaped encoder-decoder network supervised by multiple images of different scales. In the encoder, four feature extraction branches are constructed to extract features of low-light images at different scales. In the decoder, to ensure the integrity of the learned features at each scale, a feature supplementary fusion module (FSFM) is proposed to complement and integrate features from different branches of the encoder and decoder. In addition, a feature restoration module (FRM) and an image reconstruction module (IRM) are built in each branch to reconstruct the restored features and output enhanced images. To better train the network, a joint loss function is defined, in which a discriminative loss term is designed to ensure that the enhanced results better meet the visual properties of the human eye. Extensive experiments on benchmark datasets show that the proposed method outperforms some state-of-the-art methods subjectively and objectively.
APA, Harvard, Vancouver, ISO, and other styles
37

Li, Jianyong, Ge Gao, Lei Yang, Yanhong Liu, and Hongnian Yu. "DEF-Net: A Dual-Encoder Fusion Network for Fundus Retinal Vessel Segmentation." Electronics 11, no. 22 (2022): 3810. http://dx.doi.org/10.3390/electronics11223810.

Full text
Abstract:
The deterioration of numerous eye diseases is highly related to the fundus retinal structures, so the automatic retinal vessel segmentation serves as an essential stage for efficient detection of eye-related lesions in clinical practice. Segmentation methods based on encode-decode structures exhibit great potential in retinal vessel segmentation tasks, but have limited feature representation ability. In addition, they don’t effectively consider the information at multiple scales when performing feature fusion, resulting in low fusion efficiency. In this paper, a newly model, named DEF-Net, is designed to segment retinal vessels automatically, which consists of a dual-encoder unit and a decoder unit. Fused with recurrent network and convolution network, a dual-encoder unit is proposed, which builds a convolutional network branch to extract detailed features and a recurrent network branch to accumulate contextual features, and it could obtain richer features compared to the single convolution network structure. Furthermore, to exploit the useful information at multiple scales, a multi-scale fusion block used for facilitating feature fusion efficiency is designed. Extensive experiments have been undertaken to demonstrate the segmentation performance of our proposed DEF-Net.
APA, Harvard, Vancouver, ISO, and other styles
38

Jiang, Ligang, Wen Li, Zhiming Xiong, et al. "Retinal Vessel Segmentation Based on Self-Attention Feature Selection." Electronics 13, no. 17 (2024): 3514. http://dx.doi.org/10.3390/electronics13173514.

Full text
Abstract:
Many major diseases can cause changes in the morphology of blood vessels, and the segmentation of retinal blood vessels is of great significance for preventing these diseases. Obtaining complete, continuous, and high-resolution segmentation results is very challenging due to the diverse structures of retinal tissues, the complex spatial structures of blood vessels, and the presence of many small ships. In recent years, deep learning networks like UNet have been widely used in medical image processing. However, the continuous down-sampling operations in UNet can result in the loss of a significant amount of information. Although skip connections between the encoder and decoder can help address this issue, the encoder features still contain a large amount of irrelevant information that cannot be efficiently utilized by the decoder. To alleviate the irrelevant information, this paper proposes a feature selection module between the decoder and encoder that utilizes the self-attention mechanism of transformers to accurately and efficiently select the relevant encoder features for the decoder. Additionally, a lightweight Residual Global Context module is proposed to obtain dense global contextual information and establish dependencies between pixels, which can effectively preserve vascular details and segment small vessels accurately and continuously. Experimental results on three publicly available color fundus image datasets (DRIVE, CHASE, and STARE) demonstrate that the proposed algorithm outperforms existing methods in terms of both performance metrics and visual quality.
APA, Harvard, Vancouver, ISO, and other styles
39

Park, Min-Hong, Jae-Hoon Cho, and Yong-Tae Kim. "CNN Model with Multilayer ASPP and Two-Step Cross-Stage for Semantic Segmentation." Machines 11, no. 2 (2023): 126. http://dx.doi.org/10.3390/machines11020126.

Full text
Abstract:
Currently, interest in deep learning-based semantic segmentation is increasing in various fields such as the medical field, automatic operation, and object division. For example, UNet, a deep learning network with an encoder–decoder structure, is used for image segmentation in the biomedical area, and an attempt to segment various objects is made using ASPP such as Deeplab. A recent study improves the accuracy of object segmentation through structures that extend in various receptive fields. Semantic segmentation has evolved to divide objects of various sizes more accurately and in detail, and various methods have been presented for this. In this paper, we propose a model structure that reduces the overall parameters of the deep learning model in this development and improves accuracy. The proposed model is an encoder–decoder structure, and an encoder half scale provides a feature map with few encoder parameters. A decoder integrates feature maps of various scales with high area details and forward features of low areas. An integrated feature map learns a feature map of each encoder hierarchy over an area of previous data in the form of a continuous coupling structure. To verify the performance of the model, we learned and compared the KITTI-360 dataset with the Cityscapes dataset, and experimentally confirmed that the proposed method was superior to the existing model.
APA, Harvard, Vancouver, ISO, and other styles
40

Sun, Nan, Han Fang, Yuxing Lu, Chengxin Zhao, and Hefei Ling. "END^2: Robust Dual-Decoder Watermarking Framework Against Non-Differentiable Distortions." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 1 (2025): 773–81. https://doi.org/10.1609/aaai.v39i1.32060.

Full text
Abstract:
DNN-based watermarking methods have rapidly advanced, with the ``Encoder-Noise Layer-Decoder'' (END) framework being the most widely used. To ensure end-to-end training, the noise layer in the framework must be differentiable. However, real-world distortions are often non-differentiable, leading to challenges in end-to-end training. Existing solutions only treat the distortion perturbation as additive noise, which does not fully integrate the effect of distortion in training. To better incorporate non-differentiable distortions into training, we propose a novel dual-decoder architecture (END^2). Unlike conventional END architecture, our method employs two structurally identical decoders: the Teacher Decoder, processing pure watermarked images, and the Student Decoder, handling distortion-perturbed images. The gradient is backpropagated only through the Teacher Decoder branch to optimize the encoder thus bypassing the problem of non-differentiability. To ensure resistance to arbitrary distortions, we enforce alignment of the two decoders' feature representations by maximizing the cosine similarity between their intermediate vectors on a hypersphere. Extensive experiments demonstrate that our scheme outperforms state-of-the-art algorithms under various non-differentiable distortions. Moreover, even without the differentiability constraint, our method surpasses baselines with a differentiable noise layer. Our approach is effective and easily implementable across all END architectures, enhancing practicality and generalizability.
APA, Harvard, Vancouver, ISO, and other styles
41

Shi, Yanli, and Pengpeng Sheng. "J-Net: Asymmetric Encoder-Decoder for Medical Semantic Segmentation." Security and Communication Networks 2021 (August 30, 2021): 1–8. http://dx.doi.org/10.1155/2021/2139024.

Full text
Abstract:
With the development of deep learning, breakthroughs have been made in the field of semantic segmentation. However, it is difficult to generate a fine mask on the same medical images because medical images have low contrast, high resolution, and insufficient semantic information. In most scenarios, existing approaches mostly use a pooling layer to reduce the resolution of feature maps. Therefore, it is difficult for them to consider the whole image features, resulting in information loss and performance degradation. In this paper, a multiscale asymmetric encoder-decoder semantic segmentation network is proposed. The network consists of two parts, which perform feature extraction and image restoration on the input, respectively. The encoder network obtains multiscale feature information by connecting multiple ASPP modules to form a feature pyramid. Meanwhile, the upsampling layer of each decoder can be connected to the feature map generated by the corresponding ASPP module. Finally, the classification information of each pixel is obtained through the sigmoid function. The performance of the proposed method can be verified on publicly available datasets. The experimental evidence shows that the proposed method can take full advantage of multiscale feature information and achieve superior performance with less inference computational cost.
APA, Harvard, Vancouver, ISO, and other styles
42

Masood, Sharjeel, Fawad Ahmed, Suliman A. Alsuhibany, et al. "A Deep Learning-Based Semantic Segmentation Architecture for Autonomous Driving Applications." Wireless Communications and Mobile Computing 2022 (June 18, 2022): 1–12. http://dx.doi.org/10.1155/2022/8684138.

Full text
Abstract:
In recent years, the development of smart transportation has accelerated research on semantic segmentation as it is one of the most important problems in this area. A large receptive field has always been the center of focus when designing convolutional neural networks for semantic segmentation. A majority of recent techniques have used maxpooling to increase the receptive field of a network at an expense of decreasing its spatial resolution. Although this idea has shown improved results in object detection applications, however, when it comes to semantic segmentation, a high spatial resolution also needs to be considered. To address this issue, a new deep learning model, the M-Net is proposed in this paper which satisfies both high spatial resolution and a large enough receptive field while keeping the size of the model to a minimum. The proposed network is based on an encoder-decoder architecture. The encoder uses atrous convolution to encode the features at full resolution, and instead of using heavy transposed convolution, the decoder consists of a multipath feature extraction module that can extract multiscale context information from the encoded features. The experimental results reported in the paper demonstrate the viability of the proposed scheme.
APA, Harvard, Vancouver, ISO, and other styles
43

Geng, Xiaoxiao, Shunping Ji, Meng Lu, and Lingli Zhao. "Multi-Scale Attentive Aggregation for LiDAR Point Cloud Segmentation." Remote Sensing 13, no. 4 (2021): 691. http://dx.doi.org/10.3390/rs13040691.

Full text
Abstract:
Semantic segmentation of LiDAR point clouds has implications in self-driving, robots, and augmented reality, among others. In this paper, we propose a Multi-Scale Attentive Aggregation Network (MSAAN) to achieve the global consistency of point cloud feature representation and super segmentation performance. First, upon a baseline encoder-decoder architecture for point cloud segmentation, namely, RandLA-Net, an attentive skip connection was proposed to replace the commonly used concatenation to balance the encoder and decoder features of the same scales. Second, a channel attentive enhancement module was introduced to the local attention enhancement module to boost the local feature discriminability and aggregate the local channel structure information. Third, we developed a multi-scale feature aggregation method to capture the global structure of a point cloud from both the encoder and the decoder. The experimental results reported that our MSAAN significantly outperformed state-of-the-art methods, i.e., at least 15.3% mIoU improvement for scene-2 of CSPC dataset, 5.2% for scene-5 of CSPC dataset, and 6.6% for Toronto3D dataset.
APA, Harvard, Vancouver, ISO, and other styles
44

Lu, Xuwei, Yunlong Zhang, and Congqi Zhang. "CATransU-Net: Cross-attention TransU-Net for field rice pest detection." PLOS One 20, no. 6 (2025): e0326893. https://doi.org/10.1371/journal.pone.0326893.

Full text
Abstract:
Accurate detection of rice pests in field is a key problem in field pest control. U-Net can effectively extract local image features, and Transformer is good at dealing with long-distance dependencies. A Cross-Attention TransU-Net (CATransU-Net) model is constructed for paddy pest detection by combining U-Net and Transformer. It consists of encoder, decoder, dual Transformer-attention module (DTA) and cross-attention skip-connection (CASC), where dilated residual Inception (DRI) in encoder is adopted to extract the multiscale features, DTA is added into the bottleneck of the model to efficiently learn nonlocal interactions between encoder features, and CASC instead of skip-connection between encoder/decoder is designed to model the multi-resolution feature representation. Compared with U-Net and Transformer, CATransU-Net can extract multiscale features through DRI and DTA, and enhance feature representation to generate high-resolution insect images through CASC and decoder. The experimental results on the large-scale multiclass IP102 and AgriPest benchmark datasets verify that CATransU-Net is effective for rice pest extraction with precision of 93.51%, about 2% more than other methods, especially 9.36% more than U-Net. The proposed method can be applied to the field rice pest detection system. Code is available at https://github.com/chenchenchen23123121da/CATransU-Net.
APA, Harvard, Vancouver, ISO, and other styles
45

Li, Boliang, Yaming Xu, Yan Wang, and Bo Zhang. "DECTNet: Dual Encoder Network combined convolution and Transformer architecture for medical image segmentation." PLOS ONE 19, no. 4 (2024): e0301019. http://dx.doi.org/10.1371/journal.pone.0301019.

Full text
Abstract:
Automatic and accurate segmentation of medical images plays an essential role in disease diagnosis and treatment planning. Convolution neural networks have achieved remarkable results in medical image segmentation in the past decade. Meanwhile, deep learning models based on Transformer architecture also succeeded tremendously in this domain. However, due to the ambiguity of the medical image boundary and the high complexity of physical organization structures, implementing effective structure extraction and accurate segmentation remains a problem requiring a solution. In this paper, we propose a novel Dual Encoder Network named DECTNet to alleviate this problem. Specifically, the DECTNet embraces four components, which are a convolution-based encoder, a Transformer-based encoder, a feature fusion decoder, and a deep supervision module. The convolutional structure encoder can extract fine spatial contextual details in images. Meanwhile, the Transformer structure encoder is designed using a hierarchical Swin Transformer architecture to model global contextual information. The novel feature fusion decoder integrates the multi-scale representation from two encoders and selects features that focus on segmentation tasks by channel attention mechanism. Further, a deep supervision module is used to accelerate the convergence of the proposed method. Extensive experiments demonstrate that, compared to the other seven models, the proposed method achieves state-of-the-art results on four segmentation tasks: skin lesion segmentation, polyp segmentation, Covid-19 lesion segmentation, and MRI cardiac segmentation.
APA, Harvard, Vancouver, ISO, and other styles
46

Chen, Qian, Ze Liu, Yi Zhang, Keren Fu, Qijun Zhao, and Hongwei Du. "RGB-D Salient Object Detection via 3D Convolutional Neural Networks." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (2021): 1063–71. http://dx.doi.org/10.1609/aaai.v35i2.16191.

Full text
Abstract:
RGB-D salient object detection (SOD) recently has attracted increasing research interest and many deep learning methods based on encoder-decoder architectures have emerged. However, most existing RGB-D SOD models conduct feature fusion either in the single encoder or the decoder stage, which hardly guarantees sufficient cross-modal fusion ability. In this paper, we make the first attempt in addressing RGB-D SOD through 3D convolutional neural networks. The proposed model, named RD3D, aims at pre-fusion in the encoder stage and in-depth fusion in the decoder stage to effectively promote the full integration of RGB and depth streams. Specifically, RD3D first conducts pre-fusion across RGB and depth modalities through an inflated 3D encoder, and later provides in-depth feature fusion by designing a 3D decoder equipped with rich back-projection paths (RBPP) for leveraging the extensive aggregation ability of 3D convolutions. With such a progressive fusion strategy involving both the encoder and decoder, effective and thorough interaction between the two modalities can be exploited and boost the detection accuracy. Extensive experiments on six widely used benchmark datasets demonstrate that RD3D performs favorably against 14 state-of-the-art RGB-D SOD approaches in terms of four key evaluation metrics. Our code will be made publicly available: https://github.com/PPOLYpubki/RD3D.
APA, Harvard, Vancouver, ISO, and other styles
47

Xing, Na, Jun Wang, Yuehai Wang, Keqing Ning, and Fuqiang Chen. "Point Cloud Completion Based on Nonlocal Neural Networks with Adaptive Sampling." Information Technology and Control 53, no. 1 (2024): 160–70. http://dx.doi.org/10.5755/j01.itc.53.1.34047.

Full text
Abstract:
Raw point clouds are usually sparse and incomplete, inevitably containing outliers or noise from 3D sensors. In this paper, an improved SA-Net based on an encoder-decoder structure is proposed to make it more robust in predicting complete point clouds. The encoder of the original SA-Net network is very sensitive to noise in the feature extraction process. Therefore, we use PointASNL as the encoder, which weights around the initial sampling points through the AS module (Adaptive Sampling Module) and adaptively adjusts the weight of the sampling points to effectively alleviate the bias effect of outliers. In order to fully mine the feature information of point clouds, it captures the neighborhood and long-distance dependencies of sampling points through the LNL module (Local-NonLocal Module), providing more accurate information for point cloud processing. Then, we use the encoder to extract local geometric features of the incomplete point cloud at different resolutions.Then, an attention mechanism is introduced to transfer the extracted features to a decoder. The decoder gradually refines the local features to achieve a more realistic effect. Experiments on the ShapeNet data set show that the improved point cloud completion network achieves the goal and reduces the average chamfer distance by 3.50% compared to SA-Net.
APA, Harvard, Vancouver, ISO, and other styles
48

Xing, Yongfeng, Luo Zhong, and Xian Zhong. "An Encoder-Decoder Network Based FCN Architecture for Semantic Segmentation." Wireless Communications and Mobile Computing 2020 (July 7, 2020): 1–9. http://dx.doi.org/10.1155/2020/8861886.

Full text
Abstract:
In recent years, the convolutional neural network (CNN) has made remarkable achievements in semantic segmentation. The method of semantic segmentation has a desirable application prospect. Nowadays, the methods mostly use an encoder-decoder architecture as a way of generating pixel by pixel segmentation prediction. The encoder is for extracting feature maps and decoder for recovering feature map resolution. An improved semantic segmentation method on the basis of the encoder-decoder architecture is proposed. We can get better segmentation accuracy on several hard classes and reduce the computational complexity significantly. This is possible by modifying the backbone and some refining techniques. Finally, after some processing, the framework has achieved good performance in many datasets. In comparison with the traditional architecture, our architecture does not need additional decoding layer and further reuses the encoder weight, thus reducing the complete quantity of parameters needed for processing. In this paper, a modified focal loss function is also put forward, as a replacement for the cross-entropy function to achieve a better treatment of the imbalance problem of the training data. In addition, more context information is added to the decode module as a way of improving the segmentation results. Experiments prove that the presented method can get better segmentation results. As an integral part of a smart city, multimedia information plays an important role. Semantic segmentation is an important basic technology for building a smart city.
APA, Harvard, Vancouver, ISO, and other styles
49

Jing, Zhenping. "A novel deep fully convolutional encoder-decoder network and similarity analysis for English education text event clustering analysis." Computer Science and Information Systems, no. 00 (2024): 62. http://dx.doi.org/10.2298/csis240418062j.

Full text
Abstract:
Education event clustering for social media aims to achieve short text clustering according to event characteristics in online social networks. Traditional text event clustering has the problem of poor classification results and large computation. Therefore, we propose a novel deep fully convolutional encoder-decoder network and similarity analysis for English education text event clustering analysis in online social networks. At the encoder end, the features of text events are extracted step by step through the convolution operation of the convolution layer. The background noise is suppressed layer by layer while the target feature representation is obtained. The decoder end and the encoder end are symmetrical in structure. In the decoder end, the high-level feature representation obtained by the encoder end is deconvolved and up-sampled to recover the target event layer by layer. Based on the linear model, text similarity is calculated and incremental clustering is performed. In order to verify the effectiveness of the English education text event analysis method based on the proposed approach, it is compared with other advanced methods. Experiments show that the performance of the proposed method is better than that of the benchmark model.
APA, Harvard, Vancouver, ISO, and other styles
50

Chen, Yunfan, and Hyunchul Shin. "Pedestrian Detection at Night in Infrared Images Using an Attention-Guided Encoder-Decoder Convolutional Neural Network." Applied Sciences 10, no. 3 (2020): 809. http://dx.doi.org/10.3390/app10030809.

Full text
Abstract:
Pedestrian-related accidents are much more likely to occur during nighttime when visible (VI) cameras are much less effective. Unlike VI cameras, infrared (IR) cameras can work in total darkness. However, IR images have several drawbacks, such as low-resolution, noise, and thermal energy characteristics that can differ depending on the weather. To overcome these drawbacks, we propose an IR camera system to identify pedestrians at night that uses a novel attention-guided encoder-decoder convolutional neural network (AED-CNN). In AED-CNN, encoder-decoder modules are introduced to generate multi-scale features, in which new skip connection blocks are incorporated into the decoder to combine the feature maps from the encoder and decoder module. This new architecture increases context information which is helpful for extracting discriminative features from low-resolution and noisy IR images. Furthermore, we propose an attention module to re-weight the multi-scale features generated by the encoder-decoder module. The attention mechanism effectively highlights pedestrians while eliminating background interference, which helps to detect pedestrians under various weather conditions. Empirical experiments on two challenging datasets fully demonstrate that our method shows superior performance. Our approach significantly improves the precision of the state-of-the-art method by 5.1% and 23.78% on the Keimyung University (KMU) and Computer Vision Center (CVC)-09 pedestrian dataset, respectively.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!