Programmable Diffractive Deep Neural Networks

Introduction

Recently, direct laser writing on Sb2Se3 phase-change thin films has been exploited as a simple, low-cost, and fast approach to create or reprogram photonic integrated circuits (PICs)1. Various photonic components consisting of waveguides, gratings, ring resonators, couplers, crossings, and interferometers were created in one writing step, without the need for additional fabrication processes1. Furthermore, writing imperfections (errors) can be easily corrected by locally adjusting each element using erasing and restoring1. Before this, Delany et al. employed the combination of large refractive index contrast and ultralow losses of Sb2Se3 and demonstrated the feasibility of a laser-programmed multiport router based on writing nonvolatile patterns of weakly scattering perturbations onto a multimode interference device2. Blundell et al. continued this work and could also establish the ability to strongly increase the effect per pixel/unit length by increasing the Sb2Se3 film thickness up to 100 nm in their very recent article3. Wu et al. took advantage of the direct laser writing of phase patterns optimized by the inverse design technique to achieve reconfigurability in programmable phase-change photonic devices4. They could demonstrate a photonic device that can be reconfigured from a wavelength-division demultiplexer to a mode-division demultiplexer4. Also, a rewritable PIC architecture based on selective laser writing on another phase change material (Sb2S3) has also been realized by Miller et al.5. Liu et al. reported a high-speed patterning approach using 780nm femtosecond laser pulses to accomplish partial re-amorphization of Sb2S3 to write and erase multiple microscale color images6. Their proposal can find many applications in non-volatile ultrathin displays.

Photonic neural networks (PNNs) may be the next-generation computing platforms due to their advantages of low power consumption, high parallelism, and light-speed signal processing. On-chip diffractive optical neural networks are one of many implementations of PNNs7,8,9, in which an on-chip high-contrast transmit array (HCTA) metasurface10 functions as an optical neural network hidden layer. Inspired by the previously presented on-chip diffractive optical neural networks7,8,9 and leveraging the technique of direct laser writing on Sb2Se3 thin film, a proof-of-principle of an on-chip programmable diffractive deep neural network is presented in this article for image classification.

Phase-change metasurface

In this study, the on-chip programmable diffractive deep neural network consists of multiple phase-change metasurfaces, each of which performs as a neural network layer. The phase-change metasurface is a one-dimensional amorphous Sb2Se3 (aSb2Se3) rod array in the crystalline Sb2Se3 (cSb2Se3) thin film, as is shown in Fig. 1(a)11. Analogous to previous demonstrations1,2,3,4, the metasurface can be written (or rewritten) using direct laser writing. The lattice constant of the metasurface is 500 nm. The Sb2Se3 film is 30 nm thick, and it is coated with a SiO2 capping layer of 200 nm thickness1. The SiO2 layer is for protection and oxidation inhibition of the Sb2Se3 layer. Underneath the Sb2Se3, there is a Si3N4 film with 330 nm thickness on a standard oxidized silicon substrate1. A single neuron (or meta-atom) is formed by a single (aSb2Se3) rod. The geometrical parameters of an (aSb2Se3) rod can be learnable parameters. As is shown in Figs. 1(b) and 1(c), by adjusting the geometrical parameters of an (aSb2Se3) rod (i.e., the (aSb2Se3) rod length and width), the transmission phase shift and amplitude of a single neuron for TE-polarized guided wave can be tuned. By fixing the (aSb2Se3) rod width to 400 nm and altering the rod length from 300 nm to 4 μm, more than π/2 phase shift is attained, while the transmission amplitude is very near to 1 (Fig. 1(d)). Therefore, in the on-chip programmable diffractive deep neural network, the rod lengths of meta-atoms (neurons) are chosen as the learnable parameters, which collectively adjust the amplitude and phase profile of the output wavefront. The plots in Fig. 1 are generated using the commercial software package Lumerical FDTD. Throughout simulations, the Fundamental TE mode is selected for excitation, and the x-axis is chosen as the injection axis. The operation wavelength is set to be 1.55 μm.

Fig. 1

(a) Schematic of (aSb2Se3) rod array in the (cSb2Se3) thin film. The insets show the side view and a meta-atom of the structure. The electric field transmission phase (b) and amplitude (c) of the TE-polarized guided wave for a meta-atom versus the (aSb2Se3) rod length and width. (d) Electric field transmission phase and amplitude versus (aSb2Se3) rod length, fixing the rod width at 400 nm. These plots are generated by the commercial software package Lumerical FDTD. The operation wavelength is set to be 1.55 μm.

On-chip programmable diffractive deep neural network

Like artificial deep neural networks, an on-chip programmable diffractive deep neural network is comprised of one input layer, multiple hidden layers, and one output layer. Each hidden layer of the network is a phase-change metasurface that consists of many meta-atoms (neurons) arranged linearly. Each meta-atom is a weight element that connects to the meta-atoms on adjacent layers through diffraction and interference of light7. The input images are preprocessed and converted to a vector, and then encoded into the amplitude of the input light at the input layer. The output layer is composed of multiple detection regions aligned in a linear configuration. The network is trained on the training dataset using an error-backpropagation algorithm based on the adjoint gradient method described in12,13. For error-backpropagation, Adam optimizer is adapted to perform mini-batch gradient descent14. A decaying learning rate with a decay rate of 0.99 in each epoch is applied in the training process. A cost function in terms of the squared errors between the desired set of output intensity distributions (targets) and realized set of output intensity distributions is defined and minimized iteratively by adjusting the design parameters (i.e., (aSb2Se3) rod lengths through the metasurfaces). In the adjoint method, the gradient of the cost function with respect to all design variables can be computed using only two full-field simulations12. After numerical training, the inference performance of the designed network is numerically tested using the test dataset and subsequently verified using the 2.5D variational FDTD solver of the Lumerical Mode Solution. In verification, the preprocessed input images are encoded as the intensity of input light in Lumerical Mode Solution slab plane wave sources. Here, the capability of the on-chip programmable diffractive deep neural network is benchmarked on two machine learning tasks of pattern recognition for English letters X, Y, and Z and MNIST (Modified National Institute of Standards and Technology) handwritten (0-1-2) digits classification15.

Pattern recognition

First, the on-chip programmable diffractive deep neural network is trained for pattern recognition of English letters X, Y, and Z. The input letters are the binary letter images with 60 pixels (Fig. 2(a)). The dataset is created by amplitude flipping in random single-pixel and double-pixels in the binary letter images of X, Y, and Z7. This generates 5490 images, of which 4590 images are used as the training dataset and fed through the metasystem in batches of 10 during each training epoch, and the remaining 900 images are used as the test dataset. The input patterns are reshaped from a two-dimensional 10 × 6 matrix to a 60-component one-dimensional vector7 and then encoded as the amplitude of the input light.

The pattern recognizer is composed of five phase-change metasurfaces, each of which consists of 60 meta-atoms (Fig. 2(b)). The 30 μm-length metasurfaces are precisely aligned with 8 μm separations. After light exits the fifth metasurface, it propagates 8 μm until it reaches the network output layer with three linearly arranged detectors (corresponding to the letters X, Y, and Z). The weights ((aSb2Se3) rod lengths) are initialized randomly in the range [300 nm, 4 μm] and kept in the same range during the training. Figure 2(c) illustrates the evolution of the loss and accuracy for the training set during the learning procedure. Only after 3 epochs, the metasystem can achieve 100% accuracy in English letter recognition. The blind testing accuracy of the designed metasystem is 100% over the test dataset. Verification with Lumerical 2.5D FDTD also indicates 98.8% matching with numerical testing results. For verification, 90 random patterns, 30 random patterns per letter, are selected from the test dataset. The confusion matrices for numerical testing and FDTD testing are shown in Figs. 2(d) and 2(e), respectively. The x-y view of electric field distribution in the network and the normalized power of three output detectors for a sample input pattern of letter Y are presented in Fig. 2(f). Table 1 summarizes the characteristics of the designed pattern recognizer.

MNIST handwritten digits classification

For digit classification, the training is performed using 18,623 images of handwritten digits (0-1-2) from the MNIST handwritten digits database. The training images are fed through the network in batches of 64. After the training phase, the designed network is tested on 3147 images of the test dataset. The grayscale 28 × 28-pixel images are down-sampled to 14 × 14 pixels and converted to 196-component vectors. The vectors are then encoded as the amplitude of the input light.

Fig. 2

(a) 10 × 6-pixel images of the letters X, Y and Z, (b) 2D schematic of the five-layer network trained to perform as an on-chip pattern recognizer, (c) loss and training accuracy versus epoch during the learning process, (d) the confusion matrix of the network over 900 images of the test dataset, (e) the confusion matrix of the network, computed by 2.5D FDTD of Lumerical Mode Solution, over 90 random images from the test dataset, (f) the x-y view of the electric field distribution at the middle plane of the Sb2Se3 thin film through the whole network and the normalized power of the three output detectors for a representative Y image.

The schematic of the digit classifier is illustrated in Fig. 3(a). It is composed of three phase-change metasurfaces, each with 196 meta-atoms. The length of the one-dimensional metasurfaces is 98 μm, and they are precisely aligned with 7 μm separations. After light exits the third metasurface, it propagates 7 μm until it reaches the output layer of the network with three linearly aligned detectors (corresponding to the digits 0, 1, and 2). The (aSb2Se3) rod lengths are initialized randomly and kept in the range of [300 nm, 5 μm] during the training. Figure 3(b) depicts the loss and training accuracy in each epoch during the learning process. After 140 epochs, the training accuracy reaches 92.38% over the whole training dataset. The blind testing accuracy of the finally designed metasystem is 91.86% over the test dataset. From the test dataset images that the designed network can successfully classify, 100 handwritten digit images are randomly selected for 2.5D FDTD verification. Verifications indicate that for 92 out of 100 images, similar predictions to numerical predictions are made, which means 92% matching between the two predictions. The confusion matrices for numerical testing and FDTD testing of the network are shown in Figs. 3(c) and 3(d), respectively. Also, the dependence of the digit classifier performance on the number of metasurfaces (hidden layers) is investigated in Fig. 3(e). As can be observed, the network blind testing accuracy grows from 86.30% for the one-layer network to 94.43% for the four-layer network and then declines to 92.50% for the five-layer network. The matching percentage between numerical testing and FDTD testing is above 90% for all the neural networks. The one-layer network has the best matching score of 98%, and the five-layer network has the worst matching score of 91%. Due to the nearly similar performance of all the networks in Fig. 3(e), the structural complexity and footprint may be the main determining factors. Figure 3(f) shows the x-y view of the electric field distribution in the three-layer metasystem and the normalized power of three output detectors for a sample image of the handwritten digit 2. The summarized characteristics of the presented image classifier are given in Table 1.

Fig. 3

(a) 2D schematic of the three-layer network trained to perform MNIST (0-1-2) handwritten digits classification, (b) loss and training accuracy versus epoch during the learning process, (c) the confusion matrix of the metasystem over 3147 images of the test dataset, (d) the confusion matrix of the metasystem, calculated by 2.5D FDTD of Lumerical Mode Solution, over 100 random images from the test dataset, (e) blind testing accuracy for metasystems with different number of metasurfaces and the associated matching percentage between numerical testing and Lumerical 2.5D FDTD verification over 100 random images from the test dataset, (f) the x-y view of the electric field distribution at the middle plane of the Sb2Se3 thin film through the metasystem and the normalized power of three output detectors for a representative image of the digit 2.

Discussion

Performance

According to the numerical results, a five-layer English letters X, Y, and Z pattern recognizer and a three-layer (0-1-2) MNIST handwritten digits classifier based on the presented on-chip programmable diffractive deep neural network architecture could achieve 100% and 91.86% prediction accuracies on the test dataset, respectively. FDTD testing of these structures reveals 98.8% and 92% matching with numerical testing results, respectively. Compared to the three-class pattern recognizer presented in7 for binary letter images with 15 pixels, which has numerical and FDTD testing accuracy of 98% and 95.8%, respectively, and a footprint of 450 μm×300 μm, the presented scheme has a much smaller footprint of 30 μm×40 μm for the 60-pixel pattern recognizer and 98 μm×21 μm for the more complex digits classifier, while showing comparable accuracies. It is also worth comparing our proposal to the three-class Iris flower classifier presented in8 with a footprint of 280 μm×1000 μm that has a numerical and experimental testing accuracy of 90% on the test dataset. Furthermore, both of those proposals7,8 are not programmable, and once fabricated, their Function is fixed and cannot be retrained for other Functions. A programmable multimodal scheme with 85.7% accuracy on multimodal (vision, audio, touch) test sets was previously presented in16, however, its footprint is very large (0.36 mm×1.42 mm). Moreover, the tunable diffractive units are implemented with thermo-optic phase shifters that are large-sized (3 μm×100 μm) and nonvolatile (which need a constant power supply to operate). Table 2 summarizes the characteristics of the previously presented on-chip diffractive optical neural networks and provides a comparison to the presented scheme in this article.

In the previous works based on direct-laser writing on Sb2Se3 thin film2,3, nonvolatile pixel perturbation patterns on a multimode interference device were optimized using an iterative method. Wu et al. employed a topology optimization approach to inverse-design the phase pattern of phase-change (Sb2Se3) multimode interference devices in a pixel-wise manner4. Adjoint optimization of on-chip phase-change (Sb2Se3) metasurfaces is a fast and efficient alternative optimization approach to those presented in2,3,4 to design phase change photonic devices. Furthermore, more complicated photonic devices with larger device areas can be designed. Other lithography-free reconfigurable demonstrations of on-chip photonic devices like photonic neural networks based on optical coding of spatial-temporal modulations of the imaginary index on an active III–V semiconductor platform17 and multimode interference power splitters based on plasma dispersion in planar silicon from all-optical excitation with spatially modulated pump beams18,19, although providing a convenient and cost-efficient paradigm for programmable integrated photonic circuits, suffer from a volatile nature. An on-chip programmable convolutional neural network based on a phase-change metasurface mode converter was also reported in20 by using phase-change material Ge2Sb2Te5 (GST). A photonic kernel with a 2 × 2 array of GST converters was experimentally demonstrated to perform MNIST handwritten (1–2) digits classification, which achieved experimental accuracy of 91% (91 out of 100 cases). However, the footprint of each GST converter is 80 μm×20 μm. Furthermore, the weight parameters are represented by 64 distinguishable levels created by partial phase transition of GST, which is rather difficult to implement.

The computational scale of many ONN architectures like integrated Mach-Zehnder interferometer grids, micro-ring modulator arrays, etc., is typically limited to 4 × 4 matrix-vector multiplications or smaller16. Therefore, most existing ONNs still struggle with classic tasks and small datasets like MNIST and Fashion-MNIST16. The on-chip diffractive optical neural networks, however, are supposed to enable large computational scales in the literature8,16. Our investigations indicate that these networks also are able to classify only 3–4 classes, suggesting fundamental limitations in their computational scale21. If the presented scheme in this article is benchmarked on the images of ten (0–9) handwritten digits from the MNIST dataset (for which the distance between two neighboring metasurfaces is 17 μm, the distance between the last metasurface and the output layer is 57 μm, the total device size is 98 μm×91 μm, and all other characteristics are similar to the ones presented in Table 1) the test accuracy and the FDTD matching score will be 76.02% and 30%, respectively. Therefore, Full 10-digit MNIST classification necessitates integration of 3 to 4 computing modules, each of which classifying 3 digits. More elaboration on the proposed hypothesis was presented in21, the summary of which is by increasing the classification categories, the classification accuracy of the on-chip diffractive optical neural networks degrades dramatically (e.g. for this scheme, the testing accuracy decreases from 91.86% for three classes to 76.02% for ten classes and matching score decreases from 92% for three classes to 30% for ten classes). Therefore, for the presented scheme, it can be deduced that the FDTD testing accuracy for ten (0–9) digits classification is about %76.02×%30 = 22.8%, which is rather a low classification accuracy. This consequence is signified when, for such neural networks, increasing the number of layers and/or the number of neurons per layer and/or changing the distance between the layers, don’t make a big difference in the classification accuracy, especially the FDTD testing accuracy21. These denote the limited capability of such a scheme to perform complex (multi-class) tasks21. As is described in21, the confusion matrices of many MNIST and Fashion-MNIST benchmarks with different numbers of classes reveal that these networks are able to classify only 3–4 classes correctly.

In the presented study, angular spectrum wave propagation that exploits fast Fourier transforms (FFTs) for formulating the diffraction is utilized to numerically simulate the on-chip programmable diffractive deep neural networks12,13. The numerical simulation results are verified using the 2.5D variational FDTD solver of the Lumerical Mode Solution. Despite very close matching between numerical simulation results and FDTD simulation results, a small discrepancy exists between them. This discrepancy is mainly justified by the differences in propagation models and discretization between diffraction-based modeling methods (like angular spectrum method) and full-wave electromagnetic modeling methods like FDTD.

The main challenge for the on-chip diffractive optical neural networks is the discrepancy between diffraction-based analysis methods and experimental/full-wave electromagnetic verifications. Many prior works7,8,9,21,22,23,24,25,26 tried to figure out the reason for this discrepancy. In this article, we add a few more analyses of such 2D structures to the former analyses. Figure 4 presents the results of these analyses. In Fig. 4, the performance of the (0-1-2) MNIST handwritten digits classifier with different numbers of metasurfaces (neural network layers) is investigated for different lattice constants of the phase-change metasurfaces. The lattice constant of the metasurfaces is set to 300 nm, 500 nm, and 700 nm in Figs. 4(a), 4(b), and 4(c), respectively. For these figures, the (aSb2Se3) rod width is fixed at 250 nm, 400 nm, and 600 nm, respectively, and the (aSb2Se3) rod length is selected as the learnable parameter. As is evident, while the testing accuracies are very near for the three lattice constants, the matching score between numerical and FDTD testing falls down by decreasing the lattice constant. Especially, for the more subwavelength 300 nm lattice constant (Fig. 4(a)), the matching score gets very low. In Fig. 4(d), the performance of the three-layer (0-1-2) MNIST handwritten digits classifier shown in Fig. 3 is studied when changing the operation wavelength. It is seen that the performance of the network remains nearly the same in the wavelength range of 1.5 μm to 1.6 μm, which indicates the broadband operation of the presented scheme. It should be mentioned that by increasing the wavelength, a slight decay in the testing accuracy as well as the matching score between the numerical and FDTD testing can be observed. This also indicates that a more subwavelength meta-atom results in a smaller matching score between numerical and FDTD testing. For the presented study in Fig. 4(d), the refractive index of Sb2Se3 between 1.5 μm and 1.6 μm is nearly constant27, and the refractive indices of silicon nitride and silicon oxide are extracted from28,29, respectively.

Fig. 4

These figures illustrate how the subwavelength extent of phase-change metasurfaces affects the performance of the presented architecture. The blind testing accuracy for (0-1-2) MNIST handwritten digits classifier with different numbers of metasurfaces and the associated matching percentage between numerical testing and Lumerical 2.5D FDTD verification over 100 random images from the test dataset are shown for metasystems composed of (a) 300 nm-lattice constant phase-change metasurfaces, (b) 500 nm-lattice constant phase-change metasurfaces, and (c) 700 nm-lattice constant phase-change metasurfaces. (d) The blind testing accuracy and the matching percentage between numerical testing and Lumerical 2.5D FDTD verification over 100 random images from the test dataset, versus operation wavelength for the (0-1-2) MNIST handwritten digits classifier of Fig. 3.

Practical issues

For the presented on-chip optical neural network scheme, there might be a simulation-to-reality gap, as is also indicated in8 for the on-chip diffractive optical neural networks based on on-chip HCTA metasurfaces. This gap was attributed to the errors caused by the fabrication process in8, and an algorithm compensation method consisting of phase compensation and power compensation was exploited to reduce the negative impacts of errors. The phase compensation stage was implemented based on an online in-situ training procedure. Here, the rewritable nature of the on-chip optical neural network helps to easily perform the compensation stage. Furthermore, the whole training process of the on-chip optical neural network can be easily accomplished online and in situ by erasing and writing the diffractive units. In the context of on-chip diffractive optical neural networks, the presented programmable scheme can be a breakthrough, because it provides the opportunity to retrain these networks by in-situ training very easily. Direct laser writing technique makes it straightforward to adjust each neuron individually and therefore correcting the errors to restore the designed functionality1. Using the presented scheme and direct laser writing technique will assist finding the main reason for the limited scalability of these networks and if it is because of the incapability of Fourier-optics methods in modeling the evolution of optical fields through the network, then in-situ training would be an ideal solution for this problem.

In our study, (aSb2Se3) rod array in the (cSb2Se3) thin film is considered as the phase-change metasurface. According to3, the method of direct laser writing of amorphous rods on a crystalline background leads to small pixels with well-defined edges. However, this can be achieved at the cost of much increased losses, as most of the Sb2Se3 layer is crystallized. On the other hand, for (cSb2Se3) rod array in the (aSb2Se3) thin film, direct laser writing of crystalline rods on an amorphous background offers limited spatial control over crystalline rods, generally resulting in larger rod size compared with amorphous rods3. In1, a commercial laser writing system (Heidelberg DWL 66+, 405-nm laser) was utilized to directly write the circuit layouts on a blank (cSb2Se3) thin film. The minimum achievable feature size with such an industrialized laser writing tool was 300 nm. In this article, for the presented on-chip programmable optical neural network designs to be complied with commercial laser writing constraints, the width of (aSb2Se3) rods is fixed at 400 nm, and the minimum value for (aSb2Se3) rods lengths is set to be 300 nm in the training processes. The thickness of the Sb2Se3 film is assumed to be 30 nm. As is mentioned in3, switching depths exceeding 200 nm are achieved for direct-write amorphization pulses in (cSb2Se3) thin films. Therefore, it is expected that 30 nm thickness to be in the regime where the entire thickness of the Sb2Se3 is switched.

As is indicated in3, losses stay very low for increasing thickness and length of Sb2Se3 in the amorphous state (in a waveguide). However, generally losses increase exponentially with the length of Sb2Se3 in both amorphous and crystalline states. While in3, these losses are attributed to surface and volume contributions, because of scattering by sidewall surface roughness and small poly-crystalline domains, Wu et al. considered the scattering induced by the grain boundaries within the (cSb2Se3) waveguide as the main source of loss1. According to1, the scattering loss can be mitigated when a lower erasure temperature is used to form larger crystal grains with sizes on the order of tens of micrometers. Based on3, the insertion loss over the device length dramatically reduces by decreasing the film thickness, however, larger Sb2Se3 film thickness leads to increased modulation per Sb2Se3 film length. As a result, due to1,3, a 30 nm-thickness Sb2Se3 film will have a propagation loss between (0.003–0.01) dB/µm, which is rather negligible, considering the small propagation lengths in the presented schemes.

According to[3](https://www.nature.com/articles/s41598-025-19638-8#ref-CR3 “Blundell, S. et al. Ultracompact programmable silicon photonics using layers of low-loss phase-change material Sb2Se3 of increasing thickness. ACS Photonics. 12

Introduction

Introduction

Phase-change metasurface

On-chip programmable diffractive deep neural network

Pattern recognition

MNIST handwritten digits classification

Discussion

Performance

Practical issues

Similar Posts