I am very happy to join the team of Pr. Raja Appuswamy, in the Data Science department, to work on image compression based on AI models. An application to DNA storage is also considered. This postdoc is funded by the French Government, under the PEPR Molexcularxiv Project fund.
Coding algorithms for long-term storage of digital images on synthetic DNA molecules
Abtract
The current digital world is facing a number of issues, some of them linked to the amount of data that is being stored. The current technologies available as an offer to store data are not enough to store the totality of the storage demand. For this reason, new data storage technologies have to be developed. DNA molecules are one of the candidates available for novel data storage methods. The long lifespan of these molecules make it a good fit for the archival of data that is rarely accessed but needs to be stored for long periods of time. This data, often called “cold”, represents approximately 80% of the data in our digital universe. But DNA uses 4 symbols (A,C,G and T) to encode data against the usual binary code (0,1). For this reason, storing data into DNA requires a specific encoding system capable of translating a binary data stream into a quaternary data stream. In this thesis we will focus on new encoding methods from the Deep Learning state of the art, and we will adapt those methods for the encoding, decoding, compression and decompression of images on synthetic DNA.
Jury
- Aline Roumy, Research Director, INRIA, Rennes
- Eitan Yaakobi, Research Director, Technion, Haifa
- Thomas Heinis, Associate Professor, Imperial College London
- Athanassios Skodras, Professor, University of Patras
- Raja Appuswamy, Assistant Professor, EURECOM, Sophia Antipolis
- Dominique Lavenier, Research Director, CNRS, IRISA, Rennes
Over the past years, the ever-growing trend on data storage demand, more specifically for "cold" data (i.e. rarely accessed), has motivated research for alternative systems of data storage. Because of its biochemical characteristics, synthetic DNA molecules are now considered as serious candidates for this new kind of storage. This paper introduces a novel arithmetic coder for DNA data storage, and presents some results on a lossy JPEG 2000 based image compression method adapted for DNA data storage that uses this novel coder. The DNA coding algorithms presented here have been designed to efficiently compress images, encode them into a quaternary code, and finally store them into synthetic DNA molecules. This work also aims at making the compression models better fit the problematic that we encounter when storing data into DNA, namely the fact that the DNA writing, storing and reading methods are error prone processes. The main take away of this work is our arithmetic coder and it's integration into a performant image codec.
Multiple Description Coding (MDC) is an error-resilient source coding method designed for transmission over noisy channels. We present a novel MDC scheme employing a neural network based on implicit neural representation. This involves overfitting the neural representation for images. Each description is transmitted along with model parameters and its respective latent spaces. Our method has advantages over traditional MDC that utilizes auto-encoders, such as eliminating the need for model training and offering high flexibility in redundancy adjustment. Experiments demonstrate that our solution is competitive with autoencoder-based MDC and classic MDC based on HEVC, delivering superior visual quality.
Over the past years, the ever-growing trend on data storage demand, more specifically for "cold" data (rarely accessed data), has motivated research for alternative systems of data storage. Because of its biochemical characteristics, synthetic DNA molecules are now considered as serious candidates for this new kind of storage. This paper presents some results on lossy image compression methods based on convolutional autoencoders adapted to DNA data storage, with synthetic DNA-adapted entropic and fixed-length codes. The model architectures presented here have been designed to efficiently compress images, encode them into a quaternary code, and finally store them into synthetic DNA molecules. This work also aims at making the compression models better fit the problematics that we encounter when storing data into DNA, namely the fact that the DNA writing, storing and reading methods are error prone processes. The main take aways of this kind of compressive autoencoder are our latent space quantization and the different DNA adapted entropy coders used to encode the quantized latent space, which are an improvement over the fixed length DNA adapted coders that were previously used.
The JPEG Committee has been exploring coding of images in quaternary representations particularly suitable for image archival on DNA storage. The scope of JPEG DNA is to create a standard for efficient coding of images that considers biochemical constraints and offers robustness to noise introduced by the different stages of the storage process that is based on DNA synthetic polymers.
At the 100th JPEG meeting, “Additions to the JPEG DNA Common Test Conditions version 2.0”, was produced which supplements the “JPEG DNA Common Test Conditions” by specifying a new constraint to be taken into account when coding images in quaternary representation. In addition, the detailed procedures for evaluation of the pre-registered responses to the JPEG DNA Call for Proposals were defined.
Furthermore, the next steps towards a deployed high-performance standard were discussed and defined. In particular, it was decided to request for the new work item approval once a Committee Draft stage has been reached.
The JPEG-DNA AHG has been re-established to work on the preparation of assessment and crosschecking of responses to the JPEG DNA Call for Proposals until the 101st JPEG meeting in October 2023.
L'explosion des données est l'un des plus grands défis de l'évolution numérique. La demande de stockage augmente à un rythme tel qu'elle ne peut rivaliser avec les capacités réelles des appareils. Selon les prévisions, l'univers numérique devrait atteindre plus de 180 zettaoctets d'ici 2025, tandis que 80 % des données sont rarement consultées (données "froides"), mais méritent d'être archivées à long terme pour mémoire de l'humanité (photographies, films, code informatique, connaissances scientifiques, etc.). Dans le même temps, les dispositifs de stockage classiques ont une durée de vie limitée à 10 ou 20 ans et doivent être fréquemment remplacés pour garantir la fiabilité des données, un processus coûteux en termes d'argent et d'énergie. Des études récentes ont montré qu'en raison de ses propriétés biologiques, l'ADN est un candidat très prometteur pour l'archivage à long terme de données numériques "froides" pendant des siècles. Le stockage de données sous la forme de molécules d'ADN nécessite de coder les informations dans un flux quaternaire composé des symboles A, C, T et G (les fameux nucléotides), tout en respectant des contraintes strictes liées aux processus biochimiques associés. De plus, ce support de stockage introduit des erreurs non conventionnelles de types insertions et deletions que les méthodes classiques de correction d'erreurs ne savent pas traiter. Des travaux pionniers ont d'ores et déjà proposé différents algorithmes pour le codage et la protection des données stockées dans de l'ADN, laissant cependant encore la place à de nombreux défis à relever.
L'objectif de cette journée est de faire le point sur les avancées technologiques et les grands défis à relever dans ce domaine du stockage moléculaire, en mettant en avant les problématiques liées au traitement du signal et des images ainsi que de la théorie des codes correcteurs et du codage source/canal conjoint. La journée débutera par deux tutoriels d'introduction au sujet, avant de se poursuivre par des exposés plus techniques sur les sujets précédents.
Over the past years, the ever-growing trend on data storage demand, more specifically for "cold" data (i.e. rarely accessed), has motivated research for alternative systems of data storage. Because of its biochemical characteristics, synthetic DNA molecules are considered as potential candidates for a new storage paradigm. Because of this trend, several coding solutions have been proposed over the past years for the storage of digital information into DNA. Despite being a promising solution, DNA storage faces two major obstacles: the large cost of synthesis and the noise introduced during sequencing. Additionally, this noise increases when biochemically defined coding constraints are not respected: avoiding homopolymers and patterns, as well as balancing the GC content. This paper describes a novel entropy coder which can be embedded to any block-based image-coding schema and aims to robustify the decoded results. Our proposed solution introduces variability in the generated quaternary streams, reduces the amount of homopolymers and repeated patterns to reduce the probability of errors occurring. In this paper, we integrate the proposed entropy coder into four existing JPEG-inspired DNA coders. We then evaluate the quality-in terms of biochemical constraints-of the encoded data for all the different methods.
The exponentially increasing demand for data storage has been facing more and more challenges during the past years. The energy costs that it represents are also increasing, and the availability of the storage hardware is not able to follow the storage demand's trend. The short lifespan of conventional storage media-10 to 20 years-forces the duplication of the hardware and worsens the situation. The majority of this storage demand concerns "cold" data, data very rarely accessed but that has to be kept for long periods of time. The coding abilities of synthetic DNA, and its long durability (several hundred years), make it a serious candidate as an alternative storage media for "cold" data. In this paper, we propose a variable-length coding algorithm adapted to DNA data storage with improved performance. The proposed algorithm is based on a modified Shannon-Fano code that respects some biochemichal constraints imposed by the synthesis chemistry. We have inserted this code in a JPEG compression algorithm adapted to DNA image storage and we highlighted an improvement of the compression ratio ranging from 0.5 up to 2 bits per nucleotide compared to the state-of-the-art solution, without affecting the reconstruction quality.