Publications | LLM-Lab

2025

Taxi1500: A multilingual dataset for text classification in 1500 languages

Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, and 3 more authors

2025

Abs PDF

While natural language processing tools have been developed extensively for some of the world’s languages, a significant portion of the world’s over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

2024

The Touché23-ValueEval Dataset for Identifying Human Values behind Arguments

Benno Stein Nailia Mirzakhmedova, Johannes Kiesel, Milad Alshomary, and 10 more authors

2024

Abs PDF

While human values play a crucial role in making arguments persuasive, we currently lack the necessary extensive datasets to develop methods for analyzing the values underlying these arguments on a large scale. To address this gap, we present the Touché23-ValueEval dataset, an expansion of the Webis-ArgValues-22 dataset. We collected and annotated an additional 4780 new arguments, doubling the dataset’s size to 9324 arguments. These arguments were sourced from six diverse sources, covering religious texts, community discussions, free-text arguments, newspaper editorials, and political debates. Each argument is annotated by three crowdworkers for 54 human values, following the methodology established in the original dataset. The Touché23-ValueEval dataset was utilized in the SemEval 2023 Task 4. ValueEval: Identification of Human Values behind Arguments, where an ensemble of transformer models demonstrated state-of-the-art performance. Furthermore, our experiments show that a fine-tuned large language model, Llama-2-7B, achieves comparable results.
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, and 5 more authors

2024

Abs PDF

Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these …
Assessing computational predictions of antimicrobial resistance phenotypes from microbial genomes

Kaixin Hu, Fernando Meyer, Zhi-Luo Deng, and 4 more authors

Briefings in Bioinformatics, 2024

Abs PDF

The advent of rapid whole-genome sequencing has created new opportunities for computational prediction of antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored for this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno and Aytan-Aktug), an ML baseline and the rule-based ResFinder by training and testing each of them across 78 species–antibiotic datasets, using a rigorous benchmarking workflow that integrates three evaluation approaches, each paired with three distinct sample splitting methods. Our analysis revealed considerable variation in the performance across techniques and datasets. Whereas ML methods generally excelled for closely related strains, ResFinder excelled for handling divergent genomes. Overall …
Aima at semeval-2024 task 10: History-based emotion recognition in hindi-english code-mixed conversations

Mohammad Mahdi Abootorabi, Nona Ghazizadeh, Seyed Arshan Dalili, and 3 more authors

Semeval l2024, 2024

Abs PDF

In this study, we introduce a solution to the SemEval 2024 Task 10 on subtask 1, dedicated to Emotion Recognition in Conversation (ERC) in code-mixed Hindi-English conversations. ERC in code-mixed conversations presents unique challenges, as existing models are typically trained on monolingual datasets and may not perform well on code-mixed data. To address this, we propose a series of models that incorporate both the previous and future context of the current utterance, as well as the sequential information of the conversation. To facilitate the processing of code-mixed data, we developed a Hinglish-to-English translation pipeline to translate the code-mixed conversations into English. We designed four different base models, each utilizing powerful pre-trained encoders to extract features from the input but with varying architectures. By ensembling all of these models, we developed a final model that outperforms all other baselines.
SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation

Farhan Farsi, Sadra Sabouri, Kian Kashfipour, and 3 more authors

2024

Abs PDF

Generating coherent and comprehensive responses remains a significant challenge Question-Answering (QA) systems when working with short answers especially for low-resourced languages like Farsi. We present a novel approach to expand these answers into complete, fluent responses, addressing the critical issue of limited Farsi resources and models. Our methodology employs a two-stage process: first, we develop a dataset using rulebased techniques on Farsi text, followed by a BERT-based ranking system to ensure fluency and comprehensibility. The resulting model demonstrates strong compatibility with existing QA systems, particularly those based on knowledge graphs. Notably, our system exhibits enhanced performance when integrated with large language models using Chain-of-Thought (CoT) prompting, leveraging detailed explanations rather than single-word answers. Our approach significantly improves response quality and coherence compared to baseline systems. We release our dataset to support further research in Farsi QA.

2023

Ebhaam at SemEval-2023 task 1: A CLIP-based approach for comparing cross-modality and unimodality in visual word sense disambiguation

Zeinab Taghavi, Parsa Haghighi Naeini, Mohammad Ali Sadraei Javaheri, and 4 more authors

2023

Abs PDF

This paper presents an approach to tackle the task of Visual Word Sense Disambiguation (Visual-WSD), which involves determining the most appropriate image to represent a given polysemous word in one of its particular senses. The proposed approach leverages the CLIP model, prompt engineering, and text-to-image models such as GLIDE and DALL-E 2 for both image retrieval and generation. To evaluate our approach, we participated in the SemEval 2023 shared task on “Visual Word Sense Disambiguation (Visual-WSD)” using a zero-shot learning setting, where we compared the accuracy of different combinations of tools, including “Simple prompt-based” methods and “Generated prompt-based” methods for prompt engineering using completion models, and text-to-image models for changing input modality from text to image. Moreover, we explored the benefits of cross-modality evaluation between text and candidate images using CLIP. Our experimental results demonstrate that the proposed approach reaches better results than cross-modality approaches, highlighting the potential of prompt engineering and text-to-image models to improve accuracy in Visual-WSD tasks. We assessed our approach in a zero-shot learning scenario and attained an accuracy of 68.75% in our best attempt.
Sinaai at semeval-2023 task 3: A multilingual transformer language model-based approach for the detection of news genre, framing and persuasion techniques

Aryan Sadeghi, Reza Alipour, Kamyar Taeb, and 3 more authors

2023

Abs PDF

This paper describes SinaAI’s participation in SemEval-2023 Task 3, which involves detecting propaganda in news articles across multiple languages. The task comprises three sub-tasks:(i) genre detection,(ii) news framing, and (iii) persuasion technique identification. The employed dataset includes news articles in nine languages and domains, including English, French, Italian, German, Polish, Russian, Georgian, Greek, and Spanish, with labeled instances of news framing, genre, and persuasion techniques. Our approach combines fine-tuning multilingual language models such as XLM, LaBSE, and mBERT with data augmentation techniques. Our experimental results show that XLM outperforms other models in terms of F1-Micro in and F1-Macro, and the ensemble of XLM and LaBSE achieved the best performance. Our study highlights the effectiveness of multilingual sentence embedding models in multilingual propaganda detection. Our models achieved highest score for two languages (greek and italy) in sub-task 1 and one language (Russian) for sub-task 2.
Borderless azerbaijani processing: Linguistic resources and a transformer-based approach for azerbaijani transliteration

Reihaneh Zohrabi, Mostafa Masumi, Omid Ghahroodi, and 4 more authors

2023

Abs PDF

Recent advancements in neural language models have revolutionized natural language understanding. However, many languages still face the risk of being left behind without the benefits of such advancements, potentially leading to their extinction. One such language is Azerbaijani in Iran, which suffers from limited digital resources and a lack of alignment between spoken and written forms. In contrast, Azerbaijani in the Republic of Azerbaijan has seen more resources and is not considered as low-resource as its Iranian counterpart. In this context, our research focuses on the computational progress made in Iranian Azerbaijani language. We propose a transliteration model that leverages an Azerbaijani parallel dataset, effectively bridging the gap between the Latin and Persian scripts. By enabling seamless communication between these two scripts, our model facilitates cultural exchange and serves as a valuable tool for transfer learning. The effectiveness of our approach surpasses traditional rule-based methods, as evidenced by the significant improvements in performance metrics. We observe a minimum 15% increase in BLEU scores and a reduction of at least 1/3 in edit distance. Furthermore, our model’s online demo is accessible at https://azeri. parsi. ai/.
SUT at SemEval-2023 task 1: Prompt generation for visual word sense disambiguation

Omid Ghahroodi, Seyed Arshan Dalili, Sahel Mesforoush, and 1 more author

2023

Abs PDF

Visual Word Sense Disambiguation (V-WSD) identifies the correct visual sense of a multi-sense word in a specific context. This can be challenging as images may need to provide additional context and words may have multiple senses. A proper V-WSD system can benefit applications like image retrieval and captioning. This paper proposes a Prompt Generation approach to solve this challenge. This approach improves the robustness of language-image models like CLIP to contextual ambiguities and helps them better correlate between textual and visual contexts of different senses of words.

2022

Peptide microarrays coupled to machine learning reveal individual epitopes from human antibody responses with neutralizing capabilities against SARS-CoV-2

Sven-Kevin Hotop, Susanne Reimering, Aditya Shekhar, and 13 more authors

Emerging microbes & infections, 2022

Abs PDF

The coronavirus SARS-CoV-2 is the causative agent for the disease COVID-19. To capture the IgA, IgG, and IgM antibody response of patients infected with SARS-CoV-2 at individual epitope resolution, we constructed planar microarrays of 648 overlapping peptides that cover the four major structural proteins S(pike), N(ucleocapsid), M(embrane), and E(nvelope). The arrays were incubated with sera of 67 SARS-CoV-2 positive and 22 negative control samples. Specific responses to SARS-CoV-2 were detectable, and nine peptides were associated with a more severe course of the disease. A random forest model disclosed that antibody binding to 21 peptides, mostly localized in the S protein, was associated with higher neutralization values in cellular anti-SARS-CoV-2 assays. For antibodies addressing the N-terminus of M, or peptides close to the fusion region of S, protective effects were proven by antibody …
Keyword-based natural language premise selection for an automatic mathematical statement proving

Doratossadat Dastgheib, and Ehsaneddin Asgari

2022

Abs PDF

Extraction of supportive premises for a mathematical problem can contribute to profound success in improving automatic reasoning systems. One bottleneck in automated theorem proving is the lack of a proper semantic information retrieval system for mathematical texts. In this paper, we show the effect of keyword extraction in the natural language premise selection (NLPS) shared task proposed in TextGraph-16 that seeks to select the most relevant sentences supporting a given mathematical statement.
Go bench: Shared-hub for universal benchmarking of machine learning-based protein functional annotations

*Dickson Andrew M, *Asgari Ehsaneddin, McHardy Alice C, and 1 more author

Bioinformatics Journal, 2022

Abs PDF

Gene annotation is the problem of mapping proteins to their functions represented as Gene Ontology (GO) terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem, which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset used for benchmarking the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear.

2021

EpitopeVec: linear epitope prediction using deep protein sequence embeddings

Akash Bahai, Ehsaneddin Asgari, Mohammad RK Mofrad, and 2 more authors

Bioinformatics, 2021

Abs PDF

B-cell epitopes (BCEs) play a pivotal role in the development of peptide vaccines, immuno-diagnostic reagents and antibody production, and thus in infectious disease prevention and diagnostics in general. Experimental methods used to determine BCEs are costly and time-consuming. Therefore, it is essential to develop computational methods for the rapid identification of BCEs. Although several computational methods have been developed for this task, generalizability is still a major concern, where cross-testing of the classifiers trained and tested on different datasets has revealed accuracies of 51–53%.We describe a new method called EpitopeVec, which uses a combination of residue properties, modified antigenicity scales, and protein language model-based representations (protein vectors) as features of peptides for linear BCE predictions. Extensive …
Tripletprot: deep representation learning of proteins based on siamese networks

Esmaeil Nourani, Ehsaneddin Asgari, Alice C McHardy, and 1 more author

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021

Abs PDF

Pretrained representations have recently gained attention in various machine learning applications. Nonetheless, the high computational costs associated with training these models have motivated alternative approaches for representation learning. Herein we introduce TripletProt, a new approach for protein representation learning based on the Siamese neural networks. Representation learning of biological entities which capture essential features can alleviate many of the challenges associated with supervised learning in bioinformatics. The most important distinction of our proposed method is relying on the protein-protein interaction (PPI) network. The computational cost of the generated representations for any potential application is significantly lower than comparable methods since the length of the representations is significantly smaller than that in other approaches. TripletProt offers great potentials for the …
KnowMAN: Weakly Supervised Multinomial Adversarial Networks

Luisa März, Ehsaneddin Asgari, Fabienne Braune, and 2 more authors

Empirical Methods in Natural Language Processing (EMNLP), 2021

Abs PDF

The absence of labeled data for training neural models is often addressed by leveraging knowledge about the specific task, resulting in heuristic but noisy labels. The knowledge is captured in labeling functions, which detect certain regularities or patterns in the training samples and annotate corresponding labels for training. This process of weakly supervised training may result in an over-reliance on the signals captured by the labeling functions and hinder models to exploit other signals or to generalize well. We propose KnowMAN, an adversarial scheme that enables to control influence of signals associated with specific labeling functions. KnowMAN forces the network to learn representations that are invariant to those signals and to pick up other signals that are more generally associated with an output label. KnowMAN strongly improves results compared to direct weakly supervised learning with a pre-trained transformer language model and a feature-based baseline.

2020

Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning‐enabled molecular diagnostics

Ariane Khaledi, Aaron Weimann, Monika Schniederjans, and 12 more authors

EMBO molecular medicine, 2020

Abs PDF

Limited therapy options due to antibiotic resistance underscore the need for optimization of current diagnostics. In some bacterial species, antimicrobial resistance can be unambiguously predicted based on their genome sequence. In this study, we sequenced the genomes and transcriptomes of 414 drug‐resistant clinical Pseudomonas aeruginosa isolates. By training machine learning classifiers on information about the presence or absence of genes, their sequence variation, and expression profiles, we generated predictive models and identified biomarkers of resistance to four commonly administered antimicrobial drugs. Using these data types alone or in combination resulted in high (0.8–0.9) or very high (> 0.9) sensitivity and predictive values. For all drugs except for ciprofloxacin, gene expression information improved diagnostic performance. Our results pave the way for the development of a molecular …
EmbLexChange at SemEval-2020 Task 1: Unsupervised embedding-based detection of lexical semantic changes

Ehsaneddin Asgari, Christoph Ringlstetter, and Hinrich Schütze

2020

Abs PDF

This paper describes EmbLexChange, a system introduced by the “Life-Language” team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is defined as the divergence between the embedding based profiles of word w (calculated with respect to a set of reference words) in the source and the target domains (source and target domains can be simply two time frames t_1 and t_2). The underlying assumption is that the lexical-semantic change of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces. We show that using a resampling framework for the selection of reference words (with conserved senses), we can more reliably detect lexical-semantic changes in English, German, Swedish, and Latin. EmbLexChange achieved second place in the binary detection of semantic changes in the SemEval-2020.
UniSent: Universal Adaptable Sentiment Lexica for 1000+ Languages

Ehsaneddin Asgari, Fabienne Braune, Benjamin Roth, and 2 more authors

Proceedings of The 12th Language Resources and Evaluation Conference (LREC), 2020

Abs PDF

In this paper, we introduce UniSent universal sentiment lexica for languages. Sentiment lexica are vital for sentiment analysis in absence of document-level annotations, a very common scenario for low-resource languages. To the best of our knowledge, UniSent is the largest sentiment resource to date in terms of the number of covered languages, including many low resource ones. In this work, we use a massively parallel Bible corpus to project sentiment information from English to other languages for sentiment analysis on Twitter data. We introduce a method called DomDrift to mitigate the huge domain mismatch between Bible and Twitter by a confidence weighting scheme that uses domain-specific embeddings to compare the nearest neighbors for a candidate sentiment word in the source (Bible) and target (Twitter) domain. We evaluate the quality of UniSent in a subset of languages for which manually created ground truth was available, Macedonian, Czech, German, Spanish, and French. We show that the quality of UniSent is comparable to manually created sentiment resources when it is used as the sentiment seed for the task of word sentiment prediction on top of embedding representations. In addition, we show that emoticon sentiments could be reliably predicted in the Twitter domain using only UniSent and monolingual embeddings in German, Spanish, French, and Italian. With the publication of this paper, we release the UniSent sentiment lexica.
Subword sampling for low resource word alignment

Ehsaneddin Asgari, Masoud Jalili Sabet, Philipp Dufter, and 2 more authors

arXiv preprint arXiv:2012.11657, 2020

Abs PDF

Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with low-resource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method’s hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of ’s of parallel sentences in existing word-level fast-align/eflomal alignment methods.

2019

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Naihui Zhou, Yuxiang Jiang, Timothy R Bergquist, and 147 more authors

Genome biology, 2019

Abs PDF

The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function.Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility …
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Ehsaneddin Asgari, Alice C McHardy, and Mohammad RK Mofrad

Scientific reports, 2019

Abs PDF

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an …
DeepPrime2Sec: deep learning for protein secondary structure prediction from the primary sequences

Ehsaneddin Asgari, Nina Poerner, Alice C McHardy, and 1 more author

BioRxiv, 2019

Abs PDF

Here we investigate deep learning-based prediction of protein secondary structure from the protein primary sequence. We study the function of different features in this task, including one-hot vectors, biophysical features, protein sequence embedding (ProtVec), deep contextualized embedding (known as ELMo), and the Position Specific Scoring Matrix (PSSM). In addition to the role of features, we evaluate various deep learning architectures including the following models/mechanisms and certain combinations: Bidirectional Long Short-Term Memory (BiLSTM), convolutional neural network (CNN), highway connections, attention mechanism, recurrent neural random fields, and gated multi-scale CNN. Our results suggest that PSSM concatenated to one-hot vectors are the most important features for the task of secondary structure prediction.Utilizing the CNN-BiLSTM network, we achieved an accuracy of 69.9% and 70.4% using ensemble top-k models, for 8-class of protein secondary structure on the CB513 dataset, the most challenging dataset for protein secondary structure prediction. Through error analysis on the best performing model, we showed that the misclassification is significantly more common at positions that undergo secondary structure transitions, which is most likely due to the inaccurate assignments of the secondary structure at the boundary regions. Notably, when ignoring amino acids at secondary structure transitions in the evaluation, the accuracy increases to 90.3%. Furthermore, the best performing model mostly mistook similar structures for one another, indicating that the deep learning model inferred high …
DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection

Ehsaneddin Asgari, Philipp C Münch, Till R Lesker, and 2 more authors

Bioinformatics, 2019

Abs PDF

Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to …
Life language processing: deep learning-based language-agnostic processing of proteomics, genomics/metagenomics, and human languages

Ehsaneddin Asgari

2019

Abs PDF

A broad and simple definition oflanguage’is a set of sequences constructed from a finite set of symbols. By this definition, biological sequences, human languages, and many sequential phenomena that exist in the world can be viewed as languages. Although this definition is simple, it includes languages employing very complicated grammars in the creation of their sequences of symbols. Examples are biophysical principles governing biological sequences (eg, DNA, RNA, and protein sequences), as well as grammars of human languages determining the structure of clauses and sentences. This dissertation uses a language-agnostic point of view in the processing of both biological sequences and human languages. Two main strategies are adopted toward this purpose,(i) character-level, or more accurately, subsequence-level processing of languages, which allows for simple modeling of the sequence similarities …
Deep genomics and proteomics: Language model-based embedding of biological sequences and their applications in bioinformatics

Ehsaneddin Asgari, and Mohammad RK Mofrad

2019

Abs PDF

Biophysical and biochemical principles govern biological sequences (e.g., DNA, RNA, and protein sequences) similar to the way the grammar of a natural language determines the structure of clauses and sentences. This analogy motivates “life language processing,” that is, treating biological sequences as the output of a certain language and adopting/developing language processing methods to perform analyses and predictions in that language. In this chapter, we present two specific tasks related to life language processing: (1) Developing language-model based representations for biological sequences: the large gap between the number of known sequences (raw data) versus the number of known functions/structures associated with these sequences (metadata), encourages us to develop methods that can obtain prior knowledge from the existing sequences to be used in bioinformatics tasks (e.g., protein …

2018

MicroPheno: Predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

Ehsaneddin Asgari, Kiavash Garakani, Alice Carolyn McHardy, and 1 more author

Bioinformatics, 2018

Abs PDF

Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as …
Molecular insights into the mechanisms of SUN1 oligomerization in the nuclear envelope

Zeinab Jahed, Darya Fadavi, Uyen T Vu, and 3 more authors

Biophysical journal, 2018

Abs PDF

The LINC complex is found in a wide variety of organisms and is formed by the transluminal interaction between outer- and inner-nuclear-membrane KASH and SUN proteins, respectively. Most extensively studied are SUN1 and SUN2 proteins, which are widely expressed in mammals. Although SUN1 and SUN2 play functionally redundant roles in several cellular processes, more recent studies have revealed diverse and distinct functions for SUN1. While several recent in vitro structural studies have revealed the molecular details of various fragments of SUN2, no such structural information is available for SUN1. Herein, we conduct a systematic analysis of the molecular relationships between SUN1 and SUN2, highlighting key similarities and differences that could lead to clues into their distinct functions. We use a wide range of computational tools, including multiple sequence alignments, homology modeling …

2017

Past, present, future: a computational investigation of the typology of tense in 1000 languages

Ehsaneddin Asgari, and Hinrich Schütze

2017

Abs PDF

We present SuperPivot, an analysis method for low-resource languages that occur in a superparallel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use. We show that SuperPivot performs well for the crosslingual analysis of the linguistic phenomenon of tense. We produce analysis results for more than 1000 languages, conducting - to the best of our knowledge - the largest crosslingual computational study performed to date. We extend existing methodology for leveraging parallel corpora for typological analysis by overcoming a limiting assumption of earlier work: We only require that a linguistic feature is overtly marked in a few of thousands of languages as opposed to requiring that it be marked in all languages under investigation.
Overview of character-based models for natural language processing

Heike Adel, Ehsaneddin Asgari, and Hinrich Schütze

2017

Abs PDF

Character-based models become more and more popular for different natural language processing task, especially due to the success of neural networks. They provide the possibility of directly model text sequences without the need of tokenization and, therefore, enhance the traditional preprocessing pipeline. This paper provides an overview of character-based models for a variety of natural language processing tasks. We group existing work in three categories: tokenization-based approaches, bag-of-n-gram models and end-to-end models. For each category, we present prominent examples of studies with a particular focus on recent character-based deep learning work.
Measuring Countries’ Human Rights Positions in UN Universal Periodic Review

Ehsaneddin Asgari, and Ali Sanaei

2017

Abs PDF

We use country reviews in United Nations’s Universal Periodic Review (UPR) to obtain multi-dimensional measures of similarity between countries. UPR is a mechanism that involves a review of all UN member states. The process is designed to treat all states equally and be cognizant of the level of development and specificities of countries under review. One of the unique aspects of the UPR is the peer review, where any state can offer recommendations for the state under review. The first review cycle was conducted from 2008-2012 and the second cycle has finished in November 2016. We obtain measures of similarity based on the similarity of the recommendations that states gave other states and also based on the similarity of the recommendations that states received from other states.

2016

Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance

Ehsaneddin Asgari, and Mohammad R.K. Mofrad

2016

Abs PDF

We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence (WELD). WELD is defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). Our result confirms a significant high-level difference in the genetic language model of humans/animals versus plants. The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations.
Nonsymbolic Text Representation

Hinrich Schuetze, Heike Adel, and Ehsaneddin Asgari

arXiv preprint arXiv:1610.00479, 2016

Abs PDF

We introduce the first generic text representation model that is completely nonsymbolic, i.e., it does not require the availability of a segmentation or tokenization method that attempts to identify words or other symbolic units in text. This applies to training the parameters of the model on a training corpus as well as to applying it when computing the representation of a new text. We show that our model performs better than prior work on an information extraction and a text denoising task.
Text Analysis and Automatic Triage of Posts in a Mental Health Forum

Ehsaneddin Asgaria, Soroush Nasiriany, and Mohammad Mofrad

2016

Abs PDF

We present an approach for automatic triage of message posts in ReachOut. com mental health forum, which was a shared task in the 2016 Computational Linguistics and Clinical Psychology (CLPsych). This effort is aimed at providing the trained moderators of ReachOut. com with a systematic triage of forum posts, enabling them to more efficiently support the young users aged 14-25 communicating with each other about their issues. We use different features and classifiers to predict the users’ mental health states, marked as green, amber, red, and crisis. Our results show that random forests have significant success over our baseline mutli-class SVM classifier. In addition, we perform feature importance analysis to characterize key features in identification of the critical posts.

2015

Continuous distributed representation of biological sequences for deep proteomics and genomics

Ehsaneddin Asgari, and Mohammad RK Mofrad

PloS one, 2015

Abs PDF

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that …

2014

Integration of scientific and social networks

Mahmood Neshati, Djoerd Hiemstra, Ehsaneddin Asgari, and 1 more author

World wide web, 2014

Abs PDF

In this paper, we address the problem of scientific-social network integration to find a matching relationship between members of these networks (i.e. The DBLP publication network and the Twitter social network). This task is a crucial step toward building a multi environment expert finding system that has recently attracted much attention in Information Retrieval community. In this paper, the problem of social and scientific network integration is divided into two sub problems. The first problem concerns finding those profiles in one network, which presumably have a corresponding profile in the other network and the second problem concerns the name disambiguation to find true matching profiles among some candidate profiles for matching. Utilizing several name similarity patterns and contextual properties of these networks, we design a focused crawler to find high probable matching pairs, then the problem …

2013

Linguistic Resources & Topic Models for the Analysis of Persian Poems

Ehsaneddin Asgari, and Jean-Cédric Chappelier

2013

Abs PDF

This paper describes the usage of Natural Language Processing tools, mostly probabilistic topic modeling, to study semantics (word correlations) in a collection of Persian poems consisting of roughly 18k poems from 30 different poets. For this study, we put a lot of effort in the preprocessing and the development of a large scope lexicon supporting both modern and ancient Persian. In the analysis step, we obtained very interesting and meaningful results regarding the correlation between poets and topics, their evolution through time, as well as the correlation between the topics and the metre used in the poems. This work should thus provide valuable results to literature researchers, especially for those working on stylistics or comparative literature.
Confirming the themes and interpretive unity of Ghazal poetry using topic models

Ehsaneddin Asgari, Marzyeh Ghassemi, and Mark Alan Finlayson

2013

Abs PDF

We apply topic modeling to classifying the genre of Ghazal, a form common in Persian poetry. We show that a classifier based on automatically-generated topics exposes important information with only a small performance penalty: the top discriminative topics can be manually aligned with themes prevalent in the associated genres, as identified by scholars of literature. We also weigh in on a long-standing debate about the interpretive unity of Ghazal. In particular, we show evidence that, on the average, Ghazals seem to have interpretive unity at the level of the full poem, as opposed to just at the level of the couplet. Our dataset is a collection of almost 18,000 Ghazal, comprising over 3 million words. The collection contains poems from 30 different poets, and spans nearly 900 years (1080 AD–1968).Generative models tend to perform worse than discriminative models in classification tasks. The hope is, however, that a generative model provides additional insight, by allowing decisions to be understood with reference to the generated explanation. We show that this hope holds true for topic models [1] in two specific cases in the domain of Ghazal, a form common in Persian poetry. First, we show that, for a 3-class genre classification task, the most discriminative topics correspond to known themes prevalent in the associated genres. These themes have been identified by scholars of Persian poetry, and we manually align computationally-extracted topics with these human-identified themes.
A Joint Classification Method to Integrate Scientific and Social Networks

Mahmood Neshati, Ehsaneddin Asgari, Djoerd Hiemstra, and 1 more author

2013

Abs PDF

In this paper, we address the problem of scientific-social network integration to find a matching relationship between members of these networks. Utilizing several name similarity patterns and contextual properties of these networks, we design a focused crawler to find high probable matching pairs, then the problem of name disambiguation is reduced to predict the label of each candidate pair as either true or false matching. By defining matching dependency graph, we propose a joint label prediction model to determine the label of all candidate pairs simultaneously. An extensive set of experiments have been conducted on six test collections obtained from the DBLP and the Twitter networks to show the effectiveness of the proposed joint label prediction model.