ML p(r)ior | Efficient Index for Weighted Sequences

### Efficient Index for Weighted Sequences

2016-02-02
1602.01116 | cs.DS
The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z \log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

Highlights - Most important sentences from the article

# Related Articles

2019-04-08
1904.04228 | cs.DS

Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text $T$ of lengt… show more

Highlights - Most important sentences from the article

2018-11-30
1811.12779 | cs.DS

We describe the first self-indexes able to count and locate pattern occurrences in optimal time with… show more

Highlights - Most important sentences from the article

2019-04-09
1904.04513 | cs.DS

The suffix tree, DAWG, and CDAWG are fundamental indexing structures of a string, with a number of a… show more

Highlights - Most important sentences from the article

2019-05-07
1905.02298 | cs.DS

An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$, which wa… show more

Highlights - Most important sentences from the article

2019-01-29
1901.10453 | cs.DS

Converting a set of sequencing reads into a lossless compact data structure that encodes all the rel… show more

Highlights - Most important sentences from the article

2019-02-06
1902.02187 | cs.DS

We present a compressed representation of tries based on top tree compression [ICALP 2013] that work… show more

Highlights - Most important sentences from the article

2018-09-08
1809.02792 | cs.DS

Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned … show more

Highlights - Most important sentences from the article

2018-08-03
1808.01071 | cs.DS

Two strings of equal length are said to parameterized match if there is a bijection that maps the ch… show more

Highlights - Most important sentences from the article

2018-06-05
1806.01804 | cs.DS

We present the first solution to $\tau$-majorities on tree paths. Given a tree of $n$ nodes, each wi… show more

Highlights - Most important sentences from the article

2017-12-22
1712.08573 | cs.DS

In the longest common substring problem, we are given two strings of length $n$ and must find a subs… show more

Highlights - Most important sentences from the article

2019-03-14
1903.06289 | cs.DS

Let $\Sigma$ and $\Pi$ be disjoint alphabets of respective size $\sigma$ and $\pi$. Two strings over… show more

Highlights - Most important sentences from the article

2018-04-23
1804.08731 | cs.DS

In the longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length … show more

Highlights - Most important sentences from the article

2018-07-31
1807.11702 | cs.DS

Sequence mappability is an important task in genome re-sequencing. In the $(k,m)$-mappability proble… show more

Highlights - Most important sentences from the article

2018-01-23
1801.07449 | cs.DS

We consider a sliding window $W$ over a stream of characters from some alphabet of constant size. Th… show more

Highlights - Most important sentences from the article

2018-03-26
1803.09520 | cs.DS

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes ba… show more

Highlights - Most important sentences from the article

2019-01-30
1901.10633 | cs.DS

A maximal repetition, or run, in a string, is a periodically maximal substring whose smallest period… show more

Highlights - Most important sentences from the article