Sketching and Indexing for Sequences
3.0
creditsAverage Course Rating
Many of the world's largest and fastest-growing datasets are text, e.g. DNA sequencing data, web pages, logs and social media posts. Such datasets are useful only to the degree we can query, compare and analyze them. Here we discuss two powerful approaches in this area. We will cover sketching, which enables us to summarize very large texts in small structures that allow us to measure the sizes of sets and of their unions and intersections. This in turn allows us to measure similarity and find near neighbors. Second, we will discuss indexing --- succinct and compressed indexes in particular -- which enables us to efficiently search inside very long strings, especially in highly repetitive texts.