<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Algorithm Analysis on Mulatta Blog</title><link>https://blog.mulatta.io/tags/algorithm-analysis/</link><description>Recent content in Algorithm Analysis on Mulatta Blog</description><generator>Hugo -- gohugo.io</generator><language>ko-kr</language><lastBuildDate>Tue, 20 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.mulatta.io/tags/algorithm-analysis/index.xml" rel="self" type="application/rss+xml"/><item><title>1D 패턴 일치 문제</title><link>https://blog.mulatta.io/post/bioinformatics-algorithm/05-ch1-1d/</link><pubDate>Tue, 20 Feb 2024 00:00:00 +0000</pubDate><guid>https://blog.mulatta.io/post/bioinformatics-algorithm/05-ch1-1d/</guid><description>&lt;h2 id="개요"&gt;개요
&lt;/h2&gt;&lt;p&gt;이전 1C 역상보 문제의 결과에서 우리는 &lt;em&gt;Vibrio cholerae&lt;/em&gt; 의 가장 빈번한 상위 4 종류의 서열 중 ATGATCAAG와 CTTGATCAT가 서로 상보적 관계에 있음을 알 수 있었다. 이 결과는 DnaA box를 찾았다는 결론을 뒷받침하는 근거로 여겨질 수 있을까?&lt;/p&gt;
&lt;p&gt;이전 장에서 찾은 위의 결과는 &lt;strong&gt;유전체의 처음부터 끝까지 전체 영역에서 등장하는 Pattern&lt;/strong&gt;이었다. 즉, 우리가 찾은 결과는 분포에 대한 정보를 포함하지 않는다. 따라서 결과 pattern들은 &lt;em&gt;유전체 상에서 고르게 분포&lt;/em&gt; 되어있을 수 있다.&lt;/p&gt;
&lt;p&gt;이것이 왜 중요할까? 다음의 그림을 통해, 유전체의 복제가 어떻게 진행되는지 그 양상을 확인할 수 있다. Bacteria의 경우 Origin of Replication이 하나의 영역에 존재하는 것을 확인할 수 있고, Eukaryotes의 경우 Origin of Replication이 여러 영역에 걸쳐 &lt;em&gt;밀집되어 있음&lt;/em&gt; 을 알 수 있다.&lt;/p&gt;
&lt;p&gt;따라서 우리는 이전 장에서 찾은 &lt;em&gt;pattern들이 어느 위치에 존재하는지&lt;/em&gt; 아는 것이 중요하다. 위치에 대한 정보를 알게 되면 우리가 찾은 pattern들이 얼마나 밀집되어있는지를 확인할 수 있기 때문이다.&lt;/p&gt;
&lt;p&gt;그러므로 이번 장에서는 &lt;em&gt;&lt;strong&gt;임의의 입력 pattern이 어느 위치에 존재하는지&lt;/strong&gt;&lt;/em&gt; 그 위치를 반환하는 함수를 구현하였다.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Fig 1. (A) Bacterial Replication Initiation, (B) Eukaryotic Replication Initiation - &lt;a class="link" href="https://en.wikipedia.org/wiki/Origin_of_replication" target="_blank" rel="noopener"
&gt;Wikipedia&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="problem"&gt;Problem
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Input: 찾을 pattern, 유전체&lt;/li&gt;
&lt;li&gt;Output: 유전체 상 pattern이 등장하는 위치&lt;/li&gt;
&lt;li&gt;function: 찾으려는 문자열 pattern이 유전체 상에서 등장하는 모든 위치(인덱스)를 리스트로 반환&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="pseudo-code"&gt;Pseudo-code
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;findPatternIndices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;patternSize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;genomeSize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;indexList&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;--&lt;/span&gt; &lt;span class="n"&gt;empty&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;genomeSize&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;patternSize&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;indexList&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;indexList&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id="evaluation"&gt;Evaluation
&lt;/h2&gt;&lt;h3 id="time-complexity"&gt;Time Complexity
&lt;/h3&gt;&lt;p&gt;입력크기: $\left\vert genome \right\vert = n$, $\left\vert pattern \right\vert = k $&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;line[2] ~ line[5]: 변수 초기화 → $O(1)$&lt;/li&gt;
&lt;li&gt;line[7]: 모든 genome의 시작점을 순회 → $O(n)$&lt;/li&gt;
&lt;li&gt;line[8]: 문자열 비교 연산 → $O(k)$​&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;line[9]: (일반적으로) $O(1)$&lt;sup id="fnref:2"&gt;&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;어떠한 경우에도 모든 문자열을 순회해야 correct solution이 도출됨&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Total Time Complexity&lt;/strong&gt;&lt;/em&gt;: $O(1) + O(n) \times O(k) \approxeq O(nk)$&lt;/p&gt;
&lt;h2 id="implementation"&gt;Implementation
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# 주어진 text 문자열에서 pattern이 발생한 문자열의 인덱스를 리스트로 반환하는 함수&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;findPatternIndices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 발생 빈도를 저장할 배열 선언&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;OccList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# pattern을 셀 수 있는 모든 문자열을 시작점으로 순회&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 시작 문자열부터 pattern 길이만큼의 부분문자열(substring)이 pattern을 형성하면 그 인덱스를 배열에 저장&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;OccList&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;OccList&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://rosalind.info/problems/ba1d/" target="_blank" rel="noopener"
&gt;Rosalind&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/mulatta/Bioinforamtics-Algorithm-practice/blob/main/Chapter%201/PatternInText.py" target="_blank" rel="noopener"
&gt;Code in Github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="discussion-points"&gt;Discussion Points
&lt;/h2&gt;&lt;h3 id="summary"&gt;Summary
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;이전 결과는 유전체 상에서 pattern이 고르게 등장하는 경우를 포함할 수 있음 (밀집된 경우보다 Randomness가 더 큼)&lt;/li&gt;
&lt;li&gt;따라서 빈번한 pattern이 &lt;em&gt;얼마나 밀집되어있는지&lt;/em&gt; (Localization)에 대한 정보가 필요함&lt;/li&gt;
&lt;li&gt;이를 알기 위해서는 특정 Pattern이 등장하는 위치를 찾아야 함&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="implementation-strategy"&gt;Implementation Strategy
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;유전체 문자열을 순회하며 주어진 pattern만큼 문자열을 가져옴&lt;/li&gt;
&lt;li&gt;가져온 유전체의 부분 문자열과 주어진 pattern을 비교&lt;/li&gt;
&lt;li&gt;서로 같으면 그 위치를 저장 후 순회가 끝나면 모든 위치를 반환&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="implications"&gt;Implications
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;이를 통해 찾은 이전 결과 ATGATCAAG의 등장 위치는 116556, 149355, &lt;strong&gt;151913&lt;/strong&gt;, &lt;strong&gt;152013&lt;/strong&gt;, &lt;strong&gt;152394&lt;/strong&gt;, 186189, 194276, &amp;hellip; 으로 총 17번임&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;이 중 위의 &lt;strong&gt;151913&lt;/strong&gt;, &lt;strong&gt;152013&lt;/strong&gt;, &lt;strong&gt;152394&lt;/strong&gt;은 매우 가까이 위치한 곳으로, 나머지 경우에서는 이와 같이 군집을 이루지 않음&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;즉, 이 영역이 DnaA box의 영역일 수 있음을 시사함 &lt;sup id="fnref:3"&gt;&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="reference"&gt;Reference
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Compeu, P., Pevzner, P. (2018). Bioinformatics Algorithms 3/e. 에이콘 출판사&lt;/li&gt;
&lt;li&gt;Replication Initiation figure: &lt;a class="link" href="https://en.wikipedia.org/wiki/Origin_of_replication" target="_blank" rel="noopener"
&gt;https://en.wikipedia.org/wiki/Origin_of_replication&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;두 문자열 중 가장 긴 것만큼 포인터가 순회하여 비교할 수 있음. strcmp(str1, str2) → O(s), where s = max(str1, str2)&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;초기 할당된 메모리 크기를 벗어날 경우, 배열의 크기 재조정을 위해 값 복사가 일어날 수 있음 → O(l), where l =len(indexList)&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;단순히 가까워서라기보다, 통계적 근거에 기반한 추론임&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item><item><title>1B 빈번한 단어 문제</title><link>https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/</link><pubDate>Tue, 30 Jan 2024 00:00:00 +0000</pubDate><guid>https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/</guid><description>&lt;h2 id="개요"&gt;개요
&lt;/h2&gt;&lt;p&gt;앞선 문제에서는 입력된 pattern이 얼마나 등장하는지, 즉, &lt;em&gt;&lt;strong&gt;Pattern → count&lt;/strong&gt;&lt;/em&gt; 였다면, 이번에는 pattern에 대한 조건을 다루어 볼 수 있다. 즉, 원하는 길이의 Pattern 중 가장 많이 등장하는 &lt;em&gt;k-mer&lt;/em&gt;를 찾을 수 있다.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;앞선 목표: &lt;em&gt;&lt;strong&gt;Pattern → count&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;이번 목표: &lt;em&gt;&lt;strong&gt;k → Pattern&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Brute-Force를 통해 알고리즘을 수행하면, 이전에 구현한 PatternCount를 이용해, 다음과 같은 아이디어를 이용할 수 있다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;문자열의 모든 문자 하나하나를 시작점으로 하는 k-mer pattern 에 대해&lt;/li&gt;
&lt;li&gt;주어진 입력 문자 text를 PatternCount로 count&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src="https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/FrequentWords1.png"
width="3611"
height="1847"
srcset="https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/FrequentWords1_hu_ae8f74acd85911c0.png 480w, https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/FrequentWords1_hu_a6943c5444a2a357.png 1024w"
loading="lazy"
alt="index가 0일 때 pattern count"
class="gallery-image"
data-flex-grow="195"
data-flex-basis="469px"
&gt;
&lt;em&gt;Fig 1. index가 0일 때 pattern count&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/FrequentWords2.png"
width="3611"
height="1842"
srcset="https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/FrequentWords2_hu_7b0075adcc812376.png 480w, https://blog.mulatta.io/post/bioinformatics-algorithm/03-ch1-1b/FrequentWords2_hu_30c14ce736d2521b.png 1024w"
loading="lazy"
alt="index가 1일 때 pattern count"
class="gallery-image"
data-flex-grow="196"
data-flex-basis="470px"
&gt;
&lt;em&gt;Fig 2. index가 1일 때 pattern count&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="problem"&gt;Problem
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Input: 전체 문자열 Text, Text에서 찾으려는 문자열 Pattern의 길이 &lt;em&gt;k&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Output: 가장 빈번하게 등장하는 &lt;em&gt;k-mer&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;function: k $\to$ Pattern&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="pseudo-code"&gt;Pseudo-code
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;FrequentWords&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;FrequentPatterns&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;--&lt;/span&gt; &lt;span class="n"&gt;empty&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PatternCount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;find&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id="evaluation"&gt;Evaluation
&lt;/h2&gt;&lt;h3 id="time-complexity"&gt;Time Complexity
&lt;/h3&gt;&lt;p&gt;입력 크기: $\left\vert Text \right\vert = n, \left\vert Pattern \right\vert=k$​&lt;/p&gt;
&lt;p&gt;$\text{Constraints: n ≥ k}$&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;line[2] ~ line[3: 대입연산 $\to O(1)$​&lt;/li&gt;
&lt;li&gt;line[4]: k-mer 형성이 가능한 모든 문자 순회 반복문 $\to O(n-k)$&lt;/li&gt;
&lt;li&gt;line[5]: 문자열 슬라이싱 $\to O(1)$&lt;/li&gt;
&lt;li&gt;line[6]: PatternCount $\to O(nk)$&lt;/li&gt;
&lt;li&gt;line[7]: find index of max count $\to O(n\log{n}) ~ O(n^2)$
&lt;ul&gt;
&lt;li&gt;최대값을 찾는 과정은 정렬을 수행&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Total Time Complexity&lt;/strong&gt;&lt;/em&gt;: $O(1) + O(n-k) \times (O(1) + O(nk)) + O(n^2) \approxeq O(n^2k)$&lt;/p&gt;
&lt;h2 id="implementation"&gt;Implementation
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;span class="lnt"&gt;21
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# 주어진 text에서 특정 words의 빈도수를 계산하여 빈도수가 가장 높은 k-mer를 기록하고 반환하는 함수&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;FrequentWords&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 입력된 text와 같은 크기로 등장 빈도수를 동일한 인덱스에 저장하는 리스트 선언&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 출력할 최대 빈도수의 pattern을 저장할 set 선언 - uniqueness로 중복 값을 저장하지 않음&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;FrequentPattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# i-th value가 가지는 k-mer의 빈도 수를 count라는 배열에 저장&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PatternCount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# count의 원소가 maxCount인 pattern을 FrequentPattern으로 저장&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;maxCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;maxCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;FrequentPattern&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;FrequentPattern&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://rosalind.info/problems/ba1b/" target="_blank" rel="noopener"
&gt;Rosalind&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/mulatta/Bioinforamtics-Algorithm-practice/blob/main/Chapter%201/FrequentWords.py" target="_blank" rel="noopener"
&gt;Code in Github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="discussion-points"&gt;Discussion Points
&lt;/h2&gt;&lt;h3 id="summary"&gt;Summary
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;주어진 임의의 서열에서 가장 빈번한 pattern이 &lt;strong&gt;무엇&lt;/strong&gt;인지 확인할 수 있다.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="implementation-strategy"&gt;Implementation Strategy
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;이전에 구현한 PatternCount를 통해 가능한 모든 k-mer 조합이 전체 text에서 등장하는 횟수를 모두 배열(count)에 기록&lt;/li&gt;
&lt;li&gt;기록된 배열 중 가장 큰 count를 가진 pattern을 다시 반환함&lt;/li&gt;
&lt;li&gt;가장 많이 등장하는 pattern이 어떤 pattern인지 확인할 수 있었다.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="implications"&gt;implications
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;이전 장과 서론의 문제 해결 아이디어를 더불어 생각해보면, k → Pattern → count로 어떤 k-mer가 가장 많이 등장하는지도 확인할 수 있을 것이다.&lt;/li&gt;
&lt;li&gt;이러한 접근은 아무 정보도 주어지지 않은 상태에서 어떤 임의의 반복 서열이 가장 많은 count를 가지는지 확인할 수 있는 단서로 활용할 수 있을지도 모른다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;다음 문제에서는, 이번에 찾은 다양한 Pattern들과 연관된 &lt;em&gt;상보적인 서열&lt;/em&gt; 의 등장 횟수를 고려하여, 빈번한 pattern이 우연이 아닌 상관관계/타당성을 가진다고 말할 수 있는지 알아보도록 하겠다.&lt;/p&gt;
&lt;h2 id="reference"&gt;Reference
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Compeu, P., Pevzner, P. (2018). Bioinformatics Algorithms 3/e. 에이콘 출판사&lt;/li&gt;
&lt;li&gt;Craig, N., Cohen-Fix, O., Green, R., Greider, C., Storz, G., &amp;amp; Wolberger, C. (2010). Molecular biology: Principles of genome function. Oxford University Press.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;최대값/최소값을 찾는 알고리즘은 정렬을 수행해야 한다. 기수정렬을 제외한 일반적인 merge/quick sort의 경우 $O(n^2)$, 힙정렬의 경우 $O(nlog{n})$의 시간복잡도가 소요된다.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item></channel></rss>