<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Clump Finding on Mulatta Blog</title><link>https://blog.mulatta.io/tags/clump-finding/</link><description>Recent content in Clump Finding on Mulatta Blog</description><generator>Hugo -- gohugo.io</generator><language>ko-kr</language><lastBuildDate>Tue, 20 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.mulatta.io/tags/clump-finding/index.xml" rel="self" type="application/rss+xml"/><item><title>1E 군집 찾기 문제</title><link>https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/</link><pubDate>Tue, 20 Feb 2024 00:00:00 +0000</pubDate><guid>https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/</guid><description>&lt;h2 id="개요"&gt;개요
&lt;/h2&gt;&lt;p&gt;이전 장의 결과로부터 우리가 찾은 frequent pattern ATGATCAAG는 그 위치가 군집을 이룬다는 &lt;em&gt;발견&lt;/em&gt; 을 했다. 이 결과에 이어, 이번 장에서는 pattern이 정말 군집을 이루는지 정량적으로 분석하도록 하였다.&lt;/p&gt;
&lt;p&gt;군집을 이룬다는 것을 어떻게 정의할 수 있을까? 직관적으로, 특정 문자열 pattern이 전체 유전체에서 등장하는 &amp;ldquo;밀도&amp;quot;가 높을 때 군집을 형성한다고 생각할 수 있다. 일반적으로 밀도는 단위 부피 당 질량을 의미하며, 다음과 같이 정의된다.&lt;/p&gt;
&lt;p&gt;$$\rho = w / V$$&lt;/p&gt;
&lt;p&gt;$(where,\rho:,density,,w:,weight,,V:,volume )$&lt;/p&gt;
&lt;p&gt;이러한 접근에 근거하여, 군집을 이루는 기준을 다음과 같이 정의할 수 있다.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;전체 문자열 중 임의의 영역 L에서 pattern의 등장하는 빈도 수&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;이번 장에서는 특히 등장하는 빈도수가 어느 정도 이상인 pattern을 찾기 위해, t라는 parameter를 추가하였다. 이는 곧 등장 빈도수가 t번 이상인 pattern만 확인한다는 것을 의미한다. 또, pattern의 길이를 k-mer로 고정하였다.&lt;/p&gt;
&lt;p&gt;이를 정리하자면, 다음의 그림과 같이 전체 문자열 Genome 내에서 L만큼의 영역의 substring을 가져오고, 이 substring 내에서 등장하는 모든 k-mer의 빈도를 센 뒤, 빈도가 t번보다 많이 나오면 이를 clump를 이루는 pattern으로 반환한다.&lt;/p&gt;
&lt;p&gt;$$\text{Find all K such that f(K, S) ≥ t for any S = genome[i:i+L],}$$&lt;/p&gt;
&lt;p&gt;$$\text{where i≥0 and i+L ≤ length of genome}$$&lt;/p&gt;
&lt;p&gt;$$ \text{K for k-mer patterns,}; \text{f for function to find clumps}$$&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/findClumps.png"
width="4326"
height="2080"
srcset="https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/findClumps_hu_bc25edba0f818cd8.png 480w, https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/findClumps_hu_e90833edb97dcc6f.png 1024w"
loading="lazy"
alt="Idea to calculate and find the clumps in genome"
class="gallery-image"
data-flex-grow="207"
data-flex-basis="499px"
&gt;
&lt;em&gt;Idea to calculate and find the clumps in genome&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="problem"&gt;Problem
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Input: 문자열 유전체 Text, 정수 k, L, t&lt;/li&gt;
&lt;li&gt;Output: Text의 L 길이의 임의의 영역에서 t번 이상의 빈도로 등장하는 k-mer&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="pseudo-code"&gt;Pseudo-code
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;findClummps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;clumpPatterns&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;--&lt;/span&gt; &lt;span class="n"&gt;empty&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;--&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="o"&gt;|-&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;substring&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;--&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;substring&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;occurrence&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;substring&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;clumpPatterns&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;clumpPatterns&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id="evaluation"&gt;Evaluation
&lt;/h2&gt;&lt;h3 id="time-complexity"&gt;Time Complexity
&lt;/h3&gt;&lt;p&gt;위 알고리즘의 경우 genome의 모든 문자열을 시작점으로 하여 모든 경우의 수를 순회한다. 이는 correct solution을 보장하지만, 어느정도 중복되는 부분이 있을 수 있다. 이에 대한 간략한 논의는 Discussion에서 해보도록 한다.&lt;/p&gt;
&lt;p&gt;입력 크기: $\left\vert genome \right\vert = n, k, L, t$&lt;/p&gt;
&lt;p&gt;$\text{Constraints: n ≥ L ≥ k,,t}$&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;line[2]: 변수 초기화 → $O(1)$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;line[4] ~ line[5]: 유전체에서 매 iteration마다 유전체의 모든 문자열에 대해 L 길이만큼 substring으로 slicing 해옴 → $O(n - L + 1) \approxeq O(n)$​&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;line[7] ~ line[8]: substring의 모든 문자열에서 k-mer를 pattern으로 slicing 해옴 → $O(L-k+1)\approxeq O(L)$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;line[9]: 이전에 구현한 PatternCount를 사용할 수 있음 → $O(Lk)$ &lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;line[11] ~ line[14]: 비교연산, set.add()&lt;sup id="fnref:2"&gt;&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref"&gt;2&lt;/a&gt;&lt;/sup&gt;, 반환 → $O(1)$&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Total Time Complexity&lt;/strong&gt;&lt;/em&gt;: $O(1) + O(n) \times O(L) \times O(Lk) + O(1) \approxeq O(L^2 \cdot n \cdot k)$&lt;/p&gt;
&lt;h2 id="implementation"&gt;Implementation
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;span class="lnt"&gt;21
&lt;/span&gt;&lt;span class="lnt"&gt;22
&lt;/span&gt;&lt;span class="lnt"&gt;23
&lt;/span&gt;&lt;span class="lnt"&gt;24
&lt;/span&gt;&lt;span class="lnt"&gt;25
&lt;/span&gt;&lt;span class="lnt"&gt;26
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# 주어진 유전체 문자열에서 크기가 L인 윈도우만큼을 순회하여 각 윈도우에서 t만큼의 빈도로 등장하는 k-mer를 배열로 출력&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;myFindClumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Clumps에 해당하는 k-mer를 저장할 리스트 선언&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ClumpPattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# genome에서 길이가 L인 substr을 가져오는 반복문&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 주어진 windoe 범위 내에 있는 substr(length: L)을 slicing&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;substr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genome&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# L-substr에서 모든 문자열에 대한 순회 시작&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 각 시작점에서 형성되는 k-mer를 pattern으로 설정&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;substr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 현재 window의 substr에서 pattern의 등장 횟수를 저장&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PatternCount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;substr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 만약 기록된 pattern의 빈도가 t와 같다면 ClumpPattern 에 저장&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ClumpPattern&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ClumpPattern&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://rosalind.info/problems/ba1e/" target="_blank" rel="noopener"
&gt;Rosalind&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/mulatta/Bioinforamtics-Algorithm-practice/blob/main/Chapter%201/FindClumps.py" target="_blank" rel="noopener"
&gt;Code in Github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="discussion-points"&gt;Discussion Points
&lt;/h2&gt;&lt;h3 id="summary"&gt;Summary
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;단지 빈번한 단어를 찾는 것은 &lt;strong&gt;&amp;ldquo;어디에서 얼마나 등장하는지&amp;rdquo;&lt;/strong&gt; 에 대한 위치 정보를 포함하지 못함&lt;/li&gt;
&lt;li&gt;이는 clump라는 군집을 정의함으로써 등장하는 빈도를 마치 밀도처럼 계산할 수 있음&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="implementation-strategy"&gt;Implementation Strategy
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;genome에서 L만큼 substring을 slicing&lt;/li&gt;
&lt;li&gt;substring에서 k만큼 pattern을 slicing&lt;/li&gt;
&lt;li&gt;pattern의 substring에서의 등장 빈도 확인&lt;/li&gt;
&lt;li&gt;t 이상이면 clumpPattern에 저장&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="implications"&gt;Implications
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;이번에 구현한 알고리즘은 정답을 보장하지만 효율적이지는 못함&lt;/li&gt;
&lt;li&gt;substring을 가져올 때 현재 인덱스의 L과 다음 인덱스의 L에서 L-2개의 문자열이 중복되기 때문 (아래 그림 참조)&lt;/li&gt;
&lt;li&gt;이에 대한 논의는 추후 충전소 - 빈도 배열을 통해 해결할 수 있다.&lt;/li&gt;
&lt;li&gt;한편, PatternCount 대신 window loop에 FrequentWords를 사용함으로써 더 간단하게 구현할 수 있다.&lt;sup id="fnref:3"&gt;&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src="https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/redundancy-in-findclumps.png"
width="4175"
height="2598"
srcset="https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/redundancy-in-findclumps_hu_a6f7577ed8c4e602.png 480w, https://blog.mulatta.io/post/bioinformatics-algorithm/06-ch1-1e/redundancy-in-findclumps_hu_800279db219f3d78.png 1024w"
loading="lazy"
alt="Redundancy in Brute-Force Algorithm to find clumps"
class="gallery-image"
data-flex-grow="160"
data-flex-basis="385px"
&gt;
&lt;em&gt;Redundancy in Brute-Force Algorithm to find clumps&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="reference"&gt;Reference
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Compeu, P., Pevzner, P. (2018). Bioinformatics Algorithms 3/e. 에이콘 출판사&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;이전 PatternCount(text, pattern)에서 입력 text가 substring이 되므로, O(Lk)가 됨&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;python에서 집합은 hash 구조로 이루어져 있으며, 평균적으로 O(1), 최악의 경우 O(n)이지만 내부 최적화로 O(1)로 간주&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;이 경우 역시도 시간복잡도는 $O(L^2 \cdot n \cdot k)$ 로 동일하다&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item></channel></rss>