Python Regular Expressions: How to repeat a repeat of a pattern?
I am looking at a long strand of DNA nucleotides and am looking for
sequences that begin with the start code 'AAA' and end with the stop code
'CCC'. Since nucleotides come in triplets, the number of nucleotides
between the start and end of every sequence I find must be a multiple of
three.
For example, 'AAAGGGCCC' is a valid sequence but 'AAAGCCC' is not.
In addition, before every stop code, I want the longest strand I can find
with respect to a particular reading frame.
For example, if the DNA were 'AAAGGGAAACCC', then both 'AAAGGGAAACCC' and
'AAACCC' would technically be valid, but since they share the same
instance of the stop code, I only want the longest strand of DNA
'AAAGGGAAACCC'. Also, if my strand were 'AAAAGGCCCCC', I must return
'AAAAGGCCC' AND 'AAAGGCCCC' because they are in different reading frames
(One reading frame is mod 3, the other is mod 1.)
While I think I have the code to search for strings that fulfill the
multiple of 3 requirement and don't overlap, I am not sure how to
implement the second criteria of keeping the same reading frame. My code
below would just return the longest strings that don't overlap, but does
not distinguish between reading frames, so in the above example it would
catch 'AAAAGGCCC' but not 'AAAGGCCCC':
match = re.finditer(r"AAA\w{3}{%d}BBB$"% (minNucleotide-6,
math.ceil((minNucleotide-6)/3))
Sorry for being long-winded and thank you for taking a look!
No comments:
Post a Comment