Извините за изменения!Наконец, вот рабочая версия с двойной точностью:
#!/bin/bash
# Let's read the file passed as an argument to this script
while IFS='' read -r line || [[ -n "$line" ]]; do
# We check if this line have ">" character in the beginning. That is, if it is the line with amino acids or with protein name
if [[ $(echo ${line} | head -c 1) == ">" ]]; then
# We print out the protein name
echo ${line} | awk {'print $1'}
# next=1 means that the amino acids will be in the next line, not this line, this is just the line with protein name
next=1
fi
# We do the code below only if it is the line with amino acids, so when next is not 1, but 0
if [[ $next == 0 ]] ; then
# We need to have the number of all amino acids to count percentage, so we count the number of all characters
all=${#line}
# And number of amino acids A in this protein, that is the number of "A" characters
A=$(echo "${line}" | awk -F"A" '{print NF-1}')
# Amino acids T, then C and G
T=$(echo "${line}" | awk -F"T" '{print NF-1}')
C=$(echo "${line}" | awk -F"C" '{print NF-1}')
G=$(echo "${line}" | awk -F"G" '{print NF-1}')
# Now let's count what percentage of all amino acids (all characters) are amino acids A (characters "A"). Change scale=2 to scale=3 to increase precision and have 3 digits after the dot, not two
Apercentage=`bc -l <<< "scale=2; 100*${A}/${all}"`
# Same for T, C and G
Tpercentage=`bc -l <<< "scale=2; 100*${T}/${all}"`
Gpercentage=`bc -l <<< "scale=2; 100*${G}/${all}"`
Cpercentage=`bc -l <<< "scale=2; 100*${C}/${all}"`
# We print out calculated percentage of amino acids
printf "A: %s%% T: %s%% G: %s%% C: %s%%\n" "${Apercentage}" "${Tpercentage}" "${Gpercentage}" "${Cpercentage}"
fi
# We reset the "next"
next=0
done < "$1"
Теперь это тестовый файл с именем "fasta":
>Protein1 info
ATCGGGCTGC
>Protein2 info
ATCGGGCTGCGGCC
>Protein2 info
ATCGGGCTGCGGCCCCC
И вывод:
[user@host ]$ ./script.sh fasta
>Protein1
A: 10.00% T: 20.00% G: 40.00% C: 30.00%
>Protein2
A: 7.14% T: 14.28% G: 42.85% C: 35.71%
>Protein2
A: 5.88% T: 11.76% G: 35.29% C: 47.05%
Числа верны.Однострочники mosvy и PaulRM накапливают все «As», «Ts» и «Gs» из предыдущих белков (или делают что-то другое неправильно).Боюсь, они не рассчитывают процент правильно.Только первый белок получил правильные числа в своих однострочниках, следующие белки получили неправильные числа:
Protein1
A: 10% T: 20% G: 40% C: 30%
Protein2
A: 14.2857% T: 28.5714% G: 71.4286% C: 57.1429%
Protein2
A: 17.6471% T: 35.2941% G: 94.1176% C: 94.1176%