Question

Вот функция выдержки:

    function excerpt($text, $phrase, $radius = 100, $ending = "...") {
270             if (empty($text) or empty($phrase)) {
271                 return $this->truncate($text, $radius * 2, $ending);
272             }
273     
274             $phraseLen = strlen($phrase);
275             if ($radius < $phraseLen) {
276                 $radius = $phraseLen;
277             }
278     
279             $pos = strpos(strtolower($text), strtolower($phrase));
280     
281             $startPos = 0;
282             if ($pos > $radius) {
283                 $startPos = $pos - $radius;
284             }
285     
286             $textLen = strlen($text);
287     
288             $endPos = $pos + $phraseLen + $radius;
289             if ($endPos >= $textLen) {
290                 $endPos = $textLen;
291             }
292     
293             $excerpt = substr($text, $startPos, $endPos - $startPos);
294             if ($startPos != 0) {
295                 $excerpt = substr_replace($excerpt, $ending, 0, $phraseLen);
296             }
297     
298             if ($endPos != $textLen) {
299                 $excerpt = substr_replace($excerpt, $ending, -$phraseLen);
300             }
301     
302             return $excerpt;
303         }

Его недостаток заключается в том, что он не пытается найти как можно больше искомых слов, что по умолчанию совпадает только один раз.

Как реализовать нужный?

erisco · Answer 1 · 29 января 2010

Код, приведенный здесь до сих пор, не работал для меня, поэтому я потратил некоторое время на размышления о алгоритме для реализации. То, что у меня сейчас работает, работает прилично, и это не является проблемой производительности - не стесняйтесь тестировать. Результаты не такие привлекательные фрагменты Google, как нет определения, где начинаются и заканчиваются предложения. Я мог бы добавить это, но это было бы намного сложнее, и мне пришлось бы добавить полотенце для выполнения этой функции в одной функции. Он уже переполнен и может быть лучше закодирован, если, например, манипуляции с объектами были абстрагированы от методов.

Во всяком случае, это то, что у меня есть, и это должно быть хорошим началом. Определен наиболее плотный отрывок, и полученная строка будет приблизительно соответствовать указанному вами интервалу. Я настоятельно рекомендую провести тестирование этого кода, так как я не выполнил его полностью. Конечно, есть проблемные случаи, которые можно найти.

Я также призываю любого улучшить этот алгоритм или просто код для его выполнения.

Наслаждайтесь.

// string excerpt(string $text, string $phrase, int $span = 100, string $delimiter = '...')
// parameters:
//  $text - text to be searched
//  $phrase - search string
//  $span - approximate length of the excerpt
//  $delimiter - string to use as a suffix and/or prefix if the excerpt is from the middle of a text

function excerpt($text, $phrase, $span = 100, $delimiter = '...') {

  $phrases = preg_split('/\s+/', $phrase);

  $regexp = '/\b(?:';
  foreach ($phrases as $phrase) {
    $regexp .= preg_quote($phrase, '/') . '|';
  }

  $regexp = substr($regexp, 0, -1) . ')\b/i';
  $matches = array();
  preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
  $matches = $matches[0];

  $nodes = array();
  foreach ($matches as $match) {
    $node = new stdClass;
    $node->phraseLength = strlen($match[0]);
    $node->position = $match[1];
    $nodes[] = $node;
  }

  if (count($nodes) > 0) {
    $clust = new stdClass;
    $clust->nodes[] = array_shift($nodes);
    $clust->length = $clust->nodes[0]->phraseLength;
    $clust->i = 0;
    $clusters = new stdClass;
    $clusters->data = array($clust);
    $clusters->i = 0;
    foreach ($nodes as $node) {
      $lastClust = $clusters->data[$clusters->i];
      $lastNode = $lastClust->nodes[$lastClust->i];
      $addedLength = $node->position - $lastNode->position - $lastNode->phraseLength + $node->phraseLength;
      if ($lastClust->length + $addedLength <= $span) {
        $lastClust->nodes[] = $node;
        $lastClust->length += $addedLength;
        $lastClust->i += 1;
      } else {
        if ($addedLength > $span) {
          $newClust = new stdClass;
          $newClust->nodes = array($node);
          $newClust->i = 0;
          $newClust->length = $node->phraseLength;
          $clusters->data[] = $newClust;
          $clusters->i += 1;
        } else {
          $newClust = clone $lastClust;
          while ($newClust->length + $addedLength > $span) {
            $shiftedNode = array_shift($newClust->nodes);
            if ($shiftedNode === null) {
              break;
            }
            $newClust->i -= 1;
            $removedLength = $shiftedNode->phraseLength;
            if (isset($newClust->nodes[0])) {
              $removedLength += $newClust->nodes[0]->position - $shiftedNode->position;
            }
            $newClust->length -= $removedLength;
          }
          if ($newClust->i < 0) {
            $newClust->i = 0;
          }
          $newClust->nodes[] = $node;
          $newClust->length += $addedLength;
          $clusters->data[] = $newClust;
          $clusters->i += 1;
        }
      }
    }
    $bestClust = $clusters->data[0];
    $bestClustSize = count($bestClust->nodes);
    foreach ($clusters->data as $clust) {
      $newClustSize = count($clust->nodes);
      if ($newClustSize > $bestClustSize) {
        $bestClust = $clust;
        $bestClustSize = $newClustSize;
      }
    }
    $clustLeft = $bestClust->nodes[0]->position;
    $clustLen = $bestClust->length;
    $padding = round(($span - $clustLen)/2);
    $clustLeft -= $padding;
    if ($clustLeft < 0) {
      $clustLen += $clustLeft*-1 + $padding;
      $clustLeft = 0;
    } else {
      $clustLen += $padding*2;
    }
  } else {
    $clustLeft = 0;
    $clustLen = $span;
  }

  $textLen = strlen($text);
  $prefix = '';
  $suffix = '';

  if (!ctype_space($text[$clustLeft]) && isset($text[$clustLeft-1]) && !ctype_space($text[$clustLeft-1])) {
    while (!ctype_space($text[$clustLeft])) {
      $clustLeft += 1;
    }
    $prefix = $delimiter;
  }

  $lastChar = $clustLeft + $clustLen;
  if (!ctype_space($text[$lastChar]) && isset($text[$lastChar+1]) && !ctype_space($text[$lastChar+1])) {
    while (!ctype_space($text[$lastChar])) {
      $lastChar -= 1;
    }
    $suffix = $delimiter;
    $clustLen = $lastChar - $clustLeft;
  }

  if ($clustLeft > 0) {
    $prefix = $delimiter;
  }

  if ($clustLeft + $clustLen < $textLen) {
    $suffix = $delimiter;
  }

  return $prefix . trim(substr($text, $clustLeft, $clustLen+1)) . $suffix;
}

Ben Boyter · Answer 2 · 12 июля 2013

Я придумал ниже, чтобы генерировать выдержки. Вы можете увидеть код здесь https://github.com/boyter/php-excerpt Он работает, находя все местоположения соответствующих слов, а затем берет отрывок, основываясь на том, какие слова являются ближайшими. В теории это звучит не очень хорошо, но на практике это работает очень хорошо.

Это на самом деле очень близко к тому, как Sphider (для записи, он живет в searchfuncs.php из строки 529 до 566) генерирует свои фрагменты. Я думаю, что ниже намного легче читать и без ошибок, которые существуют в Sphider. Он также не использует регулярные выражения, что делает его немного быстрее, чем другие методы, которые я использовал.

Я писал об этом здесь http://www.boyter.org/2013/04/building-a-search-result-extract-generator-in-php/

<?php

// find the locations of each of the words
// Nothing exciting here. The array_unique is required 
// unless you decide to make the words unique before passing in
function _extractLocations($words, $fulltext) {
    $locations = array();
    foreach($words as $word) {
        $wordlen = strlen($word);
        $loc = stripos($fulltext, $word);
        while($loc !== FALSE) {
            $locations[] = $loc;
            $loc = stripos($fulltext, $word, $loc + $wordlen);
        }
    }
    $locations = array_unique($locations);
    sort($locations);

    return $locations;
}

// Work out which is the most relevant portion to display
// This is done by looping over each match and finding the smallest distance between two found 
// strings. The idea being that the closer the terms are the better match the snippet would be. 
// When checking for matches we only change the location if there is a better match. 
// The only exception is where we have only two matches in which case we just take the 
// first as will be equally distant.
function _determineSnipLocation($locations, $prevcount) {
    // If we only have 1 match we dont actually do the for loop so set to the first
    $startpos = $locations[0];  
    $loccount = count($locations);
    $smallestdiff = PHP_INT_MAX;    

    // If we only have 2 skip as its probably equally relevant
    if(count($locations) > 2) {
        // skip the first as we check 1 behind
        for($i=1; $i < $loccount; $i++) { 
            if($i == $loccount-1) { // at the end
                $diff = $locations[$i] - $locations[$i-1];
            }
            else {
                $diff = $locations[$i+1] - $locations[$i];
            }

            if($smallestdiff > $diff) {
                $smallestdiff = $diff;
                $startpos = $locations[$i];
            }
        }
    }

    $startpos = $startpos > $prevcount ? $startpos - $prevcount : 0;
    return $startpos;
}

// 1/6 ratio on prevcount tends to work pretty well and puts the terms
// in the middle of the extract
function extractRelevant($words, $fulltext, $rellength=300, $prevcount=50, $indicator='...') {

    $textlength = strlen($fulltext);
    if($textlength <= $rellength) {
        return $fulltext;
    }

    $locations = _extractLocations($words, $fulltext);
    $startpos  = _determineSnipLocation($locations,$prevcount);

    // if we are going to snip too much...
    if($textlength-$startpos < $rellength) {
        $startpos = $startpos - ($textlength-$startpos)/2;
    }

    $reltext = substr($fulltext, $startpos, $rellength);

    // check to ensure we dont snip the last word if thats the match
    if( $startpos + $rellength < $textlength) {
        $reltext = substr($reltext, 0, strrpos($reltext, " ")).$indicator; // remove last word
    }

    // If we trimmed from the front add ...
    if($startpos != 0) {
        $reltext = $indicator.substr($reltext, strpos($reltext, " ") + 1); // remove first word
    }

    return $reltext;
}
?>

mgo · Answer 3 · 27 августа 2015

Мне не удалось связаться с erisco, поэтому я публикую его функцию с несколькими исправлениями (главное - с поддержкой нескольких байтов).

/**
 * @param string $text text to be searched
 * @param string $phrase search string
 * @param int $span approximate length of the excerpt
 * @param string $delimiter string to use as a suffix and/or prefix if the excerpt is from the middle of a text
 *
 * @return string
 */
public static function excerpt($text, $phrase, $span = 100, $delimiter = '...')
{
	$phrases = preg_split('/\s+/u', $phrase);
	$regexp = '/\b(?:';
	foreach($phrases as $phrase)
	{
		$regexp.= preg_quote($phrase, '/') . '|';
	}

	$regexp = mb_substr($regexp, 0, -1) .')\b/ui';
	$matches = [];
	preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
	$matches = $matches[0];
	$nodes = [];
	foreach($matches as $match)
	{
		$node = new stdClass;
		$node->phraseLength = mb_strlen($match[0]);
		$node->position = mb_strlen(substr($text, 0, $match[1])); // calculate UTF-8 position (@see https://bugs.php.net/bug.php?id=67487)
		$nodes[] = $node;
	}

	if(count($nodes) > 0)
	{
		$clust = new stdClass;
		$clust->nodes[] = array_shift($nodes);
		$clust->length = $clust->nodes[0]->phraseLength;
		$clust->i = 0;
		$clusters = new stdClass;
		$clusters->data =
		[
			$clust
		];
		$clusters->i = 0;
		foreach($nodes as $node)
		{
			$lastClust = $clusters->data[$clusters->i];
			$lastNode = $lastClust->nodes[$lastClust->i];
			$addedLength = $node->position - $lastNode->position - $lastNode->phraseLength + $node->phraseLength;
			if($lastClust->length + $addedLength <= $span)
			{
				$lastClust->nodes[] = $node;
				$lastClust->length+= $addedLength;
				$lastClust->i++;
			}
			else
			{
				if($addedLength > $span)
				{
					$newClust = new stdClass;
					$newClust->nodes =
					[
						$node
					];
					$newClust->i = 0;
					$newClust->length = $node->phraseLength;
					$clusters->data[] = $newClust;
					$clusters->i++;
				}
				else
				{
					$newClust = clone $lastClust;
					while($newClust->length + $addedLength > $span)
					{
						$shiftedNode = array_shift($newClust->nodes);
						if($shiftedNode === null)
						{
							break;
						}

						$newClust->i--;
						$removedLength = $shiftedNode->phraseLength;
						if(isset($newClust->nodes[0]))
						{
							$removedLength+= $newClust->nodes[0]->position - $shiftedNode->position;
						}

						$newClust->length-= $removedLength;
					}

					if($newClust->i < 0)
					{
						$newClust->i = 0;
					}

					$newClust->nodes[] = $node;
					$newClust->length+= $addedLength;
					$clusters->data[] = $newClust;
					$clusters->i++;
				}
			}
		}

		$bestClust = $clusters->data[0];
		$bestClustSize = count($bestClust->nodes);
		foreach($clusters->data as $clust)
		{
			$newClustSize = count($clust->nodes);
			if($newClustSize > $bestClustSize)
			{
				$bestClust = $clust;
				$bestClustSize = $newClustSize;
			}
		}

		$clustLeft = $bestClust->nodes[0]->position;
		$clustLen = $bestClust->length;
		$padding = intval(round(($span - $clustLen) / 2));
		$clustLeft-= $padding;
		if($clustLeft < 0)
		{
			$clustLen+= $clustLeft * -1 + $padding;
			$clustLeft = 0;
		}
		else
		{
			$clustLen+= $padding * 2;
		}
	}
	else
	{
		$clustLeft = 0;
		$clustLen = $span;
	}

	$textLen = mb_strlen($text);
	$prefix = '';
	$suffix = '';
	if($clustLeft > 0 && !ctype_space(mb_substr($text, $clustLeft, 1))
		&& !ctype_space(mb_substr($text, $clustLeft - 1, 1)))
	{
		$clustLeft++;
		while(!ctype_space(mb_substr($text, $clustLeft, 1)))
		{
			$clustLeft++;
		}

		$prefix = $delimiter;
	}

	$lastChar = $clustLeft + $clustLen;
	if($lastChar < $textLen && !ctype_space(mb_substr($text, $lastChar, 1))
		&& !ctype_space(mb_substr($text, $lastChar + 1, 1)))
	{
		$lastChar--;
		while(!ctype_space(mb_substr($text, $lastChar, 1)))
		{
			$lastChar--;
		}

		$suffix = $delimiter;
		$clustLen = $lastChar - $clustLeft;
	}

	if($clustLeft > 0)
	{
		$prefix = $delimiter;
	}
	if($clustLeft + $clustLen < $textLen)
	{
		$suffix = $delimiter;
	}

	return $prefix . trim(mb_substr($text, $clustLeft, $clustLen + 1)) . $suffix;
}

OMAR · Answer 4 · 26 января 2010

function excerpt($text, $phrase, $radius = 100, $ending = "...") { 


     $phraseLen = strlen($phrase); 
   if ($radius < $phraseLen) { 
         $radius = $phraseLen; 
     } 

     $phrases = explode (' ',$phrase);

     foreach ($phrases as $phrase) {
             $pos = strpos(strtolower($text), strtolower($phrase)); 
             if ($pos > -1) break;
     }

     $startPos = 0; 
     if ($pos > $radius) { 
         $startPos = $pos - $radius; 
     } 

     $textLen = strlen($text); 

     $endPos = $pos + $phraseLen + $radius; 
     if ($endPos >= $textLen) { 
         $endPos = $textLen; 
     } 

     $excerpt = substr($text, $startPos, $endPos - $startPos); 
     if ($startPos != 0) { 
         $excerpt = substr_replace($excerpt, $ending, 0, $phraseLen); 
     } 

     if ($endPos != $textLen) { 
         $excerpt = substr_replace($excerpt, $ending, -$phraseLen); 
     } 

     return $excerpt; }

Как создать отрывок с большинством искомых слов в PHP?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Как создать отрывок с большинством искомых слов в PHP?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 4 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов