Spark 시 뮬 레이 션 은 특정한 사이트 의 사용자 조회 횟수 가 가장 많은 url 통 계 를 실현 합 니 다.

19670 단어 spark
현재 IT 교육 사이트 가 있다 고 가정 하면 자바, PHP, net 등 여러 항목 이 있 고 다음은 모 의 실현 사이트 로그 입 니 다.
첫 번 째 필드 는 방문 날짜 이 고 두 번 째 필드 는 방문 URL 입 니 다. 그 중에서 각 항목 마다 독립 된 도 메 인 이름 이 있 습 니 다. 다음 과 같 습 니 다.
java.aaaaaaa.cn
net.aaaaaaa.cn
php.aaaaaaa.cn
20160321101954  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/javaee.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/android.shtml
20160321101954  http://java.aaaaaaa.cn/java/video.shtml
20160321101954  http://java.aaaaaaa.cn/java/teacher.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/android.shtml
20160321101954  http://php.aaaaaaa.cn/php/teacher.shtml
20160321101954  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/hadoop.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/base.shtml
20160321101954  http://net.aaaaaaa.cn/net/course.shtml
20160321101954  http://php.aaaaaaa.cn/php/teacher.shtml
20160321101954  http://net.aaaaaaa.cn/net/video.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/base.shtml
20160321101954  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101954  http://java.aaaaaaa.cn/java/video.shtml
20160321101954  http://java.aaaaaaa.cn/java/video.shtml
20160321101954  http://net.aaaaaaa.cn/net/video.shtml
20160321101954  http://net.aaaaaaa.cn/net/course.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/javaee.shtml
20160321101954  http://java.aaaaaaa.cn/java/course/android.shtml
20160321101955  http://php.aaaaaaa.cn/php/course.shtml
20160321101955  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101955  http://php.aaaaaaa.cn/php/teacher.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/base.shtml
20160321101955  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml
20160321101955  http://php.aaaaaaa.cn/php/video.shtml
20160321101955  http://net.aaaaaaa.cn/net/course.shtml
20160321101955  http://php.aaaaaaa.cn/php/video.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/android.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml
20160321101955  http://net.aaaaaaa.cn/net/video.shtml
20160321101955  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101955  http://java.aaaaaaa.cn/java/teacher.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/android.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml
20160321101955  http://java.aaaaaaa.cn/java/course/cloud.shtml
20160321101955  http://net.aaaaaaa.cn/net/video.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml
20160321101956  http://net.aaaaaaa.cn/net/video.shtml
20160321101956  http://net.aaaaaaa.cn/net/video.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/android.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/hadoop.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/javaee.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml
20160321101956  http://php.aaaaaaa.cn/php/teacher.shtml
20160321101956  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/base.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/cloud.shtml
20160321101956  http://php.aaaaaaa.cn/php/teacher.shtml
20160321101956  http://net.aaaaaaa.cn/net/course.shtml
20160321101956  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101956  http://php.aaaaaaa.cn/php/video.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/cloud.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/cloud.shtml
20160321101956  http://java.aaaaaaa.cn/java/course/hadoop.shtml
20160321101957  http://java.aaaaaaa.cn/java/teacher.shtml
20160321101957  http://php.aaaaaaa.cn/php/teacher.shtml
20160321101957  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101957  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101957  http://php.aaaaaaa.cn/php/teacher.shtml
20160321101957  http://php.aaaaaaa.cn/php/course.shtml
20160321101957  http://java.aaaaaaa.cn/java/course/base.shtml
20160321101957  http://net.aaaaaaa.cn/net/course.shtml
20160321101957  http://java.aaaaaaa.cn/java/video.shtml
20160321101957  http://php.aaaaaaa.cn/php/video.shtml
20160321101957  http://net.aaaaaaa.cn/net/teacher.shtml
20160321101957  http://java.aaaaaaa.cn/java/video.shtml
20160321101957  http://net.aaaaaaa.cn/net/video.shtml
20160321101957  http://java.aaaaaaa.cn/java/course/hadoop.shtml
20160321101957  http://net.aaaaaaa.cn/net/course.shtml
20160321101957  http://java.aaaaaaa.cn/java/course/cloud.shtml
20160321101957  http://java.aaaaaaa.cn/java/course/cloud.shtml
20160321101958  http://net.aaaaaaa.cn/net/course.shtml
20160321101958  http://java.aaaaaaa.cn/java/course/hadoop.shtml
20160321101958  http://php.aaaaaaa.cn/php/video.shtml
20160321101958  http://php.aaaaaaa.cn/php/course.shtml
20160321101958  http://java.aaaaaaa.cn/java/course/cloud.shtml
20160321101958  http://net.aaaaaaa.cn/net/video.shtml
20160321101958  http://java.aaaaaaa.cn/java/course/base.shtml

필요: 각 도 메 인 이름 아래 에서 가장 많이 방문 한 세 개의 URL 을 집계 합 니 다.
코드:
import java.net.URL

import org.apache.spark.{SparkConf, SparkContext}


object UrlCount {
  def main(args: Array[String]): Unit = {
    val conf  = new SparkConf().setAppName("UrlCount").setMaster("local[2]")
    val sc = new SparkContext(conf)

    val rdd1 = sc.textFile("E:\\aaaaaa.log").map(line =>{
      val f = line.split("\t")
      (f(1),1)
    })

    val rdd2 = rdd1.reduceByKey(_+_)

    val rdd3 = rdd2.map(t => {
      val url = t._1
      val host = new URL(url).getHost
      (host,url,t._2)
    })

    val rdd4 = rdd3.groupBy(_._1).mapValues(it =>{
      it.toList.sortBy(_._3).reverse.take(3)
    })
    sc.stop()
  }
}

좋은 웹페이지 즐겨찾기