Spark 프로 그래 밍, RDD 기능 소개, RDD 요소 변환, RDD 요소 조작, DATAFRAME, SparkSQL

6152 단어 빅 데이터
spark   driver worker       ?SparkContext sc
      rdd        stage    DAGSchedule
      taskSet  ?          TaskSchedule
      rdd     (worker)       (  ) map flatMap join(  rdd     ) groupByKey reduceByKey filter
      rdd     (driver)      (  ) take collect  reduce ...

-------------------------------------------------

etl:        。  :sql/  (scala)。

vi preS.py

from pyspark import SparkConf,SparkContext

# pyspark   SparkConf(    ),SparkContext(sc,driver work   )  ,  SparkContext(sc)   SparkConf

conf = SparkConf().setMaster('local[*]').setAppName('he')
#[*]    
sc = SparkContext(conf=conf)
print(sc)

rdd = sc.textFile("/user/hadoop/datas.csv")
rdd = rdd.mapPartitions(lambda it :[x for x in it if not x == ''])
rdd = rdd.map(lambda x:x.split(','))
rdd = rdd.map(lambda x:((x[0],x[1]),x[2]))
rdd = rdd.reduceByKey(lambda x,y:x+y)

datas=rdd.collect()

for d in datas:
    print(d)

spark-submit preS.py: spark-submit    Job Driver     Client  

   spark-submit preS.py      ,      !

RDD  :
    1.                     
    2.         
    3.        RDD    
    4.  RDD                       
    5.RDD                 (     )
    6.          ,RDD           
    7.RDD         , RDD    API (    Scala)       RDD
    8.    RDD   :     (     worker  ,   driver  )

RDD  (2 ):

        :
        rdd = sc.parallelize([21,4,67,34]): python list     rdd      work hadoop   
            rdd.collect(): rdd             driver  Client  

           :
        sc.textFile("/user/hadoop/datas.csv")

RDD     (21.   (Transformations) ( :map, filter)         RDD, Transformations   Lazy ,                      ,     Actions               。Lazy Evaluation。
      :map                    :    
        flatMap                 :    
        filter                  :  
        reduceByKey
        groupByKey              : key  
        combineByKey            :31232-1-3
        union()   
        intersection()  
        subtract()  
        cartesian()    

    2.   (Actions) ( :count, collect),Actions         RDD           。Actions  Spark            。                 PYTHON   list
      :
    take collect reduce top count first foreach
    aggregate(a,b,c)     :   3   :               ,           ,                   
    reduce           :  
    countByValue     :  ,    
    foreach          :  ,    .driver    ,worker     

RDD cache(  ) persist(     ):           !!
    path='hdfs://quickstart.cloudera:8020/user/cloudera/recom/user_friends.csv'
    rows = sc.textFile(path) rows.persist()   # or rows. .cach 
    rows.count() 
    rows.count()
       action rows.count()           RDD

DataFrame:         。   SchemaRDD    .       SparkSQL  RDD    。        SchemaRDD   DataFrame

DataFrame(SchemaRDD)     (4 )
 1. sc.parallelize  DataFrame
    rdd=sc.parallelize([21,4,67,34])
    df=rdd.toDF(schema=[‘  ’,’  ’,’  ’])
    df.registerTempTable("  ")
    mytabledf = spark.sql("sql  ")  # mytabledf  2. createDataFrame  DataFrame
    sqlContext = SQLContext(sc)
    df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])

 3.  json, parquet, AVRO   CSV       SchemaRDD DataFrame。                
    i: JSON         
    sqlContext = SQLContext(sc)
    df = sqlContext.read.json("c:/temp/people.json")
    ii: parquet        
    df=sqlContext.read.parquet("c:/temp/people.parquet")

 4. PYTHON R  DataFrame   SPARK  SchemaRDD 
    sqlContext = SQLContext(sc)
    path='c:/temp/cs-training.csv'
     pandas     spark df 
    scorecard=pd.read_csv(path, header=None,names=’  ,  ,  ...’)
    spDF= sqlContext.createDataFrame(scorecard) 
      spark df    pandas DF
    spDF.toPandas().head()

  5.SparkSQL: SQL      DataFrame,          
    allpeople = sqlContext.sql("SELECT * FROM people where id=4")
    allpeople.show()


좋은 웹페이지 즐겨찾기