ETL 데이터에 스트림을 사용하는 방법은 무엇입니까?

19533 단어 javascriptnode
스트림은 Node.js의 기본 제공 기능이며 데이터의 비동기 흐름을 나타냅니다. 스트림은 또한 파일 읽기 및/또는 쓰기를 처리하는 방법입니다. Node.js 스트림은 데이터를 작은 청크로 처리하기 때문에 컴퓨터의 여유 메모리보다 큰 대용량 파일을 처리하는 데 도움이 될 수 있습니다.

Node.js의 스트림

This is the fifth article of a series about streams in Node.js. This article is about how to perform ETL operations (Extract, Transform, Load) on CSV data using streams.

Streams in Node.js

  • What is a Stream in Node.js?
  • Connect streams with the pipe method
  • Handle stream errors
  • Connect streams with the pipeline method
  • 스트림을 사용하여 데이터 추출, 변환 및 로드(이 문서)

  • 개요

    When working with a flat data, we can just use the fs module and streams to process the data (memory-efficient). Instead of reading all the data into memory, we can read it in small chunks with the help of streams to avoid overconsumption of the memory.

    In this article we are going to create sample data in a CSV file, extract this data, transform it and load the data.

    A C omma- S eparated V alues file is a delimited text file that uses a comma to separate values. Read more here . CSV 데이터를 JSON 또는 더 나은 ndjson 으로 변환할 것입니다. 이는 기본적으로 줄 바꿈으로 구분되고 파일 확장자가 .ndjson 인 JSON 레코드 파일입니다. 확실히, 당신은 스스로에게 묻고 있습니다. 왜 우리는 JSON만 사용하지 않습니까? 주된 이유는 내결함성입니다. 하나의 유효하지 않은 레코드만 JSON에 기록되면 전체 JSON 파일이 손상됩니다. JSON과 ndjson의 주요 차이점은 ndjson 파일에서 파일의 각 행에는 단일 JSON 레코드가 포함되어야 한다는 것입니다. 따라서 ndjson 파일에는 유효한 JSON이 포함되어 있지만 ndjson은 유효한 JSON 문서가 아닙니다. ndjson 형식은 스트리밍 데이터 및 각 레코드가 개별적으로 처리되는 대규모 데이터 세트에서 잘 작동합니다.

    우리는 가고있다:

  • CSV 샘플 데이터 생성


  • NPM에 대한 프로젝트 초기화


  • CSV 파서 생성


  • 변환 스트림 추가


  • 실행 및 완료


  • 1. CSV 데이터 생성

    Let's create some sample CSV data, you can use the sample data below, or create your own data with FakerJS CSV로 변환합니다.

    id,firstName,lastName,email,email2,randomized
    100,Jobi,Taam,Jobi.Taam@yopmail.com,Jobi.Taam@gmail.com,Z lsmDLjL
    101,Dacia,Elephus,Dacia.Elephus@yopmail.com,Dacia.Elephus@gmail.com,Za jfPaJof
    102,Arlina,Bibi,Arlina.Bibi@yopmail.com,Arlina.Bibi@gmail.com,zmzlfER
    103,Lindie,Torray,Lindie.Torray@yopmail.com,Lindie.Torray@gmail.com,ibVggFEh
    104,Modestia,Leonard,Modestia.Leonard@yopmail.com,Modestia.Leonard@gmail.com," Tit KCrdh"
    105,Karlee,Cornelia,Karlee.Cornelia@yopmail.com,Karlee.Cornelia@gmail.com,PkQCUXzq
    106,Netty,Travax,Netty.Travax@yopmail.com,Netty.Travax@gmail.com,psJKWDBrXm
    107,Dede,Romelda,Dede.Romelda@yopmail.com,Dede.Romelda@gmail.com,heUrfT
    108,Sissy,Crudden,Sissy.Crudden@yopmail.com,Sissy.Crudden@gmail.com,cDJxC
    109,Sherrie,Sekofski,Sherrie.Sekofski@yopmail.com,Sherrie.Sekofski@gmail.com,dvYHUJ
    110,Sarette,Maryanne,Sarette.Maryanne@yopmail.com,Sarette.Maryanne@gmail.com,rskGIJNF
    111,Selia,Waite,Selia.Waite@yopmail.com,Selia.Waite@gmail.com,DOPBe
    112,Karly,Tjon,Karly.Tjon@yopmail.com,Karly.Tjon@gmail.com,zzef nCMVL
    113,Sherrie,Berriman,Sherrie.Berriman@yopmail.com,Sherrie.Berriman@gmail.com,rQqmjw
    114,Nadine,Greenwald,Nadine.Greenwald@yopmail.com,Nadine.Greenwald@gmail.com,JZsmKafeIf
    115,Antonietta,Gino,Antonietta.Gino@yopmail.com,Antonietta.Gino@gmail.com,IyuCBqwlj
    116,June,Dorothy,June.Dorothy@yopmail.com,June.Dorothy@gmail.com,vyCTyOjt
    117,Belva,Merriott,Belva.Merriott@yopmail.com,Belva.Merriott@gmail.com,MwwiGEjDfR
    118,Robinia,Hollingsworth,Robinia.Hollingsworth@yopmail.com,Robinia.Hollingsworth@gmail.com,wCaIu
    119,Dorthy,Pozzy,Dorthy.Pozzy@yopmail.com,Dorthy.Pozzy@gmail.com,fmWOUCIM
    120,Barbi,Buffum,Barbi.Buffum@yopmail.com,Barbi.Buffum@gmail.com,VOZEKSqrZa
    121,Priscilla,Hourigan,Priscilla.Hourigan@yopmail.com,Priscilla.Hourigan@gmail.com,XouVGeWwJ
    122,Tarra,Hunfredo,Tarra.Hunfredo@yopmail.com,Tarra.Hunfredo@gmail.com,NVzIduxd
    123,Madalyn,Westphal,Madalyn.Westphal@yopmail.com,Madalyn.Westphal@gmail.com,XIDAOx
    124,Ruthe,McAdams,Ruthe.McAdams@yopmail.com,Ruthe.McAdams@gmail.com,iwVelLKZH
    125,Maryellen,Brotherson,Maryellen.Brotherson@yopmail.com,Maryellen.Brotherson@gmail.com,nfoiVBjjqw
    126,Shirlee,Mike,Shirlee.Mike@yopmail.com,Shirlee.Mike@gmail.com,MnTkBSFDfo
    127,Orsola,Giule,Orsola.Giule@yopmail.com,Orsola.Giule@gmail.com,VPrfEYJi
    128,Linzy,Bennie,Linzy.Bennie@yopmail.com,Linzy.Bennie@gmail.com,ZHctp
    129,Vanessa,Cohdwell,Vanessa.Cohdwell@yopmail.com,Vanessa.Cohdwell@gmail.com,RvUcbJihHf
    130,Jaclyn,Salvidor,Jaclyn.Salvidor@yopmail.com,Jaclyn.Salvidor@gmail.com,gbbIxz
    131,Mildrid,Pettiford,Mildrid.Pettiford@yopmail.com,Mildrid.Pettiford@gmail.com,snyeV
    132,Carol-Jean,Eliathas,Carol-Jean.Eliathas@yopmail.com,Carol-Jean.Eliathas@gmail.com,EAAjYHiij
    133,Susette,Ogren,Susette.Ogren@yopmail.com,Susette.Ogren@gmail.com," BhYgr"
    134,Farrah,Suanne,Farrah.Suanne@yopmail.com,Farrah.Suanne@gmail.com,hYZbZIc
    135,Cissiee,Idelia,Cissiee.Idelia@yopmail.com,Cissiee.Idelia@gmail.com,PNuxbvjx
    136,Alleen,Clara,Alleen.Clara@yopmail.com,Alleen.Clara@gmail.com,YkonJWtV
    137,Merry,Letsou,Merry.Letsou@yopmail.com,Merry.Letsou@gmail.com,sLfCumcwco
    138,Fanny,Clywd,Fanny.Clywd@yopmail.com,Fanny.Clywd@gmail.com,Go kx
    139,Trixi,Pascia,Trixi.Pascia@yopmail.com,Trixi.Pascia@gmail.com,lipLcqRAHr
    140,Sandie,Quinn,Sandie.Quinn@yopmail.com,Sandie.Quinn@gmail.com,KrGazhI
    141,Dania,Wenda,Dania.Wenda@yopmail.com,Dania.Wenda@gmail.com,CXzs kDv
    142,Kellen,Vivle,Kellen.Vivle@yopmail.com,Kellen.Vivle@gmail.com,RrKPYqq
    143,Jany,Whittaker,Jany.Whittaker@yopmail.com,Jany.Whittaker@gmail.com,XAIufn
    144,Lusa,Fillbert,Lusa.Fillbert@yopmail.com,Lusa.Fillbert@gmail.com,FBFQnPm
    145,Farrah,Edee,Farrah.Edee@yopmail.com,Farrah.Edee@gmail.com,TrCwKb
    146,Felice,Peonir,Felice.Peonir@yopmail.com,Felice.Peonir@gmail.com,YtVZywf
    147,Starla,Juan,Starla.Juan@yopmail.com,Starla.Juan@gmail.com,aUTvjVNyw
    148,Briney,Elvyn,Briney.Elvyn@yopmail.com,Briney.Elvyn@gmail.com,tCEvgeUbwF
    149,Marcelline,Ricarda,Marcelline.Ricarda@yopmail.com,Marcelline.Ricarda@gmail.com,sDwIlLckbd
    150,Mureil,Rubie,Mureil.Rubie@yopmail.com,Mureil.Rubie@gmail.com,HbcfbKd
    151,Nollie,Dudley,Nollie.Dudley@yopmail.com,Nollie.Dudley@gmail.com,EzjjrNwVUm
    152,Yolane,Melony,Yolane.Melony@yopmail.com,Yolane.Melony@gmail.com,wfqSgpgL
    153,Brena,Reidar,Brena.Reidar@yopmail.com,Brena.Reidar@gmail.com,iTlvaS
    154,Glenda,Sabella,Glenda.Sabella@yopmail.com,Glenda.Sabella@gmail.com,zzaWxeI
    155,Paola,Virgin,Paola.Virgin@yopmail.com,Paola.Virgin@gmail.com,gJO hXTWZl
    156,Aryn,Erich,Aryn.Erich@yopmail.com,Aryn.Erich@gmail.com,qUoLwH
    157,Tiffie,Borrell,Tiffie.Borrell@yopmail.com,Tiffie.Borrell@gmail.com,cIYuVMHwF
    158,Anestassia,Daniele,Anestassia.Daniele@yopmail.com,Anestassia.Daniele@gmail.com,JsDbQbc
    159,Ira,Glovsky,Ira.Glovsky@yopmail.com,Ira.Glovsky@gmail.com,zKITnYXyhC
    160,Sara-Ann,Dannye,Sara-Ann.Dannye@yopmail.com,Sara-Ann.Dannye@gmail.com,wPClmU
    161,Modestia,Zina,Modestia.Zina@yopmail.com,Modestia.Zina@gmail.com,YRwcMqPK
    162,Kelly,Poll,Kelly.Poll@yopmail.com,Kelly.Poll@gmail.com,zgklmO
    163,Ernesta,Swanhildas,Ernesta.Swanhildas@yopmail.com,Ernesta.Swanhildas@gmail.com,tWafP
    164,Giustina,Erminia,Giustina.Erminia@yopmail.com,Giustina.Erminia@gmail.com,XgOKKAps
    165,Jerry,Kravits,Jerry.Kravits@yopmail.com,Jerry.Kravits@gmail.com,olzBzS
    166,Magdalena,Khorma,Magdalena.Khorma@yopmail.com,Magdalena.Khorma@gmail.com,BBKPB
    167,Lory,Pacorro,Lory.Pacorro@yopmail.com,Lory.Pacorro@gmail.com,YmWQB
    168,Carilyn,Ethban,Carilyn.Ethban@yopmail.com,Carilyn.Ethban@gmail.com,KUXenrJh
    169,Tierney,Swigart,Tierney.Swigart@yopmail.com,Tierney.Swigart@gmail.com,iQCQJ
    170,Beverley,Stacy,Beverley.Stacy@yopmail.com,Beverley.Stacy@gmail.com,NMrS Zpa f
    171,Ida,Dex,Ida.Dex@yopmail.com,Ida.Dex@gmail.com,hiIgOCxNg
    172,Sam,Hieronymus,Sam.Hieronymus@yopmail.com,Sam.Hieronymus@gmail.com,dLSkVe
    173,Lonnie,Colyer,Lonnie.Colyer@yopmail.com,Lonnie.Colyer@gmail.com,ZeDosRy
    174,Rori,Ethban,Rori.Ethban@yopmail.com,Rori.Ethban@gmail.com,SXFZQmX
    175,Lelah,Niles,Lelah.Niles@yopmail.com,Lelah.Niles@gmail.com,NwxvCXeszl
    176,Kathi,Hepsibah,Kathi.Hepsibah@yopmail.com,Kathi.Hepsibah@gmail.com,SOcAOSn
    177,Dominga,Cyrie,Dominga.Cyrie@yopmail.com,Dominga.Cyrie@gmail.com,IkjDyuqK
    178,Pearline,Bakerman,Pearline.Bakerman@yopmail.com,Pearline.Bakerman@gmail.com,vHVCkQ
    179,Selma,Gillan,Selma.Gillan@yopmail.com,Selma.Gillan@gmail.com,hSZgpBNsw
    180,Bernardine,Muriel,Bernardine.Muriel@yopmail.com,Bernardine.Muriel@gmail.com,AnSDTDa U
    181,Ermengarde,Hollingsworth,Ermengarde.Hollingsworth@yopmail.com,Ermengarde.Hollingsworth@gmail.com,IYQZ Nmv
    182,Marguerite,Newell,Marguerite.Newell@yopmail.com,Marguerite.Newell@gmail.com,kSaD uaHH
    183,Albertina,Nisbet,Albertina.Nisbet@yopmail.com,Albertina.Nisbet@gmail.com,Y jHyluB
    184,Chere,Torray,Chere.Torray@yopmail.com,Chere.Torray@gmail.com,loElYdo
    185,Vevay,O'Neill,Vevay.O'Neill@yopmail.com,Vevay.O'Neill@gmail.com,uLZSdatVn
    186,Ann-Marie,Gladstone,Ann-Marie.Gladstone@yopmail.com,Ann-Marie.Gladstone@gmail.com,fwKlEksI
    187,Donnie,Lymann,Donnie.Lymann@yopmail.com,Donnie.Lymann@gmail.com,deBrqXyyjf
    188,Myriam,Posner,Myriam.Posner@yopmail.com,Myriam.Posner@gmail.com,gEMZo
    189,Dale,Pitt,Dale.Pitt@yopmail.com,Dale.Pitt@gmail.com,OeMdG
    190,Cindelyn,Thornburg,Cindelyn.Thornburg@yopmail.com,Cindelyn.Thornburg@gmail.com,kvhFmKGoMZ
    191,Maisey,Hertzfeld,Maisey.Hertzfeld@yopmail.com,Maisey.Hertzfeld@gmail.com,OajjJ
    192,Corina,Heisel,Corina.Heisel@yopmail.com,Corina.Heisel@gmail.com,luoDJeHo
    193,Susette,Marcellus,Susette.Marcellus@yopmail.com,Susette.Marcellus@gmail.com,AXHtR AyV
    194,Lanae,Sekofski,Lanae.Sekofski@yopmail.com,Lanae.Sekofski@gmail.com,FgToedU
    195,Linet,Beebe,Linet.Beebe@yopmail.com,Linet.Beebe@gmail.com,DYGfRP
    196,Emilia,Screens,Emilia.Screens@yopmail.com,Emilia.Screens@gmail.com,LXUcleSs
    197,Tierney,Avi,Tierney.Avi@yopmail.com,Tierney.Avi@gmail.com,VegzbHH
    198,Pollyanna,Thar,Pollyanna.Thar@yopmail.com,Pollyanna.Thar@gmail.com,GjYeEGK
    199,Darci,Elephus,Darci.Elephus@yopmail.com,Darci.Elephus@gmail.com,DaQNdN
    


    프로젝트 폴더 생성:

    mkdir node-streams-etl
    


    폴더에 csv 파일을 만듭니다.

    cd node-streams-etl
    touch sample-data.csv
    


    모든 샘플 데이터를 csv 파일에 복사하고 저장합니다. REPL에서 복사+붙여넣기 또는 fs.writeFile를 사용하거나 터미널에서 -p 플래그와 함께 사용하십시오.

    2. NPM용 프로젝트 초기화

    We are going to use npm packages, hence, we have to initialize the project to get a package.json

    npm init -y
    

    Let's add a main file for the code.

    touch index.js
    
    First, we are going to create a readable stream to read the CSV data from sample-date.csv , and a writable stream, which will be the destination. For now, we just copy the sample data. To connect readStream and writeStream we are going to use the pipeline method. Error handling is much easier than with the pipe method. Check out the article How to Connect streams with the pipeline method .

    const fs = require('fs');
    const { pipeline } = require('stream');
    
    const inputStream = fs.createReadStream('data/sample-data.csv');
    const outputStream = fs.createWriteStream('data/sample-data.ndjson');
    
    pipeline(inputStream, outputStream, err => {
      if (err) {
        console.log('Pipeline encountered an error.', err);
      } else {
        console.log('Pipeline completed successfully.');
      }
    });
    


    3. CSV 파서 생성

    We have to convert the CSV file to JSON, as so often, for every problem, there is a package. In that use-case, there is csvtojson . 이 모듈은 헤더 행을 구문 분석하여 키를 가져온 다음 각 행을 구문 분석하여 JSON 객체를 생성합니다.

    설치합시다.

    npm install csvtojson
    


    성공적으로 설치하면 require 모듈을 pipeline 다음에 inputStream에 추가할 수 있습니다. 데이터는 CSV file 에서 CSV Parser 로 이동한 다음 Output file 로 이동합니다.

    우리는 pipeline 방법을 사용할 것입니다. Node.js v.10 이후로 스트림을 연결하고 그들 사이에서 데이터를 파이프하기 위해 선호되는 방법이기 때문입니다. 또한 오류가 발생하면 관련된 스트림이 메모리 누수를 피하기 위해 파괴되기 때문에 완료 또는 실패 시 스트림을 정리하는 데 도움이 됩니다.

    const fs = require('fs');
    const { pipeline } = require('stream');
    const csv = require('csvtojson');
    
    const inputStream = fs.createReadStream('data/sample-data.csv');
    const outputStream = fs.createWriteStream('data/sample-data.ndjson');
    
    const csvParser = csv();
    
    pipeline(inputStream, csvParser, outputStream, err => {
      if (err) {
        console.log('Pipeline encountered an error.', err);
      } else {
        console.log('Pipeline completed successfully.');
      }
    });
    


    4. 변환 스트림 추가

    The data is now emitted to the outputStream as ndjson with each data row a valid JSON. Now, we want to transform the data. Since we are using csvtojson , we could utilize the built-in subscribe method, which could be used to handle each record after it has been parsed. Though, we want to create a transform stream. Our sample data has the keys id, firstName, lastName, email, email2, randomized . We want to get rid of the randomized property in each entry and rename email2 to emailBusiness .

    Transform streams must implement a transform method that receives chunk of data as the first argument. It will also receive the encoding type of the data chunk, and a callback function.

    const transformStream = new Transform({
      transform(chunk, encoding, cb) {
        try {
          // clone person object
          let person = Object.assign({}, JSON.parse(chunk));
          // remove randomized property and rename email2 to emailBusiness
          person = {
            id: person.id,
            firstName: person.firstName,
            lastName: person.lastName,
            emailBusiness: person.email2,
          };
          cb(null, JSON.stringify(person) + `\n`);
        } catch (err) {
          cb(err);
        }
      },
    });
    

    Now let's add the transformStream to the pipeline.

    pipeline(
      inputStream,
      csvParser,
      transformStream,
      outputStream,
      err => {
        if (err) {
          console.log('Pipeline encountered an error.', err);
        } else {
          console.log('Pipeline completed successfully.');
        }
      },
    );
    

    5. 실행 및 완료

    Run the application with node index.js and the data in the ndjson file should look like this.

    {"id":"100","firstName":"Jobi","lastName":"Taam","emailBusiness":"Jobi.Taam@gmail.com"}
    {"id":"101","firstName":"Dacia","lastName":"Elephus","emailBusiness":"Dacia.Elephus@gmail.com"}
    {"id":"102","firstName":"Arlina","lastName":"Bibi","emailBusiness":"Arlina.Bibi@gmail.com"}
    

    Error handling always has to be done, when working with streams. Since we already did the error handling for all streams, because we are using the pipeline method, the sample project is done.

    Congratulations. 🚀✨

    TL;DR

    • The Newline-delimited JSON (ndjson) format works well with streaming data and large sets of data, where each record is processed individually, and it helps to reduce errors.
    • Using pipeline simplifies error handling and stream cleanup, and it makes combining streams more readable and maintainable.

    Thanks for reading and if you have any questions , use the comment function or send me a message .

    If you want to know more about Node Node Tutorials .

    참조(그리고 큰 감사):

    HeyNode , Node.js - Streams , MDN - Streams , Format and MIME Type , ndjson , csvtojson

    좋은 웹페이지 즐겨찾기