ML 문제를 해결하는 가장 Pythonic 도구

모든 기계 학습 프로젝트에는 악용되기를 기다리는 Python 라이브러리가 많이 있으며 Numpy, Pandas, Scikit, Seaborn, Matplotlib 등과 같은 일반적인 Python 라이브러리에 대해 이야기하는 여러 기사가 인터넷에 있습니다. 그러나 많은 해당 라이브러리에 대해 읽는 동안 기본 python 기능을 건너뜁니다.

이 기사에서는 ML 프로젝트를 시작하는 모든 사람을 위해 과소평가된 일부 내부 기능과 Python 라이브러리의 모든 기능을 최대한 활용할 수 있습니다. 더 이상 시간 낭비 없이 바로 본론으로 들어가 보겠습니다.

내용물


  • List Comprehension
  • PDB - Python Debugger
  • OS - Operating System
  • Sets
  • Time
  • Venv - Python Virtual Environment
  • Conclusion
  • References

  • 목록 이해

    When a list of items needs to manipulated and stored in a different list, or it needs to be manipulated as an intermediate stage before some other operation, list comprehension is a handy tool.

    Let's say there is a list and we want to square all the numbers in the list. The usual loop method would be:

    list1 = [1,2,3,4,5]
    list2 = []
    for number in list1:
     list2.append(number**2)
    print(list2)
    

    That's a lot of lines and it's time consuming. If we use list comprehension, we can write this in a single line as:

    list2 = [number**2 for number in list1]

    Cool, now what if we have a 2D list (list of list) and we want to convert it into a list that contains the squares of those numbers?
    The normal solution would be:

    list1 = [[1,2,3],[4,5,6]]
    list2 = []
    for data in list1:
     for number in data:
      list2.append(number**2)
    print(list2)
    

    A more pythonic solution would be:

    list2 = [number**2 for data in list1 for number in data] 

    You can play around with list comprehension, and you'll never want to use the normal way !

    PDB - Python 디버거

    This is the inbuilt debugger for python. Let's say there is a scenario where we don't know what is going on with a particular code snippet and it is outputting an unintended result. One way to debug is to put a print statement and print the variables along with some message like:

    print('Looks like trouble_1...')

    Maybe we need to run the whole code multiple times, before we realize where exactly the issue is. This, will surely delay the project by a substantial amount of time.

    The 2nd method is a more pythonic way to debug snippets in a blink of an eye. The PDB, works on the usual principle on how debuggers actually work - by setting breakpoints, and printing call stacks but that's the geeky stuff. The functionality it provides is, after setting breakpoints, one can see all variables at that point in history alongside their values and also create new variables and run codes as they would in a standalone environment.
    The way to add the PDB before the suspected error code is given below:

    import pdb
    # Correct code segment
    pdb.set_trace()
    # Code here might be a bit sus
    

    OS - 운영 체제

    While running ML codes, there will an urge to store intermediate files and artifacts to certain directories or check even if a directory exists or delete your office's files or run custom shell scripts to hack your neighbors machine, you're going to heavily rely on the os library in python. It almost contains all the methods one is ever going to need to call the Operating System's operations.

    세트



    파이썬의 내장 데이터 구조 중 하나는 집합입니다. 이것은 수학적 집합 이론과 매우 유사합니다. Python 집합은 교집합, 차이, 합집합 등과 같은 다양한 집합 연산을 지원합니다.

    세트는 데이터를 비교하거나 파일에서 고유한 항목을 찾거나 데이터에서 공통 항목을 추출하거나 데이터에 대해 일부 추출 작업을 수행할 때 유용합니다. 예를 들어 2세트가 있다고 가정해 보겠습니다.

    fruits = {tomato, apple, banana, orange}
    veggies = {tomato, cabbage, potato, onion}
    


    이제 어떤 음식 항목이 채소인지 과일인지 알아보기 위해 이렇게 집합 교차를 쉽게 수행할 수 있습니다fruits.intersection(veggies).

    일반적인 방법으로 이 작업을 수행하려면 가장 순진한 방법으로 2개의 루프를 실행하고 요소를 비교하고 일반적인 음식 항목을 계속 추가해야 하는 다른 목록을 가져와야 합니다.

    시간

    An ML Engineer's most important resource is time and there might be times where a script is taking way long to run. There can be performance issues with the code for various reasons. Before figuring out which part of the code is taking the longest it'll tougher to pinpoint the issue. For pinpointing the locations for longest running snippets, the time library plays an important role.

    Venv - Python 가상 환경

    This one is the most important library which every ML engineer uses. It creates an independent environment where one can run their scripts, this eliminates multiple dependency issues in a project. 

    Let's try to understand this with a scenario. Suppose, there are 2 projects and both of them require different versions of a python library in order to run with a constraint that at a point in time there can only be 1 version of the library installed. This looks impossible to run both scripts on a single machine, which is true. There are many ways to solve this problem, the simplest way is by creating 2 different environments and installing the required dependencies in the respective environments and then running the scripts in their respective environments. 

    As a ML Engineer, there will be multiple projects to work on simultaneously and it's always recommended to use a different environment for different projects so as to not run into any dependency issues.

    결론

    These are a few tools I realized were basic, powerful yet underrated for a person starting any ML projects. These few tools will not only boost your productivity but also make you realize why python is the de facto language for ML projects !

    참고문헌

    https://docs.python.org/3/library/pdb.html
    https://docs.python.org/3/library/os.html
    https://docs.python.org/3/library/time.html
    https://docs.python.org/3/tutorial/venv.html

    좋은 웹페이지 즐겨찾기