[自然言語処理100本ノック] 第2章
機械学習 自然言語処理100本ノック
Published : 2020-05-30   Lastmod : 2021-11-07

第2章 UNIXコマンド

結果だけ載せました。正解かどうかは保障しません笑

github

  • githubのjupyter notebook形式のファイルはこちら

筆者の環境

!sw_vers
ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G95
!python -V
Python 3.5.5 :: Anaconda, Inc.
!bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin18)
Copyright (C) 2007 Free Software Foundation, Inc.

テキストファイルをダウンロードします。

!wget https://nlp100.github.io/data/popular-names.txt -O ./popular-names.txt
--2020-04-19 13:55:27--  https://nlp100.github.io/data/popular-names.txt
nlp100.github.io (nlp100.github.io) をDNSに問いあわせています... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
nlp100.github.io (nlp100.github.io)|185.199.109.153|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 55026 (54K) [text/plain]
`./popular-names.txt' に保存中

./popular-names.txt 100%[===================>]  53.74K  --.-KB/s 時間 0.01s      

2020-04-19 13:55:27 (4.34 MB/s) - `./popular-names.txt' へ保存完了 [55026/55026]

ファイルは、「アメリカで生まれた赤ちゃんの「名前」「性別」「人数」「年」をタブ区切り形式で格納したファイルである」という事ですが、どんなファイルか見てみます。

!head -n 5 popular-names.txt
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
!tail -n 5 popular-names.txt
Benjamin	M	13381	2018
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018

解答

共通部分

file_name = './popular-names.txt'

10問

with open(file_name,mode='r') as f:
  print("line number by python : ", len(f.readlines()))

!echo "line number by unix   : "`cat $file_name | wc -l`
line number by python :  2780
line number by unix   :  2780

11問

with open(file_name,mode='r') as f:
  with open('11_python_out', mode='w') as p:
    for s in f.readlines():
      p.write(s.replace('\t',' '))

!expand -t 1 $file_name > 11_unix_out

!md5 $file_name
!md5 11_python_out
!md5 11_unix_out
MD5 (./popular-names.txt) = 8df49072c8f6812bbc9b7c2a1311d3f2
MD5 (11_python_out) = f4b925b1b39a797e1d90af07f1abed33
MD5 (11_unix_out) = f4b925b1b39a797e1d90af07f1abed33

12問

with open(file_name,mode='r') as f:
  with open('col1.txt', mode='w') as c1:
    with open('col2.txt', mode='w') as c2:
      for s in f.readlines():
        c1.write(s.split('\t')[0] + '\n')
        c2.write(s.split('\t')[1] + '\n')
        
!cut -f 1 $file_name > u_col1.txt
!cut -f 2 $file_name > u_col2.txt

!md5 col1.txt
!md5 u_col1.txt
!md5 col2.txt
!md5 u_col2.txt
MD5 (col1.txt) = b87013f2cafe9e8a443480a3fe0e0e9d
MD5 (u_col1.txt) = b87013f2cafe9e8a443480a3fe0e0e9d
MD5 (col2.txt) = 9252e1786bf293c854b88dc7af0ea77c
MD5 (u_col2.txt) = 9252e1786bf293c854b88dc7af0ea77c

13問

with open('col1.txt', mode='r') as c1:
  with open('col2.txt', mode='r') as c2:
    with open('13_merge.txt', mode='w') as w:
      for s1,s2 in zip(c1.readlines(), c2.readlines()):
        w.write(s1.replace('\n','') + '\t' + s2.replace('\n','') + '\n')
        
!paste col1.txt col2.txt > u_13_merge.txt

!md5 13_merge.txt
!md5 u_13_merge.txt
MD5 (13_merge.txt) = 7c49f3d98798fe8b500b25da199202b5
MD5 (u_13_merge.txt) = 7c49f3d98798fe8b500b25da199202b5

14問

def print_n(n):
  with open(file_name,mode='r') as f:
    for i,s in enumerate(f.readlines()[:n]):
      print(s.replace('\n',''))

print('print by python')
print_n(4)
print()
print('print by unix')
!head -n 4 $file_name
print by python
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880

print by unix
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880

15問

def print_n(n):
  with open(file_name,mode='r') as f:
    for i in f.readlines()[-1 * n:]: 
      print(i.replace('\n',''))

print('print by python')
print_n(4)
print()
print('print by unix')
!tail -n 4 $file_name
print by python
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018

print by unix
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018

16問

def devide_n(n):
  with open(file_name,mode='r') as f:
    lines = f.readlines()
    num = int(len(lines) / n)
    
    for k in range(n + 1):
      with open('16_No_{:06d}.txt'.format(k), mode='w') as f:
        for i in lines[num * k: num * (k + 1)]:
          f.write(i)

devide_n(3)

def get_line(n):
  with open(file_name, mode='r') as f:
    return int(len(f.readlines()) / n)

!split -l {get_line(3)} -a 4 $file_name 

print('files by python')
!ls | grep 16_No | xargs -I{} md5 {}
print()
print('files by unix')
!ls | grep xaa | xargs -I{} md5 {}
files by python
MD5 (16_No_000000.txt) = 906c1ac43d5323ce7da7de854a2867b0
MD5 (16_No_000001.txt) = b577de2d082eac6b9a69ead1d7e8044b
MD5 (16_No_000002.txt) = 370895d0df3a1ed4ac0e7c5b800f6bab
MD5 (16_No_000003.txt) = 0626fe166d70fd61595d92c39ea28759

files by unix
MD5 (xaaaa) = 906c1ac43d5323ce7da7de854a2867b0
MD5 (xaaab) = b577de2d082eac6b9a69ead1d7e8044b
MD5 (xaaac) = 370895d0df3a1ed4ac0e7c5b800f6bab
MD5 (xaaad) = 0626fe166d70fd61595d92c39ea28759

17問

with open(file_name,mode='r') as f:
  s_set = set([s.split('\t')[0] for s in f.readlines()])
  print('### python ###')
  print(sorted(list(s_set)))

print()
print('### unix ###')
!cut -f 1 $file_name | sort | uniq | tr '\n' ', '
### python ###
['Abigail', 'Aiden', 'Alexander', 'Alexis', 'Alice', 'Amanda', 'Amelia', 'Amy', 'Andrew', 'Angela', 'Anna', 'Annie', 'Anthony', 'Ashley', 'Austin', 'Ava', 'Barbara', 'Benjamin', 'Bertha', 'Bessie', 'Betty', 'Brandon', 'Brian', 'Brittany', 'Carol', 'Carolyn', 'Charles', 'Charlotte', 'Chloe', 'Christopher', 'Clara', 'Crystal', 'Cynthia', 'Daniel', 'David', 'Deborah', 'Debra', 'Donald', 'Donna', 'Doris', 'Dorothy', 'Edward', 'Elijah', 'Elizabeth', 'Emily', 'Emma', 'Ethan', 'Ethel', 'Evelyn', 'Florence', 'Frances', 'Frank', 'Gary', 'George', 'Hannah', 'Harper', 'Harry', 'Heather', 'Helen', 'Henry', 'Ida', 'Isabella', 'Jacob', 'James', 'Jason', 'Jayden', 'Jeffrey', 'Jennifer', 'Jessica', 'Joan', 'John', 'Joseph', 'Joshua', 'Judith', 'Julie', 'Justin', 'Karen', 'Kathleen', 'Kelly', 'Kimberly', 'Larry', 'Laura', 'Lauren', 'Liam', 'Lillian', 'Linda', 'Lisa', 'Logan', 'Lori', 'Lucas', 'Madison', 'Margaret', 'Marie', 'Mark', 'Mary', 'Mason', 'Matthew', 'Megan', 'Melissa', 'Mia', 'Michael', 'Michelle', 'Mildred', 'Minnie', 'Nancy', 'Nicholas', 'Nicole', 'Noah', 'Oliver', 'Olivia', 'Pamela', 'Patricia', 'Rachel', 'Rebecca', 'Richard', 'Robert', 'Ronald', 'Ruth', 'Samantha', 'Sandra', 'Sarah', 'Scott', 'Sharon', 'Shirley', 'Sophia', 'Stephanie', 'Steven', 'Susan', 'Tammy', 'Taylor', 'Thomas', 'Tracy', 'Tyler', 'Virginia', 'Walter', 'William']

### unix ###
Abigail,Aiden,Alexander,Alexis,Alice,Amanda,Amelia,Amy,Andrew,Angela,Anna,Annie,Anthony,Ashley,Austin,Ava,Barbara,Benjamin,Bertha,Bessie,Betty,Brandon,Brian,Brittany,Carol,Carolyn,Charles,Charlotte,Chloe,Christopher,Clara,Crystal,Cynthia,Daniel,David,Deborah,Debra,Donald,Donna,Doris,Dorothy,Edward,Elijah,Elizabeth,Emily,Emma,Ethan,Ethel,Evelyn,Florence,Frances,Frank,Gary,George,Hannah,Harper,Harry,Heather,Helen,Henry,Ida,Isabella,Jacob,James,Jason,Jayden,Jeffrey,Jennifer,Jessica,Joan,John,Joseph,Joshua,Judith,Julie,Justin,Karen,Kathleen,Kelly,Kimberly,Larry,Laura,Lauren,Liam,Lillian,Linda,Lisa,Logan,Lori,Lucas,Madison,Margaret,Marie,Mark,Mary,Mason,Matthew,Megan,Melissa,Mia,Michael,Michelle,Mildred,Minnie,Nancy,Nicholas,Nicole,Noah,Oliver,Olivia,Pamela,Patricia,Rachel,Rebecca,Richard,Robert,Ronald,Ruth,Samantha,Sandra,Sarah,Scott,Sharon,Shirley,Sophia,Stephanie,Steven,Susan,Tammy,Taylor,Thomas,Tracy,Tyler,Virginia,Walter,William,

18問

全部表示すると長いので、先頭から10行だけを表示しています。

with open(file_name,mode='r') as f:
  print("### python ###")
  for s in sorted([s.split('\t') for s in f.readlines()],key=lambda x:x[2],reverse=True)[0:10]:
    print(s)

print()
!echo "### unix ###"
!cat $file_name | sort -k 3 -r | head -n 10
### python ###
['Linda', 'F', '99689', '1947\n']
['James', 'M', '9951', '1911\n']
['Mildred', 'F', '9921', '1913\n']
['Mary', 'F', '9889', '1886\n']
['Mary', 'F', '9888', '1887\n']
['John', 'M', '9829', '1900\n']
['Elizabeth', 'F', '9708', '2012\n']
['Anna', 'F', '9687', '1913\n']
['Frances', 'F', '9677', '1914\n']
['John', 'M', '9655', '1880\n']

### unix ###
Linda	F	99689	1947
James	M	9951	1911
Mildred	F	9921	1913
Mary	F	9889	1886
Mary	F	9888	1887
John	M	9829	1900
Elizabeth	F	9708	2012
Anna	F	9687	1913
Frances	F	9677	1914
John	M	9655	1880
sort: Broken pipe

19問

全部表示すると長いので、先頭から10行だけを表示しています。

import collections

with open(file_name,mode='r') as f:
  s_list = [s.split('\t')[0] for s in f.readlines()]
  print('### python sort ###')
  for s in sorted(collections.Counter(s_list).items(),key=lambda x:x[1], reverse=True )[0:10]:
    print(str(s[1]) + ' ' + s[0])

print()
print('### unix sort ###')
!cut -f 1 $file_name | sort | uniq -c | sort -r | head -n 10
### python sort ###
118 James
111 William
108 John
108 Robert
92 Mary
75 Charles
74 Michael
73 Elizabeth
70 Joseph
60 Margaret

### unix sort ###
 118 James
 111 William
 108 Robert
 108 John
  92 Mary
  75 Charles
  74 Michael
  73 Elizabeth
  70 Joseph
  60 Margaret

関連記事

関連する記事