Grep and Awk to Extract Text
Once you start using the command line, it is amazing how many things you can do that were difficult in a GUI system. A common task is to extract data from text files. The text files may be the output of some program, or they may be written documents like saved emails. Here are some of my favorite commands for working with text files.
The grep command is used to find lines of interest in a text file. For example, to search a set of log files for errors, you can write:
grep -i error *.log
This will return all the lines containing the word “error” in the files with extension “log”. The -i option make the search case insensitive.
If you need a bit of context around your search use the -A (after) and -B (before) options to grep. It will return a number of lines before and after the lines that match. For example:
grep -A 3 -B 2 -i error *.log
You can also reverse a search — find lines that do not match — by using the -v option:
grep -i error *.log | grep -v "false alarm"
This will first find the errors in the log files, and then ignore the lines containing “false alarm”.
Sometimes you need a bit more flexibility. This is where the awk command comes in. For example, assume you have the file english.txt on your computer. The file contains English language signs from around the world. Assume we want to see all the signs from Tokyo. We use the awk ‘/pattern1/,/pattern2/’ command to extract everything between pattern1 and pattern2:
awk '/Tokyo.*:\r$/,/^\r$/' english.txt
The first pattern is /Tokyo.*:/r$/. It is a regular expression that matches any line that contains “Tokyo” somewhere on the line and ends with a colon. The \r is because the file has Dos/Windows newlines — simply remove the \r if the file has Unix/Linux newlines.
The second pattern is /^\r$/. it is a regular expression that matches blank lines. Again, remove the \r if you use a file with Unix/Linux newlines.
The result is:
In a Tokyo hotel:
Is forbidden to steal hotel towels please. If you are not a person
to do such a thing is please not to read notis.In a Tokyo bar:
Special cocktails for the ladies with nuts.In a Tokyo shop:
Our nylons cost more than common, but you’ll find they are best in
the long run.From a brochure of a car rental firm in Tokyo:
When passenger of foot heave in sight, tootle the horn. Trumpet him
melodiously at first, but if he still obstacles your passage
then tootle him with vigor.
The awk command can do many more things. If you regularly work with text files, it is worth taking a closer look at it.
Comment from johnlihanhong@yahoo.com.cn
Time October 4, 2008 at 11:53 am
Thanks for your ideas here. If I have grep those line with the pattern of hw=”*”. Then I got a list of them. How can I just grep the things exactly between quotation marks? Thanks for your suggestions.