Useful Text Manipulation Tidbits For Working With Twitter Messages

Sure you could find these same sorts of answers by searching Google and clicking StackOverflow links. But I thought I should put these sed/awk/grep/etc commands, which have proven very useful when working with a large corpus of tweets, together in one place. Hopefully this collection of commands saves others some time searching.

(use -i flag with sed commands to do inline editing on a file rather than display standard output)

show only lines between 36 and 150 characters in length

sed -nr '/^.{36,150}$/p' temp.txt

delete lines longer than 36 characters

sed '/.\{36\}./d' file.txt

delete all empty lines

sed '/^\s*$/d' file.txt

remove non-ascii characters

sed 's/[\d128-\d255]//g' file.txt

delete lines starting with a hashtag (comments or Twitter hashtags)

sed '/^#/d' file.txt

print any paragraph containing the matching pattern "brains" (paragraph is defined by whitespace line before and after)

sed '/./{H;$!d};x;/brains/!d' file.txt

separate all paragraphs into separate files based on a trailing symbol (In this case using \xa9 (copyright symbol))

awk -v RS="\xa9" 'NR > 1 {print RS $0 > (NR-1)}' file.txt

(j)oin and (p)rint silently all lines of file.txt using ex (removing ^M (Windows line break), character \xa9 (copyright symbol), and lines starting with a hashtag)

ex +%j +%p -scq! file.txt | sed -e 's/\^M//g' -e 's/\xa9//g' -e '/^#/d'

display a count for the number of times the hastag #cool can be found in all files in the current directory (*) (use the -r flag if you wish to search recursively in sub directories)

grep -c -i "#cool" *

search and replace using literal / character to replace "/grin" with "/cheer" (and avoid escaping characters)

sed 's@/grin@/cheer@g' file.txt

Substitute "the" with "ppp" (Finding "the" as a whole word with < > marking the spaces surrounding it)

sed 's/\<the\>/ppp/g' file.txt

replace all instances of "Dang" or "dang" with "Oops"

sed 's/[dD]ang/Oops/g' file.txt

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Useful Text Manipulation Tidbits For Working With Twitter Messages

show only lines between 36 and 150 characters in length

delete lines longer than 36 characters

delete all empty lines

remove non-ascii characters

delete lines starting with a hashtag (comments or Twitter hashtags)

print any paragraph containing the matching pattern "brains" (paragraph is defined by whitespace line before and after)

separate all paragraphs into separate files based on a trailing symbol (In this case using \xa9 (copyright symbol))

(j)oin and (p)rint silently all lines of file.txt using ex (removing ^M (Windows line break), character \xa9 (copyright symbol), and lines starting with a hashtag)

display a count for the number of times the hastag #cool can be found in all files in the current directory (*) (use the -r flag if you wish to search recursively in sub directories)

search and replace using literal / character to replace "/grin" with "/cheer" (and avoid escaping characters)

Substitute "the" with "ppp" (Finding "the" as a whole word with < > marking the spaces surrounding it)

replace all instances of "Dang" or "dang" with "Oops"

About

Releases

Packages

greatestusername/TextManipulationTidbits

Folders and files

Latest commit

History

Repository files navigation

Useful Text Manipulation Tidbits For Working With Twitter Messages

show only lines between 36 and 150 characters in length

delete lines longer than 36 characters

delete all empty lines

remove non-ascii characters

delete lines starting with a hashtag (comments or Twitter hashtags)

print any paragraph containing the matching pattern "brains" (paragraph is defined by whitespace line before and after)

separate all paragraphs into separate files based on a trailing symbol (In this case using \xa9 (copyright symbol))

(j)oin and (p)rint silently all lines of file.txt using ex (removing ^M (Windows line break), character \xa9 (copyright symbol), and lines starting with a hashtag)

display a count for the number of times the hastag #cool can be found in all files in the current directory (*) (use the -r flag if you wish to search recursively in sub directories)

search and replace using literal / character to replace "/grin" with "/cheer" (and avoid escaping characters)

Substitute "the" with "ppp" (Finding "the" as a whole word with < > marking the spaces surrounding it)

replace all instances of "Dang" or "dang" with "Oops"

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages