.net - C# Regex Issue Getting URLs -
.net - C# Regex Issue Getting URLs -
to explain briefly, i'm trying search google keyword, urls of top 10 results , save them.
this stripped downwards command line version of code. should homecoming 1 result @ least. if works that, can apply total version of code , results.
basically code have right now, fails if seek entire source of google. if include random section of code google's html source, works fine. me, means regex has error somewhere.
if there improve way aside regex, please allow me know. urls between <h3 class="r"><a href="
, " class=l onmousedown="return clk(this.href
i got regex code generator, it's hard me understand regex, since nil i've read explains clearly. if pick out what's wrong , explain why, i'd appreciate it.
thanks, kevin
using system; using system.text.regularexpressions; using system.net; namespace consoleapplication1 { class programme { static void main(string[] args) { webclient wc = new webclient(); string keyword = "seo nj"; string html = wc.downloadstring(string.format("http://www.google.com/search?q={0}", keyword)); string re1 = "(<)"; // single character 1 string re2 = "(h3)"; // alphanum 1 string re3 = "(\\s+)"; // white space 1 string re4 = "(class)"; // variable name 1 string re5 = "(=)"; // single character 2 string re6 = "(\"r\")"; // double quote string 1 string re7 = "(>)"; // single character 3 string re8 = "(<)"; // single character 4 string re9 = "([a-z])"; // single word character (not whitespace) 1 string re10 = "(\\s+)"; // white space 2 string re11 = "((?:[a-z][a-z]+))"; // word 1 string re12 = "(=)"; // single character 5 string re13 = ".*?"; // non-greedy match on filler string re14 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))"; // http url 1 string re15 = "(\")"; // single character 6 string re16 = "(\\s+)"; // white space 3 string re17 = "(class)"; // word 2 string re18 = "(=)"; // single character 7 string re19 = "(l)"; // single character 8 string re20 = "(\\s+)"; // white space 4 string re21 = "(onmousedown)"; // word 3 string re22 = "(=)"; // single character 9 string re23 = "(\")"; // single character 10 string re24 = "(return)"; // word 4 string re25 = "(\\s+)"; // white space 5 string re26 = "(clk)"; // word 5 regex r = new regex(re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re14 + re15 + re16 + re17 + re18 + re19 + re20 + re21 + re22 + re23 + re24 + re25 + re26, regexoptions.ignorecase | regexoptions.singleline); match m = r.match(txt); if (m.success) { console.writeline("good"); string c1 = m.groups[1].tostring(); string alphanum1 = m.groups[2].tostring(); string ws1 = m.groups[3].tostring(); string var1 = m.groups[4].tostring(); string c2 = m.groups[5].tostring(); string string1 = m.groups[6].tostring(); string c3 = m.groups[7].tostring(); string c4 = m.groups[8].tostring(); string w1 = m.groups[9].tostring(); string ws2 = m.groups[10].tostring(); string word1 = m.groups[11].tostring(); string c5 = m.groups[12].tostring(); string httpurl1 = m.groups[13].tostring(); string c6 = m.groups[14].tostring(); string ws3 = m.groups[15].tostring(); string word2 = m.groups[16].tostring(); string c7 = m.groups[17].tostring(); string c8 = m.groups[18].tostring(); string ws4 = m.groups[19].tostring(); string word3 = m.groups[20].tostring(); string c9 = m.groups[21].tostring(); string c10 = m.groups[22].tostring(); string word4 = m.groups[23].tostring(); string ws5 = m.groups[24].tostring(); string word5 = m.groups[25].tostring(); //console.write("(" + c1.tostring() + ")" + "(" + alphanum1.tostring() + ")" + "(" + ws1.tostring() + ")" + "(" + var1.tostring() + ")" + "(" + c2.tostring() + ")" + "(" + string1.tostring() + ")" + "(" + c3.tostring() + ")" + "(" + c4.tostring() + ")" + "(" + w1.tostring() + ")" + "(" + ws2.tostring() + ")" + "(" + word1.tostring() + ")" + "(" + c5.tostring() + ")" + "(" + httpurl1.tostring() + ")" + "(" + c6.tostring() + ")" + "(" + ws3.tostring() + ")" + "(" + word2.tostring() + ")" + "(" + c7.tostring() + ")" + "(" + c8.tostring() + ")" + "(" + ws4.tostring() + ")" + "(" + word3.tostring() + ")" + "(" + c9.tostring() + ")" + "(" + c10.tostring() + ")" + "(" + word4.tostring() + ")" + "(" + ws5.tostring() + ")" + "(" + word5.tostring() + ")" + "\n"); console.writeline(httpurl1); } else { console.writeline("bad"); } console.readline(); } } }
you're doing wrong.
google has api doing searches programmatically. don't set through pain of trying parse html regexes, when there's published, supported way want.
besides, you're trying -- submit automated searches through google's web site , scrape results -- violation of section 5.3 of terms of service:
you agree not access (or effort access) of services through automated means (including utilize of scripts or web crawlers)
c# .net regex
Comments
Post a Comment